You are on page 1of 14

784 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO.

6, JUNE 2010

Knowledge-Based Interactive Postmining


of Association Rules Using Ontologies
Claudia Marinica and Fabrice Guillet

Abstract—In Data Mining, the usefulness of association rules is strongly limited by the huge amount of delivered rules. To overcome
this drawback, several methods were proposed in the literature such as itemset concise representations, redundancy reduction, and
postprocessing. However, being generally based on statistical information, most of these methods do not guarantee that the extracted
rules are interesting for the user. Thus, it is crucial to help the decision-maker with an efficient postprocessing step in order to reduce
the number of rules. This paper proposes a new interactive approach to prune and filter discovered rules. First, we propose to use
ontologies in order to improve the integration of user knowledge in the postprocessing task. Second, we propose the Rule Schema
formalism extending the specification language proposed by Liu et al. for user expectations. Furthermore, an interactive framework is
designed to assist the user throughout the analyzing task. Applying our new approach over voluminous sets of rules, we were able, by
integrating domain expert knowledge in the postprocessing step, to reduce the number of rules to several dozens or less. Moreover,
the quality of the filtered rules was validated by the domain expert at various points in the interactive process.

Index Terms—Clustering, classification, and association rules, interactive data exploration and discovery, knowledge management
applications.

1 INTRODUCTION

A SSOCIATION rule mining, introduced in [1], is considered


as one of the most important tasks in Knowledge
Discovery in Databases [2]. Among sets of items in transac-
Unfortunately, the lower the support is, the larger the
volume of rules becomes, making it intractable for a
decision-maker to analyze the mining result. Experiments
tion databases, it aims at discovering implicative tendencies show that rules become almost impossible to use when the
that can be valuable information for the decision-maker. number of rules overpasses 100. Thus, it is crucial to help
An association rule is defined as the implication X ! Y , the decision-maker with an efficient technique for reducing
described by two interestingness measures—support and the number of rules.
confidence—where X and Y are the sets of items and To overcome this drawback, several methods were
X \ Y ¼ ;. Apriori [1] is the first algorithm proposed in the proposed in the literature. On the one hand, different
association rule mining field and many other algorithms algorithms were introduced to reduce the number of
were derived from it. Starting from a database, it proposes to itemsets by generating closed [4], maximal [5] or optimal
extract all association rules satisfying minimum thresholds itemsets [6], and several algorithms to reduce the number of
of support and confidence. It is very well known that mining rules, using nonredundant rules [7], [8], or pruning
algorithms can discover a prohibitive amount of association techniques [9]. On the other hand, postprocessing methods
rules; for instance, thousands of rules are extracted from a can improve the selection of discovered rules. Different
database of several dozens of attributes and several complementary postprocessing methods may be used, like
hundreds of transactions. Furthermore, as suggested by pruning, summarizing, grouping, or visualization [10].
Silbershatz and Tuzilin [3], valuable information is often Pruning consists in removing uninteresting or redundant
represented by those rare—low support—and unexpected rules. In summarizing, concise sets of rules are generated.
association rules which are surprising to the user. So, the Groups of rules are produced in the grouping process; and
more we increase the support threshold, the more efficient the visualization improves the readability of a large number
the algorithms are and the more the discovered rules are of rules by using adapted graphical representations.
obvious, and hence, the less they are interesting for the user. However, most of the existing postprocessing methods
As a result, it is necessary to bring the support threshold are generally based on statistical information in the
low enough in order to extract valuable information. database. Since rule interestingness strongly depends on
user knowledge and goals, these methods do not guarantee
that interesting rules will be extracted. For instance, if the
. The authors are with KOD Team—LINA CNRS 6241, Polytech’Nantes— user looks for unexpected rules, all the already known rules
Site de la Chantrerie, Rue Christian Pauc, BP 50609, 44306 Nantes cedex 3,
France. E-mail: {claudia.marinica, fabrice.guillet}@univ-nantes.fr. should be pruned. Or, if the user wants to focus on specific
schemas of rules, only this subset of rules should be
Manuscript received 31 Mar. 2009; revised 23 Sept. 2009; accepted 7 Nov.
2009; published online 4 Feb. 2010. selected. Moreover, as suggested in [11], the rule post-
Recommended for acceptance by C. Zhang, P.S. Yu, and D. Bell. processing methods should be imperatively based on a
For information on obtaining reprints of this article, please send e-mail to: strong interactivity with the user.
tkde@computer.org, and reference IEEECS Log Number
TKDESI-2009-03-0275. The representation of user knowledge is an important
Digital Object Identifier no. 10.1109/TKDE.2010.29. issue. The more the knowledge is represented in a flexible,
1041-4347/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society
MARINICA AND GUILLET: KNOWLEDGE-BASED INTERACTIVE POSTMINING OF ASSOCIATION RULES USING ONTOLOGIES 785

expressive, and accurate formalism, the more the rule of transactions containing X [ Y . If suppðX ! Y Þ ¼ s,
selection is efficient. In the Semantic Web1 field, ontology is s % of transactions contains the itemset X [ Y .
considered as the most appropriate representation to . The confidence of the rule, defined as confðX ! Y Þ ¼
express the complexity of the user knowledge, and several suppðX ! Y Þ=suppðXÞ ¼ suppðX [ Y Þ=suppðXÞ ¼ c,
specification languages were proposed. is the ratio (c %) of the number of transactions that,
This paper proposes a new interactive postprocessing containing X, contain also Y .
approach, ARIPSO (Association Rule Interactive post-Proces- Starting from a database and two thresholds minsupp and
sing using Schemas and Ontologies) to prune and filter minconf for the minimal support and, respectively, the
discovered rules. First, we propose to use Domain Ontologies minimal confidence, the problem of finding association
in order to strengthen the integration of user knowledge in rules, as discussed in [1], is to generate all rules that have
the postprocessing task. Second, we introduce Rule Schema support and confidence greater than the given thresholds.
formalism by extending the specification language proposed This problem can be divided into two main problems:
by Liu et al. [12] for user beliefs and expectations toward the
use of ontology concepts. Furthermore, an interactive and . first, all frequent itemsets are extracted. An itemset X
iterative framework is designed to assist the user throughout is called frequent itemset in the transaction database D
the analyzing task. The interactivity of our approach relies on if suppðXÞ  minsupp;
a set of rule mining operators defined over the Rule Schemas in . and then, for each frequent itemset X, the set of rules
order to describe the actions that the user can perform. X  Y ! Y , with Y  X, and satisfying confðX 
This paper is structured as follows: Section 2 introduces Y ! Y Þ  minconf is generated.
notations and definitions used throughout the paper. If X is frequent and no superset of X is frequent, X is
Section 3 justifies our motivations for using ontologies. denoted as a maximal itemset.
Section 4 describes the research domain and reviews related
works. Section 5 presents the proposed framework and its Theorem 1. Let X  I and T  D. Let cit ðXÞ denotes the
elements. Section 6 is devoted to the results obtained by composition of the two mappings t  iðXÞ ¼ iðtðXÞÞ. Also,
applying our method over a questionnaire database. let cti ðT Þ ¼ i  tðT Þ. Then, cit and cti are both Galois
Finally, Section 7 presents conclusions and shows directions closure operators [13] on itemsets and sets of transactions,
for future research. respectively.
Definition 3. A closed itemset [14] is defined as an itemset X
2 NOTATIONS AND DEFINITIONS which has the property of being the same as its closure, i.e.,
X ¼ cit ðXÞ. The minimal closed itemset containing an itemset
The association rule mining task can be stated as follows:
let I ¼ fi1 ; i2 ; . . . ; in g be a set of literals, called items. Let Y is obtained by applying the closure operator cit to Y .
D ¼ ft1 ; t2 ; . . . ; tm g be a set of transactions over I. A Definition 4. Let R1 and R2 be two association rules. We say that
nonempty subset of I is called itemset and is defined as rule R1 is more general than rule R2 , denoted R1  R2 , if R2 can
X ¼ fi1 ; i2 ; . . . ; ik g. In short, itemset X can also be denoted be generated by adding additional items to either the antecedent
as X ¼ i1 i2 . . . ik . For an itemset, the number of items is or consequent of R1 . In this case, we say that a rule Rj is
called length of the itemset and an itemset of length k is redundant [15] if there exists some rule Ri such that Ri  Rj .
referred to as k-itemset. Each transaction ti contains an In consequence, in a collection of rules, the nonredundant
itemset i1 i2 . . . ik , with a variable k number of items for rules are the most general ones, i.e., those rules having minimal
each ti . antecedents and consequents, in terms of subset relation.
Definition 1. Let X  I and T  D. We define the set of all Definition 5. A rule set is optimal [6] with respect to an
transactions that contain the itemset X as: interestingness metric if it contains all the rules except those with
t : I ! D; tðXÞ ¼ ft 2 D j X  tg: no greater interestingness than one of its more general rules. An
optimal rule set is a subset of a nonredundant rule set.
Similarly, we describe the itemsets contained in all the
transactions T by: Definition 6. Formally, an ontology is a quintuple O ¼
fC; R; I; H; Ag [16]. C ¼ fC1 ; C2 ; . . . ; Cn g is a set of concepts
i : D ! I; iðT Þ ¼ fx 2 I j 8t 2 T ; x 2 tg: and R ¼ fR1 ; R2 ; . . . ; Rm g is a set of relations defined over
concepts. I is a set of instances of concepts and H is a Directed
Acyclic Graph (DAG) defined by the subsumption relation (is-a
Definition 2. An association rule is an implication X ! Y ,
relation,  ) between concepts. We say that C2 is-a C1 ,
where X and Y are two itemsets and X \ Y ¼ ;. The former,
C1  C2 , if the concept C1 subsumes the concept C2 . A is a set of
X, is called the antecedent of the rule, and the latter, Y , is
axioms bringing additional constraints on the ontology.
called the consequent.

A rule X ! Y is described using two important 3 MOTIVATIONS FOR THE GENERAL IMPRESSION
statistical factors: IMPROVEMENT USING ONTOLOGIES
. The support of the rule, defined as suppðX ! Y Þ ¼ Since early 2000s, in the Semantic Web context, the number
suppðX [ Y Þ ¼ jtðX [ Y Þj, is the ratio of the number of available ontologies has been increasing covering a wide
domain of applications. This could be a great advantage in
1. http://www.w3.org/2001/sw/. an ontology-based user knowledge representation.
786 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

Fig. 1. Supermarket item taxonomy [12].

This paper contributes on several levels at reducing the


number of association rules. One of our most important
contributions relies on using ontologies as user background
knowledge representation. Thus, we extend the specifica-
tion language proposed by Liu et al. [17]—General Fig. 2. Visualization of the ontology created based on the supermarket
item taxonomy.
Impressions (GI), Reasonably Precise Concepts (RPC), and
Precise Knowledge (PK)—by the use of ontology concepts.
4 RELATED WORK
Example. Let us consider the case of General Impressions.
4.1 Concise Representations of Frequent Itemsets
The user might believe that there exist some associations
Interestingness measures represent metrics in the process of
among milk OR cheese, Fruit items, and beef (assume that capturing dependencies and implications between database
the user uses the taxonomy in Fig. 1). He/she could items, and express the strength of the pattern association.
specify his/her beliefs using General Impressions: Since frequent itemset generation is considered as an
expensive operation, mining frequent closed itemsets (pre-
gið<fmilk; cheeseg; F ruitþ; beef>Þ: liminary idea presented in [4]) was proposed in order to
The following rules are examples of association rules reduce the number of frequent itemsets. For example, an
that are conform to the specification: itemset X is denoted as closed frequent itemset if
6 9 itemset X0 X so that tðXÞ ¼ tðX0 Þ. Thus, the number
apple ! beef of frequent closed itemsets generated is reduced in
grape; pear; beef ! milk: comparison with the number of frequent itemsets.
The CLOSET algorithm was proposed in [19] as a new
But when working directly with database items, devel- efficient method for mining closed itemsets. CLOSET uses a
oping specifications becomes a complex task. Moreover, it novel frequent pattern tree (FP-tree) structure, which is a
compressed representation of all the transactions in the
will be very useful for the user to be able to introduce in the
database. Moreover, it uses a recursive divide-and-conquer
GI language interesting additional information. For exam- and database projection approach to mine long patterns.
ple, in the market case, it would be very useful to be able to Another solution for the reduction of the number of
find if the customers buying diet products, also buy frequent itemsets is mining maximal frequent itemsets [5]. The
ecological products. In order to select this type of rules, authors proposed the MAFIA algorithm based on depth-first
the user should be able to create an RPC such as: traversal and several pruning methods as Parent Equiva-
lence Pruning (PEP), FHUT, HUTMFI, or Dynamic Record-
rpcð<DietP roducts ! EcologicalP roducts>Þ; ing. However, the main drawback of the methods extracting
where DietP roducts and EcologicalP roducts represent, maximal frequent itemsets is the loss of information
because the subset frequency is not available; thus,
respectively, the set of the products integrated in diets, and
generating rules is not possible.
those products which are produced in an ecological way.
Defining such concepts is not possible using taxonomies. 4.2 Redundancy Reduction of Association Rules
Starting from the taxonomy presented in Fig. 1, we Conversely, generating all association rules that satisfy the
developed an ontology based on the earlier considerations. confidence threshold is a combinatorial problem.
We propose to integrate two data properties of Boolean type Zaki and Hsiao used frequent closed itemsets in the
in order to define the products that are useful in diets (IsDiet), CHARM algorithm [20] in order to generate all frequent
closed itemsets. They used an itemset-tid set search tree and
and those that are ecological (IsEcological). Description logic
pursued with the aim of generating a small nonredundant
[18], used in designing ontologies, allows concept definition rule set [7]. To this goal, the authors first found minimal
using restrictions on properties. Therefore, the concept generator for closed itemsets, and then, they generated
DietProducts is defined as a restriction on FoodItem hierarchy nonredundant association rules using two closed itemsets.
using the data property isDiet, describing the items useful in Pasquier et al. [8] proposed the Close algorithm in order to
a diet. Similarly, we define EcologicalProducts concept. extract association rules. Close algorithm is based on a new
In our example, apple and chicken items are diet products, mining method: pruning of the closed set lattice (closed itemset
and milk, grape, and beef items are ecological products. In lattice) in order to extract frequent closed itemsets. Associa-
tion rules are generated starting from frequent itemsets
Fig. 2, we present the structure of the ontology resulting after
generated from frequent closed itemsets. Nevertheless, Zaki
applying a reasoner, and the ontology construction is and Hsiao [20] proved that their algorithm CHARM out-
detailed in Section 5.3. performs CLOSET, Close, and Mafia algorithms.
MARINICA AND GUILLET: KNOWLEDGE-BASED INTERACTIVE POSTMINING OF ASSOCIATION RULES USING ONTOLOGIES 787

More recently, Li [6] proposed optimal rules sets, defined As early as 1994, in the KEFIR system [28], the key
with respect to an interestingness metric. An optimal rule finding and deviation notions were suggested. Grouped in
set contains all rules except those with no greater interest- findings, deviations represent the difference between the
ingness than one of its more general rules. actual and the expected values. KEFIR defines interesting-
A set of reduction techniques for redundant rules was ness of a key finding in terms of the estimated benefits, and
proposed and implemented in [21]. The developed potential savings of taking corrective actions that restore the
techniques are based on the generalization/specification deviation back to its expected value. These corrective
of the antecedent/consequent of the rules and they are actions are specified in advance by the domain expert for
divided in methods for multiantecedent rules and multi- various classes of deviations.
consequent rules. Later, Klemettinen et al. [29] proposed templates to
Hahsler et al. [22] were interested in the idea of describe the form of interesting rules (inclusive templates)
generating association rules from arbitrary sets of itemsets. and not interesting rules (restrictive templates). The idea of
This makes possible for a user to propose a set of itemsets using templates for association rule extraction was reused
and to integrate another set generated by a data mining in [30]. Other approaches proposed to use a rule-like
tool. In order to generate rules, a support counter is needed; formalism to express user expectations [3], [12], [31], and
consequently, the authors proposed an adequate data the discovered association rules are pruned/summarized
structure which provides fast access: prefix trees. by comparing them to user expectations.
Toivonen et al. proposed in [9] a novel technique for Imielinski et al. [32] proposed a query language for
redundancy reduction based on rule covers. The notion of association rule pruning based on SQL, called M-SQL. It
rule cover is defined as the subset of a rule set describing allows imposing constraints on the condition and/or the
the same database transaction set as the rule set. Thus, the consequent of the association rules. In the same domain of
authors developed an algorithm to efficiently extract a rule query-based association rule pruning, but more constraints-
cover out of a set of given rules. driven, Ng et al. [33] proposed an architecture for exploratory
The notion of subsumed rules, discussed in [23], describes mining of rules. The authors suggested a set of solutions for
a set of rules having the same consequent and several several problems: the lack of user exploration and control,
additional conditions in the antecedent regarding a certain the rigid notion of relationship, and the lack of focus. In
rule. Bayardo, Jr., et al. [24] proposed a new pruning order to overcome these problems, Ng et al. proposed a new
measure (Minimum Improvement) described as the difference query language called Constrained Association Query and
between the confidences of two rules in a specification/ they pointed out the importance of user feedback and user
generalization relationship. The specific rule is pruned if the flexibility in choosing interestingness metrics.
proposed measure is less than a prespecified threshold, so Another related approach was proposed by An et al. in
the rule does not bring more information compared to the [34] where the authors introduced domain knowledge in order
general one. to prune and summarize discovered rules. The first
Nevertheless, both closed and maximal itemset mining algorithm uses a data taxonomy, defined by user, in order
still break down at low support thresholds. To address these to describe the semantic distance between rules, and in
limitations, Omiecinski proposed in [25] three new impor- order to group the rules. The second algorithm allows to
tant interestingness measures: any-confidence, all-confidence, group the discovered rules that share at least one item in the
and bond. All these measures are indicators of the degree of
antecedent and the consequent.
relatedness between the items in an association. The most
In 2007, a new methodology was proposed in [35] to
interesting one, all-confidence, introduced as an alternative to
prune and organize rules with the same consequent. The
support, represents the minimum confidence of all associa-
authors suggested transforming the database in an associa-
tion rules extracted from an itemset. Bond is also similar to
tion rule base in order to extract second-level association
support, but with respect to a subset of the data rather than
rules. Called metarules, the extracted rules r1 ! r2 express
the entire database.
relations between the two association rules and help
4.3 User-Driven Association Rule Mining pruning/grouping discovered rules.
Interestingness measures were proposed in order to dis- 4.4 Ontologies in Data Mining
cover only those association rules that are interesting
In knowledge engineering and Semantic Web fields,
according to these measures. They have been divided into
ontologies have interested researchers since their first
objective measures and subjective measures. Objective mea-
proposition in the philosophy branch by Aristotle. Ontol-
sures depend only on data structure. Many survey papers
ogies have evolved over the years from controlled vocabul-
summarize and compare the objective measure definitions
and properties [26], [27]. Unfortunately, being restricted to aries to thesauri (glossaries), and later, to taxonomies [36].
In the early 1990s, an ontology was defined by Gruber as
data evaluation, the objective measures are not sufficient to
reduce the number of extracted rules and to capture the a formal, explicit specification of a shared conceptualization [37].
interesting ones. Several approaches integrating user By conceptualization, we understand here an abstract model
knowledge have been proposed. of some phenomenon described by its important concepts.
In addition, subjective measures were proposed to The formal notion denotes the idea that machines should be
integrate explicitly the decision-maker knowledge and to able to interpret an ontology. Moreover, explicit refers to the
offer a better selection of interesting association rules. transparent definition of ontology elements. Finally, shared
Silbershatz and Tuzilin [3] proposed a classification of outlines that an ontology brings together some knowledge
subjective measures in unexpectedness—a pattern is interest- common to a certain group, and not individual knowledge.
ing if it is surprising to the user—and actionability—a pattern Several other definitions are proposed in the literature.
is interesting if it can help the user take some actions. For instance, in [38], an ontology is viewed as a logical theory
788 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

accounting for the intended meaning of a formal vocabulary, and


later, in 2001, Maedche and Staab proposed a more
artificial-intelligence-oriented definition. Thus, ontologies
are described as (meta)data schemas, providing a controlled
vocabulary of concepts, each with an explicitly defined and
machine processable semantics [16].
Depending on the granularity, four types of ontologies
are proposed in the literature: upper (or top level)
ontologies, domain ontologies, task ontologies, and applica-
tion ontologies [38]. Top-level ontologies deal with general Fig. 3. Framework description.
concepts; while the other three types deal with domain-
specific concepts. user is not interested in, thus permitting to apply the
Ontologies, introduced in data mining for the first time Apriori algorithm to this new database. Let us consider that
in early 2000, can be used in several ways [39]: Domain and the user is not sure about which items he/she should prune.
Background Knowledge Ontologies, Ontologies for Data In this case, he/she should create several pruning tests, and
Mining Process, or Metadata Ontologies. Background Knowl- for each test, he/she will have to apply the Apriori
edge Ontologies organize domain knowledge and play algorithm whose execution time is very high. Second, they
important roles at several levels of the knowledge discovery use SeRQL in order to express user knowledge, and we
process. Ontologies for Data Mining Process codify mining propose a more expressive and flexible language for user
process description and choose the most appropriate task expectation representation, i.e., Rule Schemas.
according to the given problem; while Metadata Ontologies The item-relatedness filter was proposed by Natarajan and
describe the construction process of items.
Shekar [45]. Starting from the idea that the discovered rules
In this paper, we focus on Domain and Background
Knowledge Ontologies. The first idea of using Domain are generally obvious, they introduced the idea of related-
Ontologies was introduced by Srikant and Agrawal with ness between the items measuring their similarity according
the concept of Generalized Association Rules (GAR) [40]. The to item taxonomies. This measure computes the relatedness
authors proposed taxonomies of mined data (an is-a of all the couples of rule items. We can notice that we can
hierarchy) in order to generalize/specify rules. compute the relatedness for the items of the condition or/
In [41], it is suggested that an ontology of background and the consequent, or between the condition and the
knowledge can benefit all the phases of a KDD cycle consequent of the rule.
described in CRISP-DM. The role of ontologies is based on While Natarajan and Shekar measure the item-relatedness
the given mining task and method, and on data character- of an association rule, Garcia et al. developed in [46] and
istics. From business understanding to deployment, the extended in [47] a novel technique called Knowledge
authors delivered a complete example of using ontologies Cohesion (KC). The proposed metric is composed of two
in a cardiovascular risk domain. new ones: Semantic Distance (SD) and Relevance Assessment
Related to Generalized Association Rules, the notion of (RA). The SD one measures how close two items are
raising was presented in [42]. Raising is the operation of semantically, using the ontology—each type of relation
generalizing rules (making rules more abstract) in order to being weighted differently. The numerical value RA
increase support in keeping confidence high enough. This expresses the interest of the user for certain pairs of items
allows for strong rules to be discovered and also to obtain in order to encourage the selection of rules containing those
sufficient support for rules that, before raising, would not pairs. In this paper, the ontology is used only for the SD
have minimum support due to the particular items they computation, differing from our approach which uses
referred to. The difference with Generalized Association
ontologies for Rule Schemas definition. Moreover, the
Rules is that this solution proposes to use a specific level for
authors propose a metric-based approach for itemset
raising and mining.
selection, while we propose a pruning/filtering schemas-
Another contribution, very close to [40], [41], uses
based method of association rules.
taxonomies to generalize and prune association rules. The
authors developed an algorithm, called GART [43], which,
having several taxonomies over attributes, uses iteratively 5 DESCRIPTION of the ARIPSO FRAMEWORK
each taxonomy in order to generalize rules, and then,
The proposed approach is composed of two main parts (as
prunes redundant rules at each step.
shown in Fig. 3). First, the knowledge base allows
A very recent approach, [44], uses ontologies in a
preprocessing step. Several domain-specific and user-defined formalizing user knowledge and goals. Domain knowledge
constraints are introduced and grouped into two types: offers a general view over user knowledge in database
pruning constraints, meant to filter uninteresting items, and domain, and user expectations express the prior user
abstraction constraints permitting the generalization of items knowledge over the discovered rules. Second, the post-
toward ontology concepts. The data set is first preprocessed processing task consists in applying iteratively a set of filters
according to the constraints extracted from the ontology, over extracted rules in order to extract interesting rules:
and then, the data mining step takes place. The difference minimum improvement constraint filter, item-relatedness
with our approach is that, first, they apply constraints in the filter, rule schema filters/pruning.
preprocessing task, whereas we work in the postprocessing The novelty of this approach resides in supervising the
task. The advantage of the pruning constraints is that it knowledge discovery process using two different conceptual
permits to exclude from the start the information that the structures for user knowledge representation: one or several
MARINICA AND GUILLET: KNOWLEDGE-BASED INTERACTIVE POSTMINING OF ASSOCIATION RULES USING ONTOLOGIES 789

where Si is an element of an item taxonomy or an


expression defined using  þ=? operators, and support and
confidence thresholds are optional.
In the GI formalism, we can remark that the user knows
that a set of items is associated, but he/she does not know
which is the direction of the implication, which items he/
she would put in the antecedent, and which ones in the
consequent. This is the main difference between GIs and
Fig. 4. Interactive process description. RPCs—RPCs are able to describe a complete implication.
PKs express the same formalism as the RPCs, adding
ontologies and several rule schemas generalizing general obligatory constraints of support and confidence.
impressions, and proposing an iterative process. Moreover, the authors proposed to filter four types of
rules using: conforming rules and unexpected rules con-
5.1 Interactive Postmining Process cerning the antecedent and/or the consequent:
The ARIPSO framework proposes to the user an interactive
process of rule discovery, presented in Fig. 4. Taking into . Conforming rules—association rules that are conform-
ing to the specified beliefs;
account his/her feedbacks, the user is able to revise his/her
. Unexpected antecedent rules—association rules that are
expectations in function of intermediate results. Several unexpected regarding the antecedent of the specified
steps are suggested to the user in the framework as follows: beliefs;
1. ontology construction—starting from the database, . Unexpected consequent rules—association rules that
and eventually, from existing ontologies, the user are unexpected regarding the consequent of the
develops an ontology on database items; specified beliefs; and
2. defining Rule Schemas (as GIs and RPCs)—the user . Both side unexpected rules—association rules that are
expresses his/her local goals and expectations unexpected regarding both the antecedent and the
concerning the association rules that he/she wants consequent of the specified beliefs.
to find; To improve association rule selection, we propose a new
3. choosing the right operators to be applied over the rule rule filtering model, called Rule Schemas (RS). A rule schema
schemas created, and then, applying the operators; describes, in a rule-like formalism, the user expectations in
4. visualizing the results—the filtered association rules terms of interesting/obvious rules. As a result, Rule
are proposed to the user; Schemas act as a rule grouping, defining rule families.
5. selection/validation—starting from these preliminary The Rule Schema formalism is based on the specification
results, the user can validate the results or he/she language for user knowledge introduced by Liu et al. [12].
can revise his/her information; The model proposed by Liu et al. is described using
6. we propose to the user two filters already existing in elements from an item taxonomy allowing an is-a organiza-
the literature and detailed in Section 5.5. These two tion of database attributes. Using item taxonomies has
filters can be applied over rules whenever the user many advantages: the representation of user expectations is
needs them with the main goal of reducing the more general, and thus, filtered rules are more interesting
number of rules; and for the user.
7. the interactive loop permits to the user to revise the However, a taxonomy of items might not be enough. The
information that he/she proposed. Thus, he/she can user might want to use concepts that are more expressive and
return to step 2 in order to modify the rule schemas, accurate than generalized concepts and that result from
or he/she can return to step 3 in order to change the relationships other than the is-a relation (i.e., IsEcological,
operators. Moreover, in the interactive loop, the user IsCookedWith). This is why we have considered that the use of
could decide to apply one of the two predefined ontologies would be more appropriate. An ontology includes
filters discussed in step 6. the features of taxonomies but adds more representation
power. In a taxonomy, the means for subject description
5.2 Improving General Impressions with Ontologies consist essentially of one relationship: the subsumption
One existing approach interests us in particular—the relationship used to build the hierarchy. The set of items is
specification language proposed by Liu et al. [17]. The opened, but the language used to describe them is closed [48]
authors proposed to represent user expectations in terms of by using a single relationship (the subsumption). Thus, a
discovered rules using three levels of specification: General taxonomy is simply a hierarchical categorization or classifi-
Impressions, Reasonably Precise Concepts—representing cation of items in a domain. On the contrary, an ontology is a
specification of several characteristics of a domain, defined
user vague feelings, and finally, his/her Precise Knowledge.
using an open vocabulary.
The authors developed a representation formalism which
In addition, it is difficult for a domain expert to know
is very close to association rule formalism, flexible enough, exactly the support and confidence thresholds for each rule
and comprehensible for the user. For the case of General schema proposed, because of their statistical definition.
Impressions, the authors proposed the following syntax: That is why we consider that using Precise Knowledge in
user expectation representation might be useless. Thus, we
gið<S1 ; . . . ; Sm >Þ ½support; confidence
;
propose to improve only two of the three representations
790 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

introduced in [12]: General Impressions and Reasonably


Precise Concepts.
Therefore, Rule Schemas bring the expressiveness of
ontologies in the postprocessing task of association rules
combining not only item constraints, but also ontology
concept constraints.
Definition 7. A Rule Schema expresses the fact that the user
expects certain elements to be associated in the extracted
association rules. This can be expressed as

RSð<X1 ; . . . ; Xn ð!Þ Y1 ; . . . ; Ym >Þ;


where Xi ; Yj 2 C of O ¼ fC; R; I; H; Ag and the implication
“! ” is optional. In other words, we can note that the
proposed formalism combines General Impressions and
Reasonably Precise Concepts. If we use the formalism as an
implication, an implicative Rule Schema is defined extending
the RPC. On the other hand, if we do not keep the implication,
we define nonimplicative Rules Schemas, generalizing GI.
Example. Let us consider the taxonomy in Fig. 1. Let us Fig. 5. Ontology description.
develop an ontology based on this taxonomy containing
the description of the concepts BioProducts and Ecologi- concepts. This means that, recursively, only the leaf-concepts
calProducts, as shown in Fig. 2. Thus, we can define a subsumed by the generalized concept contribute to its
nonimplicative Rule Schema: database connection:

RS1ð<BioP roducts; EcologicalP roducts>Þ; f : C1 ! 2I


[
and an implicative Rule Schema: 8c1 2 C1 ; fðc1 Þ ¼ fi ¼ f0 ðc0 Þ j c0  c1 g:
c0 2C0
RS2ð<BioP roducts ! EcologicalP roducts>Þ:
Restriction concepts are described using logical expres-
sions defined over items and are organized in the C2 subset.
5.3 Ontology Description In a first attempt, we base the description of the concepts on
Domain knowledge, defined as the user information restrictions over properties available in description logics.
concerning the database, is described in our framework Thus, the restriction concept defined could be connected to
using ontologies. a disjunction of items.
Compared to taxonomies used in the specification Example. In order to explain restriction concepts, let us
language proposed in [12], ontologies offer a more complex consider the database presented in Fig. 5 and described
knowledge representation model by extending the only is-a by three transactions. Moreover, let us consider the
relation presented in a taxonomy with the set R of relations. ontology presented in the same figure as being the
In addition, the axioms bring important improvements ontology constructed over items of database and
permitting concept definition starting from existing infor- described as follows:
mation in the ontology. The concepts of the ontology are
In this scenario, it is fundamental to connect ontology
concepts C of O ¼ fC; R; I; H; Ag to the database, each one fF oodItems; F ruits; DailyP roducts; Meat;
of them being connected to one/several items of I. To this DietP roducts; EcologicP roducts; . . .g:
end, we consider three types of concepts: leaf-concepts,
And the three types of concepts are:
generalized concepts from the subsumption relation () in H
of O, and restriction concepts proposed only by ontologies. LeafConcepts : fgrape; pear; apple; milk; cheese;
In order to proceed with the definition of each type of
butter; beef; chicken; porkg;
concepts, let us remind that a set of items in a database is
defined as I ¼ fi1 ; i2 ; . . . ; in g. GeneralizedConcepts : fF ruits; DailyP roducts;
The leaf-concepts (C0 ) are defined as Meat; F oodItemg;
RestrictionConcepts : fDietP roducts;
C0 ¼ fc0 2 C j 6 9c0 2 C; c0  c0 g:
EcologicalP roductsg:
They are connected in the easiest way to database—each
concept from C0 is associated to one item in the database: Two data properties are also integrated in order to define
whether a product is useful for a diet, or is ecological. For
f0 : C0 ! I; 8c0 2 C0 ; 9i 2 I; i ¼ f0 ðc0 Þ: example, the DietP roduct restriction concept is described
Generalized concepts (C1 ) are described as the concepts that using description logics language by
subsume other concepts in the ontology. A generalized DietP roducts F oodItems u 9isDiet:T RUE
concept is connected to the database through its subsumed
MARINICA AND GUILLET: KNOWLEDGE-BASED INTERACTIVE POSTMINING OF ASSOCIATION RULES USING ONTOLOGIES 791

defining all food items that have the Boolean property conforming to the condition and, respectively, the conclu-
isDiet on T RUE. For our example, isDiet is instantiated sion of RS1 . Translating this description into the ontolo-
as follows: gical definition of concepts means that AR1 is conforming
to RS1 if the itemset A is conforming to the concept X and if
isDiet : fðapple; T RUEÞ; ðchicken; T RUEÞg:
the itemset B is conforming to the concept Y .
Now, we are able to connect the ontology and the Similarly, rule AR1 is filtered by CðRS2 Þ if the condition
database. As already presented, leaf-concepts are connected and/or the conclusion of the rule AR1 are conforming to
to items in a very simple way, for example, the concept the schema RS2 . In other words, if the itemset A [ B is
grape is connected to the same item f0 ðgrapeÞ ¼ grape. conforming to the concept U and the itemset A [ B is
On the contrary, the generalized concept F ruits is conforming to the concept V , then the rule AR1 is
connected through its three subsumed concepts: conforming with the nonimplicative rule schema RS2 .

fðF ruitsÞ ¼ fgrape; pear; appleg: Unexpectedness. With a higher interest for the user, the
Similarly, we can describe the connection for the other unexpectedness operator UðRSÞ proposes to filter a set of
concepts. rules with a surprise effect for the user. This type of rules
More interesting, the restriction concept DietP roducts will interests the user more than the conforming one since,
be connected through those concepts satisfying the restric- generally, a decision-maker searches to discover new
knowledge with regard to his/her prior knowledge.
tions in the definition of the concept. Thus, DietP roducts is
Moreover, several types of unexpected rules can be
connected through the concepts apple and chicken:
filtered according to the rule schema: rules unexpected
fðDietP roductsÞ ¼ fapple; chickeng: regarding the antecedent Up , rules unexpected regarding the
consequent Uc , and rules unexpected regarding both sides Ub .
5.4 Operations over Rule Schemas For instance, let us consider that the operator Up ðRS1 Þ
The rule schema filter is based on operators applied over extracts the rule AR1 which is unexpected according to the
rule schemas allowing the user to perform several actions condition of the rule schema RS1 . This is possible if the rule
over the discovered rules. We propose two important consequent B is conforming to the concept Y , while the
operators: pruning and filtering operators. The filtering condition itemset A is not conforming to the concept X.
operator is composed of three different operators: conform- In a similar way, we define the two other unexpected-
ing, unexpectedness, and exception. We propose to reuse the ness operators.
operators proposed by Liu et al.: conforming and unexpect- Exceptions. Finally, the exception operator is defined
edness, and we bring two new operators in the postproces- only over implicative rule schemas (i.e., RS1 ) and extracts
sing task: pruning and exceptions. conforming rules with respect to the following new
These four operators will be presented in this section. To implicative rule schema: X ^ Z ! :Y , where Z is a set
this end, let us consider an implicative rule schema RS1 : of items.
ð<X ! Y >Þ, a nonimplicative rule schema RS2 : ð<U; V >Þ, Example. Let us consider the implicative rule schema
and an association rule AR1 : A ! B, where X, Y , U, and V RS : F ruits ! EcologicalP roducts, where
are the ontology concepts, and A and B are the itemsets.
Definition 8. Let us consider an ontology concept C associated fðF ruitsÞ ¼ fgrape; apple; pearg
in the database to fðCÞ ¼ fy1 ; . . . ; yn g and an itemset and
X ¼ fx1 ; . . . ; xk g. We say that the itemset X is conforming
to the concept C if 9yi ; yi 2 X. fðEcologicalP roductsÞ ¼ fgrape; milkg;
and I ¼ fgrape; apple; pear; milk; beefg (see Fig. 1 for
Pruning. The pruning operator allows to the user to
supermarket taxonomy). Also, let us consider that the
remove families of rules that he/she considers uninterest-
following set of association rules is extracted by
ing. In databases, there exist, in most cases, relations
between items that we consider obvious or that we already traditional techniques:
know. Thus, it is not useful to find these relations among R1 : grape; beef ! milk; pear;
the discovered associations. The pruning operator applied
over a rule schema, P ðRSÞ, eliminates all association rules R2 : apple ! beef;
matching the rule schema. To extract all the rules matching R3 : apple; pear; milk ! grape;
a rule schema, the conforming operator is used. R4 : grape; pear ! apple;
Conforming. The conforming operator applied over a
R5 : beef ! grape;
rule schema, CðRSÞ, confirms an implication or finds the
implication between several concepts. As a result, rules R6 : milk; beef ! grape:
matching all the elements of a nonimplicative rule schema Thus, the operator CðRSÞ filters the rules R1 and R3 ,
are filtered. For an implicative rule schema, the condition the operator UpðRSÞ filters the rules R5 and R6 , and the
and the conclusion of the association rule should match operator UcðRSÞ filters the rules R2 and R4 . The pruning
those of the schema. operator P ðRSÞ prunes the rules selected by the conform-
Example. The rule AR1 is selected by the operator CðRS1 Þ if ing operator CðRSÞ. Let us explain the operator UcðRSÞ:
the condition and the conclusion of the rule AR1 are Uc operator filters the rules whose conclusion itemset is
792 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

not conforming to the conclusion concept of the TABLE 1


RS—EcologicalProducts—and whose condition itemset is Examples of Questions and Meaning
conforming to the condition concept of the RS—Fruits.
The R4 rule is filtered by UcðRSÞ because the itemset apple
does not contain an item corresponding to an Ecological-
Products concept, apple 62 fðEcologicalP roductsÞ, and be-
cause the itemset grape pear contains at least one item
corresponding to a Fruits concept, pear 2 fðF ruitsÞ.

5.5 Filters
In order to reduce the number of rules, three filters integrate
the framework: operators applied over rule schemas,
minimum improvement constraint filter [24], and item-
relatedness filter [45].
Minimum improvement constraint filter [24] (MICF) R1 : grape; pear; butter  > milk;
selects only those rules whose confidence is greater with IRðR1 Þ ¼ minðdðgrape; milkÞ; dðpear; milkÞ;
minimp than the confidence of any of its simplifications. dðbutter; milkÞÞ ¼ minð4; 4; 2Þ ¼ 2:
Example. Let us consider the following three association
rules:

grape; pear ! milk ðConfidence ¼ 85%Þ; 6 EXPERIMENTAL STUDY


grape ! milk ðConfidence ¼ 90%Þ; This study is based on a questionnaire database, provided
pear ! milk ðConfidence ¼ 83%Þ: by Nantes Habitat,2 dealing with customers satisfaction
concerning accommodation. The database consists in an
We can note that the last two rules are the simplifications annual study (since 2003) performed by Nantes Habitat on a
of the first one. The theory of Bayardo et al. tells us that the sample of 1,500 out of a total of 50,000 customers.
first rule is interesting only if its confidence improves the The questionnaire consists of 67 different questions with
confidence of all its simplifications. In our case, the first four possible answers expressing the degree of satisfaction:
rule does not improve the confidence of 90 percent of the very satisfied, quite satisfied, rather not satisfied, and dissatisfied
coded as f1; 2; 3; 4g.
best of its simplifications (the second rule), so it is not
Table 1 introduces a question sample with the meaning
considered as an interesting rule, and it is not selected. for each. For instance, the item q1 ¼ 1 describes that a
The item-relatedness filter (IRF) was proposed by customer is very satisfied by the transport in his district
Shekar and Natarajan [45]. Starting from the idea that the (q1 ¼ ‘‘Is your district transport practical?’’).
discovered rules are generally obvious, they introduced In order to target the most interesting rules, we fixed a
the idea of relatedness between items measuring their minimum support of 2 percent, a maximum support of
semantic distance in item taxonomies. This measure
30 percent, and a minimum confidence of 80 percent for the
computes the relatedness of all the couples of rule items.
association rules mining process. Among available algo-
We can notice that we can compute the relatedness for
rithms, we use the Apriori algorithm in order to extract
the items of the condition or/and the consequent, or
association rules and 358;072 rules are discovered.
between the condition and the consequent of the rule.
For example, the following association rule describes the
In our approach, we use the last type of item-
relationship between questions q2, q3, q47, and the question
relatedness because users are interested to find associa- q70. Thus, if the customers are very satisfied by the access to
tion between itemsets with different functionalities, the city center (q2), the shopping facilities (q3), and the
coming from different domains. This measure is com- apartment ventilation (q47), then they can be satisfied by the
puted as the minimum distance between the condition documents received from Nantes Habitat Agency (q70) with
items and the consequent items as presented hereafter. a confidence of 85.9 percent:
The distance between each pair of items from the
condition and, respectively, the consequent is computed R1 : q2 ¼ 1 q3 ¼ 1 q47 ¼ 1 ¼¼> q70 ¼ 1;
as the minimum path that connects the two items in the Support ¼ 15:2%; Confidence ¼ 85:9%:
ontology, defined as dða; bÞ. Thus, the item-relatedness
(IR) for a rule is defined as the minimum of all the
6.1 Ontology Structure and Ontology-Database
distance computed between the items in the condition
Mapping
and the consequent:
In the first step of the interactive process described in the
RA1 : A ! B; Section 5.1, the user develops an ontology on database
items. In our case, starting from the database attributes, the
IRðRA1 Þ ¼ MINðdij ðai ; bj ÞÞ; 8ai 2 A and bj 2 B:
ontology was created by the Nantes Habitat expert. During
several session, we discussed with the expert about the
Example. Let us consider the ontology in Fig. 2. For the
database attributes and proposed her to classify them.
association rule R1 , we can define the item-relatedness
as follows: 2. http://www.nantes-habitat.fr/.
MARINICA AND GUILLET: KNOWLEDGE-BASED INTERACTIVE POSTMINING OF ASSOCIATION RULES USING ONTOLOGIES 793

Fig. 7. Restriction concept construction using necessary and sufficient


conditions in Protégé.

SatisfactionDistrict concept if it represents a question


between q1 and q14, subsumed by the District concept, with
a satisfied answer (1 or 2). The SatisfactionDistrict
restriction concept is described using description logics
language by:

SatisfactionDistrict District u 9hasAnswer:1 OR


hasAnswer:2

6.1.2 Ontology-Database Mapping


As a part of rule schemas, ontology concepts are mapped to
database items. Thus, several connections between ontology
Fig. 6. Ontology structure visualized with Jambalaya Protégé plug-in. and database can be designed. Due to implementation
requirements, the ontology and the database are mapped
Moreover, we found other interesting information asking through instances.
her to develop her expectations and knowledge connected The ontology-database connection is made manually by
to database attributes. In this section, we will present the the expert. In our case, with the 67 attributes and four
development of the ontology in our case study. values, the expert did not meet any problems to realize the
connection, but we agree that for large databases, a
6.1.1 Conceptual Structure of the Ontology
manually connection could be very time-consuming. That
To describe the ontology, we propose to use the Web is why integrating an automatic ontology construction plug-
Semantic representation language, OWL-DL [49]. Based on
in in our tool is one of our principal perspectives.
description logics, OWL-DL language permits, along with
Thus, using the simplest ontology-database mapping, the
the ontological structure, to create restriction concepts using
expert directly connected one instance of the ontology to an
necessary and sufficient conditions over other concepts.
item (semantically, the nearest one). For example, the expert
Also, we use the Protégé [50] software to edit the ontology
connected the instance Q11 1 to the item ðq11 ¼ 1Þ:
and validate it. The Jambalaya [51] environment was used
f0 ðQ11 1Þ ¼ ðq11 ¼ 1Þ.
for ontology graph exploration.
Then, leaf concepts (C0 ) of the Attribute hierarchy were
During several exchanges with the Nantes Habitat
expert, she developed an ontology composed of two main connected by the expert to a set of items (semantically, the
parts, a sample being presented in Fig. 6. The ontology has nearest one). Considering the concept Q11 of the ontology, it
seven depth levels, a total of 130 concepts among which 113 is associated to the attribute q1 ¼ ‘‘Are you satisfied
are primitive concepts, and 17 are restriction concepts. with the transport in your district?:’’ Furthermore, the con-
Concerning siblings, the concepts have a mean of six child cept Q11 has two instances describing the question q11 with
concepts, with a maximum of 13 child concepts. Moreover, two possible answers: 1 and 3. Let us consider that the concept
two data properties are introduced. Q11 was connected by the expert to two items as follows:
The first part of the ontology is a database item fðQ11Þ ¼ ff0 ðQ11 1Þ; f0 ðQ11 3Þg ¼ fq11 ¼ 1; q11 ¼ 3g. The
organization with the root defined by the Attribute concept, connection of generalized concepts follows the same idea.
grouping 113 subsumed concepts. The items are organized A second type of connection implies connecting concepts
among the question topic in the Nantes Habitat question- of the T opic hierarchy to the database. Let us consider the
naire. For instance, considering the District concept, it restriction concept DissastisfactionCalmDistrict (Fig. 7). In
regroups 14 questions (from q1 to q14) concerning the natural language, it is defined as all the concepts, subsumed
facilities and the life quality in a district. by CalmDistrict (connected to questions q8, q9, q10, and q11)
The second hierarchy T opics regroups all 17 restriction and with a dissatisfied answer.
concepts created by the expert using necessary and The DissastisfactionCalmDistrict restriction concept is
sufficient conditions over primitive concepts. described by the expert using description logics language by:
Moreover, the subsumption relation ( ) is completed by
the relation hasAnswer associating the Attribute concepts to DissastisfactionCalmDistrict
an integer from f1; 2; 3; 4g, simulating the relation attribute
CalmDistrict u 9hasAnswer:3 OR hasAnswer:4:
value in the database.
For instance, let us consider the restriction concept Considering that the user has instantiated the concept Q8
SatisfactionDistrict. In natural language, it expresses the with answer 3, and the concept Q11 with the answers 1 and
satisfaction answers of clients in the questions concerning 3, then the concept DissastisfactionCalmDistrict is con-
the district. In other words, an item is instantiated by the nected in the database as it follows:
794 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

TABLE 2 TABLE 4
Pruning Rule Schemas Pruning Rate for Each Filter Combination

TABLE 3
Filtering Rule Schemas
TABLE 5
Notation Meaning

fðDissastisfactionCalmDistrictÞ ¼ ff0 ðQ8 3Þ; f0 ðQ11 3Þg


¼ fq8 ¼ 3; q11 ¼ 3g: However, applying the most reducing combination
number 8 (Table 4), the expert should analyze 13,382 rules
6.2 Results which is impossible manually. Thus, other filters should be
Example 1. This first example proposes to present the applied. The expert was interested in the dissatisfaction
efficiency of our new approach concerning the reduction of phenomena, presented by answers 3 and 4 in the
the number of rules. To this end, we propose to the expert to questionnaire. The expert is interested in applying all the
test the four filters: on the one hand the pruning filters— rule schemas with the corresponding operator (Table 3) for
MICF, IRF, and pruning rule schemas—and on the other hand, each combination of the first three filters presented in
the selection filters—rule schema filters (meanings of acro- Table 4. Table 6 presents the number of rules filtered by
nyms in Table 5). The expert could use each filter separately each rule schema.
and in several combinations in order to compare the results In Table 6, the first column Nb represents the identifica-
and validate them. tion of each filter combination as denoted in Table 4. We can
Hence, the expert proposed a set of pruning rule schemas note that the rule schema filters are very efficient. More-
(Table 2) and a set of filtering rule schemas (Table 3). She over, studying the dissatisfaction of the clients improves the
constructed these rule schemas during several meetings of filtering power of the rule schemas.
testing the new tool and analyzing generated results. Let us consider the second rule schema. Applied over the
At the beginning, the expert is faced to the whole set of initial set of 358,072 association rules with the conforming
358,072 association rules extracted. In a first attempt, we operator, it filters 1,008 rules representing 0.28 percent of the
focus on pruning filters. If the MICF is applied, all the complete set. But it is obvious that it is very difficult for an
specialized rules not improving confidence are pruned. In expert to analyze a set of rules of the order of thousands of
Table 4, we can see that the MICF prunes 92.3 percent of rules. Thus, we can note the importance of the pruning
rules, being a very efficient filter for redundancy pruning. In filters, the set of rules extracted in each case having less
addition, IRF prunes 71 percent of rules—these rules
than 500 rules. We can also note that the IRF filter is more
implying items close semantically. The third pruning filter,
powerful than the other pruning filters, and the combina-
Pruning Rule Schemas, prunes 43 percent of rules.
We propose to compare the three pruning filters and the tion of two filters at the same time gives remarkable results:
combinations of the pruning filters, as presented in Table 4. . on the fifth line, combining MICF with IRF reduces
The first column is the reference for our experiments. The the number of rules to 77 rules;
rates of number of rules remaining after the three filters are
. combining IRF with pruning using Rule Schemas the
used separately are presented in columns 2, 3, and 4. We set of rules is reduced to three rules; and
can note that the MICF filter is the most discriminatory,
. we can also note that in the last two rows, the filters
pruning 92.3 percent of rules, comparing to other two ones
have the same results. We can explain this by the fact
pruning 71 percent and, respectively, 43 percent of rules.
We can also note that combining the first two filters, MICF that we are working on an incomplete set of rules
and IRF, the pruning is more powerful than combining the because of the maximum support threshold that we
first one with the third one. Nevertheless, applying the three impose in the mining process.
filters over the set of the association rules implies a rule It is very important to note that the quality of the selected
reduction of 96.3 percent. rules was certified by the Nantes Habitat expert.
MARINICA AND GUILLET: KNOWLEDGE-BASED INTERACTIVE POSTMINING OF ASSOCIATION RULES USING ONTOLOGIES 795

TABLE 6
Rates for Rule Schema Filters Applied after
the Other Three Filter Combinations

Fig. 8. Description of the interactive process during the experiment.

second rule does not bring important information to the


whole set of rule; hence, it can be pruned. In the same way,
the expert noted that the forth rule is the specialization of
the third one, and the confidence is not improved in this
case neither. The expert decided to modify her initial
information (step 5) and to go to the beginning of the process
via the interactivity loop (step 7), choosing to apply the
MCIF (step 6) which extracts 27,602 rules. The expert
decided to keep these results (steps 4 and 5) and to return
Example 2. This second example is proposed in order to in the interactivity loop, going back to steps 2 and 3 in order
outline the quality of the filtered rules, and to confirm to redefine rule schemas and operators.
the importance of the interactivity in our framework. To This time the expert proposed to use only the rule schema
this end, we present the sequence of steps (Fig. 8) CðRS3 Þ, as a consequence of high volume of rules extracted
performed by the expert during the interactivity by the other one. Using CðRS3 Þ, 50 rules are filtered, and the
process, steps already described in Section 6.1. We have presence of rules 1 and 3 and the absence of rules 2 and 4
already presented the first step of the interactive (from the set presented above) validate the use of MICF
process—ontology construction—in Section 5.1. (steps 4 and 5). Moreover, the hight reduction of number of
As in the first example, the expert is faced to the whole rules validate the application of CðRS3 Þ. In this state, the
set of rules. In a first attempt (steps 2 and 3), she proposed to expert returned to step 2 in order to modify the rule schema
investigate the quality of rules filtered by two of the rule proposing RS4 and first, she applied the unexpectedness
schemas RS2 and RS3 with the conforming operator. The regarding the antecedent operator Up ðRS4 Þ, and then, she
first one deals with dissatisfaction concerning the tranquility returned to step 3 in order to modify the operator, choosing
in the district, and the second one searches rules associating the exception one EðRS4 Þ. These results are briefly presented
dissatisfaction in price with dissatisfaction concerning the in Table 6, but due to space limit, they are not detailed in this
common areas of the building. section.
Applying these two schemas to the whole rule set, an The expert analyzed the 50 rules extracted by CðRS3 Þ
important selection is made: and she found several trivial implications noting that the
implication between several items did not interest her. For
. CðRS2 Þ filters 1,008 association rules; and instance, let us consider the following set of rules:
. CðRS3 Þ filters 96 association rules.
The expert is in the visualization and validation steps (4 q17 ¼ 4; q97 ¼ 4 ¼¼> q16 ¼ 4 C ¼ 86:7% S ¼ 3:5%
and 5), and she analyzes the 96 rules filtered by CðRS3 Þ, q25 ¼ 4; q28 ¼ 4; q97 ¼ 4 ¼¼> q26 ¼ 4 C ¼ 100% S ¼ 2:0%:
because of the reduced number of rules comparing to
1,008 filtered by CðRS2 Þ. For example, let us consider the These rules imply items from EntryHall and CloseSur-
following set of association rules: rounding; thus, the expert proposed to apply rule schemas
RS5 to RS8 with the pruning operator (steps 2 and 3) in order
q17 ¼ 4; q26 ¼ 4; q97 ¼ 4 ¼¼> q28 ¼ 4 C ¼ 92:8% S ¼ 2:6% to prune those not interesting rules. In consequence, 15 rules
q16 ¼ 4; q17 ¼ 4; q26 ¼ 4; q97 ¼ 4 ¼¼> q28 ¼ 4 C ¼ 92:5% are extracted, and the absence of the above rules validates
S ¼ 2:5% the application of pruning rule schemas (steps 4 and 5).
Let us consider the following two rules:
q15 ¼ 4; q17 ¼ 4; q97 ¼ 4 ¼¼> q28 ¼ 4 C ¼ 80:5% S ¼ 1:9%
q15 ¼ 4; q17 ¼ 4; q97 ¼ 4 ¼¼> q26 ¼ 4; q28 ¼ 4 C ¼ 80:5% q28 ¼ 4; q97 ¼ 4 ¼¼> q17 ¼ 4 C ¼ 81:1% S ¼ 2:9%
S ¼ 1:9%: q8 ¼ 4; q16 ¼ 4; q97 ¼ 4 ¼¼> q9 ¼ 4 C ¼ 88:6% S ¼ 2:1%:
The expert noted that the second rule is a specialization The expert noted that a great part of the 15 rules are
of the first rule—the item q16 ¼ 4 is added in the implications between attributes subsumed by the same
antecedent, and she also noted that its confidence is lower concept in the ontology. For instance, the attributes q28
than the confidence of the more general rule. Thus, the and q17 of the first rule, described by the Q28 and the
796 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

Q17 concepts, are subsumed by the concept Stairwell. [4] M.J. Zaki and M. Ogihara, “Theoretical Foundations of Associa-
Similarly, for the second rule, q8 and q9 are subsumed by tion Rules,” Proc. Workshop Research Issues in Data Mining and
Knowledge Discovery (DMKD ’98), pp. 1-8, June 1998.
CalmDistrict concept. Thus, the expert applied the IRF [5] D. Burdick, M. Calimlim, J. Flannick, J. Gehrke, and T. Yiu,
filter, and only three rules are filtered. One of these rules “Mafia: A Maximal Frequent Itemset Algorithm,” IEEE Trans.
attracts the interest of the expert: Knowledge and Data Eng., vol. 17, no. 11, pp. 1490-1504, Nov. 2005.
[6] J. Li, “On Optimal Rule Discovery,” IEEE Trans. Knowledge and
q15 ¼ 4; q16 ¼ 4; q97 ¼ 4 ¼¼> q9 ¼ 4; Data Eng., vol. 18, no. 4, pp. 460-471, Apr. 2006.
[7] M.J. Zaki, “Generating Non-Redundant Association Rules,” Proc.
Support ¼ 2:3% Confidence ¼ 79:1%; Int’l Conf. Knowledge Discovery and Data Mining, pp. 34-43, 2000.
[8] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, “Efficient Mining
which can be translated by: if a client is not satisfied with
of Association Rules Using Closed Itemset Lattices,” Information
the cleaning of the close surrounding and the entry hall, and Systems, vol. 24, pp. 25-46, 1999.
if he is not satisfied with the service charges, then it is [9] H. Toivonen, M. Klemettinen, P. Ronkainen, K. Hatonen, and H.
possible with a confidence of 79.1 percent that he considers Mannila, “Pruning and Grouping of Discovered Association
that his district has a bad reputation. This rule is very Rules,” Proc. ECML-95 Workshop Statistics, Machine Learning, and
interesting because the expert thought that the building Knowledge Discovery in Databases, pp. 47-52, 1995.
state does not influence the opinion concerning the district, [10] B. Baesens, S. Viaene, and J. Vanthienen, “Post-Processing of
Association Rules,” Proc. Workshop Post-Processing in Machine
but it is obvious that this is the case. Learning and Data Mining: Interpretation, Visualization, Integration,
and Related Topics with Sixth ACM SIGKDD, pp. 20-23, 2000.
[11] J. Blanchard, F. Guillet, and H. Briand, “A User-Driven and
7 CONCLUSION Quality-Oriented Visualization for Mining Association Rules,”
This paper discusses the problem of selecting interesting Proc. Third IEEE Int’l Conf. Data Mining, pp. 493-496, 2003.
association rules throughout huge volumes of discovered [12] B. Liu, W. Hsu, K. Wang, and S. Chen, “Visually Aided Exploration
of Interesting Association Rules,” Proc. Pacific-Asia Conf. Knowledge
rules. The major contributions of our paper are stated Discovery and Data Mining (PAKDD), pp. 380-389, 1999.
below. First, we propose to integrate user knowledge in [13] G. Birkhoff, Lattice Theory, vol. 25. Am. Math. Soc., 1967.
association rule mining using two different types of [14] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, “Discovering
formalism: ontologies and rule schemas. On the one hand, Frequent Closed Itemsets for Association Rules,” Proc. Seventh Int’l
domain ontologies improve the integration of user domain Conf. Database Theory (ICDT ’99), pp. 398-416, 1999.
knowledge concerning the database field in the postpro- [15] M. Zaki, “Mining Non-Redundant Association Rules,” Data
Mining and Knowledge Discovery, vol. 9, pp. 223-248, 2004.
cessing step. On the other hand, we propose a new
[16] A. Maedche and S. Staab, “Ontology Learning for the Semantic
formalism, called Rule Schemas, extending the specification Web,” IEEE Intelligent Systems, vol. 16, no. 2, pp. 72-79, Mar. 2001.
language proposed by Liu et al. The latter is especially [17] B. Liu, W. Hsu, L.-F. Mun, and H.-Y. Lee, “Finding Interesting
used to express the user expectations and goals concerning Patterns Using User Expectations,” IEEE Trans. Knowledge and
the discovered rules. Data Eng., vol. 11, no. 6, pp. 817-832, Nov. 1999.
Second, a set of operators, applicable over the rule [18] I. Horrocks and P.F. Patel-Schneider, “Reducing owl Entailment to
schemas, is proposed in order to guide the user throughout Description Logic Satisfiability,” J. Web Semantics, pp. 17-29,
vol. 2870, 2003.
the postprocessing step. Thus, several types of actions, as
[19] J. Pei, J. Han, and R. Mao, “Closet: An Efficient Algorithm for
pruning and filtering, are available to the user. Finally, the Mining Frequent Closed Itemsets,” Proc. ACM SIGMOD Workshop
interactivity of our ARIPSO framework, relying on the set of Research Issues in Data Mining and Knowledge Discovery, pp. 21-30,
rule mining operators, assists the user throughout the 2000.
analyzing task and permits him/her an easier selection of [20] M.J. Zaki and C.J. Hsiao, “Charm: An Efficient Algorithm for
interesting rules by reiterating the process of filtering rules. Closed Itemset Mining,” Proc. Second SIAM Int’l Conf. Data Mining,
pp. 34-43, 2002.
By applying our new approach over a voluminous [21] M.Z. Ashrafi, D. Taniar, and K. Smith, “Redundant Association
questionnaire database, we allowed the integration of Rules Reduction Techniques,” AI 2005: Advances in Artificial
domain expert knowledge in the postprocessing step in Intelligence – Proc 18th Australian Joint Conf. Artificial Intelligence
order to reduce the number of rules to several dozens or pp. 254-263, 2005.
less. Moreover, the quality of the filtered rules was [22] M. Hahsler, C. Buchta, and K. Hornik, “Selective Association Rule
validated by the expert throughout the interactive process. Generation,” Computational Statistic, vol. 23, no. 2, pp. 303-315,
Kluwer Academic Publishers, 2008.
[23] J. Bayardo, J. Roberto, and R. Agrawal, “Mining the Most
Interesting Rules,” Proc. ACM SIGKDD, pp. 145-154, 1999.
ACKNOWLEDGMENTS [24] R.J. Bayardo, Jr., R. Agrawal, and D. Gunopulos, “Constraint-
The authors would like to thank Nantes Habitat, the Public Based Rule Mining in Large, Dense Databases,” Proc. 15th Int’l
Housing Unit in Nantes, France, and more specially Ms. Conf. Data Eng. (ICDE ’99), pp. 188-197, 1999.
[25] E.R. Omiecinski, “Alternative Interest Measures for Mining
Christelle Le Bouter, and also M. Loic Glimois for Associations in Databases,” IEEE Trans. Knowledge and Data Eng.,
supporting this work. vol. 15, no. 1, pp. 57-69, Jan./Feb. 2003.
[26] F. Guillet and H. Hamilton, Quality Measures in Data Mining.
Springer, 2007.
REFERENCES [27] P.-N. Tan, V. Kumar, and J. Srivastava, “Selecting the Right
[1] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Objective Measure for Association Analysis,” Information Systems,
Rules between Sets of Items in Large Databases,” Proc. ACM vol. 29, pp. 293-313, 2004.
SIGMOD, pp. 207-216, 1993. [28] G. Piatetsky-Shapiro and C.J. Matheus, “The Interestingness of
[2] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Deviations,” Proc. AAAI’94 Workshop Knowledge Discovery in
Advances in Knowledge Discovery and Data Mining. AAAI/MIT Databases, pp. 25-36, 1994.
Press, 1996. [29] M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I.
[3] A. Silberschatz and A. Tuzhilin, “What Makes Patterns Interesting Verkamo, “Finding Interesting Rules from Large Sets of Dis-
in Knowledge Discovery Systems,” IEEE Trans. Knowledge and covered Association Rules,” Proc. Int’l Conf. Information and
Data Eng. vol. 8, no. 6, pp. 970-974, Dec. 1996. Knowledge Management (CIKM), pp. 401-407, 1994.
MARINICA AND GUILLET: KNOWLEDGE-BASED INTERACTIVE POSTMINING OF ASSOCIATION RULES USING ONTOLOGIES 797

[30] E. Baralis and G. Psaila, “Designing Templates for Mining Claudia Marinica received the master’s de-
Association Rules,” J. Intelligent Information Systems, vol. 9, pp. 7- gree in “KDD” from the Polytechnique School
32, 1997. of Nantes University in 2006, and the Com-
[31] B. Padmanabhan and A. Tuzhuilin, “Unexpectedness as a puter Science degree from Politehnica Uni-
Measure of Interestingness in Knowledge Discovery,” Proc. versity of Bucharest, Romania, in 2006. She is
Workshop Information Technology and Systems (WITS), pp. 81-90, currently working toward the PhD degree in
1997. computer science in the “Knowledge and
[32] T. Imielinski, A. Virmani, and A. Abdulghani, “Datamine: Decision” Team, LINA UMR CNRS 6241 at
Application Programming Interface and Query Language for Polytechnique School of Nantes University,
Database Mining,” Proc. Int’l Conf. Knowledge Discovery and Data France. Her main research interests are in
Mining (KDD), pp. 256-262, http://www.aaai.org/Papers/KDD/ Association Rule Mining and Semantic Web.
1996/KDD96-042.pdf, 1996.
[33] R.T. Ng, L.V.S. Lakshmanan, J. Han, and A. Pang, “Exploratory
Mining and Pruning Optimizations of Constrained Associations Fabrice Guillet received the PhD degree in
Rules,” Proc. ACM SIGMOD Int’l Conf. Management of Data, vol. 27, computer sciences from the Ecole Nationale
pp. 13-24, 1998. Superieure des Telecommunications de Bre-
[34] A. An, S. Khan, and X. Huang, “Objective and Subjective tagne in 1995. He has been an associate
Algorithms for Grouping Association Rules,” Proc. Third IEEE professor (HdR) in computer science at Poly-
Int’l Conf. Data Mining (ICDM ’03), pp. 477-480, 2003. tech’Nantes, and a member of the “KnOwledge
[35] A. Berrado and G.C. Runger, “Using Metarules to Organize and and Decision” team (KOD) in the Nantes-
Group Discovered Association Rules,” Data Mining and Knowledge Atlantic Laboratory of Computer Sciences (LINA
Discovery, vol. 14, no. 3, pp. 409-431, 2007. UMR CNRS 6241) since 1997. He is a founder
[36] M. Uschold and M. Grüninger, “Ontologies: Principles, Methods, of the “Knowledge Extraction and Management”
and Applications,” Knowledge Eng. Rev., vol. 11, pp. 93-155, 1996. French-speaking association of research (EGC, www.egc.asso.fr). His
[37] T.R. Gruber, “A Translation Approach to Portable Ontology research interests include knowledge quality and visualization in the
Specifications,” Knowledge Acquisition, vol. 5, pp. 199-220, 1993. frameworks of Data Mining and Knowledge Engineering. He has
[38] N. Guarino, “Formal Ontology in Information Systems,” Proc. First recently coedited two refereed books of chapter entitled Quality
Int’l Conf. Formal Ontology in Information Systems, pp. 3-15, 1998. Measures in Data Mining (Springer, 2007), and Statistical Implicative
[39] H. Nigro, S.G. Cisaro, and D. Xodo, Data Mining with Ontologies: Ananlysis—Theory and Applications (Springer, 2008).
Implementations, Findings and Frameworks. Idea Group, Inc., 2007.
[40] R. Srikant and R. Agrawal, “Mining Generalized Association
Rules,” Proc. 21st Int’l Conf. Very Large Databases, pp. 407-419, . For more information on this or any other computing topic,
http://citeseer.ist.psu.edu/srikant95mining.html, 1995. please visit our Digital Library at www.computer.org/publications/dlib.
[41] V. Svatek and M. Tomeckova, “Roles of Medical Ontology in
Association Mining Crisp-dm Cycle,” Proc. Workshop Knowledge
Discovery and Ontologies in ECML/PKDD, 2004.
[42] X. Zhou and J. Geller, “Raising, to Enhance Rule Mining in Web
Marketing with the Use of an Ontology,” Data Mining with
Ontologies: Implementations, Findings and Frameworks, pp. 18-36,
Idea Group Reference, 2007.
[43] M.A. Domingues and S.A. Rezende, “Using Taxonomies to
Facilitate the Analysis of the Association Rules,” Proc. Second Int’l
Workshop Knowledge Discovery and Ontologies, held with ECML/
PKDD, pp. 59-66, 2005.
[44] A. Bellandi, B. Furletti, V. Grossi, and A. Romei, “Ontology-
Driven Association Rule Extraction: A Case Study,” Proc. Work-
shop Context and Ontologies: Representation and Reasoning, pp. 1-10,
2007.
[45] R. Natarajan and B. Shekar, “A Relatedness-Based Data-Driven
Approach to Determination of Interestingness of Association
Rules,” Proc. 2005 ACM Symp. Applied Computing (SAC), pp. 551-
552, 2005.
[46] A.C.B. Garcia and A.S. Vivacqua, “Does Ontology Help Make
Sense of a Complex World or Does It Create a Biased Interpreta-
tion?” Proc. Sensemaking Workshop in CHI ’08 Conf. Human Factors
in Computing Systems, 2008.
[47] A.C.B. Garcia, I. Ferraz, and A.S. Vivacqua, “From Data to
Knowledge Mining,” Artificial Intelligence for Eng. Design, Analysis
and Manufacturing, vol. 23, pp. 427-441, 2009.
[48] L.M. Garshol, “Metadata? Thesauri? Taxonomies? Topic Maps
Making Sense of It All,” J. Information Science, vol. 30, no. 4,
pp. 378-391, 2004.
[49] I. Horrocks and P.F. Patel-Schneider, “A Proposal for an owl Rules
Language,” Proc. 13th Int’l Conf. World Wide Web, pp. 723-731,
2004.
[50] W.E. Grosso, H. Eriksson, R.W. Fergerson, J.H. Gennari, S.W. Tu,
and M.A. Musen, “Knowledge Modeling at the Millennium (the
Design and Evolution of Protege-2000),” Proc. 12th Workshop
Knowledge Acquisition, Modeling and Management (KAW ’99), 1999.
[51] M.-A. Storey, N.F. Noy, M. Musen, C. Best, R. Fergerson, and N.
Ernst, “Jambalaya: An Interactive Environment for Exploring
Ontologies,” Proc. Seventh Int’l Conf. Intelligent User Interfaces
(IUI ’02), pp. 239-239, 2002.

You might also like