Professional Documents
Culture Documents
www.elsevier.com/locate/ins
Abstract
Conventional algorithms for mining association rules operate in a combination of
smaller large itemsets. This paper presents a new efficient which combines both the
cluster concept and decomposition of larger candidate itemsets, while proceeds from
mining the maximal large itemsets down to large 1-itemsets, named cluster-decompo-
sition association rule (CDAR). First, the CDAR method creates some clusters by
reading the database only once, and then clustering the transaction records to the kth
cluster, where the length of a record is k. Then, the large k-itemsets are generated by
contrasts with the kth cluster only, unlike the combination concept that contrasts with
the entire database. Experiments with real-life databases show that CDAR outperforms
Apriori, a well-known and widely used association rule.
Ó 2003 Elsevier Inc. All rights reserved.
1. Introduction
*
Corresponding author. Tel.: +886-8-7703202x6355; fax: +886-8-7740306.
E-mail address: yjtsay@mail.npust.edu.tw (Y.-J. Tsay).
0020-0255/$ - see front matter Ó 2003 Elsevier Inc. All rights reserved.
doi:10.1016/j.ins.2003.08.013
162 Y.-J. Tsay, Y.-W. Chang-Chien / Information Sciences 160 (2004) 161–171
for association rules between items in large database of sales transactions has
been recognized as an important area of database research [9]. These rules can
be effectively used to uncover unknown relationships, producing results that
can provide a basis for forecasting and decision-making. These results have
proven to be very useful for enterprises striving to enhance their competitive-
ness and profitability. One of the main challenges in mining association rules is
developing fast and efficient algorithms that can handle large volumes of data,
because most association rule algorithms perform computations over entire
databases, which frequently are very large.
Savasere et al. proposed the partition algorithm to further improve efficiency,
since it effectively reduces the number of database scans; however, this algo-
rithm wastes significant time scanning infrequent candidate itemsets [15].
Moreover, Pork et al. proposed an effective algorithm DHP (Direct Hashing
and Pruning) for initial candidate set generation. This method efficiently con-
trols the number of candidate 2-itemsets, pruning database size [14]. Han et al.
proposed a top-down method, which delves progressively deeper into the data,
for efficient mining of multiple-level association rules from large transaction
databases based on the Apriori principle [10]. Moreover, Cheung et al. proposed
the Fast Distributed Algorithm (FDA) for efficiently discovering association
rules in a distributed system [7]. Additionally, Toivonen proposed a sampling
algorithm, which requires just a single database scan, but still wastes consid-
erable time on candidate itemsets [16]. Furthermore, Carter et al. adjusted the
counting method; including the actual quantity of product bought by custom-
ers. Carter et al. wanted to determine not only the time of customer purchase
but also purchase quantity [6]. Brin et al. proposed the Dynamic Itemset Count
(DIC) algorithm for identifying large itemsets, which uses fewer data passes
over the data than classic algorithms, and fewer candidate itemsets than sam-
pling based methods [5]. Finally, Liu et al. proposed the Msapriori algorithm,
which resembles the Apriori algorithm. The difference between the two methods
is that the Msapriori algorithm automatically generates the minimum support
for each item [12]. Additionally, the Column-Wise Apriori algorithm [8] and the
Tree-Based Association Rule (TBAR) algorithm [4], transformed the storage
structure of the data, to reduce the time needed for database scans, improving
overall efficiency. Briefly, the above improvement methods focus on decreasing
the number of reading databases and pruning the number of candidates.
Efficiency is improved if an alternative method can decrease the number of
database reads, and also reduce the number of contrasts or candidate itemsets.
Thus this study proposes an efficient CDAR method for rapidly identifying
large itemsets, with the main contributions being as follows. The CDAR only
requires a single read of the database, followed by contrasts with the kth
cluster, to generate large k-itemsets. This not only prunes considerable
amounts of data, thus reducing memory requirements and the time needed, but
also ensures the accuracy of the mined results.
Y.-J. Tsay, Y.-W. Chang-Chien / Information Sciences 160 (2004) 161–171 163
Database (D)
…
Reading D and every candidate (M-1)-itemset contrasts with entire D
Large (M-1)-itemsets
Reading D and every candidate M-itemset contrasts with entire D
Large M-itemsets
Fig. 2. CDAR needs the contrasts with only kth cluster to generate the large k-itemsets.
Y.-J. Tsay, Y.-W. Chang-Chien / Information Sciences 160 (2004) 161–171 165
specified length threshold of large itemsets. First, for each candidate k-itemset
in Ck , its support is obtained simply by counting the times that individual
candidate k-itemset appears in Ck , avoiding the need to contrast all transaction
records in the database. If the support equals or exceeds the user specified
minimum support threshold (MinSup), the candidate k-itemset automatically
becomes the large k-itemset. In this case, let Tmp ¼ Tmp [ ClusterðkÞ while
deleting itemset X from Tmp if X Y , where Y 2 LM [ LM1 [ Lkþ1 [ Lk .
Finally, the itemsets in Tmp can be decomposed into length-(k 1) subsets and
added to IntCðk 1Þ while deleting itemset X from IntCðk 1Þ if X Y , where
Y 2 LM [ LM1 [ Lkþ1 [ Lk , to generate candidate (k 1)-itemsets, Ck1 , as
shown in Fig. 4. The algorithm terminates when the calculated support equals
or exceeds the user specified minimum support threshold (MinSup), or when k
is below the user specified minimum length threshold (MinLength) of large
itemsets.
This study uses the following example to describe the program execution of
the CDAR algorithm, where candidate itemsets and large itemsets are gener-
ated. The example includes seven transaction records, as shown in Fig. 5. The
minimum length (number of items) threshold of large itemsets is already de-
fined as MinLength ¼ 2 while the minimum support threshold (MinSup) is
defined based upon the generation of large itemsets in various lengths.
Part 1 establishes a set of candidate M-itemsets and creates M clusters by
scanning the database, clustering transaction records, and putting length-k
transaction records into ClusterðkÞ, where 1 6 k 6 M. Since the length of the
longest transaction record in the example is 4, M ¼ 4 and the generated can-
didate 4-itemsets are C4 ¼ {ABCE, ABCE, BCDE}, as shown in Fig. 5.
Part 2 identifies large k-itemsets, Lk , and generates candidate (k 1)-item-
sets, Ck1 , where k ranges from 4 down to 2, as shown in Fig. 5.
(1) When k ¼ 4, assume that the minimum support is defined as MinSup ¼ 2,
and then discover the large 4-itemsets L4 , and generate candidate 3-itemsets C3 .
Y.-J. Tsay, Y.-W. Chang-Chien / Information Sciences 160 (2004) 161–171 167
Database D
1 CDE
2 BCE
3 ABCE
4 BE
5 ABD
6 ABCE
7 BCDE
Scan database D only once
C4 IntC(3) C3
Itemset Support {CDE}2 Itemset Support
Decomposition {BCE}
{ABCE} 2 {BCE}2 {CDE} 2
Tmp={BCDE} ⊂ L4
{BCDE} 1 {ABD} {BCD} 1
{BDE} {BDE} 1
MinSup = 2
{BCD} {ABD} 1
MinSup = 2
L4
Itemset
L3
{ABCE}
Itemset
{CDE}
IntC (2) Decomposition
{BE}2 Tmp={BCDE, ABD}
{BD}2
{BC}
{BE}, {BC}, {CD}, C2
{CD}
{CE}, {DE}, {AB} Itemset Support MinSup=3
{CE}
⊂ the itemsets of L3∪L4 {BD} 2 L2 = ∅
{DE} MinLength=2
{AD} 1
{AB}
{AD}
The support of the candidate itemsets is obtained simply by counting the times
that individual candidate itemsets have appeared in C4 , avoiding the need to
168 Y.-J. Tsay, Y.-W. Chang-Chien / Information Sciences 160 (2004) 161–171
contrast all transaction records in the database. If only one candidate itemset
ABCE has appeared twice in C4 , its support is 2, higher than or equal to the
minimum support threshold. The generated large itemsets are L4 ¼ {ABCE}.
Another candidate itemset BCDE appears just once, and is less than the
minimum support threshold, thus Tmp ¼ {BCDE}. Itemset BCDE of Tmp then
is decomposed into lengh-3 subsets {BCD}, {CDE}, {BCE}, {BDE}, which
are added to IntCð3Þ (thus, IntCð3Þ ¼ {BCD, CDE, BCE, BDE, CDE, BCE,
ABD}) while the candidate itemset BCE in IntCð3Þ is deleted (since
{BCE} L4 ) to enerate the candidate 3-itemsets C3 ¼ {BCD, CDE, BDE,
CDE, ABD}. The longest itemset in the example is L4 ¼ {ABCE}.
(2) When k ¼ 3, assume that the minimum support is defined as MinSup ¼ 2,
and then identify the large 3-itemsets L3 , and generate candidate 2-itemsets C2 .
The support is obtained simply by counting the times that individual candidate
itemsets have appeared in C3 . Since only the candidate itemset CDE has ap-
peared twice in C3 , the support is 2, equaling or exceeding the minimum
support and the generated large 3-itemsets, L3 ¼ {CDE}, and Tmp ¼ {BCDE,
ABD}. The itemsets BCDE, ABD of Tmp thus are decomposed into length-2
subsets {BC}, {BD}, {BE}, {CD}, {CE}, {DE}, {AB}, {AD}, {BD} which are
added to IntCð2Þ (thus, IntCð2Þ ¼ {BE, BC, BD, BE, CD, CE, DE, AB, AD,
BD}) while deleted candidate itemsets BE, BC, CD, CE, DE, AB from IntCð2Þ
(since {BE}, {BC}, {CD}, {CE}, {DE}, {AB} the itemsets of L4 [ L3 ) to
generate the candidate 2-itemsets C2 {BD, BD, AD}.
(3) When k ¼ 2, assume that the minimum support is defined as MinSup ¼ 3,
and then identify the large 2-itemsets L2 . The support is determined simply by
counting the times that individual candidate itemsets have appeared in C2 .
Since the candidate itemset BD has appeared twice and AD has appeared once,
below the minimum support, thus L2 ¼ ;. The decomposition process then
stops, since the minimum length of large itemsets is defined as MinLength ¼ 2.
In the example, the generated longest large itemset that meets the minimum
support is L4 ¼ {ABCE} while generated large itemset with other length is
L3 ¼ {CDE}, as shown in Fig. 5. Individual large itemsets then can be con-
verted into association rules relatively easily.
5. Experimental results
To evaluate the efficiency of the proposed method, the CDAR, along with
the Apriori algorithm, is implemented using Microsoft Visual Basic 6.0 on a
Pentium III 550 MHz PC with 256 MB of available physical memory. The test
database is the FoodMart transaction database provided with Microsoft SQL
Server 2000. In this experiment, the efficiency of the CDAR algorithm is
compared to that of the Apriori algorithm.
Y.-J. Tsay, Y.-W. Chang-Chien / Information Sciences 160 (2004) 161–171 169
(1) 5000, 10,000, 15,000, 20,000 and 25,000 transaction records of experi-
mental data are randomly sampled from the FoodMart transaction database.
The test database contains 1600 items, and the longest transaction record
contains 18 items. The performance of the CDAR algorithm is stable when the
minimum supports (MinSup) are 0.44%, 0.42%, 0.40% and 0.38%, and the
number of transaction records is varied at levels 5000, 10,000, 15,000, 20,000
and 25,000. Fig. 6 shows the results.
(2) 5000 and 10,000 transaction records of experimental data are randomly
sampled from the FoodMart transaction database. The test database contains
1600 items, with the longest transaction record contains 18 items. The exper-
iment results of applying the CDAR algorithm are compared to those for
applying Apriori algorithm under the various minimum supports threshold
(MinSup), which are set at 0.44%, 0.42%, 0.40%, 0.38%, 0.36%. Figs. 7 and 8
show the results.
600 0.44%
0.42%
500
0.40%
400 0.38%
Time(sec)
300
200
100
0
5000 10000 15000 20000 25000
Database Size
5000
4500 Apriori
4000 CDAR
3500
Time(sec)
3000
2500
2000
1500
1000
500
0
0.44% 0.42% 0.40% 0.38% 0.36%
Minimum Support
5000
4500 Apriori
4000 CDAR
3500
Time(sec)
3000
2500
2000
1500
1000
500
0
0.44% 0.42% 0.40% 0.38% 0.36%
Minimum Support
The experimental results in Figs. 6–8, show that the CDAR performs better
and more stably than the Apriori algorithm. This good and stable performance
not only eliminates considerable amounts of data, thus reducing the time
needed to perform data contrasts and also memory requirements, but also
ensures the correctness of the mined results.
6. Conclusions
An efficient method for identifying the large itemsets can be useful in various
data mining problems, such as the discovery of the association rules. Although
conventional algorithms can discover meaningful itemsets and construct as-
sociation rules from large databases, they suffer the disadvantage of also
generating numerous candidate itemsets that must be repeatedly contrasted
with the entire database, level by level, while mining large itemsets. The pro-
cessing of conventional algorithms utilizes large amounts of memory. Perfor-
mance also is influenced, since the database is repeatedly read to contrast each
candidate itemset with all of the transaction records in the database.
The CDAR algorithm creates clusters to aid the discovery of large k-itemsets
by reading the database only once. Contrasts are performed only against the
kth cluster, which was created in advance, differing from conventional methods
that contrast with the entire database. Unlike the combination of smaller large
itemsets found in conventional algorithms, CDAR adopts the decomposition
of larger candidate itemsets, and uses a program execution opposite to that in
conventional algorithms. This design not only prunes considerable amounts of
data, reducing time needed to perform data reading and also memory re-
quirements, but also ensures the correctness of the mined results. The algo-
rithm provides better performance improvements. The performance gap
Y.-J. Tsay, Y.-W. Chang-Chien / Information Sciences 160 (2004) 161–171 171
between the algorithms becomes more evident with the number and size of
patterns identified. Experiments demonstrating the improvement in using the
CDAR can be extremely significant, particularly given long maximal large
itemsets, and are better suited to the requirements of practical applications.
References