You are on page 1of 11

Information Sciences 160 (2004) 161–171

www.elsevier.com/locate/ins

An efficient cluster and decomposition


algorithm for mining association rules
Yuh-Jiuan Tsay *, Ya-Wen Chang-Chien
Department of Management Information Systems, National Ping-Tung University of Science
and Technology, 1, Hseuh-Fu Rd., Nei-Pu Shan, Ping-Tung 91201, Taiwan
Received 13 May 2003; received in revised form 25 July 2003; accepted 13 August 2003

Abstract
Conventional algorithms for mining association rules operate in a combination of
smaller large itemsets. This paper presents a new efficient which combines both the
cluster concept and decomposition of larger candidate itemsets, while proceeds from
mining the maximal large itemsets down to large 1-itemsets, named cluster-decompo-
sition association rule (CDAR). First, the CDAR method creates some clusters by
reading the database only once, and then clustering the transaction records to the kth
cluster, where the length of a record is k. Then, the large k-itemsets are generated by
contrasts with the kth cluster only, unlike the combination concept that contrasts with
the entire database. Experiments with real-life databases show that CDAR outperforms
Apriori, a well-known and widely used association rule.
Ó 2003 Elsevier Inc. All rights reserved.

Keywords: Association rule; Combination; Cluster-decomposition

1. Introduction

According to a Meta Group study, over 70% of Fortune 1000 companies


have established data warehousing projects, integrating their internal databases
and using data mining technologies to discover meaningful information. The
mined results are then consulted during executive decision-making [13]. Mining

*
Corresponding author. Tel.: +886-8-7703202x6355; fax: +886-8-7740306.
E-mail address: yjtsay@mail.npust.edu.tw (Y.-J. Tsay).

0020-0255/$ - see front matter Ó 2003 Elsevier Inc. All rights reserved.
doi:10.1016/j.ins.2003.08.013
162 Y.-J. Tsay, Y.-W. Chang-Chien / Information Sciences 160 (2004) 161–171

for association rules between items in large database of sales transactions has
been recognized as an important area of database research [9]. These rules can
be effectively used to uncover unknown relationships, producing results that
can provide a basis for forecasting and decision-making. These results have
proven to be very useful for enterprises striving to enhance their competitive-
ness and profitability. One of the main challenges in mining association rules is
developing fast and efficient algorithms that can handle large volumes of data,
because most association rule algorithms perform computations over entire
databases, which frequently are very large.
Savasere et al. proposed the partition algorithm to further improve efficiency,
since it effectively reduces the number of database scans; however, this algo-
rithm wastes significant time scanning infrequent candidate itemsets [15].
Moreover, Pork et al. proposed an effective algorithm DHP (Direct Hashing
and Pruning) for initial candidate set generation. This method efficiently con-
trols the number of candidate 2-itemsets, pruning database size [14]. Han et al.
proposed a top-down method, which delves progressively deeper into the data,
for efficient mining of multiple-level association rules from large transaction
databases based on the Apriori principle [10]. Moreover, Cheung et al. proposed
the Fast Distributed Algorithm (FDA) for efficiently discovering association
rules in a distributed system [7]. Additionally, Toivonen proposed a sampling
algorithm, which requires just a single database scan, but still wastes consid-
erable time on candidate itemsets [16]. Furthermore, Carter et al. adjusted the
counting method; including the actual quantity of product bought by custom-
ers. Carter et al. wanted to determine not only the time of customer purchase
but also purchase quantity [6]. Brin et al. proposed the Dynamic Itemset Count
(DIC) algorithm for identifying large itemsets, which uses fewer data passes
over the data than classic algorithms, and fewer candidate itemsets than sam-
pling based methods [5]. Finally, Liu et al. proposed the Msapriori algorithm,
which resembles the Apriori algorithm. The difference between the two methods
is that the Msapriori algorithm automatically generates the minimum support
for each item [12]. Additionally, the Column-Wise Apriori algorithm [8] and the
Tree-Based Association Rule (TBAR) algorithm [4], transformed the storage
structure of the data, to reduce the time needed for database scans, improving
overall efficiency. Briefly, the above improvement methods focus on decreasing
the number of reading databases and pruning the number of candidates.
Efficiency is improved if an alternative method can decrease the number of
database reads, and also reduce the number of contrasts or candidate itemsets.
Thus this study proposes an efficient CDAR method for rapidly identifying
large itemsets, with the main contributions being as follows. The CDAR only
requires a single read of the database, followed by contrasts with the kth
cluster, to generate large k-itemsets. This not only prunes considerable
amounts of data, thus reducing memory requirements and the time needed, but
also ensures the accuracy of the mined results.
Y.-J. Tsay, Y.-W. Chang-Chien / Information Sciences 160 (2004) 161–171 163

The remainder of this paper is organized as follows. Section 2, briefly de-


scribes Apriori-like level-wise association rules; Section 3, presents the pro-
posed algorithm, which is called CDAR, and Section 4 presents an example of
CDAR. Subsequently, Section 5 discusses the experiment design and the re-
turned results; finally, Section 6 presents conclusions.

2. Apriori-like association rules

Let I ¼ fi2 ; i2 ; i3 ; . . . ; iItemNo g represent a set of ItemNo distinct literals, called


items. Generally, a set of items is called an itemset. Moreover, the number of
items in an itemset indicates its length. An itemset with length k is termed a k-
itemset. A database D is a set of variable length transactions, where the ith
transaction Ti denotes a set of items such that Ti  I. jTi j denotes the length
(number of items) of transaction Ti . Each transaction is associated with a
unique identifier, which is termed its TID. M denotes the length of the longest
transaction record in database D. A transaction T is considered to support an
itemset X  I if it contains all items of X , that is, X  T . The fraction of the
transactions in D that support X is called the support of X , denoted by
supportðX Þ. An itemset is large if its support exceeds some user specified
minimum support threshold, denoted MinSup.
Conventional Apriori-like level-wise association rules for identifying the set
of all large itemsets process in a combination of smaller large itemsets [1–
5,7,8,10–12,14,15]. In the kth level, the algorithm identifies all large k-itemsets,
denoted as Lk , and Ck represents the set of candidate k-itemsets obtained from
Lk1 , that is, potentially large k-itemsets. For each transaction in D, the candi-
date k-itemsets in Ck also contained in the transaction are determined and their
support count is increased by 1. Following reading and contrasting with the
entire D, if their supports are greater than or equal to MinSup, the candidate k-
itemsets immediately become the large k-itemsets. At the end of level k all large
itemsets of length k or less have been discovered. During the execution, nu-
merous candidate itemsets are generated from single itemsets, and every can-
didate itemset also needs to perform contrasts with the entire database, level by
level, while discovering large itemsets. However, performance is significantly
affected, because the database is repeatedly read to contrast each candidate
itemset with all transaction records of the database, as shown in Fig. 1.

3. Cluster-decomposition association rule (CDAR)

The CDAR algorithm employs some efficient clusters to represent database


D based on a single reading, then proceeds from mining the maximal large
itemsets to large 1-itemsets by decomposing larger candidate itemsets, as
164 Y.-J. Tsay, Y.-W. Chang-Chien / Information Sciences 160 (2004) 161–171

Database (D)

Reading D and every candidate 1-itemset contrasts with entire D


Large 1-itemsets
Reading D and every candidate 2-itemset contrasts with entire D
Large 2-itemsets


Reading D and every candidate (M-1)-itemset contrasts with entire D
Large (M-1)-itemsets
Reading D and every candidate M-itemset contrasts with entire D
Large M-itemsets

Fig. 1. Apriori needs the contrasts with entire database repeatedly.

shown in Fig. 2. Unlike the combination of smaller large itemsets in conven-


tional algorithms, CDAR adopts the decomposition of larger candidate
itemsets, using an opposite program execution to that in conventional algo-
rithms. Additionally, the CDAR method creates some clusters by reading the
database only once, and then clustering the transaction records to the kth
cluster, where the length of a record is k. Moreover, the large k-itemsets are ge
nerated by contrasts with the kth cluster only, unlike the combination of
smaller itemsets approach that contrasts with the entire database. Fig. 3 il-
lustrates the algorithmic form of CDAR, which, for ease of presentation, is
divided into two parts.
Part 1 creates M clusters, and obtains a set of candidate M-itemsets, CM .
Reading the database once and clustering the transaction records. If the length
of transaction record is k, then the transaction record is stored in the kth
cluster, named ClusterðkÞ, 1 6 k 6 M, where M denotes the length of the longest

Reading database (D) only once

Cluster (M ) Cluster (M-1) …… Cluster (2) Cluster (1)

Contrasts with Cluster (M)


Large M -itemsets
Contrasts with Cluster (M-1)
Large (M-1)-itmsets

Contrasts with Cluster (2)


Large 2-itemsets
Contrasts with Cluster (1)
Large 1-itemsets

Fig. 2. CDAR needs the contrasts with only kth cluster to generate the large k-itemsets.
Y.-J. Tsay, Y.-W. Chang-Chien / Information Sciences 160 (2004) 161–171 165

Fig. 3. Main program for the CDAR algorithm.

transaction record in the database. Meanwhile, the set of candidate M-itemsets,


CM ¼ ClusterðMÞ, is generated. Moreover, let Tmp ¼ ;, and initial candidate k-
itemsets IntCðkÞ ¼ ClusterðkÞ for 1 6 k < M.
Part 2 identifies large k-itemsets, Lk , and generates candidate (k  1)-item-
sets, Ck1 , where k from M down to MinLength, and MinLength is the user
166 Y.-J. Tsay, Y.-W. Chang-Chien / Information Sciences 160 (2004) 161–171

Fig. 4. Procedure of decomposition for the CDAR algorithm.

specified length threshold of large itemsets. First, for each candidate k-itemset
in Ck , its support is obtained simply by counting the times that individual
candidate k-itemset appears in Ck , avoiding the need to contrast all transaction
records in the database. If the support equals or exceeds the user specified
minimum support threshold (MinSup), the candidate k-itemset automatically
becomes the large k-itemset. In this case, let Tmp ¼ Tmp [ ClusterðkÞ while
deleting itemset X from Tmp if X  Y , where Y 2 LM [ LM1    [ Lkþ1 [ Lk .
Finally, the itemsets in Tmp can be decomposed into length-(k  1) subsets and
added to IntCðk  1Þ while deleting itemset X from IntCðk  1Þ if X  Y , where
Y 2 LM [ LM1    [ Lkþ1 [ Lk , to generate candidate (k  1)-itemsets, Ck1 , as
shown in Fig. 4. The algorithm terminates when the calculated support equals
or exceeds the user specified minimum support threshold (MinSup), or when k
is below the user specified minimum length threshold (MinLength) of large
itemsets.

4. Example of the CDAR

This study uses the following example to describe the program execution of
the CDAR algorithm, where candidate itemsets and large itemsets are gener-
ated. The example includes seven transaction records, as shown in Fig. 5. The
minimum length (number of items) threshold of large itemsets is already de-
fined as MinLength ¼ 2 while the minimum support threshold (MinSup) is
defined based upon the generation of large itemsets in various lengths.
Part 1 establishes a set of candidate M-itemsets and creates M clusters by
scanning the database, clustering transaction records, and putting length-k
transaction records into ClusterðkÞ, where 1 6 k 6 M. Since the length of the
longest transaction record in the example is 4, M ¼ 4 and the generated can-
didate 4-itemsets are C4 ¼ {ABCE, ABCE, BCDE}, as shown in Fig. 5.
Part 2 identifies large k-itemsets, Lk , and generates candidate (k  1)-item-
sets, Ck1 , where k ranges from 4 down to 2, as shown in Fig. 5.
(1) When k ¼ 4, assume that the minimum support is defined as MinSup ¼ 2,
and then discover the large 4-itemsets L4 , and generate candidate 3-itemsets C3 .
Y.-J. Tsay, Y.-W. Chang-Chien / Information Sciences 160 (2004) 161–171 167

Database D
1 CDE
2 BCE
3 ABCE
4 BE
5 ABD
6 ABCE
7 BCDE
Scan database D only once

Cluster(4) Cluster(3) Cluster(2)


Notation {Y}x:
Itemset Itemset Itemset
x times of Y
{ABCE}2 {CDE} {BE}
in Cluster, IntC
{BCDE} {BCE}
{ABD}

C4 IntC(3) C3
Itemset Support {CDE}2 Itemset Support
Decomposition {BCE}
{ABCE} 2 {BCE}2 {CDE} 2
Tmp={BCDE} ⊂ L4
{BCDE} 1 {ABD} {BCD} 1
{BDE} {BDE} 1
MinSup = 2
{BCD} {ABD} 1
MinSup = 2
L4
Itemset
L3
{ABCE}
Itemset
{CDE}
IntC (2) Decomposition
{BE}2 Tmp={BCDE, ABD}
{BD}2
{BC}
{BE}, {BC}, {CD}, C2
{CD}
{CE}, {DE}, {AB} Itemset Support MinSup=3
{CE}
⊂ the itemsets of L3∪L4 {BD} 2 L2 = ∅
{DE} MinLength=2
{AD} 1
{AB}
{AD}

Fig. 5. Example of the CDAR.

The support of the candidate itemsets is obtained simply by counting the times
that individual candidate itemsets have appeared in C4 , avoiding the need to
168 Y.-J. Tsay, Y.-W. Chang-Chien / Information Sciences 160 (2004) 161–171

contrast all transaction records in the database. If only one candidate itemset
ABCE has appeared twice in C4 , its support is 2, higher than or equal to the
minimum support threshold. The generated large itemsets are L4 ¼ {ABCE}.
Another candidate itemset BCDE appears just once, and is less than the
minimum support threshold, thus Tmp ¼ {BCDE}. Itemset BCDE of Tmp then
is decomposed into lengh-3 subsets {BCD}, {CDE}, {BCE}, {BDE}, which
are added to IntCð3Þ (thus, IntCð3Þ ¼ {BCD, CDE, BCE, BDE, CDE, BCE,
ABD}) while the candidate itemset BCE in IntCð3Þ is deleted (since
{BCE}  L4 ) to enerate the candidate 3-itemsets C3 ¼ {BCD, CDE, BDE,
CDE, ABD}. The longest itemset in the example is L4 ¼ {ABCE}.
(2) When k ¼ 3, assume that the minimum support is defined as MinSup ¼ 2,
and then identify the large 3-itemsets L3 , and generate candidate 2-itemsets C2 .
The support is obtained simply by counting the times that individual candidate
itemsets have appeared in C3 . Since only the candidate itemset CDE has ap-
peared twice in C3 , the support is 2, equaling or exceeding the minimum
support and the generated large 3-itemsets, L3 ¼ {CDE}, and Tmp ¼ {BCDE,
ABD}. The itemsets BCDE, ABD of Tmp thus are decomposed into length-2
subsets {BC}, {BD}, {BE}, {CD}, {CE}, {DE}, {AB}, {AD}, {BD} which are
added to IntCð2Þ (thus, IntCð2Þ ¼ {BE, BC, BD, BE, CD, CE, DE, AB, AD,
BD}) while deleted candidate itemsets BE, BC, CD, CE, DE, AB from IntCð2Þ
(since {BE}, {BC}, {CD}, {CE}, {DE}, {AB}  the itemsets of L4 [ L3 ) to
generate the candidate 2-itemsets C2 {BD, BD, AD}.
(3) When k ¼ 2, assume that the minimum support is defined as MinSup ¼ 3,
and then identify the large 2-itemsets L2 . The support is determined simply by
counting the times that individual candidate itemsets have appeared in C2 .
Since the candidate itemset BD has appeared twice and AD has appeared once,
below the minimum support, thus L2 ¼ ;. The decomposition process then
stops, since the minimum length of large itemsets is defined as MinLength ¼ 2.
In the example, the generated longest large itemset that meets the minimum
support is L4 ¼ {ABCE} while generated large itemset with other length is
L3 ¼ {CDE}, as shown in Fig. 5. Individual large itemsets then can be con-
verted into association rules relatively easily.

5. Experimental results

To evaluate the efficiency of the proposed method, the CDAR, along with
the Apriori algorithm, is implemented using Microsoft Visual Basic 6.0 on a
Pentium III 550 MHz PC with 256 MB of available physical memory. The test
database is the FoodMart transaction database provided with Microsoft SQL
Server 2000. In this experiment, the efficiency of the CDAR algorithm is
compared to that of the Apriori algorithm.
Y.-J. Tsay, Y.-W. Chang-Chien / Information Sciences 160 (2004) 161–171 169

(1) 5000, 10,000, 15,000, 20,000 and 25,000 transaction records of experi-
mental data are randomly sampled from the FoodMart transaction database.
The test database contains 1600 items, and the longest transaction record
contains 18 items. The performance of the CDAR algorithm is stable when the
minimum supports (MinSup) are 0.44%, 0.42%, 0.40% and 0.38%, and the
number of transaction records is varied at levels 5000, 10,000, 15,000, 20,000
and 25,000. Fig. 6 shows the results.
(2) 5000 and 10,000 transaction records of experimental data are randomly
sampled from the FoodMart transaction database. The test database contains
1600 items, with the longest transaction record contains 18 items. The exper-
iment results of applying the CDAR algorithm are compared to those for
applying Apriori algorithm under the various minimum supports threshold
(MinSup), which are set at 0.44%, 0.42%, 0.40%, 0.38%, 0.36%. Figs. 7 and 8
show the results.

600 0.44%
0.42%
500
0.40%
400 0.38%
Time(sec)

300

200

100

0
5000 10000 15000 20000 25000
Database Size

Fig. 6. Performance of CDAR on the various amounts of transactions.

5000
4500 Apriori
4000 CDAR
3500
Time(sec)

3000
2500
2000
1500
1000
500
0
0.44% 0.42% 0.40% 0.38% 0.36%
Minimum Support

Fig. 7. Performance of CDAR and Apriori on 5000 transaction records.


170 Y.-J. Tsay, Y.-W. Chang-Chien / Information Sciences 160 (2004) 161–171

5000
4500 Apriori
4000 CDAR
3500
Time(sec)

3000
2500
2000
1500
1000
500
0
0.44% 0.42% 0.40% 0.38% 0.36%
Minimum Support

Fig. 8. Performance of CDAR and Apriori on 10,000 transaction records.

The experimental results in Figs. 6–8, show that the CDAR performs better
and more stably than the Apriori algorithm. This good and stable performance
not only eliminates considerable amounts of data, thus reducing the time
needed to perform data contrasts and also memory requirements, but also
ensures the correctness of the mined results.

6. Conclusions

An efficient method for identifying the large itemsets can be useful in various
data mining problems, such as the discovery of the association rules. Although
conventional algorithms can discover meaningful itemsets and construct as-
sociation rules from large databases, they suffer the disadvantage of also
generating numerous candidate itemsets that must be repeatedly contrasted
with the entire database, level by level, while mining large itemsets. The pro-
cessing of conventional algorithms utilizes large amounts of memory. Perfor-
mance also is influenced, since the database is repeatedly read to contrast each
candidate itemset with all of the transaction records in the database.
The CDAR algorithm creates clusters to aid the discovery of large k-itemsets
by reading the database only once. Contrasts are performed only against the
kth cluster, which was created in advance, differing from conventional methods
that contrast with the entire database. Unlike the combination of smaller large
itemsets found in conventional algorithms, CDAR adopts the decomposition
of larger candidate itemsets, and uses a program execution opposite to that in
conventional algorithms. This design not only prunes considerable amounts of
data, reducing time needed to perform data reading and also memory re-
quirements, but also ensures the correctness of the mined results. The algo-
rithm provides better performance improvements. The performance gap
Y.-J. Tsay, Y.-W. Chang-Chien / Information Sciences 160 (2004) 161–171 171

between the algorithms becomes more evident with the number and size of
patterns identified. Experiments demonstrating the improvement in using the
CDAR can be extremely significant, particularly given long maximal large
itemsets, and are better suited to the requirements of practical applications.

References

[1] R. Agrawal, T. Imilienski, A. Swami, Data mining: a performance perspective, IEEE


Transactions on Knowledge and Data Engineering 5 (6) (1993) 914–925.
[2] R. Agrawal, T. Imilienski, A. Swami, Mining association rules between sets of items in large
databases, in: Proc. of the ACM SIGMOD Int’l Conference on Management of Data,
Washington, DC, May 1993, pp. 207–216.
[3] R. Agrawal, R. Srikant, Fast algorithm for mining association rules in large databases, in:
Proc. of 1994 Int’l Conf. VLDB, Santiago, Chile, June 1994, pp. 487–499.
[4] F. Berzal, J.C. Cubero, N. Marin, J.M. Serrano, TBAR: An efficient method for association
rule mining in relational databases, Elserier Data & Knowledge, Engineering 37 (2001) 47–64.
[5] S. Brin, R. Motwani, C. Silverstein, Beyond market baskets: generalizing association rules to
correlations, ACM SIGMOD Conference on Management of Data, Tuscon, AZ, May 1997,
pp. 265–276.
[6] C. Carter, H. Hamilton, N. Cercone, Share based measures for itemsets, principles of data
mining and knowledge discovery, in: J. Komorowski, J. Zytkow (Eds.), vol. 1263, 1997, pp.
14–24.
[7] D.W. Cheung, J. Han, V.T. Ng, A.W. Fu, Y. Fu, A fast distributed algorithm for mining
association rules, in: Proc. of 1996 Int’l Conf. on PDIS’96, Miami Beach, FL, USA, December
1996.
[8] B. Dunkel, N. Soparkar, Data Organization and Access for Efficient Data Mining, ICDE,
Australia, 1999.
[9] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers,
2000.
[10] J. Han, Y. Fu, Mining multiple-level association rules in large databases, IEEE Transactions
on Knowledge and Data Engineering 11 (5) (1999).
[11] J.D. Holt, S.M. Chung, Mining association rules using inverted hashing and pruning,
Information Processing Letter 83 (2002) 211–220.
[12] B. Liu, W. Hsu, Y. Ma, Mining association rules with multiple minimum supports, in: Proc. of
the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
San Diego, USA, August 1999.
[13] META Group, Data Warehouse Marketing Trends/Opportunities: An in-depth analysis of
key market trends, META Group, January 1998.
[14] J.S. Pork, M.S. Chen, P.S. Yu, An effective hash based algorithm for mining association rules,
ACM SIGMOD (1995) 175–186.
[15] A. Savasere, E. Omiecinski, S. Navathe, An efficient algorithm for mining association rules in
large databases, in: Proc. of 21th VLDB Conference, Zurich, Switzerland, September 1995, pp.
432–444.
[16] H. Toivonen, Sampling large databases for association rules, in: Proc. of 22nd VLDB
Conference, Mumbai, India, September 1996, pp. 134–145.

You might also like