You are on page 1of 12

Improving UPGrowth Algorithm using TOP-K Itemset mining High Utility

Ramkrishna H. Patil
P.G.Student
SSBTs College of Enginering & Technology Bambhori Jalgaon M.S. India
ramu@gmail.com
Sandip S. Patil
Associate Professor
SSBTs College of Enginering & Technology Bambhori Jalgaon M.S. India
sspatiljalgaon@gmail.com

Abstract
Discovering useful patterns hidden in a database plays an essential role in several data mining
tasks, Mining high utility itemsets from databases is an emerging topic in data mining, which
refers to the discovery of itemsets with utilities higher than a user-specified minimum utility
threshold min_util. Although several studies have been carried out on this topic, setting an
appropriate minimum utility threshold is a difficult problem for users because the existing high
utility mining algorithm generates large number of candidate itemsets, which takes much time to
find utility value of all candidate itemsets, especially for dense datasets.. If min_util is set too
low, too many high utility itemsets will be generated, which may cause the mining algorithms to
become inefficient or even run out of memory. On the other hand, if min_util is set too high, no
high utility itemset will be found. Setting appropriate minimum utility thresholds by trial and
error is a tedious process for users. Conventional frequent pattern mining algorithms require
some user-specified minimum support, and then mine frequent patterns with support values that
are higher than the minimum support. As it is difficult to predict how many frequent patterns will
be mined with a specified minimum support, the Top-k mining concept has been proposed. The
Top-k Mining concept is based on an algorithm for mining frequent patterns without a minimum
support, but with the number of most k frequent patterns ordered according to their support
values. However, the Top-k mining concept still requires a threshold k. Therefore, users must
decide the value of k before initiating mining. we address this problem by proposing a new
framework named top-k high utility itemset mining, where k is the desired number of high utility

itemsets to be mined. An efficient algorithm named Top-K Utility itemsets mining is proposed
for mining such itemsets without setting min_util.

Introduction
Generally, data mining (sometimes called data or knowledge discovery) is the process of
analyzing data from different perspectives and summarizing it into useful information such
information that can be used to increase revenue, cuts costs, or both. Data mining software is one
of a number of analytical tools for analyzing data. It allows users to analyze data from many
different dimensions or angles, categorize it, and summarize the relationships identified.
Technically, data mining is the process of finding correlations or patterns among dozens of fields
in large relational databases. The overall goal of the data mining process is to extract information
from a data set and transform it into an understandable structure for further use.

Literature Survey
Although many studies have devoted to HUI mining, it is difficult for users to choose an
appropriate minimum utility threshold in practice. Depending on the threshold, the output size
can be very small or very large. Besides, the choice of the threshold also greatly influences the
performance of the algorithms. If the threshold is set too low, too many high utility itemsets will
be presented to the users. It is difficult for the users to comprehend the results. A large number of
high utility itemsets also causes the mining algorithms to become inefficient or even run out of
memory, because the more high utility itemsets the algorithms generate, the more resources they
consume. On the contrary, if the threshold is set too high, no high utility itemset will be found. In
this case, users need to try different thresholds by guessing and re-executing the algorithms over
and over until being satisfied with the results. This process is both inconvenient and time
consuming.

Figure: 2.1 Literature Survey

Frequent Pattern Mining:


Various studies have been proposed for mining frequent patterns. Out of which two main
methods are association rule mining [1] and sequential pattern mining [2]. In mining of frequent
patterns, frequently generated items are identified based on various algorithm. A basic algorithm
is Apriory Algorithm.
Association Rule Mining:
A pioneer of efficiently mining association rules from large databases is Apriory
Algorithm. Afterword FP-Growth [6] was proposed which is pattern growth-based association
rule mining algorithm. It achieves a better performance than Apriory algorithm since it finds
frequent itemsets without generating any candidate itemset and scans database just twice.
Sequential Pattern Mining:
The problem of discovering what items are bought together in a transaction over basket
data was introduced. While related, the problem of finding what items are bought together is
concerned with finding intra-transaction patterns, whereas the problem of finding sequential
patterns is concerned with inter-transaction patterns. A pattern in the first problem consists of an

unordered set of items whereas a pattern in the latter case is an ordered list of sets of items. Each
transaction in database consists of the various fields in the transaction. There cannot be more
than one transaction with the same transaction-time. We do not consider quantities of items
bought in a transaction: each item is a binary variable representing whether an item was bought
or not. An itemset is a non-empty set of items. A sequence is an ordered list of itemsets.However
the relative importance of item is not considered in the frequent itemset mining. Thus, the
weighted association rule mining [5] was brought to attention.
Weighted Association Rule Mining:
Mining association rules is an important issue in the field of data mining due to its wide
applications. Traditional association rules are, however, derived from frequent itemsets, which
only consider the occurrence of items but do not reflect any other factors, such as price or profit.
Chan et al. thus proposed the utility mining to solve the problem. They considered both
individual profits and quantities of products (items) in transactions, and used them to find out
actual utility values of itemsets. Several other researches about utility mining were proposed in
these years. Weighted mining [11][16][17] has recently been proposed, The mining process of
the proposed algorithm can be divided into two phases. In the first phase, the possible candidate
transaction-weighted utility itemsets are found level by level. In the second phase, the candidate
transaction-weighted utility itemsets are further checked for their actual utility values by an
additional database scan. Finally, the itemsets with their actual weighted-utility values larger
than or equal to a predefined threshold are output as the high transaction-weighted utility
itemsets [11][16][17].

Tree based HU-Mine Algorithm:


In recent years, the problem of high utility pattern mining become one of the most
important research area in data mining. The problem is challenging, due to non applicability of
anti-monotone property. The existing high utility mining algorithm generates large number of
candidate itemsets, which takes much time to find utility value of all candidate itemsets,
especially for dense datasets. A novel conditional high utility tree (CHUT) is proposed [12] to

compress transactional databases in two stages to reduce search space and a new algorithm called
HU-Mine is proposed to mine complete set of high utility item sets.
IHUP algorithms:
Another tree based algorithm was proposed, named IHUP [3] to efficiently generate
HTWUIs and avoid multiple time database scanning. It uses a tree based structure IHUP-Tree [3]
to maintain the information about itemsets and their utilities. It first generate IHUP tree and then
generate HTWUIs from tree and at last performs mining on that itemset. To perform this
operation it uses two database scan. In first scan it generates tree and during second scan it uses
FP-Growth algorithm.
Disk-resident High Utility Pattern Mining:
In earlier approach, tree structure is constructed every time with the change in threshold
value but, trie structure [14] is constructed once which can be reused with changing minimum
threshold value. Therefore, it supports, build once mine many, property. Then disk resident
feature for mining of most valuable itemsets from the large database is proposed because whole
trie structure [14] consisting all transactions from large database cannot be stored in the memory.
Top-k Frequent Itemset Mining:
In frequent pattern mining, several top-k pattern mining algorithms have been proposed.
Most of them follow a same general process for finding top-k patterns, although they also have
several differences. We describe this general process below and then highlight the challenges for
top-k high utility itemset mining.
The general process for mining top-k patterns from a database is the following. Initially, a
top-k pattern mining algorithm sets minimum support threshold minsup to 0 to ensure that all the
top-k patterns will be found. Then, the algorithm starts searching for patterns by using a search
strategy. As soon as a pattern is found, it is added to a list of patterns L ordered by the support of
patterns. The list L is used to maintain the top-k patterns found until now. Once k patterns are
found, the value of minsup is raised to the support of the least interesting pattern in L. Raising

minsup is used to prune the search space when searching for more patterns. Thereafter, each time
a pattern is found that meets the minimum support threshold, the pattern is inserted into L, the
patterns in L not respecting the threshold anymore are removed from L, and the threshold is
raised to the support of the least frequent patterns in L. The algorithm continues searching for
more
Proposed Solution
We propose an efficient algorithm named TKU (mining Top-K Utility itemsets) for
discovering top-k high utility itemsets without specifying min_util. We first present a baseline
named TKBase approach and then introduce effective strategies to enhance its performance.
Architecture:
The framework of TKBase consists of three parts:
(1) construction of UP-Tree,
(2) generation of potential top-k high utility itemsets (abbreviated as PKHUIs) from the
UP-Tree, and
(3) identifying top-k high utility itemsets from the set of PKHUIs.

Results and Discussion


Experimental Evaluation:
In this section, we evaluate the performance of the proposed algorithm. Experiments were
performed on computer with a 3.40 GHz Intel Core Processor with 4 gigabyte memory, and

running on Windows 7. All of the algorithms are implemented in Java. Different types of real
world datasets were used in the experiments. Chess, a dense dataset, was acquired from Spmf
database ; Education, a dense dataset, was obtained from our college feedback forms. The Chess
datasets already contain unit profits and purchased quantities. For Education dataset, unit profits
for items are generated between 1 and 5 by using a normal skills distribution and quantities of
items are generated according to students feedback form collected in the college. Table shows
the characteristics of the datasets used in the experiments.
Dataset

Transactions

Avg. length

Items

Type

Chess

3450

37

74

dense

Education

500

dense

Table 4.2 Dataset characteristics

Experimental Results:
The proposed system is tested on two real world data set as shown table 4.2.

Figure 4.1: Proposed system results for chess dataset(Itemset founds)


Figure 4.1 shows the analysis of itemset founds from the chess data set as per the give k
values mention in table 4.3 by the proposed system. Initially, k-value is 10 then 381 itemsets are
found in the chess data set. When the k value is increased then, the number of itemset found is
increased. At some k-values, the similar number of itemsets are found and remains constant or
no change in that values.

Figure 4.2 shows the analysis of itemset founds from the education data set as per the
give k values mention in table 4.5 by the proposed system. Initially, k-value is 10 then 11
itemsets are found in the education data set. When the k value is increased then, the number of
itemset found is increased. At some k-values, the similar number of itemsets are found and
remains constant or no change in that values.

Conclusion and Future Work

We propose TOP-K algorithm to mine High Utility itemset which is based on UP Tree data
structure for maintaining the information of high utility itemsets. As per literature survey on
various papers and previous studies and existing studies, various techniques has been proposed
for mining high utility itemset.

References

[1] A. Erwin, R. P. Gopalan and N. R. Achuthan. "Efficient Mining of High-utility


Itemsets from Large Datasets," In PAKDD 2008, LNAI 5012, pp. 554-561, 2008.
[2] Agrawal R., Imielinski T., and Swami, "A. Mining association rules between sets of items in
large databases," In Proceedings of 1993 ACM SIGMOD Intl. Conf. on Management of Data,
pages 207--216, May 1993.
[3] Agrawal R. and Srikant R., Fast Algorithms for Mining Association Rules, Proc. 20th Intl
Conf. Very Large Data Bases (VLDB), pp. 487-499, 1994.
[4] Agrawal R. and Srikant R., Mining Sequential Patterns, Proc.1995 Intl
Conf. Data Eng.(ICDE 95), pp. 3-14, Mar. 1995.
[5] B.-E. Shie, V. S. Tseng, and P. S. Yu. "Online Mining of Temporal Maximal Utility Itemsets
from Data Streams," In Proc. of the 25th Annual ACM Symposium on Applied Computing
(ACM SAC 2010), 2010.
[6] C.F. Ahmed, S.K. Tanbeer, B.-S. Jeong, and Y.-K. Lee, Efficient Tree Structures for High
Utility Pattern Mining in Incremental Databases, IEEE Trans. Knowledge and Data Eng., vol.
21, no. 12, pp. 1708-1721, Dec. 2009.

[7] C.H. Cai, A.W.C. Fu, C.H. Cheng, and W.W. Kwong, Mining Association Rules with
Weighted Items, Proc. Intl Database Eng. and Applications Symp. (IDEAS 98), pp. 68-77,
1998.
[8] Cheng Wei Wu, Bai-En Shie, Philip S. Yu, Vincent S. Tseng, Mining Top-K High Utility
Itemsets , KDD12, August 1216, 2012, Beijing, China., pp.78-86 ,2012.
[9] Chithra Ramaraju, Nickolas Savarimuthu A Conditional Tree Based Novel Algorithm for
High Utility Item set Mining, IEEE ,2011 [10] FIMI (Frequent Itemset Mining Implementations
Repository), http://fimi.cs.helsinki.fi/
[11] Guo-Cheng Lan, Tzung-Pei Hong, Vincent S. Tseng, Mining High Transaction- Weighted
Utility Itemsets, IEEE ,2010
[12] H. Yao, H. J. Hamilton, L. Geng, "A unified framework for utility-based measures for
mining itemsets," In Proc. of ACM SIGKDD 2nd Workshop on Utility-Based Data Mining , pp.
28-37, 2006.
[13] J. Han, J. Wang, Y. Lu and P. Tzvetkov, Mining Top-k Frequent Closed Patterns without
Minimum Support, In Proc. of ICDM, 2002.
[14] J. Han, J. Pei, and Y. Yin, Mining Frequent Patterns without Candidate Generation, Proc.
ACM-SIGMOD Intl Conf. Management of Data, pp. 1-12, 2000.
[15] J. Han, J. Pei, Y. Yin, and R. Mao.," Mining frequent patterns without candidate generation:
A frequent pattern tree approach," Data Mining and Knowledge, pp. 53-87, 2004
[16] Dr.Yogendra Kumar Jain and Sandip S.Patil , Design and Implementation of
Anomalies Detection System Using IP Gray Space Analysis , is Published in the IEEE
Explorer and the Proceeding of IEEE International Conference of Future Network
(ICFN 2008) Bangkok, Thailand, IEEE 2009. On 7, 8, 9 March 2009. Pages 203 to 207 .
DOI 10.1109/ICFN.2009.9

[17] Atul Chaudhari, Dr.Girish Kumar Patnaik and Sandip S. Patil Implementation ofMinutiae
based Fingerprint Identification System using Crossing Number Concept is published in the
International Journal of Computer Trends and Technology , Volume 8, Issue 4, ISSN 2231-2803
, PP178-183, Feb 2014

You might also like