Professional Documents
Culture Documents
Abstract general sequential pattern mining [13, 20, 9, 24, 17, 4],
constraint-based sequential pattern mining [6, 18, 19], fre-
Previous studies have presented convincing arguments quent episode mining [12], cyclic association rule mining
that a frequent pattern mining algorithm should not mine [14], temporal relation mining [5], partial periodic pattern
all frequent patterns but only the closed ones because the mining [7], and long sequential pattern mining in noisy en-
latter leads to not only more compact yet complete result vironment [23].
set but also better efciency. However, most of the previ- In recent years many studies have presented convinc-
ously developed closed pattern mining algorithms work un- ing arguments that for mining frequent patterns (for both
der the candidate maintenance-and-test paradigm which is itemsets and sequences), one should not mine all frequent
inherently costly in both runtime and space usage when the patterns but the closed ones because the latter leads to not
support threshold is low or the patterns become long. only more compact yet complete result set but also better
In this paper, we present, BIDE, an efcient algorithm efciency [15, 25, 22, 21]. However, unlike mining fre-
for mining frequent closed sequences without candidate quent itemsets, there are not so many methods proposed
maintenance. It adopts a novel sequence closure check- for mining closed sequential patterns. This is partly due
ing scheme called BI-Directional Extension, and prunes the to the complexity of the problem. To our best knowledge,
search space more deeply compared to the previous algo- CloSpan is currently the only such algorithm [22]. Like
rithms by using the BackScan pruning method and the Scan- most of the frequent closed itemset mining algorithms, it
Skip optimization technique. A thorough performance study follows a candidate maintenance-and-test paradigm, i.e., it
with both sparse and dense real-life data sets has demon- needs to maintain the set of already mined closed sequence
strated that BIDE signicantly outperforms the previous al- candidates which can be used to prune search space and
gorithms: it consumes order(s) of magnitude less memory check if a newly found frequent sequence is promising to
and can be more than an order of magnitude faster. It is be closed. Unfortunately, a closed pattern mining algorithm
also linearly scalable in terms of database size. under such a paradigm has rather poor scalability in the
number of frequent closed patterns because a large number
of frequent closed patterns (or just candidates) will occupy
1 Introduction much memory and lead to large search space for the closure
checking of new patterns, which is usually the case when
Sequential pattern mining, since its introduction in [2], the support threshold is low or the patterns become long.
has become an essential data mining task, with broad appli- Can we nd a way to mine frequent closed sequences
cations, including market and customer analysis, web log without candidate maintenance? This seems to be a very
analysis, pattern discovery in protein sequences, and min- difcult task. In this paper, we present a nice solution
ing XML query access patterns for caching. Efcient min- which leads to an algorithm, BIDE1 , that mines efciently
ing methods have been studied extensively, including the the complete set of frequent closed sequences. In BIDE,
The work was supported in part by National Science Foundation under
we do not need to keep track of any single historical fre-
Grant No. 02-09199, the Univ. of Illinois, and an IBM Faculty Award. Any quent closed sequence (or candidate) for a new patterns
opinions, ndings, and conclusions or recommendations expressed in this closure checking, which leads to our proposal of a deep
material are those of the author(s) and do not necessarily reect the views search space pruning method and some other optimization
of the funding agencies.
Currently is with Digital Technology Center, University of Minnesota 1 BIDE stands for BI-Directional Extension based frequent closed se-
Unlike most of the previous closed pattern mining algo- 4 Performance evaluation
rithms, BIDE can prune search space and check if a pat-
tern is closed without maintenance of any already mined In this section, we will present our thorough experimen-
patterns (or just candidates), it is very space efcient and tal results in order to testify the following claims: (1) A
its worst case memory usage can be calculated even be- properly designed frequent closed sequence mining algo-
fore the mining. Assume the input sequence database SDB rithm like BIDE or CloSpan can signicantly outperform
contains s num input sequences (i.e., |SDB|=s num) two efcient frequent sequence mining algorithms, PrexS-
and totally i num distinct items, on average each se- pan and SPADE when the support threshold is low; (2)
quence contains avg l events, the longest sequence contains BIDE consumes much less memory and can be an order
max l events. The worst case memory usage in BIDE is of magnitude faster than CloSpan; (3) BIDE has linear scal-
O((s num avg l) + (max l s num (2 max l)) + ability in terms of base size; and (4) The BackScan pruning
(max l i num)). (s num avg l) represents the input method and the ScanSkip technique are very effective in en-
sequence database. (max l s num (2 max l)) repre- hancing the performance.
All of our experiments were performed on an IBM Closed vs. all frequent sequence mining. Previous study
ThinkPad R31 with 384MB memory and Windows XP pro- [22] has shown that a properly designed closed sequential
fessional installed. In the experiments we compared BIDE pattern mining algorithm like CloSpan can outperform an
with two frequent sequence mining algorithms, PrexSpan efcient sequential pattern mining algorithm, PrexSpan,
and SPADE, and one frequent closed sequence mining algo- by more than an order of magnitude. As a result, it will be
rithm, CloSpan. The source codes of SPADE and PrexS- unfair to compare BIDE with some all-frequent-sequence
pan and the executable code of CloSpan were provided by mining algorithms. Here we only use one dataset to show a
their corresponding authors, respectively. We ran the four previously un-revealed fact: with a high support threshold,
algorithms on the same Cygwin environment with the out- a well-designed closed sequence mining algorithm may lose
put turned off. to some all-frequent-sequence mining algorithms. Later we
Because the synthetic datasets have far different charac- will focus on the comparison of BIDE with CloSpan.
teristics from the real-world ones, in our experiments we
only used some real datasets to do the tests. However, we 10000
SPADE SPADE
PrefixSpan CloSpan
cautiously chose these real datasets which can cover a range CloSpan
BIDE
BIDE
PrefixSpan
100
of distribution characteristics: sparse, a little dense, and
Runtime in seconds
1000
Memory in MBs
very dense.
100
The rst dataset, Gazelle, is very sparse, but it contains
10
some very long frequent closed sequences with low sup- 10
port threshold. This dataset was originally provided by Blue
Martini company and has been used in evaluating both fre- 1
1
quent itemset mining algorithms [25, 8] and frequent se- 0.04 0.035 0.03 0.025 0.02 0.04 0.035 0.03 0.025 0.02
Support threshold (in %) Support threshold (in %)
quence mining algorithms [22]. It contains totally 29369
Fig. 4. Comparison among Fig. 5. Comparison among
customers Web click-stream data. For each customer there
BIDE, CloSpan, SPADE and BIDE, CloSpan, SPADE and
is a corresponding series of page views, and we treat each
PrexSpan (Gazelle dataset, PrexSpan (Gazelle dataset,
page view as an event. This dataset contains 29369 se-
runtime). memory).
quences (i.e., customers), 87546 events (i.e., page views),
and 1423 distinct items (i.e., web pages). More detailed in-
formation about this dataset can be found in [11].
Fig. 4 and Fig. 5 depict the comparison results among
The second dataset, Snake, is a little dense and can gen- SPADE, PrexSpan, CloSpan and BIDE for dataset
erate a lot of frequent closed sequences with a medium sup- Gazelle. From Fig. 4 we can see that when the support
port threshold like 60%. It is a bio-dataset which contains is greater than 0.03%, both PrexSpan and Spade outper-
totally 175 Toxin-Snake protein sequences and 20 unique form CloSpan and BIDE, but once we continue to lower
items. This Toxin-Snake dataset is about a family of eu- the support threshold (e.g., to 0.02%), the two closed se-
karyotic and viral DNA binding proteins and has been used quence mining algorithms will outperform a lot the two all-
in evaluating pattern discovery task [10]. frequent-sequence mining algorithms due to the generation
We also nd a very dense dataset, Pi, from which a huge of an explosive number of frequent sequences for the lat-
number of frequent closed sequences can be mined even ter. It also shows that BIDE always outperforms CloSpan
with a very high support threshold like 90%. This dataset is with varying support threshold. From Fig. 5 we know that
also a bio-dataset which contains 190 protein sequences and both PrexSpan and BIDE use a rather stable sized mem-
21 distinct items. This dataset has ever been used to assess ory, which can be more than an order of magnitude less than
the reliability of functional inheritance [3]. The character- those used by SPADE and CloSpan.
istics of these datasets are shown in Table 2.
BIDE vs. CloSpan. We rst compared BIDE with CloSpan
using the Gazelle dataset. Fig. 6 depicts the distribution
Dataset # seq. # items avg. seq. len. max. seq. len.
of the number of frequent closed sequences against the
Gazelle 29369 1423 3 651 length of the frequent closed sequences for support thresh-
Snake 175 20 67 121 olds varying from 0.02% to 0.01%. From Fig. 6 we can see
Pi 190 21 258 757 that many long closed sequences can be discovered for this
sparse dataset. For example, at support 0.01%, the longest
Table 2. Dataset Characteristics. frequent closed sequence has a length 127. Fig. 7 and Fig.
8 demonstrate the runtime and memory usage comparison
Runtime in seconds
10000 100
Memory in MBs
1000
1000
100
100 10
10
10
1
1 1
70 65 60 55 50 70 65 60 55 50
0.1 Support threshold (in %) Support threshold (in %)
20 40 60 80 100 120
Length of frequent closed sequencs Fig. 10. Dataset Fig. 11. Dataset
Fig. 6. Dataset Gazelle(distribution). Snake(runtime). Snake(memory).
CloSpan CloSpan
10000 BIDE BIDE port thresholds from 50% to 70%. Fig. 10 shows that at
a high support threshold, BIDE is several times slower than
Runtime in seconds
1000 100 CloSpan, but once the support is no greater than 60%, BIDE
Memory in MBs
between BIDE and CloSpan. We can see BIDE always runs 1e+009 min_sup=88%
much faster than CloSpan but consumes much less memory.
Number of frequent closed sequences
1e+008 min_sup=90%
min_sup=92%
For example, at support 0.01%, BIDE can be more than an 1e+007 min_sup=94%
min_sup=96%
order of magnitude faster than CloSpan, while it only uses 1e+006
over an order of magnitude less memory. 100000
10000
1000
1e+008 min_sup=50%
Number of frequent closed sequences
min_sup=55% 100
1e+007 min_sup=60%
min_sup=65% 10
1e+006 min_sup=70%
1
1 2 3 4 5 6 7 8 9 10
100000
Length of frequent closed sequences
10000
Fig. 12. Dataset Pi(distribution).
1000
100
10
We also use the very dense dataset, Pi, to compare BIDE
1
2 4 6 8 10 12 14 16 18 with CloSpan. From Fig. 12 we can see that even with a
Length of frequent closed sequences very high support like 90%, there can be a large number of
Fig. 9. Dataset Snake(distribution). short frequent closed sequences with a length less than 10.
Fig. 13 shows that with a support higher than 90%, these
two algorithms have very similar performance, CloSpan is
Fig. 9 depicts the distribution of the number of frequent only a little faster than BIDE, but once the support is no
closed sequences against the length of the frequent closed greater than 88%, BIDE will outperform CloSpan a lot. For
sequences for dataset Snake. We can see it is a little dense example, at support 88%, BIDE is more than 6 times faster
dataset: A lot of closed sequences with a medium length than CloSpan. From Fig. 14 we know BIDE always uses
from 6 to 12 can be mined with some not very low sup- much less memory than CloSpan. At support 88%, BIDE
Memory in MBs
1000
olds. All the above performance results for BIDE use both
100
10
the BackScan pruning method and the ScanSkip technique.
10
Runtime in seconds
Runtime in seconds
1000
Support threshold (in %) Support threshold (in %)
300
Fig. 13. Dataset Pi Fig. 14. Dataset Pi 100
(runtime). (memory). 200
10
100
1
consumes over 2 orders of magnitude less memory than 0.04 0.035 0.03 0.025 0.02 90 85 80 75 70 65
Support threshold (in %) Support threshold (in %)
CloSpan.
Fig. 17. Effectiveness of Fig. 18. Effectiveness of
Scalability test. We tested BIDEs scalability in both run-
BackScan pruning (Gazelle ScanSkip optimization
time and memory usage using all the three datasets in terms
dataset). (Snake dataset).
of the base size. In Fig. 15 and Fig. 16, we xed the support
threshold at a certain constant for each dataset and repli-
cated the sequences from 2 to 16 times. Although these 3
datasets have rather different features, BIDE shows a linear
scalability in both the runtime and memory usage against 5 Conclusions
the increasing number of sequences for these datasets. For
example, for dataset Snake with a given support 60%, its Many studies have elaborated that closed pattern min-
runtime increases from 807 seconds to 11906 seconds and ing has the same expressive power as that of all frequent
its memory usage increases from 1.883MB to 21.784MB pattern mining yet leads to more compact result set and sig-
when the number of sequences increases 16 times. nicantly better efciency. Our study showed that this is
usually true when the number of frequent patterns is pro-
14000
Dataset=Gazelle, min_sup=0.000136
70
Dataset=Gazelle, min_sup=0.000136
hibitively huge, in which case the number of frequent closed
12000
Dataset=Snake, min_sup=0.60
Dataset=Pi, min_sup=0.92 60
Dataset=Snake, min_sup=0.60
Dataset=Pi, min_sup=0.92 patterns is also likely very large. Unfortunately, most of the
10000 50 previously developed closed pattern mining algorithms rely
Runtime in seconds
Memory in MBs
8000 40
on the historical set of frequent closed patterns (or candi-
6000
dates) to check if a newly found frequent pattern is closed
30
or if it can invalidate some already mined closed candidates.
4000 20
Because the set of already mined frequent closed patterns
2000 10
keeps growing during the mining process, not only will it
0
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 consume more memory, but also lead to inefciency due to
Replication factor Replication factor the growing search space for pattern closure checking.
Fig. 15. Scalability Fig. 16. Scalability In this paper, we proposed BIDE, a novel algorithm for
test(runtime). test(memory). mining frequent closed sequences. It avoids the curse of
the candidate maintenance-and-test paradigm, prunes the
search space more deeply and checks the pattern closure
Effectiveness of the optimization techniques. Fig. 17 tests in a more efcient way while consuming much less mem-
the effectiveness of the BackScan pruning method. We can ory in contrast to the previously developed closed pattern
see that the BackScan pruning method is very effective in mining algorithms. It does not need to maintain the set of
pruning search space and speeding up the mining process: historic closed patterns, thus it scales very well in the num-
for gazelle dataset with support threshold at 0.02%, it can ber of frequent closed patterns. BIDE adopts a strict depth-
give several orders of magnitude enhancement to the perfor- rst search order and can output the frequent closed pat-
mance. The effectiveness of the BackScan pruning method terns in an online fashion. An extensive set of experiments
assures that BIDE which is only based on this single prun- on several real datasets with different distribution features
ing method can signicantly outperform in most cases the have shown the effectiveness of the algorithm design: BIDE
CloSpan algorithm which adopts several pruning methods. consumes order(s) of magnitude less memory while can be