You are on page 1of 12

BIDE: Efcient Mining of Frequent Closed Sequences

Jianyong Wang and Jiawei Han


Department of Computer Science
University of Illinois at Urbana-Champaign
Urbana, Illinois 61801, U.S.A.
{wangj, hanj}@cs.uiuc.edu

Abstract general sequential pattern mining [13, 20, 9, 24, 17, 4],
constraint-based sequential pattern mining [6, 18, 19], fre-
Previous studies have presented convincing arguments quent episode mining [12], cyclic association rule mining
that a frequent pattern mining algorithm should not mine [14], temporal relation mining [5], partial periodic pattern
all frequent patterns but only the closed ones because the mining [7], and long sequential pattern mining in noisy en-
latter leads to not only more compact yet complete result vironment [23].
set but also better efciency. However, most of the previ- In recent years many studies have presented convinc-
ously developed closed pattern mining algorithms work un- ing arguments that for mining frequent patterns (for both
der the candidate maintenance-and-test paradigm which is itemsets and sequences), one should not mine all frequent
inherently costly in both runtime and space usage when the patterns but the closed ones because the latter leads to not
support threshold is low or the patterns become long. only more compact yet complete result set but also better
In this paper, we present, BIDE, an efcient algorithm efciency [15, 25, 22, 21]. However, unlike mining fre-
for mining frequent closed sequences without candidate quent itemsets, there are not so many methods proposed
maintenance. It adopts a novel sequence closure check- for mining closed sequential patterns. This is partly due
ing scheme called BI-Directional Extension, and prunes the to the complexity of the problem. To our best knowledge,
search space more deeply compared to the previous algo- CloSpan is currently the only such algorithm [22]. Like
rithms by using the BackScan pruning method and the Scan- most of the frequent closed itemset mining algorithms, it
Skip optimization technique. A thorough performance study follows a candidate maintenance-and-test paradigm, i.e., it
with both sparse and dense real-life data sets has demon- needs to maintain the set of already mined closed sequence
strated that BIDE signicantly outperforms the previous al- candidates which can be used to prune search space and
gorithms: it consumes order(s) of magnitude less memory check if a newly found frequent sequence is promising to
and can be more than an order of magnitude faster. It is be closed. Unfortunately, a closed pattern mining algorithm
also linearly scalable in terms of database size. under such a paradigm has rather poor scalability in the
number of frequent closed patterns because a large number
of frequent closed patterns (or just candidates) will occupy
1 Introduction much memory and lead to large search space for the closure
checking of new patterns, which is usually the case when
Sequential pattern mining, since its introduction in [2], the support threshold is low or the patterns become long.
has become an essential data mining task, with broad appli- Can we nd a way to mine frequent closed sequences
cations, including market and customer analysis, web log without candidate maintenance? This seems to be a very
analysis, pattern discovery in protein sequences, and min- difcult task. In this paper, we present a nice solution
ing XML query access patterns for caching. Efcient min- which leads to an algorithm, BIDE1 , that mines efciently
ing methods have been studied extensively, including the the complete set of frequent closed sequences. In BIDE,
The work was supported in part by National Science Foundation under
we do not need to keep track of any single historical fre-
Grant No. 02-09199, the Univ. of Illinois, and an IBM Faculty Award. Any quent closed sequence (or candidate) for a new patterns
opinions, ndings, and conclusions or recommendations expressed in this closure checking, which leads to our proposal of a deep
material are those of the author(s) and do not necessarily reect the views search space pruning method and some other optimization
of the funding agencies.
Currently is with Digital Technology Center, University of Minnesota 1 BIDE stands for BI-Directional Extension based frequent closed se-

at Twin-Cites, email: jianyong@cs.umn.edu. quence mining.

Proceedings of the 20th International Conference on Data Engineering (ICDE04)


1063-6382/04 $ 20.00 2004 IEEE
techniques. Our thorough performance study demonstrates Sequence identier Sequence
the big success of the algorithm design: BIDE consumes 1 CAABC
2 ABCB
order(s) of magnitude less memory and runs over an order 3 CABC
of magnitude faster than the previously developed frequent 4 ABBCA
(closed) sequence mining algorithms, especially when the
support is low.
Table 1. An example sequence database SDB.
The rest of this paper is organized as follows: In sec-
tion 2 we present the problem denition of frequent closed
sequence mining and discuss the related work and our con-
tributions to this problem. Section 3 is focused on the BIDE frequent closed sequences is to nd the complete set of
algorithm: mainly introducing the BI-Directional Extension frequent closed sequences for an input sequence database
pattern closure checking mechanism, the BackScan prun- SDB, given a minimum support threshold, min sup.
ing method and the ScanSkip optimization technique. Some
possible extensions are also discussed in this section. In sec- Example 1 Table 1 shows the input sequence database
tion 4 we present an extensive experimental study. Finally, SDB in our running example. The database has totally 3
we conclude the study in section 5. unique items, four input sequences (i.e., |SDB|=4). Sup-
pose min sup = 2. The complete set of frequent closed se-
quences, Sf cs = {AA:2, ABB:2, ABC:4, CA:3, CABC:2,
2 Problem denition and related work CB:3}, consists of only six sequences, while the whole set
of frequent sequences consists of 17 sequences, that is, Sf s
Let I = {i1 , i2 , . . . , in } be a set of distinct items. ={A:4, AA:2, AB:4, ABB:2, ABC:4, AC:4, B:4, BB:2,
A sequence S is an ordered list of events, denoted as BC:4, C:4, CA:3, CAB:2, CABC:2, CAC:2, CB:3, CBC:2,
e1 , e2 , . . . , em , where ei is an item, i.e., ei I for CC:2}. Obviously, Sf cs is more compact than Sf s . Also,
1 i m. For brevity, a sequence is also written as if a frequent sequence, S , has the same support as that
e1 e2 . . . em . From the denition we know that an item of one of its proper supersequence, S , S is absorbed by
can occur multiple times in different events of a sequence. S . For example, frequent sequence CBC:2 is absorbed
The number of events (i.e., instances of items) in a se- by sequence CABC:2, because (CBC CABC) and
quence is called the length of the sequence and a sequence (supSDB (CBC) = supSDB (CABC) = 2). 
with a length l is also called an l-sequence. For example,
AABCCA is a 6-sequence. A sequence Sa =a1 a2 . . . an Notice that in the above denition of a sequence, each
is contained in another sequence Sb =b1 b2 . . . bm , if there event contains only a single item. Thus the derived BIDE al-
exist integers 1 i1 < i2 < . . . < in m such that gorithm mines only frequent closed single-item sequences.
a1 =bi1 , a2 =bi2 , . . . , an =bin . If sequence Sa is contained In section 3.6, we show that BIDE can be easily extended to
in sequence Sb , Sa is called a subsequence of Sb and Sb a mine closed sequences of subsets of items (e.g., sequences
supersequence of Sa , denoted as Sa  Sb . of shopping transactions). We rst discuss mining single-
An input sequence database SDB is a set of tuples item sequences because (1) it makes the presentation clear
(sid, S), where sid is a sequence identier, and S an input by focusing on the methodology and optimization tech-
sequence. The number of tuples in SDB is called the base niques instead of tedious description, and (2) it represents
size of SDB, denoted as |SDB|. A tuple (sid, S) is said one of the most important and popular type of sequences,
to contain a sequence S , if S is a supersequence of S , such as DNA strings, protein sequences, Web click streams,
i.e., S  S. The absolute support of a sequence S in a and sequences of le block references in operating systems.
sequence database SDB is the number of tuples in SDB
that contain S , denoted as supSDB (S ), and the relative 2.1 Related work
support is the percentage of tuples in SDB that contain S
(i.e., supSDB (S )/|SDB|). Without loss of generality, we The sequential pattern mining problem was rst pro-
use the absolute support for describing the BIDE algorithm posed by Agrawal and Srikant in [2], and the same au-
while using the relative support to present the experimental thors further developed a generalized and rened algorithm,
results in the remaining of the paper. GSP [20], based on the Apriori property [1]. Since then,
Given a support threshold min sup, a sequence S is a many sequential pattern mining algorithms have also been
frequent sequence on SDB if supSDB (S ) min sup. proposed for performance improvements. Among those,
If sequence S is frequent and there exists no proper su- SPADE [24], PrexSpan [17], and SPAM [4] are quite inter-
persequence of S with the same support, i.e., S such esting ones. SPADE is based on a vertical id-list format and
that S S and supSDB (S ) = supSDB (S ), we call uses a lattice-theoretic approach to decompose the original
S a frequent closed sequence. The problem of mining search space into smaller spaces, while PrexSpan adopts

Proceedings of the 20th International Conference on Data Engineering (ICDE04)


1063-6382/04 $ 20.00 2004 IEEE
a horizontal format dataset representation and mines the se- pruning method, and the ScanSkip optimization technique
quential patterns under the pattern-growth paradigm: grow are proposed to speed up the mining and also assure the
a prex pattern to get longer sequential patterns by build- correctness of the algorithm. (3) Our thorough performance
ing and scanning its projected database. Both SPADE and study shows that BIDE has surprisingly high efciency: it
PrexSPan outperform GSP. SPAM is a recently developed can be an order of magnitude faster than CloSpan but only
algorithm for mining long sequential patterns and adopts a uses order(s) of magnitude less memory in many cases. It
vertical bitmap representation. Its performance study shows also has very good scalability w.r.t. the database size.
that SPAM is more efcient in mining long patterns than
SPADE and PrexSpan, however, it consumes more space 3 BIDE: An efcient algorithm for frequent
in comparison with SPADE and PrexSpan. closed sequence mining
Since the introduction of frequent closed itemset mining
[15], several efcient frequent closed itemset mining algo- In this section, we introduce the BIDE algorithm by an-
rithms have been proposed, such as A-Close [15], CLOSET swering the following questions: How to enumerate the
[16], CHARM [25], and CLOSET+ [21]. Most of these complete set of frequent sequences? Upon getting a fre-
algorithms need to maintain the already mined frequent quent sequence, how to check if it is closed? How to design
closed patterns in order to do pattern closure checking. To some search space pruning methods or other optimization
reduce the memory usage and search space for pattern clo- techniques to accelerate the mining process?
sure checking, two algorithms, TFP [8] and CLOSET+2 ,
adopt a compact 2-level hash indexed result-tree structure to Level
store the already mined frequent closed itemset candidates.
Some of the pruning methods and pattern closure checking A:4 B:4 C:4 1
schemes proposed there can be extended for optimizing the
mining of closed sequential patterns as well. AA:2 AB:4 AC:4 BB:2 BC:4 CA:3 CB:3 CC:2 2
CloSpan is a recently proposed algorithm for mining
frequent closed sequences [22]. It follows the candi-
ABB:2 ABC:4 CAB:2 CAC:2 CBC:2 3
date maintenance-and-test approach: First generate a set
of closed sequence candidates which is stored in a hash-
indexed result-tree structure and then do post-pruning on CABC:2 4
it. It uses some pruning methods like CommomPrex and
Backward Sub-Pattern pruning to prune the search space. Fig. 1. The lexicographic frequent sequence tree in our
Because CloSpan needs to maintain the set of historical running example.
closed sequence candidates, it will consume much memory
and lead to huge search space for pattern closure checking
when there are many frequent closed sequences. As a re- 3.1 Frequent sequence enumeration
sult, it does not scale very well with respect to the number
of frequent closed sequences. Assume there is a lexicographical ordering among
Contributions. In this paper, we introduce BIDE, an ef- the set of items I in the input sequence database (e.g., in
cient algorithm for discovering the complete set of fre- our running example, one possible item ordering can be
quent closed sequences. The contributions of this pa- A B C), conceptually the complete search space of
per include: (1) A new paradigm is proposed for mining sequence mining forms a sequence tree [4], which can be
closed sequences without candidate maintenance, called BI- constructed in the following way: The root node of the tree
Directional Extension. The forward directional extension is is at the top level and labeled with , recursively we can
used to grow the prex patterns and also check the closure extend a node N at level L in the tree by adding one item
of prex patterns, while the backward directional extension in I to get a child node at the next level L+1 and the chil-
can be used to both check closure of a prex pattern and dren of a node N are generated and arranged according to
prune the search space. (2) Under the BI-Directional Exten- the chosen lexicographical ordering. By removing the infre-
sion paradigm, we designed an efcient algorithm for fre- quent sequences in the sequence tree, the remaining nodes
quent closed sequence mining, BIDE. The BI-Directional in the lattice form a lexicographic frequent sequence tree,
Extension pattern closure checking scheme, the BackScan which contains the complete set of frequent sequences. Fig.
2
1 shows the lexicographic frequent sequence tree built from
CLOSET+ adopts a hybrid closure checking scheme: the result tree
method for dense datasets and upward checking for sparse datasets, among
our running example. In Fig. 1, each node contains a fre-
which the upward checking can be regarded as a simplied version of the quent sequence and its corresponding support, and the se-
backward-extension event checking described in Lemma 2 of this paper. quences in the dotted ellipses are non-closed ones.

Proceedings of the 20th International Conference on Data Engineering (ICDE04)


1063-6382/04 $ 20.00 2004 IEEE
Many previous frequent pattern (either itemset or se- e1 e2 . . . ei is called the projected database w.r.t. prex
quence) mining algorithms have elaborated that depth-rst e1 e2 . . . ei in SDB. For example, the projected database
searching is more efcient in mining long patterns than of prex sequence AB in our running example is {C, CB,
breadth-rst searching [4]. BIDE traverses the sequence C, BCA}. 
tree in a strict depth-rst search order. In our example
shown in Fig. 1, the frequent sequences will be mined and After giving the denition of the projected database for
reported in such an order: A:4, AA:2, AB:4, ABB:2, a certain prex sequence, the idea of pseudo-projection can
ABC:4, AC:4, B:4, BB:2, BC:4, C:4, CA:3, CAB:2, be described as follows. Instead of physically constructing
CABC:2, CAC:2, CB:3, CBC:2, CC:2. the projected database, we only need to record a set of point-
A certain node in the sequence tree can be treated as ers, one for each projected sequence, pointing at the starting
a prex sequence, from which the set of its children can position in the corresponding projected sequence. By fol-
be generated by adding one item in I. Some items may lowing the set of pointers, it is easy to locate the set of pro-
not be locally frequent with respect to (abbreviated w.r.t.) jected sequences. And by scanning forward each projected
the corresponding prex sequence. Because we are only sequence w.r.t. a prex Sp and count the items (This is the
interested in mining frequent sequences, according to the so-called Forward-extension step), we will nd the locally
downward closure property (also called the Apriori prop- frequent items w.r.t. prex Sp , which can be used to grow
erty [1]), we only need to grow a prex sequence using the prex Sp in order to get longer frequent prex sequences.
set of its locally frequent items. To compute the locally fre- For example, if Sp =AB, the set of its locally frequent items
quent items w.r.t. a certain prex, a well-known method is is {C:4, B:2}.
to build the projected database for the prex and scan it to
count the items. Two kinds of projection methods have been
used in the past : physical projection and pseudo projection Frequent-sequence-enumeration (SDB, min_sup, FS)
Input: an input sequence database SDB, a minimum support threshold min_sup
[17]. Because the physical projection-based method needs Output: the complete set of frequent sequences, FS
to physically build the conditional projected databases, it is 1: FS = ;
not space- and runtime- efcient due to the cost of allocat- 2: call Frequent-sequences(SDB, , min_sup, FS);
ing and freeing memory. In BIDE, we only use the pseudo 3: return FS;
projection method to nd the set of locally frequent items
Frequent-sequences (Sp_SDB, Sp, min_sup, FS)
w.r.t. a certain prex and use them to grow the correspond-
Input: a projected sequence database Sp_SDB, a prefix sequence Sp,
ing prex. Here we briey introduce the pseudo-projection and a minimum support threshold min_sup
method (for details, see [17]). Output: the current set of frequent sequences, FS
4: if Sp is non-empty
Denition 1 (First instance of a prex sequence) Given 5: FS = FS U Sp ;
an input sequence S which contains a prex 1-sequence e1 , 6: LF_S p = locally frequent items (Sp_SDB, Sp, min_sup);
the subsequence from the beginning of S to the rst appear- 7: if LF_S p is empty
8: Return;
ance of item e1 in S is called the rst instance of prex
9: for each locally frequent item i
1-sequence e1 in S. Recursively, we can dene the rst in- 10: Spi = <Sp, i>;
stance of a (i + 1)-sequence e1 e2 . . . ei ei+1 from the rst 11: Spi_SDB = pseudo projected database (Spi, Sp_SDB);
instance of the i-sequence e1 e2 . . . ei (where i 1) as the 12: call Frequent-sequences(Spi_SDB, S pi, min_sup, FS);
subsequence from the beginning of S to the rst appearance
of item ei+1 which also occurs after the rst instance of the Fig. 2. Frequent sequence enumeration algorithm.
i-sequence e1 e2 . . . ei . For example, the rst instance of the
prex sequence AB in sequence CAABC is CAAB. 
Fig. 2 shows the algorithm to enumerate the complete
Denition 2 (Projected sequence of a prex sequence) set of frequent sequences, which is similar to the pseudo-
Given an input sequence S which contains a prex i- projection-based PrexSpan algorithm. It recursively calls
sequence e1 e2 . . . ei , the remaining part of S after we re- subroutine Frequent-sequences (Sp SDB, Sp , min sup,
move the rst instance of the prex i-sequence e1 e2 . . . ei in FS): For a certain prex Sp , if it is non-empty, output it
S is called the projected sequence w.r.t. prex e1 e2 . . . ei in (line 4 and 5), scan projected database Sp SDB once to
S. For example, the projected sequence of prex sequence nd the locally frequent items (line 6), each frequent item
AB in sequence ABBCA is BCA.  i can be chosen in lexicographical ordering to grow Sp to
get a new prex Spi (line 10), scan Sp SDB once again
Denition 3 (Projected database of a prex sequence) to build pseudo-projection database for each new prex Spi
Given an input sequence database SDB, the complete set (line 11). Furthermore, one can easily gure out that the or-
of projected sequences in SDB w.r.t. a prex sequence der of the frequent sequence enumeration is consistent with

Proceedings of the 20th International Conference on Data Engineering (ICDE04)


1063-6382/04 $ 20.00 2004 IEEE
the depth-rst traversal of the frequent sequence tree. sequence closure checking scheme, called BI-Directional
Extension checking. According to the denition of a fre-
3.2 The BI-Directional Extension closure checking quent closed sequence, if an n-sequence, S=e1 e2 . . . en , is
scheme non-closed, there must exist at least one event, e , which
can be used to extend sequence S to get a new sequence,
S  , which has the same support. The sequence S can
The frequent enumeration algorithm in Fig. 2 can only
be extended in three ways: (1) S  = e1 e2 . . . en e and
be used to mine the complete set of frequent sequences in-
supSDB (S  ) = supSDB (S); (2) i (1 i < n), S  =
stead of the frequent closed ones. Usually upon getting a
e1 e2 . . . ei e ei+1 . . . en and supSDB (S  ) = supSDB (S);
new frequent prex sequence, we need to do pattern clo-
and (3) S  = e e1 e2 . . . en and supSDB (S  )=supSDB (S).
sure checking in order to assure that it is really closed, i.e.,
In the rst case, event e occurs after event en , we call
it cannot be absorbed by one of its super-sequence with the
e a forward-extension event (or item) and S  a forward-
same support. Currently most of the frequent closed pat-
extension sequence w.r.t. S. While in the second and
tern (both itemset and sequence) mining algorithms, like
third cases, event e occurs before event en , we call e
CLOSET [16], CHARM [25], TFP [8] and CloSpan [22],
a backward-extension event (or item) and S  a backward-
need to maintain the set of already mined frequent closed
extension sequence w.r.t. S. After giving the above deni-
patterns (or just candidates) in memory and do (1) subpat-
tion, the following theorem will be evident according to the
tern checking, that checks if a newly found pattern can be
denition of a frequent closed sequence.
absorbed by an already mined frequent closed pattern (or
candidate); and/or (2) super-pattern checking, which checks Theorem 1 (BI-Directional Extension closure checking)
whether the newly found pattern can absorb some already If there exists no forward-extension event nor backward-
mined closed pattern candidates. extension event w.r.t. a prex sequence Sp , Sp must be a
For a properly designed closed itemset mining algorithm closed sequence; otherwise, Sp must be non-closed. 
like CHARM, it only needs to do the subpattern checking,
which is usually less expensive than super-pattern checking. From theorem 1 we know that to judge if a frequent pre-
However, a typical frequent closed sequence mining algo- x sequence is closed, we need to check whether there is
rithm usually needs to do both subpattern and super-pattern any forward-extension event or backward-extension event.
checking. Here we can use our running example to explain It is relatively easy to nd the forward-extension events ac-
it. Let the lexicographical ordering be B A C. Upon cording to the following Lemma.
getting a new frequent sequence ABC:4, another frequent
sequence BC:4 has already been mined, which can be ab- Lemma 1 (Forward-extension event checking) For a pre-
sorbed by ABC:4. Therefore, we need to do super-pattern x sequence Sp , its complete set of forward-extension events
checking in order to remove the previously mined but non- is equivalent to the set of its locally frequent items whose
closed frequent sequences. In addition, when we get a new supports are equal to SU P SDB (Sp ).
prex sequence C:4, another frequent sequence ABC:4 has Proof. The locally frequent items are found by scanning the
already been mined, which can absorb C:4. As a result, we projected database w.r.t. Sp , which consists of all the pro-
also need to do sub-pattern checking in order to remove the jected sequences. Since each event in a projected sequence
newly found but non-closed sequence. always occurs after the prex sequence Sp , if it occurs in ev-
It is easy to see that because the above pattern closure ery projected sequence, it forms a forward-extension event.
checking scheme adopted by the previous algorithms needs Also, any event occurring after the rst instance of Sp must
to maintain the already mined frequent closed patterns (or be included in the projected database, which means the
candidates) in memory, the algorithms like CloSpan may complete set of forward-extension events can be found by
consume much memory and the search space for pattern scanning the projected database w.r.t. Sp . 
closure checking will be huge when there exist a large num-
ber of frequent closed sequences. Some closed itemset min- Denition 4 (Last instance of a prex sequence) Given
ing algorithms such as TFP [8] try to save space by storing an input sequence S which contains a prex i-sequence
the closed itemset candidates in a compact prex itemset- e1 e2 . . . ei , the last instance of the prex sequence
tree structure and reduce the search space by applying a e1 e2 . . . ei in S is the subsequence from the beginning of S
two-level hash-index. CloSpan adopts the similar tech- to the last appearance of item ei in S. For example, the last
niques, however, because a prex sequence tree is usually instance of the prex sequence AB in sequence ABBCA
less compact than an itemset tree, it still consumes much is ABB. 
memory.
To avoid maintaining the set of already mined closed Denition 5 (The i-th last-in-last appearance w.r.t. a pre-
sequence candidates in memory, we have designed a new x sequence) For an input sequence S containing a prex

Proceedings of the 20th International Conference on Data Engineering (ICDE04)


1063-6382/04 $ 20.00 2004 IEEE
n-sequence, Sp = e1 e2 . . . en , the i-th last-in-last appearance contrast, let Sp =ABC:4, we cannot nd any backward-
w.r.t. the prex Sp in S is denoted as LLi and dened re- extension item for it. Also there is no forward-extension
cursively as: (1) if i = n, it is the last appearance of ei in item for it, therefore ABC:4 is a frequent closed sequence.
the last instance of the prex Sp in S; (2) if 1 i < n, it is
the last appearance of ei in the last instance of the prex Sp 3.3 The BackScan search space pruning method
in S while LLi must appear before LLi+1 . For example,
if S=CAABC and Sp =AB, the 1st last-in-last appearance
w.r.t. prex Sp in S is the second A in S. 
Upon nding a new frequent sequence by the frequent
sequence enumeration algorithm, we can use the above BI-
Denition 6 (The i-th maximum period of a prex se- Directional Extension closure checking scheme to check if
quence) For an input sequence S containing a prex n- it is closed in order to generate the complete set of non-
sequence Sp =e1 e2 . . . en , the i-th maximum period of the redundant frequent sequences. Although the closure check-
prex Sp in S is dened as: (1) if 1 < i n, it is the ing scheme can lead to more compact result set, it cannot
piece of sequence between the end of the rst instance of improve the mining efciency. As Fig. 1 shows the whole
prex e1 e2 . . . ei1 in S and the i-th last-in-last appearance subtree under node B:4 contains no frequent closed se-
w.r.t. prex Sp ; (2) if i = 1, it is the piece of sequence in quences, which means there is no hope to grow prex B:4
S locating before the 1st last-in-last appearance w.r.t. pre- to generate any frequent closed sequences. If we can detect
x Sp . For example, if S=ABCB and the prex sequence such unpromising prex sequences and stop growing them,
Sp =AB, the 2nd maximum period of prex Sp in S is BC, the search space will be reduced.
while the 1st maximum period of prex Sp in S is .  Search space pruning in frequent closed sequence min-
ing is trickier than that in frequent closed itemset mining.
Lemma 2 (Backward-extension event checking) Let the Usually a depth-rst search based closed itemset mining
prex sequence be a n-sequence, Sp =e1 e2 . . . en . If i algorithm like CLOSET can stop growing a prex itemset
(1 i n) and there exists an item e which appears in once it nds that this itemset can be absorbed by an already
each of the i-th maximum periods of the prex Sp in SDB, mined closed itemset. However, a closed sequence mining
e is a backward-extension event (or item) w.r.t. prex Sp . algorithm cannot do so. For example, assume the lexico-
Otherwise, for any i (1 i n), if we cannot nd any graphical ordering in our running example is A B C,
item which appears in each of the i-th maximum periods of and the current prex sequence is C:4, which can be ab-
the prex Sp in SDB, there will be no backward-extension sorbed by an already mined sequence ABC:4, but we can-
event w.r.t. prex Sp . not stop growing C:4. As Fig. 1 shows, we can still generate
Proof. From the denition of the i-th maximum period of a three frequent closed sequences (i.e., CA:3, CABC:2, and
prex sequence, we know if item e appears in each of the CB:3) by growing prex C:4. This complicated situation
i-th maximum periods of the prex sequence Sp , we can get is caused by the multiple instances of the same item in a
a new sequence Sp =e1 e2 . . . ei1 e ei . . . en (1 < i n) sequence and the temporal ordering among the events in a
or Sp =e e1 e2 . . . en (i = 1), which satises Sp Sp and sequence.
supSDB (Sp ) = supSDB (Sp ). Therefore, e is a backward-
extension item w.r.t. prex Sp and Sp is not closed. Denition 7 (The i-th last-in-rst appearance w.r.t. a pre-
In addition, assume there exists a sequence Sp x sequence) For an input sequence S containing a prex
=e e1 e2 . . . en (for i = 1) or Sp =e1 e2 . . . ei1 e ei . . . en
 n-sequence Sp = e1 e2 . . . en , the i-th last-in-rst appearance
(for 1 < i n), which can absorb Sp , that is, item e is w.r.t. the prex Sp in S is denoted as LFi and dened re-
a backward-extension item w.r.t. Sp . In each sequence con- cursively as: (1) if i = n, it is the last appearance of ei in
taining Sp , item e must appear after the rst instance of the rst instance of the prex Sp in S; (2) if 1 i < n, it
prex e1 e2 . . . ei1 (for 1 < i n) and before the i-th is the last appearance of ei in the rst instance of the prex
last-in-last appearance w.r.t. Sp , which means item e must Sp in S while LFi must appear before LFi+1 . For example,
appear in the i-th maximum period of Sp . As a result, for if S=CAABC and Sp =CA, the 2nd last-in-rst appearance
any i (1 i n), if we cannot nd any item which ap- w.r.t. prex Sp in S is the rst A in S. 
pears in each of the i-th maximum periods of the prex Sp
in SDB, there will be no backward-extension event w.r.t. Denition 8 (The i-th semi-maximum period of a prex
prex Sp .  sequence) For an input sequence S containing a prex n-
sequence Sp = e1 e2 . . . en , the i-th semi-maximum period
We use an example to illustrate the sequence closure of the prex Sp in S is dened as: (1) if 1 < i n, it is the
checking scheme in BIDE. First, we assume Sp =AC:4, it piece of sequence between the end of the rst instance of
is easy to nd that item B appears in each of the 2nd max- prex e1 e2 . . . ei1 in S and the i-th last-in-rst appearance
imum periods of Sp . As a result AC:4 is not closed. In w.r.t. prex Sp ; (2) if i = 1, it is the piece of sequence in

Proceedings of the 20th International Conference on Data Engineering (ICDE04)


1063-6382/04 $ 20.00 2004 IEEE
S locating before the 1st last-in-rst appearance w.r.t. pre- can get a set of unique items, denoted as SIki . The set of
x Sp . For example, if S=ABCB and the prex sequence items which appears in each of the i-th maximum periods
Sp =AC, the 2nd semi-maximum period of prex AC in S w.r.t. Sp equals SI1i SI2i . . . SISUP
i
Sp
and is denoted
is B, while the 1st semi-maximum period of prex AC in as SI i .
S is .  In the sequence closure checking scheme, we only care
about whether SI i is empty or not. If we nd that after
Theorem 2 (BackScan search space pruning) Let the pre- we scan the rst m i-th maximum periods, the intersec-
x sequence be an n-sequence, Sp =e1 e2 . . . en . If i (1 tion of the set of items which appears in each of the rst
i n) and there exists an item e which appears in each of m i-th maximum periods has become empty, we will do
the i-th semi-maximum periods of the prex Sp in SDB, we not need to scan the left (SU PSp -m) i-th maximum peri-
can safely stop growing prex Sp . ods, because we already know SI i must be empty. We
Proof. Because item e appears in each of the i-th semi- call this optimization the ScanSkip technique. Similarly, in
maximum periods of the prex Sp in SDB, we can get the BackScan search space pruning method, BIDE needs to
a new prex Sp =e1 e2 . . . ei1 e ei . . . en (1 < i n) scan backward a number of semi-maximum periods w.r.t. a
or Sp =e e1 e2 . . . en (i = 1), and both (Sp Sp ) and prex sequence. The ScanSkip technique can also be used
(supSDB (Sp ) = supSDB (Sp )) hold. Any locally fre- to speed up the BackScan search space pruning.
quent item e w.r.t. prex Sp is also a locally frequent item Here we use an example to illustrate the usefulness of
w.r.t. Sp , in the meantime ( Sp , e   Sp , e ) and the Scanskip optimization technique. In our running exam-
SDB  SDB   ple shown in Table 1, assume the current prex sequence is
(sup (Sp , e ) = sup (Sp , e )) hold. This means
there is no hope to mine frequent closed sequences with pre- ABC:4. The set of the 3rd maximum periods w.r.t. ABC:4
x Sp .  is { , , , B}, after scanning the rst 3rd maximum pe-
riod we nd that it contains no item, we know that there will
For example, if prex sequence Sp =B:4, there is an item be no item which appears in each of the four 3rd maximum
A which appears in each of the 1st semi-maximum period periods w.r.t. prex ABC:4 without scanning the last three
of prex Sp in SDB, we can safely stop mining frequent 3rd maximum periods. The set of the 2nd maximum peri-
closed sequences with prex B:4. In contrast, if Sp =C:4, ods w.r.t. ABC:4 is {A, , , B}, after scanning the rst
we cannot nd any item which appears in each of the 1st two 2nd maximum periods, we already know that there will
semi-maximum periods of prex Sp in SDB. As a result, be no item which appears in each of the four 2nd maximum
we cannot stop growing C:4. periods w.r.t. prex ABC:4. Similarly, The set of the 1st
Compared with some pruning methods used in previ- maximum periods w.r.t. ABC:4 is {CA, , C, }, we can
ously developed algorithms [16, 25, 22], which are based skip the scanning of the last two 1st maximum periods w.r.t.
on the relationships among the newly found frequent pattern ABC:4.
and some already mined closed patterns (or just candidates),
the BackScan pruning method is more aggressive and thus 3.5 The BIDE algorithm
more effective. Consider another possible lexicographical
ordering in our running example B A C, which means Fig. 3 shows the BIDE algorithm. It rst scans the
we rst mine the closed sequences with prex B. Accord- database once to nd the frequent 1-sequences (line 2),
ing to Theorem 2, we can safely prune prex B and directly builds pseudo projected database for each frequent 1-
mine frequent closed sequences with prex A. However, sequence (line 3 and 4), treats each frequent 1-sequence
because there are no other already mined frequent closed as a prex and uses BackScan pruning method to check
sequences (or candidates) for checking, algorithms based if it can be pruned (line 6), if not, computes the number
on the candidate-maintenance-and-test paradigm will still of backward-extension-items (line 7), and calls subroutine
try to use B as a prex to grow frequent closed sequences. bide(Sp SDB, Sp , min sup, BEI, FCS) (line 8). Subroutine
bide(Sp SDB, Sp , min sup, BEI, FCS) recursively calls
3.4 The ScanSkip optimization technique itself and works as follows: For prex Sp , scan its pro-
jected database Sp SDB once to nd its locally frequent
The above closure checking scheme needs to scan back- items (line 10), compute the number of forward-extension-
ward a set of maximum-periods w.r.t. a certain prex, this items (line 11), if there is no backward-extension-item nor
is one of the most expensive operations in BIDE. Assume forward-extension-item, output Sp as a frequent closed se-
the current prex sequence is Sp =e1 e2 . . . en with a sup- quence (line 12 and 13), grow Sp with each locally frequent
port SU PSp . For any i (1 i n), let {M P1i ,M P2i , . . . , item in lexicographical ordering to get a new prex (line 15)
i
M PSUP Sp
} be the SU PSp i-th maximum periods w.r.t. Sp . and build the pseudo projected database for the new pre-
By scanning the k-th i-th (1 i n) maximum period we x (line 16), for each new prex, rst check if it can be

Proceedings of the 20th International Conference on Data Engineering (ICDE04)


1063-6382/04 $ 20.00 2004 IEEE
pruned (line 18), if not, compute the number of backward- sents the worst case of space usage for building pseudo pro-
extension-items (line 19) and call itself (line 20). We need jected databases: the longest frequent closed sequence has
to point out that both subroutines BackScan() and backward a length no greater than max l, which means there are at
extension check() use the ScanSkip technique to speed up most max l pseudo projected databases which can co-exist
the mining process. at a certain time, and each pseudo projected database corre-
sponds to a prex with a length no greater than max l and
a support no greater than s num, given a prex in any input
BIDE (SDB, min_sup, FCS) sequence, we need to record at most (2 max l) positions
Input: an input sequence database SDB, a minimum support threshold min_sup in order to locate all the last-in-last maximum periods (In
Output: the complete set of frequent closed sequences, FCS our implementation, we only need to record (max l + 1)
1: FCS = ;
positions). (max l i num) represents the upper bound
2: F1=frequent 1-sequences(SDB, min_sup);
3: for (each 1-sequence f1 in F1) do
space cost in computing the backward-extension events.
4: SDB f1= pseudo projected database (SDB); The above BIDE algorithm can only mine frequent
5: for (each f1 in F1) do closed sequences of single items, but it is rather easy to
6: if (!BackScan(f1, SDBf1)) extend it to mine frequent closed sequences of subsets of
7: BEI=backward extension check (f1, SDB f1); items, that is, each event may contain a set of un-ordered
8: call bide(SDB f1, f1, min_sup, BEI, FCS); items. As [4] has shown there are two kinds of extensions
9: return FCS; to grow a certain prex sequence: sequence-extensions (ab-
breviated S-extension) and itemset-extensions (abbreviated
bide (Sp_SDB, S p, min_sup, BEI, FCS) I-extension). A sequence extension w.r.t. a prex is gen-
Input: a projected sequence database Sp_SDB, a prefix sequence Sp,
erated by adding a new event consisting of a single item
a minimum support threshold min_sup, and the number of backward
extension items BEI
to the prex sequence, while an itemset extension w.r.t. a
Output: the current set of frequent closed sequences, FCS prex is a sequence generated by adding an item to any
10: LFI = locally frequent items (Sp_SDB); one event of the prex sequence. To revise the frequent
11: FEI = {z in LFI | z . sup = sup SDB ( S p ) } ; sequence enumeration algorithm shown in Fig. 2 in order
12: if ((BEI+FEI)==0) to mine frequent sequences of subsets of items is straight-
13: FCS=FCS U {Sp}; forward, which is very similar to the pseudo-projection
14: for (each i in LFI) do based PrexSpan algorithm [17]. To check if a frequent
15: Spi = <S p,i>; sequence of subsets of items is closed, we need to g-
16: SDB Spi= pseudo projected database (Sp_SDB, S pi ); ure out whether there exists any S-extension item or I-
17: for (each i in LFI) do extension item which has the same support as the prex
18: if (!BackScan(S pi, SDB Spi ))
19: BEI=backward extension check (Spi, SDB Spi );
sequence. The BI-Directional Extension closure checking
20: call bide(SDB Spi, S pi, min_sup, BEI, FCS); scheme described in section 3.2 and the BackScan prun-
ing method have shown how to compute the backward(or
forward) S-extension items from the maximum periods (or
Fig. 3. BIDE algorithm. semi-maximum periods), while it is a relatively straightfor-
ward process to extend it to compute the backward (or for-
ward) I-extension items under the same framework. Due to
limited space, we will leave it to the interested readers.
3.6 Further discussions

Unlike most of the previous closed pattern mining algo- 4 Performance evaluation
rithms, BIDE can prune search space and check if a pat-
tern is closed without maintenance of any already mined In this section, we will present our thorough experimen-
patterns (or just candidates), it is very space efcient and tal results in order to testify the following claims: (1) A
its worst case memory usage can be calculated even be- properly designed frequent closed sequence mining algo-
fore the mining. Assume the input sequence database SDB rithm like BIDE or CloSpan can signicantly outperform
contains s num input sequences (i.e., |SDB|=s num) two efcient frequent sequence mining algorithms, PrexS-
and totally i num distinct items, on average each se- pan and SPADE when the support threshold is low; (2)
quence contains avg l events, the longest sequence contains BIDE consumes much less memory and can be an order
max l events. The worst case memory usage in BIDE is of magnitude faster than CloSpan; (3) BIDE has linear scal-
O((s num avg l) + (max l s num (2 max l)) + ability in terms of base size; and (4) The BackScan pruning
(max l i num)). (s num avg l) represents the input method and the ScanSkip technique are very effective in en-
sequence database. (max l s num (2 max l)) repre- hancing the performance.

Proceedings of the 20th International Conference on Data Engineering (ICDE04)


1063-6382/04 $ 20.00 2004 IEEE
4.1 Test environment and datasets 4.2 Experimental results

All of our experiments were performed on an IBM Closed vs. all frequent sequence mining. Previous study
ThinkPad R31 with 384MB memory and Windows XP pro- [22] has shown that a properly designed closed sequential
fessional installed. In the experiments we compared BIDE pattern mining algorithm like CloSpan can outperform an
with two frequent sequence mining algorithms, PrexSpan efcient sequential pattern mining algorithm, PrexSpan,
and SPADE, and one frequent closed sequence mining algo- by more than an order of magnitude. As a result, it will be
rithm, CloSpan. The source codes of SPADE and PrexS- unfair to compare BIDE with some all-frequent-sequence
pan and the executable code of CloSpan were provided by mining algorithms. Here we only use one dataset to show a
their corresponding authors, respectively. We ran the four previously un-revealed fact: with a high support threshold,
algorithms on the same Cygwin environment with the out- a well-designed closed sequence mining algorithm may lose
put turned off. to some all-frequent-sequence mining algorithms. Later we
Because the synthetic datasets have far different charac- will focus on the comparison of BIDE with CloSpan.
teristics from the real-world ones, in our experiments we
only used some real datasets to do the tests. However, we 10000
SPADE SPADE
PrefixSpan CloSpan
cautiously chose these real datasets which can cover a range CloSpan
BIDE
BIDE
PrefixSpan
100
of distribution characteristics: sparse, a little dense, and

Runtime in seconds
1000

Memory in MBs
very dense.
100
The rst dataset, Gazelle, is very sparse, but it contains
10
some very long frequent closed sequences with low sup- 10
port threshold. This dataset was originally provided by Blue
Martini company and has been used in evaluating both fre- 1
1
quent itemset mining algorithms [25, 8] and frequent se- 0.04 0.035 0.03 0.025 0.02 0.04 0.035 0.03 0.025 0.02
Support threshold (in %) Support threshold (in %)
quence mining algorithms [22]. It contains totally 29369
Fig. 4. Comparison among Fig. 5. Comparison among
customers Web click-stream data. For each customer there
BIDE, CloSpan, SPADE and BIDE, CloSpan, SPADE and
is a corresponding series of page views, and we treat each
PrexSpan (Gazelle dataset, PrexSpan (Gazelle dataset,
page view as an event. This dataset contains 29369 se-
runtime). memory).
quences (i.e., customers), 87546 events (i.e., page views),
and 1423 distinct items (i.e., web pages). More detailed in-
formation about this dataset can be found in [11].
Fig. 4 and Fig. 5 depict the comparison results among
The second dataset, Snake, is a little dense and can gen- SPADE, PrexSpan, CloSpan and BIDE for dataset
erate a lot of frequent closed sequences with a medium sup- Gazelle. From Fig. 4 we can see that when the support
port threshold like 60%. It is a bio-dataset which contains is greater than 0.03%, both PrexSpan and Spade outper-
totally 175 Toxin-Snake protein sequences and 20 unique form CloSpan and BIDE, but once we continue to lower
items. This Toxin-Snake dataset is about a family of eu- the support threshold (e.g., to 0.02%), the two closed se-
karyotic and viral DNA binding proteins and has been used quence mining algorithms will outperform a lot the two all-
in evaluating pattern discovery task [10]. frequent-sequence mining algorithms due to the generation
We also nd a very dense dataset, Pi, from which a huge of an explosive number of frequent sequences for the lat-
number of frequent closed sequences can be mined even ter. It also shows that BIDE always outperforms CloSpan
with a very high support threshold like 90%. This dataset is with varying support threshold. From Fig. 5 we know that
also a bio-dataset which contains 190 protein sequences and both PrexSpan and BIDE use a rather stable sized mem-
21 distinct items. This dataset has ever been used to assess ory, which can be more than an order of magnitude less than
the reliability of functional inheritance [3]. The character- those used by SPADE and CloSpan.
istics of these datasets are shown in Table 2.
BIDE vs. CloSpan. We rst compared BIDE with CloSpan
using the Gazelle dataset. Fig. 6 depicts the distribution
Dataset # seq. # items avg. seq. len. max. seq. len.
of the number of frequent closed sequences against the
Gazelle 29369 1423 3 651 length of the frequent closed sequences for support thresh-
Snake 175 20 67 121 olds varying from 0.02% to 0.01%. From Fig. 6 we can see
Pi 190 21 258 757 that many long closed sequences can be discovered for this
sparse dataset. For example, at support 0.01%, the longest
Table 2. Dataset Characteristics. frequent closed sequence has a length 127. Fig. 7 and Fig.
8 demonstrate the runtime and memory usage comparison

Proceedings of the 20th International Conference on Data Engineering (ICDE04)


1063-6382/04 $ 20.00 2004 IEEE
100000 min_sup = 0.01% CloSpan CloSpan
Number of frequent closed sequences

min_sup = 0.0136% 100000 BIDE BIDE


min_sup = 0.017%
10000 min_sup = 0.02%

Runtime in seconds
10000 100

Memory in MBs
1000
1000
100
100 10
10
10

1
1 1
70 65 60 55 50 70 65 60 55 50
0.1 Support threshold (in %) Support threshold (in %)
20 40 60 80 100 120
Length of frequent closed sequencs Fig. 10. Dataset Fig. 11. Dataset
Fig. 6. Dataset Gazelle(distribution). Snake(runtime). Snake(memory).

CloSpan CloSpan
10000 BIDE BIDE port thresholds from 50% to 70%. Fig. 10 shows that at
a high support threshold, BIDE is several times slower than
Runtime in seconds

1000 100 CloSpan, but once the support is no greater than 60%, BIDE
Memory in MBs

will signicantly outperform CloSpan. For example, at sup-


100
port 50%, BIDE is about 40 times faster than CloSpan.
10
10
From Fig. 11 we can see BIDE uses more than 2 orders
of magnitude less memory than CloSpan in almost all the
1 1 cases. Although this dataset only contains 175 sequences,
0.02 0.018 0.016 0.014 0.012 0.01 0.02 0.018 0.016 0.014 0.012 0.01
Support threshold (in %) Support threshold (in %)
which is rather small, however, because CloSpan needs to
Fig. 7. Dataset Fig. 8. Dataset keep track of the already mined frequent closed sequence
Gazelle(runtime). Gazelle(memory). candidates, it can consume more than 300MB memory. For
example, at support 50%, BIDE only uses about 2MB mem-
ory, while CloSpan uses about 328MB memory.

between BIDE and CloSpan. We can see BIDE always runs 1e+009 min_sup=88%
much faster than CloSpan but consumes much less memory.
Number of frequent closed sequences

1e+008 min_sup=90%
min_sup=92%
For example, at support 0.01%, BIDE can be more than an 1e+007 min_sup=94%
min_sup=96%
order of magnitude faster than CloSpan, while it only uses 1e+006
over an order of magnitude less memory. 100000
10000
1000
1e+008 min_sup=50%
Number of frequent closed sequences

min_sup=55% 100
1e+007 min_sup=60%
min_sup=65% 10
1e+006 min_sup=70%
1
1 2 3 4 5 6 7 8 9 10
100000
Length of frequent closed sequences
10000
Fig. 12. Dataset Pi(distribution).
1000
100
10
We also use the very dense dataset, Pi, to compare BIDE
1
2 4 6 8 10 12 14 16 18 with CloSpan. From Fig. 12 we can see that even with a
Length of frequent closed sequences very high support like 90%, there can be a large number of
Fig. 9. Dataset Snake(distribution). short frequent closed sequences with a length less than 10.
Fig. 13 shows that with a support higher than 90%, these
two algorithms have very similar performance, CloSpan is
Fig. 9 depicts the distribution of the number of frequent only a little faster than BIDE, but once the support is no
closed sequences against the length of the frequent closed greater than 88%, BIDE will outperform CloSpan a lot. For
sequences for dataset Snake. We can see it is a little dense example, at support 88%, BIDE is more than 6 times faster
dataset: A lot of closed sequences with a medium length than CloSpan. From Fig. 14 we know BIDE always uses
from 6 to 12 can be mined with some not very low sup- much less memory than CloSpan. At support 88%, BIDE

Proceedings of the 20th International Conference on Data Engineering (ICDE04)


1063-6382/04 $ 20.00 2004 IEEE
100000 CloSpan CloSpan Fig. 18 shows the effectiveness of the ScanSkip opti-
BIDE BIDE
10000
mization technique for Snake dataset. This technique al-
100 ways makes BIDE run faster with varying support thresh-
Runtime in seconds

Memory in MBs
1000
olds. All the above performance results for BIDE use both
100
10
the BackScan pruning method and the ScanSkip technique.
10

1 Without BackScan pruning 500 Without ScanSkip


1 10000 With BackScan pruning With ScanSkip
0.1
96 95 94 93 92 91 90 89 88 96 95 94 93 92 91 90 89 88 400

Runtime in seconds

Runtime in seconds
1000
Support threshold (in %) Support threshold (in %)
300
Fig. 13. Dataset Pi Fig. 14. Dataset Pi 100
(runtime). (memory). 200
10
100
1

consumes over 2 orders of magnitude less memory than 0.04 0.035 0.03 0.025 0.02 90 85 80 75 70 65
Support threshold (in %) Support threshold (in %)
CloSpan.
Fig. 17. Effectiveness of Fig. 18. Effectiveness of
Scalability test. We tested BIDEs scalability in both run-
BackScan pruning (Gazelle ScanSkip optimization
time and memory usage using all the three datasets in terms
dataset). (Snake dataset).
of the base size. In Fig. 15 and Fig. 16, we xed the support
threshold at a certain constant for each dataset and repli-
cated the sequences from 2 to 16 times. Although these 3
datasets have rather different features, BIDE shows a linear
scalability in both the runtime and memory usage against 5 Conclusions
the increasing number of sequences for these datasets. For
example, for dataset Snake with a given support 60%, its Many studies have elaborated that closed pattern min-
runtime increases from 807 seconds to 11906 seconds and ing has the same expressive power as that of all frequent
its memory usage increases from 1.883MB to 21.784MB pattern mining yet leads to more compact result set and sig-
when the number of sequences increases 16 times. nicantly better efciency. Our study showed that this is
usually true when the number of frequent patterns is pro-
14000
Dataset=Gazelle, min_sup=0.000136
70
Dataset=Gazelle, min_sup=0.000136
hibitively huge, in which case the number of frequent closed
12000
Dataset=Snake, min_sup=0.60
Dataset=Pi, min_sup=0.92 60
Dataset=Snake, min_sup=0.60
Dataset=Pi, min_sup=0.92 patterns is also likely very large. Unfortunately, most of the
10000 50 previously developed closed pattern mining algorithms rely
Runtime in seconds

Memory in MBs

8000 40
on the historical set of frequent closed patterns (or candi-
6000
dates) to check if a newly found frequent pattern is closed
30
or if it can invalidate some already mined closed candidates.
4000 20
Because the set of already mined frequent closed patterns
2000 10
keeps growing during the mining process, not only will it
0
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 consume more memory, but also lead to inefciency due to
Replication factor Replication factor the growing search space for pattern closure checking.
Fig. 15. Scalability Fig. 16. Scalability In this paper, we proposed BIDE, a novel algorithm for
test(runtime). test(memory). mining frequent closed sequences. It avoids the curse of
the candidate maintenance-and-test paradigm, prunes the
search space more deeply and checks the pattern closure
Effectiveness of the optimization techniques. Fig. 17 tests in a more efcient way while consuming much less mem-
the effectiveness of the BackScan pruning method. We can ory in contrast to the previously developed closed pattern
see that the BackScan pruning method is very effective in mining algorithms. It does not need to maintain the set of
pruning search space and speeding up the mining process: historic closed patterns, thus it scales very well in the num-
for gazelle dataset with support threshold at 0.02%, it can ber of frequent closed patterns. BIDE adopts a strict depth-
give several orders of magnitude enhancement to the perfor- rst search order and can output the frequent closed pat-
mance. The effectiveness of the BackScan pruning method terns in an online fashion. An extensive set of experiments
assures that BIDE which is only based on this single prun- on several real datasets with different distribution features
ing method can signicantly outperform in most cases the have shown the effectiveness of the algorithm design: BIDE
CloSpan algorithm which adopts several pruning methods. consumes order(s) of magnitude less memory while can be

Proceedings of the 20th International Conference on Data Engineering (ICDE04)


1063-6382/04 $ 20.00 2004 IEEE
over an order of magnitude faster than the CloSpan algo- [10] I. Jonassen, J.F. Collins, and D.G. Higgins, Finding
rithm. It also has linear scalability in terms of the number exible patterns in unaligned protein sequences. Pro-
of sequences in the database. Many studies have shown that tein Science, 4(8), 1995.
constraints are essential for many sequential pattern mining [11] R. Kohavi, C. Brodley, B. Frasca, L.Mason, and Z.
applications. In the future, we plan to study how to push Zheng, KDD-cup 2000 organizers report: Peeling the
constraints (like gap constraint) into BIDE in order to mine Onion. SIGKDD Explorations, 2, 2000.
more compact and specic result set. [12] H. Mannila, H. Toivonen, and A.I. Verkamo, Discov-
ering frequent episodes in sequences . In SIGKDD95,
Acknowledgment Montreal, Canada, Aug. 1995.
[13] F. Masseglia, F. Cathala, and P. Poncelet, The psp ap-
We are grateful to Dr. Mohammed Zaki for providing us proach for mining sequential patterns. In PKDD98,
the source code of SPADE. Thanks also go to Dr. George Nantes, France, Sept. 1995.
Karypis for providing us the source of some protein se- [14] B. Ozden, S. Ramaswamy, and A. Silberschatz,
quence data, and Xifeng Yan for providing us the executable Cyclic association rules. In ICDE98, Olando, FL,
code of CloSpan and some helpful discussions. Feb. 1998.
[15] N. Pasquier, Y. Bastide, R. Taouil and L. Lakhal, Dis-
References coving frequent closed itemsets for association rules.
In ICDT99, Jerusalem, Israel, Jan. 1999.
[1] R. Agrawal and R. Srikant. Fast algorithms for mining [16] J. Pei, J. Han, and R. Mao, CLOSET: An efcient
algorithm for mining frequent closed itemsets . In
association rules. In VLDB94, Santiago, Chile, Sept.
DMKD01 workshop, Dallas, TX, May 2001.
1994.
[17] J. Pei, J. Han, B. Mortazavi-Asl, Q. Chen, U. Dayal,
[2] R. Agrawal, and R. Srikant, Mining sequential pat- and M.C. Hsu, PrexSpan: Mining sequential pat-
terns. In ICDE95, Taipei, Taiwan, Mar. 1995. terns efciently by prex-projected pattern growth. In
[3] P. Aloy, E. Querol, F.X. Aviles and M.J.E. Sternberg, ICDE01, Heidelberg, Germany, April 2001.
Automated Structure-based Prediction of Functional [18] J. Pei, J. Han, and W. Wang, Constraint-based
Sites in Proteins: Applications to Assessing the Valid- sequential pattern mining in large databases. In
ity of Inheriting Protein Function From Homology in CIKM02, McLean, VA, Nov. 2002.
Genome Annotation and to Protein Docking. Journal [19] M. Seno, G. Karypis, SLPMiner: An algorithm
of Molecular Biology, 311, 2002. for nding frequent sequential patterns using length-
[4] J. Ayres, J. Gehrke, T. Yiu, and J. Flannick, Sequential decreasing support constraint. In ICDM02,, Mae-
PAttern Mining using a Bitmap Representation. In bashi, Japan, Dec. 2002.
SIGKDD02, Edmonton, Canada, July 2002. [20] R. Srikant, and R. Agrawal, Mining sequential pat-
[5] C. Bettini, X. Wang, and S. Jajodia, Mining tempo- terns:Generalizations and performance improvements.
ral relationals with multiple granularities in time se- In EDBT96, Avignon, France, Mar. 1996.
quences. Data Engineering Bulletin, 21(1):32-38, [21] J. Wang, J. Han, and J. Pei, CLOSET+: Searching for
1998. the Best Strategies for Mining Frequent Closed Item-
[6] M. Garofalakis, R. Rastogi, and K. Shim, SPIRIT: Se- sets. In KDD03, Washington, DC, Aug. 2003.
quential PAttern Mining with regular expression con- [22] X. Yan, J. Han, and R. Afshar, CloSpan: Mining
straints. In VLDB99, San Francisco, CA, Sept. 1999. Closed Sequential Patterns in Large Databases. In
SDM03, San Francisco, CA, May 2003.
[7] J. Han, G. Dong, and Y. Yin, Efcient mining of partial
periodic patterns in time series database. In ICDE99, [23] J. Yang, P.S. Yu, W. Wang and J. Han, Mining long
Sydney, Australia, Mar. 1999. sequential patterns in a noisy environment. In SIG-
MOD02, Madison, WI, June 2002.
[8] J. Han, J. Wang, Y. Lu, and P. Tzvetkov, Mining Top-
[24] M. Zaki, SPADE: An Efcient Algorithm for Mining
K Frequent Closed Patterns without Minimum Support.
Frequent Sequences. Machine Learning, 42:31-60,
In ICDM02, Maebashi, Japan, Dec. 2002.
Kluwer Academic Pulishers, 2001.
[9] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, [25] M. Zaki, and C. Hsiao, CHARM: An efcient algo-
and M.C. Hsu, FreeSpan: Frequent pattern-projected rithm for closed itemset mining. In SDM02, Arling-
sequential pattern mining . In SIGKDD00, Boston, ton, VA, April 2002.
MA, Aug. 2000.

Proceedings of the 20th International Conference on Data Engineering (ICDE04)


1063-6382/04 $ 20.00 2004 IEEE

You might also like