Professional Documents
Culture Documents
port problem faced by most large retail organizations. ( 1 2 m), where j is an item. We denote a sequence
i i :::i i
for retail organizations to collect and store massive A sequence h 1 2 n i is contained in another se-
a a :::a
amounts of sales data, referred to as the basket data. quence h 1 2 m i if there exist integers 1 2
b b :::b i < i <
A record in such data typically consists of the trans- ::: < in such that 1 i1 , 2 i2 , ..., n in . For
a b a b a b
action date and the items bought in the transaction. example, the sequence h (3) (4 5) (8) i is contained in
Very often, data records also contain customer-id, par- h (7) (3 8) (9) (4 5 6) (8) i, since (3) (3 8), (4 5) (4
ticularly when the purchase has been made using a 5 6) and (8) (8). However, the sequence h (3) (5) i is
credit card or a frequent-buyer card. Catalog compa- not contained in h (3 5) i (and vice versa). The former
nies also collect such data using the orders they re- represents items 3 and 5 being bought one after the
ceive. other, while the latter represents items 3 and 5 being
We introduce the problem of mining sequential pat- bought together. In a set of sequences, a sequence is s
terns over this data. An example of such a pattern is maximal if is not contained in any other sequence.
s
begin
Figure 8: Customer Sequences if ( k,1 known) then
L
pass, we only count sequences of certain lengths. For Lk = Candidates in k with minimum support.
C
this function determines exactly which sequences are if ( k not found in forward phase) then begin
L
counted, and balances the tradeo between the time Delete all sequences in k contained in
C
candidates were counted last), when all non-maximal that are contained in . c
sequences are counted, but no extensions of small can- k = Candidates in k with minimum support.
L C
port increases, the time wasted by counting extensions ( , 1)-candidate sequences. In that case, we use the
k
of small candidates when we skip a length goes down. candidate set k,1 to generate k . Correctness is
C C
rst pass over the database). Take for illustration sim- quences of length 6 with sequences of length 3, etc.
plicity, ( ) = 2 . In the second pass, we count 2 to However, to generate the sequences of length 3, we
f k k
We then start the backward phase. Nothing gets count 3 and 6 , and 9 turns out to be empty in the
L L L
We had skipped counting the support for sequences phase), and then count 8 followed by 7 after delet-
C C
L 4 , i.e., subsequences of h 1 2 3 4 i, we are left with plementation, the intermediate phase is interspersed
the sequences h 1 3 5 i and h 3 4 5 i. These would be with the backward phase, but we have omitted this
counted to get h 1 3 5 i as a maximal large 3-sequence. detail in Fig. 12 to simplify exposition.
Next, all the sequences in 2 except h 4 5 i are deleted
L We use apriori-generate in the initialization and
since they are contained in some longer sequence. For intermediate phases, but use otf-generate in the for-
the same reason, all sequences in 1 are also deleted.
L ward phase. The otf-generate procedure is given in
Section 3.3.1. The reason is that apriori-generate
generates less candidates than otf-generate when we
generate k+1 from k [2]. However, this may not
C L
foreach customer-sequence in DT do
L
the set of large k-sequences, j , the set of large j- L
c
Increment the count of all candidates in k sequences, and the customer sequence . It returns c
k = Candidates in k
c
The intuition behind this generation procedure is
L
Xk = subseq(Lk , c);
// nd k+step from k and step
L L L forall sequences x 2 Xk do
x.end = minfj jx is contained in h c1 c2 :::cj ig;
C k+step = ;;
Xj = subseq(Lj , c);
foreach customer sequences in DT do c
forall sequences x 2 Xj do
begin x.start = maxfj jx is contained in h cj cj +1 :::cn ig;
X = otf-generate( k , step , ); See Section 3.3.1
L L c
Answer = join of Xk with Xj with the join
For each sequence 2 0 , increment its count in
x X
condition Xk .end < Xj .start;
k+step (adding it to k+step if necessary).
C C
end
L = Candidates in k+step with min support.
k+step C For example, consider 2 to be the set of se- L
for ( ,,; 1; ,, ) do
k k > k
to f2g, etc. The end and start values for each sequence
if ( k not yet determined) then
L
in 2 which is contained in are shown in Fig. 13.
L c
Figure 12: Algorithm DynamicSome shown in Fig. 9. Then, in the forward phase, we get 2
candidate sequences in 4: h 1 2 3 4 i with support of
C
Time (sec)
250
1000
200
800
150
600
100 400
50 200
0 0
1 0.75 0.5 0.33 0.25 0.2 1 0.75 0.5 0.33 0.25 0.2
Minimum Support Minimum Support
C20-T2.5-S4-I1.25 C20-T2.5-S8-I1.25
700 1400
DynamicSome DynamicSome
Apriori Apriori
600 AprioriSome 1200 AprioriSome
500 1000
Time (sec)
Time (sec)
400 800
300 600
200 400
100 200
0 0
1 0.75 0.5 0.33 0.25 0.2 1 0.75 0.5 0.33 0.25 0.2
Minimum Support Minimum Support
the number of candidates generated using Apriori- We will present in this section the results of scale-up
Some can be larger. Second, although AprioriSome experiments for the AprioriSome algorithm. We also
skips over counting candidates of some lengths, they performed the same experiments for AprioriAll, and
are generated nonetheless and stay memory resident. found the results to be very similar. We do not re-
If memory gets lled up, AprioriSome is forced to port the AprioriAll results to conserve space. We will
count the last set of candidates generated even if the present the scale-up results for some selected datasets.
heuristic suggests skipping some more candidate sets. Similar results were obtained for other datasets.
This eect decreases the skipping distance between
the two candidate sets that are indeed counted, and Fig. 15 shows how AprioriSome scales up as the
AprioriSome starts behaving more like AprioriAll. For number of customers is increased ten times from
lower supports, there are longer large sequences, and 250,000 to 2.5 million. (The scale-up graph for increas-
hence more non-maximal sequences, and AprioriSome ing the number of customers from 25,000 to 250,000
looks very similar.) We show the results for the
for the increase was that in spite of setting the min-
10
2%
1% imum support in terms of the number of customers,
0.5% the number of large sequences increased with increas-
ing customer-sequence size. A secondary reason was
8
that nding the candidates present in a customer se-
quence took a little more time. For support level of
Relative Time
the number of large sequences. We kept the size of the people who bought the rst items, for 0
k < k <
database roughly constant by keeping the product of length of sequence. In this case, we will have to make
the average customer-sequence size and the number of an additional pass over the data to get counts for all
customers constant. We xed the minimum support prexes of large sequences if we were using the Apri-
in terms of the number of transactions in this exper- oriSome algorithms. With the AprioriAll algorithm,
iment. Fixing the minimum support as a percentage we already have these counts. In such applications,
would have led to large increases in the number of therefore, AprioriAll will become the preferred algo-
large sequences and we wanted to keep the size of the rithm.
answer set roughly the same. All the experiments had These algorithms have been implemented on several
the large sequence length set to 4 and the large item- data repositories, including the AIX le system and
set size set to 1.25. The average transaction size was DB2/6000, as part of the Quest project, and have been
set to 2.5 in the rst graph, while the number of trans- run against data from several data. In the future, we
actions per customer was set to 10 in the second. The plan to extend this work along the following lines:
numbers in the key (e.g. 100) refer to the minimum
support. Extension of the algorithms to discover sequential
The results are shown in Fig. 16. As shown, the patterns across item categories. An example of
execution times usually increased with the customer- such a category is that a dish washer is a kitchen
sequence size, but only gradually. The main reason appliance is a heavy electric appliance, etc.
3 3
200 200
100 100
2.5 50 2.5 50
2 2
Relative Time
Relative Time
1.5 1.5
1 1
0.5 0.5
0 0
10 20 30 40 50 2.5 5 7.5 10 12.5
# of Transactions Per Customer Transaction Size
Transposition of constraints into the discovery al- [6] T. G. Dietterich and R. S. Michalski. Discovering
gorithms. There could be item constraints (e.g. patterns in sequences of events. Articial Intelli-
sequential patterns involving home appliances) or gence, 25:187{232, 1985.
time constraints (e.g. the elements of the patterns
should come from transactions that are at least 1 [7] L. Hui. Color set size problem with applica-
tions to string matching. In A. Apostolico,
d