You are on page 1of 129

Frequent Pattern Mining

Christian Borgelt
lntcicnt Lata Anaysis and Grajhica `odcs lcscarch nit
Lurojcan Ccntrc tor Sott Comjutin
c, Gonzao Guticrrcz Quiros s,n, 33o00 `icrcs, Sjain
christian@borgelt.net
http://www.borgelt.net/
http://www.borgelt.net/teach/fpm/
http://www.softcomputing.es/
Christian Borgelt Frequent Pattern Mining 1
Overview
Frequent Pattern Mining comjriscs
Ircqucnt ltcm Sct `inin and Association luc lnduction
Ircqucnt Scqucncc `inin
Ircqucnt Trcc `inin
Ircqucnt Grajh `inin
Application Areas ot Ircqucnt lattcrn `inin incudc
`arkct Laskct Anaysis
Cick Strcam Anaysis
\c Link Anaysis
Gcnomc Anaysis
Lru Lcsin (`occuar Iramcnt `inin)
Christian Borgelt Frequent Pattern Mining 2
Frequent Item Set Mining
Christian Borgelt Frequent Pattern Mining 3
Frequent Item Set Mining: Motivation
Ircqucnt ltcm Sct `inin is a mcthod tor market basket analysis
lt aims at ndin rcuaritics in thc shojjin chavior ot customcrs
ot sujcrmarkcts, mai-ordcr comjanics, on-inc shojs ctc
`orc sjccicay
Find sets of products that are frequently bought together.
lossic ajjications ot tound trcqucnt itcm scts
lmjrovc arrancmcnt ot jroducts in shcvcs, on a cataos jacs ctc
Sujjort cross-scin (sucstion ot othcr jroducts), jroduct undin
Iraud dctcction, tcchnica dcjcndcncc anaysis ctc
Ottcn tound jattcrns arc cxjrcsscd as association rules, tor cxamjc
If a customcr uys bread and wine,
then shc,hc wi jroay aso uy cheese
Christian Borgelt Frequent Pattern Mining 4
Frequent Item Set Mining: Basic Notions
Lct B i
1
, . . . , i
m
c a sct ot items This sct is cacd thc item base
ltcms may c jroducts, sjccia cquijmcnt itcms, scrvicc ojtions ctc
Any susct I B is cacd an item set
An itcm sct may c any sct ot jroducts that can c ouht (tocthcr)
Lct T (t
1
, . . . , t
n
) with k, 1 k n t
k
B c a vcctor ot
transactions ovcr B This vcctor is cacd thc transaction database
A transaction dataasc can ist, tor cxamjc, thc scts ot jroducts
ouht y thc customcrs ot a sujcrmarkct in a ivcn jcriod ot timc
Lvcry transaction is an itcm sct, ut somc itcm scts may not ajjcar in T
Transactions nccd not c jairwisc dicrcnt it may c t
j
t
k
tor j , k
T may aso c dcncd as a bag or multiset ot transactions
Thc sct B may not c cxjicitcy ivcn, ut ony imjicity as B

n
k1
t
k

Christian Borgelt Frequent Pattern Mining 5


Frequent Item Set Mining: Basic Notions
Lct I B c an itcm sct and T a transaction dataasc ovcr B
A transaction t T covers thc itcm sct I or
thc itcm sct I is contained in a transaction t T i I t
Thc sct K
T
(I) k 1, . . . , n [ I t
k
is cacd thc cover ot I wrt T
Thc covcr ot an itcm sct is thc indcx sct ot thc transactions that covcr it
lt may aso c dcncd as a vcctor ot a transactions that covcr it
(which, howcvcr, is comjicatcd to writc in a tormay corrcct way)
Thc vauc s
T
(I) [K
T
(I)[ is cacd thc (absolute) support ot I wrt T
Thc vauc
T
(I)
1
n
[K
T
(I)[ is cacd thc relative support ot I wrt T
Thc sujjort ot I is thc numcr or traction ot transactions that contain it
Somctimcs
T
(I) is aso cacd thc (relative) frequency ot I wrt T
Christian Borgelt Frequent Pattern Mining 6
Frequent Item Set Mining: Basic Notions
Atcrnativc Lcnition ot Transactions
A transaction ovcr an itcm asc B is a tujc t (tid, J), whcrc
tid is a uniquc transaction identier and
J B is an itcm sct
A transaction database T t
1
, . . . , t
n
is a set ot transactions
A simjc sct can c uscd, sincc transactions dicr at cast in thcir idcnticr
A transaction t (tid, J) covers an itcm sct I i I J
Thc sct K
T
(I) tid [ J B t T t (tid, J) I J
is thc cover ot I wrt T
lcmark lt thc transaction dataasc is dcncd as a vcctor, thcrc is an imjicit
transaction idcnticr, namcy thc josition ot thc transaction in thc vcctor
Christian Borgelt Frequent Pattern Mining 7
Frequent Item Set Mining: Formal Denition
Given:
a sct B i
1
, . . . , i
m
ot itcms, thc item base,
a vcctor T (t
1
, . . . , t
n
) ot transactions ovcr B, thc transaction database,
a numcr s
min
l`, 0 < s
min
n, or (cquivacnty)
a numcr
min
ll, 0 <
min
1, thc minimum support
Desired:
thc sct ot frequent item sets, that is,
thc sct F
T
(s
min
) I B [ s
T
(I) s
min
or (cquivacnty)
thc sct
T
(
min
) I B [
T
(I)
min

`otc that with thc rcations s


min
,n
min
| and
min

1
n
s
min
thc two vcrsions can casiy c transtormcd into cach othcr
Christian Borgelt Frequent Pattern Mining 8
Frequent Item Sets: Example
transaction dataasc
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
trcqucnt itcm scts
0 itcms 1 itcm 2 itcms 3 itcms
10 a a, c ! a, c, d 3
b 3 a, d ` a, c, e 3
c a, e o a, d, e !
d o b, c 3
e c, d !
c, e !
d, e !
Thc minimum sujjort is s
min
3 or
min
0.3 30/ in this cxamjc
Thcrc arc 2
`
32 jossic itcm scts ovcr B a, b, c, d, e
Thcrc arc 1o trcqucnt itcm scts (ut ony 10 transactions)
Christian Borgelt Frequent Pattern Mining 9
Searching for Frequent Item Sets
Christian Borgelt Frequent Pattern Mining 10
Properties of the Support of Item Sets
A brute force approach that travcrscs a jossic itcm scts, dctcrmincs thcir
sujjort, and discards intrcqucnt itcm scts is usuay infeasible
Thc numcr ot jossic itcm scts rows cxjoncntiay with thc numcr ot itcms
A tyjica sujcrmarkct ocrs thousands ot dicrcnt jroducts
Idea: Considcr thc jrojcrtics ot an itcm scts covcr and sujjort, in jarticuar
I J I K
T
(J) K
T
(I).
This jrojcrty hods, sincc t I J I J t I t
Lach additiona itcm is anothcr condition a transaction has to satisty
Transactions that do not satisty this condition arc rcmovcd trom thc covcr
lt toows I J I s
T
(J) s
T
(I).
That is If an item set is extended, its support cannot increase.
Onc aso says that sujjort is anti-monotone or downward closed
Christian Borgelt Frequent Pattern Mining 11
Properties of the Support of Item Sets
Irom I J I s
T
(J) s
T
(I) it toows immcdiatcy
s
min
I J I s
T
(I) < s
min
s
T
(J) < s
min
.
That is No superset of an infrequent item set can be frequent.
This jrojcrty is ottcn rctcrrcd to as thc Apriori Property
lationac Somctimcs wc can know a priori, that is, ctorc chcckin its sujjort
y acccssin thc ivcn transaction dataasc, that an itcm sct cannot c trcqucnt
Ot coursc, thc contrajosition ot this imjication aso hods
s
min
I J I s
T
(I) s
min
s
T
(J) s
min
.
That is All subsets of a frequent item set are frequent.
This sucsts a comjrcsscd rcjrcscntation ot thc sct ot trcqucnt itcm scts
(which wi c cxjorcd atcr maxima and coscd trcqucnt itcm scts)
Christian Borgelt Frequent Pattern Mining 12
Reminder: Partially Ordered Sets
A partial order is a inary rcation ovcr a sct S which satiscs a, b, c S
a a (rccxivity)
a b b a a b (anti-symmctry)
a b b c a c (transitivity)
A sct with a jartia ordcr is cacd a partially ordered set (or poset tor short)
Lct a and b c two distinct ccmcnts ot a jartiay ordcrcd sct (S, )
it a b or b a, thcn a and b arc cacd comparable
it ncithcr a b nor b a, thcn a and b arc cacd incomparable
lt a jairs ot ccmcnts ot thc undcryin sct S arc comjarac,
thc ordcr is cacd a total order or a linear order
ln a tota ordcr thc rccxivity axiom is rcjaccd y thc stroncr axiom
a b b a (totaity)
Christian Borgelt Frequent Pattern Mining 13
Properties of the Support of Item Sets
Monotonicity in Calculus and Analysis
A tunction f ll ll is cacd monotonically non-decreasing
it x, y x y f(x) f(y)
A tunction f ll ll is cacd monotonically non-increasing
it x, y x y f(x) f(y)
Monotonicity in Order Theory
Ordcr thcory is conccrncd with aritrary jartiay ordcrcd scts
Thc tcrms increasing and decreasing arc avoidcd, ccausc thcy osc thcir jictoria
motivation as soon as scts arc considcrcd that arc not totay ordcrcd
A tunction f S R, whcrc S and R arc two jartiay ordcrcd scts, is cacd
monotone or order-preserving it x, y S x
S
y f(x)
R
f(y)
A tunction f S R, is cacd
anti-monotone or order-reversing it x, y S x
S
y f(x)
R
f(y)
ln this scnsc thc sujjort ot an itcm sct is anti-monotonc
Christian Borgelt Frequent Pattern Mining 14
Properties of Frequent Item Sets
A susct R ot a jartiay ordcrcd sct (S, ) is cacd downward closed
it tor any ccmcnt ot thc sct a smacr ccmcnts arc aso in it
x R y S y x y R
ln this casc thc susct R is aso cacd a lower set
Thc notions ot upward closed and upper set arc dcncd anaoousy
Ior cvcry s
min
thc sct ot trcqucnt itcm scts F
T
(s
min
) is downward coscd
wrt thc jartiay ordcrcd sct (2
B
, ), whcrc 2
B
dcnotcs thc jowcrsct ot B
X F
T
(s
min
) Y B Y X Y F
T
(s
min
)
Sincc thc sct ot trcqucnt itcm scts is induccd y thc sujjort tunction,
thc notions ot up- or downward closed arc transtcrrcd to thc sujjort tunction
Any sct ot itcm scts induccd y a sujjort thrcshod is uj- or downward coscd
F
T
() S B [ s
T
(S) is downward coscd,
G
T
() S B [ s
T
(S) < is ujward coscd
Christian Borgelt Frequent Pattern Mining 15
Reminder: Partially Ordered Sets and Hasse Diagrams
A nitc jartiay ordcrcd sct (S, ) can c dcjictcd as a (dircctcd) acycic rajh G,
which is cacd Hasse diagram
G has thc ccmcnts ot S as nodcs
Thc cdcs arc sccctcd accordin to
lt x and y arc ccmcnts ot S with x < y
(that is, x y and not x y) and
thcrc is no ccmcnt ctwccn x and y
(that is, no z S with x < z < y),
thcn thcrc is an cdc trom x to y
Sincc thc rajh is acycic
(thcrc is no dircctcd cycc),
thc rajh can aways c dcjictcd
such that a cdcs cad downwards
Thc Lassc diaram ot a tota or
incar ordcr is a chain
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
Lassc diaram ot (a, b, c, d, e, )
(Ldc dircctions arc omittcd.
a cdcs cad downwards)
Christian Borgelt Frequent Pattern Mining 16
Searching for Frequent Item Sets
Thc standard scarch jroccdurc is an enumeration approach,
that cnumcratcs candidatc itcm scts and chccks thcir sujjort
lt imjrovcs ovcr thc rutc torcc ajjroach y cxjoitin thc apriori property
to skij itcm scts that cannot c trcqucnt ccausc thcy havc an intrcqucnt susct
Thc search space is thc partially ordered set (2
B
, )
Thc structurc ot thc jartiay ordcrcd sct (2
B
, ) hcjs to idcntity
thosc itcm scts that can c skijjcd duc to thc ajriori jrojcrty
top-down search (trom cmjty sct,onc-ccmcnt scts to arcr scts)
Sincc a jartiay ordcrcd sct can convcnicnty c dcjictcd y a Hasse diagram,
wc wi usc such diarams to iustratc thc scarch
`otc that thc scarch may havc to visit an cxjoncntia numcr ot itcm scts
ln jracticc, howcvcr, thc scarch timcs arc ottcn carac,
at cast it thc minimum sujjort is not choscn too ow
Christian Borgelt Frequent Pattern Mining 17
Searching for Frequent Item Sets
Idea: sc thc jrojcrtics
ot thc sujjort to oranizc
thc scarch tor a trcqucnt
itcm scts, csjcciay thc
apriori property
I J I
s
T
(I) < s
min
s
T
(J) < s
min
.
Sincc thcsc jrojcrtics rc-
atc thc sujjort ot an itcm
sct to thc sujjort ot its
subsets and supersets,
it is rcasonac to oranizc
thc scarch ascd on thc
structurc ot thc partially
ordered set (2
B
, )
Hasse diagram tor vc itcms a, b, c, d, e B
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde (2
B
, )
Christian Borgelt Frequent Pattern Mining 18
Hasse Diagrams and Frequent Item Sets
transaction dataasc
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
Luc oxcs arc trcqucnt
itcm scts, whitc oxcs
intrcqucnt itcm scts
Lassc diaram with trcqucnt itcm scts (s
min
3)
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
Christian Borgelt Frequent Pattern Mining 19
The Apriori Algorithm
|Arawa and Srikant 199!|
Christian Borgelt Frequent Pattern Mining 20
Searching for Frequent Item Sets
One possible scheme for the search:
Lctcrminc thc sujjort ot thc onc ccmcnt itcm scts
and discard thc intrcqucnt itcms
Iorm candidatc itcm scts with two itcms (oth itcms must c trcqucnt),
dctcrminc thcir sujjort, and discard thc intrcqucnt itcm scts
Iorm candidatc itcm scts with thrcc itcms (a jairs must c trcqucnt),
dctcrminc thcir sujjort, and discard thc intrcqucnt itcm scts
Continuc y tormin candidatc itcm scts with tour, vc ctc itcms
unti no candidatc itcm sct is trcqucnt
This is thc cncra schcmc ot thc Apriori Algorithm
lt is ascd on two main stcjs candidate generation and pruning
A cnumcration aorithms arc ascd on thcsc two stcjs in somc torm
Christian Borgelt Frequent Pattern Mining 21
The Apriori Algorithm 1
function ajriori (B, T, s
min
)
begin ( Ajriori aorithm )
k 1. ( initiaizc thc itcm sct sizc )
E
k


iB
i. ( start with sinc ccmcnt scts )
F
k
jrunc(E
k
, T, s
min
). ( and dctcrminc thc trcqucnt oncs )
while F
k
, do begin ( whic thcrc arc trcqucnt itcm scts )
E
k+1
candidatcs(F
k
). ( crcatc candidatcs with onc itcm morc )
F
k+1
jrunc(E
k+1
, T, s
min
). ( and dctcrminc thc trcqucnt itcm scts )
k k + 1. ( incrcmcnt thc itcm countcr )
end.
return

k
j1
F
j
. ( rcturn thc trcqucnt itcm scts )
end ( ajriori )
E
j
candidatc itcm scts ot sizc j, F
j
trcqucnt itcm scts ot sizc j
Christian Borgelt Frequent Pattern Mining 22
The Apriori Algorithm 2
function candidatcs (F
k
)
begin ( cncratc candidatcs with k + 1 itcms )
E . ( initiaizc thc sct ot candidatcs )
forall f
1
, f
2
F
k
( travcrsc a jairs ot trcqucnt itcm scts )
with f
1
i
1
, . . . , i
k1
, i
k
( that dicr ony in onc itcm and )
and f
2
i
1
, . . . , i
k1
, i
/
k
( arc in a cxicorajhic ordcr )
and i
k
< i
/
k
do begin ( (thc ordcr is aritrary, ut xcd) )
f f
1
f
2
i
1
, . . . , i
k1
, i
k
, i
/
k
. ( union has k + 1 itcms )
if i f f i F
k
( it a suscts with k itcms arc trcqucnt, )
then E E f. ( add thc ncw itcm sct to thc candidatcs )
end. ( (othcrwisc it cannot c trcqucnt) )
return E. ( rcturn thc cncratcd candidatcs )
end ( candidatcs )
Christian Borgelt Frequent Pattern Mining 23
The Apriori Algorithm 3
function jrunc (E, T, s
min
)
begin ( jrunc intrcqucnt candidatcs )
forall e E do ( initiaizc thc sujjort countcrs )
s
T
(e) 0. ( ot a candidatcs to c chcckcd )
forall t T do ( travcrsc thc transactions )
forall e E do ( travcrsc thc candidatcs )
if e t ( it transaction contains thc candidatc, )
then s
T
(e) s
T
(e) + 1. ( incrcmcnt thc sujjort countcr )
F . ( initiaizc thc sct ot trcqucnt candidatcs )
forall e E do ( travcrsc thc candidatcs )
if s
T
(e) s
min
( it a candidatc is trcqucnt, )
then F F e. ( add it to thc sct ot trcqucnt itcm scts )
return F. ( rcturn thc jruncd sct ot candidatcs )
end ( jrunc )
Christian Borgelt Frequent Pattern Mining 24
Improving the Candidate Generation
Christian Borgelt Frequent Pattern Mining 25
Searching for Frequent Item Sets
Thc Ajriori aorithm scarchcs thc jartia ordcr toj-down cvc y cvc
Cocctin thc trcqucnt itcm scts ot sizc k in a set F
k
has drawacks
A trcqucnt itcm sct ot sizc k + 1 can c tormcd in
j
k(k + 1)
2
jossic ways (Ior intrcqucnt itcm scts thc numcr may c smacr)
As a conscqucncc, thc candidatc cncration stcj may carry out a ot ot
rcdundant work, sincc it succs to cncratc cach candidatc itcm sct oncc
Question: Can wc rcducc or cvcn ciminatc this rcdundant work
More generally:
Low can wc makc surc that any candidatc itcm sct is cncratcd at most oncc
Idea: Assin to cach itcm sct a uniquc jarcnt itcm sct,
trom which this itcm sct is to c cncratcd
Christian Borgelt Frequent Pattern Mining 26
Searching for Frequent Item Sets
A corc jrocm is that an itcm sct ot sizc k (that is, with k itcms)
can c cncratcd in k' dicrcnt ways (on k' jaths in thc Lassc diaram),
ccausc in jrincijc thc itcms may c addcd in any ordcr
lt wc considcr an itcm y itcm jroccss ot uidin an itcm sct
(which can c imaincd as a cvcwisc travcrsa ot thc jartia ordcr),
thcrc arc k jossic ways ot tormin an itcm sct ot sizc k
trom itcm scts ot sizc k 1 y addin thc rcmainin itcm
lt is ovious that it succs to considcr cach itcm sct at most oncc in ordcr
to nd thc trcqucnt oncs (intrcqucnt itcm scts nccd not c cncratcd at a)
Question: Can wc rcducc or cvcn ciminatc this varicty
More generally:
Low can wc makc surc that any candidatc itcm sct is cncratcd at most oncc
Idea: Assin to cach itcm sct a uniquc jarcnt itcm sct,
trom which this itcm sct is to c cncratcd
Christian Borgelt Frequent Pattern Mining 27
Searching for Frequent Item Sets
\c havc to scarch thc jartiay ordcrcd sct (2
B
, ) , its Lassc diaram
Assinin uniquc jarcnts turns thc Lassc diaram into a trcc
Travcrsin thc rcsutin trcc cxjorcs cach itcm sct cxacty oncc
Lassc diaram and a jossic trcc tor vc itcms
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
Christian Borgelt Frequent Pattern Mining 28
Searching with Unique Parents
Principle of a Search Algorithm based on Unique Parents:
Base Loop:
Travcrsc a onc-ccmcnt itcm scts (thcir uniquc jarcnt is thc cmjty sct)
lccursivcy jroccss a onc-ccmcnt itcm scts that arc trcqucnt
Recursive Processing:
Ior a ivcn trcqucnt itcm sct I
Gcncratc a cxtcnsions J ot I y onc itcm (that is, J I, [J[ [I[ + 1)
tor which thc itcm sct I is thc choscn uniquc jarcnt
Ior a J it J is trcqucnt, jroccss J rccursivcy, othcrwisc discard J
Questions:
Low can wc tormay assin uniquc jarcnts
Low can wc makc surc that wc cncratc ony thosc cxtcnsions
tor which thc itcm sct that is cxtcndcd is thc choscn uniquc jarcnt
Christian Borgelt Frequent Pattern Mining 29
Assigning Unique Parents
Iormay, thc sct ot a possible parents ot an itcm sct I is
P(I) J I [ , K J K I.
ln othcr words, thc jossic jarcnts ot I arc its maximal proper subsets
ln ordcr to sinc out onc ccmcnt ot P(I), thc canonical parent p
c
(I),
wc can simjy dcnc an (aritrary, ut xcd) oa ordcr ot thc itcms
i
1
< i
2
< i
3
< < i
n
.
Thcn thc canonica jarcnt ot an itcm sct I can c dcncd as thc itcm sct
p
c
(I) I max
iI
i (or p
c
(I) I min
iI
i),
whcrc thc maximum (or minimum) is takcn wrt thc choscn ordcr ot thc itcms
Lvcn thouh this ajjroach is straihttorward and simjc,
wc rctormuatc it now in tcrms ot a canonical form ot an itcm sct,
in ordcr to ay thc toundations tor thc study ot trcqucnt (su)rajh minin
Christian Borgelt Frequent Pattern Mining 30
Canonical Forms of Item Sets
Christian Borgelt Frequent Pattern Mining 31
Canonical Forms
Thc mcanin ot thc word canonica
(sourcc Oxtord Advanccd Lcarncrs Lictionary Lncycojcdic Ldition)
canon , kan
c
n, n 1 cncra ruc, standard or jrincijc, y which sth is ,udcd
This lm oends against all the canons of good taste.
canonical ,k
c
n
a
nIk, adj 3 standard. acccjtcd
A canonical form ot somcthin is a standard rcjrcscntation ot it
Thc canonica torm must c uniquc (othcrwisc it coud not c standard)
`cvcrthccss thcrc arc ottcn scvcra jossic choiccs tor a canonica torm
Lowcvcr, onc must x onc ot thcm tor a ivcn ajjication
ln thc toowin wc wi dcnc a standard rcjrcscntation ot an itcm sct,
and atcr standard rcjrcscntations ot a rajh, a scqucncc, a trcc ctc
This canonica torm wi c uscd to assin uniquc jarcnts to a itcm scts
Christian Borgelt Frequent Pattern Mining 32
A Canonical Form for Item Sets
An itcm sct is rcjrcscntcd y a code word. cach cttcr rcjrcscnts an itcm
Thc codc word is a word ovcr thc ajhact B, thc sct ot a itcms
Thcrc arc k' jossic codc words tor an itcm sct ot sizc k,
ccausc thc itcms may c istcd in any ordcr
Ly introducin an (aritrary, ut xcd) order of the items,
and y comjarin codc words cxicorajhicay wrt this ordcr,
wc can dcnc an ordcr on thcsc codc words
Lxamjc abc < bac < bca < cab ctc tor thc itcm sct a, b, c and a < b < c
Thc cxicorajhicay smacst (or, atcrnativcy, rcatcst) codc word
tor an itcm sct is dcncd to c its canonical code word
Oviousy thc canonica codc word ists thc itcms in thc choscn, xcd ordcr
lcmark Thcsc cxjanations may ajjcar otuscatcd, sincc thc corc idca and thc rcsut arc vcry simjc
Lowcvcr, thc vicw dcvcojcd hcrc wi hcj us a ot whcn wc turn to trcqucnt (su)rajh minin
Christian Borgelt Frequent Pattern Mining 33
Canonical Forms and Canonical Parents
Lct I c an itcm sct and w
c
(I) its canonica codc word
Thc canonical parent p
c
(I) ot thc itcm sct I is thc itcm sct
dcscricd y thc longest proper prex ot thc codc word w
c
(I)
Sincc thc canonica codc word ot an itcm sct ists its itcms in thc choscn ordcr,
this dcnition is cquivacnt to
p
c
(I) I max
aI
a.
General Recursive Processing with Canonical Forms:
Ior a ivcn trcqucnt itcm sct I
Gcncratc a jossic cxtcnsions J ot I y onc itcm (J I, [J[ [I[ + 1)
Iorm thc canonica codc word w
c
(J) ot cach cxtcndcd itcm sct J
Ior cach J it thc ast cttcr ot w
c
(J) is thc itcm addcd to I to torm J
and J is trcqucnt, jroccss J rccursivcy, othcrwisc discard J
Christian Borgelt Frequent Pattern Mining 34
The Prex Property
`otc that thc considcrcd itcm sct codin schcmc has thc prex property
The longest proper prex of the canonical code word of any item set
is a canonical code word itself.
\ith thc oncst jrojcr jrcx ot thc canonica codc word ot an itcm sct I
wc not ony know thc canonica jarcnt ot I, ut aso its canonica codc word
Lxamjc Considcr thc itcm sct I a, b, d, e
Thc canonica codc word ot I is abde
Thc oncst jrojcr jrcx ot abde is abd
abd is thc canonica codc word ot p
c
(I) a, b, d
`otc that thc jrcx jrojcrty immcdiatcy imjics
Every prex of a canonical code word is a canonical code word itself.
(ln thc toowin oth statcmcnts arc cacd thc prex property, sincc thcy arc oviousy cquivacnt)
Christian Borgelt Frequent Pattern Mining 35
Searching with the Prex Property
Thc jrcx jrojcrty aows us to simplify the search scheme
Thc cncra rccursivc jroccssin schcmc with canonica torms rcquircs
to construct thc canonical code word ot cach crcatcd itcm sct
in ordcr to dccidc whcthcr it has to c jroccsscd rccursivcy or not
\c know thc canonica codc word ot cvcry itcm sct that is jroccsscd rccursivcy
\ith this codc word wc know, duc to thc prex property, thc canonica
codc words ot a chid itcm scts that havc to c cxjorcd in thc rccursion
with the exception of the last letter (that is, thc addcd itcm)
\c ony havc to chcck whcthcr thc codc word that rcsuts trom ajjcndin
thc addcd itcm to thc ivcn canonica codc word is canonica or not
Advantage:
Chcckin whcthcr a ivcn codc word is canonica can c simjcr,tastcr
than constructin a canonica codc word trom scratch
Christian Borgelt Frequent Pattern Mining 36
Searching with the Prex Property
Principle of a Search Algorithm based on the Prex Property:
Base Loop:
Travcrsc a jossic itcms, that is,
thc canonica codc words ot a onc-ccmcnt itcm scts
lccursivcy jroccss cach codc word that dcscrics a trcqucnt itcm sct
Recursive Processing:
Ior a ivcn (canonica) codc word ot a trcqucnt itcm sct
Gcncratc a jossic cxtcnsions y onc itcm
This is donc y simjy appending the item to thc codc word
Chcck whcthcr thc cxtcndcd codc word is thc canonical code word
ot thc itcm sct that is dcscricd y thc cxtcndcd codc word
(and, ot coursc, whcthcr thc dcscricd itcm sct is trcqucnt)
lt it is, jroccss thc cxtcndcd codc word rccursivcy, othcrwisc discard it
Christian Borgelt Frequent Pattern Mining 37
Searching with the Prex Property: Examples
Sujjosc thc itcm asc is B a, b, c, d, e and ct us assumc that
wc simjy usc thc ajhactica ordcr to dcnc a canonica torm (as ctorc)
Considcr thc rccursivc jroccssin ot thc codc word acd
(this codc word is canonica, ccausc its cttcrs arc in ajhactica ordcr)
Sincc acd contains ncithcr b nor e, its cxtcnsions arc acdb and acde
Thc codc word acdb is not canonica and thus it is discardcd
(ccausc d > b notc that it succs to comjarc thc ast two cttcrs)
Thc codc word acde is canonica and thcrctorc it is jroccsscd rccursivcy
Considcr thc rccursivc jroccssin ot thc codc word bc
Thc cxtcndcd codc words arc bca, bcd and bce
bca is not canonica and thus discardcd
bcd and bce arc canonica and thcrctorc jroccsscd rccursivcy
Christian Borgelt Frequent Pattern Mining 38
Searching with the Prex Property
Exhaustive Search
Thc prex property is a ncccssary condition tor cnsurin
that a canonica codc words can c constructcd in thc scarch
y ajjcndin cxtcnsions (itcms) to visitcd canonica codc words
Sujjosc thc jrcx jrojcrty woud not hod Thcn
Thcrc cxist a canonica codc word w and a jrcx v ot w,
such that v is not a canonica codc word
Iormin w y rcjcatcdy ajjcndin itcms must torm v rst
(othcrwisc thc jrcx woud dicr)
\hcn v is constructcd in thc scarch, it is discardcd,
ccausc it is not canonica
As a conscqucncc, thc canonica codc word w can ncvcr c rcachcd
Thc simjicd scarch schcmc can c cxhaustivc ony it thc jrcx jrojcrty hods
Christian Borgelt Frequent Pattern Mining 39
Searching with Canonical Forms
Straightforward Improvement of the Extension Step:
Thc considcrcd canonica torm ists thc itcms in thc choscn itcm ordcr
lt thc addcd itcm succccds a arcady jrcscnt itcms in thc choscn ordcr,
thc rcsut is in canonica torm
lt thc addcd itcm jrcccdcs any ot thc arcady jrcscnt itcms in thc choscn ordcr,
thc rcsut is not in canonica torm
As a conscqucncc, wc havc a vcry simjc canonical extension rule
(that is, a ruc that cncratcs a chidrcn and ony canonica codc words)
Ajjicd to thc Ajriori aorithm, this mcans that wc cncratc candidatcs
ot sizc k + 1 y cominin two trcqucnt itcm scts f
1
i
1
, . . . , i
k1
, i
k

and f
2
i
1
, . . . , i
k1
, i
/
k
ony it i
k
< i
/
k
and j, 1 j < k i
j
< i
j+1

`otc that it succs to comjarc thc ast cttcrs,itcms i


k
and i
/
k
it a trcqucnt itcm scts arc rcjrcscntcd y canonica codc words
Christian Borgelt Frequent Pattern Mining 40
Searching with Canonical Forms
Final Search Algorithm based on Canonical Forms:
Base Loop:
Travcrsc a jossic itcms, that is,
thc canonica codc words ot a onc-ccmcnt itcm scts
lccursivcy jroccss cach codc word that dcscrics a trcqucnt itcm sct
Recursive Processing:
Ior a ivcn (canonica) codc word ot a trcqucnt itcm sct
Gcncratc a jossic cxtcnsions y a sinc itcm,
whcrc this itcm succccds thc ast cttcr (itcm) ot thc ivcn codc word
This is donc y simjy appending the item to thc codc word
lt thc itcm sct dcscricd y thc rcsutin cxtcndcd codc word is trcqucnt,
jroccss thc codc word rccursivcy, othcrwisc discard it
This scarch schcmc cncratcs cach candidatc itcm sct at most once
Christian Borgelt Frequent Pattern Mining 41
Canonical Parents and Prex Trees
ltcm scts, whosc canonica codc words sharc thc samc oncst jrojcr jrcx,
arc siins, ccausc thcy havc (y dcnition) thc samc canonica jarcnt
This aows us to rcjrcscnt thc canonica jarcnt trcc as a prex tree or trie
Canonica jarcnt trcc,jrcx trcc and jrcx trcc with mcrcd siins tor vc itcms
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
a b c d e
b c d e c d e d e e
c d e d e e d e e e
d e e e e
e
a b c d
b c d c d d
c d d d
d
Christian Borgelt Frequent Pattern Mining 42
Canonical Parents and Prex Trees
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
a
b c
d
b
c d c d d
c d d d
d
A (tu) jrcx trcc tor thc vc itcms a, b, c, d, e
Lascd on a oa ordcr ot thc itcms (which can c aritrary)
Thc itcm scts countcd in a nodc consist ot
a itcms acin thc cdcs to thc nodc (common jrcx) and
onc itcm toowin thc ast cdc ac in thc itcm ordcr
Christian Borgelt Frequent Pattern Mining 43
Search Tree Pruning
ln ajjications thc scarch trcc tcnds to ct vcry arc, so jrunin is nccdcd
Structural Pruning:
Lxtcnsions ascd on canonica codc words rcmovc sujcruous jaths
Lxjains thc unaanccd structurc ot thc tu jrcx trcc
Support Based Pruning:
No superset of an infrequent item set can be frequent.
(apriori property)
`o countcrs tor itcm scts havin an intrcqucnt susct arc nccdcd
Size Based Pruning:
lrunc thc trcc it a ccrtain dcjth (a ccrtain sizc ot thc itcm scts) is rcachcd
ldca Scts with too many itcms can c dicut to intcrjrct
Christian Borgelt Frequent Pattern Mining 44
The Order of the Items
Thc structurc ot thc (structuray jruncd) jrcx trcc
oviousy dcjcnds on thc choscn ordcr ot thc itcms
ln jrincijc, thc ordcr is aritrary (that is, any ordcr can c uscd)
Lowcvcr, thc numcr and thc sizc ot thc nodcs that arc visitcd in thc scarch
dicrs considcray dcjcndin on thc ordcr
As a conscqucncc, thc cxccution timcs ot trcqucnt itcm sct minin aorithms
can dicr considcray dcjcndin on thc itcm ordcr
\hich ordcr ot thc itcms is cst (cads to thc tastcst scarch)
can dcjcnd on thc trcqucnt itcm sct minin aorithm uscd
Advanccd mcthods cvcn adajt thc ordcr ot thc itcms durin thc scarch
(that is, usc dicrcnt, ut comjatic ordcrs in dicrcnt ranchcs)
Lcuristics tor choosin an itcm ordcr arc usuay ascd
on (conditiona) indcjcndcncc assumjtions
Christian Borgelt Frequent Pattern Mining 45
The Order of the Items
Heuristics for Choosing the Item Order
Basic Idea: independence assumption
lt is jausic that trcqucnt itcm scts consist ot trcqucnt itcms
Sort thc itcms wrt thcir sujjort (trcqucncy ot occurrcncc)
Sort dcsccndiny lrcx trcc has tcwcr, ut arcr nodcs
Sort asccndiny lrcx trcc has morc, ut smacr nodcs
Extension of this Idea:
Sort itcms wrt thc sum ot thc sizcs ot thc transactions that covcr thcm
ldca thc sum ot transaction sizcs aso cajturcs imjicity thc trcqucncy
ot jairs, trijcts ctc (thouh, ot coursc, ony to somc dcrcc)
Lmjirica cvidcncc cttcr jcrtormancc than simjc trcqucncy sortin
Christian Borgelt Frequent Pattern Mining 46
Searching the Prex Tree
a b c d e
b c d e c d e d e e
c d e d e e d e e e
d e e e e
e
a b c d
b c d c d d
c d d d
d
a b c d e
b c d e c d e d e e
c d e d e e d e e e
d e e e e
e
a b c d
b c d c d d
c d d d
d
Apriori Lrcadth-rst,cvcwisc scarch (itcm scts ot samc sizc)
Suscts tcsts on transactions to nd thc sujjort ot itcm scts
Eclat Lcjth-rst scarch (itcm scts with samc jrcx)
lntcrscction ot transaction ists to nd thc sujjort ot itcm scts
Christian Borgelt Frequent Pattern Mining 47
Searching the Prex Tree Levelwise
(Ajriori Aorithm lcvisitcd)
Christian Borgelt Frequent Pattern Mining 48
Apriori: Basic Ideas
Thc itcm scts arc chcckcd in thc order of increasing size
(breadth-rst/levelwise traversal ot thc jrcx trcc)
Thc canonica torm ot itcm scts and thc induccd jrcx trcc arc uscd
to cnsurc that cach candidatc itcm sct is cncratcd at most oncc
Thc arcady cncratcd cvcs arc uscd to cxccutc a priori jrunin
ot thc candidatc itcm scts (usin thc apriori property)
(a priori: ctorc acccssin thc transaction dataasc to dctcrminc thc sujjort)
Transactions arc rcjrcscntcd as simjc arrays ot itcms
(so-cacd horizontal transaction representation, scc aso cow)
Thc sujjort ot a candidatc itcm sct is comjutcd
y chcckin whcthcr thcy arc suscts ot a transaction
or y cncratin and ndin suscts ot a transaction
Christian Borgelt Frequent Pattern Mining 49
Apriori: Levelwise Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
Lxamjc transaction dataasc with ` itcms and 10 transactions
`inimum sujjort 30/, that is, at cast 3 transactions must contain thc itcm sct
A onc itcm scts arc trcqucnt tu sccond cvc is nccdcd
Christian Borgelt Frequent Pattern Mining 50
Apriori: Levelwise Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b c
d
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
Lctcrminin thc sujjort ot itcm scts Ior cach itcm sct travcrsc thc dataasc
and count thc transactions that contain it (hihy inccicnt)
Lcttcr Travcrsc thc trcc tor cach transaction and nd thc itcm scts it contains
(ccicnt can c imjcmcntcd as a simjc douy rccursivc jroccdurc)
Christian Borgelt Frequent Pattern Mining 51
Apriori: Levelwise Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b c
d
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
`inimum sujjort 30/, that is, at cast 3 transactions must contain thc itcm sct
lntrcqucnt itcm scts a, b, b, d, b, e
Thc sutrccs startin at thcsc itcm scts can c jruncd
Christian Borgelt Frequent Pattern Mining 52
Apriori: Levelwise Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b c
d
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
c d c
d
d : ? e : ? e : ? d : ? e : ? e : ?
Gcncratc candidatc itcm scts with 3 itcms (jarcnts must c trcqucnt)
Lctorc countin, chcck whcthcr thc candidatcs contain an intrcqucnt itcm sct
An itcm sct with k itcms has k suscts ot sizc k 1
Thc jarcnt itcm sct is ony onc ot thcsc suscts
Christian Borgelt Frequent Pattern Mining 53
Apriori: Levelwise Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b c
d
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
c d c
d
d : ? e : ? e : ? d : ? e : ? e : ?
Thc itcm scts b, c, d and b, c, e can c jruncd, ccausc
b, c, d contains thc intrcqucnt itcm sct b, d and
b, c, e contains thc intrcqucnt itcm sct b, e
Christian Borgelt Frequent Pattern Mining 54
Apriori: Levelwise Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b c
d
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
c d c
d
d : 3 e : 3 e : 4 d : ? e : ? e : 2
Ony thc rcmainin tour itcm scts ot sizc 3 arc cvauatcd
`o othcr itcm scts ot sizc 3 can c trcqucnt
Christian Borgelt Frequent Pattern Mining 55
Apriori: Levelwise Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b c
d
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
c d c
d
d : 3 e : 3 e : 4 d : ? e : ? e : 2
`inimum sujjort 30/, that is, at cast 3 transactions must contain thc itcm sct
lntrcqucnt itcm sct c, d, e
Christian Borgelt Frequent Pattern Mining 56
Apriori: Levelwise Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b c
d
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
c d c
d
d : 3 e : 3 e : 4 d : ? e : ? e : 2
d
e : ?
Gcncratc candidatc itcm scts with ! itcms (jarcnts must c trcqucnt)
Lctorc countin, chcck whcthcr thc candidatcs contain an intrcqucnt itcm sct
Christian Borgelt Frequent Pattern Mining 57
Apriori: Levelwise Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b c
d
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
c d c
d
d : 3 e : 3 e : 4 d : ? e : ? e : 2
d
e : ?
Thc itcm sct a, c, d, e can c jruncd,
ccausc it contains thc intrcqucnt itcm sct c, d, e
Conscqucncc `o candidatc itcm scts with tour itcms
Iourth acccss to thc transaction dataasc is not ncccssary
Christian Borgelt Frequent Pattern Mining 58
Apriori: Node Organization 1
ldca Ojtimizc thc oranization ot thc countcrs and thc chid jointcrs
Direct Indexing:
Lach nodc is a simjc vcctor (array) ot countcrs
An itcm is uscd as a dircct indcx to nd thc countcr
Advantac Countcr acccss is cxtrcmcy tast
Lisadvantac `cmory usac can c hih duc to ajs in thc indcx sjacc
Sorted Vectors:
Lach nodc is a vcctor (array) ot itcm,countcr jairs
A inary scarch is ncccssary to nd thc countcr tor an itcm
Advantac `cmory usac may c smacr, no unncccssary countcrs
Lisadvantac Countcr acccss is sowcr duc to thc inary scarch
Christian Borgelt Frequent Pattern Mining 59
Apriori: Node Organization 2
Hash Tables:
Lach nodc is a vcctor (array) ot itcm,countcr jairs (coscd hashin)
Thc indcx ot a countcr is comjutcd trom thc itcm codc
Advantac Iastcr countcr acccss than with inary scarch
Lisadvantac Lihcr mcmory usac than sortcd vcctors (jairs, ratc)
Thc ordcr ot thc itcms cannot c cxjoitcd
Child Pointers:
Thc dccjcst cvc ot thc itcm sct trcc docs not nccd chid jointcrs
Icwcr chid jointcrs than countcrs arc nccdcd
lt jays to rcjrcscnt thc chid jointcrs in a scjaratc array
Thc sortcd array ot itcm,countcr jairs can c rcuscd tor a inary scarch
Christian Borgelt Frequent Pattern Mining 60
Apriori: Item Coding
ltcms arc codcd as consccutivc intccrs startin with 0
(nccdcd tor thc dircct indcxin ajjroach)
Thc sizc and thc numcr ot thc ajs in thc indcx sjacc
dcjcnds on how thc itcms arc codcd
ldca lt is jausic that trcqucnt itcm scts consist ot trcqucnt itcms
Sort thc itcms wrt thcir trcqucncy (rouj trcqucnt itcms)
Sort dcsccndiny jrcx trcc has tcwcr nodcs
Sort asccndiny thcrc arc tcwcr and smacr indcx ajs
Lmjirica cvidcncc sortin asccndiny is cttcr
Lxtcnsion Sort itcms wrt thc sum ot thc sizcs
ot thc transactions that covcr thcm
Lmjirica cvidcncc cttcr than simjc itcm trcqucncics
Christian Borgelt Frequent Pattern Mining 61
Apriori: Recursive Counting
Thc itcms in a transaction arc sortcd (asccndin itcm codcs)
lroccssin a transaction is a doubly recursive procedure
To jroccss a transaction tor a nodc ot thc itcm sct trcc
Go to thc chid corrcsjondin to thc rst itcm in thc transaction and
count thc rcst ot thc transaction rccursivcy tor that chid
(ln thc currcnty dccjcst cvc ot thc trcc wc incrcmcnt thc countcr
corrcsjondin to thc itcm instcad ot oin to thc chid nodc)
Liscard thc rst itcm ot thc transaction and
jroccss it rccursivcy tor thc nodc itsct
Ojtimizations
Lirccty skij a itcms jrcccdin thc rst itcm in thc nodc
Aort thc rccursion it thc rst itcm is cyond thc ast onc in thc nodc
Aort thc rccursion it a transaction is too short to rcach thc dccjcst cvc
Christian Borgelt Frequent Pattern Mining 62
Apriori: Recursive Counting
a : 7 b : 3 c : 7 d : 6 e : 7
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
a
b c
d
c
d
c
d
d : e : e : d : ? e : ? e : 0 0 0 0
c d e a
a
transaction
to count
a, c, d, e
currcnt
itcm sct sizc 3
a : 7 b : 3 c : 7 d : 6 e : 7
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
a
b c
d
c
d
c
d
d : e : e : d : ? e : ? e : 0 0 0 0
d e c
c
c d e
jroccssin a
jroccssin c
Christian Borgelt Frequent Pattern Mining 63
Apriori: Recursive Counting
a : 7 b : 3 c : 7 d : 6 e : 7
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
a
b c
d
c
d
c
d
d : e : e : d : ? e : ? e : 1 1 0 0
d e
d e
c d e
jroccssin a
jroccssin c
jroccssin d e
a : 7 b : 3 c : 7 d : 6 e : 7
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
a
b c
d
c
d
c
d
d : e : e : d : ? e : ? e : 1 1 0 0
e d
d
c d e
jroccssin a
jroccssin d
Christian Borgelt Frequent Pattern Mining 64
Apriori: Recursive Counting
a : 7 b : 3 c : 7 d : 6 e : 7
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
a
b c
d
c
d
c
d
d : e : e : d : ? e : ? e : 1 1 1 0
e
e
c d e
jroccssin a
jroccssin d
jroccssin e
a : 7 b : 3 c : 7 d : 6 e : 7
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
a
b c
d
c
d
c
d
d : e : e : d : ? e : ? e : 1 1 1 0
e
c d e
jroccssin a
jroccssin e
(skijjcd
too tcw itcms)
Christian Borgelt Frequent Pattern Mining 65
Apriori: Recursive Counting
a : 7 b : 3 c : 7 d : 6 e : 7
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
a
b c
d
c
d
c
d
d : e : e : d : ? e : ? e : 1 1 1 0
d e c
c
jroccssin c
a : 7 b : 3 c : 7 d : 6 e : 7
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
a
b c
d
c
d
c
d
d : e : e : d : ? e : ? e : 1 1 1 0
e d
d
d e
jroccssin c
jroccssin d
Christian Borgelt Frequent Pattern Mining 66
Apriori: Recursive Counting
a : 7 b : 3 c : 7 d : 6 e : 7
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
a
b c
d
c
d
c
d
d : e : e : d : ? e : ? e : 1 1 1 1
e
e
d e
jroccssin c
jroccssin d
jroccssin e
a : 7 b : 3 c : 7 d : 6 e : 7
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
a
b c
d
c
d
c
d
d : e : e : d : ? e : ? e : 1 1 1 1
e
d e
jroccssin c
jroccssin e
(skijjcd
too tcw itcms)
Christian Borgelt Frequent Pattern Mining 67
Apriori: Recursive Counting
a : 7 b : 3 c : 7 d : 6 e : 7
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
a
b c
d
c
d
c
d
d : e : e : d : ? e : ? e : 1 1 1 1
e d
jroccssin d
(skijjcd
too tcw itcms)
lroccssin an itcm sct in a nodc is casiy imjcmcntcd as a simjc ooj
Ior cach itcm thc rcmainin sux is jroccsscd in thc corrcsjondin chid
lt thc (currcnty) dccjcst trcc cvc is rcachcd,
countcrs arc incrcmcntcd tor cach itcm in thc transaction
lt thc rcmainin transaction (sux) is too short to rcach
thc (currcnty) dccjcst cvc, jroccssin is tcrminatcd
Christian Borgelt Frequent Pattern Mining 68
Apriori: Transaction Representation
Direct Representation:
Lach transaction is rcjrcscntcd as an array ot itcms
Thc transactions arc storcd in a simjc ist or array
Organization as a Prex Tree:
Thc itcms in cach transaction arc sortcd (aritrary, ut xcd ordcr)
Transactions with thc samc jrcx arc roujcd tocthcr
Advantac a common jrcx is jroccsscd ony oncc
Gains trom this oranization dcjcnd on how thc itcms arc codcd
Common transaction jrcxcs arc morc ikcy
it thc itcms arc sortcd with dcsccndin trcqucncy
Lowcvcr an asccndin ordcr is cttcr tor thc scarch
and this dominatcs thc cxccution timc
Christian Borgelt Frequent Pattern Mining 69
Apriori: Transactions as a Prex Tree
transaction
dataasc
a, d, e
b, c, d
a, c, e
a, c, d, e
a, e
a, c, d
b, c
a, c, d, e
b, c, e
a, d, e
cxicorajhicay
sortcd
a, c, d
a, c, d, e
a, c, d, e
a, c, e
a, d, e
a, d, e
a, e
b, c
b, c, d
b, c, e
prex tree
representation
a
b
c
d
e
c
d
e
e
d
e
e
: 7
: 3
: 4
: 2
: 1
: 3
: 3
: 1
: 2
: 1
: 1
: 2
ltcms in transactions arc sortcd wrt somc aritrary ordcr,
transactions arc sortcd cxicorajhicay, thcn a jrcx trcc is constructcd
Advantage: idcntica transaction jrcxcs arc jroccsscd ony oncc
Christian Borgelt Frequent Pattern Mining 70
Summary Apriori
Basic Processing Scheme
Lrcadth-rst,cvcwisc travcrsa ot thc jartiay ordcrcd sct (2
B
, )
Candidatcs arc tormcd y mcrin itcm scts that dicr in ony onc itcm
Sujjort countin can c donc with a douy rccursivc jroccdurc
Advantages
lcrtcct jrunin ot intrcqucnt candidatc itcm scts (with intrcqucnt suscts)
Disadvantages
Can rcquirc a ot ot mcmory (sincc a trcqucnt itcm scts arc rcjrcscntcd)
Sujjort countin takcs vcry on tor arc transactions
Software
http://www.borgelt.net/apriori.html
Christian Borgelt Frequent Pattern Mining 71
Searching the Prex Tree Depth-First
(Lcat, Il-rowth and othcr aorithms)
Christian Borgelt Frequent Pattern Mining 72
Depth-First Search and Conditional Databases
A dcjth-rst scarch can aso c sccn as a divide-and-conquer scheme
Iirst nd a trcqucnt itcm scts that contain a choscn itcm,
thcn a trcqucnt itcm scts that do not contain it
Gcncra scarch jroccdurc
Lct thc itcm ordcr c a < b < c <
lcstrict thc transaction dataasc to thosc transactions that contain a
This is thc conditional database for the prex a
lccursivcy scarch this conditiona dataasc tor trcqucnt itcm scts
and add thc jrcx a to a trcqucnt itcm scts tound in thc rccursion
lcmovc thc itcm a trom thc transactions in thc full transaction dataasc
This is thc conditional database for item sets without a
lccursivcy scarch this conditiona dataasc tor trcqucnt itcm scts
\ith this schcmc ony trcqucnt onc-ccmcnt itcm scts havc to c dctcrmincd
Larcr itcm scts rcsut trom addin jossic jrcxcs
Christian Borgelt Frequent Pattern Mining 73
Depth-First Search and Conditional Databases
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
a
b c
d
b
c d c d d
c d d d
d
sjit into sujrocms wrt itcm a
uc itcm sct containin only itcm a
rccn itcm scts containin itcm a (and at cast onc othcr itcm)
rcd itcm scts not containin itcm a (ut at cast onc othcr itcm)
rccn nccds cond dataasc with transactions containin itcm a
rcd nccds cond dataasc with a transactions, ut with itcm a rcmovcd
Christian Borgelt Frequent Pattern Mining 74
Depth-First Search and Conditional Databases
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
a
b c
d
b
c d c d d
c d d d
d
sjit into sujrocms wrt itcm b
uc itcm scts a and a, b
rccn itcm scts containin itcms a and b (and at cast onc othcr itcm)
rcd itcm scts containin itcm a (and at cast onc othcr itcm), ut not itcm b
rccn nccds dataasc with trans containin oth itcms a and b
rcd nccds dataasc with trans containin itcm a, ut with itcm b rcmovcd
Christian Borgelt Frequent Pattern Mining 75
Depth-First Search and Conditional Databases
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
a
b c
d
b
c d c d d
c d d d
d
sjit into sujrocms wrt itcm b
uc itcm sct containin only itcm b
rccn itcm scts containin itcm b (and at cast onc othcr itcm), ut not itcm a
rcd itcm scts containin ncithcr itcm a nor b (ut at cast onc othcr itcm)
rccn nccds dataasc with trans containin itcm b, ut not itcm a
rcd nccds dataasc with a trans, ut with oth itcms a and b rcmovcd
Christian Borgelt Frequent Pattern Mining 76
Formal Description of the Divide-and-Conquer Scheme
Gcncray, a dividc-and-conqucr schcmc can c dcscricd as a sct ot (su)jrocms
Thc initia (su)jrocm is thc actua jrocm to sovc
A sujrocm is jroccsscd y sjittin it into smacr sujrocms,
which arc thcn jroccsscd rccursivcy
A sujrocms that occur in trcqucnt itcm sct minin can c dcncd y
a conditional transaction database and
a prex (ot itcms)
Thc jrcx is a sct ot itcms that has to c addcd to a trcqucnt itcm scts
that arc discovcrcd in thc conditiona transaction dataasc
Iormay, a sujrocms arc tujcs S (D, P),
whcrc D is a conditiona transaction dataasc and P B is a jrcx
Thc initia jrocm, with which thc rccursion is startcd, is S (T, ),
whcrc T is thc transaction dataasc to minc and thc jrcx is cmjty
Christian Borgelt Frequent Pattern Mining 77
Formal Description of the Divide-and-Conquer Scheme
A sujrocm S
0
(T
0
, P
0
) is jroccsscd as toows
Choosc an itcm i B
0
, whcrc B
0
is thc sct ot itcms occurrin in T
0

lt s
T
0
(i) s
min
(whcrc s
T
0
(i) is thc sujjort ot thc itcm i in T
0
)
lcjort thc itcm sct P
0
i as trcqucnt with thc sujjort s
T
0
(i)
Iorm thc sujrocm S
1
(T
1
, P
1
) with P
1
P
0
i
T
1
comjriscs a transactions in T
0
that contain thc itcm i,
ut with thc itcm i rcmovcd (and cmjty transactions rcmovcd)
lt T
1
is not cmjty, jroccss S
1
rccursivcy
ln any casc (that is, rcardcss ot whcthcr s
T
0
(i) s
min
or not)
Iorm thc sujrocm S
2
(T
2
, P
2
), whcrc P
2
P
0

T
2
comjriscs a transactions in T
0
(whcthcr thcy contain i or not),
ut aain with thc itcm i rcmovcd (and cmjty transactions rcmovcd)
lt T
2
is not cmjty, jroccss S
2
rccursivcy
Christian Borgelt Frequent Pattern Mining 78
Divide-and-Conquer Recursion
Subproblem Tree
(T, )

9
a
X
X
X
X
X
X
X
X
X
X
X
X
XXz
a
(T
a
, a)

b
@
@
@
@
@R

b
(T
a
, )

b
@
@
@
@
@R

b
(T
ab
, a, b)

c
A
A
A
A
AU
c
(T
a

b
, a)

c
A
A
A
A
AU
c
(T
ab
, b)

c
A
A
A
A
AU
c
(T
a

b
, )

c
A
A
A
A
AU
c
(T
abc
, a, b, c)
(T
ab c
, a, b)
(T
a

bc
, a, c)
(T
a

b c
, a)
(T
abc
, b, c)
(T
ab c
, b)
(T
a

bc
, c)
(T
a

b c
, )
Lranch to thc ctt incudc an itcm (rst sujrocm)
Lranch to thc riht cxcudc an itcm (sccond sujrocm)
(ltcms in thc indiccs ot thc conditiona transaction dataascs T havc ccn rcmovcd trom thcm)
Christian Borgelt Frequent Pattern Mining 79
Perfect Extensions
Thc scarch can casiy c imjrovcd with so-cacd perfect extension pruning
Lct T c a transaction dataasc ovcr an itcm asc B
Givcn an itcm sct I, an itcm a / I is cacd a perfect extension ot I wrt T,
i thc itcm scts I and I a havc thc samc sujjort s
T
(I) s
T
(I a)
(that is, it a transactions containin thc itcm sct I aso contain thc itcm a)
lcrtcct cxtcnsions havc thc toowin jrojcrtics
lt thc itcm a is a jcrtcct cxtcnsion ot an itcm sct I,
thcn a is aso a jcrtcct cxtcnsion ot any itcm sct J I (as on as a / J)
This can most casiy c sccn y considcrin that K
T
(I) K
T
(a)
and hcncc K
T
(J) K
T
(a), sincc K
T
(J) K
T
(I)
lt X
T
(I) is thc sct ot a jcrtcct cxtcnsions ot an itcm sct I wrt T
(that is, it X
T
(I) i B I [ s
T
(I i) s
T
(I)),
thcn a scts I J with J 2
X
T
(I)
havc thc samc sujjort as I
(whcrc 2
M
dcnotcs thc jowcr sct ot a sct M)
Christian Borgelt Frequent Pattern Mining 80
Perfect Extensions: Examples
transaction dataasc
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
trcqucnt itcm scts
0 itcms 1 itcm 2 itcms 3 itcms
10 a a, c ! a, c, d 3
b 3 a, d ` a, c, e 3
c a, e o a, d, e !
d o b, c 3
e c, d !
c, e !
d, e !
c is a jcrtcct cxtcnsion ot b as b and b, c oth havc sujjort 3
a is a jcrtcct cxtcnsion ot d, e as d, e and a, d, e oth havc sujjort !
Thcrc arc no othcr jcrtcct cxtcnsions in this cxamjc
tor a minimum sujjort ot s
min
3
Christian Borgelt Frequent Pattern Mining 81
Perfect Extension Pruning
Considcr aain thc oriina divide-and-conquer scheme
A sujrocm S
0
(T
0
, P
0
) is sjit into
a sujrocm S
1
(T
1
, P
1
) to nd a trcqucnt itcm scts
that contain an itcm i B
0
,
a sujrocm S
2
(T
2
, P
2
) to nd a trcqucnt itcm scts
that do not contain thc itcm i
Sujjosc thc itcm i is a perfect extension ot thc jrcx P
0

Lct F
1
and F
2
c thc scts ot trcqucnt itcm scts
that arc rcjortcd whcn jroccssin S
1
and S
2
, rcsjcctivcy
lt is I i F
1
I F
2

Thc rcason is that cncray P


1
P
2
i and in this casc T
1
T
2
,
ccausc a transaction in T
0
contain itcm i (i is a jcrtcct cxtcnsion)
Thcrctorc it succs to sovc onc sujrocm (namcy S
2
)
and to construct thc soution ot thc othcr (S
1
) y addin itcm i
Christian Borgelt Frequent Pattern Mining 82
Perfect Extension Pruning
lcrtcct cxtcnsions can c cxjoitcd y cocctin thcsc itcms in thc rccursion,
in a third ccmcnt ot a sujrocm dcscrijtion
Iormay, a sujrocm is a trijct S (T, P, X), whcrc
T is a conditional transaction database,
P is thc sct ot prex items tor T,
X is thc sct ot perfect extension items
Oncc idcnticd, jcrtcct cxtcnsion itcms arc no oncr jroccsscd in thc rccursion,
ut arc ony uscd to cncratc a sujcrscts ot thc jrcx havin thc samc sujjort
Conscqucnty, thcy arc rcmovcd trom thc conditiona dataascs
This tcchniquc is aso known as hypercube decomposition
Thc dividc-and-conqucr schcmc has asicay thc samc structurc
as without jcrtcct cxtcnsion jrunin
Lowcvcr, thc cxact way in which jcrtcct cxtcnsions arc cocctcd
can dcjcnd on thc sjccic aorithm uscd
Christian Borgelt Frequent Pattern Mining 83
Reporting Frequent Item Sets
\ith thc dcscricd dividc-and-conqucr schcmc,
itcm scts arc rcjortcd in lexicographic order
This can c cxjoitcd tor ecient item set reporting
Thc jrcx P is a strin, which is cxtcndcd whcn an itcm is addcd to P
Thus ony onc itcm nccds to c tormattcd jcr rcjortcd trcqucnt itcm sct,
thc rcst is arcady tormattcd in thc strin
Lacktrackin thc scarch (rcturn trom rccursion)
rcmovcs an itcm trom thc jrcx strin
This schcmc can sjccd uj thc outjut considcray
Lxamjc a ()
a c (!)
a c d (3)
a c e (3)
a d (`)
a d e (!)
a e (o)
b (3)
b c (3)
c ()
c d (!)
c e (!)
d (o)
d e (!)
e ()
Christian Borgelt Frequent Pattern Mining 84
Global and Local Item Order
j to now wc assumcd that thc itcm ordcr is (oay) xcd,
and dctcrmincd at thc vcry cinnin ascd on hcuristics
Lowcvcr, thc dcscricd dividc-and-conqucr schcmc shows
that a oay xcd itcm ordcr is morc rcstrictivc than ncccssary
Thc itcm uscd to sjit thc currcnt sujrocm can c any itcm
that occurs in thc conditiona transaction dataasc ot thc sujrocm
Thcrc is no nccd to choosc thc samc itcm tor sjittin siin sujrocms
(as a oa itcm ordcr woud rcquirc us to do)
Thc samc hcuristics uscd tor dctcrminin a oa itcm ordcr sucst
that thc sjit itcm tor a ivcn sujrocm shoud c sccctcd trom
thc (conditionay) most trcqucnt itcm(s)
As a conscqucncc, thc itcm ordcrs may dicr tor cvcry ranch ot thc scarch trcc
Lowcvcr, two sujrocms must sharc thc itcm ordcr that is xcd
y thc common jart ot thcir jaths trom thc root (initia sujrocm)
Christian Borgelt Frequent Pattern Mining 85
Item Order: Divide-and-Conquer Recursion
Subproblem Tree
(T, )

9
a
X
X
X
X
X
X
X
X
X
X
X
X
XXz
a
(T
a
, a)

b
@
@
@
@
@R

b
(T
a
, )

c
@
@
@
@
@R
c
(T
ab
, a, b)

d
A
A
A
A
AU

d
(T
a

b
, a)

e
A
A
A
A
AU
e
(T
ac
, c)

f
A
A
A
A
AU

f
(T
a c
, )

g
A
A
A
A
AU
g
(T
abd
, a, b, d)
(T
ab

d
, a, b)
(T
a

be
, a, e)
(T
a

b e
, a)
(T
acf
, c, f)
(T
ac

f
, c)
(T
a cg
, g)
(T
a c g
, )
A oca itcm ordcrs start with a < . . .
A sujrocms on thc ctt sharc a < b < . . .,
A sujrocms on thc riht sharc a < c < . . .
Christian Borgelt Frequent Pattern Mining 86
Global and Local Item Order
Loca itcm ordcrs havc advantacs and disadvantacs
Advantage
ln somc data scts thc ordcr ot thc conditiona itcm trcqucncics
dicrs considcray trom thc oa ordcr
Such data scts can somctimcs c jroccsscd sinicanty tastcr
with oca itcm ordcrs (dcjcndin on thc aorithm)
Disadvantage
Thc data structurc ot thc conditiona dataascs must aow us
to dctcrminc conditiona itcm trcqucncics quicky
`ot havin a oay xcd itcm ordcr can makc it morc dicut
to dctcrminc conditiona transaction dataascs wrt sjit itcms
(dcjcndin on thc cmjoycd data structurc)
Thc ains trom thc cttcr itcm ordcr may c ost aain
duc to thc morc comjcx jroccssin , conditionin schcmc
Christian Borgelt Frequent Pattern Mining 87
Transaction Database Representation
Christian Borgelt Frequent Pattern Mining 88
Transaction Database Representation
Lcat, Il-rowth and scvcra othcr trcqucnt itcm sct minin aorithms
rcy on thc dcscricd asic dividc-and-conqucr schcmc
Thcy dicr mainy in how thcy rcjrcscnt thc conditiona transaction dataascs
Thc main ajjroachcs arc horizonta and vcrtica rcjrcscntations
ln a horizontal representation, thc dataasc is storcd as a ist (or array)
ot transactions, cach ot which is a ist (or array) ot thc itcms containcd in it
ln a vertical representation, a dataasc is rcjrcscntcd y rst rctcrrin
with a ist (or array) to thc dicrcnt itcms Ior cach itcm a ist (or array) ot
idcnticrs is storcd, which indicatc thc transactions that contain thc itcm
Lowcvcr, this distinction is not jurc, sincc thcrc arc many aorithms
that usc a comination ot thc two torms ot rcjrcscntin a dataasc
Ircqucnt itcm sct minin aorithms aso dicr in
how thcy construct ncw conditiona dataascs trom a ivcn onc
Christian Borgelt Frequent Pattern Mining 89
Transaction Database Representation
Thc Ajriori aorithm uscs a horizontal transaction representation
cach transaction is an array ot thc containcd itcms
`otc that thc atcrnativc jrcx trcc oranization
is sti an csscntiay horizontal rcjrcscntation
Thc atcrnativc is a vertical transaction representation
Ior cach itcm a transaction list is crcatcd
Thc transaction ist ot itcm a indicatcs thc transactions that contain it,
that is, it rcjrcscnts its cover K
T
(a)
Advantac thc transaction ist tor a jair ot itcms can c comjutcd y
intcrscctin thc transaction ists ot thc individua itcms
Gcncray, a vcrtica transaction rcjrcscntation can cxjoit
I, J B K
T
(I J) K
T
(I) K
T
(J).
A comincd rcjrcscntation is thc frequent pattern tree (to c discusscd atcr)
Christian Borgelt Frequent Pattern Mining 90
Transaction Database Representation
Horizontal Representation: List itcms tor cach transaction
Vertical Representation: List transactions tor cach itcm
horizonta rcjrcscntation
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
vcrtica rcjrcscntation
a b c d e
1 2 2 1 1
3 3 2 3
! 9 ! ! !
` o o `
o S S
S S 10 9
10 9 10
matrix rcjrcscntation
a b c d e
1 1 0 0 1 1
2 0 1 1 1 0
3 1 0 1 0 1
! 1 0 1 1 1
` 1 0 0 0 1
o 1 0 1 1 0
0 1 1 0 0
S 1 0 1 1 1
9 0 1 1 0 1
10 1 0 0 1 1
Christian Borgelt Frequent Pattern Mining 91
Transaction Database Representation
transaction
dataasc
a, d, e
b, c, d
a, c, e
a, c, d, e
a, e
a, c, d
b, c
a, c, d, e
b, c, e
a, d, e
cxicorajhicay
sortcd
a, c, d
a, c, d, e
a, c, d, e
a, c, e
a, d, e
a, d, e
a, e
b, c
b, c, d
b, c, e
prex tree
representation
a
b
c
d
e
c
d
e
e
d
e
e
: 7
: 3
: 4
: 2
: 1
: 3
: 3
: 1
: 2
: 1
: 1
: 2
`otc that a jrcx trcc rcjrcscntation is a comjrcsscd horizonta rcjrcscntation
Principle: cqua jrcxcs ot transactions arc mcrcd
This is most ccctivc it thc itcms arc sortcd dcsccndiny wrt thcir sujjort
Christian Borgelt Frequent Pattern Mining 92
The Eclat Algorithm
|Zaki, larthasarathy, Oihara, and Li 199|
Christian Borgelt Frequent Pattern Mining 93
Eclat: Basic Ideas
Thc itcm scts arc chcckcd in lexicographic order
(depth-rst traversal ot thc jrcx trcc)
Thc scarch schcmc is thc samc as thc cncra schcmc tor scarchin
with canonica torms havin thc jrcx jrojcrty and josscssin
a jcrtcct cxtcnsion ruc (cncratc ony canonica cxtcnsions)
Lcat cncratcs morc candidatc itcm scts than Ajriori,
ccausc it (usuay) docs not storc thc sujjort ot a visitcd itcm scts

As a conscqucncc it cannot tuy cxjoit thc Ajriori jrojcrty tor jrunin


Lcat uscs a jurcy vertical transaction representation
`o susct tcsts and no susct cncration arc nccdcd to comjutc thc sujjort
Thc sujjort ot itcm scts is rathcr dctcrmincd y intcrscctin transaction ists

`otc that Lcat cannot tuy cxjoit thc Ajriori jrojcrty, ccausc it docs not store thc sujjort ot a
cxjorcd itcm scts, not ccausc it cannot know it lt a comjutcd sujjort vaucs wcrc storcd, it coud
c imjcmcntcd in such a way that a sujjort vaucs nccdcd tor tu a priori jrunin wcrc avaiac
Christian Borgelt Frequent Pattern Mining 94
Eclat: Subproblem Split
1
3
4
5
6
8
10
a
7
2
7
9
b
3
2
3
4
6
7
8
9
c
7
1
2
4
6
8
10
d
6
1
3
4
5
8
9
10
e
7
b
0
3
4
6
8
c
4
1
4
6
8
10
d
5
1
3
4
5
8
10
e
6
2
7
9
b
3
2
3
4
6
7
8
9
c
7
1
2
4
6
8
10
d
6
1
3
4
5
8
9
10
e
7

Conditiona
dataasc
tor jrcx a
(1st sujrocm)
Conditiona
dataasc
with itcm a
rcmovcd
(2nd sujrocm)
a
7
b
3
c
7
d
6
e
7
b
0
c
4
d
5
e
6
b
3
c
7
d
6
e
7

Conditiona
dataasc
tor jrcx a
(1st sujrocm)
Conditiona
dataasc
with itcm a
rcmovcd
(2nd sujrocm)
Christian Borgelt Frequent Pattern Mining 95
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
Iorm a transaction ist tor cach itcm Lcrc it vcctor rcjrcscntation
rcy itcm is containcd in transaction
whitc itcm is not containcd in transaction
Transaction dataasc is nccdcd ony oncc (tor thc sinc itcm transaction ists)
Christian Borgelt Frequent Pattern Mining 96
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
lntcrscct thc transaction ist tor itcm a
with thc transaction ists ot a othcr itcms (conditional database tor itcm a)
Count thc numcr ot its that arc sct (numcr ot containin transactions)
This yicds thc sujjort ot a itcm scts with thc jrcx a
Christian Borgelt Frequent Pattern Mining 97
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
Thc itcm sct a, b is intrcqucnt and can c jruncd
A othcr itcm scts with thc jrcx a arc trcqucnt
and arc thcrctorc kcjt and jroccsscd rccursivcy
Christian Borgelt Frequent Pattern Mining 98
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
lntcrscct thc transaction ist tor thc itcm sct a, c
with thc transaction ists ot thc itcm scts a, x, x d, e
lcsut Transaction ists tor thc itcm scts a, c, d and a, c, e
Count thc numcr ot its that arc sct (numcr ot containin transactions)
This yicds thc sujjort ot a itcm scts with thc jrcx ac
Christian Borgelt Frequent Pattern Mining 99
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
lntcrscct thc transaction ists tor thc itcm scts a, c, d and a, c, e
lcsut Transaction ist tor thc itcm sct a, c, d, e
\ith Ajriori this itcm sct coud c jruncd ctorc countin,
ccausc it was known that c, d, e is intrcqucnt
Christian Borgelt Frequent Pattern Mining 100
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
Thc itcm sct a, c, d, e is not trcqucnt (sujjort 2,20/) and thcrctorc jruncd
Sincc thcrc is no transaction ist ctt (and thus no intcrscction jossic),
thc rccursion is tcrminatcd and thc scarch acktracks
Christian Borgelt Frequent Pattern Mining 101
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
Thc scarch acktracks to thc sccond cvc ot thc scarch trcc and
intcrscct thc transaction ist tor thc itcm scts a, d and a, e
lcsut Transaction ist tor thc itcm sct a, d, e
Sincc thcrc is ony onc transaction ist ctt (and thus no intcrscction jossic),
thc rccursion is tcrminatcd and thc scarch acktracks aain
Christian Borgelt Frequent Pattern Mining 102
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
b
c : 3 d : 1 e : 1
Thc scarch acktracks to thc rst cvc ot thc scarch trcc and
intcrscct thc transaction ist tor b with thc transaction ists tor c, d, and e
lcsut Transaction ists tor thc itcm scts b, c, b, d, and b, e
Christian Borgelt Frequent Pattern Mining 103
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
b
c : 3 d : 1 e : 1
Ony onc itcm sct has sucicnt sujjort jrunc a sutrccs
Sincc thcrc is ony onc transaction ist ctt (and thus no intcrscction jossic),
thc rccursion is tcrminatcd and thc scarch acktracks aain
Christian Borgelt Frequent Pattern Mining 104
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
b
c : 3 d : 1 e : 1
c
d : 4 e : 4
Lacktrack to thc rst cvc ot thc scarch trcc and
intcrscct thc transaction ist tor c with thc transaction ists tor d and e
lcsut Transaction ists tor thc itcm scts c, d and c, e
Christian Borgelt Frequent Pattern Mining 105
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
b
c : 3 d : 1 e : 1
c
d : 4 e : 4
d
e : 2
lntcrscct thc transaction ist tor thc itcm scts c, d and c, e
lcsut Transaction ist tor thc itcm sct c, d, e
Christian Borgelt Frequent Pattern Mining 106
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
b
c : 3 d : 1 e : 1
c
d : 4 e : 4
d
e : 2
Thc itcm sct c, d, e is not trcqucnt (sujjort 2,20/) and thcrctorc jruncd
Sincc thcrc is no transaction ist ctt (and thus no intcrscction jossic),
thc rccursion is tcrminatcd and thc scarch acktracks
Christian Borgelt Frequent Pattern Mining 107
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
b
c : 3 d : 1 e : 1
c
d : 4 e : 4
d
e : 2
d
e : 4
Thc scarch acktracks to thc rst cvc ot thc scarch trcc and
intcrscct thc transaction ist tor d with thc transaction ist tor e
lcsut Transaction ist tor thc itcm sct d, e
\ith this stcj thc scarch is nishcd
Christian Borgelt Frequent Pattern Mining 108
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
b
c : 3 d : 1 e : 1
c
d : 4 e : 4
d
e : 2
d
e : 4
Thc tound trcqucnt itcm scts coincidc, ot coursc,
with thosc tound y thc Ajriori aorithm
Lowcvcr, a tundamcnta dicrcncc is that
Lcat usuay ony writcs tound trcqucnt itcm scts to an outjut c,
whic Ajriori kccjs thc whoc scarch trcc in main mcmory
Christian Borgelt Frequent Pattern Mining 109
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
b
c : 3 d : 1 e : 1
c
d : 4 e : 4
d
e : 2
d
e : 4
`otc that thc itcm sct a, c, d, e coud c jruncd y Ajriori without comjutin
its sujjort, ccausc thc itcm sct c, d, e is intrcqucnt
Thc samc can c achicvcd with Lcat it thc dcjth-rst travcrsa ot thc jrcx trcc
is carricd out trom riht to ctt and comjutcd sujjort vaucs arc storcd
lt is dcatac whcthcr thc jotcntia ains ,ustity thc mcmory rcquircmcnt
Christian Borgelt Frequent Pattern Mining 110
Eclat: Bit Matrices and Item Coding
Bit Matrices
lcjrcscnt transactions as a it matrix
Lach coumn corrcsjonds to an itcm
Lach row corrcsjonds to a transaction
`orma and sjarsc rcjrcscntation ot it matriccs
`orma onc mcmory it jcr matrix it, zcros rcjrcscntcd
Sjarsc ists ot row indiccs ot sct its (transaction ists)
\hich rcjrcscntation is jrctcrac dcjcnds on
thc ratio ot sct its to ccarcd its
Item Coding
Sortin thc itcms asccndiny wrt thcir trcqucncy (individua or
transaction sizc sum) cads to a cttcr structurc ot thc scarch trcc
Christian Borgelt Frequent Pattern Mining 111
Eclat: Intersecting Transaction Lists
function iscct (src1, src2 tidist)
begin ( intcrscct two transaction id ists )
var dst tidist. ( crcatcd intcrscction )
while oth src1 and src2 arc not cmjty do begin
if hcad(src1) < hcad(src2) ( skij transaction idcnticrs that arc )
then src1 tai(src1). ( uniquc to thc rst sourcc ist )
elseif hcad(src1) > hcad(src2) ( skij transaction idcnticrs that arc )
then src2 tai(src2). ( uniquc to thc sccond sourcc ist )
else begin ( it transaction id is in oth sourccs )
dstajjcnd(hcad(src1)). ( ajjcnd it to thc outjut ist )
src1 tai(src1). src2 tai(src2).
end. ( rcmovc thc transtcrrcd transaction id )
end. ( trom oth sourcc ists )
return dst. ( rcturn thc crcatcd intcrscction )
end. ( tunction iscct() )
Christian Borgelt Frequent Pattern Mining 112
Reminder (Apriori): Transactions as a Prex Tree
transaction
dataasc
a, d, e
b, c, d
a, c, e
a, c, d, e
a, e
a, c, d
b, c
a, c, d, e
b, c, e
a, d, e
cxicorajhicay
sortcd
a, c, d
a, c, d, e
a, c, d, e
a, c, e
a, d, e
a, d, e
a, e
b, c
b, c, d
b, c, e
prex tree
representation
a
b
c
d
e
c
d
e
e
d
e
e
: 7
: 3
: 4
: 2
: 1
: 3
: 3
: 1
: 2
: 1
: 1
: 2
ltcms in transactions arc sortcd wrt somc aritrary ordcr,
transactions arc sortcd cxicorajhicay, thcn a jrcx trcc is constructcd
Advantage: idcntica transaction jrcxcs arc jroccsscd ony oncc
Christian Borgelt Frequent Pattern Mining 113
Eclat: Transaction Ranges
transaction
dataasc
a, d, e
b, c, d
a, c, e
a, c, d, e
a, e
a, c, d
b, c
a, c, d, e
b, c, e
a, d, e
itcm
trcqucncics
a
b 3
c
d o
e
sortcd y
trcqucncy
a, e, d
c, d, b
a, c, e
a, c, e, d
a, e
a, c, d
c, b
a, c, e, d
c, e, b
a, e, d
cxicorajhicay
sortcd
1 a, c, e
2 a, c, e, d
3 a, c, e, d
! a, c, d
` a, e
o a, e, d
a, e, d
S c, e, b
9 c, d, b
10 c, b
a
1

c
1

!
S

10
e
1

3
`

S
d
2

3
!

!
o

9
b
S

S
9

9
10

10
Thc transaction ists can c comjrcsscd y cominin
consccutivc transaction idcnticrs into rancs
Lxjoit itcm trcqucncics and cnsurc susct rcations ctwccn rancs
trom owcr to hihcr trcquccics, so that intcrscctin thc ists is casy
Christian Borgelt Frequent Pattern Mining 114
Eclat: Dierence sets (Disets)
ln a conditiona dataasc, a transaction ists arc tcrcd y thc jrcx
Ony transactions containcd in thc transaction ist tor thc jrcx
can c in thc transaction ists ot thc conditiona dataasc
This sucsts thc idca to usc disets to rcjrcscnt conditiona dataascs
I a / I D
T
(a [ I) K
T
(I) K
T
(I a)
D
T
(a [ I) contains thc indiccs ot thc transactions that contain I ut not a
Thc sujjort ot dircct sujcrscts ot I can now c comjutcd as
I a / I s
T
(I a) s
T
(I) [D
T
(a [ I)[.
Thc discts tor thc ncxt cvc can c comjutcd y
I a, b / I, a , b D
T
(b [ I a) D
T
(b [ I) D
T
(a [ I)
Ior somc transaction dataascs, usin discts sjccds uj thc scarch considcray
Christian Borgelt Frequent Pattern Mining 115
Eclat: Disets
Proof of the Formula for the Next Level:
D
T
(b [ I a) K
T
(I a) K
T
(I a, b)
k [ I a t
k
k [ I a, b t
k

k [ I t
k
a t
k

k [ I t
k
a t
k
b t
k

k [ I t
k
a t
k
b / t
k

k [ I t
k
b / t
k

k [ I t
k
b / t
k
a / t
k

k [ I t
k
b / t
k

k [ I t
k
a / t
k

(k [ I t
k
k [ I b t
k
)
(k [ I t
k
k [ I a t
k
)
(K
T
(I) K
T
(I b)
(K
T
(I) K
T
(I a)
D(b [ I) D(a [ I)
Christian Borgelt Frequent Pattern Mining 116
Summary Eclat
Basic Processing Scheme
Lcjth-rst travcrsa ot thc jrcx trcc
Lata is rcjrcscntcd as ists ot transaction idcnticrs (onc jcr itcm)
Sujjort countin is donc y intcrscctin ists ot transaction idcnticrs
Advantages
Lcjth-rst scarch rcduccs mcmory rcquircmcnts
suay (considcray) tastcr than Ajriori
Disadvantages
\ith a sjarsc transaction ist rcjrcscntation (row indiccs)
Lcat is dicut to cxccutc tor modcrn jroccssors (ranch jrcdiction)
Software
http://www.borgelt.net/eclat.html
Christian Borgelt Frequent Pattern Mining 117
The SaM Algorithm
Sjit and `crc Aorithm |Lorct 200S|
Christian Borgelt Frequent Pattern Mining 118
SaM: Basic Ideas
Thc itcm scts arc chcckcd in lexicographic order
(depth-rst traversal ot thc jrcx trcc)
Stcj y stcj cimination ot itcms trom thc transaction dataasc
lccursivc jroccssin ot thc conditiona transaction dataascs
\hic Lcat uscs a jurcy vcrtica transaction rcjrcscntation,
Sa` uscs a jurcy horizontal transaction representation
This dcmonstratcs that thc travcrsa ordcr tor thc jrcx trcc and
thc rcjrcscntation torm ot thc transaction dataasc can c comincd trccy
Thc data structurc uscd is a simjy array ot transactions
Thc two conditiona dataascs tor thc two sujrocms tormcd in cach stcj
arc crcatcd with a split step and a merge step
Luc to thcsc stcjs thc aorithm is cacd Sjit and `crc (Sa`)
Christian Borgelt Frequent Pattern Mining 119
SaM: Preprocessing the Transaction Database
1

a d
a c d e
b d
b c d g
b c f
a b d
b d e
b c d e
b c
a b d f
2

g 1
f 2
e 3
a !
c `
b S
d S
s
min
3
3

a d
e a c d
b d
c b d
c b
a b d
e b d
e c b d
c b
a b d
!

e a c d
e c b d
e b d
a b d
a b d
a d
c b d
c b
c b
b d
`

1 e a c d
1 e c b d
1 e b d
2 a b d
1 a d
1 c b d
2 c b
1 b d
1 Oriina transaction dataasc
2 Ircqucncy ot individua itcms
3 ltcms in transactions sortcd
asccndiny wrt thcir trcqucncy
! Transactions sortcd cxicorajhicay
in dcsccndin ordcr (comjarison ot
itcms invcrtcd wrt jrcccdin stcj)
` Lata structurc uscd y thc aorithm
Christian Borgelt Frequent Pattern Mining 120
SaM: Basic Operations
1 e a c d
1 e c b d
1 e b d
2 a b d
1 a d
1 c b d
2 c b
1 b d
1 a c d
1 c b d
1 b d
e
e
e
split
prefix e
2 a b d
1 a d
1 c b d
2 c b
1 b d
1 a c d
1 c b d
1 b d
1 a c d
2 a b d
1 a d
2 c b d
2 c b
2 b d
merge
prefix e
e removed
Split Step: (on thc ctt. tor rst sujrocm)
`ovc a transactions startin with thc samc itcm to a ncw array
lcmovc thc common cadin itcm (advancc jointcr into transaction)
Merge Step: (on thc riht. tor sccond sujrocm)
`crc thc rcst ot thc transaction array and thc cojicd transactions
Thc mcrc ojcration is simiar to a mergesort jhasc
Christian Borgelt Frequent Pattern Mining 121
SaM: Pseudo-Code
function Sa` (a array ot transactions, ( conditiona dataasc to jroccss )
p sct ot itcms, ( jrcx ot thc conditiona dataasc a )
s
min
int) ( minimum sujjort ot an itcm sct )
var i itcm. ( ucr tor thc sjit itcm )
b array ot transactions. ( sjit rcsut )
begin ( sjit and mcrc rccursion )
while a is not cmjty do ( whic thc dataasc is not cmjty )
i a|0|itcms|0|. ( ct cadin itcm ot rst transaction )
movc transactions startin with i to b. ( sjit stcj rst sujrocm )
mcrc b and thc rcst ot a into a. ( mcrc stcj sccond sujrocm )
if s(i) s
min
then ( it thc sjit itcm is trcqucnt )
p p i. ( cxtcnd thc jrcx itcm sct and )
rcjort p with sujjort s(i). ( rcjort thc tound trcqucnt itcm sct )
Sa`(b, p, s
min
). ( jroccss thc sjit rcsut rccursivcy, )
p p i. ( thcn rcstorc thc oriina jrcx )
end.
end.
end. ( tunction Sa`() )
Christian Borgelt Frequent Pattern Mining 122
SaM: Pseudo-Code Split Step
var i itcm. ( ucr tor thc sjit itcm )
s int. ( sujjort ot thc sjit itcm )
b array ot transactions. ( sjit rcsut )
begin ( sjit stcj )
b cmjty. s 0. ( initiaizc sjit rcsut and itcm sujjort )
i a|0|itcms|0|. ( ct cadin itcm ot rst transaction )
while a is not cmjty ( whic dataasc is not cmjty and )
and a|0|itcms|0| i do ( ncxt transaction starts with samc itcm )
s s + a|0|wt. ( sum occurrcnccs (comjutc sujjort) )
rcmovc i trom a|0|itcms. ( rcmovc sjit itcm trom transaction )
if a|0|itcms is not cmjty ( it transaction has not ccomc cmjty )
then rcmovc a|0| trom a and ajjcnd it to b.
else rcmovc a|0| trom a. end. ( movc it to thc conditiona dataasc, )
end. ( othcrwisc simjy rcmovc it )
end. ( cmjty transactions arc ciminatcd )
`otc that thc sjit stcj aso dctcrmincs thc sujjort ot thc itcm i
Christian Borgelt Frequent Pattern Mining 123
SaM: Pseudo-Code Merge Step
var c array ot transactions. ( ucr tor rcst ot sourcc array )
begin ( mcrc stcj )
c a. a cmjty. ( initiaizc thc outjut array )
while b and c arc oth not cmjty do ( mcrc sjit and rcst ot dataasc )
if c|0|itcms > b|0|itcms ( cojy cx smacr transaction trom c )
then rcmovc c|0| trom c and ajjcnd it to a.
else if c|0|itcms < b|0|itcms ( cojy cx smacr transaction trom b )
then rcmovc b|0| trom b and ajjcnd it to a.
else b|0|wt b|0|wt +c|0|wt. ( sum thc occurrcnccs,wcihts )
rcmovc b|0| trom b and ajjcnd it to a.
rcmovc c|0| trom c. ( movc comincd transaction and )
end. ( dcctc thc othcr, cqua transaction )
end. ( kccj ony onc cojy jcr transaction )
while c is not cmjty do ( cojy rcst ot transactions in c )
rcmovc c|0| trom c and ajjcnd it to a. end.
while b is not cmjty do ( cojy rcst ot transactions in b )
rcmovc b|0| trom b and ajjcnd it to a. end.
end. ( sccond rccursion cxccutcd y ooj )
Christian Borgelt Frequent Pattern Mining 124
SaM: Optimization
lt thc transaction dataasc is sjarsc,
thc two transaction arrays to mcrc can dicr sustantiay in sizc
ln this casc Sa` can ccomc tairy sow,
ccausc thc mcrc stcj jroccsscs many morc transactions than thc sjit stcj
lntuitivc cxjanation (cxtrcmc casc)
Sujjosc mergesort aways mcrcd a sinc ccmcnt
with thc rccursivcy sortcd rcst ot thc array (or ist)
This vcrsion ot mcrcsort woud c cquivacnt to insertion sort
As a conscqucncc thc timc comjcxity worscns trom O(no n) to O(n
2
)
lossic ojtimization
`odity thc mcrc stcj it thc arrays to mcrc dicr sinicanty in sizc
ldca usc thc samc ojtimization as in binary search ascd insertion sort
Christian Borgelt Frequent Pattern Mining 125
SaM: Pseudo-Code Binary Search Based Merge
function mcrc (a, b array ot transactions) array ot transactions
var l, m, r int. ( inary scarch variacs )
c array ot transactions. ( outjut transaction array )
begin ( inary scarch ascd mcrc )
c cmjty. ( initiaizc thc outjut array )
while a and b arc oth not cmjty do ( mcrc thc two transaction arrays )
l 0. r cnth(a). ( initiaizc thc inary scarch ranc )
while l < r do ( whic thc scarch ranc is not cmjty )
m
l+r
2
|. ( comjutc thc middc indcx )
if a|m| < b|0| ( comjarc thc transaction to inscrt )
then l m + 1. else r m. ( and adajt thc inary scarch ranc )
end. ( accordin to thc comjarison rcsut )
while l > 0 do ( whic sti ctorc inscrtion josition )
rcmovc a|0| trom a and ajjcnd it to c.
l l 1. ( cojy cx arcr transaction and )
end. ( dccrcmcnt thc transaction countcr )
. . .
Christian Borgelt Frequent Pattern Mining 126
SaM: Pseudo-Code Binary Search Based Merge
. . .
rcmovc b|0| trom b and ajjcnd it to c. ( cojy thc transaction to inscrt and )
i cnth(c) 1. ( ct its indcx in thc outjut array )
if a is not cmjty and a|0|itcms c|i|itcms
then c|i|wt c|i|wt +a|0|wt. ( it thcrc is a transaction in thc rcst )
rcmovc a|0| trom a. ( that is cqua to thc onc ,ust cojicd, )
end. ( thcn sum thc transaction wcihts )
end. ( and rcmovc trans trom thc rcst )
while a is not cmjty do ( cojy rcst ot transactions in a )
rcmovc a|0| trom a and ajjcnd it to c. end.
while b is not cmjty do ( cojy rcst ot transactions in b )
rcmovc b|0| trom b and ajjcnd it to c. end.
return c. ( rcturn thc mcrc rcsut )
end. ( tunction mcrc() )
Ajjyin this mcrc jroccdurc it thc cnth ratio ot thc transaction arrays
cxcccds 1o1 accccratcs thc cxccution on sjarsc data scts
Christian Borgelt Frequent Pattern Mining 127
SaM: Optimization and External Storage
Acccjtin a sihty morc comjicatcd jroccssin schcmc,
onc may work with double source buering
lnitiay, onc sourcc is thc injut dataasc and thc othcr sourcc is cmjty
A sjit rcsut, which has to c crcatcd y movin and mcrin transactions
trom oth sourccs, is aways mcrcd to thc smacr sourcc
lt oth sourccs havc ccomc arc,
thcy may c mcrcd in ordcr to cmjty onc sourcc
`otc that Sa` can casiy c imjcmcntcd to work on external storage
ln jrincijc, thc transactions nccd not c oadcd into main mcmory
Lvcn thc transaction array can casiy c storcd on cxtcrna storac
or as a rcationa dataasc tac
Thc tact that thc transaction array is jroccsscd incary
is advantacous tor cxtcrna storac ojcrations
Christian Borgelt Frequent Pattern Mining 128
Summary SaM
Basic Processing Scheme
Lcjth-rst travcrsa ot thc jrcx trcc
Lata is rcjrcscntcd as an array ot transactions (jurcy horizonta rcjrcscntation)
Sujjort countin is donc imjicity in thc sjit stcj
Advantages
\cry simjc data structurc and jroccssin schcmc
Lasy to imjcmcnt tor ojcration on cxtcrna storac , rcationa dataascs
Disadvantages
Can c sow on sjarsc transaction dataascs duc to thc mcrc stcj
Software
http://www.borgelt.net/sam.html
Christian Borgelt Frequent Pattern Mining 129
The RElim Algorithm
lccursivc Limination Aorithm |Lorct 200`|
Christian Borgelt Frequent Pattern Mining 130
Recursive Elimination: Basic Ideas
Thc itcm scts arc chcckcd in lexicographic order
(depth-rst traversal ot thc jrcx trcc)
Stcj y stcj cimination ot itcms trom thc transaction dataasc
lccursivc jroccssin ot thc conditiona transaction dataascs
Avoids thc main jrocm ot thc Sa` aorithm
docs not usc a mcrc ojcration to rouj transactions with thc samc cadin itcm
lLim rathcr maintains one list of transactions per item,
thus cmjoyin thc corc idca ot radix sort
Lowcvcr, ony transactions startin with an itcm arc in thc corrcsjondin ist
Attcr an itcm has ccn jroccsscd, transactions arc rcassincd to othcr ists
(ascd on thc ncxt itcm in thc transaction)
lLim is in scvcra rcsjccts simiar to thc LC` aorithm
and coscy rcatcd to thc L-minc aorithm (ut simjcr data structurc)
Christian Borgelt Frequent Pattern Mining 131
RElim: Preprocessing the Transaction Database
1

samc
as tor
Sa`
!

e a c d
e c b d
e b d
a b d
a b d
a d
c b d
c b
c b
b d
`

d
0
b
1
c
3
a
3
e
3
1 d 1 b d
2 b
2 b d
1 d
1 a c d
1 c b d
1 b d
1 Oriina transaction dataasc
2 Ircqucncy ot individua itcms
3 ltcms in transactions sortcd
asccndiny wrt thcir trcqucncy
! Transactions sortcd cxicorajhicay
in dcsccndin ordcr (comjarison ot
itcms invcrtcd wrt jrcccdin stcj)
` Lata structurc uscd y thc aorithm
(cadin itcms imjicit in ist)
Christian Borgelt Frequent Pattern Mining 132
RElim: Basic Operations
initial database
d
0
b
1
c
3
a
3
e
3
1 d 1 b d
2 b
2 b d
1 d
1 a c d
1 c b d
1 b d
3
e
a
c
b
prefix e
d
0
b
1
c
1
a
1
1 d 1 b d 1 c d
e eliminated
d
0
b
2
c
4
a
4
1 d
1 d
1 b d
1 b d
2 b
1 c d
2 b d
1 d
Thc asic ojcrations ot thc lLim aorithm
Thc rihtmost ist is travcrscd and rcassincd
oncc to an initiay cmjty ist array (condi-
tiona dataasc tor thc jrcx e, scc toj riht)
and oncc to thc oriina ist array (ciminatin
itcm e, scc ottom ctt) Thcsc two dataascs
arc thcn oth jroccsscd rccursivcy
`otc that attcr a simjc rcassinmcnt thcrc may c dujicatc ist ccmcnts
Christian Borgelt Frequent Pattern Mining 133
RElim: Pseudo-Code
function lLim (a array ot transaction ists, ( cond dataasc to jroccss )
p sct ot itcms, ( jrcx ot thc conditiona dataasc a )
s
min
int) int ( minimum sujjort ot an itcm sct )
var i, k itcm. ( ucr tor thc currcnt itcm )
s int. ( sujjort ot thc currcnt itcm )
n int. ( numcr ot tound trcqucnt itcm scts )
b array ot transaction ists. ( conditiona dataasc tor currcnt itcm )
t, u transaction ist ccmcnt. ( to travcrsc thc transaction ists )
begin ( rccursivc cimination )
n 0. ( initiaizc thc numcr ot tound itcm scts )
while a is not cmjty do ( whic conditiona dataasc is not cmjty )
i ast itcm ot a. s a|i|wt. ( ct thc ncxt itcm to jroccss )
if s s
min
then ( it thc currcnt itcm is trcqucnt )
p p i. ( cxtcnd thc jrcx itcm sct and )
rcjort p with sujjort s. ( rcjort thc tound trcqucnt itcm sct )
. . . ( crcatc conditiona dataasc tor i )
p p i. ( and jroccss it rccursivcy, )
end. ( thcn rcstorc thc oriina jrcx )
Christian Borgelt Frequent Pattern Mining 134
RElim: Pseudo-Code
if s s
min
then ( it thc currcnt itcm is trcqucnt )
. . . ( rcjort thc tound trcqucnt itcm sct )
b array ot transaction ists. ( crcatc an cmjty ist array )
t a|i|hcad. ( ct thc ist associatcd with thc itcm )
while t , ni do ( whic not at thc cnd ot thc ist )
u cojy ot t. t tsucc. ( cojy thc transaction ist ccmcnt, )
k uitcms|0|. ( o to thc ncxt ist ccmcnt, and )
rcmovc k trom uitcms. ( rcmovc thc cadin itcm trom thc cojy )
if uitcms is not cmjty ( add thc cojy to thc conditiona dataasc )
then usucc b|k|hcad. b|k|hcad u. end.
b|k|wt b|k|wt +uwt. ( sum thc transaction wciht )
end. ( in thc ist wciht,transaction countcr )
n n + 1 + lLim(b, p, s
min
). ( jroccss thc crcatcd dataasc rccursivcy )
. . . ( and sum thc tound trcqucnt itcm scts, )
end. ( thcn rcstorc thc oriina itcm sct jrcx )
. . . ( o on y rcassinin )
( thc jroccsscd transactions )
Christian Borgelt Frequent Pattern Mining 135
RElim: Pseudo-Code
. . .
t a|i|hcad. ( ct thc ist associatcd with thc itcm )
while t , ni do ( whic not at thc cnd ot thc ist )
u t. t tsucc. ( notc thc currcnt ist ccmcnt, )
k uitcms|0|. ( o to thc ncxt ist ccmcnt, and )
rcmovc k trom uitcms. ( rcmovc thc cadin itcm trom currcnt )
if uitcms is not cmjty ( rcassin thc notcd ist ccmcnt )
then usucc a|k|hcad. a|k|hcad u. end.
a|k|wt a|k|wt +uwt. ( sum thc transaction wciht )
end. ( in thc ist wciht,transaction countcr )
rcmovc a|i| trom a. ( rcmovc thc jroccsscd ist )
end.
return n. ( rcturn thc numcr ot trcqucnt itcm scts )
end. ( tunction lLim() )
ln ordcr to rcmovc dujicatc ccmcnts, it is usuay advisac
to sort and comjrcss thc ncxt transaction ist ctorc it is jroccsscd
Christian Borgelt Frequent Pattern Mining 136
Summary RElim
Basic Processing Scheme
Lcjth-rst travcrsa ot thc jrcx trcc
Lata is rcjrcscntcd as ists ot transactions (onc jcr itcm)
Sujjort countin is imjicit in thc (rc)assinmcnt stcj
Advantages
Simjc data structurcs and jroccssin schcmc
Comjctitivc with thc tastcst aorithms dcsjitc this simjicity
Disadvantages
lLim is usuay outjcrtormcd y Il-rowth (discusscd ncxt)
Software
http://www.borgelt.net/relim.html
Christian Borgelt Frequent Pattern Mining 137
The LCM Algorithm
Lincar Coscd ltcm Sct `incr
|no, Asai, chida, and Arimura 2003| (vcrsion 1)
|no, Iiyomi and Arimura 200!, 200`| (vcrsions 2 : 3)
Christian Borgelt Frequent Pattern Mining 138
LCM: Basic Ideas
Thc itcm scts arc chcckcd in lexicographic order
(depth-rst traversal ot thc jrcx trcc)
Stcj y stcj cimination ot itcms trom thc transaction dataasc.
rccursivc jroccssin ot thc conditiona transaction dataascs
Coscy rcatcd to thc Lcat aorithm
`aintains both a horizontal and a vertical representation
ot thc transaction dataasc in jarac
scs thc vcrtica rcjrcscntation to tcr thc transactions
with thc choscn sjit itcm
scs thc horizonta rcjrcscntation to thc vcrtica rcjrcscntation
tor thc ncxt rccursion stcj (no intcrscction as in Lcat)
suay travcrscs thc scarch trcc trom right to left
in ordcr to rcusc thc mcmory tor thc vcrtica rcjrcscntation
(xcd mcmory rcquircmcnt, jrojortiona to dataasc sizc)
Christian Borgelt Frequent Pattern Mining 139
LCM: Occurrence Deliver
a d e 1:
b c d 2:
a c e 3:
a c d e 4:
a e 5:
a c d 6:
b c 7:
a c d e 8:
b c e 9:
a d e 10:
1
3
4
5
6
8
10
a
7
2
7
9
b
3
2
3
4
6
7
8
9
c
7
1
2
4
6
8
10
d
6
1
3
4
5
8
9
10
e
7
1
3
4
5
8
9
10
e
7
1
a
1
b
0
c
0
1
d
1
a d e
1
3
4
5
8
9
10
e
7
1
3
a
2
b
0
3
c
1
1
d
1
a c e
1
3
4
5
8
9
10
e
7
1
3
4
a
3
b
0
3
4
c
2
1
4
d
2
a c d e
Occurrcncc dcivcr schcmc uscd
y LC` to nd thc conditiona
transaction dataasc tor thc rst
sujrocm (nccds a horizonta
rcjrcscntation in jarac)
ctc
Christian Borgelt Frequent Pattern Mining 140
LCM: Left to Right Processing
a
0
2
7
9
b
3
2
3
4
6
7
8
10
c
7
1
2
4
6
8
10
d
6
1
3
4
5
8
9
10
e
7
3
4
6
8
a
4
2
7
9
b
3
2
3
4
6
7
8
9
c
7
1
2
4
6
8
10
d
6
1
3
4
5
8
9
10
e
7
1
3
4
6
8
10
a
6
b
0
3
4
6
8
c
4
1
2
4
6
8
10
d
6
1
3
4
5
8
9
10
e
7
1
3
4
5
8
10
a
6
9
b
1
3
4
8
9
c
4
1
4
8
10
d
4
1
3
4
5
8
9
10
e
7
ack unjroccsscd jart uc sjit itcm rcd conditiona dataasc
Thc sccond sujrocm (cxcudc sjit itcm) is sovcd
ctorc thc rst sujrocm (incudc sjit itcm)
Thc aorithm is cxccutcd on ony thc mcmory
that storcs thc initia vcrtica rcjrcscntation
lt thc transaction dataasc can c oadcd, thc trcqucnt itcm scts can c tound
Christian Borgelt Frequent Pattern Mining 141
LCM: k-items Machine
lrocm ot LC` (as ot Lcat) it is dicut to cominc cqua transaction suxcs
ldca lt thc numcr ot itcms is sma, a bucket/bin sort scheme
can c uscd to jcrtccty cominc cqua transaction suxcs
This schcmc cads to thc k-items machine (tor sma k)
A jossic transaction suxcs arc rcjrcscntcd as it jattcrns.
onc uckct,in is crcatcd tor cach jossic it jattcrn
A lLim-ikc jroccssin schcmc is cmjoycd (on a xcd data structurc)
Lcadin itcms arc cxtractcd with a tac that is indcxcd with thc it jattcrn
ltcms arc ciminatcd with a it mask
Tac ot hihcst sct its tor a !-itcms machinc
highest items/set bits of transactions (constant)
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
____ ___a __b_ __ba _c__ _c_a _cb_ _cba d___ d__a d_b_ d_ba dc__ dc_a dcb_ dcba
*.* a.0 b.1 b.1 c.2 c.2 c.2 c.2 d.3 d.3 d.3 d.3 d.3 d.3 d.3 d.3
Christian Borgelt Frequent Pattern Mining 142
LCM: k-items Machine
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
Empty 4-items machine (no transactions)
transaction weights/multiplicities
transaction lists (one per item)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
a.0
0
b.1
0
c.2
0
d.3
0
4-items machine after inserting the transactions
transaction weights/multiplicities
transaction lists (one per item)
0 1 0 0 0 1 2 0 0 2 0 0 0 3 1 0
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
a.0
1
b.1
0
c.2
3
d.3
6
0001 0101 0110 1001 1110 1101
ln this statc thc !-itcms machinc rcjrcscnts a sjccia torm
ot thc initia transaction dataasc ot thc lLim aorithm
Christian Borgelt Frequent Pattern Mining 143
LCM: k-items Machine
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
4-items machine after inserting the transactions
transaction weights/multiplicities
transaction lists (one per item)
0 1 0 0 0 1 2 0 0 2 0 0 0 3 1 0
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
a.0
1
b.1
0
c.2
3
d.3
6
0001 0101 0110 1001 1110 1101
After propagating the transaction lists
transaction weights/multiplicities
transaction lists (one per item)
0 7 3 0 0 4 3 0 0 2 0 0 0 3 1 0
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
a.0
7
b.1
3
c.2
7
d.3
6
0001 0010 0101 0110 1001 1110 1101
lrojaatin thc transactions ists is cquivacnt to occurrcncc dcivcr
Conditiona transaction dataascs arc crcatcd as in lLim jus jrojaation
Christian Borgelt Frequent Pattern Mining 144
Summary LCM
Basic Processing Scheme
Lcjth-rst travcrsa ot thc jrcx trcc
larac horizonta and vcrtica transaction rcjrcscntation
Sujjort countin is donc durin thc occurrcncc dcivcr jroccss
Advantages
Iairy simjc data structurc and jroccssin schcmc
\cry tast it imjcmcntcd jrojcry (and with additiona tricks)
Disadvantages
Simjc, straihttorward imjcmcntation is rcativcy sow
Software
http://www.borgelt.net/eclat.html (ojtion -ao)
Christian Borgelt Frequent Pattern Mining 145
The FP-Growth Algorithm
Ircqucnt lattcrn Growth Aorithm |Lan, lci, and Yin 2000|
Christian Borgelt Frequent Pattern Mining 146
FP-Growth: Basic Ideas
Il-Growth mcans Frequent Pattern Growth
Thc itcm scts arc chcckcd in lexicographic order
(depth-rst traversal ot thc jrcx trcc)
Stcj y stcj cimination ot itcms trom thc transaction dataasc
lccursivc jroccssin ot thc conditiona transaction dataascs
Thc transaction dataasc is rcjrcscntcd as an FP-tree
An Il-trcc is asicay a prex tree with additiona structurc
nodcs ot this trcc that corrcsjond to thc samc itcm arc inkcd
This combines a horizontal and a vertical database representation
This data structurc is uscd to comjutc conditiona dataascs ccicnty
A transactions containin a ivcn itcm can casiy c tound
y thc inks ctwccn thc nodcs corrcsjondin to this itcm
Christian Borgelt Frequent Pattern Mining 147
FP-Growth: Preprocessing the Transaction Database
1

a d f
a c d e
b d
b c d
b c
a b d
b d e
b c e g
c d f
a b d
2

d S
b
c `
a !
e 3
f 2
g 1
s
min
3
3

d a
d c a e
d b
d b c
b c
d b a
d b e
b c e
d c
d b a
!

d b
d b c
d b a
d b a
d b e
d c
d c a e
d a
b c
b c e
`

Il-trcc
(scc ncxt sidc)
1 Oriina transaction dataasc
2 Ircqucncy ot individua itcms
3 ltcms in transactions sortcd
dcsccndiny wrt thcir trcqucncy
and intrcqucnt itcms rcmovcd
! Transactions sortcd cxicorajhicay
in asccndin ordcr (comjarison ot
itcms is thc samc as in jrcccdin stcj)
` Lata structurc uscd y thc aorithm
(dctais on ncxt sidc)
Christian Borgelt Frequent Pattern Mining 148
Transaction Representation: FP-Tree
Luid a frequent pattern tree (FP-tree) trom thc transactions
(asicay a jrcx trcc with links between the branches that ink nodcs
with thc samc itcm and a header table tor thc rcsutin itcm ists)
Ircqucnt sinc itcm scts can c rcad dirccty trom thc Il-trcc
Simple Example Database
1

a d f
a c d e
b d
b c d
b c
a b d
b d e
b c e g
c d f
a b d
!

d b
d b c
d b a
d b a
d b e
d c
d c a e
d a
b c
b c e
10
d: 8 b: 7 c: 5 a: 4
d: 8
b: 5
b: 2
c: 1
c: 2
c: 2
a: 2
a: 1
a: 1
e: 3
e: 1
e: 1
e: 1
frequent pattern tree
Christian Borgelt Frequent Pattern Mining 149
Transaction Representation: FP-Tree
An Il-trcc comincs a horizonta and a vcrtica transaction rcjrcscntation
Horizontal Representation: jrcx trcc ot transactions
Vertical Representation: inks ctwccn thc jrcx trcc ranchcs
`otc thc jrcx trcc is invcrtcd,
ic thcrc arc ony jarcnt jointcrs
Chid jointcrs arc not nccdcd
duc to thc jroccssin schcmc
(to c discusscd)
ln jrincijc, a nodcs rctcrrin
to thc samc itcm can c storcd
in an array rathcr than a ist
10
d: 8 b: 7 c: 5 a: 4
d: 8
b: 5
b: 2
c: 1
c: 2
c: 2
a: 2
a: 1
a: 1
e: 3
e: 1
e: 1
e: 1
frequent pattern tree
Christian Borgelt Frequent Pattern Mining 150
Recursive Processing
Thc initia Il-trcc is projected wrt thc itcm corrcsjondin to
thc rihtmost cvc in thc trcc (ct this itcm c i)
This yicds an Il-trcc ot thc conditional database
(dataasc ot transactions containin thc itcm i, ut with this itcm rcmovcd
it is imjicit in thc Il-trcc and rccordcd as a common jrcx)
Irom thc jro,cctcd Il-trcc thc trcqucnt itcm scts
containin itcm i can c rcad dirccty
Thc rightmost level ot thc oriina (unjro,cctcd) Il-trcc is removed
(thc itcm i is rcmovcd trom thc dataasc)
Thc jro,cctcd Il-trcc is jroccsscd rccursivcy. thc itcm i is notcd as a jrcx
that is to c addcd in dccjcr cvcs ot thc rccursion
Attcrwards thc rcduccd oriina Il-trcc is turthcr jroccsscd
y workin on thc ncxt cvc cttwards
Christian Borgelt Frequent Pattern Mining 151
Projecting an FP-Tree
10
d: 8 b: 7 c: 5 a: 4
d: 8
b: 5
b: 2
c: 1
c: 2
c: 2
a: 2
a: 1
a: 1
e: 3
e: 1
e: 1
e: 1
3
b: 1
d: 2 c: 1 a: 1
b: 1 c: 1

dctachcd jro,cction
Il-trcc with attachcd jro,cction
3
d: 2 b: 2 c: 2 a: 1
b: 1 c: 1
d: 2 c: 1 a: 1
b: 1
Ly travcrsin thc nodc ist tor thc rihtmost itcm,
a transactions containin this itcm can c tound
Thc Il-trcc ot thc conditiona dataasc tor this itcm is crcatcd
y cojyin thc nodcs on thc jaths to thc root
Christian Borgelt Frequent Pattern Mining 152
Projecting an FP-Tree
A simjcr, ut usuay cquay ccicnt jro,cction schcmc
is to cxtract a jath to thc root as a (rcduccd) transaction
and to inscrt this transaction into a ncw Il-trcc
Ior thc inscrtion into thc ncw trcc thcrc arc two ajjroachcs
Ajart trom a jarcnt jointcr (which is nccdcd tor thc jath cxtraction),
cach nodc josscsscs a jointcr to its rst child and its right sibling
Thcsc jointcrs aow to inscrt a ncw transaction toj-down
lt thc initia Il-trcc has ccn uit trom a cxicorajhicay sortcd
transaction dataasc, thc travcrsa ot thc itcm ists yicds thc
(rcduccd) transactions in cxicorajhica ordcr
This can c cxjoitcd to inscrt a transaction usin ony thc header table
Ly jroccssin an Il-trcc trom left to right (or trom top to bottom
wrt thc jrcx trcc), thc jro,cction may cvcn rcusc thc arcady jrcscnt nodcs
and thc arcady jroccsscd jart ot thc hcadcr tac (top-down fp-growth)
ln this way thc aorithm can c cxccutcd on a xcd amount ot mcmory
Christian Borgelt Frequent Pattern Mining 153
Reducing the Original FP-Tree
10
d: 8 b: 7 c: 5 a: 4
d: 8
b: 5
b: 2
c: 1
c: 2
c: 2
a: 2
a: 1
a: 1
e: 3
e: 1
e: 1
e: 1
10
d: 8 b: 7 c: 5 a: 4
d: 8
b: 5
b: 2
c: 1
c: 2
c: 2
a: 2
a: 1
a: 1
Thc oriina Il-trcc is rcduccd y rcmovin thc rihtmost cvc
This yicds thc conditiona dataasc tor itcm scts not containin thc itcm
corrcsjondin to thc rihtmost cvc
Christian Borgelt Frequent Pattern Mining 154
FP-growth: Divide-and-Conquer
10
d: 8 b: 7 c: 5 a: 4
d: 8
b: 5
b: 2
c: 1
c: 2
c: 2
a: 2
a: 1
a: 1
e: 3
e: 1
e: 1
e: 1
10
d: 8 b: 7 c: 5 a: 4
d: 8
b: 5
b: 2
c: 1
c: 2
c: 2
a: 2
a: 1
a: 1

Conditiona dataasc
with itcm e rcmovcd
(sccond sujrocm)
3
d: 2 b: 2 c: 2 a: 1
b: 1 c: 1
d: 2 c: 1 a: 1
b: 1
Conditiona dataasc tor jrcx e
(rst sujrocm)
Christian Borgelt Frequent Pattern Mining 155
Pruning a Projected FP-Tree
Trivial case: lt thc itcm corrcsjondin to thc rihtmost cvc is intrcqucnt,
thc itcm and thc Il-trcc cvc arc rcmovcd without jro,cction
More interesting case: An itcm corrcsjondin to a middc cvc
is intrcqucnt, ut an itcm on a cvc turthcr to thc riht is trcqucnt
Example FP-Tree with an intrcqucnt itcm on a middc cvc
a: 6 b: 1 c: 4 d: 3
a: 6 b: 1 c: 1 d: 1
c: 3 d: 2
a: 6 b: 1 c: 4 d: 3
a: 6 c: 4 d: 3
So-cacd -jrunin or Lonsai jrunin ot a (jro,cctcd) Il-trcc
lmjcmcntcd y ctt-to-riht cvcwisc mcrin ot nodcs with samc jarcnts
`ot nccdcd it jro,cction works y cxtraction and inscrtion
Christian Borgelt Frequent Pattern Mining 156
FP-growth: Implementation Issues
Chains:
lt an Il-trcc has ccn rcduccd to a chain, no jro,cctions arc comjutcd anymorc
lathcr a suscts ot thc sct ot itcms in thc chain arc tormcd and rcjortcd
Rebuilding the FP-tree:
An Il-trcc may c jro,cctcd y cxtractin thc (rcduccd) transactions dcscricd
y thc jaths to thc root and inscrtin thcm into a ncw Il-trcc (scc aovc)
This makcs it jossic to chanc thc itcm ordcr, with thc toowin advantages
`o nccd tor - or Lonsai jrunin, sincc thc itcms can c rcordcrcd
so that a conditionay trcqucnt itcms ajjcar on thc ctt
`o nccd tor jcrtcct cxtcnsion jrunin, ccausc thc jcrtcct cxtcnsions can c
movcd to thc ctt and arc jroccsscd at thc cnd with thc chain ojtimization
Lowcvcr, thcrc arc aso disadvantages
Lithcr thc Il-trcc has to c travcrscd twicc or jair trcqucncics havc to c
dctcrmincd to rcordcr thc itcms accordin to thcir conditiona trcqucncy
Christian Borgelt Frequent Pattern Mining 157
FP-growth: Implementation Issues
Thc initia Il-trcc is uit trom an array-ascd main mcmory rcjrcscntation
ot thc transaction dataasc (ciminatcs thc nccd tor chid jointcrs)
This has thc disadvantac that thc mcmory savins ottcn rcsutin
trom an Il-trcc rcjrcscntation cannot c tuy cxjoitcd
Lowcvcr, it has thc advantac that no chid and siin jointcrs arc nccdcd
and thc transactions can c inscrtcd in cxicorajhic ordcr
Lach Il-trcc nodc has a constant sizc ot 1o ytcs (2 jointcrs, 2 intccrs)
Aocatin thcsc throuh thc standard mcmory manacmcnt is wastctu
(Aocatin many sma mcmory o,ccts is hihy inccicnt)
Soution Thc nodcs arc aocatcd in onc arc array jcr Il-trcc
As a conscqucncc, cach Il-trcc rcsidcs in a sinc mcmory ock
Thcrc is no aocation and dcaocation ot individua nodcs
(This may wastc somc mcmory, ut is hihy ccicnt)
Christian Borgelt Frequent Pattern Mining 158
FP-growth: Implementation Issues
An Il-trcc can c imjcmcntcd with ony two integer arrays |lasz 200!|
onc array contains thc transaction countcrs (sujjort vaucs) and
onc array contains thc jarcnt jointcrs (as thc indiccs ot array ccmcnts)
This rcduccs thc mcmory rcquircmcnts to S ytcs jcr nodc
Such a mcmory structurc has advantages
duc thc way in which modcrn jroccssors acccss thc main mcmory
Lincar mcmory acccsscs arc tastcr than random acccsscs
`ain mcmory is oranizcd as a tac with rows and coumns
Iirst thc row is addrcsscd and thcn, attcr somc dcay, thc coumn
Acccsscs to dicrcnt coumns in thc samc row can skij thc row addrcssin
Lowcvcr, thcrc arc aso disadvantages
lrorammin jro,cction and - or Lonsai jrunin ccomcs morc comjcx,
ccausc css structurc is avaiac
lcordcrin thc itcms is virtuay rucd out
Christian Borgelt Frequent Pattern Mining 159
Summary FP-Growth
Basic Processing Scheme
Transaction dataasc is rcjrcscntcd as a trcqucnt jattcrn trcc
An Il-trcc is jro,cctcd to otain a conditiona dataasc
lccursivc jroccssin ot thc conditiona dataasc
Advantages
Ottcn thc tastcst aorithm or amon thc tastcst aorithms
Disadvantages
`orc dicut to imjcmcnt than othcr ajjroachcs, comjcx data structurc
An Il-trcc can nccd morc mcmory than a ist or array ot transactions
Software
http://www.borgelt.net/fpgrowth.html
Christian Borgelt Frequent Pattern Mining 160
Experimental Comparison
Christian Borgelt Frequent Pattern Mining 161
Experiments: Data Sets
Chess
A data sct istin chcss cnd amc jositions tor kin vs kin and rook
This data sct is jart ot thc Cl machinc carnin rcjository
` itcms, 319o transactions
(avcrac) transaction sizc 3, dcnsity 0.`
Census
A data sct dcrivcd trom an cxtract ot thc S ccnsus urcau data ot 199!,
which was jrcjroccsscd y discrctizin numcric attriutcs
This data sct is jart ot thc Cl machinc carnin rcjository
13` itcms, !SS!2 transactions
(avcrac) transaction sizc 1!, dcnsity 0.1
Thc density ot a transaction dataasc is thc avcrac traction ot a itcms occurrin
jcr transaction dcnsity avcrac transaction sizc , numcr ot itcms
Christian Borgelt Frequent Pattern Mining 162
Experiments: Data Sets
T10I4D100K
An articia data sct cncratcd with lL`s data cncrator
Thc namc is tormcd trom thc jaramctcrs ivcn to thc cncrator
(tor cxamjc 100I 100000 transactions)
S0 itcms, 100000 transactions
avcrac transaction sizc 10.1, dcnsity 0.012
BMS-Webview-1
A wc cick strcam trom a c-carc comjany that no oncr cxists
lt has ccn uscd in thc ILL cuj 2000 and is a jojuar cnchmark
!9 itcms, `9o02 transactions
avcrac transaction sizc 2.`, dcnsity 0.00`
Thc density ot a transaction dataasc is thc avcrac traction ot a itcms occurrin
jcr transaction dcnsity avcrac transaction sizc , numcr ot itcms
Christian Borgelt Frequent Pattern Mining 163
Experiments: Programs and Test System
A jrorams arc my own imjcmcntations
A usc thc samc codc tor rcadin thc transaction dataasc
and tor writin thc tound trcqucnt itcm scts
Thcrctorc dicrcnccs in sjccd can ony c thc ccct ot thc jroccssin schcmcs
Thcsc jrorams and thcir sourcc codc can c tound on my wc sitc
http://www.borgelt.net/fpm.html
Ajriori http://www.borgelt.net/apriori.html
Lcat http://www.borgelt.net/eclat.html
Il-Growth http://www.borgelt.net/fpgrowth.html
lLim http://www.borgelt.net/relim.html
Sa` http://www.borgelt.net/sam.html
Thc tcst systcm was an lL`,Lcnovo Xo0s ajtoj
(lntc Ccntrino Luo L2!00, 1o GLz, 1 GL main mcmory)
runnin SuSL Linux 103. jrorams wcrc comjicd with cc !21
Christian Borgelt Frequent Pattern Mining 164
Experiments: Execution Times
1000 1200 1400 1600 1800 2000
-1
0
1
2
apriori
eclat
fpgrowth
relim
sam
chess
0 5 10 15 20 25 30 35 40 45 50
0
1
apriori
eclat
fpgrowth
relim
sam
relim -h
T10I4D100K
0 10 20 30 40 50 60 70 80 90 100
0
1
apriori
eclat
fpgrowth
relim
sam
census
33 34 35 36 37 38 39 40
0
1
apriori
eclat
fpgrowth
relim
sam
webview1
Lccima oarithm ot cxccution timc in scconds ovcr asoutc minimum sujjort
Christian Borgelt Frequent Pattern Mining 165
Reminder: Perfect Extensions
Thc scarch can c imjrovcd with so-cacd perfect extension pruning
Givcn an itcm sct I, an itcm a / I is cacd a perfect extension ot I,
i I and I a havc thc samc sujjort (a transactions containin I contain a)
lcrtcct cxtcnsions havc thc toowin jrojcrtics
lt thc itcm a is a jcrtcct cxtcnsion ot an itcm sct I,
thcn a is aso a jcrtcct cxtcnsion ot any itcm sct J I (as on as a / J)
lt I is a trcqucnt itcm sct and X is thc sct ot a jcrtcct cxtcnsions ot I,
thcn a scts I J with J 2
X
(whcrc 2
X
dcnotcs thc jowcr sct ot X)
arc aso trcqucnt and havc thc samc sujjort as I
This can c cxjoitcd y cocctin jcrtcct cxtcnsion itcms in thc rccursion,
in a third ccmcnt ot a sujrocm dcscrijtion S (D, P, X)
Oncc idcnticd, jcrtcct cxtcnsion itcms arc no oncr jroccsscd in thc rccursion,
ut arc ony uscd to cncratc a sujcrscts ot thc jrcx havin thc samc sujjort
Christian Borgelt Frequent Pattern Mining 166
Experiments: Perfect Extension Pruning
1000 1200 1400 1600 1800 2000
-1
0
1
2
w/o pep
apriori
eclat
fpgrowth
chess
0 5 10 15 20 25 30 35 40 45 50
0
1
w/o pep
apriori
eclat
fpgrowth
T10I4D100K
0 10 20 30 40 50 60 70 80 90 100
0
1 w/o pep
apriori
eclat
fpgrowth
census
33 34 35 36 37 38 39 40
0
1
w/o pep
apriori
eclat
fpgrowth
webview1
Lccima oarithm ot cxccution timc in scconds ovcr asoutc minimum sujjort
Christian Borgelt Frequent Pattern Mining 167
Reducing the Output:
Closed and Maximal Item Sets
Christian Borgelt Frequent Pattern Mining 168
Maximal Item Sets
Considcr thc sct ot maximal (frequent) item sets
M
T
(s
min
) I B [ s
T
(I) s
min
J I s
T
(J) < s
min
.
That is An item set is maximal if it is frequent,
but none of its proper supersets is frequent.
Sincc with this dcnition wc know that
s
min
I F
T
(s
min
) I M
T
(s
min
) J I s
T
(J) s
min
it toows (can casiy c jrovcn y succcssivcy cxtcndin thc itcm sct I)
s
min
I F
T
(s
min
) J M
T
(s
min
) I J.
That is Every frequent item set has a maximal superset.
Thcrctorc s
min
F
T
(s
min
)
_
IM
T
(s
min
)
2
I
Christian Borgelt Frequent Pattern Mining 169
Mathematical Excursion: Maximal Elements
Lct R c a susct ot a jartiay ordcrcd sct (S, )
An ccmcnt x R is cacd maximal or a maximal element ot R it
y R x y x y.
Thc notions minimal and minimal element arc dcncd anaoousy
`axima ccmcnts nccd not c uniquc,
ccausc thcrc may c ccmcnts x, y R with ncithcr x y nor y x
lnnitc jartiay ordcrcd scts nccd not josscss a maxima,minima ccmcnt
Lcrc wc considcr thc sct F
T
(s
min
) as a susct ot thc jartiay ordcrcd sct (2
B
, )
Thc maximal (frequent) item sets arc thc maxima ccmcnts ot F
T
(s
min
)
M
T
(s
min
) I F
T
(s
min
) [ J F
T
(s
min
) I J I J.
That is, no sujcrsct ot a maxima (trcqucnt) itcm sct is trcqucnt
Christian Borgelt Frequent Pattern Mining 170
Maximal Item Sets: Example
transaction dataasc
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
trcqucnt itcm scts
0 itcms 1 itcm 2 itcms 3 itcms
10 a a, c ! a, c, d 3
b 3 a, d ` a, c, e 3
c a, e o a, d, e !
d o b, c 3
e c, d !
c, e !
d, e !
Thc maxima itcm scts arc
b, c, a, c, d, a, c, e, a, d, e.
Lvcry trcqucnt itcm sct is a susct ot at cast onc ot thcsc scts
Christian Borgelt Frequent Pattern Mining 171
Hasse Diagram and Maximal Item Sets
transaction dataasc
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
lcd oxcs arc maxima
itcm scts, whitc oxcs
intrcqucnt itcm scts
Lassc diaram with maxima itcm scts (s
min
3)
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
Christian Borgelt Frequent Pattern Mining 172
Limits of Maximal Item Sets
Thc sct ot maxima itcm scts cajturcs thc sct ot a trcqucnt itcm scts,
ut thcn wc know at most thc sujjort ot thc maxima itcm scts cxacty
Aout thc sujjort ot a non-maxima trcqucnt itcm sct wc ony know
s
min
I F
T
(s
min
) M
T
(s
min
) s
T
(I) max
JM
T
(s
min
),JI
s
T
(J).
This rcation toows immcdiatcy trom I J I s
T
(I) s
T
(J),
that is, an itcm sct cannot havc a owcr sujjort than any ot its sujcrscts
`otc that wc havc cncray
s
min
I F
T
(s
min
) s
T
(I) max
JM
T
(s
min
),JI
s
T
(J).
Question: Can wc nd a susct ot thc sct ot a trcqucnt itcm scts,
which aso jrcscrvcs knowcdc ot a sujjort vaucs
Christian Borgelt Frequent Pattern Mining 173
Closed Item Sets
Considcr thc sct ot closed (frequent) item sets
C
T
(s
min
) I B [ s
T
(I) s
min
J I s
T
(J) < s
T
(I).
That is An item set is closed if it is frequent,
but none of its proper supersets has the same support.
Sincc with this dcnition wc know that
s
min
I F
T
(s
min
) I C
T
(s
min
) J I s
T
(J) s
T
(I)
it toows (can casiy c jrovcn y succcssivcy cxtcndin thc itcm sct I)
s
min
I F
T
(s
min
) J C
T
(s
min
) I J.
That is Every frequent item set has a closed superset.
Thcrctorc s
min
F
T
(s
min
)
_
IC
T
(s
min
)
2
I
Christian Borgelt Frequent Pattern Mining 174
Closed Item Sets
Lowcvcr, not ony has cvcry trcqucnt itcm sct a coscd sujcrsct,
ut it has a closed superset with the same support
s
min
I F
T
(s
min
) J I J C
T
(s
min
) s
T
(J) s
T
(I).
(lroot scc (aso) thc considcrations on thc ncxt sidc)
Thc sct ot a coscd itcm scts jrcscrvcs knowcdc ot a sujjort vaucs
s
min
I F
T
(s
min
) s
T
(I) max
JC
T
(s
min
),JI
s
T
(J).
`otc that thc wcakcr statcmcnt
s
min
I F
T
(s
min
) s
T
(I) max
JC
T
(s
min
),JI
s
T
(J)
toows immcdiatcy trom I J I s
T
(I) s
T
(J), that is,
an itcm sct cannot havc a owcr sujjort than any ot its sujcrscts
Christian Borgelt Frequent Pattern Mining 175
Closed Item Sets
Alternative characterization of closed (frequent) item sets:
I is coscd s
T
(I) s
min
I

kK
T
(I)
t
k
.
lcmindcr K
T
(I) k 1, . . . , n [ I t
k
is thc cover ot I wrt T
This is dcrivcd as toows sincc k K
T
(I) I t
k
, it is ovious that
s
min
I F
T
(s
min
) I

kK
T
(I)
t
k
,
lt I

kK
T
(I)
t
k
, it is not coscd, sincc

kK
T
(I)
t
k
has thc samc sujjort
On thc othcr hand, no sujcrsct ot

kK
T
(I)
t
k
has thc covcr K
T
(I)
`otc that thc aovc charactcrization aows us to construct tor any itcm sct
thc (uniqucy dctcrmincd) coscd sujcrsct that has thc samc sujjort
Christian Borgelt Frequent Pattern Mining 176
Mathematical Excursion: Closure Operators
A closure operator on a sct S is a tunction cl 2
S
2
S
,
which satiscs thc toowin conditions X, Y S
X cl (X) (cl is extensive)
X Y cl (X) cl (Y ) (cl is increasing or monotone)
cl (cl (X)) cl (X) (cl is idempotent)
A sct R S is cacd closed it it is cqua to its cosurc
R is coscd R cl (R).
Thc closed (frequent) item sets arc induccd y thc cosurc ojcrator
cl (I)

kK
T
(I)
t
k
.
rcstrictcd to thc sct ot trcqucnt itcm scts
C
T
(s
min
) I F
T
(s
min
) [ I cl (I)
Christian Borgelt Frequent Pattern Mining 177
Mathematical Excursion: Galois Connections
Lct (X, _
X
) and (Y, _
Y
) c two jartiay ordcrcd scts
A tunction jair (f
1
, f
2
) with f
1
X Y and f
2
Y X
is cacd a (monotone) Galois connection i
A
1
, A
2
X A
1
_ A
2
f
1
(A
1
) _ f
1
(A
2
),
B
1
, B
2
Y B
1
_ B
2
f
2
(B
1
) _ f
2
(B
2
),
A X B Y A _ f
2
(B) B _ f
1
(A)
A tunction jair (f
1
, f
2
) with f
1
X Y and f
2
Y X
is cacd an anti-monotone Galois connection i
A
1
, A
2
X A
1
_ A
2
f
1
(A
1
) _ f
1
(A
2
),
B
1
, B
2
Y B
1
_ B
2
f
2
(B
1
) _ f
2
(B
2
),
A X B Y A _ f
2
(B) B _ f
1
(A)
ln a monotonc Gaois conncction, oth f
1
and f
2
arc monotonc,
in an anti-monotonc Gaois conncction, oth f
1
and f
2
arc anti-monotonc
Christian Borgelt Frequent Pattern Mining 178
Mathematical Excursion: Galois Connections
Lct thc two scts X and Y c jowcr scts ot somc scts U and V , rcsjcctivcy,
and ct thc jartia ordcrs c thc susct rcations on thcsc jowcr scts, that is, ct
(X, _
X
) (2
U
, ) and (Y, _
Y
) (2
V
, ).
Thcn thc comination f
1
f
2
X X ot thc tunctions ot a Gaois conncction
is a closure operator (as wc as thc comination f
2
f
1
Y Y )
(i) A U A f
2
(f
1
(A)) (a cosurc ojcrator is extensive)
Sincc (f
1
, f
2
) is a Gaois conncction, wc know
A U B V A f
2
(B) B f
1
(A).
Choosc B f
1
(A)
A U A f
2
(f
1
(A)) f
1
(A) f
1
(A)
. .
truc
.
Choosc A f
2
(B)
B V f
2
(B) f
2
(B)
. .
truc
B f
1
(f
2
(B)).
Christian Borgelt Frequent Pattern Mining 179
Mathematical Excursion: Galois Connections
(ii) A
1
, A
2
U A
1
A
2
f
2
(f
1
(A
1
)) f
2
(f
1
(A
2
))
(a cosurc ojcrator is increasing or monotone)
This jrojcrty toows immcdiatcy trom thc tact that
thc tunctions f
1
and f
2
arc oth (anti-)monotonc
lt f
1
and f
2
arc oth monotonc, wc havc
A
1
, A
2
U A
1
A
2
A
1
, A
2
U f
1
(A
1
) f
1
(A
2
)
A
1
, A
2
U f
2
(f
1
(A
1
)) f
2
(f
1
(A
2
)).
lt f
1
and f
2
arc oth anti-monotonc, wc havc
A
1
, A
2
U A
1
A
2
A
1
, A
2
U f
1
(A
1
) f
1
(A
2
)
A
1
, A
2
U f
2
(f
1
(A
1
)) f
2
(f
1
(A
2
)).
Christian Borgelt Frequent Pattern Mining 180
Mathematical Excursion: Galois Connections
(ii) A U f
2
(f
1
(f
2
(f
1
(A)))) f
2
(f
1
(A)) (a cosurc ojcrator is idempotent)
Sincc oth f
1
f
2
and f
2
f
1
arc cxtcnsivc (scc aovc), wc know
A V A f
2
(f
1
(A)) f
2
(f
1
(f
2
(f
1
(A))))
B V B f
1
(f
2
(B)) f
1
(f
2
(f
1
(f
2
(B))))
Choosin B f
1
(A
/
) with A
/
U, wc otain
A
/
U f
1
(A
/
) f
1
(f
2
(f
1
(f
2
(f
1
(A
/
))))).
Sincc (f
1
, f
2
) is a Gaois conncction, wc know
A U B V A f
2
(B) B f
1
(A).
Choosin A f
2
(f
1
(f
2
(f
1
(A
/
)))) and B f
1
(A
/
), wc otain
A
/
U f
2
(f
1
(f
2
(f
1
(A
/
)))) f
2
(f
1
(A
/
))
f
1
(A
/
) f
1
(f
2
(f
1
(f
2
(f
1
(A
/
)))))
. .
truc (scc aovc)
.
Christian Borgelt Frequent Pattern Mining 181
Galois Connections in Frequent Item Set Mining
Considcr thc jartiay ordcrcd scts (2
B
, ) and (2
1,...,n
, )
Lct f
1
2
B
2
1,...,n
, I K
T
(I) k 1, . . . , n [ I t
k

and f
2
2
1,...,n
2
B
, J

jJ
t
j
i B [ j J i t
j

Thc tunction jair (f


1
, f
2
) is an anti-monotone Galois connection
I
1
, I
2
2
B

I
1
I
2
f
1
(I
1
) K
T
(I
1
) K
T
(I
2
) f
1
(I
2
),
J
1
, J
2
2
1,...,n

J
1
J
2
f
2
(J
1
)

kJ
1
t
k


kJ
2
t
k
f
2
(J
2
),
I 2
B
J 2
1,...,n

I f
2
(J)

jJ
t
j
J f
1
(I) K
T
(I)
As a conscqucncc f
1
f
2
2
B
2
B
, I

kK
T
(I)
t
k
is a closure operator
Christian Borgelt Frequent Pattern Mining 182
Galois Connections in Frequent Item Set Mining
Likcwisc f
2
f
1
2
1,...,n
2
1,...,n
, J K
T
(

jJ
t
j
)
is aso a closure operator
Iurthcrmorc, it wc rcstrict our considcrations to thc rcsjcctivc scts
ot coscd scts in oth domains, that is, to thc scts
(
B
I B [ I f
2
(f
1
(I))

kK
T
(I)
t
k
and
(
T
J 1, . . . , n [ J f
1
(f
2
(J)) K
T
(

jJ
t
j
),
thcrc cxists a 1-to-1 relationship ctwccn thcsc two scts,
which is dcscricd y thc Gaois conncction
f
/
1
f
1
[
(
B
is a bijection with f
/1
1
f
/
2
f
2
[
(
T

(This toows immcdiatcy trom thc tacts that thc Gaois conncction
dcscrics cosurc ojcrators and that a cosurc ojcrator is idcmjotcnt)
Thcrctorc ndin coscd itcm scts with a ivcn minimum support is cquivacnt
to ndin coscd scts ot transaction idcnticrs ot a ivcn minimum size
Christian Borgelt Frequent Pattern Mining 183
Closed Item Sets: Example
transaction dataasc
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
trcqucnt itcm scts
0 itcms 1 itcm 2 itcms 3 itcms
10 a a, c ! a, c, d 3
b 3 a, d ` a, c, e 3
c a, e o a, d, e !
d o b, c 3
e c, d !
c, e !
d, e !
A trcqucnt itcm scts arc coscd with thc cxccjtion ot b and d, e
b is a susct ot b, c, oth havc a sujjort ot 3 30/
d, e is a susct ot a, d, e, oth havc a sujjort ot ! !0/
Christian Borgelt Frequent Pattern Mining 184
Hasse diagram and Closed Item Sets
transaction dataasc
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
lcd oxcs arc coscd
itcm scts, whitc oxcs
intrcqucnt itcm scts
Lassc diaram with coscd itcm scts (s
min
3)
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
Christian Borgelt Frequent Pattern Mining 185
Reminder: Perfect Extensions
Thc scarch can c imjrovcd with so-cacd perfect extension pruning
Givcn an itcm sct I, an itcm a / I is cacd a perfect extension ot I,
i I and I a havc thc samc sujjort (a transactions containin I contain a)
lcrtcct cxtcnsions havc thc toowin jrojcrtics
lt thc itcm a is a jcrtcct cxtcnsion ot an itcm sct I,
thcn a is aso a jcrtcct cxtcnsion ot any itcm sct J I (as on as a / J)
lt I is a trcqucnt itcm sct and X is thc sct ot a jcrtcct cxtcnsions ot I,
thcn a scts I J with J 2
X
(whcrc 2
X
dcnotcs thc jowcr sct ot X)
arc aso trcqucnt and havc thc samc sujjort as I
This can c cxjoitcd y cocctin jcrtcct cxtcnsion itcms in thc rccursion,
in a third ccmcnt ot a sujrocm dcscrijtion S (D, P, X)
Oncc idcnticd, jcrtcct cxtcnsion itcms arc no oncr jroccsscd in thc rccursion,
ut arc ony uscd to cncratc a sujcrscts ot thc jrcx havin thc samc sujjort
Christian Borgelt Frequent Pattern Mining 186
Closed Item Sets and Perfect Extensions
transaction dataasc
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
trcqucnt itcm scts
0 itcms 1 itcm 2 itcms 3 itcms
10 a a, c ! a, c, d 3
b 3 a, d ` a, c, e 3
c a, e o a, d, e !
d o b, c 3
e c, d !
c, e !
d, e !
c is a jcrtcct cxtcnsion ot b as b and b, c oth havc sujjort 3
a is a jcrtcct cxtcnsion ot d, e as d, e and a, d, e oth havc sujjort !
`on-coscd itcm scts josscss at cast onc jcrtcct cxtcnsion,
coscd itcm scts do not josscss any jcrtcct cxtcnsions
Christian Borgelt Frequent Pattern Mining 187
Relation of Maximal and Closed Item Sets
empty set
item base
maxima (trcqucnt) itcm scts
empty set
item base
coscd (trcqucnt) itcm scts
Thc sct ot coscd itcm scts is thc union ot thc scts ot maxima itcm scts
tor a minimum sujjort vaucs at cast as arc as s
min

C
T
(s
min
)
_
ss
min
,s
min
+1,...,n1,n
M
T
(s)
Christian Borgelt Frequent Pattern Mining 188
Types of Frequent Item Sets: Summary
Frequent Item Set
Any trcqucnt itcm sct (sujjort is hihcr than thc minima sujjort)
I trcqucnt s
T
(I) s
min
Closed (Frequent) Item Set
A trcqucnt itcm sct is cacd closed it no sujcrsct has thc samc sujjort
I coscd s
T
(I) s
min
J I s
T
(J) < s
T
(I)
Maximal (Frequent) Item Set
A trcqucnt itcm sct is cacd maximal it no sujcrsct is trcqucnt
I maxima s
T
(I) s
min
J I s
T
(J) < s
min
Ovious rcations ctwccn thcsc tyjcs ot itcm scts
A maxima itcm scts and a coscd itcm scts arc trcqucnt
A maxima itcm scts arc coscd
Christian Borgelt Frequent Pattern Mining 189
Types of Frequent Item Sets: Summary
0 itcms 1 itcm 2 itcms 3 itcms

+
10 a
+
a, c
+
! a, c, d
+
3
b 3 a, d
+
` a, c, e
+
3
c
+
a, e
+
o a, d, e
+
!
d
+
o b, c
+
3
e
+
c, d
+
!
c, e
+
!
d, e !
Frequent Item Set
Any trcqucnt itcm sct (sujjort is hihcr than thc minima sujjort)
Closed (Frequent) Item Set (markcd with
+
)
A trcqucnt itcm sct is cacd closed it no sujcrsct has thc samc sujjort
Maximal (Frequent) Item Set (markcd with

)
A trcqucnt itcm sct is cacd maximal it no sujcrsct is trcqucnt
Christian Borgelt Frequent Pattern Mining 190
Experiments: Data Sets (Reminder)
Chess
A data sct istin chcss cnd amc jositions tor kin vs kin and rook
This data sct is jart ot thc Cl machinc carnin rcjository
` itcms, 319o transactions
avcrac transaction sizc 3, dcnsity 0.`
Census
A data sct dcrivcd trom an cxtract ot thc S ccnsus urcau data ot 199!,
which was jrcjroccsscd y discrctizin numcric attriutcs
This data sct is jart ot thc Cl machinc carnin rcjository
13` itcms, !SS!2 transactions
avcrac transaction sizc 1!, dcnsity 0.1
Thc density ot a transaction dataasc is thc avcrac traction ot a itcms occurrin
jcr transaction dcnsity avcrac transaction sizc , numcr ot itcms
Christian Borgelt Frequent Pattern Mining 191
Experiments: Data Sets (Reminder)
T10I4D100K
An articia data sct cncratcd with lL`s data cncrator
Thc namc is tormcd trom thc jaramctcrs ivcn to thc cncrator
(tor cxamjc 100I 100000 transactions)
S0 itcms, 100000 transactions
avcrac transaction sizc 10.1, dcnsity 0.012
BMS-Webview-1
A wc cick strcam trom a c-carc comjany that no oncr cxists
lt has ccn uscd in thc ILL cuj 2000 and is a jojuar cnchmark
!9 itcms, `9o02 transactions
avcrac transaction sizc 2.`, dcnsity 0.00`
Thc density ot a transaction dataasc is thc avcrac traction ot a itcms occurrin
jcr transaction dcnsity avcrac transaction sizc , numcr ot itcms
Christian Borgelt Frequent Pattern Mining 192
Types of Frequent Item Sets: Experiments
1000 1200 1400 1600 1800 2000
4
5
6
7
frequent
closed
maximal
chess
0 5 10 15 20 25 30 35 40 45 50
4
5
6
frequent
closed
maximal
T10I4D100K
0 10 20 30 40 50 60 70 80 90 100
5
6
7 frequent
closed
maximal
census
33 34 35 36 37 38 39 40
4
5
6
7
8
frequent
closed
maximal
webview1
Lccima oarithm ot thc numcr ot itcm scts ovcr asoutc minimum sujjort
Christian Borgelt Frequent Pattern Mining 193
Reminder: Perfect Extension Pruning
1000 1200 1400 1600 1800 2000
-1
0
1
2
w/o pep
apriori
eclat
fpgrowth
chess
0 5 10 15 20 25 30 35 40 45 50
0
1
w/o pep
apriori
eclat
fpgrowth
T10I4D100K
0 10 20 30 40 50 60 70 80 90 100
0
1 w/o pep
apriori
eclat
fpgrowth
census
33 34 35 36 37 38 39 40
0
1
w/o pep
apriori
eclat
fpgrowth
webview1
Lccima oarithm ot cxccution timc in scconds ovcr asoutc minimum sujjort
Christian Borgelt Frequent Pattern Mining 194
Searching for Closed and Maximal Item Sets
Christian Borgelt Frequent Pattern Mining 195
Searching for Closed Frequent Item Sets
\c know that it succs to nd thc coscd itcm scts tocthcr with thcir sujjort
trom thcm a trcqucnt itcm scts and thcir sujjort can c rctricvcd
Thc charactcrization ot coscd itcm scts y
I coscd s
T
(I) s
min
I

kK
T
(I)
t
k
sucsts to nd thcm y tormin a jossic intcrscctions ot thc transactions
(with at cast s
min
transactions) and chcckin thcir sujjort
Lowcvcr, on standard data scts, ajjroachcs usin this idca
arc rarcy comjctitivc with othcr mcthods
Sjccia cascs in which thcy arc comjctitivc arc domains
with tcw transactions and vcry many itcms
Lxamjcs ot such a domains arc cnc cxjrcssion anaysis
and thc anaysis ot documcnt cocctions
Christian Borgelt Frequent Pattern Mining 196
Carpenter
|lan, Con, Tun, Yan, and Zaki 2003|
Christian Borgelt Frequent Pattern Mining 197
Carpenter: Enumerating Transaction Sets
Thc Carpenter aorithm imjcmcnts thc intcrscction ajjroach y cnumcratin
scts ot transactions (or, cquivacnty, scts ot transaction indiccs), intcrscctin thcm,
and rcmovin,jrunin jossic dujicatcs
This is donc with asicay thc samc divide-and-conquer scheme as tor thc
itcm sct cnumcration ajjroachcs, ony that it is ajjicd to transactions (that is,
itcms and transactions cxchanc thcir mcanin |liout et al. 2003|
Thc task to cnumcratc a transaction indcx scts is sjit into two su-tasks
cnumcratc a transaction indcx scts that contain thc indcx 1
cnumcratc a transaction indcx scts that do not contain thc indcx 1
Thcsc su-tasks arc thcn turthcr dividcd wrt thc transaction indcx 2
cnumcratc a transaction indcx scts containin
oth indiccs 1 and 2, indcx 2, ut not indcx 1,
indcx 1, ut not indcx 2, ncithcr indcx 1 nor indcx 2,
and so on rccursivcy
Christian Borgelt Frequent Pattern Mining 198
Carpenter: Enumerating Transaction Sets
Iormay, a sujrocms in thc rccursion can c dcscricd y trijcs S (I, K, k)
K 1, . . . , n is a sct ot transaction indiccs,
I

kK
t
k
is thcir intcrscction, and
k is a transaction indcx, namcy thc indcx ot thc ncxt transaction to considcr
Thc initia jrocm, with which thc rccursion is startcd, is S (B, , 1),
whcrc B is thc itcm asc and no transactions havc ccn intcrscctcd yct
A sujrocm S
0
(I
0
, K
0
, k
0
) is jroccsscd as toows
Lct K
1
K
0
k
0
and torm thc intcrscction I
1
I
0
t
k
0

lt I
1
, do nothin (rcturn trom rccursion)
lt [K
1
[ s
min
, and thcrc is no transaction t
j
with j 1, . . . , n K
1
such that I
1
t
j
, rcjort I
1
with sujjort s
T
(I
1
) [K
1
[
Lct k
1
k
0
+ 1 lt k
1
n, thcn torm thc sujrocms
S
1
(I
1
, K
1
, k
1
) and S
2
(I
0
, K
0
, k
1
) and jroccss thcm rccursivcy
Christian Borgelt Frequent Pattern Mining 199
Carpenter: List-based Implementation
Transaction identier lists arc uscd to rcjrcscnt thc currcnt itcm sct I
(vcrtica transaction rcjrcscntation, as in thc Lcat aorithm)
Thc intcrscction consists in cocctin a ists with thc ncxt transaction indcx k
Lxamjc transaction
dataasc
t
1
a c
t
2
a d c
t
3
c d
t
!
a c d
t
`
c
t
o
a d
t

d c
t
S
c d c
transaction
idcnticr ists
a c d c
1 1 1 2 2
2 3 3 3
! ! ! ! S
o ` ` o
o S
S
cocction
tor K 1
a c
2 3 3
! ! !
o ` `
o S
tor K 1, 2, 1, 3
a c
! ! !
o ` `
o S
Christian Borgelt Frequent Pattern Mining 200
Carpenter: Table-/Matrix-based Implementation
lcjrcscnt thc data sct y a n [B[ matrix M as toows |Lorct et al. 2011|
m
ki

_
0, it itcm i / t
k
,
[j k, . . . , n [ i t
j
[, othcrwisc.
Lxamjc transaction dataasc
t
1
a c
t
2
a d c
t
3
c d
t
!
a c d
t
`
c
t
o
a d
t

d c
t
S
c d c
matrix rcjrcscntation
a c d c
t
1
! ` ` 0 0
t
2
3 0 0 o 3
t
3
0 ! ! ` 0
t
!
2 3 3 ! 0
t
`
0 2 2 0 0
t
o
1 1 0 3 0
t

0 0 0 2 2
t
S
0 0 1 1 1
Thc currcnt itcm sct I is simjy rcjrcscntcd y thc containcd itcms
An intcrscction coccts a itcms i I with m
ki
> max0, s
min
[K[ 1
Christian Borgelt Frequent Pattern Mining 201
Carpenter: Duplicate Removal
Thc intcrscction ot scvcra transaction indcx scts can yicd thc samc itcm sct
Thc support ot thc itcm sct is thc sizc ot thc largest transaction index set
that yicds thc itcm sct. smacr transaction indcx scts can c skijjcd,inorcd
This is thc rcason tor thc chcck whcthcr thcrc cxists a transaction t
j
with j 1, . . . , n K
1
such that I
1
t
j

This chcck is sjit into thc two chccks whcthcr thcrc cxists such a transaction t
j
with j > k
0
and with j 1, . . . , k
0
1 K
0

Thc rst check is casy, ccausc such transactions arc considcrcd


in thc recursive processing which can rcturn whcthcr onc cxists
Thc jrocmatic second check is sovcd y maintainin
a repository of already found closed frequent item sets
ln ordcr to makc thc ook-uj in thc rcjository ccicnt,
it is aid out as a prex tree with a at array toj cvc
Christian Borgelt Frequent Pattern Mining 202
Summary Carpenter
Basic Processing Scheme
Lnumcration ot transactions scts (transaction idcnticr scts)
lntcrscction ot thc transactions in any sct yicds a coscd itcm sct
Lujicatc rcmova is donc with a rcjository (jrcx trcc)
Advantages
Lcctivcy incar in thc numcr ot itcms
\cry tast tor transaction dataascs with many morc itcms than transactions
Disadvantages
Lxjoncntia in thc numcr ot transactions
\cry sow tor transaction dataascs with many morc transactions than itcms
Software
http://www.borgelt.net/carpenter.html
Christian Borgelt Frequent Pattern Mining 203
IsTa
lntcrscctin Transactions
|`icikaincn 2003| (simjc rcjository, no jrcx trcc)
|Lorct, Yan, `oacs-Cadcnas, Carmona-Sacz, and lascua-`ontano 2011|
Christian Borgelt Frequent Pattern Mining 204
Ista: Cumulative Transaction Intersections
Atcrnativc ajjroach maintain a rcjository ot a coscd itcm scts,
which is ujdatcd y intcrscctin it with thc ncxt transaction |`icikaincn 2003|
To ,ustity this ajjroach tormay, wc considcr thc sct ot a coscd trcqucnt itcm
scts tor s
min
1, that is, thc sct
(
T
(1) I B [ S T S , I

tS
t.
Thc sct (
T
(1) satiscs thc toowin simjc rccursivc rcation
(

(1) ,
(
Tt
(1) (
T
(1) t I [ s (
T
(1) I s t.
As a conscqucncc, wc can start thc jroccdurc with an cmjty sct ot coscd itcm
scts and thcn jroccss thc transactions onc y onc
ln cach stcj thc sct ot coscd itcm scts y addin thc ncw transaction t itsct
and thc additiona coscd itcm scts that rcsut trom intcrscctin it with (
T
(1)
ln addition, thc sujjort ot arcady known coscd itcm scts may havc to c ujdatcd
Christian Borgelt Frequent Pattern Mining 205
Ista: Cumulative Transaction Intersections
Thc corc imjcmcntation jrocm is to nd a data structure tor storin thc
coscd itcm scts that aows to quicky comjutc thc intcrscctions with a ncw trans-
action and to mcrc thc rcsut with thc arcady storcd coscd itcm scts
Ior this wc rcy on a prex tree, cach nodc ot which rcjrcscnts an itcm sct
Thc aorithm works on thc jrcx trcc as toows
At thc cinnin an cmjty trcc is crcatcd (dummy root nodc).
thcn thc transactions arc jroccsscd onc y onc
Lach ncw transaction is rst simjy addcd to thc jrcx trcc
Any ncw nodcs crcatcd in this stcj arc initiaizcd with a sujjort ot zcro
ln thc ncxt stcj wc comjutc thc intcrscctions ot thc ncw transactions
with a itcm scts rcjrcscntcd y thc currcnt jrcx trcc
A rccursivc jroccdurc travcrscs thc jrcx trcc sccctivcy (dcjth-rst) and
matchcs thc itcms in thc trcc nodcs with thc itcms ot thc transaction
Intersecting with and inserting into the tree can be combined.
Christian Borgelt Frequent Pattern Mining 206
Ista: Cumulative Transaction Intersections
transaction
dataasc
t
1
e c a
t
2
e d b
t
3
d c b a
0: 0 1: 1
e 1
c 1
a 1
2: 2
e 2
d 1
b 1
c 1
a 1
3.1: 2
e 2
d 1
b 1
c 1
a 1
d 0
c 0
b 0
a 0
3.2: 2
e 2
d 1
b 1
c 1
a 1
d 2
c 1
b 1
a 1
b 2
3.3: 2
e 2
d 1
b 1
c 1
a 1
d 2
c 1
b 1
a 1
b 2
c 2
a 2
Christian Borgelt Frequent Pattern Mining 207
Ista: Data Structure
typedef struct nodc , a jrcx trcc nodc ,
int stcj. , most rcccnt ujdatc stcj ,
int itcm. , assoc itcm (ast in sct) ,
int sujj. , sujjort ot itcm sct ,
struct nodc siin. , succcssor in siin ist ,
struct nodc chidrcn. , ist ot chid nodcs ,
`OLL.
Standard rst chid , riht siin nodc structurc
Iixcd sizc ot cach nodc aows tor ojtimizcd aocation
Icxic structurc that can casiy c cxtcndcd
Thc stcj cd indicatcs whcthcr thc sujjort cd was arcady ujdatcd
Thc stcj cd is an incrcmcnta markcr, so that it nccd not c ccarcd
in a scjaratc travcrsa ot thc jrcx trcc
Christian Borgelt Frequent Pattern Mining 208
Ista: Pseudo-Code
void iscct (`OLL nodc, `OLL ins)
, intcrscct with transaction ,
int i. , ucr tor currcnt itcm ,
`OLL d. , to aocatc ncw nodcs ,
while (nodc) , travcrsc thc siin ist ,
i nodcitcm. , ct thc currcnt itcm ,
if (trans|i|) , it itcm is in intcrscction ,
while ((d ins) :: (ditcm > i))
ins :dsiin. , nd thc inscrtion josition ,
if (d , it an intcrscction nodc with ,
:: (ditcm i)) , thc itcm arcady cxists ,
if (dstcj stcj) dsujj.
if (dsujj nodcsujj)
dsujj nodcsujj.
dsujj++. , ujdatc intcrscction sujjort ,
dstcj stcj. , and sct currcnt ujdatc stcj ,
Christian Borgelt Frequent Pattern Mining 209
Ista: Pseudo-Code
else , it thcrc is no corrcsj nodc ,
d maoc(sizcot(`OLL)).
dstcj stcj. , crcatc a ncw nodc and ,
ditcm i. , sct itcm and sujjort ,
dsujj nodcsujj+1.
dsiin ins. ins d.
dchidrcn `LL.
, inscrt nodc into thc trcc ,
if (i imin) rcturn. , it cyond ast itcm, aort ,
iscct(nodcchidrcn, :dchidrcn).
else , it itcm is not in intcrscction ,
if (i imin) rcturn. , it cyond ast itcm, aort ,
iscct(nodcchidrcn, ins).
, intcrscct with sutrcc ,
nodc nodcsiin. , o to thc ncxt siin ,
, cnd ot whic (nodc) ,
, iscct() ,
Christian Borgelt Frequent Pattern Mining 210
Ista: Keeping the Repository Small
ln jracticc wc wi not work with a minimum sujjort s
min
1
lcmovin intcrscctions cary, ccausc thcy do not rcach thc minimum sujjort
is dicut in jrincijc, cnouh ot thc transactions to c jroccsscd in thc tuturc
coud contain thc itcm sct undcr considcration
lmjrovcd jroccssin with itcm occurrcncc countcrs
ln an initia jass thc trcqucncy ot thc individua itcms is dctcrmincd
Thc otaincd countcrs arc ujdatcd with cach jroccsscd transaction
Thcy aways rcjrcscnt thc itcm occurrcnccs in thc unjroccsscd transactions
Lascd on thcsc countcrs, wc can ajjy thc toowin jrunin schcmc
Sujjosc that attcr havin jroccsscd k ot a tota ot n transactions
thc sujjort ot a coscd itcm sct I is s
T
k
(I) x
Lct y c thc minimum ot thc countcr vaucs tor thc itcms containcd in I
lt x + y < s
min
, thcn I can c discardcd, ccausc it cannot rcach s
min

Christian Borgelt Frequent Pattern Mining 211


Ista: Keeping the Repository Small
Onc has to c carctu, thouh, ccausc I may c nccdcd in ordcr to torm suscts,
namcy thosc that rcsut trom intcrscctions ot it with ncw transactions
Thcsc suscts may sti c trcqucnt, cvcn thouh I is not
As a conscqucncc, an itcm sct I is not simjy rcmovcd,
ut thosc items are selectively removed trom it
that do not occur trcqucnty cnouh in thc rcmainin transactions
Athouh in this way non-coscd itcm scts may c constructcd,
no jrocms tor thc na outjut arc crcatcd
cithcr thc rcduccd itcm sct aso occurs as thc intcrscction
ot cnouh transactions and thus is coscd,
or it wi not rcach thc minimum sujjort thrcshod
and thcn it wi not c rcjortcd
Christian Borgelt Frequent Pattern Mining 212
Summary Ista
Basic Processing Scheme
Cumuativc intcrscction ot transactions (incrcmcnta,on-inc minin)
Comincd intcrscction and rcjository cxtcnsions (onc travcrsa)
Additiona jrunin is jossic tor atch jroccssin
Advantages
Lcctivcy incar in thc numcr ot itcms
\cry tast tor transaction dataascs with many morc itcms than transactions
Disadvantages
Lxjoncntia in thc numcr ot transactions
\cry sow tor transaction dataascs with many morc transactions than itcms
Software
http://www.borgelt.net/ista.html
Christian Borgelt Frequent Pattern Mining 213
Experimental Comparison
0 5 10 15 20 25 30
1
0
1
2
3
minimum support
l
o
g
(
t
i
m
e
/
s
e
c
o
n
d
s
)
FP-close
LCM3
IsTa
Carp. table
Carp. lists
yeast
46 48 50 52 54
1
0
1
2
3
minimum support
l
o
g
(
t
i
m
e
/
s
e
c
o
n
d
s
)
IsTa
Carp. table
Carp. lists
ncbi60
25 30 35 40
1
0
1
2
3
minimum support
l
o
g
(
t
i
m
e
/
s
e
c
o
n
d
s
)
FP-close
LCM3
IsTa
Carp. table
Carp. lists
thrombin
0 5 10 15 20
1
0
1
2
3
minimum support
l
o
g
(
t
i
m
e
/
s
e
c
o
n
d
s
)
FP-close
LCM3
IsTa
Carp. table
Carp. lists
webview tpo.
Christian Borgelt Frequent Pattern Mining 214
Searching for Closed and Maximal Item Sets
with Item Set Enumeration
Christian Borgelt Frequent Pattern Mining 215
Filtering Frequent Item Sets
lt ony coscd itcm scts or ony maxima itcm scts arc to c tound with itcm sct
cnumcration ajjroachcs, thc tound trcqucnt itcm scts havc to c tcrcd
Somc usctu notions tor tcrin and jrunin
Thc head H B ot a scarch trcc nodc is thc sct ot itcms on thc jath
cadin to it lt is thc jrcx ot thc conditiona dataasc tor this nodc
Thc tail L B ot a scarch trcc nodc is thc sct ot itcms that arc trcqucnt
in its conditiona dataasc Thcy arc thc jossic cxtcnsions ot H
`otc that h H l L h < l
E i BH [ h H h > i is thc sct ot eliminated items
Thcsc itcms arc not considcrcd anymorc in thc corrcsjondin sutrcc
`otc that thc itcms in thc tai and thcir sujjort in thc conditiona dataasc
arc known, at cast attcr thc scarch rcturns trom thc rccursivc jroccssin
Christian Borgelt Frequent Pattern Mining 216
Head, Tail and Eliminated Items
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
a
b c
d
b
c d c d d
c d d d
d
A (tu) jrcx trcc tor thc vc itcms a, b, c, d, e
Thc uc oxcs arc thc trcqucnt itcm scts
Ior thc cncirccd scarch trcc nodcs wc havc
rcd hcad H b, tai L c, ciminatcd itcms E a
rccn hcad H a, c, tai L d, e, ciminatcd itcms E b
Christian Borgelt Frequent Pattern Mining 217
Closed and Maximal Item Sets
\hcn tcrin trcqucnt itcm scts tor coscd and maxima itcm scts
thc toowin conditions arc casy and ccicnt to chcck
lt thc tai ot a scarch trcc nodc is not cmjty,
its hcad is not a maxima itcm sct
lt an itcm in thc tai ot a scarch trcc nodc has thc samc sujjort
as thc hcad, thc hcad is not a coscd itcm sct
Lowcvcr, thc invcrsc imjications nccd not hod
lt thc tai ot a scarch trcc nodc is cmjty,
its hcad is not ncccssariy a maxima itcm sct
lt no itcm in thc tai ot a scarch trcc nodc has thc samc sujjort
as thc hcad, thc hcad is not ncccssariy a coscd itcm sct
Thc jrocm arc thc eliminated items,
which can sti rcndcr thc hcad non-coscd or non-maxima
Christian Borgelt Frequent Pattern Mining 218
Closed and Maximal Item Sets
Check the Dening Condition Directly:
Closed Item Sets
Chcck whcthcr a E K
T
(H) K
T
(a)
or chcck whcthcr

kK
T
(H)
(t
k
H) ,
lt cithcr is thc casc, H is not coscd, othcrwisc it is
`otc that with thc attcr condition, thc intcrscction can c comjutcd transaction
y transaction lt can c concudcd that H is coscd as soon as thc intcrscction
ccomcs cmjty
Maximal Item Sets:
Chcck whcthcr a E s
T
(H a) s
min

lt this is thc casc, H is not maxima, othcrwisc it is


Christian Borgelt Frequent Pattern Mining 219
Closed and Maximal Item Sets
Chcckin thc dcnin condition dirccty is trivia tor thc tai itcms,
as thcir sujjort vaucs arc avaiac trom thc conditiona transaction dataascs
As a conscqucncc, a itcm sct cnumcration ajjroachcs tor coscd and
maxima itcm scts chcck thc dcnin condition tor thc tai itcms
Lowcvcr, chcckin thc dcnin condition can c dicut tor thc ciminatcd itcms,
sincc additiona data (cyond thc conditiona transaction dataasc) is nccdcd to
dctcrminc thcir occurrcnccs in thc transactions or thcir sujjort vaucs
lt can dcjcnd on thc dataasc structurc uscd whcthcr a chcck
ot thc dcnin condition is ccicnt tor thc ciminatcd itcms or not
As a conscqucncc, somc itcm sct cnumcration aorithms
do not chcck thc dcnin condition tor thc ciminatcd itcms,
ut rcy on a rcjository ot arcady tound coscd or maxima itcm scts
\ith such a rcjository it can c chcckcd in an indircct way
whcthcr an itcm sct is coscd or maxima
Christian Borgelt Frequent Pattern Mining 220
Checking the Eliminated Items: Repository
Lach tound maxima or coscd itcm sct is storcd in a rcjository
(lrctcrrcd data structurc tor thc rcjository jrcx trcc)
lt is chcckcd whcthcr a sujcrsct ot thc hcad H with thc samc sujjort
has arcady ccn tound lt ycs, thc hcad H is ncithcr coscd nor maxima
Lvcn morc thc hcad H nccd not c jroccsscd rccursivcy,
ccausc thc rccursion cannot yicd any coscd or maxima itcm scts
Thcrctorc thc currcnt sutrcc ot thc scarch trcc can c jruncd
`otc that with a rcjository thc dcjth-rst scarch has to jrocccd trom ctt to riht
\c nccd thc rcjository to chcck tor jossiy cxistin coscd
or maxima sujcrscts that contain onc or morc ciminatcd itcm(s)
ltcm scts containin ciminatcd itcms arc considcrcd ony
in scarch trcc ranchcs to thc ctt ot thc considcrcd nodc
Thcrctorc thcsc ranchcs must arcady havc ccn jroccsscd
in ordcr to cnsurc that jossic sujcrscts havc arcady ccn rccordcd
Christian Borgelt Frequent Pattern Mining 221
Checking the Eliminated Items: Repository
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
a
b c
d
b
c d c d d
c d d d
d
A (tu) jrcx trcc tor thc vc itcms a, b, c, d, e
Sujjosc thc jrcx trcc woud c travcrscd trom riht to ctt
Ior nonc ot thc trcqucnt itcm scts d, e, c, d and c, e it coud c dctcrmincd
with thc hcj ot a rcjository that thcy arc not maxima, ccausc thc maxima
itcm scts a, c, d, a, c, e, a, d, e havc not ccn jroccsscd thcn
Christian Borgelt Frequent Pattern Mining 222
Checking the Eliminated Items: Repository
lt a sujcrsct ot thc currcnt hcad H with thc samc sujjort
has arcady ccn tound, thc hcad H nccd not c jroccsscd,
ccausc it cannot yicd any maxima or coscd itcm scts
Thc rcason is that a tound jrojcr sujcrsct I H with s
T
(I) s
T
(H)
contains at cast onc itcm i I H that is a jcrtcct cxtcnsion ot H
Thc itcm i is an ciminatcd itcm, that is, i / L (itcm i is not in thc tai)
(lt i wcrc in L, thc sct I woud not c in thc rcjository arcady)
lt thc itcm i is a jcrtcct cxtcnsion ot thc hcad H,
it is a jcrtcct cxtcnsion ot a sujcrscts J H with i / J
A itcm scts cxjorcd trom thc scarch trcc nodc with hcad H and tai L
arc suscts ot H L (ccausc ony thc itcms in L arc conditionay trcqucnt)
Conscqucnty, thc itcm i is a jcrtcct cxtcnsion ot a itcm scts cxjorcd trom thc
scarch trcc nodc with hcad H and tai L, and thcrctorc nonc ot thcm can c coscd
Christian Borgelt Frequent Pattern Mining 223
Checking the Eliminated Items: Repository
lt is usuay advantacous to usc not ,ust a sinc, oa rcjository,
ut to crcatc conditiona rcjositorics tor cach rccursivc ca,
which contain ony thc tound coscd itcm scts that contain H
\ith conditiona rcjositorics thc chcck tor a known sujcrsct rcduccs
to thc chcck whcthcr thc conditiona rcjository contains an itcm sct
with thc ncxt sjit itcm and thc samc sujjort as thc currcnt hcad
(`otc that thc chcck is cxccutcd ctorc oin into rccursion,
that is, ctorc constructin thc cxtcndcd hcad ot a chid nodc
lt thc chcck nds a sujcrsct, thc chid nodc is jruncd)
Thc conditiona rcjositorics arc otaincd y asicay thc samc ojcration as
thc conditiona transaction dataascs (jro,cctin,conditionin on thc sjit itcm)
A jojuar structurc tor thc rcjository is an Il-trcc,
ccausc it aows tor simjc and ccicnt jro,cction,conditionin
Lowcvcr, a simjc jrcx trcc that is jro,cctcd toj-down may aso c uscd
Christian Borgelt Frequent Pattern Mining 224
Closed and Maximal Item Sets: Pruning
lt ony coscd itcm scts or ony maxima itcm scts arc to c tound,
additiona jrunin ot thc scarch trcc ccomcs jossic
Perfect Extension Pruning / Parent Equivalence Pruning (PEP)
Givcn an itcm sct I, an itcm a / I is cacd a perfect extension ot I,
i thc itcm scts I and I a havc thc samc sujjort s
T
(I) s
T
(I a)
(that is, it a transactions containin I aso contain thc itcm a)
Thcn wc know J I s
T
(J a) s
T
(J)
As a conscqucncc, no sujcrsct J I with a / J can c coscd
Lcncc a can c addcd dirccty to thc jrcx ot thc conditiona dataasc
Lct X
T
(I) a [ a / I s
T
(I a) s
T
(I) c thc sct ot a jcrtcct cxtcnsion
itcms Thcn thc whoc sct X
T
(I) can c addcd to thc jrcx
lcrtcct cxtcnsion , jarcnt cquivacncc jrunin can c ajjicd tor oth coscd and
maxima itcm scts, sincc a maxima itcm scts arc coscd
Christian Borgelt Frequent Pattern Mining 225
Head Union Tail Pruning
lt ony maxima itcm scts arc to c tound, cvcn morc
additiona jrunin ot thc scarch trcc ccomcs jossic
General Idea: A trcqucnt itcm scts in thc sutrcc rootcd at a nodc
with hcad H and tai L arc suscts ot H L
Maximal Item Set Contains Head Tail Pruning (MFIHUT)
lt wc nd out that H L is a susct ot an arcady tound
maxima itcm sct, thc whoc sutrcc can c jruncd
This jrunin mcthod rcquircs a ctt to riht travcrsa ot thc jrcx trcc
Frequent Head Tail Pruning (FHUT)
lt H L is not a susct ot an arcady tound maxima itcm sct
and y somc ccvcr mcans wc discovcr that H L is trcqucnt,
H L can immcdiatcy c rccordcd as a maxima itcm sct
Christian Borgelt Frequent Pattern Mining 226
Alternative Description of Closed Item Set Mining
ln ordcr to avoid rcdundant scarch in thc jartiay ordcrcd sct (2
B
, ),
wc assincd a uniquc jarcnt itcm sct to cach itcm sct (cxccjt thc cmjty sct)
Anaoousy, wc may structurc thc sct ot coscd itcm scts
y assinin unique closed parent item sets |no et al. 2003|
Lct c an itcm ordcr and ct I c a coscd itcm sct with I ,

1kn
t
k

Lct i

I c thc (uniqucy dctcrmincd) itcm satistyin


s
T
(i I [ i < i

) > s
T
(I) and s
T
(i I [ i i

) s
T
(I).
lntuitivcy, thc itcm i

is thc rcatcst itcm in I that is not a jcrtcct cxtcnsion


(A itcms rcatcr than i

can c rcmovcd without acctin thc sujjort)


Lct I

i I [ i < i

and X
T
(I) i B I [ s
T
(I i) s
T
(I)
Thcn thc canonica jarcnt p
C
(I) ot I is thc itcm sct
p
C
(I) I

i X
T
(I

) [ i > i

.
lntuitivcy, to nd thc canonica jarcnt ot thc itcm sct I, thc rcduccd itcm sct I

is cnhanccd y a jcrtcct cxtcnsion itcms toowin thc itcm i

Christian Borgelt Frequent Pattern Mining 227


Alternative Description of Closed Item Set Mining
`otc that

1kn
t
k
is thc smacst coscd itcm sct tor a ivcn dataasc T
`otc aso that thc sct i X
T
(I

) [ i > i

nccd not contain a itcms i > i

,
ccausc a jcrtcct cxtcnsion ot I

nccd not c a jcrtcct cxtcnsion ot I

,
sincc K
T
(I

) K
T
(I

)
Ior thc rccursivc scarch, thc toowin tormuation is usctu
Lct I B c a coscd itcm sct Thc canonical children ot I (that is,
thc coscd itcm scts that havc I as thcir canonica jarcnt) arc thc itcm scts
J I i j X
T
(I i) [ j > i
with j I i > j and j X
T
(I i) [ j < i X
T
(J)
Thc union with j X
T
(I i) [ j > i
rcjrcscnts jcrtcct cxtcnsion or jarcnt cquivacncc jrunin
a jcrtcct cxtcnsions in thc tai ot I i arc immcdiatcy addcd
Thc condition j X
T
(I i) [ j < i cxjrcsscs
that thcrc must not c any jcrtcct cxtcnsions amon thc ciminatcd itcms
Christian Borgelt Frequent Pattern Mining 228
Additional Frequent Item Set Filtering
Christian Borgelt Frequent Pattern Mining 229
Additional Frequent Item Set Filtering
General problem of frequent item set mining:
Thc numcr ot trcqucnt itcm scts, cvcn thc numcr ot coscd or maxima itcm
scts, can cxcccd thc numcr ot transactions in thc dataasc y tar
Thcrctorc Additiona tcrin is ncccssary to nd
thc rccvant or intcrcstin trcqucnt itcm scts
Gcncra idca Compare support to expectation.
ltcm scts consistin ot itcms that ajjcar trcqucnty
arc ikcy to havc a hih sujjort
Lowcvcr, this is not surjrisin
wc cxjcct this cvcn it thc occurrcncc ot thc itcms is indcjcndcnt
Additiona tcrin shoud rcmovc itcm scts with a sujjort
cosc to thc sujjort cxjcctcd trom an indcjcndcnt occurrcncc
Christian Borgelt Frequent Pattern Mining 230
Additional Frequent Item Set Filtering
Full Independence
Lvauatc itcm scts with

(I)
s
T
(I) n
[I[1

aI
s
T
(a)

p
T
(I)

aI
p
T
(a)
.
an rcquirc a minimum vauc tor this mcasurc
( p
T
is thc jroaiity cstimatc ascd on T)
Assumcs tu indcjcndcncc ot thc itcms in ordcr
to torm an cxjcctation aout thc sujjort ot an itcm sct
Advantac Can c comjutcd trom ony thc sujjort ot thc itcm sct
and thc sujjort vaucs ot thc individua itcms
Lisadvantac lt somc itcm sct I scorcs hih on this mcasurc,
thcn a J I arc aso ikcy to scorc hih,
cvcn it thc itcms in J I arc indcjcndcnt ot I
Christian Borgelt Frequent Pattern Mining 231
Additional Frequent Item Set Filtering
Incremental Independence
Lvauatc itcm scts with

ii
(I) min
aI
n s
T
(I)
s
T
(I a) s
T
(a)
min
aI
p
T
(I)
p
T
(I a) p
T
(a)
.
an rcquirc a minimum vauc tor this mcasurc
( p
T
is thc jroaiity cstimatc ascd on T)
Advantac lt I contains indcjcndcnt itcms,
thc minimum cnsurcs a ow vauc
Lisadvantacs \c nccd to know thc sujjort vaucs ot a suscts I a
lt thcrc cxist hih scorin indcjcndcnt suscts I
1
and I
2
with [I
1
[ > 1, [I
2
[ > 1, I
1
I
2
and I
1
I
2
I,
thc itcm sct I sti rcccivcs a hih cvauation
Christian Borgelt Frequent Pattern Mining 232
Additional Frequent Item Set Filtering
Subset Independence
Lvauatc itcm scts with

si
(I) min
JI,J,
n s
T
(I)
s
T
(I J) s
T
(J)
min
JI,J,
p
T
(I)
p
T
(I J) p
T
(J)
.
an rcquirc a minimum vauc tor this mcasurc
( p
T
is thc jroaiity cstimatc ascd on T)
Advantac Lctccts a cascs whcrc a dccomjosition is jossic
and cvauatcs thcm with a ow vauc
Lisadvantacs \c nccd to know thc sujjort vaucs ot a jrojcr suscts J
lmjrovcmcnt sc incrcmcnta indcjcndcncc and in thc minimum considcr
ony itcms a tor which I a has ccn cvauatcd hih
This cajturcs susct indcjcndcncc incrcmcntay
Christian Borgelt Frequent Pattern Mining 233
Summary Frequent Item Set Mining
\ith a canonical form ot an itcm sct thc Lassc diaram
can c turncd into a much simjcr prex tree
( dividc-and-conqucr schcmc usin conditiona dataascs)
Item set enumeration aorithms dicr in
thc traversal order ot thc jrcx trcc
(rcadth-rst,cvcwisc vcrsus dcjth-rst travcrsa)
thc transaction representation
horizontal (itcm arrays) vcrsus vertical (transaction ists)
vcrsus specialized data structures ikc Il-trccs
thc types of frequent item sets tound
frequent vcrsus closed vcrsus maximal item sets
(additiona jrunin mcthods tor coscd and maxima itcm scts)
An atcrnativc arc transaction set enumeration or intersection aorithms
Additional ltering is ncccssary to rcducc thc sizc ot thc outjut
Christian Borgelt Frequent Pattern Mining 234
Example Application:
Finding Neuron Assemblies in Neural Spike Data
Christian Borgelt Frequent Pattern Mining 235
Biological Background
Structure of a prototypical neuron
cc corc
axon
mycin shcath
cc ody
(soma)
tcrmina outon
synajsis
dcndritcs
Christian Borgelt Frequent Pattern Mining 236
Biological Background
c
_
2
0
0

l
u
i
z
-
\
i

a
r
r
c
a

Christian Borgelt Frequent Pattern Mining 237


Biological Background
(Very) simplied description of neural information processing
Axon tcrmina rccascs chcmicas, cacd neurotransmitters
Thcsc act on thc mcmranc ot thc rcccjtor dcndritc to chanc its joarization
(Thc insidc is usuay 0m\ morc ncativc than thc outsidc)
Lccrcasc in jotcntia dicrcncc excitatory synajsc
lncrcasc in jotcntia dicrcncc inhibitory synajsc
lt thcrc is cnouh nct cxcitatory injut, thc axon is dcjoarizcd
Thc rcsutin action potential travcs aon thc axon
(Sjccd dcjcnds on thc dcrcc to which thc axon is covcrcd with mycin)
\hcn thc action jotcntia rcachcs thc tcrmina outons,
it tricrs thc rccasc ot ncurotransmittcrs
Christian Borgelt Frequent Pattern Mining 238
Neuronal Action Potential
A schcmatic vicw ot an idcaizcd action
jotcntia iustratcs its various jhascs as
thc action jotcntia jasscs a joint on a
cc mcmranc
Actua rccordins ot action jotcntias arc
ottcn distortcd comjarcd to thc schcmatic
vicw ccausc ot variations in ccctrojhys-
iooica tcchniqucs uscd to makc thc
rccordin
c _ cnwikijcdiaor
Christian Borgelt Frequent Pattern Mining 239
Higher Level Neural Processing
Thc ow-cvc mcchanisms ot ncura intormation jroccssin arc tairy wc
undcrstood (ncurotransmittcrs, cxcitation and inhiition, action jotcntia)
Thc hih-cvc mcchanisms, howcvcr, arc a tojic ot currcnt rcscarch
Thcrc arc scvcra comjctin thcorics (scc thc toowin sidcs)
how ncurons codc and transmit thc intormation thcy jroccss
j to tairy rcccnty it was not jossic to rccord thc sjikcs
ot cnouh ncurons in jarac to dccidc ctwccn thc dicrcnt modcs
Lowcvcr, ncw mcasurcmcnt tcchniqucs ojcn uj thc jossiiity
to rccord dozcns or cvcn uj to a hundrcd ncurons in jarac
Currcnty mcthods arc invcstiatcd y which it woud c jossic
to chcck thc vaidity ot thc dicrcnt codin modcs
Ircqucnt itcm sct minin, jrojcry adajtcd, coud jrovidc a mcthod
to tcst thc temporal coincidence hypothesis (scc cow)
Christian Borgelt Frequent Pattern Mining 240
Models of Neuronal Coding
c _ Zotan `adasdy
Frequency Code Hypothesis
|Shcrrinton 190o, Lcccs 19`, Larow 192|
`curons cncratc dicrcnt trcqucncy ot sjikc trains
as a rcsjonsc to dicrcnt stimuus intcnsitics
Christian Borgelt Frequent Pattern Mining 241
Models of Neuronal Coding
c _ Zotan `adasdy
Temporal Coincidence Hypothesis
|Gray ct a 1992, Sincr 1993, 199!|
Sjikc occurrcnccs arc moduatcd y oca cd osciation (amma)
Tihtcr coincidcncc ot sjikcs rccordcd trom dicrcnt ncurons
rcjrcscnt hihcr stimuus intcnsity
Christian Borgelt Frequent Pattern Mining 242
Models of Neuronal Coding
c _ Zot an `adasdy
Delay Coding Hypothesis
|Lojcd 199`, Luzsaki and Chroak 199`|
Thc injut currcnt is convcrtcd to thc sjikc dcay
`curon 1 which was stimuatcd stroncr rcachcd thc thrcshod caricr
and initiatcd a sjikc sooncr than ncurons stimuatcd css
Licrcnt dcays ot thc sjikcs (d2-d!) rcjrcscnt
rcativc intcnsitics ot thc dicrcnt stimuus
Christian Borgelt Frequent Pattern Mining 243
Models of Neuronal Coding
c _ Zotan `adasdy
Spatio-Temporal Code Hypothesis
`curons disjay a causa scqucncc ot sjikcs in rcationshij to a stimuus conuration
Thc stroncr stimuus induccs sjikcs caricr and wi initiatc sjikcs in thc othcr, con-
ncctcd ccs in thc ordcr ot rcativc thrcshod and actua dcjoarization Thc scqucncc
ot sjikc jrojaation is dctcrmincd y thc sjatio-tcmjora conuration ot thc stimuus
as wc as thc intrinsic conncctivity ot thc nctwork Sjikc scqucnccs coincidc with thc
oca cd activity `otc that this modc intcratcs oth thc tcmjora coincidcncc and
thc dcay codin jrincijcs
Christian Borgelt Frequent Pattern Mining 244
Models of Neuronal Coding
c _ Zotan `adasdy
Markovian Process of Frequency Modulation
|Scidcrmann ct a 199o|
Stimuus intcnsitics arc convcrtcd to a scqucncc ot trcqucncy cnhanccmcnts and dccrc-
mcnts in thc dicrcnt ncurons Licrcnt stimuus conurations arc rcjrcscntcd y
dicrcnt `arkovian scqucnccs across scvcra scconds
Christian Borgelt Frequent Pattern Mining 245
Finding Neuron Assemblies in Neuronal Spike Data
data c _ Son,a Gr un, lcscarch Ccntcr 1 uich, Gcrmany
Lot disjays ot (simuatcd) jarac sjikc trains
vcrtica ncurons (100)
horizonta timc (10 scconds)
ln onc ot thcsc dot disjays, 20 ncurons arc rin synchronousy
\ithout jrojcr intcicnt data anaysis mcthods,
it is virtuay imjossic to dctcct such synchronous rin
Christian Borgelt Frequent Pattern Mining 246
Finding Neuron Assemblies in Neural Spike Data
data c _ Son,a Gr un, lcscarch Ccntcr 1 uich, Gcrmany
lt thc ncurons that rc tocthcr arc roujcd tocthcr,
thc synchronous rin ccomcs casiy visic
ctt cojy ot thc riht diaram ot thc jrcvious sidc
riht samc data, ut with rccvant ncurons cocctcd at thc ottom
A synchronousy rin sct ot ncurons is cacd a neuron assembly
Qucstion Low can wc nd out which ncurons to rouj tocthcr
Christian Borgelt Frequent Pattern Mining 247
Finding Neuron Assemblies in Neural Spike Data
A Frequent Item Set Mining Approach
Thc ncurona sjikc trains arc usuay codcd as jairs ot a ncuron id
and a sjikc timc, sortcd y thc sjikc timc
ln ordcr to makc trcqucnt itcm sct minin ajjicac, timc ins arc tormcd
Lach time bin ivcs risc to onc transaction
lt contains thc sct ot neurons that rc in this timc in (items)
Ircqucnt itcm sct minin, jossiy rcstrictcd to maxima itcm scts,
is thcn ajjicd with additiona tcrin ot thc trcqucnt itcm scts
Ior thc (simuatcd) cxamjc data sct such an ajjroach
dctccts thc ncuron asscmy jcrtccty
80 54 88 28 93 83 39 29 50 24 40 30 32 11 82 69 22 60 5 4
(0.5400%/54, 105.1679)
Christian Borgelt Frequent Pattern Mining 248
Finding Neuron Assemblies in Neural Spike Data
Translation of Basic Notions
mathcmatica jrocm markct askct anaysis sjikc train anaysis
itcm jroduct ncuron
itcm asc sct ot a jroducts sct ot a ncurons
(transaction id) customcr timc in
transaction sct ot jroducts sct ot ncurons
ouht y a customcr rin in a timc in
trcqucnt itcm sct sct ot jroducts sct ot ncurons
trcqucnty ouht tocthcr trcqucnty rin tocthcr
ln oth cascs thc injut can c rcjrcscntcd as a inary matrix
(thc so-cacd dot display in sjikc train anaysis)
`otc, howcvcr, that a dot disjay is usuay rotatcd y 90
o

usuay customcrs rctcr to rows, jroducts to coumns,


ut in a dot disjay, rows arc ncurons, coumns arc timc ins
Christian Borgelt Frequent Pattern Mining 249
Finding Neuron Assemblies in Neural Spike Data
Core Problems of Detecting Synchronous Patterns:
Multiple Testing
lt scvcra statistica tcsts arc carricd out, onc oscs contro ot thc sinicancc cvc
Ior tairy sma numcrs ot tcsts, ccctivc corrcction jroccdurcs cxist
Lcrc, howcvcr, thc numcr ot jotcntia jattcrns and thc numcr ot tcsts is huc
Induced Patterns
lt synchronous sjikin activity is jrcscnt in thc data, not ony thc actua asscmy,
ut aso suscts, sujcrscts and ovcrajjin scts ot ncurons arc dctcctcd
Temporal Imprecision
Thc sjikcs ot ncurons that jarticijatc in synchronous sjikin
cannot c cxjcctcd to c jcrtccty synchronous
Selective Participation
\aryin suscts ot thc ncurons in an asscmy
may jarticijatc in dicrcnt synchronous sjikin cvcnts
Christian Borgelt Frequent Pattern Mining 250
Neural Spike Data: Multiple Testing
lt 1000 tcsts arc carricd out, cach with a sinicancc cvc 0.01 1/,
around 10 tcsts wi turn out jositivc, sinityin nothin
Thc jositivc tcst rcsuts can c cxjaincd as mcrc chancc cvcnts
Lxamjc 100 rccordcd ncurons aow tor
_
100
3
_
1o1, 00 trijcts
and
_
100
!
_
3, 921, 22` quadrujcts
As a conscqucncc, cvcn thouh it is vcry unikcy that, say,
four specic neurons rc tocthcr thrcc timcs it thcy arc indcjcndcnt,
it is tairy ikcy that wc oscrvc some set of four neurons
rin tocthcr thrcc timcs
Lxamjc 100 ncurons, 20Lz rin ratc, 3 scconds rccordin timc,
inncd with 3ms timc ins to otain 1000 transactions
Thc cvcnt ot ! ncurons rin tocthcr 3 timcs has a p-vauc ot 10
o
(
2
-tcst)
Thc avcrac numcr ot such jattcrns in indcjcndcnt data is rcatcr than 1
(data cncratcd as indcjcndcnt loisson jroccsscs)
Christian Borgelt Frequent Pattern Mining 251
Neural Spike Data: Multiple Testing
Soution shitt statistica tcstin to pattern signatures z, c,
whcrc z is thc numcr ot ncurons (jattcrn sizc)
and c thc numcr ot coincidcnccs (jattcrn sujjort) |licado-`ui no et al. 2013|
lcjrcscnt nu hyjothcsis y cncratin sucicnty many surrogate data sets
(c y sjikc timc randomization tor constant rin ratc)
(Surroatc data cncration must takc data jrojcrtics into account)
lcmovc a jattcrns tound in thc oriina data sct tor which a countcrjart
(samc sinaturc) was tound in somc surroatc data sct (coscd itcm scts)
(ldca a countcrjart indicatcs that thc jattcrn coud c a chancc cvcnt)
p
a
tte
rn
s
iz
e
z
coincidences
c
l
o
g
(
#
p
a
t
t
e
r
n
s
)
4
3
2
1
0
1
2
3
2
3
4
5
6
7
8
9
1
0
1
1
1
2
2
3
4
5
6
7
8
9 101112
frequent
patterns
a
s
s
e
m
b
ly
s
iz
e
z
coincidences
c
r
a
t
e
0
0.2
0.4
0.6
0.8
1
2
3
4
5
6
7
8
9
1
0
1
1
1
2
2
3
4
5
6
7
8
9 101112
false neg.
exact
a
s
s
e
m
b
ly
s
iz
e
z
coincidences
c
r
a
t
e
0
0.2
0.4
0.6
0.8
1
2
3
4
5
6
7
8
9
1
0
1
1
1
2
2
3
4
5
6
7
8
9 101112
all other
patterns
p
a
tte
rn
s
iz
e
z
coincidences
c
a
v
g
.
#
p
a
t
t
e
r
n
s
0
0.2
0.4
0.6
0.8
1
2
3
4
5
6
7
8
9
1
0
1
1
1
2
2
3
4
5
6
7
8
9 101112
7 neurons
7 coins.
Christian Borgelt Frequent Pattern Mining 252
Neural Spike Data: Induced Patterns
Lct A and B with B A c two scts ctt ovcr attcr jrimary jattcrn tcrin,
that is, attcr rcmovin a scts I with sinaturcs z
I
, c
I
[I[, s(I) that occur
in thc surroatc data scts
Thc sct A is preferred to thc sct B i (z
A
1)c
A
(z
B
1)c
B
,
that is, it thc jattcrn A covcrs at cast as many sjikcs as thc jattcrn B
it onc ncuron is nccctcd Othcrwisc B is jrctcrrcd to A
(This mcthod is simjc and ccctivc, ut thcrc arc scvcra atcrnativcs)
lattcrn sct rcduction kccjs ony scts that arc jrctcrrcd
to a ot thcir suscts and to a ot thcir sujcrscts |Torrc et al. 2013|
p
a
tte
rn
s
iz
e
z
coincidences
c
l
o
g
(
#
p
a
t
t
e
r
n
s
)
4
3
2
1
0
1
2
3
2
3
4
5
6
7
8
9
1
0
1
1
1
2
2
3
4
5
6
7
8
9 101112
frequent
patterns
a
s
s
e
m
b
ly
s
iz
e
z
coincidences
c
r
a
t
e
0
0.2
0.4
0.6
0.8
1
2
3
4
5
6
7
8
9
1
0
1
1
1
2
2
3
4
5
6
7
8
9 101112
false neg.
exact
a
s
s
e
m
b
ly
s
iz
e
z
coincidences
c
r
a
t
e
0
0.2
0.4
0.6
0.8
1
2
3
4
5
6
7
8
9
1
0
1
1
1
2
2
3
4
5
6
7
8
9 101112
all other
patterns
p
a
tte
rn
s
iz
e
z
coincidences
c
a
v
g
.
#
p
a
t
t
e
r
n
s
0
0.2
0.4
0.6
0.8
1
2
3
4
5
6
7
8
9
1
0
1
1
1
2
2
3
4
5
6
7
8
9 101112
7 neurons
7 coins.
Christian Borgelt Frequent Pattern Mining 253
Neural Spike Data: Temporal Imprecision
Thc most common ajjroach to cojc with tcmjora imjrccision,
namcy time binning, has scvcra drawacks
Boundary Problem:
Sjikcs amost as tar ajart as thc in width arc synchronous it thcy ta into thc
samc in, ut sjikcs cosc tocthcr arc not sccn as synchronous it a in oundary
scjaratcs thcm
Bivalence Problem:
Sjikcs arc cithcr synchronous (samc timc in) or not,
no radcd notion ot synchrony (jrccision ot coincidcncc)
lt is dcsirac to havc continuous time approaches
that aow tor a graded notion of synchrony
Soution CoCoNAD (Continuous timc COscd `curon Asscmy Lctcction)
Lxtcnds trcqucnt itcm sct minin to joint jroccsscs
Lascd on inucncc rcions around sjikcs,joints |licado-`ui no et al. 2013|
Christian Borgelt Frequent Pattern Mining 254
Neural Spike Data: Selective Participation
data c _ Son,a Gr un, lcscarch Ccntcr 1 uich, Gcrmany
Loth diarams show thc samc (simuatcd) data, ut on thc riht
thc ncurons ot thc asscmy arc cocctcd at thc ottom
Ony aout S0/ ot thc ncurons (randomy choscn) jarticijatc in cach
synchronous rin Lcncc thcrc is no trcqucnt itcm sct comjrisin a ot thcm
lathcr a trcqucnt itcm sct minin ajjroach nds a arc numcr
ot trcqucnt itcm scts with 12 to 1o ncurons
lossic ajjroach fault-tolerant frequent item set mining
Christian Borgelt Frequent Pattern Mining 255
Association Rules
Christian Borgelt Frequent Pattern Mining 256
Association Rules: Basic Notions
Ottcn tound jattcrns arc cxjrcsscd as association rules, tor cxamjc
If a customcr uys bread and wine,
then shc,hc wi jroay aso uy cheese
Iormay, wc considcr rucs ot thc torm X Y ,
with X, Y B and X Y
Support of a Rule X Y
Lithcr
T
(X Y )
T
(X Y ) (morc common ruc is corrcct)
Or
T
(X Y )
T
(X) (morc jausic ruc is ajjicac)
Condence of a Rule X Y
c
T
(X Y )

T
(X Y )

T
(X)

s
T
(X Y )
s
T
(X)

s
T
(I)
s
T
(X)
Thc condcncc can c sccn as an cstimatc ot P(Y [ X)
Christian Borgelt Frequent Pattern Mining 257
Association Rules: Formal Denition
Given:
a sct B i
1
, . . . , i
m
ot itcms,
a vcctor T (t
1
, . . . , t
n
) ot transactions ovcr B,
a rca numcr
min
, 0 <
min
1, thc minimum support,
a rca numcr c
min
, 0 < c
min
1, thc minimum condence
Desired:
thc sct ot a association rules, that is, thc sct
! R X Y [
T
(R)
min
c
T
(R) c
min
.
General Procedure:
Iind thc trcqucnt itcm scts
Construct rucs and tcr thcm wrt
min
and c
min

Christian Borgelt Frequent Pattern Mining 258


Generating Association Rules
\hich minimum sujjort has to c uscd tor ndin thc trcqucnt itcm scts
dcjcnds on thc dcnition ot thc sujjort ot a ruc
lt
T
(X Y )
T
(X Y ),
thcn
min

min
or cquivacnty s
min
,n
min
|
lt
T
(X Y )
T
(X),
thcn
min

min
c
min
or cquivacnty s
min
,n
min
c
min
|
Attcr thc trcqucnt itcm scts havc ccn tound,
thc ruc construction thcn travcrscs a trcqucnt itcm scts I and
sjits thcm into dis,oint suscts X and Y (X Y and X Y I),
thus tormin rucs X Y
Iitcrin rucs wrt condcncc is aways ncccssary
Iitcrin rucs wrt sujjort is ony ncccssary it
T
(X Y )
T
(X)
Christian Borgelt Frequent Pattern Mining 259
Properties of the Condence
Irom I J I s
T
(I) s
T
(J) it oviousy toows
X, Y a X
s
T
(X Y )
s
T
(X)

s
T
(X Y )
s
T
(X a)
and thcrctorc
X, Y a X c
T
(X Y ) c
T
(X a Y a).
That is Moving an item from the antecedent to the consequent
cannot increase the condence of a rule.
As an immcdiatc conscqucncc wc havc
X, Y a X c
T
(X Y ) < c
min
c
T
(X a Y a) < c
min
.
That is If a rule fails to meet the minimum condence,
no rules over the same item set and with
a larger consequent need to be considered.
Christian Borgelt Frequent Pattern Mining 260
Generating Association Rules
function rucs (I). ( cncratc association rucs )
R . ( initiaizc thc sct ot rucs )
forall f F do begin ( travcrsc thc trcqucnt itcm scts )
m 1. ( start with ruc hcads (conscqucnts) )
H
m


if
i. ( that contain ony onc itcm )
repeat ( travcrsc ruc hcads ot incrcasin sizc )
forall h H
m
do ( travcrsc thc jossic ruc hcads )
if
s
T
(f)
s
T
(fh)
c
min
( it thc condcncc is hih cnouh, )
then R R |(f h) h|. ( add ruc to thc rcsut )
else H
m
H
m
h. ( othcrwisc discard thc hcad )
H
m+1
candidatcs(H
m
). ( crcatc hcads with onc itcm morc )
m m + 1. ( incrcmcnt thc hcad itcm countcr )
until H
m
or m [f[. ( unti thcrc arc no morc ruc hcads )
end. ( or antcccdcnt woud ccomc cmjty )
return R. ( rcturn thc rucs tound )
end. ( rucs )
Christian Borgelt Frequent Pattern Mining 261
Generating Association Rules
function candidatcs (F
k
) ( cncratc candidatcs with k + 1 itcms )
begin
E . ( initiaizc thc sct ot candidatcs )
forall f
1
, f
2
F
k
( travcrsc a jairs ot trcqucnt itcm scts )
with f
1
a
1
, . . . , a
k1
, a
k
( that dicr ony in onc itcm and )
and f
2
a
1
, . . . , a
k1
, a
/
k
( arc in a cxicorajhic ordcr )
and a
k
< a
/
k
do begin ( (thc ordcr is aritrary, ut xcd) )
f f
1
f
2
a
1
, . . . , a
k1
, a
k
, a
/
k
. ( union has k + 1 itcms )
if a f f a F
k
( ony it a suscts arc trcqucnt, )
then E E f. ( add thc ncw itcm sct to thc candidatcs )
end. ( (othcrwisc it cannot c trcqucnt) )
return E. ( rcturn thc cncratcd candidatcs )
end ( candidatcs )
Christian Borgelt Frequent Pattern Mining 262
Frequent Item Sets: Example
transaction dataasc
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 c, b, e
10 a, d, e
trcqucnt itcm scts
0 itcms 1 itcm 2 itcms 3 itcms
10 a a, c ! a, c, d 3
b 3 a, d ` a, c, e 3
c a, e o a, d, e !
d o b, c 3
e c, d !
c, e !
d, e !
Thc minimum sujjort is s
min
3 or
min
0.3 30/ in this cxamjc
Thcrc arc 2
`
32 jossic itcm scts ovcr B a, b, c, d, e
Thcrc arc 1o trcqucnt itcm scts (ut ony 10 transactions)
Christian Borgelt Frequent Pattern Mining 263
Generating Association Rules
Example: I a, c, e, X c, e, Y a
c
T
(c, e a)
s
T
(a, c, e)
s
T
(c, e)

3
!
`/
Minimum condence: 80%
association sujjort ot sujjort ot condcncc
ruc a itcms antcccdcnt
b c 3 (30/) 3 (30/) 100/
d a ` (`0/) o (o0/) S33/
e a o (o0/) (0/) S`/
a e o (o0/) (0/) S`/
d, e a ! (!0/) ! (!0/) 100/
a, d e ! (!0/) ` (`0/) S0/
Christian Borgelt Frequent Pattern Mining 264
Support of an Association Rule
The two rule support denitions are not equivalent:
transaction dataasc
1 a, c, e
2 b, d
3 b, c, d
! a, e
` a, b, c, d
o c, e
a, b, d
S a, c, d
two association rucs
association sujjort ot sujjort ot condcncc
ruc a itcms antcccdcnt
a c 3 (3`/) ` (o2`/) o00/
b d ! (`00/) ! (`00/) 1000/
Lct thc minimum condcncc c c
min
o0/
Ior
T
(R) (X Y ) and 3 <
min
! ony thc ruc b d is cncratcd,
ut not thc ruc a c
Ior
T
(R) (X) thcrc is no vauc
min
that cncratcs ony thc ruc b d,
ut not at thc samc timc aso thc ruc a c
Christian Borgelt Frequent Pattern Mining 265
Rule Extraction from Prex Tree
lcstriction to rucs with onc itcm in thc hcad,conscqucnt
Lxjoit thc jrcx trcc to nd thc sujjort ot thc ody,antcccdcnt
Travcrsc thc itcm sct trcc rcadth-rst or dcjth-rst
Ior cach nodc travcrsc thc jath to thc root and
cncratc and tcst onc ruc jcr nodc
root
hdnode
i j head
-
prev
j
body
samc
jath

3
isnode

p
p
p
p
p

J
J
p
p
p
p
p
J
J

p
p
p
p
p

Iirst ruc Gct thc sujjort ot thc ody,


antcccdcnt trom thc jarcnt nodc
`cxt rucs Liscard thc hcad,consc-
qucnt itcm trom thc downward jath
and toow thc rcmainin jath trom thc
currcnt nodc
Christian Borgelt Frequent Pattern Mining 266
Reminder: Prex Tree
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
a
b c
d
b
c d c d d
c d d d
d
A (tu) jrcx trcc tor thc vc itcms a, b, c, d, e
Lascd on a oa ordcr ot thc itcms (which can c aritrary)
Thc itcm scts countcd in a nodc consist ot
a itcms acin thc cdcs to thc nodc (common jrcx) and
onc itcm toowin thc ast cdc ac in thc itcm ordcr
Christian Borgelt Frequent Pattern Mining 267
Additional Rule Filtering: Simple Measures
Gcncra idca Comjarc

P
T
(Y [ X) c
T
(X Y )
and

P
T
(Y ) c
T
( Y )
T
(Y )
(Asoutc) condcncc dicrcncc to jrior
d
T
(R) [c
T
(X Y )
T
(Y )[
Litt vauc
l
T
(R)
c
T
(X Y )

T
(Y )
(Asoutc) dicrcncc ot itt vauc to 1
q
T
(R)

c
T
(X Y )

T
(Y )
1

(Asoutc) dicrcncc ot itt quoticnt to 1


r
T
(R)

1 min
_
c
T
(X Y )

T
(Y )
,

T
(Y )
c
T
(X Y )
_

Christian Borgelt Frequent Pattern Mining 268


Additional Rule Filtering: More Sophisticated Measures
Considcr thc 2 2 contincncy tac or thc cstimatcd jroaiity tac
X , t X t
Y , t n
00
n
01
n
0.
Y t n
10
n
11
n
1.
n
.0
n
.1
n
..
X , t X t
Y , t p
00
p
01
p
0.
Y t p
10
p
11
p
1.
p
.0
p
.1
1
n
..
is thc tota numcr ot transactions
n
.1
is thc numcr ot transactions to which thc ruc is ajjicac
n
11
is thc numcr ot transactions tor which thc ruc is corrcct
lt is p
ij

n
ij
n
..
, p
i.

n
i.
n
..
, p
.j

n
.j
n
..
tor i, j 1, 2
Gcncra idca sc mcasurcs tor thc strcnth ot dcjcndcncc ot X and Y
Thcrc is a arc numcr ot such mcasurcs ot dcjcndcncc
oriinatin trom statistics, dccision trcc induction ctc
Christian Borgelt Frequent Pattern Mining 269
An Information-theoretic Evaluation Measure
Information Gain (Iuack and Lcicr 19`1, Quinan 19So)
Lascd on Shannon Lntrojy H
n

i1
p
i
o
2
p
i
(Shannon 19!S)
I
ain
(X, Y ) H(Y ) H(Y [X)

..

k
Y

i1
p
i.
o
2
p
i.

..
k
X

j1
p
.j
_
_

k
Y

i1
p
i[j
o
2
p
i[j
_
_
H(Y ) Lntrojy ot thc distriution ot Y
H(Y [X) Expected entropy ot thc distriution ot Y
it thc vauc ot thc X ccomcs known
H(Y ) H(Y [X) Lxjcctcd cntrojy rcduction or information gain
Christian Borgelt Frequent Pattern Mining 270
Interpretation of Shannon Entropy
Lct S s
1
, . . . , s
n
c a nitc sct ot atcrnativcs
havin jositivc jroaiitics P(s
i
), i 1, . . . , n, satistyin

n
i1
P(s
i
) 1
Shannon Entropy:
H(S)
n

i1
P(s
i
) o
2
P(s
i
)
lntuitivcy Expected number of yes/no questions that have
to be asked in order to determine the obtaining alternative.
Sujjosc thcrc is an oracc, which knows thc otainin atcrnativc,
ut rcsjonds ony it thc qucstion can c answcrcd with ycs or no
A cttcr qucstion schcmc than askin tor onc atcrnativc attcr thc othcr
can casiy c tound Lividc thc sct into two suscts ot aout cqua sizc
Ask tor containmcnt in an aritrariy choscn susct
Ajjy this schcmc rccursivcy numcr ot qucstions oundcd y ,o
2
n|
Christian Borgelt Frequent Pattern Mining 271
Question/Coding Schemes
P(s
1
) 0.10, P(s
2
) 0.1`, P(s
3
) 0.1o, P(s
!
) 0.19, P(s
`
) 0.!0
Shannon cntrojy

i
P(s
i
) o
2
P(s
i
) 2.1` it,symo
Linear Traversal
s
!
, s
`
s
3
, s
!
, s
`
s
2
, s
3
, s
!
, s
`
s
1
, s
2
, s
3
, s
!
, s
`
0.10 0.15 0.16 0.19 0.40
s
1
s
2
s
3
s
!
s
`
1 2 3 ! !
Codc cnth 32! it,symo
Codc ccicncy 0oo!
Equal Size Subsets
s
1
, s
2
, s
3
, s
!
, s
`
0.25 0.75
s
1
, s
2
s
3
, s
!
, s
`
0.59
s
!
, s
`
0.10 0.15 0.16 0.19 0.40
s
1
s
2
s
3
s
!
s
`
2 2 2 3 3
Codc cnth 2`9 it,symo
Codc ccicncy 0S30
Christian Borgelt Frequent Pattern Mining 272
Question/Coding Schemes
Sjittin into suscts ot aout cqua sizc can cad to a ad arrancmcnt
ot thc atcrnativcs into suscts hih cxjcctcd numcr ot qucstions
Good qucstion schcmcs takc thc jroaiity ot thc atcrnativcs into account
Shannon-Fano Coding (19!S)
Luid thc qucstion,codin schcmc toj-down
Sort thc atcrnativcs wrt thcir jroaiitics
Sjit thc sct so that thc suscts havc aout cqua probability
(sjits must rcsjcct thc jroaiity ordcr ot thc atcrnativcs)
Human Coding (19`2)
Luid thc qucstion,codin schcmc ottom-uj
Start with onc ccmcnt scts
Aways cominc thosc two scts that havc thc smacst jroaiitics
Christian Borgelt Frequent Pattern Mining 273
Question/Coding Schemes
P(s
1
) 0.10, P(s
2
) 0.1`, P(s
3
) 0.1o, P(s
!
) 0.19, P(s
`
) 0.!0
Shannon cntrojy

i
P(s
i
) o
2
P(s
i
) 2.1` it,symo
ShannonFano Coding (19!S)
s
1
, s
2
, s
3
, s
!
, s
`
0.25
0.41
s
1
, s
2
s
1
, s
2
, s
3
0.59
s
!
, s
`
0.10 0.15 0.16 0.19 0.40
s
1
s
2
s
3
s
!
s
`
3 3 2 2 2
Codc cnth 22` it,symo
Codc ccicncy 09``
Human Coding (19`2)
s
1
, s
2
, s
3
, s
!
, s
`
0.60
s
1
, s
2
, s
3
, s
!
0.25 0.35
s
1
, s
2
s
3
, s
!
0.10 0.15 0.16 0.19 0.40
s
1
s
2
s
3
s
!
s
`
3 3 3 3 1
Codc cnth 220 it,symo
Codc ccicncy 09
Christian Borgelt Frequent Pattern Mining 274
Question/Coding Schemes
lt can c shown that Luman codin is ojtima
it wc havc to dctcrminc thc otainin atcrnativc in a sinc instancc
(`o qucstion,codin schcmc has a smacr cxjcctcd numcr ot qucstions)
Ony it thc otainin atcrnativc has to c dctcrmincd in a scqucncc
ot (indcjcndcnt) situations, this schcmc can c imjrovcd ujon
ldca lroccss thc scqucncc not instancc y instancc,
ut cominc two, thrcc or morc consccutivc instanccs and
ask dirccty tor thc otainin comination ot atcrnativcs
Athouh this cnarcs thc qucstion,codin schcmc, thc cxjcctcd numcr
ot qucstions jcr idcntication is rcduccd (ccausc cach intcrroation
idcntics thc otainin atcrnativc tor scvcra situations)
Lowcvcr, thc cxjcctcd numcr ot qucstions jcr idcntication
ot an otainin atcrnativc cannot c madc aritrariy sma
Shannon showcd that thcrc is a owcr ound, namcy thc Shannon cntrojy
Christian Borgelt Frequent Pattern Mining 275
Interpretation of Shannon Entropy
P(s
1
)
1
2
, P(s
2
)
1
!
, P(s
3
)
1
S
, P(s
!
)
1
1o
, P(s
`
)
1
1o
Shannon cntrojy

i
P(s
i
) o
2
P(s
i
) 1.S` it,symo
lt thc jroaiity distriution aows tor a
jcrtcct Luman codc (codc ccicncy 1),
thc Shannon cntrojy can casiy c intcr-
jrctcd as toows

i
P(s
i
) o
2
P(s
i
)

i
P(s
i
)
. .
occurrcncc
jroaiity
o
2
1
P(s
i
)
. .
jath cnth
in trcc
.
ln othcr words, it is thc cxjcctcd numcr
ot nccdcd ycs,no qucstions
Perfect Question Scheme
s
!
, s
`
s
3
, s
!
, s
`
s
2
, s
3
, s
!
, s
`
s
1
, s
2
, s
3
, s
!
, s
`
1
2
1
4
1
8
1
16
1
16
s
1
s
2
s
3
s
!
s
`
1 2 3 ! !
Codc cnth 1S` it,symo
Codc ccicncy 1
Christian Borgelt Frequent Pattern Mining 276
A Statistical Evaluation Measure

2
Measure
Comjarcs thc actua ,oint distriution
with a hypothetical independent distribution
scs asoutc comjarison
Can c intcrjrctcd as a dicrcncc mcasurc

2
(X, Y )
k
X

i1
k
Y

j1
n
..
(p
i.
p
.j
p
ij
)
2
p
i.
p
.j
Sidc rcmark lntormation ain can aso c intcrjrctcd as a dicrcncc mcasurc
I
ain
(X, Y )
k
X

j1
k
Y

i1
p
ij
o
2
p
ij
p
i.
p
.j
Christian Borgelt Frequent Pattern Mining 277
A Statistical Evaluation Measure

2
Measure
Comjarcs thc actua ,oint distriution
with a hypothetical independent distribution
scs asoutc comjarison
Can c intcrjrctcd as a dicrcncc mcasurc

2
(X, Y )
k
X

i1
k
Y

j1
n
..
(p
i.
p
.j
p
ij
)
2
p
i.
p
.j
Ior k
X
k
Y
2 (as tor ruc cvauation) thc
2
mcasurc simjics to

2
(X, Y ) n
..
(p
1.
p
.1
p
11
)
2
p
1.
(1 p
1.
)p
.1
(1 p
.1
)
n
..
(n
1.
n
.1
n
..
n
11
)
2
n
1.
(n
..
n
1.
)n
.1
(n
..
n
.1
)
.
Christian Borgelt Frequent Pattern Mining 278
Examples from the Census Data
A rucs arc statcd as
consequent <- antecedent (support%, confidence%, lift)
whcrc thc sujjort ot a ruc is thc sujjort ot thc antcccdcnt
Trivial/Obvious Rules
edu_num=13 <- education=Bachelors (16.4, 100.0, 6.09)
sex=Male <- relationship=Husband (40.4, 99.99, 1.50)
sex=Female <- relationship=Wife (4.8, 99.9, 3.01)
Interesting Comparisons
marital=Never-married <- age=young sex=Female (12.3, 80.8, 2.45)
marital=Never-married <- age=young sex=Male (17.4, 69.9, 2.12)
salary>50K <- occupation=Exec-managerial sex=Male (8.9, 57.3, 2.40)
salary>50K <- occupation=Exec-managerial (12.5, 47.8, 2.00)
salary>50K <- education=Masters (5.4, 54.9, 2.29)
hours=overtime <- education=Masters (5.4, 41.0, 1.58)
Christian Borgelt Frequent Pattern Mining 279
Examples from the Census Data
salary>50K <- education=Masters (5.4, 54.9, 2.29)
salary>50K <- occupation=Exec-managerial (12.5, 47.8, 2.00)
salary>50K <- relationship=Wife (4.8, 46.9, 1.96)
salary>50K <- occupation=Prof-specialty (12.6, 45.1, 1.89)
salary>50K <- relationship=Husband (40.4, 44.9, 1.88)
salary>50K <- marital=Married-civ-spouse (45.8, 44.6, 1.86)
salary>50K <- education=Bachelors (16.4, 41.3, 1.73)
salary>50K <- hours=overtime (26.0, 40.6, 1.70)
salary>50K <- occupation=Exec-managerial hours=overtime
(5.5, 60.1, 2.51)
salary>50K <- occupation=Prof-specialty hours=overtime
(4.4, 57.3, 2.39)
salary>50K <- education=Bachelors hours=overtime
(6.0, 54.8, 2.29)
Christian Borgelt Frequent Pattern Mining 280
Examples from the Census Data
salary>50K <- occupation=Prof-specialty marital=Married-civ-spouse
(6.5, 70.8, 2.96)
salary>50K <- occupation=Exec-managerial marital=Married-civ-spouse
(7.4, 68.1, 2.85)
salary>50K <- education=Bachelors marital=Married-civ-spouse
(8.5, 67.2, 2.81)
salary>50K <- hours=overtime marital=Married-civ-spouse
(15.6, 56.4, 2.36)
marital=Married-civ-spouse <- salary>50K (23.9, 85.4, 1.86)
Christian Borgelt Frequent Pattern Mining 281
Examples from the Census Data
hours=half-time <- occupation=Other-service age=young
(4.4, 37.2, 3.08)
hours=overtime <- salary>50K (23.9, 44.0, 1.70)
hours=overtime <- occupation=Exec-managerial (12.5, 43.8, 1.69)
hours=overtime <- occupation=Exec-managerial salary>50K
(6.0, 55.1, 2.12)
hours=overtime <- education=Masters (5.4, 40.9, 1.58)
education=Bachelors <- occupation=Prof-specialty (12.6, 36.2, 2.20)
education=Bachelors <- occupation=Exec-managerial (12.5, 33.3, 2.03)
education=HS-grad <- occupation=Transport-moving (4.8, 51.9, 1.61)
education=HS-grad <- occupation=Machine-op-inspct (6.2, 50.7, 1.6)
Christian Borgelt Frequent Pattern Mining 282
Examples from the Census Data
occupation=Prof-specialty <- education=Masters (5.4, 49.0, 3.88)
occupation=Prof-specialty <- education=Bachelors sex=Female
(5.1, 34.7, 2.74)
occupation=Adm-clerical <- education=Some-college sex=Female
(8.6, 31.1, 2.71)
sex=Female <- occupation=Adm-clerical (11.5, 67.2, 2.03)
sex=Female <- occupation=Other-service (10.1, 54.8, 1.65)
sex=Female <- hours=half-time (12.1, 53.7, 1.62)
age=young <- hours=half-time (12.1, 53.3, 1.79)
age=young <- occupation=Handlers-cleaners (4.2, 50.6, 1.70)
age=senior <- workclass=Self-emp-not-inc (7.9, 31.1, 1.57)
Christian Borgelt Frequent Pattern Mining 283
Summary Association Rules
Association Rule Induction is a Two Step Process
Iind thc trcqucnt itcm scts (minimum sujjort)
Iorm thc rccvant association rucs (minimum condcncc)
Generating the Association Rules
Iorm a jossic association rucs trom thc trcqucnt itcm scts
Iitcr intcrcstin association rucs
ascd on minimum sujjort and minimum condcncc
Filtering the Association Rules
Comjarc ruc condcncc and conscqucnt sujjort
lntormation ain,
2
mcasurc
ln jrincijc othcr mcasurcs uscd tor dccision trcc induction
Christian Borgelt Frequent Pattern Mining 284
Mining More Complex Patterns
Christian Borgelt Frequent Pattern Mining 285
Mining More Complex Patterns
Thc scarch schcmc in Ircqucnt Grajh,Trcc,Scqucncc minin is thc samc,
namcy thc cncra schcmc ot scarchin with a canonica torm
Frequent (Sub)Graph Mining comjriscs thc othcr arcas
Trccs arc sjccia rajhs, namcy rajhs that arc siny conncctcd
Scqucnccs can c sccn as sjccia trccs, namcy chains
(ony onc or two ranchcs dcjcndin on thc choicc ot thc root)
Frequent Sequence Mining and Frequent Tree Mining can cxjoit
Sjcciaizcd canonica torms that aow tor morc ccicnt chccks
Sjccia data structurcs to rcjrcscnt thc dataasc to minc,
so that sujjort countin ccomcs morc ccicnt
\c wi trcat Frequent Graph Mining rst and
wi discuss ojtimizations tor thc othcr arcas atcr
Christian Borgelt Frequent Pattern Mining 286
Motivation:
Molecular Fragment Mining
Christian Borgelt Frequent Pattern Mining 287
Molecular Fragment Mining
Motivation: Accelerating Drug Development
lhascs ot dru dcvcojmcnt jrc-cinica and cinica
Lata athcrin y hih-throuhjut scrccnin
uidin moccuar dataascs with activity intormation
Accccration jotcntia y intcicnt data anaysis
(quantitativc) structurc-activity rcationshij discovcry
Mining Molecular Databases
Lxamjc data `Cl LTl Ll\ Antivira Scrccn data sct
Lcscrijtion anuacs tor moccucs
S`lLLS, SL`, SLc,Cta ctc
Iindin common moccuar sustructurcs
Iindin discriminativc moccuar sustructurcs
Christian Borgelt Frequent Pattern Mining 288
Accelerating Drug Development
Lcvcojin a ncw dru can takc 10 to 12 years
(trom thc choicc ot thc tarct to thc introduction into thc markct)
ln rcccnt ycars thc duration ot thc dru dcvcojmcnt jroccsscs increased
continuousy. at thc samc thc number ot sustanccs undcr dcvcojmcnt
has gone down drasticay
Luc to hih invcstmcnts jharmaccutica comjanics must sccurc thcir markct
josition and comjctitivcncss y ony a few, highly successful drugs
As a conscqucncc thc chanccs tor thc dcvcojmcnt
ot drus tor tarct roujs
with rare diseases or
with special diseases in developing countries
arc considcray rcduccd
A sinicant reduction of the development time coud mitiatc this trcnd
or cvcn rcvcrsc it
(Source: Bundesministerium f ur Bildung und Forschung, Germany)
Christian Borgelt Frequent Pattern Mining 289
Phases of Drug Development
Discovery and Optimization of Candidate Substances
Lih-Throuhjut Scrccnin
Lcad Liscovcry and Lcad Ojtimization
Pre-clinical Test Series (tcsts with animas, ca 3 ycars)
Iundamcnta tcst wrt ccctivcncss and sidc cccts
Clinical Test Series (tcsts with humans, ca !o ycars)
lhasc 1 ca 30S0 hcathy humans
Chcck tor sidc cccts
lhasc 2 ca 100300 humans cxhiitin thc symjtoms ot thc tarct discasc
Chcck tor ccctivcncss
lhasc 3 uj to 3000 hcathy and i humans at cast 3 ycars
Lctaicd chcck ot ccctivcncss and sidc cccts
Ocial Acceptance as a Drug
Christian Borgelt Frequent Pattern Mining 290
Drug Development: Acceleration Potential
Thc cnth ot thc jrc-cinica and cinica tcsts scrics can hardy c rcduccd,
sincc thcy scrvc thc jurjosc to cnsurc thc satcty ot thc jaticnts
Thcrctorc ajjroachcs to sjccd uj thc dcvcojmcnt jroccss
usuay tarct thc pre-clinical phase ctorc thc anima tcsts
ln jarticuar, it is tricd to imjrovc thc scarch tor ncw dru candidatcs
(lead discovery) and thcir ojtimization (lead optimization)
Here Intelligent Data Analysis and Frequent Pattern Mining can help.
One possible approach:
\ith hih-throuhjut scrccnin a vcry arc numcr ot sustanccs
is tcstcd automaticay and thcir activity is dctcrmincd
Thc rcsutin moccuar dataascs arc anayzcd y tryin
to nd common substructures ot activc sustanccs
Christian Borgelt Frequent Pattern Mining 291
High-Throughput Screening
On so-cacd micro-plates jrotcins,ccs arc automaticay comincd with a arc
varicty ot chcmica comjounds
c
_
w
w
w
.
m
a
t
r
i
x
t
e
c
h
c
o
r
p
.
c
o
m
w
w
w
.
e
l
i
s
a
-
t
e
k
.
c
o
m
w
w
w
.
t
h
e
r
m
o
.
c
o
m
w
w
w
.
a
r
r
a
y
i
t
.
c
o
m
Christian Borgelt Frequent Pattern Mining 292
High-Throughput Screening
Thc cd micro-jatcs arc thcn cvauatcd in spectrometers
(wrt asorjtion, uorcsccncc, umincsccncc, joarization ctc)
c _ www.moleculardevices.com www.biotek.com
Christian Borgelt Frequent Pattern Mining 293
High-Throughput Screening
Attcr thc mcasurcmcnt thc sustanccs arc cassicd as active or inactive
Figure c _ Christof Fattinger, Homann-LaRoche, Basel
Ly anayzin thc rcsuts onc trics
to undcrstand thc dcjcndcncc
ctwccn moccuar structurc and
activity
QSAR
Quantitativc Structurc-Activity
lcationshij `odcin
ln this arca a arc
numcr ot data minin
aorithms arc uscd
tcaturc sccction mcthods
dccision trccs
ncura nctworks ctc
Christian Borgelt Frequent Pattern Mining 294
Example: NCI DTP HIV Antiviral Screen
Amon othcr data scts, thc `ationa Canccr lnstitutc (`Cl) has madc
thc DTP HIV Antiviral Screen Data Set juicy avaiac
A arc numcr ot chcmica comjounds whcrc tcstcd
whcthcr thcy jrotcct human CL` ccs aainst an Ll\-1 intcction
Sustanccs that jrovidcd `0/ jrotcction wcrc rctcstcd
Sustanccs that rcjroduciy jrovidcd 100/ jrotcction
arc istcd as conrmed active (CA)
Sustanccs that rcjroduciy jrovidcd at cast `0/ jrotcction
arc istcd as moderately active (CM)
A othcr sustanccs
arc istcd as conrmed inactive (CI)
32` CA, S CM, 3` 9o9 CI (tota 3 11 sustanccs)
Christian Borgelt Frequent Pattern Mining 295
Form of the Input Data
Lxccrjt trom thc `Cl LTl Ll\ Antivira Scrccn data sct (S`lLLS tormat)
737, 0,CN(C)C1=[S+][Zn]2(S1)SC(=[S+]2)N(C)C
2018, 0,N#CC(=CC1=CC=CC=C1)C2=CC=CC=C2
19110,0,OC1=C2N=C(NC3=CC=CC=C3)SC2=NC=N1
20625,2,NC(=N)NC1=C(SSC2=C(NC(N)=N)C=CC=C2)C=CC=C1.OS(O)(=O)=O
22318,0,CCCCN(CCCC)C1=[S+][Cu]2(S1)SC(=[S+]2)N(CCCC)CCCC
24479,0,C[N+](C)(C)C1=CC2=C(NC3=CC=CC=C3S2)N=N1
50848,2,CC1=C2C=CC=CC2=N[C-](CSC3=CC=CC=C3)[N+]1=O
51342,0,OC1=C2C=NC(=NC2=C(O)N=N1)NC3=CC=C(Cl)C=C3
55721,0,NC1=NC(=C(N=O)C(=N1)O)NC2=CC(=C(Cl)C=C2)Cl
55917,0,O=C(N1CCCC[CH]1C2=CC=CN=C2)C3=CC=CC=C3
64054,2,CC1=C(SC[C-]2N=C3C=CC=CC3=C(C)[N+]2=O)C=CC=C1
64055,1,CC1=CC=CC(=C1)SC[C-]2N=C3C=CC=CC3=C(C)[N+]2=O
64057,2,CC1=C2C=CC=CC2=N[C-](CSC3=NC4=CC=CC=C4S3)[N+]1=O
66151,0,[O-][N+](=O)C1=CC2=C(C=NN=C2C=C1)N3CC3
...
identication number, activity (2: CA, 1: CM, 0: CI), molecule description in SMILES notation
Christian Borgelt Frequent Pattern Mining 296
Input Format: SMILES Notation and SLN
SMILES Notation: (zL Layiht, lnc)
c1:c:c(-F):c:c2:c:1-C1-C(-C-C-2)-C2-C(-C)(-C-C-1)-C(-O)-C-C-2
SLN (SYBYL Line Notation): (Trijos, lnc)
C[1]H:CH:C(F):CH:C[8]:C:@1-C[10]H-CH(-CH2-CH2-@8)-C[20]H-C(-CH3)
(-CH2-CH2-@10)-CH(-CH2-CH2-@20)-OH
Represented Molecule:
Iu lcjrcscntation
F O
C
C
C C
C
C
C
C
C
C
C
C C
C
C C
C
C
C C
C H
H
H
H
H H
HH
H
H
H H
H H
H
H
H
H H
H
H
H H
Simjicd lcjrcscntation
O F
Christian Borgelt Frequent Pattern Mining 297
Input Format: Grammar for SMILES and SLN
Gcncra rammar tor (incar) moccuc dcscrijtions (S`lLLS and SL`)
`occuc Atom Lranch
Lranch
[ Lond Atom Lranch
[ Lond Lac Lranch
[ ( Lranch ) Lranch
Atom Lcmcnt LacLct
LacLct
[ Lac LacLct
ack non-tcrmina symos
uc tcrmina symos
Thc dcnitions ot thc non-tcrminas Lcmcnt, Lond, and Lac
dcjcnd on thc choscn dcscrijtion anuac Ior S`lLLS it is
Lcmcnt B [ C [ N [ O [ F [ [H] [ [He] [ [Li] [ [Be] [
Lond [ - [ = [ # [ : [ .
Lac Liit [ % Liit Liit
Liit 0 [ 1 [ [ 9
Christian Borgelt Frequent Pattern Mining 298
Input Format: SDle/Ctab
L-Alanine (13C)
user initials, program, date/time etc.
comment
6 5 0 0 1 0 3 V2000
-0.6622 0.5342 0.0000 C 0 0 2 0 0 0
0.6622 -0.3000 0.0000 C 0 0 0 0 0 0
-0.7207 2.0817 0.0000 C 1 0 0 0 0 0
-1.8622 -0.3695 0.0000 N 0 3 0 0 0 0
0.6220 -1.8037 0.0000 O 0 0 0 0 0 0
1.9464 0.4244 0.0000 O 0 5 0 0 0 0
1 2 1 0 0 0
1 3 1 1 0 0
1 4 1 0 0 0
2 5 2 0 0 0
2 6 1 0 0 0
M END
> <value>
0.2
$$$$
O
5
C
2 O
6
C
1
C
3 N4
SLc Structurc-data c
Cta Conncction tac (incs !1o)
c _ Lscvicr Scicncc
Christian Borgelt Frequent Pattern Mining 299
Finding Common Molecular Substructures
N N N O
O
N
N
O
O
O
O
N
N
N
N N N O
O
N
N
O
O
O
N N N O
O
N
N
O
O
O
P
O
O
O
O
O
N N N O
O
N
N
O
O
O
O
O
O
O
N N N O
O
N
N
O
O
Some Molecules from the NCI HIV Database
Common Fragment
Christian Borgelt Frequent Pattern Mining 300
Finding Molecular Substructures
Common Molecular Substructures
Anayzc ony thc activc moccucs
Iind moccuar tramcnts that ajjcar trcqucnty in thc moccucs
Discriminative Molecular Substructures
Anayzc thc activc and thc inactivc moccucs
Iind moccuar tramcnts that ajjcar trcqucnty in thc activc moccucs
and ony rarcy in thc inactivc moccucs
Rationale in both cases
Thc tound tramcnts can ivc hints which structura jrojcrtics
arc rcsjonsic tor thc activity ot a moccuc
This can hcj to idcntity dru candidatcs (so-cacd pharmacophores)
and to uidc tuturc scrccnin corts
Christian Borgelt Frequent Pattern Mining 301
Frequent (Sub)Graph Mining
Christian Borgelt Frequent Pattern Mining 302
Frequent (Sub)Graph Mining: General Approach
Iindin trcqucnt itcm scts mcans to nd
sets of items that are contained in many transactions
Iindin trcqucnt sustructurcs mcans to nd
graph fragments that are contained in many graphs
in a ivcn dataasc ot attriutcd rajhs (uscr sjccics minimum sujjort)
Grajh structurc ot vcrticcs and cdcs has to c takcn into account
Scarch jartiay ordcrcd sct ot rajh structurcs instcad ot suscts
`ain jrocm How can we avoid redundant search?
suay thc scarch is rcstrictcd to connected substructures
Conncctcd sustructurcs succ tor most ajjications
This rcstriction considcray narrows thc scarch sjacc
Christian Borgelt Frequent Pattern Mining 303
Frequent (Sub)Graph Mining: Basic Notions
Lct A a
1
, . . . , a
m
c a sct ot attributes or labels
A labeled or attributed graph is a trijc G (V, E, ), whcrc
V is thc sct ot vcrticcs,
E V V (v, v) [ v V is thc sct ot cdcs, and
V E A assins acs trom thc sct A to vcrticcs and cdcs
`otc that G is undirected and simple and contains no loops
Lowcvcr, rajhs without thcsc rcstrictions coud c handcd as wc
`otc aso that scvcra vcrticcs and cdcs may havc thc samc attriutc,ac
Lxamjc molecule representation
Atom attriutcs atom tyjc (chcmica ccmcnt), charc, aromatic rin a
Lond attriutcs ond tyjc (sinc, douc, trijc, aromatic)
Christian Borgelt Frequent Pattern Mining 304
Frequent (Sub)Graph Mining: Basic Notions
`otc that tor accd rajhs thc samc notions can c uscd as tor norma rajhs
\ithout torma dcnition, wc wi usc, tor cxamjc
A vcrtcx v is incident to an cdc e, and thc cdc is incident to thc vcrtcx v,
i e (v, v
/
) or e (v
/
, v)
Two dicrcnt vcrticcs arc adjacent or connected
it thcy arc incidcnt to thc samc cdc
A path is a scqucncc ot cdcs conncctin two vcrticcs
lt is undcrstood that no cdc (and no vcrtcx) occurs twicc
A rajh is cacd connected it thcrc cxists a jath ctwccn any two vcrticcs
A subgraph consists ot a susct ot thc vcrticcs and a susct ot thc cdcs
lt S is a (jrojcr) surajh ot G wc writc S G or S G, rcsjcctivcy
A connected component ot a rajh is a surajh that is conncctcd and
maxima in thc scnsc that any arcr surajh containin it is not conncctcd
Christian Borgelt Frequent Pattern Mining 305
Frequent (Sub)Graph Mining: Basic Notions
`otc that tor accd rajhs thc samc notions can c uscd as tor norma rajhs
\ithout torma dcnition, wc wi usc, tor cxamjc
A vcrtcx ot a rajh is cacd isolated it it is not incidcnt to any cdc
A vcrtcx ot a rajh is cacd a leaf it it is incidcnt to cxacty onc cdc
An cdc ot a rajh is cacd a bridge it rcmovin it
incrcascs thc numcr ot conncctcd comjoncnts ot thc rajh
`orc intuitivcy a ridc is thc ony conncction ctwccn two vcrticcs,
that is, thcrc is no othcr jath on which onc can rcach thc onc trom thc othcr
An cdc ot a rajh is cacd a proper bridge
it it is a ridc and not incidcnt to a cat
ln othcr words an cdc is a jrojcr ridc it rcmovin it crcatcs an isoatcd vcrtcx
A othcr ridcs arc cacd leaf bridges
(ccausc thcy arc incidcnt to at cast onc cat)
Christian Borgelt Frequent Pattern Mining 306
Frequent (Sub)Graph Mining: Basic Notions
Lct G (V
G
, E
G
,
G
) and S (V
S
, E
S
,
S
) c two accd rajhs
A subgraph isomorphism ot S to G or an occurrence ot S in G
is an in,cctivc tunction f V
S
V
G
with
v V
S

S
(v)
G
(f(v)) and
(u, v) E
S
(f(u), f(v)) E
G

S
((u, v))
G
((f(u), f(v)))
That is, thc majjin f jrcscrvcs thc conncction structurc and thc acs
lt such a majjin f cxists, wc writc S _ G
`otc that thcrc may c scvcra ways to maj a accd rajh S to a accd rajh G
so that thc conncction structurc and thc vcrtcx and cdc acs arc jrcscrvcd
Ior cxamjc, G may josscss scvcra surajhs that arc isomorjhic to S
lt may cvcn c that thc rajh S can c majjcd in scvcra dicrcnt ways to thc
samc surajh ot G This is thc casc it thcrc cxists a surajh isomorjhism ot S
to itsct (a so-cacd graph automorphism) that is not thc idcntity
Christian Borgelt Frequent Pattern Mining 307
Frequent (Sub)Graph Mining: Basic Notions
Lct S and G c two accd rajhs
S and G arc cacd isomorphic, writtcn S G, i S _ G and G _ S
ln this casc a tunction f majjin S to G is cacd a graph isomorphism
A tunction f majjin S to itsct is cacd a graph automorphism
S is properly contained in G, writtcn S < G, i S _ G and S , G
lt S _ G or S < G, thcn thcrc cxists a (jrojcr) surajh G
/
ot G,
such that S and G
/
arc isomorjhic
This cxjains thc tcrm surajh isomorjhism
Thc set of all connected subgraphs ot G is dcnotcd y ((G)
lt is ovious that tor a S ((G) S _ G
Lowcvcr, thcrc arc (unconncctcd) rajhs S with S _ G that arc not in ((G)
Thc sct ot a (conncctcd) surajhs is anaoous to thc jowcr sct ot a sct
Christian Borgelt Frequent Pattern Mining 308
Subgraph Isomorphism: Examples
G
S
1
S
2
N
N
O
O O
O
O
N
N
O
A moccuc G that rcjrcscnts a rajh in a dataasc
and two rajhs S
1
and S
2
that arc containcd in G
Thc surajh rcationshij is tormay dcscricd y a majjin f
ot thc vcrticcs ot onc rajh to thc vcrticcs ot anothcr
G (V
G
, E
G
), S (V
S
, E
S
), f V
S
V
G
.
This majjin must jrcscrvc thc conncction structurc and thc acs
Christian Borgelt Frequent Pattern Mining 309
Subgraph Isomorphism: Examples
G
S
1
f
1
V
S
1
V
G
S
2
f
2
V
S
2
V
G
N
N
O
O O
O
O
N
N
O
Thc majjin must jrcscrvc thc conncction structurc
(u, v) E
S
(f(u), f(v)) E
G
.
Thc majjin must jrcscrvc vcrtcx and cdc acs
v V
S

S
(v)
G
(f(v)), (u, v) E
S

S
((u, v))
G
((f(u), f(v))).
Lcrc oxycn must c majjcd to oxycn, sinc onds to sinc onds ctc
Christian Borgelt Frequent Pattern Mining 310
Subgraph Isomorphism: Examples
G
S
1
f
1
V
S
1
V
G
S
2
f
2
V
S
2
V
G
g
2
V
S
2
V
G
N
N
O
O O
O
O
N
N
O
Thcrc may c morc than onc jossic majjin , occurrcncc
(Thcrc arc cvcn thrcc morc occurrcnccs ot S
2
)
Lowcvcr, wc arc currcnty ony intcrcstcd in whcthcr thcrc cxists a majjin
(Thc numcr ot occurrcnccs wi ccomc imjortant
whcn wc considcr minin trcqucnt (su)rajhs in a sinc rajh)
Tcstin whcthcr a surajh isomorjhism cxists ctwccn ivcn rajhs S and G
is NP-complete (that is, rcquircs cxjoncntia timc uncss l `l)
Christian Borgelt Frequent Pattern Mining 311
Subgraph Isomorphism: Examples
G
S
1
f
1
V
S
1
V
G
S
3
f
3
V
S
3
V
G
g
3
V
S
3
V
G
N
N
O
O O
O
O
N
N O
O
A rajh may c majjcd to itsct (automorphism)
Triviay, cvcry rajh josscsscs thc idcntity as an automorjhism
(Lvcry rajh can c majjcd to itsct y majjin cach nodc to itsct)
lt a rajh (tramcnt) josscsscs an automorjhism that is not thc idcntity
thcrc is morc than onc occurrcncc at the same location in anothcr rajh
Thc numcr ot occurrcnccs ot a rajh (tramcnt) in a rajh can c huc
Christian Borgelt Frequent Pattern Mining 312
Frequent (Sub)Graph Mining: Basic Notions
Lct S c a accd rajh and ( a vcctor ot accd rajhs
A accd rajh G ( covers thc accd rajh S or
thc accd rajh S is contained in a accd rajh G ( i S _ G
Thc sct K
(
(S) k1, . . . , n [ S_G
k
is cacd thc cover ot S wrt (
Thc covcr ot a rajh is thc indcx sct ot thc dataasc rajhs that covcr it
lt may aso c dcncd as a vcctor ot a accd rajhs that covcr it
(which, howcvcr, is comjicatcd to writc in tormay corrcct way)
Thc vauc s
(
(S) [K
(
(S)[ is cacd thc (absolute) support ot S wrt (
Thc vauc
(
(S)
1
n
[K
(
(S)[ is cacd thc relative support ot S wrt (
Thc sujjort ot S is thc numcr or traction ot accd rajhs that contain it
Somctimcs
(
(S) is aso cacd thc (relative) frequency ot S wrt (
Christian Borgelt Frequent Pattern Mining 313
Frequent (Sub)Graph Mining: Formal Denition
Given:
a sct A a
1
, . . . , a
m
ot attriutcs or acs,
a vcctor ( (G
1
, . . . , G
n
) ot rajhs with acs in A,
a numcr s
min
l`, 0 < s
min
n, or (cquivacnty)
a numcr
min
ll, 0 <
min
1, thc minimum support
Desired:
thc sct ot frequent (sub)graphs or frequent fragments, that is,
thc sct F
(
(s
min
) S [ s
(
(S) s
min
or (cquivacnty)
thc sct
(
(
min
) S [
(
(S)
min

`otc that with thc rcations s


min
,n
min
| and
min

1
n
s
min
thc two vcrsions can casiy c transtormcd into cach othcr
Christian Borgelt Frequent Pattern Mining 314
Frequent (Sub)Graphs: Example
cxamjc moccucs
(rajh dataasc)
S C N C
O
O S C N
F
O S C N
O
Thc numcrs
cow thc surajhs
statc thcir sujjort
trcqucnt moccuar tramcnts (s
min
2)
(cmjty rajh)
3
S O C N
3 3 3 3
O S S C C O C N
2 3 2 3
O S C S C N S C O N C O
2 3 2 2
O S C
N
S C N
O
2 2
Christian Borgelt Frequent Pattern Mining 315
Properties of the Support of (Sub)Graphs
A brute force approach that cnumcratcs a jossic (su)rajhs, dctcrmincs
thcir sujjort, and discards intrcqucnt (su)rajhs is usuay infeasible
Thc numcr ot jossic (conncctcd) (su)rajhs,
rows vcry quicky with thc numcr ot vcrticcs and cdcs
Idea: Considcr thc jrojcrtics ot thc sujjort, in jarticuar
S R S K
(
(R) K
(
(S).
This jrojcrty hods, ccausc G S R S R _ G S _ G
Lach additiona cdc is anothcr condition a dataasc rajh has to satisty
Grajhs that do not satisty this condition arc rcmovcd trom thc covcr
lt toows S R S s
(
(R) s
(
(S).
That is If a (sub)graph is extended, its support cannot increase.
Onc aso says that sujjort is anti-monotone or downward closed
Christian Borgelt Frequent Pattern Mining 316
Properties of the Support of (Sub)Graphs
Irom S R S s
(
(R) s
(
(S) it toows
s
min
S R S s
(
(S) < s
min
s
(
(R) < s
min
.
That is No supergraph of an infrequent (sub)graph can be frequent.
This jrojcrty is ottcn rctcrrcd to as thc Apriori Property
lationac Somctimcs wc can know a priori, that is, ctorc chcckin its sujjort
y acccssin thc ivcn rajh dataasc, that a (su)rajh cannot c trcqucnt
Ot coursc, thc contrajosition ot this imjication aso hods
s
min
S R S s
(
(S) s
min
s
(
(R) s
min
.
That is All subgraphs of a frequent (sub)graph are frequent.
This sucsts a comjrcsscd rcjrcscntation ot thc sct ot trcqucnt (su)rajhs
Christian Borgelt Frequent Pattern Mining 317
Reminder: Partially Ordered Sets
A partial order is a inary rcation ovcr a sct S which satiscs a, b, c S
a a (rccxivity)
a b b a a b (anti-symmctry)
a b b c a c (transitivity)
A sct with a jartia ordcr is cacd a partially ordered set (or poset tor short)
Lct a and b c two distinct ccmcnts ot a jartiay ordcrcd sct (S, )
it a b or b a, thcn a and b arc cacd comparable
it ncithcr a b nor b a, thcn a and b arc cacd incomparable
lt a jairs ot ccmcnts ot thc undcryin sct S arc comjarac,
thc ordcr is cacd a total order or a linear order
ln a tota ordcr thc rccxivity axiom is rcjaccd y thc stroncr axiom
a b b a (totaity)
Christian Borgelt Frequent Pattern Mining 318
Properties of the Support of (Sub)Graphs
Monotonicity in Calculus and Analysis
A tunction f ll ll is cacd monotonically non-decreasing
it x, y x y f(x) f(y)
A tunction f ll ll is cacd monotonically non-increasing
it x, y x y f(x) f(y)
Monotonicity in Order Theory
Ordcr thcory is conccrncd with aritrary jartiay ordcrcd scts
Thc tcrms increasing and decreasing arc avoidcd, ccausc thcy osc thcir jictoria
motivation as soon as scts arc considcrcd that arc not totay ordcrcd
A tunction f S
1
S
2
, whcrc S
1
and S
2
arc two jartiay ordcrcd scts, is cacd
monotone or order-preserving it x, y S
1
x y f(x) f(y)
A tunction f S
1
S
2
, is cacd
anti-monotone or order-reversing it x, y S
1
x y f(x) f(y)
ln this scnsc thc sujjort ot a (su)rajh is anti-monotonc
Christian Borgelt Frequent Pattern Mining 319
Properties of Frequent (Sub)Graphs
A susct R ot a jartiay ordcrcd sct (S, ) is cacd downward closed
it tor any ccmcnt ot thc sct a smacr ccmcnts arc aso in it
x R y S y x y R
ln this casc thc susct R is aso cacd a lower set
Thc notions ot upward closed and upper set arc dcncd anaoousy
Ior cvcry s
min
thc sct ot trcqucnt (su)rajhs F
(
(s
min
)
is downward coscd wrt thc jartia ordcr _
S F
(
(s
min
) S _ R R F
(
(s
min
)
Sincc thc sct ot trcqucnt (su)rajhs is induccd y thc sujjort tunction,
thc notions ot up- or downward closed arc transtcrrcd to thc sujjort tunction
Any sct ot (su)rajhs induccd y a sujjort thrcshod is uj- or downward coscd
F
(
() S [ s
(
(S) is downward coscd,
I
(
() S [ s
(
(S) < is ujward coscd
Christian Borgelt Frequent Pattern Mining 320
Types of Frequent (Sub)Graphs
Christian Borgelt Frequent Pattern Mining 321
Maximal (Sub)Graphs
Considcr thc sct ot maximal (frequent) (sub)graphs / fragments
M
(
(s
min
) S [ s
(
(S) s
min
R S s
(
(R) < s
min
.
That is A (su)rajh is maxima it it is trcqucnt,
ut nonc ot its jrojcr sujcrrajhs is trcqucnt
Sincc with this dcnition wc know that
s
min
S F
(
(s
min
) S M
(
(s
min
) R S s
(
(R) s
min
it toows (can casiy c jrovcn y succcssivcy cxtcndin thc rajh S)
s
min
S F
(
(s
min
) R M
(
(s
min
) S R.
That is Every frequent (sub)graph has a maximal supergraph.
Thcrctorc s
min
F
(
(s
min
)
_
SM
(
(s
min
)
((S).
Christian Borgelt Frequent Pattern Mining 322
Reminder: Maximal Elements
Lct R c a susct ot a jartiay ordcrcd sct (S, )
An ccmcnt x R is cacd maximal or a maximal element ot R it
y R x y x y.
Thc notions minimal and minimal element arc dcncd anaoousy
`axima ccmcnts nccd not c uniquc,
ccausc thcrc may c ccmcnts y R with ncithcr x y nor y x
lnnitc jartiay ordcrcd scts nccd not josscss a maxima ccmcnt
Lcrc wc considcr thc sct F
(
(s
min
) tocthcr with thc jartia ordcr _
Thc maximal (frequent) (sub)graphs arc thc maxima ccmcnts ot F
(
(s
min
)
M
(
(s
min
) S F
(
(s
min
) [ R F
(
(s
min
) S _ R S R.
That is, no sujcrrajh ot a maxima (trcqucnt) (su)rajh is trcqucnt
Christian Borgelt Frequent Pattern Mining 323
Maximal (Sub)Graphs: Example
cxamjc moccucs
(rajh dataasc)
S C N C
O
O S C N
F
O S C N
O
Thc numcrs
cow thc surajhs
statc thcir sujjort
trcqucnt moccuar tramcnts (s
min
2)
(cmjty rajh)
3
S O C N
3 3 3 3
O S S C C O C N
2 3 2 3
O S C S C N S C O N C O
2 3 2 2
O S C
N
S C N
O
2 2
Christian Borgelt Frequent Pattern Mining 324
Limits of Maximal (Sub)Graphs
Thc sct ot maxima (su)rajhs cajturcs thc sct ot a trcqucnt (su)rajhs,
ut thcn wc know ony thc sujjort ot thc maxima (su)rajhs
Aout thc sujjort ot a non-maxima trcqucnt (su)rajhs wc ony know
s
min
S F
(
(s
min
) M
(
(s
min
) s
(
(S) max
RM
(
(s
min
),RS
s
(
(R).
This rcation toows immcdiatcy trom S R S s
(
(S) s
(
(R),
that is, a (su)rajh cannot havc a owcr sujjort than any ot its sujcrrajhs
`otc that wc havc cncray
s
min
S F
(
(s
min
) s
(
(S) max
RM
(
(s
min
),RS
s
(
(R).
Question: Can wc nd a susct ot thc sct ot a trcqucnt (su)rajhs,
which aso jrcscrvcs knowcdc ot a sujjort vaucs
Christian Borgelt Frequent Pattern Mining 325
Closed (Sub)Graphs
Considcr thc sct ot closed (frequent) (sub)graphs / fragments
C
(
(s
min
) S [ s
(
(S) s
min
R S s
(
(R) < s
(
(S).
That is A (su)rajh is coscd it it is trcqucnt,
ut nonc ot its jrojcr sujcrrajhs has thc samc sujjort
Sincc with this dcnition wc know that
s
min
S F
(
(s
min
) S C
(
(s
min
) R S s
(
(R) s
(
(S)
it toows (can casiy c jrovcn y succcssivcy cxtcndin thc rajh S)
s
min
S F
(
(s
min
) R C
(
(s
min
) S R.
That is Every frequent (sub)graph has a closed supergraph.
Thcrctorc s
min
F
(
(s
min
)
_
SC
(
(s
min
)
((S).
Christian Borgelt Frequent Pattern Mining 326
Closed (Sub)Graphs
Lowcvcr, not ony has cvcry trcqucnt (su)rajh a coscd sujcrrajh,
ut it has a closed supergraph with the same support
s
min
S F
(
(s
min
) R S R C
(
(s
min
) s
(
(R) s
(
(S).
(lroot considcr thc cosurc ojcrator that is dcncd on thc toowin sidcs)
`otc, howcvcr, that thc sujcrrajh nccd not c uniquc scc cow
Thc sct ot a coscd (su)rajhs jrcscrvcs knowcdc ot a sujjort vaucs
s
min
S F
(
(s
min
) s
(
(S) max
RC
(
(s
min
),RS
s
(
(R).
`otc that thc wcakcr statcmcnt
s
min
S F
(
(s
min
) s
(
(S) max
RC
(
(s
min
),RS
s
(
(R)
toows immcdiatcy trom S R S s
(
(S) s
(
(R), that is,
a (su)rajh cannot havc a owcr sujjort than any ot its sujcrrajhs
Christian Borgelt Frequent Pattern Mining 327
Reminder: Closure Operators
A closure operator on a sct S is a tunction cl 2
S
2
S
,
which satiscs thc toowin conditions X, Y S
X cl (X) (cl is cxtcnsivc)
X Y cl (X) cl (Y ) (cl is incrcasin or monotonc)
cl (cl (X)) cl (X) (cl is idcmjotcnt)
A sct R S is cacd closed it it is cqua to its cosurc
R is coscd R cl (R)
Thc closed (frequent) item sets arc induccd y thc cosurc ojcrator
cl (I)

kK
T
(I)
t
k
.
rcstrictcd to thc sct ot trcqucnt itcm scts
C
T
(s
min
) I F
T
(s
min
) [ I cl (I)
Christian Borgelt Frequent Pattern Mining 328
Closed (Sub)Graphs
Question: ls thcrc a cosurc ojcrator that induccs thc coscd (su)rajhs
At rst ancc, it ajjcars natura to transtcr thc ojcration
cl (I)

kK
T
(I)
t
k
y rcjacin thc intcrscction with thc greatest common subgraph
ntortunatcy, this is not jossic, ccausc thc rcatcst common surajh
ot two (or morc) rajhs nccd not c uniqucy dcncd
Considcr thc two rajhs (which arc actuay chains)
A B C and A B B C.
Thcrc arc two rcatcst common surajhs
A B and B C.
As a conscqucncc, thc intcrscction ot a sct ot dataasc rajhs
can yicd a set of graphs instcad ot a sinc common rajh
Christian Borgelt Frequent Pattern Mining 329
Reminder: Galois Connections
Lct (X, _
X
) and (Y, _
Y
) c two jartiay ordcrcd scts
A tunction jair (f
1
, f
2
) with f
1
X Y and f
2
Y X
is cacd a (monotone) Galois connection i
A
1
, A
2
X A
1
_ A
2
f
1
(A
1
) _ f
1
(A
2
),
B
1
, B
2
Y B
1
_ B
2
f
2
(B
1
) _ f
2
(B
2
),
A X B Y A _ f
2
(B) B _ f
1
(A)
A tunction jair (f
1
, f
2
) with f
1
X Y and f
2
Y X
is cacd an anti-monotone Galois connection i
A
1
, A
2
X A
1
_ A
2
f
1
(A
1
) _ f
1
(A
2
),
B
1
, B
2
Y B
1
_ B
2
f
2
(B
1
) _ f
2
(B
2
),
A X B Y A _ f
2
(B) B _ f
1
(A)
ln a monotonc Gaois conncction, oth f
1
and f
2
arc monotonc,
in an anti-monotonc Gaois conncction, oth f
1
and f
2
arc anti-monotonc
Christian Borgelt Frequent Pattern Mining 330
Reminder: Galois Connections
Galois Connections and Closure Operators
Lct thc two scts X and Y c jowcr scts ot somc scts U and V , rcsjcctivcy,
and ct thc jartia ordcrs c thc susct rcations on thcsc jowcr scts, that is, ct
(X, _
X
) (2
U
, ) and (Y, _
Y
) (2
V
, ).
Thcn thc comination f
1
f
2
X X ot thc tunctions ot a Gaois conncction
is a closure operator (as wc as thc comination f
2
f
1
Y Y )
Galois Connections in Frequent Item Set Mining
Considcr thc jartiay ordcr scts (2
B
, ) and (2
1,...,n
, )
Lct f
1
2
B
2
1,...,n
, I K
T
(I) k 1, . . . , n [ I t
k

and f
2
2
1,...,n
2
B
, J

jJ
t
j
i B [ j J i t
j

Thc tunction jair (f


1
, f
2
) is an anti-monotone Galois connection
Thcrctorc thc comination f
1
f
2
2
B
2
B
is a closure operator
Christian Borgelt Frequent Pattern Mining 331
Galois Connections in Frequent (Sub)Graph Mining
Lct ( (G
1
, . . . , G
n
) c a vcctor ot dataasc rajhs
Lct U c thc sct ot a surajhs ot thc dataasc rajhs in (, that is,
U S [ i 1, . . . , n S _ G
i

Lct V c thc indcx sct ot thc dataasc rajhs in (, that is


V 1, . . . , n (sct ot rajh idcnticrs)
(2
U
, ) and (2
V
, _) arc jartiay ordcrcd scts Considcr thc tunction jair
f
1
2
U
2
V
, I k U [ S I S _ G
k
. and
f
2
2
V
2
U
J S V [ k J S _ G
k
,
Thc jair (f
1
, f
2
) is a Gaois conncction ot X (2
U
, ) and Y (2
V
, _)
A
1
, A
2
2
U
A
1
A
2
f
1
(A
1
) f
1
(A
2
),
B
1
, B
2
2
V
B
1
B
2
f
2
(B
1
) f
2
(B
2
),
A 2
U
B 2
V
A f
2
(B) B f
1
(A)
Christian Borgelt Frequent Pattern Mining 332
Galois Connections in Frequent (Sub)Graph Mining
Sincc thc tunction jair (f
1
, f
2
) is an (anti-monotonc) Gaois conncction,
f
2
f
1
2
U
2
U
is a closure operator
This cosurc ojcrator can c uscd to dcnc thc coscd (su)rajhs
A surajh S is closed wrt a rajh dataasc ( i
S (f
2
f
1
)(S) , G (f
2
f
1
)(S) S < G.
Thc cncraization to a Gaois conncction takcs tormay carc ot thc jrocm
that thc rcatcst common surajh may not c uniqucy dctcrmincd
lntuitivcy, thc aovc dcnition simjy says that a surajh S is coscd i
it is a common surajh ot a dataasc rajhs containin it and
no sujcrrajh ot it is aso a common surajh ot thcsc rajhs
That is, a surajh S is coscd it it is one ot thc rcatcst common surajhs
ot a dataasc rajhs containin it
Thc Gaois conncction is ony nccdcd to jrovc thc cosurc ojcrator jrojcrty
Christian Borgelt Frequent Pattern Mining 333
Closed (Sub)Graphs: Example
cxamjc moccucs
(rajh dataasc)
S C N C
O
O S C N
F
O S C N
O
Thc numcrs
cow thc surajhs
statc thcir sujjort
trcqucnt moccuar tramcnts (s
min
2)
(cmjty rajh)
3
S
O
C N
3 3 3 3
O S S C C O C N
2 3 2 3
O S C
S C N
S C O N C O
2 3 2 2
O S C
N
S C N
O
2 2
Christian Borgelt Frequent Pattern Mining 334
Types of Frequent (Sub)Graphs
Frequent (Sub)Graph
Any trcqucnt (su)rajh (sujjort is hihcr than thc minima sujjort)
I trcqucnt s
(
(S) s
min
Closed (Sub)Graph
A trcqucnt (su)rajh is cacd closed it no sujcrrajh has thc samc sujjort
I coscd s
(
(S) s
min
R S s
(
(R) < s
(
(S)
Maximal (Sub)Graph
A trcqucnt (su)rajh is cacd maximal it no sujcrrajh is trcqucnt
I maxima s
(
(S) s
min
R S s
(
(R) < s
min
Ovious rcations ctwccn thcsc tyjcs ot (su)rajhs
A maxima and a coscd (su)rajhs arc trcqucnt
A maxima (su)rajhs arc coscd
Christian Borgelt Frequent Pattern Mining 335
Searching for Frequent (Sub)Graphs
Christian Borgelt Frequent Pattern Mining 336
Partially Ordered Set of Subgraphs
Hasse diagram ranging from the empty graph to the database graphs.
Thc surajh (isomorjhism) rcationshij dcncs a jartia ordcr on surajhs
Thc cmjty rajh is (tormay) containcd in a surajhs
Thcrc is usuay no (natura) uniquc arcst rajh
cxamjc moccucs
S C N C
O
O S C N
F
O S C N
O
*
F S O C N
F S O S S C C O C N
O S F F S C O S C S C N S C O O C N C N C
O S C
F F
S C N O S C
N
O S C
O
S C N
O
S C N
C O
C N C
O S C N
F
O S C N
O
S C N C
O
Christian Borgelt Frequent Pattern Mining 337
Frequent (Sub)Graphs
The frequent (sub)graphs form a partially ordered subset at the top.
Thcrctorc thc jartiay ordcrcd sct shoud c scarchcd toj-down
Standard scarch stratcics rcadth-rst and dcjth-rst
Lcjth-rst scarch is usuay jrctcrac, sincc thc scarch trcc can c vcry widc
cxamjc moccucs
S C N C
O
O S C N
F
O S C N
O
s
min
2
F
F S
O S F F S C C N C
O S C
F F
S C N O S C
O
S C N
C O
C N C
O S C N
F
O S C N
O
S C N C
O
*
S O C N
O S S C C O C N
O S C S C N S C O O C N
O S C
N
S C N
O
1
1
1 1 1
1 1 1 1 1
1 1 1
3
3
3
3
3
2 3 2 3
2 3 2 2
2 2
Christian Borgelt Frequent Pattern Mining 338
Closed and Maximal Frequent (Sub)Graphs
Partially ordered subset of frequent (sub)graphs.
Coscd trcqucnt (su)rajhs arc cncirccd
Thcrc arc 1! trcqucnt (su)rajhs, ut ony ! coscd (su)rajhs
Thc two coscd (su)rajhs at thc ottom arc aso maxima
cxamjc moccucs
S C N C
O
O S C N
F
O S C N
O
*
S O C N
O S S C C O C N
O S C S C N S C O O C N
O S C N S C N
O
3
3 3 3
3
2 3 2 3
2
3
2 2
2
2
Christian Borgelt Frequent Pattern Mining 339
Basic Search Principle
Grow (sub)graphs into the graphs of the given database.
Start with a sinc vcrtcx (sccd vcrtcx)
Add an cdc (and mayc a vcrtcx) in cach stcj
Lctcrminc thc sujjort and jrunc intrcqucnt (su)rajhs
`ain jrocm A (sub)graph can be grown in several dierent ways

S S C S C O S C N
O
O C O N C O S C N
O
C C N S C N S C N
O
C C N N C O S C N
O
ctc (S morc jossiiitics)
*
S O C N
S C C O C N
S C N S C O O C N
S C N
O
Christian Borgelt Frequent Pattern Mining 340
Reminder: Searching for Frequent Item Sets
\c havc to scarch thc jartiay ordcrcd sct (2
B
, ) , its Lassc diaram
Assinin uniquc jarcnts turns thc Lassc diaram into a trcc
Travcrsin thc rcsutin trcc cxjorcs cach itcm sct cxacty oncc
Lassc diaram and a jossic trcc tor vc itcms
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
Christian Borgelt Frequent Pattern Mining 341
Searching for Frequent (Sub)Graphs
\c havc to scarch thc jartiay ordcrcd sct ot (conncctcd) (su)rajhs
ranin trom thc cmjty rajh to thc dataasc rajhs
Assinin uniquc jarcnts turns corrcsjondin Lassc diaram into a trcc
Travcrsin thc rcsutin trcc cxjorcs cach (su)rajh cxacty oncc
Surajh Lassc diaram and a jossic trcc
*
F S O C N
F S O S S C C O C N
O S F F S C O S C S C N S C O O C N C N C
O S C
F F
S C N O S C
N
O S C
O
S C N
O
S C N
C O
C N C
O S C N
F
O S C N
O
S C N C
O
*
F S O C N
F S O S S C C O C N
O S F F S C O S C S C N S C O O C N C N C
O S C
F F
S C N O S C
N
O S C
O
S C N
O
S C N
C O
C N C
O S C N
F
O S C N
O
S C N C
O
Christian Borgelt Frequent Pattern Mining 342
Searching with Unique Parents
Principle of a Search Algorithm based on Unique Parents:
Base Loop:
Travcrsc a jossic vcrtcx attriutcs (thcir uniquc jarcnt is thc cmjty rajh)
lccursivcy jroccss a vcrtcx attriutcs that arc trcqucnt
Recursive Processing:
Ior a ivcn trcqucnt (su)rajh S
Gcncratc a cxtcnsions R ot S y an cdc or y an cdc and a vcrtcx
(it thc vcrtcx is not yct in S) tor which S is thc choscn uniquc jarcnt
Ior a R it R is trcqucnt, jroccss R rccursivcy, othcrwisc discard R
Questions:
Low can wc tormay assin uniquc jarcnts
(Low) Can wc makc surc that wc cncratc ony thosc cxtcnsions
tor which thc (su)rajh that is cxtcndcd is thc choscn uniquc jarcnt
Christian Borgelt Frequent Pattern Mining 343
Assigning Unique Parents
Iormay, thc sct ot a possible parents ot a (conncctcd) (su)rajh S is
P(S) R ((S) [ , U ((S) R U S.
ln othcr words, thc jossic jarcnts ot S arc its maximal proper subgraphs
Lach jossic jarcnt contains cxacty one edge less than thc (su)rajh S
lt wc can dcnc an ordcr on thc cdcs ot thc (su)rajh S,
wc can casiy sinc out a uniquc jarcnt, thc canonical parent p
c
(S)
Lct e

c thc last edge in thc ordcr that is not a proper bridge


(ic cithcr a cat ridc or no ridc)
Thc canonical parent p
c
(S) is thc rajh S without the edge e

lt e

is a cat ridc, wc aso havc to rcmovc thc crcatcd isoatcd nodc


lt e

is thc ony cdc ot S, wc aso nccd an ordcr ot thc nodcs,


so that wc can dccidc which isoatcd nodc to rcmovc
`otc it S is conncctcd, thcn p
c
(S) is conncctcd, as e

is not a jrojcr ridc


Christian Borgelt Frequent Pattern Mining 344
Assigning Unique Parents
ln ordcr to dcnc an ordcr ot thc cdcs ot a ivcn (su)rajh,
wc wi rcy on a canonical form ot (su)rajhs
Canonica torms tor rajhs arc morc comjcx than canonica torms tor itcm scts
(rcmindcr on ncxt sidc), ccausc wc havc to codc thc conncction structurc
A canonica torm ot a (su)rajh is a sjccia rcjrcscntation ot this (su)rajh
Lach (su)rajh is dcscricd y a code word
lt dcscrics thc rajh structurc and thc vcrtcx and cdc acs
(and thus imjicity ordcrs thc cdcs and vcrticcs)
Thc (su)rajh can c rcconstructcd trom thc codc word
Thcrc may c mutijc codc words that dcscric thc samc (su)rajh
Onc ot thc codc words is sincd out as thc canonical code word
Thcrc arc two main jrincijcs tor canonica torms ot rajhs
spanning trees and adjacency matrices
Christian Borgelt Frequent Pattern Mining 345
Support Counting
Subgraph Isomorphism Tests
Gcncratc cxtcnsions ascd on oa intormation aout cdcs
Cocct trijcs ot sourcc nodc ac, cdc ac, and dcstination nodc ac
Travcrsc thc (cxtcndac) nodcs ot a ivcn tramcnt
and attach cdcs ascd on thc cocctcd trijcs
Travcrsc dataasc rajhs and tcst whcthcr cncratcd cxtcnsion occurs
(Thc dataasc rajhs may c rcstrictcd to thosc containin thc jarcnt)
Maintain List of Occurrences
Iind and rccord a occurrcnccs ot sinc nodc rajhs
Chcck dataasc rajhs tor cxtcnsions ot known occurrcnccs
This immcdiatcy yicds thc occurrcnccs ot thc cxtcndcd tramcnts
Lisadvantac considcrac mcmory is nccdcd tor storin thc occurrcnccs
Advantac tcwcr cxtcndcd tramcnts and tastcr sujjort countin
Christian Borgelt Frequent Pattern Mining 346
Canonical Forms of Graphs
Christian Borgelt Frequent Pattern Mining 347
Reminder: Canonical Form for Item Sets
An itcm sct is rcjrcscntcd y a code word. cach cttcr rcjrcscnts an itcm
Thc codc word is a word ovcr thc ajhact A, thc sct ot a itcms
Thcrc arc k' jossic codc words tor an itcm sct ot sizc k,
ccausc thc itcms may c istcd in any ordcr
Ly introducin an (aritrary, ut xcd) order of the items,
and y comjarin codc words cxicorajhicay,
wc can dcnc an ordcr on thcsc codc words
Lxamjc abc < bac < bca < cab tor thc itcm sct a, b, c and a < b < c
Thc cxicorajhicay smacst codc word tor an itcm sct
is thc canonical code word
Oviousy thc canonica codc word ists thc itcms in thc choscn, xcd ordcr
ln jrincijc, thc samc cncra idca can c uscd tor rajhs
Lowcvcr, a oa ordcr on thc vcrtcx and cdc attriutcs is not cnouh
Christian Borgelt Frequent Pattern Mining 348
Canonical Forms of Graphs: General Idea
Construct a code word that uniqucy idcntics an (attriutcd or accd) rajh
uj to automorjhisms (that is, symmctrics)
Basic idea: Thc charactcrs ot thc codc word dcscric thc cdcs ot thc rajh
Core problem: \crtcx and cdc attriutcs can casiy c incorjoratcd into
a codc word, ut how to dcscric thc conncction structurc is not so ovious
Thc vcrticcs ot thc rajh must c numcrcd (cndowcd with uniquc acs),
ccausc wc nccd to sjccity thc vcrticcs that arc incidcnt to an cdc
(`otc vcrtcx acs nccd not c uniquc scvcra nodcs may havc thc samc ac)
Lach jossic numcrin ot thc vcrticcs ot thc rajh yicds a codc word,
which is thc concatcnation ot thc (sortcd) cdc dcscrijtions (charactcrs)
(`otc that thc rajh can c rcconstructcd trom such a codc word)
Thc rcsutin ist ot codc words is sortcd cxicorajhicay
Thc cxicorajhicay smacst codc word is thc canonical code word
(Atcrnativcy, onc may choosc thc cxicorajhicay rcatcst codc word)
Christian Borgelt Frequent Pattern Mining 349
Searching with Canonical Forms
Lct S c a (su)rajh and w
c
(S) its canonica codc word
Lct e

(S) c thc ast cdc in thc cdc ordcr induccd y w


c
(S)
(ic thc ordcr in which thc cdcs arc dcscricd) that is not a jrojcr ridc
General Recursive Processing with Canonical Forms:
Ior a ivcn trcqucnt (su)rajh S
Gcncratc a cxtcnsions R ot S y a sinc cdc or an cdc and a vcrtcx
(it onc vcrtcx incidcnt to thc cdc is not yct jart ot S)
Iorm thc canonica codc word w
c
(R) ot cach cxtcndcd (su)rajh R
lt thc cdc e

(R) as induccd y w
c
(R) is thc cdc addcd to S to torm R
and R is trcqucnt, jroccss R rccursivcy, othcrwisc discard R
Questions:
Low can wc tormay dcnc canonica codc words
Lo wc havc to cncratc a jossic cxtcnsions ot a trcqucnt (su)rajh
Christian Borgelt Frequent Pattern Mining 350
Canonical Forms: Prex Property
Sujjosc thc canonica torm josscsscs thc prex property
Every prex of a canonical code word is a canonical code word itself
Thc cdc e

is aways thc ast dcscricd cdc


Thc oncst jrojcr jrcx ot thc canonica codc word ot a (su)rajh S
not ony dcscrics thc canonica jarcnt ot S, ut is its canonica codc word
Thc cncra rccursivc jroccssin schcmc with canonica torms rcquircs
to construct thc canonical code word ot cach crcatcd (su)rajh
in ordcr to dccidc whcthcr it has to c jroccsscd rccursivcy or not
\c know thc canonica codc word ot any (su)rajh that is jroccsscd
\ith this codc word wc know, duc to thc prex property, thc canonica
codc words ot a chid (su)rajhs that havc c cxjorcd in thc rccursion
with the exception of the last letter (that is, thc dcscrijtion ot thc addcd cdc)
\c ony havc to chcck whcthcr thc codc word that rcsuts trom ajjcndin
thc dcscrijtion ot thc addcd cdc to thc ivcn canonica codc word is canonica
Christian Borgelt Frequent Pattern Mining 351
Searching with the Prex Property
Principle of a Search Algorithm based on the Prex Property:
Base Loop:
Travcrsc a jossic vcrtcx attriutcs, that is,
thc canonica codc words ot sinc vcrtcx (su)rajhs
lccursivcy jroccss cach codc word that dcscrics a trcqucnt (su)rajh
Recursive Processing:
Ior a ivcn (canonica) codc word ot a trcqucnt (su)rajh
Gcncratc a jossic cxtcnsions y an cdc (and a mayc a vcrtcx)
This is donc y ajjcndin thc cdc dcscrijtion to thc codc word
Chcck whcthcr thc cxtcndcd codc word is thc canonical code word
ot thc (su)rajh dcscricd y thc cxtcndcd codc word
(and, ot coursc, whcthcr thc dcscricd (su)rajh is trcqucnt)
lt it is, jroccss thc cxtcndcd codc word rccursivcy, othcrwisc discard it
Christian Borgelt Frequent Pattern Mining 352
The Prex Property
Advantages of the Prex Property:
Tcstin whcthcr a ivcn codc word is canonica can c simjcr,tastcr
than constructin a canonica codc word trom scratch
Thc jrcx jrojcrty usuay aows us to casiy nd simjc rucs
to restrict the extensions that nccd to c cncratcd
Disadvantages of the Prex Property:
Onc has rcduccd trccdom in thc dcnition ot a canonica torm
This can makc it imjossic to cxjoit ccrtain jrojcrtics ot a rajh
that can hcj to construct a canonica torm quicky
ln thc toowin wc considcr mainy canonica torms havin thc jrcx jrojcrty
Lowcvcr, it wi c discusscd atcr how additiona rajh jrojcrtics
can c cxjoitcd to imjrovc thc construction ot a canonica torm
it thc jrcx jrojcrty is not madc a rcquircmcnt
Christian Borgelt Frequent Pattern Mining 353
Canonical Forms based on Spanning Trees
Christian Borgelt Frequent Pattern Mining 354
Spanning Trees
A (accd) rajh G is cacd a tree i tor any jair ot vcrticcs in G
thcrc cxists exactly one path conncctin thcm in G
A spanning tree ot a (accd) conncctcd rajh G is a surajh S ot G that
is a trcc and
comjriscs a vcrticcs ot G (that is, V
S
V
G
)
Lxamjcs ot sjannin trccs
O
F
N
N
O
O
O
F
N
N
O
O
O
F
N
N
O
O
O
F
N
N
O
O
O
F
N
N
O
O
Thcrc arc 1 9 + ` ! o ` 1 29 jossic sjannin trccs tor this cxamjc,
ccausc oth rins havc to c cut ojcn
Christian Borgelt Frequent Pattern Mining 355
Canonical Forms based on Spanning Trees
A code word dcscriin a rajh can c constructcd y
systcmaticay constructin a spanning tree ot thc rajh,
numbering the vertices in thc ordcr in which thcy arc visitcd,
dcscriin cach cdc y thc numcrs ot thc vcrticcs it connccts,
thc cdc ac, and thc acs ot thc incidcnt vcrticcs, and
istin thc cdc dcscrijtions in thc ordcr in which thc cdcs arc visitcd
(Ldcs cosin cyccs may nccd sjccia trcatmcnt)
Thc most common ways ot constructin a sjannin trcc arc
depth-rst search Sjan |Yan and Lan 2002|
breadth-rst search `oSS,`oIa |Lorct and Lcrthod 2002|
An atcrnativc way is to visit a chidrcn ot a vcrtcx ctorc jrocccdin
in a dcjth-rst manncr (can c sccn as a variant ot dcjth-rst scarch)
Othcr systcmatic scarch schcmcs arc, in jrincijc, aso ajjicac
Christian Borgelt Frequent Pattern Mining 356
Canonical Forms based on Spanning Trees
Lach startin joint (choicc ot a root) and cach way to uid a sjannin trcc
systcmaticay trom a ivcn startin joint yicds a dicrcnt codc word
O
F
N
N
O
O
O
F
N
N
O
O
O
F
N
N
O
O
O
F
N
N
O
O
O
F
N
N
O
O
Thcrc arc 12 jossic startin joints and scvcra ranchin joints
As a conscqucncc, thcrc arc scvcra hundrcd jossic codc words
Thc cxicorajhicay smacst codc word is thc canonical code word
Sincc thc cdcs arc istcd in thc ordcr in which thcy arc visitcd durin thc
sjannin trcc construction, this canonica torm has thc prex property
lt a jrcx ot a canonica codc word wcrc not canonica, thcrc woud c
a startin joint and a sjannin trcc that yicd a smacr codc word
(sc thc canonica codc word ot thc jrcx rajh and ajjcnd thc missin cdc)
Christian Borgelt Frequent Pattern Mining 357
Canonical Forms based on Spanning Trees
An edge description consists ot
thc indiccs ot thc sourcc and thc dcstination vcrtcx
(dcnition thc sourcc ot an cdc is thc vcrtcx with thc smacr indcx),
thc attriutcs ot thc sourcc and thc dcstination vcrtcx,
thc cdc attriutc
Listin thc cdcs in thc ordcr in which thcy arc visitcd can ottcn c charactcrizcd
y a precedence order on thc dcscriin ccmcnts ot an cdc
Ordcr ot individua ccmcnts (con,ccturcs, ut sujjortcd y cxjcrimcnts)
\crtcx and cdc attriutcs shoud c sortcd accordin to thcir trcqucncy
Asccndin ordcr sccms to c rccommcndac tor thc vcrtcx attriutcs
Simplication: Thc sourcc attriutc is nccdcd ony tor thc rst cdc
and thus can c sjit o trom thc ist ot cdc dcscrijtions
Christian Borgelt Frequent Pattern Mining 358
Canonical Forms: Edge Sorting Criteria
Precedence Order for Depth-rst Search:
dcstination vcrtcx indcx (asccndin)
sourcc vcrtcx indcx (dcsccndin)
cdc attriutc (asccndin)
dcstination vcrtcx attriutc (asccndin)
Precedence Order for Breadth-rst Search:
sourcc vcrtcx indcx (asccndin)
cdc attriutc (asccndin)
dcstination vcrtcx attriutc (asccndin)
dcstination vcrtcx indcx (asccndin)
Edges Closing Cycles:
Ldcs cosin cyccs may c distinuishcd trom sjannin trcc cdcs,
ivin sjannin trcc cdcs asoutc jrcccdcncc ovcr cdcs cosin cyccs
Atcrnativc Sort ctwccn thc othcr cdcs ascd on thc jrcccdcncc rucs
Christian Borgelt Frequent Pattern Mining 359
Canonical Forms: Code Words
Irom thc dcscricd jroccdurc thc toowin codc words rcsut
(rcuar cxjrcssions with non-tcrmina symos)
Depth-First Search: a (i
d
i
s
b a)
m
Breadth-First Search: a (i
s
b a i
d
)
m
(or a (i
s
i
d
b a)
m
)
whcrc n thc numcr ot vcrticcs ot thc rajh,
m thc numcr ot cdcs ot thc rajh,
i
s
indcx ot thc sourcc vcrtcx ot an cdc, i
s
0, . . . , n 1,
i
d
indcx ot thc dcstination vcrtcx ot an cdc, i
d
0, . . . , n 1,
a thc attriutc ot a vcrtcx,
b thc attriutc ot an cdc
Thc ordcr ot thc ccmcnts dcscriin an cdc rcccts thc jrcccdcncc ordcr
That i
s
in thc dcjth-rst scarch cxjrcssion is undcrincd is mcant as a rcmindcr
that thc cdc dcscrijtions havc to c sortcd dcsccndiny wrt this vauc
Christian Borgelt Frequent Pattern Mining 360
Canonical Forms: A Simple Example
O
N
S
O
O
cxamjc
moccuc
dcjth-rst
A

S
0
N
1
C
3
C
7
C
8
O
2
C
4
O
5
O
6
rcadth-rst
L

C
6
O
7
O
8
S
0
N
1
C
2
O
3
C
4
C
5
Order of Elements: S N O C Order of Bonds:
Code Words:
A S 10-N 21-O 31-C 43-C 54-O 64=O 73-C 87-C 80-C
L S 0-N1 0-C2 1-O3 1-C4 2-C5 4-C5 4-C6 6-O7 6=O8
(lcmindcr in A thc cdcs arc sortcd descendingly wrt thc sccond cntry)
Christian Borgelt Frequent Pattern Mining 361
Checking for Canonical Form: Compare Prexes
Base Loop:
Travcrsc a vcrticcs with a ac no css than thc currcnt root vcrtcx
(rst charactcr ot thc codc word. jossic roots ot sjannin trccs)
Recursive Processing:
Thc rccursivc jroccssin constructs atcrnativc sjannin trccs and
comjarc thc codc words rcsutin trom it with thc codc word to chcck
ln cach rccursion stcj onc cdc is addcd to thc sjannin trcc and its dcscrijtion
is comjarcd to thc corrcsjondin onc in thc codc word to chcck
lt thc ncw cdc dcscrijtion is larger, thc cdc can c skijjcd
(ncw codc word is cxicorajhicay arcr)
lt thc ncw cdc dcscrijtion is smaller, thc codc word is not canonica
(ncw codc word is cxicorajhicay smacr)
lt thc ncw cdc dcscrijtion is equal, thc rcst ot thc codc word
is jroccsscd rccursivcy (codc word jrcxcs arc cqua)
Christian Borgelt Frequent Pattern Mining 362
Checking for Canonical Form
function isCanonica (w array ot int, G rajh) oocan.
var v vcrtcx. ( to travcrsc thc vcrticcs ot thc rajh )
e cdc. ( to travcrsc thc cdcs ot thc rajh )
x array ot vcrtcx. ( to cocct thc numcrcd vcrticcs )
begin
forall v G.V do v.i 1. ( ccar thc vcrtcx indiccs )
forall e G.E do e.i 1. ( ccar thc cdc markcrs )
forall v G.V do begin ( travcrsc thc jotcntia root vcrticcs )
if v.a < w|0| then return tasc. ( it v has a smacr ac, aort )
if v.a w|0| then begin ( it v has thc samc ac, chcck rcst )
v.i 0. x|0| v. ( numcr and rccord thc root vcrtcx )
if not rcc(w, 1, x, 1, 0) ( chcck thc codc word rccursivcy and )
then return tasc. ( aort it a smacr codc word is tound )
v.i 1. ( ccar thc vcrtcx indcx aain )
end.
end.
return truc. ( thc codc word is canonica )
end ( isCanonica ) ( tor a rcadth-rst scarch sjannin trcc )
Christian Borgelt Frequent Pattern Mining 363
Checking for Canonical Form
function rcc (w array ot int, k int, x array ot vcrtcx, n int, i int) oocan.
( w codc word to c tcstcd )
( k currcnt josition in codc word )
( x array ot arcady accd,numcrcd vcrticcs )
( n numcr ot accd,numcrcd vcrticcs )
( i indcx ot ncxt cxtcndac vcrtcx to chcck. i < n )
var d vcrtcx. ( vcrtcx at thc othcr cnd ot an cdc )
j int. ( indcx ot dcstination vcrtcx )
u oocan. ( a tor unnumcrcd dcstination vcrtcx )
r oocan. ( ucr tor a rccursion rcsut )
begin
if k cnth(w) return truc. ( tu codc word has ccn cncratcd )
while i < w|k| do begin ( chcck whcthcr thcrc is an cdc with )
forall e incidcnt to x|i| do ( a sourcc vcrtcx havin a smacr indcx )
if e.i < 0 then return tasc.
i i + 1. ( it thcrc is an unmarkcd cdc, aort, )
end. ( othcrwisc o to thc ncxt vcrtcx )

Christian Borgelt Frequent Pattern Mining 364
Checking for Canonical Form

forall e incidcnt to x|i| (in sortcd ordcr) do begin
if e.i < 0 then begin ( travcrsc thc unvisitcd incidcnt cdcs )
if e.a < w|k + 1| then return tasc. ( chcck thc )
if e.a > w|k + 1| then return truc. ( cdc attriutc )
d vcrtcx incidcnt to e othcr than x|i|.
if d.a < w|k + 2| then return tasc. ( chcck dcstination )
if d.a > w|k + 2| then return truc. ( vcrtcx attriutc )
if d.i < 0 then j n else j d.i.
if j < w|k + 3| then return tasc. ( chcck dcstination vcrtcx indcx )
|| ( chcck rcst ot codc word rccursivcy, )
( ccausc jrcxcs arc cqua )
end.
end.
return truc. ( rcturn that no smacr codc word )
end ( rcc ) ( than w coud c tound )
Christian Borgelt Frequent Pattern Mining 365
Checking for Canonical Form

forall e incidcnt to x|i| (in sortcd ordcr) do begin
if e.i < 0 then begin ( travcrsc thc unvisitcd incidcnt cdcs )
|| ( chcck thc currcnt cdc )
if j w|k + 3| then begin ( it cdc dcscrijtions arc cqua )
e.i 1. u d.i < 0. ( mark cdc and numcr vcrtcx )
if u then begin d.i j. x|n| d. n n + 1. end
r rcc(w, k + !, x, n, i). ( chcck rccursivcy )
if u then begin d.i 1. n n 1. end
e.i 1. ( unmark cdc (and vcrtcx) aain )
if not r then return tasc.
end. ( cvauatc thc rccursion rcsut )
end. ( aort it a smacr codc word was tound )
end.
return truc. ( rcturn that no smacr codc word )
end ( rcc ) ( than w coud c tound )
Christian Borgelt Frequent Pattern Mining 366
Restricted Extensions
Christian Borgelt Frequent Pattern Mining 367
Canonical Forms: Restricted Extensions
Principle of the Search Algorithm up to now:
Gcncratc a jossic cxtcnsions ot a ivcn canonica codc word
y thc dcscrijtion ot an cdc that cxtcnds thc dcscricd (su)rajh
Chcck whcthcr thc cxtcndcd codc word is canonica (and thc (su)rajh trcqucnt)
lt it is, jroccss thc cxtcndcd codc word rccursivcy, othcrwisc discard it
Straightforward Improvement:
Ior somc cxtcnsions ot a ivcn canonica codc word it is casy to scc
that thcy wi not c canonica thcmscvcs
Thc trick is to chcck whcthcr a sjannin trcc rooted at the same vertex
yicds a codc word that is smacr than thc crcatcd cxtcndcd codc word
This immcdiatcy rucs out cdcs attachcd to ccrtain vcrticcs in thc (su)rajh
(ony ccrtain vcrticcs arc extendable, that is, can c incidcnt to a ncw cdc)
as wc as ccrtain cdcs cosin cyccs
Christian Borgelt Frequent Pattern Mining 368
Canonical Forms: Restricted Extensions
Depth-First Search: Rightmost Path Extension
Extendable Vertices:
Ony vcrticcs on thc rightmost path ot thc sjannin trcc may c cxtcndcd
lt thc sourcc vcrtcx ot thc ncw cdc is not a cat, thc cdc dcscrijtion
must not jrcccdc thc dcscrijtion ot thc downward cdc on thc jath
(That is, thc cdc attriutc must c no css than thc cdc attriutc ot thc
downward cdc, and it it is cqua, thc attriutc ot its dcstination vcrtcx must
c no css than thc attriutc ot thc downward cdcs dcstination vcrtcx)
Edges Closing Cycles:
Ldcs cosin cyccs must start at an cxtcndac vcrtcx
Thcy must cad to thc rihtmost cat (vcrtcx at cnd ot rihtmost jath)
Thc indcx ot thc sourcc vcrtcx must jrcccdc thc indcx ot thc sourcc vcrtcx
ot any cdc arcady incidcnt to thc rihtmost cat
Christian Borgelt Frequent Pattern Mining 369
Canonical Forms: Restricted Extensions
Breadth-First Search: Maximum Source Extension
Extendable Vertices:
Ony vcrticcs havin an indcx no css than thc maximum source index
ot cdcs that arc arcady in thc (su)rajh may c cxtcndcd
lt thc sourcc ot thc ncw cdc is thc onc havin thc maximum sourcc indcx,
it may c cxtcndcd ony y cdcs whosc dcscrijtions do not jrcccdc
thc dcscrijtion ot any downward cdc arcady incidcnt to this vcrtcx
(That is, thc cdc attriutc must c no css, and it it is cqua,
thc attriutc ot thc dcstination vcrtcx must c no css)
Edges Closing Cycles:
Ldcs cosin cyccs must start at an cxtcndac vcrtcx
Thcy must cad torward,
that is, to a vcrtcx havin a arcr indcx than thc cxtcndcd vcrtcx
Christian Borgelt Frequent Pattern Mining 370
Restricted Extensions: A Simple Example
O
N
S
O
O
cxamjc
moccuc
dcjth-rst
A

S
0
N
1
C
3
C
7
C
8
O
2
C
4
O
5
O
6
rcadth-rst
L

C
6
O
7
O
8
S
0
N
1
C
2
O
3
C
4
C
5
Extendable Vertices:
A vcrticcs on thc rihtmost jath, that is, 0, 1, 3, , S
L vcrticcs with an indcx no smacr than thc maximum sourcc, that is, o, , S
Edges Closing Cycles:
A nonc, ccausc thc cxistin cycc cdc has thc smacst jossic sourcc
L thc cdc ctwccn thc vcrticcs and S
Christian Borgelt Frequent Pattern Mining 371
Restricted Extensions: A Simple Example
O
N
S
O
O
cxamjc
moccuc
dcjth-rst
A

S
0
N
1
C
3
C
7
C
8
O
2
C
4
O
5
O
6
rcadth-rst
L

C
6
O
7
O
8
S
0
N
1
C
2
O
3
C
4
C
5
lt othcr vcrticcs arc cxtcndcd, a trcc with the same root yicds a smacr codc word
Example: attach a sinc ond to a caron atom at thc cttmost oxycn atom
A S 10-N 21-O 31-C 43-C 54-O 64=O 73-C 87-C 80-C 92-C
S 10-N 21-O 32-C
L S 0-N1 0-C2 1-O3 1-C4 2-C5 4-C5 4-C6 6-O7 6=O8 3-C9
S 0-N1 0-C2 1-O3 1-C4 2-C5 3-C6
Christian Borgelt Frequent Pattern Mining 372
Canonical Forms: Restricted Extensions
Thc rucs undcryin rcstrictcd cxtcnsions jrovidc a onc-sidcd answcr
to thc qucstions whcthcr an cxtcnsion yicds a canonica codc word
Depth-rst search canonical form
lt thc cxtcnsion cdc is not a rihtmost jath cxtcnsion,
thcn thc rcsutin codc word is certainly not canonica
lt thc cxtcnsion cdc is a rihtmost jath cxtcnsion,
thcn thc rcsutin codc word may or may not be canonica
Breadth-rst search canonical form
lt thc cxtcnsion cdc is not a maximum sourcc cxtcnsion,
thcn thc rcsutin codc word is certainly not canonica
lt thc cxtcnsion cdc is a maximum sourcc cxtcnsion,
thcn thc rcsutin codc word may or may not be canonica
As a conscqucncc, a canonical form test is sti ncccssary
Christian Borgelt Frequent Pattern Mining 373
Example Search Tree
Start with a sinc vcrtcx (sccd vcrtcx)
Add an cdc (and mayc a vcrtcx) in cach stcj (restricted extensions)
Lctcrminc thc sujjort and jrunc intrcqucnt (su)rajhs
Chcck tor canonica torm and jrunc (su)rajhs with non-canonica codc words
cxamjc moccucs
S C N C
O
O S C N
F
O S C N
O
scarch trcc tor sccd S
S F
O S C
O
S C N C
O S C N
O
S C N C
O
S
S C S O
O S C S C N S C O
O S C N S C N
O
3
1
3
2
2 3
2
2 1
2
1
1 1
S F N C O - =
rcadth-rst scarch canonica torm
Christian Borgelt Frequent Pattern Mining 374
Searching without a Seed Atom
*
S N O C
S C N C O C O C C C
S C C N C C O C C O C O O C C O C O C C C
S C C C S C C N
S C C C
N
S C C C O S C C C O
S C C C O
O
12 7 5
3
cycin
N C
C
C
O
O
O
cystcin
N C
C
C
O
O
S
scrin
N C
C
C
O
O
O
rcadth-rst scarch canonica torm S N O C - =
Chcmica ccmcnts jroccsscd on thc ctt arc cxcudcd on thc riht
Christian Borgelt Frequent Pattern Mining 375
Comparison of Canonical Forms
(dcjth-rst vcrsus rcadth-rst sjannin trcc construction)
Christian Borgelt Frequent Pattern Mining 376
Canonical Forms: Comparison
Depth-First vs. Breadth-First Search Canonical Form
\ith rcadth-rst scarch canonica torm thc cxtcndac vcrticcs
arc much casicr to travcrsc, as thcy aways havc consccutivc indiccs
Onc ony has to storc and ujdatc onc numcr, namcy thc indcx
ot thc maximum cdc sourcc, to dcscric thc vcrtcx ranc
Aso thc chcck tor canonica torm is sihty morc comjcx (to jroram)
tor dcjth-rst scarch canonica torm
Thc two canonica torms oviousy cad to dicrcnt ranchin tactors,
widths and dcjths ot thc scarch trcc
Lowcvcr, it is not immcdiatcy ccar, which torm cads to thc cttcr
(morc ccicnt) structurc ot thc scarch trcc
Thc cxjcrimcnta rcsuts rcjortcd in thc toowin indicatc that it may dcjcnd
on thc data sct which canonica torm jcrtorms cttcr
Christian Borgelt Frequent Pattern Mining 377
Advantage for Maximum Source Extensions
Gcncratc a sustructurcs
(that contain nitrocn)
ot thc cxamjc moccuc
O
C
N
C
C
C O
lrocm Thc two ranchcs cmanatin
trom thc nitrocn atom start idcnticay
Thus rihtmost jath cxtcnsions try
thc riht ranch ovcr and ovcr aain
Search Trees with N O C
`aximum Sourcc Lxtcnsion
lihtmost lath Lxtcnsion
C
N
C
O
C
N
C
C
C
C
N
C
O
N
C
N
C
N
C
O
C
N
C
C
N
O
C
N
C
C
C
N
C
O
C
C
N
C
C
C
N
O
C
N
C
C
O
C
C
N
C
C
C
C
N
C
O
C
N
C
C
O
O
C
N
C
C
C O
C
C
N
C
C
O
C
N
C
C
C O
C
N
C
O
C
N
C
C
C
C
N
C
O
O
C
C
N
C
O
C
C
C
N
C
O
O
C
C
N
C
O
C
N
C
N
O
C
N
C
C
N
C
N
C
O
C
N
C
O
C
C
N
C
C
C
N
C
C
N
C
O
C
N
C
C
O
C
C
N
C O
C
C
N
C
C
C
C
N
C
O
C
N
C
C
O
O
C
N
C
C
C O
C
C
N
C
C
O
C
N
C
C
C O non-canonica 3 non-canonica o
Christian Borgelt Frequent Pattern Mining 378
Advantage for Rightmost Path Extensions
Gcncratc a sustructurcs
(that contain nitrocn)
ot thc cxamjc moccuc
(N C)
N
C
C
C
C
lrocm Thc rin ot caron atoms
can c coscd ctwccn any two ranchcs
(thrcc ways ot uidin thc tramcnt,
ony onc ot which is canonica)
Search Trees with N C
`aximum Sourcc Lxtcnsion
lihtmost lath Lxtcnsion
N
C
C C
C
N
C
C
C
C
3
5
4
N
C
C
C
C
5
4
3
N
N
C
N
C
C
N
C
C C
N
C
C
C
N
C
C
C
C
N
C
C
C
N
C
C
C
C
N
C
C
C
C
3
4
5
N
C
C C
C
N
N
C
N
C
C
N
C
C
C
N
C
C C
N
C
C
C
N
C
C
C
C
N
C
C
C
C
N
C
C
C
C
5 non-canonica 3 non-canonica 1
Christian Borgelt Frequent Pattern Mining 379
Experiments: Data Sets
Index Chemicus Subset of 1993
1293 moccucs , 3!!31 atoms , 3o`9! onds
Ircqucnt tramcnts down to tairy ow sujjort vaucs arc trccs (no rins)
`cdium numcr ot tramcnts and coscd tramcnts
Steroids
1 moccucs , !01 atoms , !`o onds
A arc jart ot thc trcqucnt tramcnts contain onc or morc rins
Luc numcr ot tramcnts, sti arc numcr ot coscd tramcnts
Christian Borgelt Frequent Pattern Mining 380
Steroids Data Set
O
O
O Br
O F
O
O
O
O O
O
O O
O
O O
O
O
O O
O
O O
O
O
O
O
O O
O O
O
O
O
O O
O
O
O
O O
O
O
O O
O
O
O
N
O
O
N
Christian Borgelt Frequent Pattern Mining 381
Experiments: IC93 Data Set
3 3.5 4 4.5 5 5.5 6
5
10
15
20
time/seconds
breadth-rst
depth-rst
3 3.5 4 4.5 5 5.5 6
0
5
10
15
fragments/10
4
breadth-rst
depth-rst
processed
3 3.5 4 4.5 5 5.5 6
4
6
8
10
12
14
occurences/10
6
breadth-rst
depth-rst
Lxjcrimcnta rcsuts on thc lC93 data
Thc horizonta axis shows thc minima
sujjort in jcrccnt Thc curvcs show thc
numcr ot cncratcd and jroccsscd tra-
mcnts (toj ctt), numcr ot jroccsscd oc-
currcnccs (toj riht), and thc cxccution
timc in scconds (ottom ctt) tor thc two
canonica torms,cxtcnsion stratcics
Christian Borgelt Frequent Pattern Mining 382
Experiments: Steroids Data Set
2 3 4 5 6 7 8
10
15
20
25
30
35
time/seconds
breadth-rst
depth-rst
2 3 4 5 6 7 8
5
10
15
fragments/10
5
breadth-rst
depth-rst
processed
2 3 4 5 6 7 8
6
8
10
12
occurrences/10
6
breadth-rst
depth-rst
Lxjcrimcnta rcsuts on thc stcroids data
Thc horizonta axis shows thc asoutc
minima sujjort Thc curvcs show thc
numcr ot cncratcd and jroccsscd tra-
mcnts (toj ctt), numcr ot jroccsscd oc-
currcnccs (toj riht), and thc cxccution
timc in scconds (ottom ctt) tor thc two
canonica torms,cxtcnsion stratcics
Christian Borgelt Frequent Pattern Mining 383
Equivalent Sibling Pruning
Christian Borgelt Frequent Pattern Mining 384
Alternative Test: Equivalent Siblings
Basic Idea:
lt thc (su)rajh to cxtcnd cxhiits a ccrtain symmctry, scvcra cxtcnsions
may c cquivacnt (in thc scnsc that thcy dcscric thc samc (su)rajh)
At most onc ot thcsc siin cxtcnsions can c in canonica torm, namcy
thc onc least restricting future extensions (cx smacst codc word)
ldcntity cquivacnt siins and kccj ony thc maximay cxtcndac onc
Test Procedure for Equivalence:
Gct any rajh in which two siin (su)rajhs to comjarc occur
(lt thcrc is no such rajh, thc siins arc not cquivacnt)
`ark any occurrcncc ot thc rst (su)rajh in thc rajh
Travcrsc a occurrcnccs ot thc sccond (su)rajh in thc rajh
and chcck whcthcr a cdcs ot an occurrcncc arc markcd
lt thcrc is such an occurrcncc, thc two (su)rajhs arc cquivacnt
Christian Borgelt Frequent Pattern Mining 385
Alternative Test: Equivalent Siblings
If siblings in the search tree are equivalent,
only the one with the least restrictions needs to be processed.
Example: `inin jhcno, j-crcso, and catccho
O C
C C
C
C C
C O C
C C
C
C C
O
O C
C C
C
C C
Considcr cxtcnsions ot a o-ond caron rin (twcvc jossic occurrcnccs)
O C
C C
C
C C
0
1 2
3
4 5
O C
C C
C
C C
1
2 3
4
5 0
O C
C C
C
C C
2
3 4
5
0 1
O C
C C
C
C C
1
0 5
4
3 2
Ony thc (su)rajh that least restricts future extensions
(ic, that has thc cxicorajhicay smacst codc word) can c in canonica torm
sc dcjth-rst canonica torm (rihtmost jath cxtcnsions) and C O
Christian Borgelt Frequent Pattern Mining 386
Alternative Test: Equivalent Siblings
Test for Equivalent Siblings before Test for Canonical Form
Travcrsc thc siin cxtcnsions and comjarc cach jair
Ot two cquivacnt siins rcmovc thc onc
that rcstricts tuturc cxtcnsions morc
Advantages:
ldcntics somc codc words that arc non-canonica in a simjc way
Tcst ot two siins is at most incar in thc numcr ot cdcs
and at most incar in thc numcr ot occurrcnccs
Disadvantages:
Locs not idcntity a non-canonica codc words,
thcrctorc a suscqucnt canonica torm tcst is sti nccdcd
Comjarcs two siin (su)rajhs,
thcrctorc it is quadratic in thc numcr ot siins
Christian Borgelt Frequent Pattern Mining 387
Alternative Test: Equivalent Siblings
Thc ccctivcncss ot cquivacnt siin jrunin dcjcnds on thc canonica torm
`inin thc IC93 data with !/ minima sujjort
dcjth-rst rcadth-rst
cquivacnt siin jrunin 1`o ( 19/) !19` (S3/)
canonica torm jrunin 9SS (9S1/) S1` (1o3/)
tota jrunin S1!! `010
(coscd) (su)rajhs tound 2002 2002
`inin thc steroids data with minima sujjort o
dcjth-rst rcadth-rst
cquivacnt siin jrunin 1`32 ( 2/) 1`2`o2 (`!o/)
canonica torm jrunin 19!!9 (92S/) 1202o (!`!/)
tota jrunin 212o 29`SS
(coscd) (su)rajhs tound 1!20 1!20
Christian Borgelt Frequent Pattern Mining 388
Alternative Test: Equivalent Siblings
Observations:
Lcjth-rst torm cncratcs morc dujicatc (su)rajhs on thc lC93 data
and tcwcr dujicatc (su)rajhs on thc stcroids data (as sccn ctorc)
Thcrc arc ony vcry tcw cquivacnt siins with dcjth-rst torm
on oth thc lC93 data and thc stcroids data
(Con,ccturc cquivacnt siins rcsut trom rotatcd trcc ranchcs,
which arc css ikcy to c siins with dcjth-rst torm)
\ith rcadth-rst scarch canonica torm a arc jart ot thc (su)rajhs
that arc not cncratcd in canonica torm (with a canonica codc word)
can c tcrcd out with cquivacnt siin jrunin
On thc tcst lC93 data no dicrcncc in sjccd coud c oscrvcd,
jrcsumay ccausc jrunin takcs ony a sma jart ot thc tota timc
On thc stcroids data, howcvcr, cquivacnt siin jrunin
yicds a siht sjccd-uj tor rcadth-rst torm ( `/)
Christian Borgelt Frequent Pattern Mining 389
Canonical Forms based on Adjacency Matrices
Christian Borgelt Frequent Pattern Mining 390
Adjacency Matrices
A (norma, that is, unaccd) rajh can c dcscricd y an adjacency matrix
A rajh G with n vcrticcs is dcscricd y an n n matrix A (a
ij
)
Givcn a numcrin ot thc vcrticcs (trom 1 to n), cach vcrtcx is associatcd
with thc row and coumn corrcsjondin to its numcr
A matrix ccmcnt a
ij
is 1 it thcrc cxists an cdc ctwccn thc vcrticcs
with numcrs i and j and 0 othcrwisc
Ad,accncy matriccs arc not uniquc
Licrcnt numcrins ot thc vcrticcs cad to dicrcnt ad,accncy matriccs
1 2
3
4
5
1
1
2
2
3
3
4
4
5
5
0 1 0 1 0
1 0 1 1 0
0 1 0 1 1
1 1 1 0 0
0 0 1 0 0
5 4
2
3
1
1
1
2
2
3
3
4
4
5
5
0 1 0 0 0
1 0 1 1 0
0 1 0 1 1
0 1 1 0 1
0 0 1 1 0
Christian Borgelt Frequent Pattern Mining 391
Extended Adjacency Matrices
A accd rajh can c dcscricd y an extended adjacency matrix
lt thcrc is an cdc ctwccn thc vcrticcs with numcrs i and j
thc matrix ccmcnt a
ij
contains thc ac ot this cdc
and thc sjccia ac (thc cmjty ac) othcrwisc
Thcrc is an additiona coumn containin thc vcrtcx acs
Ot coursc, cxtcndcd ad,accncy matriccs arc aso not uniquc
O
N
S
O
O
C
C
C
C
4
2
1
3
6
5
7
8
9
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
S
N
C
O
C
C
C
O
O
O
N
S
O
O
C
C
C
C
7
2
5
6
4
1
3
8
9
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
C
N
C
C
S
C
O
O
O
Christian Borgelt Frequent Pattern Mining 392
From Adjacency Matrices to Code Words
An (cxtcndcd) ad,accncy matrix can c turncd into a code word
y simjy istin its ccmcnts row y row
Sincc tor undircctcd rajhs thc ad,accncy matrix is ncccssariy symmctric,
it succs to ist thc ccmcnts ot thc ujjcr (or owcr) trianc
Ior sjarsc rajhs (tcw cdcs) istin ony coumn,ac jairs can advantacous,
ccausc this rcduccs thc codc word cnth
O
N
S
O
O
C
C
C
C
4
2
1
3
6
5
7
8
9
lcuar cxjrcssion
(non-tcrminas)
(a ( i
c
b )

)
n
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
S
N
C
O
C
C
C
O
O
codc word
S 2 - 3 -
N 4 - 5 -
C 6 -
O
C 6 - 7 -
C
C 8 - 9 =
O
O
Christian Borgelt Frequent Pattern Mining 393
From Adjacency Matrices to Code Words
\ith an (aritrary, ut xcd) ordcr on thc ac sct A (and dcnin that
intccr numcrs, which arc ordcrcd in thc usua way, jrcccdc a acs),
codc words can c comjarcd cxicorajhicay (S N O C . - =)
O
N
S
O
O
C
C
C
C
4
2
1
3
6
5
7
8
9
S 2 - 3 - N 4 - 5 - C 6 - O C 6 - 7 - C C 8 - 9 = O O
<
O
N
S
O
O
C
C
C
C
7
2
5
6
4
1
3
8
9
C 2 - 3 - 4 - N 5 - 7 - C 8 - 9 = C 6 - S 6 - C O O O
As tor canonica torms ascd on sjannin trccs, wc thcn dcnc thc cxicorajhicay
smacst (or arcst) codc word as thc canonical code word
`otc that ad,accncy matriccs aow tor a much larger number of code words,
ccausc any numcrin ot thc vcrticcs is acccjtac
Ior canonica torms ascd on sjannin trccs, thc vcrtcx numcrin
must c comjatic with a (sjccic) construction ot a sjannin trcc
Christian Borgelt Frequent Pattern Mining 394
From Adjacency Matrices to Code Words
Thcrc is a varicty ot othcr ways in which an ad,accncy matrix
may c turncd into a codc word
O
N
S
O
O
C
C
C
C
4
2
1
3
6
5
7
8
9
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
S
N
C
O
C
C
C
O
O
owcr trianc
S
N 1 -
C 1 -
O 2 -
C 2 -
C 3 - 5 -
C 5 -
O 6 - 7 -
O 7 =
coumnwisc
S N C O C C C O O
| 1 -
| 1 -
| 2 -
| 2 -
| 3 - 5 -
| 5 -
| 7 -
| 7 =
(`otc that thc coumnwisc istin nccds a scjarator charactcr |)
Lowcvcr, thc rowwisc istin rcstrictcd to thc ujjcr trianc (as uscd ctorc)
has thc advantac that it has a jrojcrty anaoous to thc prex property
ln contrast to this, thc two torms shown aovc do not havc this jrojcrty
Christian Borgelt Frequent Pattern Mining 395
Exploiting Vertex Signatures
Christian Borgelt Frequent Pattern Mining 396
Canonical Form and Vertex and Edge Labels
\crtcx and cdc acs hcj considcray to construct a canonica codc word
or to chcck whcthcr a ivcn codc word is canonica
Canonica torm chcck or construction arc usuay (much) sowcr,morc dicut
tor unaccd rajhs or rajhs with tcw dicrcnt vcrtcx and cdc acs
Thc rcason is that with vcrtcx and cdc acs constructcd codc word jrcxcs
may arcady aow us to makc a dccision ctwccn (scts ot) codc words
lntuitivc cxjanation with an cxtrcmc cxamjc
Sujjosc that a vcrticcs ot a ivcn (su)rajh havc dicrcnt acs Thcn
Thc root,rst row vcrtcx is uniqucy dctcrmincd
it is thc vcrtcx with thc smacst ac (wrt thc choscn ordcr)
Thc ordcr ot cach vcrtcxs ncihors in thc canonica torm is dctcrmincd
at cast y thc vcrtcx acs (ut mayc aso y thc cdc acs)
As a conscqucncc, constructin thc canonica codc word is straihttorward
Christian Borgelt Frequent Pattern Mining 397
Canonical Form and Vertex and Edge Labels
Thc comjcxity ot constructin a canonica codc word is causcd y cqua cdc and
vcrtcx acs, which makc it ncccssary to ajjy a backtracking aorithm
Question: Can wc cxjoit rajh jrojcrtics (that is, thc conncction structurc)
to distinuish vcrticcs,cdcs with thc samc ac
Idea: Lcscric how thc (su)rajh undcr considcration ooks trom a vcrtcx
This can c achicvcd y constructin a oca codc word (vertex signature)
Start with thc ac ot thc vcrtcx
lt thcrc is morc than onc vcrtcx with a ccrtain ac,
add a (sortcd) ist ot thc acs ot thc incidcnt cdcs
lt thcrc is morc than onc vcrtcx with thc samc ist,
add a (sortcd) ist ot thc ists ot thc ad,accnt vcrticcs
Continuc with thc vcrticcs that arc two cdcs away and so on
Christian Borgelt Frequent Pattern Mining 398
Constructing Vertex Signatures
Thc jroccss ot constructin vcrtcx sinaturcs is cst dcscricd
as an iterative subdivision of equivalence classes
Thc initia sinaturc ot cach vcrtcx is simjy its ac
Thc vcrtcx sct is sjit into cquivacncc casscs
ascd on thc initia vcrtcx sinaturc (that is, thc vcrtcx acs)
Lquivacncc casscs with morc than onc vcrtcx arc thcn jroccsscd
y ajjcndin thc (sortcd) acs ot thc incidcnt cdcs to thc vcrtcx sinaturc
Thc vcrtcx sct is thcn rcjartitioncd ascd on thc cxtcndcd vcrtcx sinaturc
ln a sccond stcj thc (sortcd) sinaturcs ot thc ad,accnt vcrticcs arc ajjcndcd
ln suscqucnt stcjs thcsc sinaturcs ot ad,accnt vcrticcs arc rcjaccd
y thc ujdatcd vcrtcx sinaturcs
Thc jroccss stojs whcn no rcjaccmcnt sjits an cquivacncc cass
Christian Borgelt Frequent Pattern Mining 399
Constructing Vertex Signatures
O
N
S
O
O
C
C
C
C
4
2
1
3
6
5
7
8
9
vcrtcx sinaturc
1 S
2 N
! O
S O
9 O
3 C
o C
` C
C
Vertex Signatures, Step 1
Thc initia vcrtcx sinaturcs
arc simjy thc vcrtcx acs
Thcrc arc tour cquivacncc casscs
S, N, O, and C
Thc cquivacncc casscs S and N
nccd not turthcr jroccssin,
ccausc thcy arcady contain
ony a sinc vcrtcx
Lowcvcr, thc vcrtcx sinaturcs O and C
nccd to c cxtcndcd in ordcr to sjit
thc corrcsjondin cquivacncc casscs
Christian Borgelt Frequent Pattern Mining 400
Constructing Vertex Signatures
O
N
S
O
O
C
C
C
C
4
2
1
3
6
5
7
8
9
vcrtcx sinaturc
1 S
2 N
! O -
S O -
9 O =
3 C --
o C --
` C ---
C --=
Vertex Signatures, Step 2
Thc vcrtcx sinaturcs ot thc casscs
that contain morc than onc vcrtcx arc
cxtcndcd y thc sortcd ist ot acs
ot thc incidcnt cdcs
This distinuishcs thc thrcc oxycn atoms,
ccausc two is incidcnt to a sinc ond,
thc othcr to a douc ond
lt aso distinuishcs most caron atoms,
ccausc thcy havc dicrcnt scts
ot incidcnt cdcs
Ony thc sinaturcs ot carons 3 and o
and thc sinaturcs ot oxycns ! and 9
nccd to c cxtcndcd turthcr
Christian Borgelt Frequent Pattern Mining 401
Constructing Vertex Signatures
O
N
S
O
O
C
C
C
C
4
2
1
3
6
5
7
8
9
vcrtcx sinaturc
1 S
2 N
! O - N
S O - C --=
9 O =
3 C -- S C --
o C -- C -- C ---
` C ---
C --=
Vertex Signatures, Step 3
Thc vcrtcx sinaturcs ot carons 3 and o
and ot oxycns ! and 9 arc cxtcndcd
y thc sortcd ist ot vcrtcx sinaturcs
ot thc ad,accnt vcrticcs
This distinuishcs thc two jairs
(caron 3 is ad,accnt to a sutur atom,
oxycn ! is incidcnt to a nitrocn atom)
As a rcsut, a cquivacncc casscs
contain ony a sinc vcrtcx and thus
wc otaincd a uniquc vcrtcx acin
\ith this uniquc vcrtcx acin,
constructin a canonica codc word
ccomcs vcry simjc and ccicnt
Christian Borgelt Frequent Pattern Mining 402
Elements of Vertex Signatures
sin ony (sortcd) ists ot acs ot incidcnt cdcs and ad,accnt vcrticcs
cannot aways distinuish a vcrticcs
Lxamjc Ior thc toowin two (unaccd) rajhs such vcrtcx sinaturcs
cannot sjit thc soc cquivacncc cass
Thc cquivacncc cass can c sjit tor thc riht rajh, thouh, it thc numcr
ot ad,accnt vcrticcs that arc ad,accnt is incorjoratcd into thc vcrtcx sinaturc
Thcrc is aso a arc varicty ot othcr rajh jrojcrtics that may c uscd
Lowcvcr, tor ncithcr rajh thc cquivacncc casscs can c rcduccd to sinc vcrticcs
Ior thc ctt rajh it is not cvcn jossic at a to sjit thc cquivacncc cass
Thc rcason is that oth rajhs josscss automorphisms othcr thcn thc idcntity
Christian Borgelt Frequent Pattern Mining 403
Automorphism Groups
Lct F
auto
(G) c thc sct ot a automorphisms ot a (accd) rajh G
Thc orbit ot a vcrtcx v V
G
wrt F
auto
(G) is thc sct
o(v) u V
G
[ f F
auto
(G) u f(v).
`otc that wc havc aways v o(v), ccausc thc idcntity is aways in F
auto
(G)
Thc vcrticcs in an orit cannot jossiy c distinuishcd y vcrtcx sinaturcs,
ccausc thc rajh ooks thc samc trom a ot thcm
ln ordcr to dca with orits, onc can cxjoit that thc automorjhisms F
auto
(G)
ot a rajh G torm a group (thc automorphism group ot G)
Lurin thc construction ot a canonica codc word,
dctcct automorjhisms (vcrtcx numcrins cadin to thc samc codc word)
Irom tound automorjhisms, generators ot thc rouj ot automorjhisms
can c dcrivcd Thcsc cncrators can thcn c uscd to avoid cxjorin
imjicd automorjhisms, thus sjccdin uj thc scarch |`cIay 19S1|
Christian Borgelt Frequent Pattern Mining 404
Canonical Form and Vertex Signatures
Advantages of Vertex Signatures:
\crticcs with thc samc ac can c distinuishcd in a jrcjroccssin stcj
Constructin canonica codc words can thus ccomc much casicr,tastcr,
ccausc thc ncccssary acktrackin can ottcn c rcduccd considcray
(Thc ains arc usuay jarticuary arc tor rajhs with tcw,no acs)
Disadvantages of Vertex Signatures:
\crtcx sinaturcs can rctcr to thc rajh as a whoc
and thus may c dicrcnt tor surajhs
(\crticcs with dicrcnt sinaturcs in a surajh
may havc thc samc sinaturc in a sujcrrajh and vicc vcrsa)
As a conscqucncc it can c dicut to cnsurc
that thc rcsutin canonica torm has thc prex property
ln such a casc onc may not c ac to rcstrict (su)rajh cxtcnsions
or to usc thc simjicd scarch schcmc (ony codc word chccks)
Christian Borgelt Frequent Pattern Mining 405
Repository of Processed Fragments
Christian Borgelt Frequent Pattern Mining 406
Repository of Processed Fragments
Canonical form pruning is thc jrcdominant mcthod
to avoid rcdundant scarch in trcqucnt (su)rajh minin
Thc ovious atcrnativc, a repository of processed (sub)graphs,
has rcccivcd tairy ittc attcntion |Lorct and Iicdcr 200|
\hcncvcr a ncw (su)rajh is crcatcd, thc rcjository is acccsscd
lt it contains thc (su)rajh, wc know that it has arcady ccn jroccsscd
and thcrctorc it can c discardcd
Ony (su)rajhs that arc not containcd in thc rcjository arc cxtcndcd
and, ot coursc, inscrtcd into thc rcjository
lt thc rcjository is aid out as a hash tac with a carctuy dcsincd
hash tunction, it is comjctitivc with canonica torm jrunin
(ln somc cxjcrimcnts, thc rcjository-ascd ajjroach
coud outjcrtorm canonica torm jrunin y 1`/)
Christian Borgelt Frequent Pattern Mining 407
Repository of Processed Fragments
Lach (su)rajh shoud c storcd usin a minima amount ot mcmory
(sincc thc numcr ot jroccsscd (su)rajhs is usuay huc)
Storc a (su)rajh y istin thc cdcs ot onc occurrcncc
(`otc that tor conncctcd (su)rajhs thc cdcs aso idcntity a vcrticcs)
Thc containmcnt tcst has to c madc as tast as jossic
(sincc it wi c carricd out trcqucnty)
Try to avoid a tu isomorjhism tcst with a hash tac
Lmjoy a hash tunction that is comjutcd trom oca rajh jrojcrtics
(Lasic idca cominc thc vcrtcx and cdc attriutcs and thc vcrtcx dcrccs)
lt an isomorjhism tcst is ncccssary, do quick chccks rst
numcr ot vcrticcs, numcr ot cdcs, rst containin dataasc rajh ctc
Actua isomorjhism tcst
mark storcd occurrcncc and chcck tor tuy markcd ncw occurrcncc
(ct thc jroccdurc ot cquivacnt siin jrunin)
Christian Borgelt Frequent Pattern Mining 408
Canonical Form Pruning versus Repository
Advantage of Canonical Form Pruning
Ony onc tcst (tor canonica torm) is nccdcd in ordcr to dctcrminc
whcthcr a (su)rajh nccds to c jroccsscd or not
Disadvantage of Canonical Form Pruning
lt is most costy tor thc (su)rajhs that arc crcatcd in canonica torm
( sowcst tor tramcnts that havc to c jroccsscd)
Advantage of Repository-based Pruning
Ottcn aows to dccidc vcry quicky that a (su)rajh has not ccn jroccsscd
( tastcst tor tramcnts that havc to c jroccsscd)
Disadvantages of Repository-based Pruning
`utijc isomorjhism tcsts may c ncccssary tor a jroccsscd tramcnt
`ccds tar morc mcmory than canonica torm jrunin
A rcjository vcry dicut to usc in a jarac aorithm
Christian Borgelt Frequent Pattern Mining 409
Canonical Form vs. Repository: Execution Times
2 2.5 3 3.5 4 4.5 5 5.5 6
20
40
60
80 time/seconds
canon. form
repository
2 2.5 3 3.5 4 4.5 5 5.5 6
20
40
60
80 time/seconds
canon. form
repository
Lxjcrimcnta rcsuts on thc lC93 data sct,
scarch timc in scconds (vcrtica axis) vcrsus
minimum sujjort in jcrccnt (horizonta axis)
Lctt maximum sourcc cxtcnsions
liht rihtmost jath cxtcnsions
Christian Borgelt Frequent Pattern Mining 410
Canonical Form vs. Repository: Numbers of (Sub)Graphs
2 2.5 3 3.5 4 4.5 5 5.5 6
0
20
40
60
80 subgraphs/10 000
generated
dupl. tests
processed
duplicates
2 2.5 3 3.5 4 4.5 5 5.5 6
0
20
40
60
80 subgraphs/10 000
generated
dupl. tests
processed
duplicates
Lxjcrimcnta rcsuts on thc lC93 data sct,
numcrs ot surajhs uscd in thc scarch
Lctt maximum sourcc cxtcnsions
liht rihtmost jath cxtcnsions
Christian Borgelt Frequent Pattern Mining 411
Repository Performance
2 2.5 3 3.5 4 4.5 5 5.5 6
0
20
40
60
80 subgraphs/10 000
generated
accesses
isom. tests
duplicates
2 2.5 3 3.5 4 4.5 5 5.5 6
0
20
40
60
80 subgraphs/10 000
generated
accesses
isom. tests
duplicates
Lxjcrimcnta rcsuts on thc lC93 data sct,
jcrtormancc ot rcjository-ascd jrunin
Lctt maximum sourcc cxtcnsions
liht rihtmost jath cxtcnsions
Christian Borgelt Frequent Pattern Mining 412
Perfect Extension Pruning
Christian Borgelt Frequent Pattern Mining 413
Reminder: Perfect Extension Pruning for Item Sets
lt ony coscd itcm scts or ony maxima itcm scts arc to c tound,
additiona jrunin ot thc scarch trcc ccomcs jossic
Sujjosc that durin thc scarch wc discovcr that
s
T
(I a) s
T
(I)
tor somc itcm sct I and somc itcm a / I (That is, I is not coscd)
\c ca thc itcm a a perfect extension ot I Thcn wc know
J I s
T
(J a) s
T
(J).
This can most casiy c sccn y considcrin that K
T
(I) K
T
(a)
and hcncc K
T
(J) K
T
(a), sincc K
T
(J) K
T
(I)
As a conscqucncc, no sujcrsct J I with a / J can c coscd
Lcncc a can c addcd dirccty to thc jrcx ot thc conditiona dataasc
Thc samc asic idca can aso c uscd tor rajhs, ut nccds modications
Christian Borgelt Frequent Pattern Mining 414
Perfect Extensions
An cxtcnsion ot a rajh (tramcnt) is cacd perfect,
it it can c ajjicd to a ot its occurrcnccs in cxacty thc samc way
Attention: lt may not c cnouh to comjarc thc sujjort
and thc numcr ot occurrcnccs ot thc rajh tramcnt
(Lvcn thouh jcrtcct cxtcnsions must havc thc samc sujjort and
an intccr mutijc ot thc numcr ot occurrcnccs ot thc asc tramcnt)
O C S C
N
O C S C N
O
O
C S C
C S C N O C S C
2+2 embs.
1+1 embs. 1+3 embs.
`cithcr is a sinc ond to nitrocn a jcrtcct cxtcnsion ot O-C-S-C
nor is a sinc ond to oxycn a jcrtcct cxtcnsion ot N-C-S-C
Lowcvcr, wc nccd that a jcrtcct cxtcnsion ot a rajh tramcnt
is aso a jcrtcct cxtcnsion ot any sujcrrajh ot this tramcnt
Consequence: lt may c ncccssary to chcck whcthcr a occurrcnccs
ot thc asc tramcnt cad to thc samc numcr ot cxtcndcd occurrcnccs
Christian Borgelt Frequent Pattern Mining 415
Partial Perfect Extension Pruning
Basic idea of perfect extension pruning:
Iirst row a tramcnt to thc icst common sustructurc
Partial perfect extension pruning: lt thc chidrcn ot a scarch trcc vcrtcx
arc ordcrcd cxicorajhicay (wrt thcir codc word), no tramcnt in a sutrcc
to thc riht ot a jcrtcct cxtcnsion ranch can c coscd |Yan and Lan 2003|
cxamjc moccucs
S C N C
O
O S C N
F
O S C N
O
scarch trcc tor sccd S
S F S O
S C O
O S C
O
S C N C
O S C N
O
S C N C
O
S
S C
O S C S C N
O S C N S C N
O
3
1
3
2
2 3
2
2 1
2
1
1 1
S F N C O - =
rcadth-rst scarch canonica torm
Christian Borgelt Frequent Pattern Mining 416
Full Perfect Extension Pruning
Full perfect extension pruning: |Lorct and `cin 200o|
Aso jrunc thc ranchcs to thc ctt ot thc jcrtcct cxtcnsion ranch
Problem: This jrunin mcthod intcrtcrcs with canonica torm jrunin,
ccausc thc cxtcnsions in thc ctt siins cannot c rcjcatcd in thc jcrtcct
cxtcnsion ranch (rcstrictcd cxtcnsions, simjc rucs tor canonica torm)
cxamjc moccucs
S C N C
O
O S C N
F
O S C N
O
scarch trcc tor sccd S
S F S O
O S C S C O
S C N C
O S C N
O
S C N C
O
S
S C
S C N
O S C N S C N
O
3
1
3
2
2 3
2
2
2
1
1 1
S F N C O - =
rcadth-rst scarch canonica torm
Christian Borgelt Frequent Pattern Mining 417
Code Word Reorganization
Restricted extensions:
`ot a cxtcnsions ot a tramcnt arc aowcd y thc canonica torm
Somc can c chcckcd y simjc rucs (rihtmost jath,max sourcc cxtcnsion)
Consequence: ln ordcr to makc canonica torm jrunin and tu jcrtcct
cxtcnsion jrunin comjatic, thc rcstrictions on cxtcnsions must c mitiatcd
Example:
Thc corc jrocm ot otainin thc scarch trcc on thc jrcvious sidc is
how wc can avoid that thc tramcnt O-S-C-N is jruncd as non-canonica
Thc rcadth-rst scarch canonica codc word tor this tramcnt is
S 0-C1 0-O2 1-N3
Lowcvcr, with thc scarch trcc on thc jrcvious sidc it is assincd
S 0-C1 1-N2 0-O3
Solution: Lcviatc trom ajjcndin thc dcscrijtion ot a ncw cdc
Aow tor a (stricty imitcd) codc word rcoranization
Christian Borgelt Frequent Pattern Mining 418
Code Word Reorganization
ln ordcr to otain a jrojcr codc, it must c jossic to shitt dcscrijtions
ot ncw cdcs jast dcscrijtions ot jcrtcct cxtcnsion cdcs in thc codc word
Thc codc word ot a tramcnt consists ot two jarts
a prex cndin with thc ast non-jcrtcct cxtcnsion cdc and
a (jossiy cmjty) sux ot jcrtcct cxtcnsion cdcs
A ncw cdc dcscrijtion is usuay ajjcndcd at thc cnd ot thc codc word
This is sti thc standard jroccdurc is thc sux is cmjty
Lowcvcr, it thc sux is not cmjty, thc dcscrijtion ot thc ncw cdc
may c inscrtcd into thc sux or cvcn movcd dirccty ctorc thc sux
(\hichcvcr jossiiity yicds thc cxicorajhicay smacst codc word)
lathcr than to actuay shitt and modity cdc dcscrijtion,
it is tcchnicay casicr to rcuid thc codc word trom thc tront
(ln jarticuar, rcnumcrin thc vcrticcs is casicr)
Christian Borgelt Frequent Pattern Mining 419
Code Word Reorganization: Example
Shift an cxtcnsion to thc jrojcr jacc and rcnumcr thc vcrticcs
1 Lasc tramcnt S-C-N canonica codc S 0-C1 1-N2
2 Lxtcnsion to O-S-C-N (non-canonica') codc S 0-C1 1-N2 0-O3
3 Shitt cxtcnsion (invaid) codc S 0-C1 0-O3 1-N2
! lcnumcr vcrticcs canonica codc S 0-C1 0-O2 1-N3
Rebuild thc codc word trom thc tront
Thc root vcrtcx (hcrc thc sutur atom) is aways in thc xcd jart
lt rcccivcs thc initia vcrtcx indcx, that is, 0 (zcro)
Comjarc two jossic codc word jrcxcs S 0-O1 and S 0-C1
Iix thc attcr, sincc it is cxicorajhicay smacr
Comjarc thc codc word jrcxcs S 0-C1 0-O2 and S 0-C1 1-N2
Iix thc tormcr, sincc it is cxicorajhicay smacr
Ajjcnd thc rcmainin jcrtcct cxtcnsion cdc S 0-C1 0-O2 1-N3
rcadth-rst scarch canonica torm. S N C O. - =
Christian Borgelt Frequent Pattern Mining 420
Perfect Extensions: Problems with Cycles/Rings
cxamjc
moccucs
scarch trcc tor sccd N
N O
C
C C
C
N O
C
C
C C
N
N O N C
C N O N O C N C C
C
N O C
C C
N O N O C
C
N C C
C
C C
N O C
C
N O C
C
O N
C C C
N O C
C C
C C
N O C
C C C C
C O N
Problem: lcrtcct cxtcnsions in cyccs may not aow tor jrunin
Consequence: Additional constraint |Lorct and `cin 200o|
lcrtcct cxtcnsions must c ridcs or cdcs cosin a cycc,rin
Christian Borgelt Frequent Pattern Mining 421
Experiments: IC93 without Ring Mining
2.5 3 3.5 4 4.5 5 5.5 6
4
6
8
10
12
14
occurrences/10
6
full
partial
none
2.5 3 3.5 4 4.5 5 5.5 6
5
10
15
20 fragments/10
4
full
partial
none
2.5 3 3.5 4 4.5 5 5.5 6
20
40
60
nodes/10
3
full
partial
none
Lxjcrimcnta rcsuts on thc lC93 data,
otaincd without rin minin (sinc
ond cxtcnsions) Thc horizonta axis
shows thc minima sujjort in jcrccnt
Thc curvcs show thc numcr ot cncratcd
tramcnts (toj ctt), thc numcr ot jro-
ccsscd occurrcnccs (ottom ctt), and thc
numcr ot scarch trcc nodcs (toj riht)
tor thc thrcc dicrcnt mcthods
Christian Borgelt Frequent Pattern Mining 422
Experiments: IC93 with Ring Mining
2 2.5 3 3.5 4
10
20
30
occurrences/10
5
full
partial
none
2 2.5 3 3.5 4
20
40
60
fragments/10
3
full
partial
none
2 2.5 3 3.5 4
0
5
10
15
20
nodes/10
3
full
partial
none
Lxjcrimcnta rcsuts on thc lC93 data,
otaincd with rin minin Thc hori-
zonta axis shows thc minima sujjort
in jcrccnt Thc curvcs show thc num-
cr ot cncratcd tramcnts (toj ctt), thc
numcr ot jroccsscd occurrcnccs (ottom
ctt), and thc numcr ot scarch trcc nodcs
(toj riht) tor thc thrcc dicrcnt mcth-
ods
Christian Borgelt Frequent Pattern Mining 423
Extensions for Molecular Fragment Mining
Christian Borgelt Frequent Pattern Mining 424
Extensions of the Search Algorithm
Rings |Lotcr, Lorct, and Lcrthod 200!. Lorct 200o|
lrcjroccssin Iind rins in thc moccucs and mark thcm
ln thc scarch jroccss Add a atoms and onds ot a rin in onc stcj
Considcray imjrovcs ccicncy and intcrjrctaiity
Carbon Chains |`cin, Lorct, and Lcrthod 200!|
Add a caron chain in onc stcj, inorin its cnth
Lxtcnsions y a caron chain match rcardcss ot thc chain cnth
Wildcard Atoms |Lotcr, Lorct, and Lcrthod 200!|
Lcnc casscs ot atoms that can c sccn as cquivacnt
Cominc tramcnt cxtcnsions with cquivacnt atoms
lntrcqucnt tramcnts that dicr ony in a tcw atoms
trom trcqucnt tramcnts can c tound
Christian Borgelt Frequent Pattern Mining 425
Ring Mining: Treat Rings as Units
General Idea of Ring Mining
A rin (cycc) is cithcr containcd in a tramcnt as a whoc or not at a
Filter Approaches
(Su)rajhs,tramcnts arc rown cdc y cdc (as ctorc)
Iound trcqucnt rajh tramcnts arc tcrcd
Grajh tramcnts with incomjctc rins arc discardcd
Additiona scarch trcc jrunin
lrunc sutrccs that yicd ony tramcnts with incomjctc rins
Reordering Approach
lt an cdc is addcd that is jart ot onc or morc rins,
(onc ot) thc containin rin(s) is addcd as a whoc (a ot its cdcs arc addcd)
lncomjatiiitics with canonica torm jrunin arc handcd
y rcordcrin codc words (simiar to tu jcrtcct cxtcnsion jrunin)
Christian Borgelt Frequent Pattern Mining 426
Ring Mining: Preprocessing
lin minin is simjcr attcr jrcjroccssin thc rins in thc rajhs to anayzc
Basic Preprocessing: (tor tcr ajjroachcs)
`ark a cdcs ot rins in a uscr-sjccicd sizc ranc
(moccuar tramcnt minin usuay rins with ` o vcrticcs,atoms)
Tcchnicay, thcrc arc two rin idcntication jarts jcr cdc
A markcr in thc cdc attriutc,
which tundamcntay distinuishcs rin cdcs trom non-rin cdcs
A sct ot as idcntityin thc dicrcnt rins an cdc is containcd in
(`otc that an cdc can c jart ot scvcra rins)
Extended Preprocessing: (tor rcordcrin ajjroach)
N
0
1
5
8 6
2
4
3
7 9
`ark pseudo-rings, that is, rins ot smacr sizc than thc uscr sjccicd, ut which
consist ony ot cdcs that arc jart ot rins within thc uscr-sjccicd sizc ranc
Christian Borgelt Frequent Pattern Mining 427
Filter Approaches: Open Rings
Idea of Open Ring Filtering:
lt wc rcquirc thc outjut to havc ony comjctc rins, wc havc to idcntity and
rcmovc tramcnts with rin cdcs that do not con to any comjctc rin
lin cdcs havc ccn markcd in thc jrcjroccssin
lt is known which cdcs ot a rown (su)rajh arc rin cdcs
(in thc undcryin rajhs ot thc dataasc)
Ajjy thc jrcjroccssin jroccdurc to a rown (su)rajh, ut
kccj thc markcr in thc cdc attriutc.
ony sct thc as that idcntity thc rins an cdc is containcd in
Chcck tor cdcs that havc a rin markcr in thc cdc attriutc,
ut did not rcccivc any rin a whcn thc (su)rajh was rcjroccsscd
lt such cdcs cxist, thc (su)rajh contains uncoscd,ojcn rins,
so thc (su)rajh must not c rcjortcd
Christian Borgelt Frequent Pattern Mining 428
Filter Approaches: Unclosable Rings
Idea of Unclosable Ring Filtering:
Grown (su)rajhs with ojcn rins that cannot c coscd y tuturc cxtcnsions
can c jruncd trom thc scarch
Canonica torm jrunin aows to rcstrict thc jossic cxtcnsions ot a tramcnt
Luc to jrcvious cxtcnsions ccrtain vcrticcs ccomc uncxtcndac
Somc rins cannot c coscd y cxtcndin a (su)rajh
Oviousy, a ncccssary (thouh not sucicnt) condition tor a rins cin coscd
is that cvcry vcrtcx has cithcr zcro or at cast two incidcnt rin cdcs
lt thcrc is a vcrtcx with ony onc incidcnt rin cdc,
this cdc must c jart ot an incomjctc rin
lt an uncxtcndac vcrtcx ot a rown (su)rajh has ony onc incidcnt rin cdc,
this (su)rajh can c jruncd trom thc scarch
(ccausc thcrc is an ojcn rin that can ncvcr c coscd)
Christian Borgelt Frequent Pattern Mining 429
Reminder: Restricted Extensions
O
N
S
O
O
cxamjc
moccuc
dcjth-rst
A

S
0
N
1
C
3
C
7
C
8
O
2
C
4
O
5
O
6
rcadth-rst
L

C
6
O
7
O
8
S
0
N
1
C
2
O
3
C
4
C
5
Extendable Vertices:
A vcrticcs on thc rihtmost jath, that is, 0, 1, 3, , S
L vcrticcs with an indcx no smacr than thc maximum sourcc, that is, o, , S
Edges Closing Cycles:
A nonc, ccausc thc cxistin cycc cdc has thc smacst jossic sourcc
L thc cdc ctwccn thc vcrticcs and S
Christian Borgelt Frequent Pattern Mining 430
Filter Approaches: Merging Ring Extensions
Idea of Merging Ring Extensions:
Thc jrcvious mcthods work on individua cdcs and hcncc cannot aways dctcct
it an cxtcnsion ony cads to tramcnts with comjctc rins that arc intrcqucnt
Add a cdcs ot a rin, thus distinuishin cxtcnsions that
start with thc samc individua cdc, ut
N O
C
C
C C
N O
C
C C
C
cad into rins ot dicrcnt sizc or dicrcnt comjosition
Lctcrminc thc sujjort ot thc rown (su)rajhs and jrunc intrcqucnt oncs
Trim and mcrc rin cxtcnsions that sharc thc samc initia cdc
Advantage of Merging Ring Extensions:
A cxtcnsions arc rcmovcd that ccomc intrcqucnt whcn comjctcd into rins
A occurrcnccs arc rcmovcd that cad to intrcqucnt (su)rajhs
oncc rins arc comjctcd
Christian Borgelt Frequent Pattern Mining 431
A Reordering Approach
Drawback of Filtering:
(Su)rajhs arc sti cxtcndcd cdc y cdc Iramcnts row tairy sowy
Better Approach:
Add a cdcs ot a rin in onc stcj (\hcn a rin cdc is addcd,
crcatc onc cxtcndcd (su)rajh tor cach rin it is containcd in)
lcordcr ccrtain cdcs in ordcr to comjy with canonica torm jrunin
Problems of a Reordering Approach:
Onc must aow tor inscrtions ctwccn arcady addcd rin cdcs
(ccausc ranchcs may jrcccdc rin cdcs in thc canonica torm)
Onc must not commit too cary to an ordcr ot thc cdcs
(ccausc ranchcs may inucncc thc ordcr ot thc rin cdcs)
A jossic ordcrs ot (ocay) cquivacnt cdcs must c tricd,
ccausc any ot thcm may jroducc vaid outjut
Christian Borgelt Frequent Pattern Mining 432
Problems of Reordering Approaches
One must not commit too early to an order of the edges.
lustration cccts ot attachin a ranch to an asymmctric rin N O C, - =
N O O
0
2
4
5
3
1
N 0-C1 0-C2 1-C3 2-C4 3-C5 4=C5
N O O
0
1
3
5
4
2
N 0-C1 0-C2 1-C3 2-C4 3=C5 4-C5
N O O
0
2
5
6
3
1
4
N 0-C1 0-C2 1-C3 2-O4 2-C5 3=C6 5-C6
N O O
0
1
4
6
5
2
3
N 0-C1 0-C2 1-O3 1-C4 2-C5 3-C6 5=C6
\rt a rcadth-rst scarch canonica torm, thc cdcs ot thc rin
can c ordcrcd in two dicrcnt ways (ujjcr two rows)
Thc ujjcr,ctt is thc canonica torm ot thc jurc rin
\ith an attachcd ranch (cosc to thc root vcrtcx),
thc othcr ordcrin ot thc rin cdcs (owcr,riht) is thc canonica torm
Christian Borgelt Frequent Pattern Mining 433
Keeping Non-Canonical Fragments
Solution of the early commitment problem:
`aintain (and cxtcnd) oth ordcrins ot thc rin cdcs and
aow tor dcviations trom thc canonica torm cyond xcd cdcs
Principle: kccj (and, conscqucnty, aso cxtcnd) tramcnts that arc not in
canonica torm, ut that coud ccomc canonica oncc ranchcs arc addcd
`ccdcd a ruc which non-canonica tramcnts to kccj and which to discard
ldca addin a rin can c sccn as addin its initia cdc as in an cdc-y-cdc
jroccdurc, and somc additiona cdcs, thc jositions ot which arc not yct xcd
As a conscqucncc wc can sjit thc codc word into two jarts
a xed prex, which is aso uit y an cdc-y-cdc jroccdurc, and
a volatile sux, which consists ot thc additiona (rin) cdcs
Christian Borgelt Frequent Pattern Mining 434
Keeping Non-Canonical Fragments
Fixed prex of a code word:
Thc jrcx ot thc codc word uj to (and incudin)
thc ast cdc addcd in an cdc-y-cdc manncr
Volatile sux of a code word:
Thc sux ot thc codc word attcr thc ast cdc
addcd in an cdc-y-cdc manncr (and cxcudin it)
Rule for keeping non-canonical fragments:
If the current code word deviates from the canonical code word
in the xed part, the fragment is pruned, otherwise it is kept.
Justication of this rule:
lt thc dcviation is in thc xcd jart, no atcr addition ot cdcs
can havc any ccct on it, sincc thc xcd jart wi ncvcr c chancd
lt, howcvcr, thc dcviation is in thc voatic jart, a atcr cxtcnsion cdc
may c inscrtcd in such a way that thc codc word ccomcs canonica
Christian Borgelt Frequent Pattern Mining 435
Search Tree for an Asymmetric Ring with Branches
`aintain (and cxtcnd) oth ordcrins ot thc rin cdcs and
aow tor dcviations trom thc canonica torm cyond xed cdcs
N
N O O
0
2
4
5
3
1
N O O
0
1
3
5
4
2
N O O
0
2
5
6
4
1
3
N O O
0
2
5
6
3
1
4
N O O
0
1
3
6
5
2
4
N O O
0
1
4
6
5
2
3
N O O
0
2
6
7
4
1
3 5
N O O
0
1 2
3
4 6
5
7
Thc cdcs ot a rown surajh arc sjit into
xed edges (cdcs that coud havc ccn addcd in an cdc-y-cdc manncr),
volatile edges (cdcs that havc ccn addcd with rin cxtcnsions
and ctorc,ctwccn which cdcs may c inscrtcd)
Christian Borgelt Frequent Pattern Mining 436
Search Tree for an Asymmetric Ring with Branches
Thc scarch constructs thc rin with oth jossic numcrins ot thc vcrticcs
Thc torm on thc ctt is canonic, so it is kcjt
ln thc tramcnt on thc riht ony thc rst rin ond is xcd,
a othcr onds arc voatic
Sincc thc codc word tor this tramcnt dcviatcs trom thc canonica onc
ony at thc `th ond, wc may not discard it
On thc ncxt cvc, thcrc arc two canonica and two non-canonica tramcnts
Thc non-canonica tramcnts oth dicr in thc xcd jart,
which now consists ot thc rst thrcc onds, and thus arc jruncd
On thc third cvc, thcrc is onc canonica and onc non-canonica tramcnt
Thc non-canonica tramcnt dicrs in thc voatic jart (thc rst tour onds
arc xcd, ut it dcviatcs trom thc canonica codc word ony in thc th ond)
and thus may not c jruncd trom thc scarch
Christian Borgelt Frequent Pattern Mining 437
Connected and Nested Rings
Connected and nested rings can josc jrocms, ccausc in thc jrcscncc ot
equivalent edges thc ordcr ot thcsc cdcs cannot c dctcrmincd ocay
N
0
1
5
8 6
2
4
3
7 9
5
8 6
2
4
7
N
N
N
0
1
3 5 4
2
N
0
1
3 7 6
2
5
4
N
0
1
5 7 6
2
4
3
N
0
1
3
6 5
2
4
4
8 7
N
0
1
3
6 5
2
4
8
9 7
Ldcs arc (ocay) equivalent it thcy start trom thc samc vcrtcx, havc thc samc
cdc attriutc, and cad to vcrticcs with thc samc vcrtcx attriutc
Lquivacnt cdcs must c spliced in a ways, in which thc ordcr ot thc cdcs
arcady in thc (su)rajh and thc ordcr ot thc ncwy addcd cdcs is jrcscrvcd
lt is ncccssary to considcr pseudo-rings tor cxtcnsions,
ccausc othcrwisc not a ordcrs ot cquivacnt cdcs arc cncratcd
Christian Borgelt Frequent Pattern Mining 438
Splicing Equivalent Edges
ln jrincijc, all possible orders of equivalent edges havc to c considcrcd,
ccausc any ot thcm may in thc cnd yicd thc canonica torm
\c cannot (aways) dccidc ocay which is thc riht ordcr,
ccausc this may dcjcnd on cdcs addcd atcr
`cvcrthccss, wc may not rcordcr cquivacnt cdcs trccy,
as this woud intcrtcrc with kccjin ccrtain non-canonica tramcnts
Ly kccjin somc non-canonica tramcnts wc arcady considcr somc variants
ot ordcrs ot cquivacnt cdcs Thcsc must not c cncratcd aain
Splicing rule for equivalent edges: (rcadth-rst scarch canonica torm)
Thc ordcr ot thc cquivacnt cdcs arcady in thc tramcnt must c maintaincd,
and thc ordcr ot thc cquivacnt ncw cdcs must c maintaincd
Thc two scqucnccs ot cquivacnt cdcs may c mcrcd in a zijjcr-ikc manncr,
sccctin thc ncxt cdc trom cithcr ist, ut jrcscrvin thc ordcr in cach ist
Christian Borgelt Frequent Pattern Mining 439
The Necessity of Pseudo-Rings
Thc splicing rule cxjains thc ncccssity ot pseudo-rings
\ithout jscudo-rins it is imjossic to achicvc canonica torm in somc cascs
N
0
1
5
8 6
2
4
3
7 9
5
8 6
2
4
7
N
N
N
0
1
3 5 4
2
N
0
1
3 7 6
2
5
4
N
0
1
5 7 6
2
4
3
N
0
1
3
6 5
2
4
4
8 7
N
0
1
3
6 5
2
4
8
9 7
lt wc coud ony add thc `-rin and thc o-rin, ut not thc 3-rin,
thc ujward ond trom thc atom numcrcd 1 woud aways jrcccdc
at cast onc ot thc othcr two onds that arc cquivacnt to it
(sincc thc ordcr ot cxistin onds must c jrcscrvcd)
Lowcvcr, in thc canonica torm thc ujward ond succccds oth othcr onds,
and this wc can achicvc ony y addin thc 3-ond rin rst
Christian Borgelt Frequent Pattern Mining 440
Splicing Equivalent Edges
Thc considcrcd splicing rule is tor a rcadth-rst scarch canonica torm
ln this torm cquivacnt cdcs arc ad,accnt in thc canonica codc word
ln a dcjth-rst scarch canonica torm cquivacnt cdcs
can c tar ajart trom cach othcr in thc codc word
`cvcrthccss somc sjicin is ncccssary to jrojcry trcat cquivacnt cdcs
in this canonica torm, cvcn thouh thc ruc is sihty simjcr
Splicing rule for equivalent edges: (dcjth-rst scarch canonica torm)
Thc rst ncw rin cdc has to c tricd in a ocations in thc voatic jart
ot thc codc word, whcrc cquivacnt cdcs can c tound
Sincc wc cannot dccidc ocay which ot thcsc cdcs shoud c toowcd rst
whcn uidin thc sjannin trcc, wc havc to try a ot thcsc jossiiitics
in ordcr not to miss thc canonica onc
Christian Borgelt Frequent Pattern Mining 441
Avoiding Duplicate Fragments
Thc sjicin rucs sti aow that thc samc tramcnt can c rcachcd in thc samc
torm in dicrcnt ways, namcy y addin (ncstcd) rins in dicrcnt ordcrs
lcason wc cannot aways distinuish ctwccn two dicrcnt ordcrs
in which two rins sharin a vcrtcx arc addcd
`ccdcd an augmented canonical form test
Ideas undcryin such an aumcntcd tcst
Thc rcquircmcnt ot comjctc rins introduccs dcjcndcnccs ctwccn cdcs
Thc jrcscncc ot ccrtain cdcs enforces thc jrcscncc ot ccrtain othcr cdcs
Thc samc codc word ot a tramcnt is crcatcd scvcra timcs,
ut cach timc with a dierent xed part
Thc josition ot thc rst cdc ot a rin cxtcnsion (attcr rcordcrin)
is thc cnd ot thc xcd jart ot thc (cxtcndcd) codc word
Christian Borgelt Frequent Pattern Mining 442
Ring Key Pruning
Dependences between Edges
Thc rcquircmcnt ot comjctc rins introduccs dcjcndcnccs ctwccn cdcs
(ldca considcr tormin su-tramcnts with ony comjctc rins)
A rin cdc e
1
ot a tramcnt enforces the presence ot anothcr rin cdc e
2
i thc sct ot rins containin e
1
is a susct ot thc sct ot rins containin e
2

ln ordcr tor a rin cdc to c jrcscnt in a su-tramcnt,


at cast onc ot thc rins containin it must c jrcscnt
lt a rin cdc e
1
cntorccs a rin cdc e
2
, it is not jossic to torm
a su-tramcnt with ony comjctc rins that contains e
1
, ut not e
2

Oviousy, cvcry rin cdc cntorccs at cast its own jrcscncc


ln ordcr to cajturc aso non-rin cdcs y such a dcnition,
wc dcnc that a non-rin cdc cntorccs ony its own jrcscncc
Christian Borgelt Frequent Pattern Mining 443
Ring Key Pruning
Example of Dependences between Edges
N 0 3
5
4 1
2
N 0 3
5
4 2
1
N 0 2
4
3 1
2
N 0 2
4
5 3
1
(A cdc dcscrijtions rctcr to thc vcrtcx numcrin in thc tramcnt on thc ctt)
ln thc tramcnt on thc ctt, any cdc in thc sct (0, 3), (1, !), (3, `), (!, `)
cntorccs thc jrcscncc ot any othcr cdc in this sct, ccausc a
ot thcsc cdcs arc containcd cxacty in thc `-rin and thc o-rin
ln thc samc way, thc cdcs (0, 2) and (1, 2) cntorcc cach othcr,
ccausc oth arc containcd cxacty in thc 3-rin and thc o-rin
Thc cdc (0, 1), howcvcr, ony cntorccs itsct and is cntorccd ony y itsct
Thcrc arc no othcr cntorccmcnt rcations ctwccn cdcs
Christian Borgelt Frequent Pattern Mining 444
Ring Key Pruning
(Shortest) Ring Keys
\c considcr jrcxcs ot codc words that contain !k + 1 charactcrs,
k 0, 1, . . . , m, whcrc m is thc numcr ot cdcs ot thc tramcnt
A jrcx v ot a codc word vw (whcthcr canonica or not) is cacd a ring key
i cach cdc dcscricd in w is cntorccd y at cast onc cdc dcscricd in v
Thc jrcx v is cacd a shortest ring key ot vw i it is a rin kcy
and thcrc is no shortcr jrcx that is a rin kcy tor vw
`otc Thc shortcst rin kcy ot a codc word is uniqucy dcncd,
ut dcjcnds, ot coursc, on thc considcrcd codc word
ldca ot (Shortcst) lin Icy lrunin
Liscard tramcnts that arc tormcd with a codc word,
thc xcd jart ot which is not a shortcst rin kcy
Christian Borgelt Frequent Pattern Mining 445
Ring Key Pruning
Example ot (shortcst) rin kcy(s)
N 0 3
5
4 1
2
Lrcadth-rst scarch (canonica) codc word
N 0-C1 0-C2 0-C3 1-C2 1-C4 3-C5 4-C5
Ldcs e
1
e
2
e
3
e
!
e
`
e
o
e

N is oviousy not a rin kcy, ccausc it cntorccs no cdcs


N 0-C1 is not a rin kcy, ccausc it docs not cntorcc, tor cxamjc, e
2
or e
3

N 0-C1 0-C2 is not a rin kcy, ccausc it docs not cntorcc, tor cxamjc, e
3

N 0-C1 0-C2 0-C3 is thc shortcst rin kcy, ccausc


e
!
(1, 2) is cntorccd y e
2
(0, 2) and
e
`
(1, !), e
o
(3, `) and e

(!, `) arc cntorccd y e


3
(0, 3)
Any oncr jrcx is a rin kcy, ut not a shortcst rin kcy
Christian Borgelt Frequent Pattern Mining 446
Ring Key Pruning
lt ony codc words with xcd jarts that arc shortcst rin kcys arc cxtcndcd,
it succs to chcck whcthcr thc xcd jart is a rin kcy
Anchor lt a tramcnt contains ony onc rin, thc rst rin cdc cntorccs
thc othcr rin cdcs and thus thc xcd jart is a shortcst rin kcy
lnduction stcj
Lct vw c a codc word with xcd jart v and voatic jart w,
tor which thc jrcx v is a shortcst rin kcy
Lxtcndin this codc word cncray transtorms it into a codc word vuxw
/

u dcscrics cdcs oriinay dcscricd y jarts ot w (u may c cmjty),


x is thc dcscrijtion ot thc rst ncw cdc and
w
/
dcscrics thc rcmainin od and ncw cdcs
Thc codc word vuxw
/
cannot havc a shortcr rin kcy than vux,
ccausc thc cdcs dcscricd in vu do not cntorcc thc cdc dcscricd y x
Christian Borgelt Frequent Pattern Mining 447
Ring Key Pruning
Test Procedure of Ring Key Pruning
Chcck tor cach voatic cdc whcthcr it is cntorccd y at cast onc xcd cdc
`ark a rins in thc considcrcd tramcnt (sct rin as)
lcmovc a rins containin a ivcn voatic cdc e (ccar rin as)
lt y this jroccdurc a xcd rin cdc ccomcs acss,
thc cdc e is cntorccd y it, othcrwisc thc cdc e is not cntorccd
Example:
N 0 2
4
3 1
2
N 0 3
5
4 1
2
N 0 3
5
4 1
2
Lxtcndin thc `-rin yicds thc tramcnt on thc riht in canonica torm
with thc rst two cdcs (that is, e
1
(0, 1) and e
2
(0, 2)) xcd
Thc jrcx N 0-C1 0-C2 is not a rin kcy (thc rcy cdcs arc not cntorccd)
and hcncc thc tramcnt is discardcd, cvcn thouh it is in canonica torm
Christian Borgelt Frequent Pattern Mining 448
Search Tree for Nested Rings
N
N 0 3
5
4 2
1
N 0 2
4
3 1
2
N 0 2
4
5 3
1
N 0 3
5
4 1
2
N 0 2
5
4 1
3
N 0 3
5
4 1
2
N 0 3
5
4 1
2
N 0 2
4
5 3
1
N 0 3
5
4 2
1
N 0 3
5
4 1
2
N 0 3
5
4 1
2
N 0 2
5
4 1
3
N 0 3
5
4 1
2
N 0 1
4
5 3
2
N 0 1
4
5 2
3
N 0 3
5
4 1
2
N 0 3
5
4 2
1
N 0 2
4
5 3
1
N 0 3
5
4 1
2
N 0 1
4
5 2
3
N 0 1
4
5 3
2
also in
canonical
form
(soid tramc cxtcndcd and rcjortcd. dashcd tramc cxtcndcd, ut not rcjortcd. no tramc jruncd)
Thc tu tramcnt is cncratcd twicc in cach torm (cvcn thc canonica)
Augmented Canonical Form Test:
Thc crcatcd codc words havc dicrcnt xcd jarts
Chcck whcthcr thc xcd jart is a shortcst rin kcy
Christian Borgelt Frequent Pattern Mining 449
Search Tree for Nested Rings
ln a tramcnts in thc ottom row ot thc scarch trcc (tramcnts with tramcs)
thc rst thrcc cdcs arc xcd, thc rcst is voatic
Thc jrcx N 0-C1 0-C2 0-C3 dcscriin thcsc cdcs is a shortcst rin kcy
Lcncc thcsc tramcnts arc kcjt and jroccsscd
ln thc row aovc it (tramcnts without tramcs),
ony thc rst two cdcs arc xcd, thc rcst is voatic
Thc jrcx N 0-C1 0-C2 dcscriin thcsc cdcs is not a rin kcy
(Thc rcy cdcs arc not cntorccd) Lcncc thcsc tramcnts arc discardcd
`otc that tor a sinc rin tramcnts two ot thcir tour chidrcn arc kcjt,
cvcn thouh ony thc onc at thc ctt ottom is in canonica torm
Thc rcason is that thc dcviation trom thc canonica torm rcsidcs
in thc voatic jart ot thc tramcnt
Ly attachin additiona rins any ot thcsc tramcnts may ccomc canonica
Christian Borgelt Frequent Pattern Mining 450
Experiments: IC93
2 2.5 3 3.5 4 4.5 5
0
5
10
15
20
25
time/seconds
reorder
merge rings
close rings
2 2.5 3 3.5 4 4.5 5
0
2
4
6
8 fragments/10
4
reorder
merge rings
close rings
2 2.5 3 3.5 4 4.5 5
0
1
2
3
4
5 occurrences/10
6
reorder
merge rings
close rings
Lxjcrimcnta rcsuts on thc lC93
data Thc horizonta axis shows thc
minima sujjort in jcrccnt Thc
curvcs show thc numcr ot cncratcd
tramcnts (toj ctt), thc numcr ot
jroccsscd occurrcnccs (toj riht), and
thc cxccution timc in scconds (ottom
ctt) tor thc thrcc dicrcnt stratcics
Christian Borgelt Frequent Pattern Mining 451
Experiments: NCI HIV Screening Database
0.5 1 1.5 2 2.5 3 3.5 4
0
50
100
150
time/seconds
reorder
merge rings
close rings
0.5 1 1.5 2 2.5 3 3.5 4
0
1
2
3
fragments/10
4
reorder
merge rings
close rings
0.5 1 1.5 2 2.5 3 3.5 4
0
2
4
6
8
occurrences/10
7
reorder
merge rings
close rings
Lxjcrimcnta rcsuts on thc Ll\ data
Thc horizonta axis shows thc minima
sujjort in jcrccnt Thc curvcs show
thc numcr ot cncratcd tramcnts
(toj ctt), thc numcr ot jroccsscd oc-
currcnccs (toj riht), and thc cxccu-
tion timc in scconds (ottom ctt) tor
thc thrcc dicrcnt stratcics
Christian Borgelt Frequent Pattern Mining 452
Found Molecular Fragments
Christian Borgelt Frequent Pattern Mining 453
NCI DTP HIV Antiviral Screen: AZT
N N N O
O
N
N
O
O
O
O
N
N
N
N N N O
O
N
N
O
O
O
N N N O
O
N
N
O
O
O
P
O
O
O
O
O
N N N O
O
N
N
O
O
O
O
O
O
O
N N N O
O
N
N
O
O
Some Molecules from the NCI HIV Database
Common Fragment
Christian Borgelt Frequent Pattern Mining 454
NCI DTP HIV Antiviral Screen: Other Fragments
N
N
S
O
O
O
Iramcnt 1
CA `23/
Cl,C` 00`/
N
N
S
O
O
O
Iramcnt 2
CA !92/
Cl,C` 00/
N
N
O
Iramcnt 3
CA `23/
Cl,C` 00S/
O N
O
P
O
O
Iramcnt !
CA 9S`/
Cl,C` 00/
N
N
O
O O
O
O
Iramcnt `
CA 101`/
Cl,C` 00!/
S
N Cl
Iramcnt o
CA 9S`/
Cl,C` 000/
Christian Borgelt Frequent Pattern Mining 455
Experiments: Ring Extensions
Improved Interpretability
N
Iramcnt 1
asic aorithm
trcq in CA 22/
N
Iramcnt 2
with rin cxtcnsions
trcq in CA 2000/
O
O
S
N
O
O
`SC ]oo9!S
N
S
N
O
`SC ]o9So01
Comjounds trom thc `Cl canccr data sct that contain Iramcnt 1 ut not 2
Christian Borgelt Frequent Pattern Mining 456
Experiments: Carbon Chains
Tcchnicay Add a caron chain in onc stcj, inorin its cnth
Lxtcnsion y a caron chain match rcardcss ot thc chain cnth
Advantac Iramcnts can rcjrcscnt caron chains ot varyin cnth
Example from the NCI Cancer Dataset:
Iramcnt with Chain
N
N
C*
trcq CA 1!S/
trcq Cl 013/
Actua Structurcs
N
N
N
N
Christian Borgelt Frequent Pattern Mining 457
Experiments: Wildcard Atoms
Lcnc casscs ot atoms that can c considcrcd as cquivacnt
Cominc tramcnt cxtcnsions with cquivacnt atoms
Advantac lntrcqucnt tramcnts that dicr ony in a tcw atoms
trom trcqucnt tramcnts can c tound
Examples from the NCI HIV Dataset:
A
N
S
Cl
AO A`
CA ``/ 3/
Cl,C` 00/ 00/
N
Cl
S
B
LO LS
CA ``/ 001/
Cl,C` 00/ 00/
Christian Borgelt Frequent Pattern Mining 458
Summary Frequent (Sub)Graph Mining
Ircqucnt (su)rajh minin is coscy rcatcd to trcqucnt itcm sct minin
Find frequent (sub)graphs instcad ot trcqucnt suscts
A corc jrocm ot trcqucnt (su)rajh minin is how to avoid rcdundant scarch
This jrocm is sovcd with thc hcj ot canonical forms of graphs
Licrcnt canonica torms cad to dicrcnt chavior ot thc scarch aorithm
Thc rcstriction to closed fragments is a osscss rcduction ot thc outjut
A trcqucnt tramcnts can c rcconstructcd trom thc coscd oncs
A rcstriction to coscd tramcnts aows tor additiona jrunin stratcics
jartia and tu perfect extension pruning
Lxtcnsions ot thc asic aorithm (jarticuary usctu tor moccucs) incudc
Ring Mining, (Carbon) Chain Mining, and Wildcard Vertices
A Java implementation tor moccuar tramcnt minin is avaiac at
http://www.borgelt.net/moss.html
Christian Borgelt Frequent Pattern Mining 459
Mining a Single Graph
Christian Borgelt Frequent Pattern Mining 460
Reminder: Basic Notions
A labeled or attributed graph is a trijc G (V, E, ), whcrc
V is thc sct ot vcrticcs,
E V V (v, v) [ v V is thc sct ot cdcs, and
V E A assins acs trom thc sct A to vcrticcs and cdcs
Lct G (V
G
, E
G
,
G
) and S (V
S
, E
S
,
S
) c two accd rajhs
A subgraph isomorphism ot S to G or an occurrence ot S in G
is an in,cctivc tunction f V
S
V
G
with
v V
S

S
(v)
G
(f(v)) and
(u, v) E
S
(f(u), f(v)) E
G

S
((u, v))
G
((f(u), f(v)))
That is, thc majjin f jrcscrvcs thc conncction structurc and thc acs
Christian Borgelt Frequent Pattern Mining 461
Anti-Monotonicity of Subgraph Support
`ost natura dcnition ot surajh sujjort in a sinc rajh scttin
number of occurrences (surajh isomorjhisms)
Problem: Thc numcr ot occurrcnccs ot a surajh is not anti-monotone
Lxamjc
injut rajh
s
G
(A) 1
A B B
surajhs
A
s
G
(BAB) 2
A B B
2 1 3
occurrcnccs
B B A
A B B
1
A B B
1 2 3
3 2
But: Anti-monotonicity is vita tor thc ccicncy ot trcqucnt surajh minin
Question: Low shoud wc dcnc surajh sujjort in a sinc rajh
Christian Borgelt Frequent Pattern Mining 462
Anti-Monotonicity of Subgraph Support
`ost natura dcnition ot surajh sujjort in a sinc rajh scttin
number of occurrences (surajh isomorjhisms)
Problem: Thc numcr ot occurrcnccs ot a surajh is not anti-monotone
Lxamjc
injut rajh
s
G
(A) 1
A B B
surajhs
A
s
G
(AB) 2
A B
s
G
(BAB) 2
A B B
2 1 3
occurrcnccs
B B A
B A B
B A B
A B B
1
A B B
1 2 3
3 2
But: Anti-monotonicity is vita tor thc ccicncy ot trcqucnt surajh minin
Question: Low shoud wc dcnc surajh sujjort in a sinc rajh
Christian Borgelt Frequent Pattern Mining 463
Relations between Occurrences
Lct f
1
and f
2
two surajh isomorjhisms ot S to G and
V
1
v V
G
[ u V
S
v f
1
(u) and
V
2
v V
G
[ u V
S
v f
2
(u)
Thc two surajh isomorjhisms f
1
and f
2
arc cacd
overlapping, writtcn f
1
f
2
, i V
1
V
2
, ,
equivalent, writtcn f
1
f
2
, i V
1
V
2
,
identical, writtcn f
1
f
2
, i v V
S
f
1
(v) f
2
(v)
`otc that idcntica surajh isomorjhisms arc cquivacnt
and that cquivacnt surajh isomorjhisms arc ovcrajjin
Thcrc can c non-idcntica, ut cquivacnt surajh isomorjhisms,
namcy it S josscsscs an automorjhism that is not thc idcntity
Christian Borgelt Frequent Pattern Mining 464
Overlap Graphs of Occurrences
Lct G (V
G
, E
G
,
G
) and S (V
S
, E
S
,
S
) c two accd rajhs and
ct V
O
c thc sct ot a occurrcnccs (surajh isomorjhisms) ot S in G
Thc overlap graph ot S wrt G is thc rajh O (V
O
, E
O
),
which has thc sct V
O
ot occurrcnccs ot S in G as its vcrtcx sct
and thc cdc sct E
O
(f
1
, f
2
) [ f
1
, f
2
V
O
f
1
, f
2
f
1
f
2

Example:
injut rajh
B A B A B
surajh
B A B
A B B A B
A B B A B
B A B A B
B A B A B
3 1 2
2 1 3
2 1 3
3 1 2
Christian Borgelt Frequent Pattern Mining 465
Maximum Independent Set Support
Lct G (V, E) c an (undircctcd) rajh with vcrtcx sct V
and cdc sct E V V (v, v) [ v V
An independent vertex set ot G is a sct I V with u, v I (u, v) / E
I is a maximum independent vertex set i
it is an indcjcndcnt vcrtcx sct and
tor a indcjcndcnt vcrtcx scts J ot G it is [I[ [J[
`otcs Iindin a maximum indcjcndcnt vcrtcx sct is an `l-comjctc jrocm
Lowcvcr, a rccdy aorithm usuay ivcs vcry ood ajjroximations
Lct O (V
O
, E
O
) c thc ovcraj rajh ot thc occurrcnccs
ot a accd rajh S (V
S
, E
S
,
S
) in a accd rajh G (V
G
, E
G
,
G
)
Thc maximum independent set support (or MIS-support tor short)
ot S wrt G is thc sizc ot a maximum indcjcndcnt vcrtcx sct ot O
Christian Borgelt Frequent Pattern Mining 466
Finding a Maximum Independent Set
nmark a vcrticcs ot thc ovcraj rajh
Exact Backtracking Algorithm
Iind an unmarkcd vcrtcx with maximum dcrcc and try two jossiiitics
Sccct it tor thc `lS, that is, mark it as sccctcd and
mark a ot its ncihors as cxcudcd
Lxcudc it trom thc `lS, that is, mark it as cxcudcd
lroccss thc rcst rccursivcy and rccord cst soution tound
Heuristic Greedy Algorithm
Sccct a vcrtcx with thc minimum numcr ot unmarkcd ncihors and
mark a ot its ncihors as cxcudcd
lroccss thc rcst ot thc rajh rccursivcy
ln oth aorithms vcrticcs with css than two unmarkcd ncihors
can c sccctcd and a ot thcir ncihors markcd as cxcudcd
Christian Borgelt Frequent Pattern Mining 467
Anti-Monotonicity of MIS-Support: Preliminaries
Lct G (V
G
, E
G
,
G
) and S (V
S
, E
S
,
S
) c two accd rajhs
Lct T (V
T
, E
T
,
T
) a (non-cmjty) jrojcr surajh ot S
(that is, V
T
V
S
, E
T
(V
T
V
T
) E
S
, and
T

S
[
V
T
E
T
)
Lct f c an occurrcncc ot S in G
An occurrcncc f
/
ot thc surajh T is cacd a T-ancestor ot thc occurrcncc f
i f
/
f[
V
T
, that is, it f
/
coincidcs with f on thc vcrtcx sct V
T
ot T
Observations:
Ior ivcn G, S, T and f thc T-anccstor f
/
ot thc occurrcncc f is uniqucy dcncd
Lct f
1
and f
2
c two (non-idcntica, ut mayc cquivacnt) occurrcncc ot S in G
f
1
and f
2
ovcraj it thcrc cxist ovcrajjin T-anccstors f
/
1
and f
/
2
ot thc occurrcnccs f
1
and f
2
, rcsjcctivcy
(`otc Thc invcrsc imjication docs not hod cncray)
Christian Borgelt Frequent Pattern Mining 468
Anti-Monotonicity of MIS-Support: Proof
Theorem: `lS-sujjort is anti-monotonc
Proof: \c havc to show that thc `lS-sujjort ot a surajh S wrt a rajh G
cannot cxcccd thc `lS-sujjort ot any (non-cmjty) jrojcr surajh T ot S
Lct I c an aritrary indcjcndcnt vcrtcx sct ot thc ovcraj rajh O ot S wrt G
Thc sct I induccs a susct I
/
ot thc vcrticcs ot thc ovcraj rajh O
/
ot an (aritrary, ut xcd) surajh T ot thc considcrcd surajh S,
which consists ot thc (uniqucy dcncd) T-anccstors ot thc vcrticcs in I
lt is [I[ [I
/
[, ccausc no two ccmcnts ot I can havc thc samc T-anccstor
\ith simiar arumcnt I
/
is an indcjcndcnt vcrtcx sct ot thc ovcraj rajh O
/

As a conscqucncc, sincc I is aritrary, cvcry indcjcndcnt vcrtcx sct ot O


induccs an indcjcndcnt vcrtcx sct ot O
/
ot thc samc sizc
Lcncc thc maximum indcjcndcnt vcrtcx sct ot O
/
must c at cast as arc as thc maximum indcjcndcnt vcrtcx sct ot O
Christian Borgelt Frequent Pattern Mining 469
Harmful and Harmless Overlaps of Occurrences
`ot a ovcrajs ot occurrcnccs arc harmtu
injut rajh A B C A B C A
surajh A B C A
occurrcnccs
B C A A B C A
A B C A B C A
Lct G (V
G
, E
G
,
G
) and S (V
S
, E
S
,
S
) c two accd rajhs and
ct f
1
and f
2
c two occurrcnccs (surajh isomorjhisms) ot S to G
f
1
and f
2
arc cacd harmfully overlapping, writtcn f
1
f
2
, i
thcy arc cquivacnt or |Iicdcr and Lorct 200|
thcrc cxists a (non-cmjty) jrojcr surajh T ot S,
so that thc T-anccstors f
/
1
and f
/
2
ot f
1
and f
2
, rcsjcctivcy, arc cquivacnt
Christian Borgelt Frequent Pattern Mining 470
Harmful Overlap Graphs and Subgraph Support
Lct G (V
G
, E
G
,
G
) and S (V
S
, E
S
,
S
) c two accd rajhs and
ct V
H
c thc sct ot a occurrcnccs (surajh isomorjhisms) ot S in G
Thc harmful overlap graph ot S wrt G is thc rajh H (V
H
, E
H
),
which has thc sct V
H
ot occurrcnccs ot S in G as its vcrtcx sct
and thc cdc sct E
H
(f
1
, f
2
) [ f
1
, f
2
V
H
f
1
, f
2
f
1
f
2

Lct H (V
H
, E
H
) c thc harmtu ovcraj rajh ot thc occurrcnccs
ot a accd rajh S (V
S
, E
S
,
S
) in a accd rajh G (V
G
, E
G
,
G
)
Thc harmful overlap support (or HO-support tor short) ot thc rajh S wrt G
is thc sizc ot a maximum indcjcndcnt vcrtcx sct ot H
Theorem: LO-sujjort is anti-monotonc
Proof: ldcntica to jroot tor `lS-sujjort
(Thc samc two oscrvations hod, which wcrc a that was nccdcd)
Christian Borgelt Frequent Pattern Mining 471
Harmful Overlap Graphs and Ancestor Relations
injut rajh
B A B A B
B B B A A A A B B B
B A B A B
B A B B A
B A B B A
B A B A B
A B B A B
A B B A B
B A B A B
B A B A B
3 1 2
2 1 3
2 1 3
3 1 2
Christian Borgelt Frequent Pattern Mining 472
Subgraph Support Computation
Chcckin whcthcr two occurrcnccs ovcraj is casy, ut
How do we check whether two occurrences overlap harmfully?
Core ideas of the harmful overlap test:
Try to construct a surajh S
E
(V
E
, E
E
,
E
) that yicds cquivacnt anccstors
ot two ivcn occurrcnccs f
1
and f
2
ot a rajh S (V
S
, E
S
,
S
)
Ior such a surajh S
E
thc majjin g V
E
V
E
with v f
1
2
(f
1
(v)),
whcrc f
1
2
is thc invcrsc ot f
2
, must c a i,cctivc majjin
`orc cncray, g must c an automorphism ot S
E
,
that is, a surajh isomorjhism ot S
E
to itsct
Lxjoit thc jrojcrtics ot automorjhism
to cxcudc vcrticcs trom thc rajh S that cannot c in V
E

Christian Borgelt Frequent Pattern Mining 473


Subgraph Support Computation
Input: Two (dicrcnt) occurrcnccs f
1
and f
2
ot a accd rajh S (V
S
, E
S
,
S
)
in a accd rajh G (V
G
, E
G
,
G
)
Output: \hcthcr f
1
and f
2
ovcraj harmtuy
1) Iorm thc scts V
1
v V
G
[ u V
S
v f
1
(u)
and V
2
v V
G
[ u V
S
v f
2
(u)
2) Iorm thc scts W
1
v V
S
[ f
1
(v) V
1
V
2

and W
2
v V
S
[ f
2
(v) V
1
V
2

3) lt V
E
W
1
W
2
, rcturn false, othcrwisc rcturn true
V
E
is thc vcrtcx sct ot a surajh S
E
that induccs cquivacnt anccstors
Any vcrtcx v V
S
V
E
cannot contriutc to such cquivacnt anccstors
Lcncc V
E
is a maxima sct ot vcrticcs tor which g is a i,cction
Christian Borgelt Frequent Pattern Mining 474
Restriction to Connected Subgraphs
Thc scarch tor trcqucnt surajhs is usuay rcstrictcd to connected graphs
\c cannot concudc that no cdc is nccdcd it thc surajh S
E
is not conncctcd
thcrc may c a conncctcd surajh ot S
E
that induccs cquivacnt anccstors
ot thc occurrcnccs f
1
and f
2

Lcncc wc havc to considcr surajhs ot S


E
in this casc
Lowcvcr, chcckin a jossic surajhs is jrohiitivcy costy
Comjutin thc cdc sct E
E
ot thc surajh S
E

1) Lct E
1
(v
1
, v
2
) E
G
[ (u
1
, u
2
) E
S
(v
1
, v
2
) (f
1
(u
1
), f
1
(u
2
))
and E
2
(v
1
, v
2
) E
G
[ (u
1
, u
2
) E
S
(v
1
, v
2
) (f
2
(u
1
), f
2
(u
2
))
2) Lct F
1
(v
1
, v
2
) E
S
[ (f
1
(v
1
), f
1
(v
2
)) E
1
E
2

and F
2
(v
1
, v
2
) E
S
[ (f
2
(v
1
), f
2
(v
2
)) E
1
E
2

3) Lct E
E
F
1
F
2

Christian Borgelt Frequent Pattern Mining 475


Restriction to Connected Subgraphs
Lemma: Lct S
C
(V
C
, E
C
,
C
) c an (aritrary, ut xcd) conncctcd comjoncnt
ot thc surajh S
E
and ct W v V
C
[ g(v) V
C

(rcmindcr v V
E
g(v) f
1
2
(f
1
(v)), g is an automorjhism ot S
E
)
Thcn it is cithcr W or W V
C

Proof: (y contradiction)
Sujjosc that thcrc is a conncctcd comjoncnt S
C
with W , and W , V
C

Choosc two vcrticcs v


1
W and v
2
V
C
W
v
1
and v
2
arc conncctcd y a jath in S
C
, sincc S
C
is a conncctcd comjoncnt
On this jath thcrc must c an cdc (v
a
, v
b
) with v
a
W and v
b
V
C
W
lt is (v
a
, v
b
) E
E
and thcrctorc (g(v
a
), g(v
b
)) E
E
(g is an automorjhism)
Sincc g(v
a
) V
C
, it toows g(v
b
) V
C

Lowcvcr, this imjics v


b
W, contradictin v
b
V
C
W
Christian Borgelt Frequent Pattern Mining 476
Further Optimization
Thc tcst can c turthcr ojtimizcd y thc toowin simjc insiht
Two occurrcnccs f
1
and f
2
ovcraj harmtuy it v V
S
f
1
(v) f
2
(v),
ccausc thcn such a vcrtcx v aonc ivcs risc to cquivacnt anccstors
This tcst can c jcrtormcd vcry quicky, so it shoud c thc rst stcj
Additiona advantac
conncctcd comjoncnts consistin ot isoatcd vcrticcs can c nccctcd attcrwards
A simjc cxamjc ot harmtu ovcraj without identical images
injut rajh
B A A B
surajh
A A B
occurrcnccs
B B A A B A A B
`otc that thc surajh inducin cquivacnt anccstors can c aritrariy comjcx
cvcn it v V
S
f
1
(v) , f
2
(v)
Christian Borgelt Frequent Pattern Mining 477
Final Procedure for Harmful Overlap Test
Input: Two (dicrcnt) occurrcnccs f
1
and f
2
ot a accd rajh S (V
S
, E
S
,
S
)
in a accd rajh G (V
G
, E
G
,
G
)
Output: \hcthcr f
1
and f
2
ovcraj harmtuy
1) lt v S f
1
(v) f
2
(v), rcturn true
2) Iorm thc cdc sct E
E
ot thc surajh S
E
(as dcscricd aovc) and
torm thc (rcduccd) vcrtcx sct V
/
E
v V
S
[ u V
S
(v, u) E
E

(`otc that V
/
E
docs not contain isoatcd vcrticcs)
3) Lct S
i
C
(V
i
C
, E
i
C
), 1 i n,
c thc conncctcd comjoncnts ot S
/
E
(V
/
E
, E
E
)
lt i. 1 i n v V
i
C
g(v) f
1
2
(f
1
(v)) V
i
C
,
rcturn true, othcrwisc rcturn false
Christian Borgelt Frequent Pattern Mining 478
Alternative: Minimum Number of Vertex Images
Lct G (V
G
, E
G
,
G
) and S (V
S
, E
S
,
S
) c two accd rajhs
and ct F c thc sct ot a surajh isomorjhisms ot S to G
Thcn thc minimum number of vertex images support
(or MNI-support tor short) ot S wrt G is dcncd as
min
vV
S
[u V
G
[ f F f(v) u[.
|Lrinmann and `i,sscn 200|
Advantage:
Can c comjutcd much morc ccicnty than `lS- or LO-sujjort
(`o nccd to dctcrminc a maximum indcjcndcnt vcrtcx sct)
Disadvantage:
Ottcn counts oth ot two cquivacnt occurrcnccs
(Iairy unintuitivc chavior)
Lxamjc B A A B
Christian Borgelt Frequent Pattern Mining 479
Experimental Results
lndcx
Chcmicus
1993
200 250 300 350 400 450 500
0
100
200
300
400
500
600
number of subgraphs
MNI-support
HO-support
MIS-support
# graphs
Tic-
Tac-
Toc
win
120 140 160 180 200 220 240 260 280 300
0
50
100
150
200
250
300
number of subgraphs
MNI-support
HO-support
MIS-support
Christian Borgelt Frequent Pattern Mining 480
Summary
Lcnin surajh sujjort in thc sinc rajh scttin
maximum independent vertex set ot an ovcraj rajh ot thc occurrcnccs
MIS-support is anti-monotone
lroot ook at induccd indcjcndcnt vcrtcx scts tor sustructurcs
Lcnition ot harmful overlap support ot a surajh
cxistcncc ot cquivacnt anccstor occurrcnccs
Simjc jroccdurc tor tcstin whcthcr two occurrcnccs ovcraj harmtuy
Harmful overlap support is anti-monotone
lcstriction to conncctcd sustructurcs and ojtimizations
Atcrnativc minimum number of vertex images
Software: http://www.borgelt.net/moss.html
Christian Borgelt Frequent Pattern Mining 481
Frequent Sequence Mining
Christian Borgelt Frequent Pattern Mining 482
Frequent Sequence Mining
Directed versus undirected sequences
Tcmjora scqucnccs, tor cxamjc, arc aways dircctcd
L`A scqucnccs can c undircctcd (oth dircctions can c rccvant)
Multiple sequences versus a single sequence
`utijc scqucnccs jurchascs with rcatc cards, wc scrvcr acccss jrotocos
Sinc scqucncc aarms in tcccommunication nctworks
(Time) points versus time intervals
loints L`A scqucnccs, aarms in tcccommunication nctworks
lntcrvas wcathcr data, movcmcnt anaysis (sjorts mcdicinc)
Iurthcr distinction onc o,cct jcr (timc) joint vcrsus mutijc o,ccts
Christian Borgelt Frequent Pattern Mining 483
Frequent Sequence Mining
Consecutive subsequences versus subsequences with gaps
a c b a b c b a aways counts as a suscqucncc abc
a c b a b c b c may not aways count as a suscqucncc abc
Existence of an occurrence versus counting occurrences
Cominatoria countin (a occurrcnccs)
`axima numcr ot dis,oint occurrcnccs
Tcmjora sujjort (numcr ot timc window jositions)
`inimum occurrcncc (smacst intcrva)
Relation between the objects in a sequence
itcms ony jrcccdc and succccd
accd timc joints t
1
< t
2
, t
1
t
2
, and t
1
> t
2
accd timc intcrvas rcations ikc before, starts, overlaps, contains ctc
Christian Borgelt Frequent Pattern Mining 484
Frequent Sequence Mining
Directed sequences arc casicr to handc
Thc (su)scqucncc itsct can c uscd as a codc word
As thcrc is ony onc jossic codc word jcr scqucncc (ony onc dircction),
this codc word is ncccssariy canonica
Consecutive subsequences arc casicr to handc
Thcrc arc tcwcr occurrcnccs ot a ivcn suscqucncc
Ior cach occurrcncc thcrc is cxacty onc jossic cxtcnsions
This aows tor sjcciaizcd data structurcs (simiar to an Il-trcc)
Item sequences arc casicst to handc
Thcrc arc ony two jossic rcations and thus jattcrns arc simjc
Othcr scqucnccs arc handcd with statc machincs tor containmcnt tcsts
Christian Borgelt Frequent Pattern Mining 485
A Canonical Form for Undirected Sequences
lt thc scqucnccs to minc arc not dircctcd, a suscqucncc can not c uscd
as its own codc word, ccausc it docs not havc thc prex property
Thc rcason is that an undircctcd scqucncc can c rcad torward or ackward,
which ivcs risc to two jossic codc words, thc smacr (or thc arcr) ot which
may thcn c dcncd as thc canonical code word
Lxamjcs (that thc jrcx jrojcrty is vioatcd)
Assumc that thc itcm ordcr is a < b < c . . . and
that thc cxicorajhicay smacr codc word is thc canonica onc
Thc scqucncc bab, which is canonica, has thc jrcx ba,
ut thc canonica torm ot thc scqucncc ba is rathcr ab
Thc scqucncc cabd, which is canonica, has thc jrcx cab,
ut thc canonica torm ot thc scqucncc cab is rathcr bac
As a conscqucncc, wc havc to ook tor a dicrcnt way ot tormin codc words
(at cast it wc want thc codc to havc thc jrcx jrojcrty)
Christian Borgelt Frequent Pattern Mining 486
A Canonical Form for Undirected Sequences
A (simjc) jossiiity to torm canonica codc words havin thc jrcx jrojcrty
is to handc (su)scqucnccs ot even and odd length separately
ln addition, tormin thc codc word is startcd in the middle
Even length: Thc scqucncc a
m
a
m1
. . . a
2
a
1
b
1
b
2
. . . b
m1
b
m
is dcscricd y thc codc word a
1
b
1
a
2
b
2
. . . a
m1
b
m1
a
m
b
m
or y thc codc word b
1
a
1
b
2
a
2
. . . b
m1
a
m1
b
m
a
m

Odd length: Thc scqucncc a


m
a
m1
. . . a
2
a
1
a
0
b
1
b
2
. . . b
m1
b
m
is dcscricd y thc codc word a
0
a
1
b
1
a
2
b
2
. . . a
m1
b
m1
a
m
b
m
or y thc codc word a
0
b
1
a
1
b
2
a
2
. . . b
m1
a
m1
b
m
a
m

Thc cxicorajhicay smacr ot thc two codc words is thc canonical code word
Such scqucnccs arc extended y addin a jair a
m+1
b
m+1
or b
m+1
a
m+1
,
that is, y addin onc itcm at thc tront and onc itcm at thc cnd
Christian Borgelt Frequent Pattern Mining 487
A Canonical Form for Undirected Sequences
Thc codc words dcncd in this way ccary havc thc prex property
Sujjosc thc jrcx jrojcrty woud not hod
Thcn thcrc cxists, without oss ot cncraity, a canonica codc word
w
m
a
1
b
1
a
2
b
2
. . . a
m1
b
m1
a
m
b
m
,
thc jrcx w
m1
ot which is not canonica, whcrc
w
m1
a
1
b
1
a
2
b
2
. . . a
m1
b
m1
,
As a conscqucncc, wc havc w
m
< v
m
, whcrc
v
m
b
1
a
1
b
2
a
2
. . . b
m1
a
m1
b
m
a
m
,
and v
m1
< w
m1
, whcrc
v
m1
b
1
a
1
b
2
a
2
. . . b
m1
a
m1
.
Lowcvcr, v
m1
< w
m1
imjics v
m
< w
m
,
ccausc v
m1
is a jrcx ot v
m
and w
m1
is a jrcx ot w
m
,
ut v
m
< w
m
contradicts w
m
< v
m

Christian Borgelt Frequent Pattern Mining 488


A Canonical Form for Undirected Sequences
Gcncratin and comjarin thc two jossic codc words takcs linear time
Lowcvcr, this can c imjrovcd y maintainin an additiona jiccc ot intormation
Ior cach scqucncc a symmetry ag is comjutcd
s
m

m

i1
(a
i
b
i
)
Thc symmctry a can c maintaincd in constant timc with
s
m+1
s
m
(a
m+1
b
m+1
).
Thc permissible extensions dcjcnd on thc symmctry a
it s
m
truc, it must c a
m+1
b
m+1

it s
m
tasc, any rcation ctwccn a
m+1
and b
m+1
is acccjtac
This ruc uarantccs that cxacty thc canonica cxtcnsions arc crcatcd
Ajjyin this ruc to chcck a candidatc cxtcnsion takcs constant time
Christian Borgelt Frequent Pattern Mining 489
Sequences of Time Intervals
A (accd or attriutcd) time interval is a trijc I (s, e, l),
whcrc s is thc start timc, e is thc cnd timc and l is thc associatcd ac
A time interval sequence is a sct ot (accd) timc intcrvas,
ot which wc assumc that thcy arc maxima in thc scnsc that tor two intcrvas
I
1
(s
1
, e
1
, l
1
) and I
2
(s
2
, e
2
, l
2
) with l
1
l
2
wc havc cithcr e
1
< s
2
or e
2
< s
1

Othcrwisc thcy arc mcrcd into onc intcrva I (mins


1
, s
2
, maxe
1
, e
2
, l
1
)
A time interval sequence database is a vcctor ot timc intcrva scqucnccs
Timc intcrvas can casiy c ordcrcd as toows
Lct I
1
(s
1
, e
1
, l
1
) and I
2
(s
2
, e
2
, l
2
) c two timc intcrvas lt is I
1
I
2
i
s
1
< s
2
or
s
1
s
2
and e
1
< e
2
or
s
1
s
2
and e
1
e
2
and l
1
< l
2

Luc to thc assumjtion madc aovc, at cast thc third ojtion must hod
Christian Borgelt Frequent Pattern Mining 490
Allens Interval Relations
Luc to thcir tcmjora cxtcnsion, timc intcrvas aow tor dicrcnt rcations
A commony uscd sct ot rcations ctwccn timc intcrvas arc
Allens interval relations |Acn 19S3|
A ctorc B
A mccts B
A ovcrajs B
A is nishcd y B
A contains B
A is startcd y B
A cquas B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
B attcr A
B is mct y A
B is ovcrajjcd y A
B nishcs A
B durin A
B starts A
B cquas A
Christian Borgelt Frequent Pattern Mining 491
Temporal Interval Patterns
A tcmjora jattcrn must sjccity thc rcations ctwccn a rctcrcnccd intcrvas
This can convcnicnty c donc with a matrix
A
B
C
A B C
A c o
B io c m
C a im c
Such a tcmjora jattcrn matrix can aso c intcrjrctcd as an ad,accncy matrix
ot a rajh, which has thc intcrva rcationshijs as cdc acs
Gcncray, thc injut intcrva scqucnccs may c rcjrcscntcd as such rajhs,
thus majjin thc jrocm to trcqucnt (su)rajh minin
Lowcvcr, thc rcationshijs ctwccn timc intcrvas arc constraincd
(tor cxamjc, B attcr A and C attcr B imjy C attcr A)
Thcsc constraints can c cxjoitcd to otain a simjcr canonica torm
ln thc canonical form, thc intcrvas arc assincd in incrcasin timc ordcr
to thc rows and coumns ot thc tcmjora jattcrn matrix |Icmjc 200S|
Christian Borgelt Frequent Pattern Mining 492
Support of Temporal Patterns
Thc sujjort ot a tcmjora jattcrn wrt a sinc scqucncc can c dcncd y
Cominatoria countin (a occurrcnccs)
`axima numcr ot dis,oint occurrcnccs
Tcmjora sujjort (numcr ot timc window jositions)
`inimum occurrcncc (smacst intcrva)
Lowcvcr, a ot thcsc dcnitions sucr trom thc tact that such sujjort
is not anti-monotone or downward closed
A
B B
Thc sujjort ot A contains B is 2,
ut thc sujjort ot A is ony 1
`cvcrthccss an cxhaustivc jattcrn scarch can cnsurcd,
without havin to aandon jrunin with thc Apriori property
Thc rcasons is that with minimum occurrcncc countin thc rcationshij contains
is thc ony onc that can cad to sujjort anomaics ikc thc onc shown aovc
Christian Borgelt Frequent Pattern Mining 493
Weakly Anti-Monotone / Downward Closed
Lct T a jattcrn sjacc with a sujattcrn rcationshij < and
ct s c a tunction trom T to thc rca numcrs, s T ll
Ior a jattcrn S T ct P(S) R [ R < S , Q R < Q < S
c thc sct ot a parent patterns ot S
Thc tunction s on thc jattcrn sjacc T is cacd
strongly anti-monotone or strongly downward closed i
S T R P(S) s(R) s(S),
weakly anti-monotone or weakly downward closed i
S T R P(S) s(R) s(S).
Thc sujjort ot tcmjora intcrva jattcrns is wcaky anti-monotonc
(at cast) it it is comjutcd trom minima occurrcnccs
lt tcmjora intcrva jattcrns arc cxtcndcd backwards in time,
thc Apriori property can satcy c uscd tor jrunin |Icmjc 200S|
Christian Borgelt Frequent Pattern Mining 494
Summary Frequent Sequence Mining
Scvcra dicrcnt types of frequent sequence mining can c distinuishcd
sinc and mutijc scqucnccs, dircctcd and undircctcd scqucnccs
itcms vcrsus (accd) intcrvas, sinc and mutijc o,ccts jcr josition
rcations ctwccn thc o,ccts, dcnition ot jattcrn sujjort
A common tyjcs ot trcqucnt scqucncc minin josscss canonica torms
tor which canonical extension rules can c tound
\ith thcsc rucs it is jossic to chcck in constant timc
whcthcr a jossic cxtcnsion cads to a rcsut in canonica torm
A weakly anti-monotone sujjort tunction can c cnouh
to aow jrunin with thc Apriori property
Lowcvcr, in this casc it must c madc surc that thc canonica torm
assins an ajjrojriatc jarcnt jattcrn in ordcr to cnsurc an cxhaustivc scarch
Christian Borgelt Frequent Pattern Mining 495
Frequent Tree Mining
Christian Borgelt Frequent Pattern Mining 496
Frequent Tree Mining: Basic Notions
lcmindcr A path is a scqucncc ot cdcs conncctin two vcrticcs in a rajh
lcmindcr A (accd) rajh G is cacd a tree i tor any jair ot vcrticcs in G
thcrc cxists exactly one path conncctin thcm in G
A trcc is cacd rooted it it has a distinuishcd vcrtcx, cacd thc root
lootcd trccs arc ottcn sccn as dircctcd a cdcs arc dircctcd away trom thc root
lt a trcc is not rootcd (that is, it thcrc is no distinuishcd vcrtcx), it is cacd free
A trcc is cacd ordered it tor cach vcrtcx
thcrc cxists an ordcr on its incidcnt cdcs
lt thc trcc is rooted, thc ordcr may c dcncd on thc outoin cdcs ony
Trccs ot whichcvcr tyjc arc much casicr to handc trcqucnt (su)rajhs,
ccausc it is mainy thc cyccs (which may c jrcscnt in a cncra rajh)
that makc it dicut to construct thc canonica codc word
Christian Borgelt Frequent Pattern Mining 497
Frequent Tree Mining: Basic Notions
lcmindcr A path is a scqucncc ot cdcs conncctin two vcrticcs in a rajh
Thc length of a path is thc numcr ot its cdcs
Thc distance ctwccn two vcrticcs ot a rajh G
is thc cnth ot a shortcst jath conncctin thcm
`otc that in a trcc thcrc is cxacty onc jath conncctin two vcrticcs,
which is thcn ncccssariy aso thc shortcst jath
ln a rootcd trcc thc depth ot a vcrtcx is its distancc trom thc root vcrtcx
Thc root vcrtcx itsct has dcjth 0
Thc depth ot a trcc is thc dcjth ot its dccjcst vcrtcx
Thc diameter ot a rajh is thc arcst distancc ctwccn any two vcrticcs
A diameter path ot a rajh is a jath havin a cnth
that is thc diamctcr ot thc rajh
Christian Borgelt Frequent Pattern Mining 498
Rooted Ordered Trees
Ior rooted ordered trees codc words dcrivcd trom sjannin trccs
can dirccty c uscd thc sjannin trcc is simjy thc trcc itsct
Lowcvcr, thc root ot thc sjannin trcc is xed
it is simjy thc root ot thc rootcd ordcrcd trcc
ln addition, thc order of the children ot cach vcrtcx is xed
it is simjy thc ivcn ordcr ot thc outoin cdcs
As a conscqucncc, oncc a travcrsa ordcr tor thc sjannin trcc is xcd
(tor cxamjc, dcjth-rst or a rcadth-rst travcrsa), thcrc is ony
one possible code word, which is ncccssariy thc canonica codc word
Thcrctorc rightmost path extension (tor a dcjth-rst travcrsa)
and maximum source extension (tor a rcadth-rst travcrsa)
oviousy jrovidc a canonica cxtcnsion ruc tor rootcd ordcrcd trccs
Thcrc is no nccd tor an cxjicit tcst tor canonica torm
Christian Borgelt Frequent Pattern Mining 499
Rooted Unordered Trees
Rooted unordered trees can most convcnicnty c dcscricd y
so-cacd preorder code words
lrcordcr codc words arc coscy rcatcd to sjannin trccs that arc constructcd
with a dcjth-rst scarch, ccausc a jrcordcr travcrsa is a dcjth-rst travcrsa
Lowcvcr, thcir sjccia torm makcs it casicr to comjarc codc words tor sutrccs
Thc jrcordcr codc words wc considcr hcrc havc thc cncra torm
a ( d b a )
m
,
whcrc m is thc numcr ot cdcs ot thc trcc, m n 1,
n is thc numcr ot vcrticcs ot thc trcc,
a is a vcrtcx attriutc , ac,
b is an cdc attriutc , ac, and
d is thc dcjth ot thc sourcc vcrtcx ot an cdc
Thc sourcc vcrtcx ot an cdc is thc vcrtcx that is coscr to thc root (smacr dcjth)
Thc cdcs arc istcd in thc ordcr in which thcy arc visitcd in a jrcordcr travcrsa
Christian Borgelt Frequent Pattern Mining 500
Rooted Unordered Trees
a
b b
a
b d b
a
b
c
Ior simjicity wc omit cdc acs
ln rootcd trccs cdc acs can a-
ways c comincd with thc dcs-
tination vcrtcx ac (that is, thc
ac ot thc vcrtcx that is tarthcr
away trom thc root)
Thc aovc rootcd unordcrcd trcc can c dcscricd y thc codc word
a 0b 1d 1b 2b 2c 1a 0b 1a 1b
`otc that thc codc word consists ot sustrins that dcscric thc sutrccs
..
a 0
..
b 1
..
d 1
..
b 2
..
b 2
..
c 1
..
a 0
..
b 1
..
a 1
..
b
Thc sutrcc strins arc scjaratcd y a numcr statin thc dcjth ot thc jarcnt
Christian Borgelt Frequent Pattern Mining 501
Rooted Unordered Trees
Lxchanin codc words on thc samc cvc cxchancs ranchcs,sutrccs
..
a 0
..
b 1
..
d 1
..
b 2
..
b 2
..
c 1
..
a 0
..
b 1
..
a 1
..
b
Ior cxamjc, in this codc word thc chidrcn ot thc root arc cxchancd
..
a 0
..
b 1
..
a 1
..
b 0
..
b 1
..
d 1
..
b 2
..
b 2
..
c 1
..
a
a
b b
a
b d b
a
b
c
a
b b
d b
a a
b
b
c
Christian Borgelt Frequent Pattern Mining 502
Rooted Unordered Trees
A jossic jrcordcr codc words can c otaincd trom onc jrcordcr codc word
y cxchanin sustrins ot thc codc word that dcscric siin sutrccs
(This shows thc advantac ot usin thc vcrtcx dcjth rathcr than thc vcrtcx indcx
no rcnumcrin ot thc vcrticcs is ncccssary in such a cxchanc)
Ly dcnin an (aritrary, ut xcd) ordcr on thc vcrtcx acs
and usin thc standard ordcr ot thc intccr numcrs,
thc codc words can c comjarcd cxicorajhicay
(`otc that vcrtcx acs arc aways comjarcd to vcrtcx acs
and intccrs to intccrs, ccausc thcsc two ccmcnts atcrnatc)
Contrary to thc common dcnition uscd in a caricr cascs, wc dcnc
thc cxicorajhicay greatest codc word as thc canonical code word
Thc canonica codc word tor thc trcc on thc jrcvious sidcs is
a 0b 1d 1b 2c 2b 1a 0b 1b 1a
Christian Borgelt Frequent Pattern Mining 503
Rooted Unordered Trees
ln ordcr to undcrstand thc corc jrocm ot otainin an cxtcnsion ruc
tor rootcd unordcrcd trccs, considcr thc toowin trcc
a
b b
c c c c
d
c
d b d
c
d
Thc canonica codc word tor this trcc rcsuts trom thc shown ordcr ot thc sutrccs
a 0b 1c 2d 2c 1c 2d 2b 0b 1c 2d 2c 1c 2d
Any cxchanc ot sutrccs cads to a cxicorajhicay smaller codc word
Low can this trcc c cxtcndcd y addin a chid to thc rcy vcrtcx
That is, what ac may thc chid vcrtcx havc it thc rcsut is to c canonica
Christian Borgelt Frequent Pattern Mining 504
Rooted Unordered Trees
a
b b
c c c c
d
c
d b d
c
d
ln thc rst jacc, wc oscrvc that thc chid must not havc a ac succccdin d,
ccausc othcrwisc cxchanin thc ncw vcrtcx with thc othcr chid
ot thc rcy vcrtcx woud yicd a cxicorajhicay larger codc word
a 0b 1c 2d 2c 1c 2d 2b 0b 1c 2d 2c 1c 2d 2e
<
a 0b 1c 2d 2c 1c 2d 2b 0b 1c 2d 2c 1c 2e 2d
Gcncray, thc chidrcn ot a vcrtcx must c sortcd dcsccndiny wrt thcir acs
Christian Borgelt Frequent Pattern Mining 505
Rooted Unordered Trees
a
b b
c c c c
d
c
d b d
c
d
Sccondy, wc oscrvc that thc chid must not havc a ac succccdin c,
ccausc othcrwisc cxchanin thc sutrccs ot thc jarcnt ot thc rcy vcrtcx
woud yicd a cxicorajhicay larger codc word
a 0b 1c 2d 2c 1c 2d 2b 0b 1c 2d 2c 1c 2d 2d
<
a 0b 1c 2d 2c 1c 2d 2b 0b 1c 2d 2d 1c 2d 2c
Thc sutrccs ot any vcrtcx must c sortcd dcsccndiny wrt thcir codc words
Christian Borgelt Frequent Pattern Mining 506
Rooted Unordered Trees
a
b b
c c c c
d
c
d b d
c
d
Thirdy, wc oscrvc that thc chid must not havc a ac succccdin ,
ccausc othcrwisc cxchanin thc sutrccs ot thc root vcrtcx ot thc trcc
woud yicd a cxicorajhicay larger codc word
a 0b 1c 2d 2c 1c 2d 2b 0b 1c 2d 2c 1c 2d 2c
<
a 0b 1c 2d 2c 1c 2d 2c 0b 1c 2d 2c 1c 2d 2b
Thc sutrccs ot any vcrtcx must c sortcd dcsccndiny wrt thcir codc words
Christian Borgelt Frequent Pattern Mining 507
Rooted Unordered Trees
That a jossic cxchanc ot sutrccs at vcrticcs coscr to thc root
ncvcr yicd ooscr rcstrictions is no accidcnt
Sujjosc a rootcd trcc is dcscricd y a canonica codc word
a 0 b 1 w
1
1 w
2
0 b 1 w
3
1 w
!
.
Thcn wc know thc toowin rcationshijs ctwccn sutrcc codc words
w
1
w
2
and w
3
w
!
, ccausc othcrwisc an cxchanc ot sutrccs at thc
nodcs accd with woud cad to a cxicorajhicay arcr codc word
w
1
w
3
, ccausc othcrwisc an cxchanc ot sutrccs at thc nodc accd a
woud cad to a cxicorajhicay arcr codc word
Ony it w
1
w
3
, thc codc words w
1
and w
3
do not arcady dctcrminc thc ordcr
ot thc sutrccs ot thc vcrtcx accd with a ln this casc wc havc w
2
w
!

Lowcvcr, thcn wc aso havc w


3
w
1
w
2
,
showin that w
2
jrovidcs no ooscr rcstriction ot w
!
than w
3

Christian Borgelt Frequent Pattern Mining 508


Rooted Unordered Trees
As a conscqucncc, wc otain thc toowin simjc extension rule
Lct w c thc canonica codc word ot thc rootcd trcc to cxtcnd and
ct d c thc dcjth ot thc rootcd trcc (that is, thc dcjth ot thc dccjcst vcrtcx)
ln addition, ct thc considcrcd cxtcnsion c xa with x l`
0
and a a vcrtcx ac
Lct y c thc smacst intccr tor which w has a sux ot thc torm y w
1
w
2
y w
1
with y l`
0
and w
1
and w
2
strins not containin any y
/
y (w
2
may c cmjty)
lt w docs not josscss such a sux, ct y d (dcjth ot thc trcc)
lt x > y, thc cxtcnsion is canonica it and ony it xa w
2

lt x y, chcck whcthcr w has a sux xw


3
,
whcrc w
3
is a strin not containin any intccr x
/
x
lt w has such a sux, thc cxtcndcd codc word is canonica it and ony it a w
3

lt w docs not havc such a sux, thc cxtcndcd codc word is aways canonica
\ith this cxtcnsion ruc no suscqucnt canonica torm tcst is nccdcd
Christian Borgelt Frequent Pattern Mining 509
Rooted Unordered Trees
Thc discusscd cxtcnsion ruc is vcry ccicnt
Comjarin thc ccmcnts ot thc cxtcnsion takcs constant time
(at most onc intccr and onc ac nccd to c comjarcd)
Inowcdc ot thc strins w
3
tor a jossic vaucs ot x (0 x < d)
can maintaincd in constant time
lt succs to rccord thc startin joints ot thc sustrins
that dcscric thc rihtmost sutrcc on cach trcc cvc
At most onc ot thcsc startin joints can chanc with an cxtcnsion
Inowcdc ot thc vauc ot y and thc two startin joints ot thc strin w
1
in w
can c maintaincd in constant time
As on as no two siin vcrticcs carry thc samc ac, it is y d
lt a siin with thc samc ac is addcd, y is sct to thc dcjth ot thc jarcnt
w
1
a occurs at thc josition ot thc w
3
tor y and at thc cxtcnsion vcrtcx ac
lt a tuturc cxtcnsion dicrs trom w
2
, it is y d, othcrwisc w
1
is cxtcndcd
Christian Borgelt Frequent Pattern Mining 510
Free Trees
Free trees can c handcd y cominin thc idcas ot
how to handc sequences and rooted unordered trees
Simiar to scqucnccs, trcc trccs ot cvcn and odd diamctcr arc trcatcd scjaratcy
Gcncra idcas tor a canonica torm tor trcc trccs
Even Diameter:
Thc vcrtcx in thc middc ot a diamctcr jath is uniqucy dctcrmincd
This vcrtcx can c uscd as thc root ot a rootcd trcc
Odd Diameter:
Thc cdc in thc middc ot a diamctcr jath is uniqucy dctcrmincd
lcmovin this cdc sjits thc trcc trcc into two rootcd trccs
lroccdurc tor rowin trcc trccs
Iirst row a diamctcr jath usin thc canonica torm tor scqucnccs
Lxtcnd thc diamctcr jath into a trcc y addin ranchcs
Christian Borgelt Frequent Pattern Mining 511
Free Trees
`ain jrocm ot thc jroccdurc tor rowin trcc trccs
The initially grown diameter path must remain identiable.
(Othcrwisc thc prex property cannot c uarantccd)
ln ordcr to sovc this jrocm it is cxjoitcd that in thc canonica codc word tor a
rootcd unordcrcd trcc codc words dcscriin jaths trom thc root to a cat vcrtcx
arc cxicorajhicay incrcasin it thc jaths arc istcd trom ctt to riht
Even Diameter:
Thc oriina diamctcr jath rcjrcscnts two jaths trom thc root to two cavcs
To kccj thcm idcntiac, thcsc jaths must c thc cxicorajhicay smacst
and thc cxicorajhicay arcst jath cadin to this dcjth
Odd Diameter:
Thc oriina diamctcr jath rcjrcscnts onc jath trom thc root to a cat
in cach ot thc two rootcd trccs thc trcc trcc is sjit into
Thcsc jaths must c thc cxicorajhicay smacst jaths cadin to this dcjth
Christian Borgelt Frequent Pattern Mining 512
Summary Frequent Tree Mining
Rooted ordered trees
Thc root is xcd and thc ordcr ot thc chidrcn ot cach vcrtcx is xcd
Loth rightmost path extension and maximum source extension
oviousy jrovidc a canonica cxtcnsion ruc tor rootcd ordcrcd trccs
Rooted unordered trees
Thc root is xcd, ut thcrc is no ordcr ot thc chidrcn
Thcrc cxists a canonica cxtcnsion ruc ascd on sortcd jrcordcr strins
(constant timc tor ndin aowcd cxtcnsions) |Luccio ct a 2001, 200!|
Free trees
`o nodc is xcd as thc root, thcrc is no ordcr on ad,accnt vcrticcs
Thcrc cxists a canonica cxtcnsion ruc ascd on dcjth scqucnccs
(constant timc tor ndin aowcd cxtcnsions) |`i,sscn and Iok 200!|
Christian Borgelt Frequent Pattern Mining 513
Summary Frequent Pattern Mining
Christian Borgelt Frequent Pattern Mining 514
Summary Frequent Pattern Mining
lossic tyjcs ot jattcrns item sets, sequences, trees, and graphs
A corc inrcdicnt ot thc scarch is a canonical form ot thc tyjc ot jattcrn
lurjosc cnsurc that cach jossic jattcrn is jroccsscd at most oncc
(Liscard non-canonica codc words, jroccss ony canonica oncs)
lt is dcsirac that thc canonica torm josscsscs thc prex property
Lxccjt tor cncra rajhs thcrc cxist canonical extension rules
Ior cncra rajhs, restricted extensions aow to rcducc
thc numcr ot actua canonica torm tcsts considcray
Ircqucnt jattcrn minin aorithms jrunc with thc Apriori property
P S P s
T
(P) < s
min
s
T
(S) < s
min
.
That is No super-pattern of an infrequent pattern is frequent.
Additional ltering is imjortant to sinc out thc rccvant jattcrns
Christian Borgelt Frequent Pattern Mining 515
Software
Sottwarc tor trcqucnt jattcrn minin can c tound at
my wc sitc http://www.borgelt.net/fpm.html
Ajriori http://www.borgelt.net/apriori.html
Lcat http://www.borgelt.net/eclat.html
Il-Growth http://www.borgelt.net/fpgrowth.html
lLim http://www.borgelt.net/relim.html
Sa` http://www.borgelt.net/sam.html
`oSS http://www.borgelt.net/moss.html
thc Ircqucnt ltcm Sct `inin lmjcmcntations (Il`l) lcjository
http://fimi.cs.helsinki.fi/
This rcjository was sct uj with thc contriutions to thc Il`l workshojs in 2003
and 200!, whcrc cach sumission had to c accomjanicd y thc sourcc codc ot
an imjcmcntation Thc wc sitc ocrs a sourcc codc, scvcra data scts, and thc
rcsuts ot thc comjctition
Christian Borgelt Frequent Pattern Mining 516

You might also like