Professional Documents
Culture Documents
Christian Borgelt
lntcicnt Lata Anaysis and Grajhica `odcs lcscarch nit
Lurojcan Ccntrc tor Sott Comjutin
c, Gonzao Guticrrcz Quiros s,n, 33o00 `icrcs, Sjain
christian@borgelt.net
http://www.borgelt.net/
http://www.borgelt.net/teach/fpm/
http://www.softcomputing.es/
Christian Borgelt Frequent Pattern Mining 1
Overview
Frequent Pattern Mining comjriscs
Ircqucnt ltcm Sct `inin and Association luc lnduction
Ircqucnt Scqucncc `inin
Ircqucnt Trcc `inin
Ircqucnt Grajh `inin
Application Areas ot Ircqucnt lattcrn `inin incudc
`arkct Laskct Anaysis
Cick Strcam Anaysis
\c Link Anaysis
Gcnomc Anaysis
Lru Lcsin (`occuar Iramcnt `inin)
Christian Borgelt Frequent Pattern Mining 2
Frequent Item Set Mining
Christian Borgelt Frequent Pattern Mining 3
Frequent Item Set Mining: Motivation
Ircqucnt ltcm Sct `inin is a mcthod tor market basket analysis
lt aims at ndin rcuaritics in thc shojjin chavior ot customcrs
ot sujcrmarkcts, mai-ordcr comjanics, on-inc shojs ctc
`orc sjccicay
Find sets of products that are frequently bought together.
lossic ajjications ot tound trcqucnt itcm scts
lmjrovc arrancmcnt ot jroducts in shcvcs, on a cataos jacs ctc
Sujjort cross-scin (sucstion ot othcr jroducts), jroduct undin
Iraud dctcction, tcchnica dcjcndcncc anaysis ctc
Ottcn tound jattcrns arc cxjrcsscd as association rules, tor cxamjc
If a customcr uys bread and wine,
then shc,hc wi jroay aso uy cheese
Christian Borgelt Frequent Pattern Mining 4
Frequent Item Set Mining: Basic Notions
Lct B i
1
, . . . , i
m
c a sct ot items This sct is cacd thc item base
ltcms may c jroducts, sjccia cquijmcnt itcms, scrvicc ojtions ctc
Any susct I B is cacd an item set
An itcm sct may c any sct ot jroducts that can c ouht (tocthcr)
Lct T (t
1
, . . . , t
n
) with k, 1 k n t
k
B c a vcctor ot
transactions ovcr B This vcctor is cacd thc transaction database
A transaction dataasc can ist, tor cxamjc, thc scts ot jroducts
ouht y thc customcrs ot a sujcrmarkct in a ivcn jcriod ot timc
Lvcry transaction is an itcm sct, ut somc itcm scts may not ajjcar in T
Transactions nccd not c jairwisc dicrcnt it may c t
j
t
k
tor j , k
T may aso c dcncd as a bag or multiset ot transactions
Thc sct B may not c cxjicitcy ivcn, ut ony imjicity as B
n
k1
t
k
and f
2
i
1
, . . . , i
k1
, i
/
k
ony it i
k
< i
/
k
and j, 1 j < k i
j
< i
j+1
lt s
T
0
(i) s
min
(whcrc s
T
0
(i) is thc sujjort ot thc itcm i in T
0
)
lcjort thc itcm sct P
0
i as trcqucnt with thc sujjort s
T
0
(i)
Iorm thc sujrocm S
1
(T
1
, P
1
) with P
1
P
0
i
T
1
comjriscs a transactions in T
0
that contain thc itcm i,
ut with thc itcm i rcmovcd (and cmjty transactions rcmovcd)
lt T
1
is not cmjty, jroccss S
1
rccursivcy
ln any casc (that is, rcardcss ot whcthcr s
T
0
(i) s
min
or not)
Iorm thc sujrocm S
2
(T
2
, P
2
), whcrc P
2
P
0
T
2
comjriscs a transactions in T
0
(whcthcr thcy contain i or not),
ut aain with thc itcm i rcmovcd (and cmjty transactions rcmovcd)
lt T
2
is not cmjty, jroccss S
2
rccursivcy
Christian Borgelt Frequent Pattern Mining 78
Divide-and-Conquer Recursion
Subproblem Tree
(T, )
9
a
X
X
X
X
X
X
X
X
X
X
X
X
XXz
a
(T
a
, a)
b
@
@
@
@
@R
b
(T
a
, )
b
@
@
@
@
@R
b
(T
ab
, a, b)
c
A
A
A
A
AU
c
(T
a
b
, a)
c
A
A
A
A
AU
c
(T
ab
, b)
c
A
A
A
A
AU
c
(T
a
b
, )
c
A
A
A
A
AU
c
(T
abc
, a, b, c)
(T
ab c
, a, b)
(T
a
bc
, a, c)
(T
a
b c
, a)
(T
abc
, b, c)
(T
ab c
, b)
(T
a
bc
, c)
(T
a
b c
, )
Lranch to thc ctt incudc an itcm (rst sujrocm)
Lranch to thc riht cxcudc an itcm (sccond sujrocm)
(ltcms in thc indiccs ot thc conditiona transaction dataascs T havc ccn rcmovcd trom thcm)
Christian Borgelt Frequent Pattern Mining 79
Perfect Extensions
Thc scarch can casiy c imjrovcd with so-cacd perfect extension pruning
Lct T c a transaction dataasc ovcr an itcm asc B
Givcn an itcm sct I, an itcm a / I is cacd a perfect extension ot I wrt T,
i thc itcm scts I and I a havc thc samc sujjort s
T
(I) s
T
(I a)
(that is, it a transactions containin thc itcm sct I aso contain thc itcm a)
lcrtcct cxtcnsions havc thc toowin jrojcrtics
lt thc itcm a is a jcrtcct cxtcnsion ot an itcm sct I,
thcn a is aso a jcrtcct cxtcnsion ot any itcm sct J I (as on as a / J)
This can most casiy c sccn y considcrin that K
T
(I) K
T
(a)
and hcncc K
T
(J) K
T
(a), sincc K
T
(J) K
T
(I)
lt X
T
(I) is thc sct ot a jcrtcct cxtcnsions ot an itcm sct I wrt T
(that is, it X
T
(I) i B I [ s
T
(I i) s
T
(I)),
thcn a scts I J with J 2
X
T
(I)
havc thc samc sujjort as I
(whcrc 2
M
dcnotcs thc jowcr sct ot a sct M)
Christian Borgelt Frequent Pattern Mining 80
Perfect Extensions: Examples
transaction dataasc
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
trcqucnt itcm scts
0 itcms 1 itcm 2 itcms 3 itcms
10 a a, c ! a, c, d 3
b 3 a, d ` a, c, e 3
c a, e o a, d, e !
d o b, c 3
e c, d !
c, e !
d, e !
c is a jcrtcct cxtcnsion ot b as b and b, c oth havc sujjort 3
a is a jcrtcct cxtcnsion ot d, e as d, e and a, d, e oth havc sujjort !
Thcrc arc no othcr jcrtcct cxtcnsions in this cxamjc
tor a minimum sujjort ot s
min
3
Christian Borgelt Frequent Pattern Mining 81
Perfect Extension Pruning
Considcr aain thc oriina divide-and-conquer scheme
A sujrocm S
0
(T
0
, P
0
) is sjit into
a sujrocm S
1
(T
1
, P
1
) to nd a trcqucnt itcm scts
that contain an itcm i B
0
,
a sujrocm S
2
(T
2
, P
2
) to nd a trcqucnt itcm scts
that do not contain thc itcm i
Sujjosc thc itcm i is a perfect extension ot thc jrcx P
0
Lct F
1
and F
2
c thc scts ot trcqucnt itcm scts
that arc rcjortcd whcn jroccssin S
1
and S
2
, rcsjcctivcy
lt is I i F
1
I F
2
9
a
X
X
X
X
X
X
X
X
X
X
X
X
XXz
a
(T
a
, a)
b
@
@
@
@
@R
b
(T
a
, )
c
@
@
@
@
@R
c
(T
ab
, a, b)
d
A
A
A
A
AU
d
(T
a
b
, a)
e
A
A
A
A
AU
e
(T
ac
, c)
f
A
A
A
A
AU
f
(T
a c
, )
g
A
A
A
A
AU
g
(T
abd
, a, b, d)
(T
ab
d
, a, b)
(T
a
be
, a, e)
(T
a
b e
, a)
(T
acf
, c, f)
(T
ac
f
, c)
(T
a cg
, g)
(T
a c g
, )
A oca itcm ordcrs start with a < . . .
A sujrocms on thc ctt sharc a < b < . . .,
A sujrocms on thc riht sharc a < c < . . .
Christian Borgelt Frequent Pattern Mining 86
Global and Local Item Order
Loca itcm ordcrs havc advantacs and disadvantacs
Advantage
ln somc data scts thc ordcr ot thc conditiona itcm trcqucncics
dicrs considcray trom thc oa ordcr
Such data scts can somctimcs c jroccsscd sinicanty tastcr
with oca itcm ordcrs (dcjcndin on thc aorithm)
Disadvantage
Thc data structurc ot thc conditiona dataascs must aow us
to dctcrminc conditiona itcm trcqucncics quicky
`ot havin a oay xcd itcm ordcr can makc it morc dicut
to dctcrminc conditiona transaction dataascs wrt sjit itcms
(dcjcndin on thc cmjoycd data structurc)
Thc ains trom thc cttcr itcm ordcr may c ost aain
duc to thc morc comjcx jroccssin , conditionin schcmc
Christian Borgelt Frequent Pattern Mining 87
Transaction Database Representation
Christian Borgelt Frequent Pattern Mining 88
Transaction Database Representation
Lcat, Il-rowth and scvcra othcr trcqucnt itcm sct minin aorithms
rcy on thc dcscricd asic dividc-and-conqucr schcmc
Thcy dicr mainy in how thcy rcjrcscnt thc conditiona transaction dataascs
Thc main ajjroachcs arc horizonta and vcrtica rcjrcscntations
ln a horizontal representation, thc dataasc is storcd as a ist (or array)
ot transactions, cach ot which is a ist (or array) ot thc itcms containcd in it
ln a vertical representation, a dataasc is rcjrcscntcd y rst rctcrrin
with a ist (or array) to thc dicrcnt itcms Ior cach itcm a ist (or array) ot
idcnticrs is storcd, which indicatc thc transactions that contain thc itcm
Lowcvcr, this distinction is not jurc, sincc thcrc arc many aorithms
that usc a comination ot thc two torms ot rcjrcscntin a dataasc
Ircqucnt itcm sct minin aorithms aso dicr in
how thcy construct ncw conditiona dataascs trom a ivcn onc
Christian Borgelt Frequent Pattern Mining 89
Transaction Database Representation
Thc Ajriori aorithm uscs a horizontal transaction representation
cach transaction is an array ot thc containcd itcms
`otc that thc atcrnativc jrcx trcc oranization
is sti an csscntiay horizontal rcjrcscntation
Thc atcrnativc is a vertical transaction representation
Ior cach itcm a transaction list is crcatcd
Thc transaction ist ot itcm a indicatcs thc transactions that contain it,
that is, it rcjrcscnts its cover K
T
(a)
Advantac thc transaction ist tor a jair ot itcms can c comjutcd y
intcrscctin thc transaction ists ot thc individua itcms
Gcncray, a vcrtica transaction rcjrcscntation can cxjoit
I, J B K
T
(I J) K
T
(I) K
T
(J).
A comincd rcjrcscntation is thc frequent pattern tree (to c discusscd atcr)
Christian Borgelt Frequent Pattern Mining 90
Transaction Database Representation
Horizontal Representation: List itcms tor cach transaction
Vertical Representation: List transactions tor cach itcm
horizonta rcjrcscntation
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
vcrtica rcjrcscntation
a b c d e
1 2 2 1 1
3 3 2 3
! 9 ! ! !
` o o `
o S S
S S 10 9
10 9 10
matrix rcjrcscntation
a b c d e
1 1 0 0 1 1
2 0 1 1 1 0
3 1 0 1 0 1
! 1 0 1 1 1
` 1 0 0 0 1
o 1 0 1 1 0
0 1 1 0 0
S 1 0 1 1 1
9 0 1 1 0 1
10 1 0 0 1 1
Christian Borgelt Frequent Pattern Mining 91
Transaction Database Representation
transaction
dataasc
a, d, e
b, c, d
a, c, e
a, c, d, e
a, e
a, c, d
b, c
a, c, d, e
b, c, e
a, d, e
cxicorajhicay
sortcd
a, c, d
a, c, d, e
a, c, d, e
a, c, e
a, d, e
a, d, e
a, e
b, c
b, c, d
b, c, e
prex tree
representation
a
b
c
d
e
c
d
e
e
d
e
e
: 7
: 3
: 4
: 2
: 1
: 3
: 3
: 1
: 2
: 1
: 1
: 2
`otc that a jrcx trcc rcjrcscntation is a comjrcsscd horizonta rcjrcscntation
Principle: cqua jrcxcs ot transactions arc mcrcd
This is most ccctivc it thc itcms arc sortcd dcsccndiny wrt thcir sujjort
Christian Borgelt Frequent Pattern Mining 92
The Eclat Algorithm
|Zaki, larthasarathy, Oihara, and Li 199|
Christian Borgelt Frequent Pattern Mining 93
Eclat: Basic Ideas
Thc itcm scts arc chcckcd in lexicographic order
(depth-rst traversal ot thc jrcx trcc)
Thc scarch schcmc is thc samc as thc cncra schcmc tor scarchin
with canonica torms havin thc jrcx jrojcrty and josscssin
a jcrtcct cxtcnsion ruc (cncratc ony canonica cxtcnsions)
Lcat cncratcs morc candidatc itcm scts than Ajriori,
ccausc it (usuay) docs not storc thc sujjort ot a visitcd itcm scts
`otc that Lcat cannot tuy cxjoit thc Ajriori jrojcrty, ccausc it docs not store thc sujjort ot a
cxjorcd itcm scts, not ccausc it cannot know it lt a comjutcd sujjort vaucs wcrc storcd, it coud
c imjcmcntcd in such a way that a sujjort vaucs nccdcd tor tu a priori jrunin wcrc avaiac
Christian Borgelt Frequent Pattern Mining 94
Eclat: Subproblem Split
1
3
4
5
6
8
10
a
7
2
7
9
b
3
2
3
4
6
7
8
9
c
7
1
2
4
6
8
10
d
6
1
3
4
5
8
9
10
e
7
b
0
3
4
6
8
c
4
1
4
6
8
10
d
5
1
3
4
5
8
10
e
6
2
7
9
b
3
2
3
4
6
7
8
9
c
7
1
2
4
6
8
10
d
6
1
3
4
5
8
9
10
e
7
Conditiona
dataasc
tor jrcx a
(1st sujrocm)
Conditiona
dataasc
with itcm a
rcmovcd
(2nd sujrocm)
a
7
b
3
c
7
d
6
e
7
b
0
c
4
d
5
e
6
b
3
c
7
d
6
e
7
Conditiona
dataasc
tor jrcx a
(1st sujrocm)
Conditiona
dataasc
with itcm a
rcmovcd
(2nd sujrocm)
Christian Borgelt Frequent Pattern Mining 95
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
Iorm a transaction ist tor cach itcm Lcrc it vcctor rcjrcscntation
rcy itcm is containcd in transaction
whitc itcm is not containcd in transaction
Transaction dataasc is nccdcd ony oncc (tor thc sinc itcm transaction ists)
Christian Borgelt Frequent Pattern Mining 96
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
lntcrscct thc transaction ist tor itcm a
with thc transaction ists ot a othcr itcms (conditional database tor itcm a)
Count thc numcr ot its that arc sct (numcr ot containin transactions)
This yicds thc sujjort ot a itcm scts with thc jrcx a
Christian Borgelt Frequent Pattern Mining 97
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
Thc itcm sct a, b is intrcqucnt and can c jruncd
A othcr itcm scts with thc jrcx a arc trcqucnt
and arc thcrctorc kcjt and jroccsscd rccursivcy
Christian Borgelt Frequent Pattern Mining 98
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
lntcrscct thc transaction ist tor thc itcm sct a, c
with thc transaction ists ot thc itcm scts a, x, x d, e
lcsut Transaction ists tor thc itcm scts a, c, d and a, c, e
Count thc numcr ot its that arc sct (numcr ot containin transactions)
This yicds thc sujjort ot a itcm scts with thc jrcx ac
Christian Borgelt Frequent Pattern Mining 99
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
lntcrscct thc transaction ists tor thc itcm scts a, c, d and a, c, e
lcsut Transaction ist tor thc itcm sct a, c, d, e
\ith Ajriori this itcm sct coud c jruncd ctorc countin,
ccausc it was known that c, d, e is intrcqucnt
Christian Borgelt Frequent Pattern Mining 100
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
Thc itcm sct a, c, d, e is not trcqucnt (sujjort 2,20/) and thcrctorc jruncd
Sincc thcrc is no transaction ist ctt (and thus no intcrscction jossic),
thc rccursion is tcrminatcd and thc scarch acktracks
Christian Borgelt Frequent Pattern Mining 101
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
Thc scarch acktracks to thc sccond cvc ot thc scarch trcc and
intcrscct thc transaction ist tor thc itcm scts a, d and a, e
lcsut Transaction ist tor thc itcm sct a, d, e
Sincc thcrc is ony onc transaction ist ctt (and thus no intcrscction jossic),
thc rccursion is tcrminatcd and thc scarch acktracks aain
Christian Borgelt Frequent Pattern Mining 102
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
b
c : 3 d : 1 e : 1
Thc scarch acktracks to thc rst cvc ot thc scarch trcc and
intcrscct thc transaction ist tor b with thc transaction ists tor c, d, and e
lcsut Transaction ists tor thc itcm scts b, c, b, d, and b, e
Christian Borgelt Frequent Pattern Mining 103
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
b
c : 3 d : 1 e : 1
Ony onc itcm sct has sucicnt sujjort jrunc a sutrccs
Sincc thcrc is ony onc transaction ist ctt (and thus no intcrscction jossic),
thc rccursion is tcrminatcd and thc scarch acktracks aain
Christian Borgelt Frequent Pattern Mining 104
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
b
c : 3 d : 1 e : 1
c
d : 4 e : 4
Lacktrack to thc rst cvc ot thc scarch trcc and
intcrscct thc transaction ist tor c with thc transaction ists tor d and e
lcsut Transaction ists tor thc itcm scts c, d and c, e
Christian Borgelt Frequent Pattern Mining 105
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
b
c : 3 d : 1 e : 1
c
d : 4 e : 4
d
e : 2
lntcrscct thc transaction ist tor thc itcm scts c, d and c, e
lcsut Transaction ist tor thc itcm sct c, d, e
Christian Borgelt Frequent Pattern Mining 106
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
b
c : 3 d : 1 e : 1
c
d : 4 e : 4
d
e : 2
Thc itcm sct c, d, e is not trcqucnt (sujjort 2,20/) and thcrctorc jruncd
Sincc thcrc is no transaction ist ctt (and thus no intcrscction jossic),
thc rccursion is tcrminatcd and thc scarch acktracks
Christian Borgelt Frequent Pattern Mining 107
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
b
c : 3 d : 1 e : 1
c
d : 4 e : 4
d
e : 2
d
e : 4
Thc scarch acktracks to thc rst cvc ot thc scarch trcc and
intcrscct thc transaction ist tor d with thc transaction ist tor e
lcsut Transaction ist tor thc itcm sct d, e
\ith this stcj thc scarch is nishcd
Christian Borgelt Frequent Pattern Mining 108
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
b
c : 3 d : 1 e : 1
c
d : 4 e : 4
d
e : 2
d
e : 4
Thc tound trcqucnt itcm scts coincidc, ot coursc,
with thosc tound y thc Ajriori aorithm
Lowcvcr, a tundamcnta dicrcncc is that
Lcat usuay ony writcs tound trcqucnt itcm scts to an outjut c,
whic Ajriori kccjs thc whoc scarch trcc in main mcmory
Christian Borgelt Frequent Pattern Mining 109
Eclat: Depth-First Search
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
b
c : 3 d : 1 e : 1
c
d : 4 e : 4
d
e : 2
d
e : 4
`otc that thc itcm sct a, c, d, e coud c jruncd y Ajriori without comjutin
its sujjort, ccausc thc itcm sct c, d, e is intrcqucnt
Thc samc can c achicvcd with Lcat it thc dcjth-rst travcrsa ot thc jrcx trcc
is carricd out trom riht to ctt and comjutcd sujjort vaucs arc storcd
lt is dcatac whcthcr thc jotcntia ains ,ustity thc mcmory rcquircmcnt
Christian Borgelt Frequent Pattern Mining 110
Eclat: Bit Matrices and Item Coding
Bit Matrices
lcjrcscnt transactions as a it matrix
Lach coumn corrcsjonds to an itcm
Lach row corrcsjonds to a transaction
`orma and sjarsc rcjrcscntation ot it matriccs
`orma onc mcmory it jcr matrix it, zcros rcjrcscntcd
Sjarsc ists ot row indiccs ot sct its (transaction ists)
\hich rcjrcscntation is jrctcrac dcjcnds on
thc ratio ot sct its to ccarcd its
Item Coding
Sortin thc itcms asccndiny wrt thcir trcqucncy (individua or
transaction sizc sum) cads to a cttcr structurc ot thc scarch trcc
Christian Borgelt Frequent Pattern Mining 111
Eclat: Intersecting Transaction Lists
function iscct (src1, src2 tidist)
begin ( intcrscct two transaction id ists )
var dst tidist. ( crcatcd intcrscction )
while oth src1 and src2 arc not cmjty do begin
if hcad(src1) < hcad(src2) ( skij transaction idcnticrs that arc )
then src1 tai(src1). ( uniquc to thc rst sourcc ist )
elseif hcad(src1) > hcad(src2) ( skij transaction idcnticrs that arc )
then src2 tai(src2). ( uniquc to thc sccond sourcc ist )
else begin ( it transaction id is in oth sourccs )
dstajjcnd(hcad(src1)). ( ajjcnd it to thc outjut ist )
src1 tai(src1). src2 tai(src2).
end. ( rcmovc thc transtcrrcd transaction id )
end. ( trom oth sourcc ists )
return dst. ( rcturn thc crcatcd intcrscction )
end. ( tunction iscct() )
Christian Borgelt Frequent Pattern Mining 112
Reminder (Apriori): Transactions as a Prex Tree
transaction
dataasc
a, d, e
b, c, d
a, c, e
a, c, d, e
a, e
a, c, d
b, c
a, c, d, e
b, c, e
a, d, e
cxicorajhicay
sortcd
a, c, d
a, c, d, e
a, c, d, e
a, c, e
a, d, e
a, d, e
a, e
b, c
b, c, d
b, c, e
prex tree
representation
a
b
c
d
e
c
d
e
e
d
e
e
: 7
: 3
: 4
: 2
: 1
: 3
: 3
: 1
: 2
: 1
: 1
: 2
ltcms in transactions arc sortcd wrt somc aritrary ordcr,
transactions arc sortcd cxicorajhicay, thcn a jrcx trcc is constructcd
Advantage: idcntica transaction jrcxcs arc jroccsscd ony oncc
Christian Borgelt Frequent Pattern Mining 113
Eclat: Transaction Ranges
transaction
dataasc
a, d, e
b, c, d
a, c, e
a, c, d, e
a, e
a, c, d
b, c
a, c, d, e
b, c, e
a, d, e
itcm
trcqucncics
a
b 3
c
d o
e
sortcd y
trcqucncy
a, e, d
c, d, b
a, c, e
a, c, e, d
a, e
a, c, d
c, b
a, c, e, d
c, e, b
a, e, d
cxicorajhicay
sortcd
1 a, c, e
2 a, c, e, d
3 a, c, e, d
! a, c, d
` a, e
o a, e, d
a, e, d
S c, e, b
9 c, d, b
10 c, b
a
1
c
1
!
S
10
e
1
3
`
S
d
2
3
!
!
o
9
b
S
S
9
9
10
10
Thc transaction ists can c comjrcsscd y cominin
consccutivc transaction idcnticrs into rancs
Lxjoit itcm trcqucncics and cnsurc susct rcations ctwccn rancs
trom owcr to hihcr trcquccics, so that intcrscctin thc ists is casy
Christian Borgelt Frequent Pattern Mining 114
Eclat: Dierence sets (Disets)
ln a conditiona dataasc, a transaction ists arc tcrcd y thc jrcx
Ony transactions containcd in thc transaction ist tor thc jrcx
can c in thc transaction ists ot thc conditiona dataasc
This sucsts thc idca to usc disets to rcjrcscnt conditiona dataascs
I a / I D
T
(a [ I) K
T
(I) K
T
(I a)
D
T
(a [ I) contains thc indiccs ot thc transactions that contain I ut not a
Thc sujjort ot dircct sujcrscts ot I can now c comjutcd as
I a / I s
T
(I a) s
T
(I) [D
T
(a [ I)[.
Thc discts tor thc ncxt cvc can c comjutcd y
I a, b / I, a , b D
T
(b [ I a) D
T
(b [ I) D
T
(a [ I)
Ior somc transaction dataascs, usin discts sjccds uj thc scarch considcray
Christian Borgelt Frequent Pattern Mining 115
Eclat: Disets
Proof of the Formula for the Next Level:
D
T
(b [ I a) K
T
(I a) K
T
(I a, b)
k [ I a t
k
k [ I a, b t
k
k [ I t
k
a t
k
k [ I t
k
a t
k
b t
k
k [ I t
k
a t
k
b / t
k
k [ I t
k
b / t
k
k [ I t
k
b / t
k
a / t
k
k [ I t
k
b / t
k
k [ I t
k
a / t
k
(k [ I t
k
k [ I b t
k
)
(k [ I t
k
k [ I a t
k
)
(K
T
(I) K
T
(I b)
(K
T
(I) K
T
(I a)
D(b [ I) D(a [ I)
Christian Borgelt Frequent Pattern Mining 116
Summary Eclat
Basic Processing Scheme
Lcjth-rst travcrsa ot thc jrcx trcc
Lata is rcjrcscntcd as ists ot transaction idcnticrs (onc jcr itcm)
Sujjort countin is donc y intcrscctin ists ot transaction idcnticrs
Advantages
Lcjth-rst scarch rcduccs mcmory rcquircmcnts
suay (considcray) tastcr than Ajriori
Disadvantages
\ith a sjarsc transaction ist rcjrcscntation (row indiccs)
Lcat is dicut to cxccutc tor modcrn jroccssors (ranch jrcdiction)
Software
http://www.borgelt.net/eclat.html
Christian Borgelt Frequent Pattern Mining 117
The SaM Algorithm
Sjit and `crc Aorithm |Lorct 200S|
Christian Borgelt Frequent Pattern Mining 118
SaM: Basic Ideas
Thc itcm scts arc chcckcd in lexicographic order
(depth-rst traversal ot thc jrcx trcc)
Stcj y stcj cimination ot itcms trom thc transaction dataasc
lccursivc jroccssin ot thc conditiona transaction dataascs
\hic Lcat uscs a jurcy vcrtica transaction rcjrcscntation,
Sa` uscs a jurcy horizontal transaction representation
This dcmonstratcs that thc travcrsa ordcr tor thc jrcx trcc and
thc rcjrcscntation torm ot thc transaction dataasc can c comincd trccy
Thc data structurc uscd is a simjy array ot transactions
Thc two conditiona dataascs tor thc two sujrocms tormcd in cach stcj
arc crcatcd with a split step and a merge step
Luc to thcsc stcjs thc aorithm is cacd Sjit and `crc (Sa`)
Christian Borgelt Frequent Pattern Mining 119
SaM: Preprocessing the Transaction Database
1
a d
a c d e
b d
b c d g
b c f
a b d
b d e
b c d e
b c
a b d f
2
g 1
f 2
e 3
a !
c `
b S
d S
s
min
3
3
a d
e a c d
b d
c b d
c b
a b d
e b d
e c b d
c b
a b d
!
e a c d
e c b d
e b d
a b d
a b d
a d
c b d
c b
c b
b d
`
1 e a c d
1 e c b d
1 e b d
2 a b d
1 a d
1 c b d
2 c b
1 b d
1 Oriina transaction dataasc
2 Ircqucncy ot individua itcms
3 ltcms in transactions sortcd
asccndiny wrt thcir trcqucncy
! Transactions sortcd cxicorajhicay
in dcsccndin ordcr (comjarison ot
itcms invcrtcd wrt jrcccdin stcj)
` Lata structurc uscd y thc aorithm
Christian Borgelt Frequent Pattern Mining 120
SaM: Basic Operations
1 e a c d
1 e c b d
1 e b d
2 a b d
1 a d
1 c b d
2 c b
1 b d
1 a c d
1 c b d
1 b d
e
e
e
split
prefix e
2 a b d
1 a d
1 c b d
2 c b
1 b d
1 a c d
1 c b d
1 b d
1 a c d
2 a b d
1 a d
2 c b d
2 c b
2 b d
merge
prefix e
e removed
Split Step: (on thc ctt. tor rst sujrocm)
`ovc a transactions startin with thc samc itcm to a ncw array
lcmovc thc common cadin itcm (advancc jointcr into transaction)
Merge Step: (on thc riht. tor sccond sujrocm)
`crc thc rcst ot thc transaction array and thc cojicd transactions
Thc mcrc ojcration is simiar to a mergesort jhasc
Christian Borgelt Frequent Pattern Mining 121
SaM: Pseudo-Code
function Sa` (a array ot transactions, ( conditiona dataasc to jroccss )
p sct ot itcms, ( jrcx ot thc conditiona dataasc a )
s
min
int) ( minimum sujjort ot an itcm sct )
var i itcm. ( ucr tor thc sjit itcm )
b array ot transactions. ( sjit rcsut )
begin ( sjit and mcrc rccursion )
while a is not cmjty do ( whic thc dataasc is not cmjty )
i a|0|itcms|0|. ( ct cadin itcm ot rst transaction )
movc transactions startin with i to b. ( sjit stcj rst sujrocm )
mcrc b and thc rcst ot a into a. ( mcrc stcj sccond sujrocm )
if s(i) s
min
then ( it thc sjit itcm is trcqucnt )
p p i. ( cxtcnd thc jrcx itcm sct and )
rcjort p with sujjort s(i). ( rcjort thc tound trcqucnt itcm sct )
Sa`(b, p, s
min
). ( jroccss thc sjit rcsut rccursivcy, )
p p i. ( thcn rcstorc thc oriina jrcx )
end.
end.
end. ( tunction Sa`() )
Christian Borgelt Frequent Pattern Mining 122
SaM: Pseudo-Code Split Step
var i itcm. ( ucr tor thc sjit itcm )
s int. ( sujjort ot thc sjit itcm )
b array ot transactions. ( sjit rcsut )
begin ( sjit stcj )
b cmjty. s 0. ( initiaizc sjit rcsut and itcm sujjort )
i a|0|itcms|0|. ( ct cadin itcm ot rst transaction )
while a is not cmjty ( whic dataasc is not cmjty and )
and a|0|itcms|0| i do ( ncxt transaction starts with samc itcm )
s s + a|0|wt. ( sum occurrcnccs (comjutc sujjort) )
rcmovc i trom a|0|itcms. ( rcmovc sjit itcm trom transaction )
if a|0|itcms is not cmjty ( it transaction has not ccomc cmjty )
then rcmovc a|0| trom a and ajjcnd it to b.
else rcmovc a|0| trom a. end. ( movc it to thc conditiona dataasc, )
end. ( othcrwisc simjy rcmovc it )
end. ( cmjty transactions arc ciminatcd )
`otc that thc sjit stcj aso dctcrmincs thc sujjort ot thc itcm i
Christian Borgelt Frequent Pattern Mining 123
SaM: Pseudo-Code Merge Step
var c array ot transactions. ( ucr tor rcst ot sourcc array )
begin ( mcrc stcj )
c a. a cmjty. ( initiaizc thc outjut array )
while b and c arc oth not cmjty do ( mcrc sjit and rcst ot dataasc )
if c|0|itcms > b|0|itcms ( cojy cx smacr transaction trom c )
then rcmovc c|0| trom c and ajjcnd it to a.
else if c|0|itcms < b|0|itcms ( cojy cx smacr transaction trom b )
then rcmovc b|0| trom b and ajjcnd it to a.
else b|0|wt b|0|wt +c|0|wt. ( sum thc occurrcnccs,wcihts )
rcmovc b|0| trom b and ajjcnd it to a.
rcmovc c|0| trom c. ( movc comincd transaction and )
end. ( dcctc thc othcr, cqua transaction )
end. ( kccj ony onc cojy jcr transaction )
while c is not cmjty do ( cojy rcst ot transactions in c )
rcmovc c|0| trom c and ajjcnd it to a. end.
while b is not cmjty do ( cojy rcst ot transactions in b )
rcmovc b|0| trom b and ajjcnd it to a. end.
end. ( sccond rccursion cxccutcd y ooj )
Christian Borgelt Frequent Pattern Mining 124
SaM: Optimization
lt thc transaction dataasc is sjarsc,
thc two transaction arrays to mcrc can dicr sustantiay in sizc
ln this casc Sa` can ccomc tairy sow,
ccausc thc mcrc stcj jroccsscs many morc transactions than thc sjit stcj
lntuitivc cxjanation (cxtrcmc casc)
Sujjosc mergesort aways mcrcd a sinc ccmcnt
with thc rccursivcy sortcd rcst ot thc array (or ist)
This vcrsion ot mcrcsort woud c cquivacnt to insertion sort
As a conscqucncc thc timc comjcxity worscns trom O(no n) to O(n
2
)
lossic ojtimization
`odity thc mcrc stcj it thc arrays to mcrc dicr sinicanty in sizc
ldca usc thc samc ojtimization as in binary search ascd insertion sort
Christian Borgelt Frequent Pattern Mining 125
SaM: Pseudo-Code Binary Search Based Merge
function mcrc (a, b array ot transactions) array ot transactions
var l, m, r int. ( inary scarch variacs )
c array ot transactions. ( outjut transaction array )
begin ( inary scarch ascd mcrc )
c cmjty. ( initiaizc thc outjut array )
while a and b arc oth not cmjty do ( mcrc thc two transaction arrays )
l 0. r cnth(a). ( initiaizc thc inary scarch ranc )
while l < r do ( whic thc scarch ranc is not cmjty )
m
l+r
2
|. ( comjutc thc middc indcx )
if a|m| < b|0| ( comjarc thc transaction to inscrt )
then l m + 1. else r m. ( and adajt thc inary scarch ranc )
end. ( accordin to thc comjarison rcsut )
while l > 0 do ( whic sti ctorc inscrtion josition )
rcmovc a|0| trom a and ajjcnd it to c.
l l 1. ( cojy cx arcr transaction and )
end. ( dccrcmcnt thc transaction countcr )
. . .
Christian Borgelt Frequent Pattern Mining 126
SaM: Pseudo-Code Binary Search Based Merge
. . .
rcmovc b|0| trom b and ajjcnd it to c. ( cojy thc transaction to inscrt and )
i cnth(c) 1. ( ct its indcx in thc outjut array )
if a is not cmjty and a|0|itcms c|i|itcms
then c|i|wt c|i|wt +a|0|wt. ( it thcrc is a transaction in thc rcst )
rcmovc a|0| trom a. ( that is cqua to thc onc ,ust cojicd, )
end. ( thcn sum thc transaction wcihts )
end. ( and rcmovc trans trom thc rcst )
while a is not cmjty do ( cojy rcst ot transactions in a )
rcmovc a|0| trom a and ajjcnd it to c. end.
while b is not cmjty do ( cojy rcst ot transactions in b )
rcmovc b|0| trom b and ajjcnd it to c. end.
return c. ( rcturn thc mcrc rcsut )
end. ( tunction mcrc() )
Ajjyin this mcrc jroccdurc it thc cnth ratio ot thc transaction arrays
cxcccds 1o1 accccratcs thc cxccution on sjarsc data scts
Christian Borgelt Frequent Pattern Mining 127
SaM: Optimization and External Storage
Acccjtin a sihty morc comjicatcd jroccssin schcmc,
onc may work with double source buering
lnitiay, onc sourcc is thc injut dataasc and thc othcr sourcc is cmjty
A sjit rcsut, which has to c crcatcd y movin and mcrin transactions
trom oth sourccs, is aways mcrcd to thc smacr sourcc
lt oth sourccs havc ccomc arc,
thcy may c mcrcd in ordcr to cmjty onc sourcc
`otc that Sa` can casiy c imjcmcntcd to work on external storage
ln jrincijc, thc transactions nccd not c oadcd into main mcmory
Lvcn thc transaction array can casiy c storcd on cxtcrna storac
or as a rcationa dataasc tac
Thc tact that thc transaction array is jroccsscd incary
is advantacous tor cxtcrna storac ojcrations
Christian Borgelt Frequent Pattern Mining 128
Summary SaM
Basic Processing Scheme
Lcjth-rst travcrsa ot thc jrcx trcc
Lata is rcjrcscntcd as an array ot transactions (jurcy horizonta rcjrcscntation)
Sujjort countin is donc imjicity in thc sjit stcj
Advantages
\cry simjc data structurc and jroccssin schcmc
Lasy to imjcmcnt tor ojcration on cxtcrna storac , rcationa dataascs
Disadvantages
Can c sow on sjarsc transaction dataascs duc to thc mcrc stcj
Software
http://www.borgelt.net/sam.html
Christian Borgelt Frequent Pattern Mining 129
The RElim Algorithm
lccursivc Limination Aorithm |Lorct 200`|
Christian Borgelt Frequent Pattern Mining 130
Recursive Elimination: Basic Ideas
Thc itcm scts arc chcckcd in lexicographic order
(depth-rst traversal ot thc jrcx trcc)
Stcj y stcj cimination ot itcms trom thc transaction dataasc
lccursivc jroccssin ot thc conditiona transaction dataascs
Avoids thc main jrocm ot thc Sa` aorithm
docs not usc a mcrc ojcration to rouj transactions with thc samc cadin itcm
lLim rathcr maintains one list of transactions per item,
thus cmjoyin thc corc idca ot radix sort
Lowcvcr, ony transactions startin with an itcm arc in thc corrcsjondin ist
Attcr an itcm has ccn jroccsscd, transactions arc rcassincd to othcr ists
(ascd on thc ncxt itcm in thc transaction)
lLim is in scvcra rcsjccts simiar to thc LC` aorithm
and coscy rcatcd to thc L-minc aorithm (ut simjcr data structurc)
Christian Borgelt Frequent Pattern Mining 131
RElim: Preprocessing the Transaction Database
1
samc
as tor
Sa`
!
e a c d
e c b d
e b d
a b d
a b d
a d
c b d
c b
c b
b d
`
d
0
b
1
c
3
a
3
e
3
1 d 1 b d
2 b
2 b d
1 d
1 a c d
1 c b d
1 b d
1 Oriina transaction dataasc
2 Ircqucncy ot individua itcms
3 ltcms in transactions sortcd
asccndiny wrt thcir trcqucncy
! Transactions sortcd cxicorajhicay
in dcsccndin ordcr (comjarison ot
itcms invcrtcd wrt jrcccdin stcj)
` Lata structurc uscd y thc aorithm
(cadin itcms imjicit in ist)
Christian Borgelt Frequent Pattern Mining 132
RElim: Basic Operations
initial database
d
0
b
1
c
3
a
3
e
3
1 d 1 b d
2 b
2 b d
1 d
1 a c d
1 c b d
1 b d
3
e
a
c
b
prefix e
d
0
b
1
c
1
a
1
1 d 1 b d 1 c d
e eliminated
d
0
b
2
c
4
a
4
1 d
1 d
1 b d
1 b d
2 b
1 c d
2 b d
1 d
Thc asic ojcrations ot thc lLim aorithm
Thc rihtmost ist is travcrscd and rcassincd
oncc to an initiay cmjty ist array (condi-
tiona dataasc tor thc jrcx e, scc toj riht)
and oncc to thc oriina ist array (ciminatin
itcm e, scc ottom ctt) Thcsc two dataascs
arc thcn oth jroccsscd rccursivcy
`otc that attcr a simjc rcassinmcnt thcrc may c dujicatc ist ccmcnts
Christian Borgelt Frequent Pattern Mining 133
RElim: Pseudo-Code
function lLim (a array ot transaction ists, ( cond dataasc to jroccss )
p sct ot itcms, ( jrcx ot thc conditiona dataasc a )
s
min
int) int ( minimum sujjort ot an itcm sct )
var i, k itcm. ( ucr tor thc currcnt itcm )
s int. ( sujjort ot thc currcnt itcm )
n int. ( numcr ot tound trcqucnt itcm scts )
b array ot transaction ists. ( conditiona dataasc tor currcnt itcm )
t, u transaction ist ccmcnt. ( to travcrsc thc transaction ists )
begin ( rccursivc cimination )
n 0. ( initiaizc thc numcr ot tound itcm scts )
while a is not cmjty do ( whic conditiona dataasc is not cmjty )
i ast itcm ot a. s a|i|wt. ( ct thc ncxt itcm to jroccss )
if s s
min
then ( it thc currcnt itcm is trcqucnt )
p p i. ( cxtcnd thc jrcx itcm sct and )
rcjort p with sujjort s. ( rcjort thc tound trcqucnt itcm sct )
. . . ( crcatc conditiona dataasc tor i )
p p i. ( and jroccss it rccursivcy, )
end. ( thcn rcstorc thc oriina jrcx )
Christian Borgelt Frequent Pattern Mining 134
RElim: Pseudo-Code
if s s
min
then ( it thc currcnt itcm is trcqucnt )
. . . ( rcjort thc tound trcqucnt itcm sct )
b array ot transaction ists. ( crcatc an cmjty ist array )
t a|i|hcad. ( ct thc ist associatcd with thc itcm )
while t , ni do ( whic not at thc cnd ot thc ist )
u cojy ot t. t tsucc. ( cojy thc transaction ist ccmcnt, )
k uitcms|0|. ( o to thc ncxt ist ccmcnt, and )
rcmovc k trom uitcms. ( rcmovc thc cadin itcm trom thc cojy )
if uitcms is not cmjty ( add thc cojy to thc conditiona dataasc )
then usucc b|k|hcad. b|k|hcad u. end.
b|k|wt b|k|wt +uwt. ( sum thc transaction wciht )
end. ( in thc ist wciht,transaction countcr )
n n + 1 + lLim(b, p, s
min
). ( jroccss thc crcatcd dataasc rccursivcy )
. . . ( and sum thc tound trcqucnt itcm scts, )
end. ( thcn rcstorc thc oriina itcm sct jrcx )
. . . ( o on y rcassinin )
( thc jroccsscd transactions )
Christian Borgelt Frequent Pattern Mining 135
RElim: Pseudo-Code
. . .
t a|i|hcad. ( ct thc ist associatcd with thc itcm )
while t , ni do ( whic not at thc cnd ot thc ist )
u t. t tsucc. ( notc thc currcnt ist ccmcnt, )
k uitcms|0|. ( o to thc ncxt ist ccmcnt, and )
rcmovc k trom uitcms. ( rcmovc thc cadin itcm trom currcnt )
if uitcms is not cmjty ( rcassin thc notcd ist ccmcnt )
then usucc a|k|hcad. a|k|hcad u. end.
a|k|wt a|k|wt +uwt. ( sum thc transaction wciht )
end. ( in thc ist wciht,transaction countcr )
rcmovc a|i| trom a. ( rcmovc thc jroccsscd ist )
end.
return n. ( rcturn thc numcr ot trcqucnt itcm scts )
end. ( tunction lLim() )
ln ordcr to rcmovc dujicatc ccmcnts, it is usuay advisac
to sort and comjrcss thc ncxt transaction ist ctorc it is jroccsscd
Christian Borgelt Frequent Pattern Mining 136
Summary RElim
Basic Processing Scheme
Lcjth-rst travcrsa ot thc jrcx trcc
Lata is rcjrcscntcd as ists ot transactions (onc jcr itcm)
Sujjort countin is imjicit in thc (rc)assinmcnt stcj
Advantages
Simjc data structurcs and jroccssin schcmc
Comjctitivc with thc tastcst aorithms dcsjitc this simjicity
Disadvantages
lLim is usuay outjcrtormcd y Il-rowth (discusscd ncxt)
Software
http://www.borgelt.net/relim.html
Christian Borgelt Frequent Pattern Mining 137
The LCM Algorithm
Lincar Coscd ltcm Sct `incr
|no, Asai, chida, and Arimura 2003| (vcrsion 1)
|no, Iiyomi and Arimura 200!, 200`| (vcrsions 2 : 3)
Christian Borgelt Frequent Pattern Mining 138
LCM: Basic Ideas
Thc itcm scts arc chcckcd in lexicographic order
(depth-rst traversal ot thc jrcx trcc)
Stcj y stcj cimination ot itcms trom thc transaction dataasc.
rccursivc jroccssin ot thc conditiona transaction dataascs
Coscy rcatcd to thc Lcat aorithm
`aintains both a horizontal and a vertical representation
ot thc transaction dataasc in jarac
scs thc vcrtica rcjrcscntation to tcr thc transactions
with thc choscn sjit itcm
scs thc horizonta rcjrcscntation to thc vcrtica rcjrcscntation
tor thc ncxt rccursion stcj (no intcrscction as in Lcat)
suay travcrscs thc scarch trcc trom right to left
in ordcr to rcusc thc mcmory tor thc vcrtica rcjrcscntation
(xcd mcmory rcquircmcnt, jrojortiona to dataasc sizc)
Christian Borgelt Frequent Pattern Mining 139
LCM: Occurrence Deliver
a d e 1:
b c d 2:
a c e 3:
a c d e 4:
a e 5:
a c d 6:
b c 7:
a c d e 8:
b c e 9:
a d e 10:
1
3
4
5
6
8
10
a
7
2
7
9
b
3
2
3
4
6
7
8
9
c
7
1
2
4
6
8
10
d
6
1
3
4
5
8
9
10
e
7
1
3
4
5
8
9
10
e
7
1
a
1
b
0
c
0
1
d
1
a d e
1
3
4
5
8
9
10
e
7
1
3
a
2
b
0
3
c
1
1
d
1
a c e
1
3
4
5
8
9
10
e
7
1
3
4
a
3
b
0
3
4
c
2
1
4
d
2
a c d e
Occurrcncc dcivcr schcmc uscd
y LC` to nd thc conditiona
transaction dataasc tor thc rst
sujrocm (nccds a horizonta
rcjrcscntation in jarac)
ctc
Christian Borgelt Frequent Pattern Mining 140
LCM: Left to Right Processing
a
0
2
7
9
b
3
2
3
4
6
7
8
10
c
7
1
2
4
6
8
10
d
6
1
3
4
5
8
9
10
e
7
3
4
6
8
a
4
2
7
9
b
3
2
3
4
6
7
8
9
c
7
1
2
4
6
8
10
d
6
1
3
4
5
8
9
10
e
7
1
3
4
6
8
10
a
6
b
0
3
4
6
8
c
4
1
2
4
6
8
10
d
6
1
3
4
5
8
9
10
e
7
1
3
4
5
8
10
a
6
9
b
1
3
4
8
9
c
4
1
4
8
10
d
4
1
3
4
5
8
9
10
e
7
ack unjroccsscd jart uc sjit itcm rcd conditiona dataasc
Thc sccond sujrocm (cxcudc sjit itcm) is sovcd
ctorc thc rst sujrocm (incudc sjit itcm)
Thc aorithm is cxccutcd on ony thc mcmory
that storcs thc initia vcrtica rcjrcscntation
lt thc transaction dataasc can c oadcd, thc trcqucnt itcm scts can c tound
Christian Borgelt Frequent Pattern Mining 141
LCM: k-items Machine
lrocm ot LC` (as ot Lcat) it is dicut to cominc cqua transaction suxcs
ldca lt thc numcr ot itcms is sma, a bucket/bin sort scheme
can c uscd to jcrtccty cominc cqua transaction suxcs
This schcmc cads to thc k-items machine (tor sma k)
A jossic transaction suxcs arc rcjrcscntcd as it jattcrns.
onc uckct,in is crcatcd tor cach jossic it jattcrn
A lLim-ikc jroccssin schcmc is cmjoycd (on a xcd data structurc)
Lcadin itcms arc cxtractcd with a tac that is indcxcd with thc it jattcrn
ltcms arc ciminatcd with a it mask
Tac ot hihcst sct its tor a !-itcms machinc
highest items/set bits of transactions (constant)
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
____ ___a __b_ __ba _c__ _c_a _cb_ _cba d___ d__a d_b_ d_ba dc__ dc_a dcb_ dcba
*.* a.0 b.1 b.1 c.2 c.2 c.2 c.2 d.3 d.3 d.3 d.3 d.3 d.3 d.3 d.3
Christian Borgelt Frequent Pattern Mining 142
LCM: k-items Machine
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
Empty 4-items machine (no transactions)
transaction weights/multiplicities
transaction lists (one per item)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
a.0
0
b.1
0
c.2
0
d.3
0
4-items machine after inserting the transactions
transaction weights/multiplicities
transaction lists (one per item)
0 1 0 0 0 1 2 0 0 2 0 0 0 3 1 0
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
a.0
1
b.1
0
c.2
3
d.3
6
0001 0101 0110 1001 1110 1101
ln this statc thc !-itcms machinc rcjrcscnts a sjccia torm
ot thc initia transaction dataasc ot thc lLim aorithm
Christian Borgelt Frequent Pattern Mining 143
LCM: k-items Machine
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
4-items machine after inserting the transactions
transaction weights/multiplicities
transaction lists (one per item)
0 1 0 0 0 1 2 0 0 2 0 0 0 3 1 0
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
a.0
1
b.1
0
c.2
3
d.3
6
0001 0101 0110 1001 1110 1101
After propagating the transaction lists
transaction weights/multiplicities
transaction lists (one per item)
0 7 3 0 0 4 3 0 0 2 0 0 0 3 1 0
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
a.0
7
b.1
3
c.2
7
d.3
6
0001 0010 0101 0110 1001 1110 1101
lrojaatin thc transactions ists is cquivacnt to occurrcncc dcivcr
Conditiona transaction dataascs arc crcatcd as in lLim jus jrojaation
Christian Borgelt Frequent Pattern Mining 144
Summary LCM
Basic Processing Scheme
Lcjth-rst travcrsa ot thc jrcx trcc
larac horizonta and vcrtica transaction rcjrcscntation
Sujjort countin is donc durin thc occurrcncc dcivcr jroccss
Advantages
Iairy simjc data structurc and jroccssin schcmc
\cry tast it imjcmcntcd jrojcry (and with additiona tricks)
Disadvantages
Simjc, straihttorward imjcmcntation is rcativcy sow
Software
http://www.borgelt.net/eclat.html (ojtion -ao)
Christian Borgelt Frequent Pattern Mining 145
The FP-Growth Algorithm
Ircqucnt lattcrn Growth Aorithm |Lan, lci, and Yin 2000|
Christian Borgelt Frequent Pattern Mining 146
FP-Growth: Basic Ideas
Il-Growth mcans Frequent Pattern Growth
Thc itcm scts arc chcckcd in lexicographic order
(depth-rst traversal ot thc jrcx trcc)
Stcj y stcj cimination ot itcms trom thc transaction dataasc
lccursivc jroccssin ot thc conditiona transaction dataascs
Thc transaction dataasc is rcjrcscntcd as an FP-tree
An Il-trcc is asicay a prex tree with additiona structurc
nodcs ot this trcc that corrcsjond to thc samc itcm arc inkcd
This combines a horizontal and a vertical database representation
This data structurc is uscd to comjutc conditiona dataascs ccicnty
A transactions containin a ivcn itcm can casiy c tound
y thc inks ctwccn thc nodcs corrcsjondin to this itcm
Christian Borgelt Frequent Pattern Mining 147
FP-Growth: Preprocessing the Transaction Database
1
a d f
a c d e
b d
b c d
b c
a b d
b d e
b c e g
c d f
a b d
2
d S
b
c `
a !
e 3
f 2
g 1
s
min
3
3
d a
d c a e
d b
d b c
b c
d b a
d b e
b c e
d c
d b a
!
d b
d b c
d b a
d b a
d b e
d c
d c a e
d a
b c
b c e
`
Il-trcc
(scc ncxt sidc)
1 Oriina transaction dataasc
2 Ircqucncy ot individua itcms
3 ltcms in transactions sortcd
dcsccndiny wrt thcir trcqucncy
and intrcqucnt itcms rcmovcd
! Transactions sortcd cxicorajhicay
in asccndin ordcr (comjarison ot
itcms is thc samc as in jrcccdin stcj)
` Lata structurc uscd y thc aorithm
(dctais on ncxt sidc)
Christian Borgelt Frequent Pattern Mining 148
Transaction Representation: FP-Tree
Luid a frequent pattern tree (FP-tree) trom thc transactions
(asicay a jrcx trcc with links between the branches that ink nodcs
with thc samc itcm and a header table tor thc rcsutin itcm ists)
Ircqucnt sinc itcm scts can c rcad dirccty trom thc Il-trcc
Simple Example Database
1
a d f
a c d e
b d
b c d
b c
a b d
b d e
b c e g
c d f
a b d
!
d b
d b c
d b a
d b a
d b e
d c
d c a e
d a
b c
b c e
10
d: 8 b: 7 c: 5 a: 4
d: 8
b: 5
b: 2
c: 1
c: 2
c: 2
a: 2
a: 1
a: 1
e: 3
e: 1
e: 1
e: 1
frequent pattern tree
Christian Borgelt Frequent Pattern Mining 149
Transaction Representation: FP-Tree
An Il-trcc comincs a horizonta and a vcrtica transaction rcjrcscntation
Horizontal Representation: jrcx trcc ot transactions
Vertical Representation: inks ctwccn thc jrcx trcc ranchcs
`otc thc jrcx trcc is invcrtcd,
ic thcrc arc ony jarcnt jointcrs
Chid jointcrs arc not nccdcd
duc to thc jroccssin schcmc
(to c discusscd)
ln jrincijc, a nodcs rctcrrin
to thc samc itcm can c storcd
in an array rathcr than a ist
10
d: 8 b: 7 c: 5 a: 4
d: 8
b: 5
b: 2
c: 1
c: 2
c: 2
a: 2
a: 1
a: 1
e: 3
e: 1
e: 1
e: 1
frequent pattern tree
Christian Borgelt Frequent Pattern Mining 150
Recursive Processing
Thc initia Il-trcc is projected wrt thc itcm corrcsjondin to
thc rihtmost cvc in thc trcc (ct this itcm c i)
This yicds an Il-trcc ot thc conditional database
(dataasc ot transactions containin thc itcm i, ut with this itcm rcmovcd
it is imjicit in thc Il-trcc and rccordcd as a common jrcx)
Irom thc jro,cctcd Il-trcc thc trcqucnt itcm scts
containin itcm i can c rcad dirccty
Thc rightmost level ot thc oriina (unjro,cctcd) Il-trcc is removed
(thc itcm i is rcmovcd trom thc dataasc)
Thc jro,cctcd Il-trcc is jroccsscd rccursivcy. thc itcm i is notcd as a jrcx
that is to c addcd in dccjcr cvcs ot thc rccursion
Attcrwards thc rcduccd oriina Il-trcc is turthcr jroccsscd
y workin on thc ncxt cvc cttwards
Christian Borgelt Frequent Pattern Mining 151
Projecting an FP-Tree
10
d: 8 b: 7 c: 5 a: 4
d: 8
b: 5
b: 2
c: 1
c: 2
c: 2
a: 2
a: 1
a: 1
e: 3
e: 1
e: 1
e: 1
3
b: 1
d: 2 c: 1 a: 1
b: 1 c: 1
dctachcd jro,cction
Il-trcc with attachcd jro,cction
3
d: 2 b: 2 c: 2 a: 1
b: 1 c: 1
d: 2 c: 1 a: 1
b: 1
Ly travcrsin thc nodc ist tor thc rihtmost itcm,
a transactions containin this itcm can c tound
Thc Il-trcc ot thc conditiona dataasc tor this itcm is crcatcd
y cojyin thc nodcs on thc jaths to thc root
Christian Borgelt Frequent Pattern Mining 152
Projecting an FP-Tree
A simjcr, ut usuay cquay ccicnt jro,cction schcmc
is to cxtract a jath to thc root as a (rcduccd) transaction
and to inscrt this transaction into a ncw Il-trcc
Ior thc inscrtion into thc ncw trcc thcrc arc two ajjroachcs
Ajart trom a jarcnt jointcr (which is nccdcd tor thc jath cxtraction),
cach nodc josscsscs a jointcr to its rst child and its right sibling
Thcsc jointcrs aow to inscrt a ncw transaction toj-down
lt thc initia Il-trcc has ccn uit trom a cxicorajhicay sortcd
transaction dataasc, thc travcrsa ot thc itcm ists yicds thc
(rcduccd) transactions in cxicorajhica ordcr
This can c cxjoitcd to inscrt a transaction usin ony thc header table
Ly jroccssin an Il-trcc trom left to right (or trom top to bottom
wrt thc jrcx trcc), thc jro,cction may cvcn rcusc thc arcady jrcscnt nodcs
and thc arcady jroccsscd jart ot thc hcadcr tac (top-down fp-growth)
ln this way thc aorithm can c cxccutcd on a xcd amount ot mcmory
Christian Borgelt Frequent Pattern Mining 153
Reducing the Original FP-Tree
10
d: 8 b: 7 c: 5 a: 4
d: 8
b: 5
b: 2
c: 1
c: 2
c: 2
a: 2
a: 1
a: 1
e: 3
e: 1
e: 1
e: 1
10
d: 8 b: 7 c: 5 a: 4
d: 8
b: 5
b: 2
c: 1
c: 2
c: 2
a: 2
a: 1
a: 1
Thc oriina Il-trcc is rcduccd y rcmovin thc rihtmost cvc
This yicds thc conditiona dataasc tor itcm scts not containin thc itcm
corrcsjondin to thc rihtmost cvc
Christian Borgelt Frequent Pattern Mining 154
FP-growth: Divide-and-Conquer
10
d: 8 b: 7 c: 5 a: 4
d: 8
b: 5
b: 2
c: 1
c: 2
c: 2
a: 2
a: 1
a: 1
e: 3
e: 1
e: 1
e: 1
10
d: 8 b: 7 c: 5 a: 4
d: 8
b: 5
b: 2
c: 1
c: 2
c: 2
a: 2
a: 1
a: 1
Conditiona dataasc
with itcm e rcmovcd
(sccond sujrocm)
3
d: 2 b: 2 c: 2 a: 1
b: 1 c: 1
d: 2 c: 1 a: 1
b: 1
Conditiona dataasc tor jrcx e
(rst sujrocm)
Christian Borgelt Frequent Pattern Mining 155
Pruning a Projected FP-Tree
Trivial case: lt thc itcm corrcsjondin to thc rihtmost cvc is intrcqucnt,
thc itcm and thc Il-trcc cvc arc rcmovcd without jro,cction
More interesting case: An itcm corrcsjondin to a middc cvc
is intrcqucnt, ut an itcm on a cvc turthcr to thc riht is trcqucnt
Example FP-Tree with an intrcqucnt itcm on a middc cvc
a: 6 b: 1 c: 4 d: 3
a: 6 b: 1 c: 1 d: 1
c: 3 d: 2
a: 6 b: 1 c: 4 d: 3
a: 6 c: 4 d: 3
So-cacd -jrunin or Lonsai jrunin ot a (jro,cctcd) Il-trcc
lmjcmcntcd y ctt-to-riht cvcwisc mcrin ot nodcs with samc jarcnts
`ot nccdcd it jro,cction works y cxtraction and inscrtion
Christian Borgelt Frequent Pattern Mining 156
FP-growth: Implementation Issues
Chains:
lt an Il-trcc has ccn rcduccd to a chain, no jro,cctions arc comjutcd anymorc
lathcr a suscts ot thc sct ot itcms in thc chain arc tormcd and rcjortcd
Rebuilding the FP-tree:
An Il-trcc may c jro,cctcd y cxtractin thc (rcduccd) transactions dcscricd
y thc jaths to thc root and inscrtin thcm into a ncw Il-trcc (scc aovc)
This makcs it jossic to chanc thc itcm ordcr, with thc toowin advantages
`o nccd tor - or Lonsai jrunin, sincc thc itcms can c rcordcrcd
so that a conditionay trcqucnt itcms ajjcar on thc ctt
`o nccd tor jcrtcct cxtcnsion jrunin, ccausc thc jcrtcct cxtcnsions can c
movcd to thc ctt and arc jroccsscd at thc cnd with thc chain ojtimization
Lowcvcr, thcrc arc aso disadvantages
Lithcr thc Il-trcc has to c travcrscd twicc or jair trcqucncics havc to c
dctcrmincd to rcordcr thc itcms accordin to thcir conditiona trcqucncy
Christian Borgelt Frequent Pattern Mining 157
FP-growth: Implementation Issues
Thc initia Il-trcc is uit trom an array-ascd main mcmory rcjrcscntation
ot thc transaction dataasc (ciminatcs thc nccd tor chid jointcrs)
This has thc disadvantac that thc mcmory savins ottcn rcsutin
trom an Il-trcc rcjrcscntation cannot c tuy cxjoitcd
Lowcvcr, it has thc advantac that no chid and siin jointcrs arc nccdcd
and thc transactions can c inscrtcd in cxicorajhic ordcr
Lach Il-trcc nodc has a constant sizc ot 1o ytcs (2 jointcrs, 2 intccrs)
Aocatin thcsc throuh thc standard mcmory manacmcnt is wastctu
(Aocatin many sma mcmory o,ccts is hihy inccicnt)
Soution Thc nodcs arc aocatcd in onc arc array jcr Il-trcc
As a conscqucncc, cach Il-trcc rcsidcs in a sinc mcmory ock
Thcrc is no aocation and dcaocation ot individua nodcs
(This may wastc somc mcmory, ut is hihy ccicnt)
Christian Borgelt Frequent Pattern Mining 158
FP-growth: Implementation Issues
An Il-trcc can c imjcmcntcd with ony two integer arrays |lasz 200!|
onc array contains thc transaction countcrs (sujjort vaucs) and
onc array contains thc jarcnt jointcrs (as thc indiccs ot array ccmcnts)
This rcduccs thc mcmory rcquircmcnts to S ytcs jcr nodc
Such a mcmory structurc has advantages
duc thc way in which modcrn jroccssors acccss thc main mcmory
Lincar mcmory acccsscs arc tastcr than random acccsscs
`ain mcmory is oranizcd as a tac with rows and coumns
Iirst thc row is addrcsscd and thcn, attcr somc dcay, thc coumn
Acccsscs to dicrcnt coumns in thc samc row can skij thc row addrcssin
Lowcvcr, thcrc arc aso disadvantages
lrorammin jro,cction and - or Lonsai jrunin ccomcs morc comjcx,
ccausc css structurc is avaiac
lcordcrin thc itcms is virtuay rucd out
Christian Borgelt Frequent Pattern Mining 159
Summary FP-Growth
Basic Processing Scheme
Transaction dataasc is rcjrcscntcd as a trcqucnt jattcrn trcc
An Il-trcc is jro,cctcd to otain a conditiona dataasc
lccursivc jroccssin ot thc conditiona dataasc
Advantages
Ottcn thc tastcst aorithm or amon thc tastcst aorithms
Disadvantages
`orc dicut to imjcmcnt than othcr ajjroachcs, comjcx data structurc
An Il-trcc can nccd morc mcmory than a ist or array ot transactions
Software
http://www.borgelt.net/fpgrowth.html
Christian Borgelt Frequent Pattern Mining 160
Experimental Comparison
Christian Borgelt Frequent Pattern Mining 161
Experiments: Data Sets
Chess
A data sct istin chcss cnd amc jositions tor kin vs kin and rook
This data sct is jart ot thc Cl machinc carnin rcjository
` itcms, 319o transactions
(avcrac) transaction sizc 3, dcnsity 0.`
Census
A data sct dcrivcd trom an cxtract ot thc S ccnsus urcau data ot 199!,
which was jrcjroccsscd y discrctizin numcric attriutcs
This data sct is jart ot thc Cl machinc carnin rcjository
13` itcms, !SS!2 transactions
(avcrac) transaction sizc 1!, dcnsity 0.1
Thc density ot a transaction dataasc is thc avcrac traction ot a itcms occurrin
jcr transaction dcnsity avcrac transaction sizc , numcr ot itcms
Christian Borgelt Frequent Pattern Mining 162
Experiments: Data Sets
T10I4D100K
An articia data sct cncratcd with lL`s data cncrator
Thc namc is tormcd trom thc jaramctcrs ivcn to thc cncrator
(tor cxamjc 100I 100000 transactions)
S0 itcms, 100000 transactions
avcrac transaction sizc 10.1, dcnsity 0.012
BMS-Webview-1
A wc cick strcam trom a c-carc comjany that no oncr cxists
lt has ccn uscd in thc ILL cuj 2000 and is a jojuar cnchmark
!9 itcms, `9o02 transactions
avcrac transaction sizc 2.`, dcnsity 0.00`
Thc density ot a transaction dataasc is thc avcrac traction ot a itcms occurrin
jcr transaction dcnsity avcrac transaction sizc , numcr ot itcms
Christian Borgelt Frequent Pattern Mining 163
Experiments: Programs and Test System
A jrorams arc my own imjcmcntations
A usc thc samc codc tor rcadin thc transaction dataasc
and tor writin thc tound trcqucnt itcm scts
Thcrctorc dicrcnccs in sjccd can ony c thc ccct ot thc jroccssin schcmcs
Thcsc jrorams and thcir sourcc codc can c tound on my wc sitc
http://www.borgelt.net/fpm.html
Ajriori http://www.borgelt.net/apriori.html
Lcat http://www.borgelt.net/eclat.html
Il-Growth http://www.borgelt.net/fpgrowth.html
lLim http://www.borgelt.net/relim.html
Sa` http://www.borgelt.net/sam.html
Thc tcst systcm was an lL`,Lcnovo Xo0s ajtoj
(lntc Ccntrino Luo L2!00, 1o GLz, 1 GL main mcmory)
runnin SuSL Linux 103. jrorams wcrc comjicd with cc !21
Christian Borgelt Frequent Pattern Mining 164
Experiments: Execution Times
1000 1200 1400 1600 1800 2000
-1
0
1
2
apriori
eclat
fpgrowth
relim
sam
chess
0 5 10 15 20 25 30 35 40 45 50
0
1
apriori
eclat
fpgrowth
relim
sam
relim -h
T10I4D100K
0 10 20 30 40 50 60 70 80 90 100
0
1
apriori
eclat
fpgrowth
relim
sam
census
33 34 35 36 37 38 39 40
0
1
apriori
eclat
fpgrowth
relim
sam
webview1
Lccima oarithm ot cxccution timc in scconds ovcr asoutc minimum sujjort
Christian Borgelt Frequent Pattern Mining 165
Reminder: Perfect Extensions
Thc scarch can c imjrovcd with so-cacd perfect extension pruning
Givcn an itcm sct I, an itcm a / I is cacd a perfect extension ot I,
i I and I a havc thc samc sujjort (a transactions containin I contain a)
lcrtcct cxtcnsions havc thc toowin jrojcrtics
lt thc itcm a is a jcrtcct cxtcnsion ot an itcm sct I,
thcn a is aso a jcrtcct cxtcnsion ot any itcm sct J I (as on as a / J)
lt I is a trcqucnt itcm sct and X is thc sct ot a jcrtcct cxtcnsions ot I,
thcn a scts I J with J 2
X
(whcrc 2
X
dcnotcs thc jowcr sct ot X)
arc aso trcqucnt and havc thc samc sujjort as I
This can c cxjoitcd y cocctin jcrtcct cxtcnsion itcms in thc rccursion,
in a third ccmcnt ot a sujrocm dcscrijtion S (D, P, X)
Oncc idcnticd, jcrtcct cxtcnsion itcms arc no oncr jroccsscd in thc rccursion,
ut arc ony uscd to cncratc a sujcrscts ot thc jrcx havin thc samc sujjort
Christian Borgelt Frequent Pattern Mining 166
Experiments: Perfect Extension Pruning
1000 1200 1400 1600 1800 2000
-1
0
1
2
w/o pep
apriori
eclat
fpgrowth
chess
0 5 10 15 20 25 30 35 40 45 50
0
1
w/o pep
apriori
eclat
fpgrowth
T10I4D100K
0 10 20 30 40 50 60 70 80 90 100
0
1 w/o pep
apriori
eclat
fpgrowth
census
33 34 35 36 37 38 39 40
0
1
w/o pep
apriori
eclat
fpgrowth
webview1
Lccima oarithm ot cxccution timc in scconds ovcr asoutc minimum sujjort
Christian Borgelt Frequent Pattern Mining 167
Reducing the Output:
Closed and Maximal Item Sets
Christian Borgelt Frequent Pattern Mining 168
Maximal Item Sets
Considcr thc sct ot maximal (frequent) item sets
M
T
(s
min
) I B [ s
T
(I) s
min
J I s
T
(J) < s
min
.
That is An item set is maximal if it is frequent,
but none of its proper supersets is frequent.
Sincc with this dcnition wc know that
s
min
I F
T
(s
min
) I M
T
(s
min
) J I s
T
(J) s
min
it toows (can casiy c jrovcn y succcssivcy cxtcndin thc itcm sct I)
s
min
I F
T
(s
min
) J M
T
(s
min
) I J.
That is Every frequent item set has a maximal superset.
Thcrctorc s
min
F
T
(s
min
)
_
IM
T
(s
min
)
2
I
Christian Borgelt Frequent Pattern Mining 169
Mathematical Excursion: Maximal Elements
Lct R c a susct ot a jartiay ordcrcd sct (S, )
An ccmcnt x R is cacd maximal or a maximal element ot R it
y R x y x y.
Thc notions minimal and minimal element arc dcncd anaoousy
`axima ccmcnts nccd not c uniquc,
ccausc thcrc may c ccmcnts x, y R with ncithcr x y nor y x
lnnitc jartiay ordcrcd scts nccd not josscss a maxima,minima ccmcnt
Lcrc wc considcr thc sct F
T
(s
min
) as a susct ot thc jartiay ordcrcd sct (2
B
, )
Thc maximal (frequent) item sets arc thc maxima ccmcnts ot F
T
(s
min
)
M
T
(s
min
) I F
T
(s
min
) [ J F
T
(s
min
) I J I J.
That is, no sujcrsct ot a maxima (trcqucnt) itcm sct is trcqucnt
Christian Borgelt Frequent Pattern Mining 170
Maximal Item Sets: Example
transaction dataasc
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
trcqucnt itcm scts
0 itcms 1 itcm 2 itcms 3 itcms
10 a a, c ! a, c, d 3
b 3 a, d ` a, c, e 3
c a, e o a, d, e !
d o b, c 3
e c, d !
c, e !
d, e !
Thc maxima itcm scts arc
b, c, a, c, d, a, c, e, a, d, e.
Lvcry trcqucnt itcm sct is a susct ot at cast onc ot thcsc scts
Christian Borgelt Frequent Pattern Mining 171
Hasse Diagram and Maximal Item Sets
transaction dataasc
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
lcd oxcs arc maxima
itcm scts, whitc oxcs
intrcqucnt itcm scts
Lassc diaram with maxima itcm scts (s
min
3)
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
Christian Borgelt Frequent Pattern Mining 172
Limits of Maximal Item Sets
Thc sct ot maxima itcm scts cajturcs thc sct ot a trcqucnt itcm scts,
ut thcn wc know at most thc sujjort ot thc maxima itcm scts cxacty
Aout thc sujjort ot a non-maxima trcqucnt itcm sct wc ony know
s
min
I F
T
(s
min
) M
T
(s
min
) s
T
(I) max
JM
T
(s
min
),JI
s
T
(J).
This rcation toows immcdiatcy trom I J I s
T
(I) s
T
(J),
that is, an itcm sct cannot havc a owcr sujjort than any ot its sujcrscts
`otc that wc havc cncray
s
min
I F
T
(s
min
) s
T
(I) max
JM
T
(s
min
),JI
s
T
(J).
Question: Can wc nd a susct ot thc sct ot a trcqucnt itcm scts,
which aso jrcscrvcs knowcdc ot a sujjort vaucs
Christian Borgelt Frequent Pattern Mining 173
Closed Item Sets
Considcr thc sct ot closed (frequent) item sets
C
T
(s
min
) I B [ s
T
(I) s
min
J I s
T
(J) < s
T
(I).
That is An item set is closed if it is frequent,
but none of its proper supersets has the same support.
Sincc with this dcnition wc know that
s
min
I F
T
(s
min
) I C
T
(s
min
) J I s
T
(J) s
T
(I)
it toows (can casiy c jrovcn y succcssivcy cxtcndin thc itcm sct I)
s
min
I F
T
(s
min
) J C
T
(s
min
) I J.
That is Every frequent item set has a closed superset.
Thcrctorc s
min
F
T
(s
min
)
_
IC
T
(s
min
)
2
I
Christian Borgelt Frequent Pattern Mining 174
Closed Item Sets
Lowcvcr, not ony has cvcry trcqucnt itcm sct a coscd sujcrsct,
ut it has a closed superset with the same support
s
min
I F
T
(s
min
) J I J C
T
(s
min
) s
T
(J) s
T
(I).
(lroot scc (aso) thc considcrations on thc ncxt sidc)
Thc sct ot a coscd itcm scts jrcscrvcs knowcdc ot a sujjort vaucs
s
min
I F
T
(s
min
) s
T
(I) max
JC
T
(s
min
),JI
s
T
(J).
`otc that thc wcakcr statcmcnt
s
min
I F
T
(s
min
) s
T
(I) max
JC
T
(s
min
),JI
s
T
(J)
toows immcdiatcy trom I J I s
T
(I) s
T
(J), that is,
an itcm sct cannot havc a owcr sujjort than any ot its sujcrscts
Christian Borgelt Frequent Pattern Mining 175
Closed Item Sets
Alternative characterization of closed (frequent) item sets:
I is coscd s
T
(I) s
min
I
kK
T
(I)
t
k
.
lcmindcr K
T
(I) k 1, . . . , n [ I t
k
is thc cover ot I wrt T
This is dcrivcd as toows sincc k K
T
(I) I t
k
, it is ovious that
s
min
I F
T
(s
min
) I
kK
T
(I)
t
k
,
lt I
kK
T
(I)
t
k
, it is not coscd, sincc
kK
T
(I)
t
k
has thc samc sujjort
On thc othcr hand, no sujcrsct ot
kK
T
(I)
t
k
has thc covcr K
T
(I)
`otc that thc aovc charactcrization aows us to construct tor any itcm sct
thc (uniqucy dctcrmincd) coscd sujcrsct that has thc samc sujjort
Christian Borgelt Frequent Pattern Mining 176
Mathematical Excursion: Closure Operators
A closure operator on a sct S is a tunction cl 2
S
2
S
,
which satiscs thc toowin conditions X, Y S
X cl (X) (cl is extensive)
X Y cl (X) cl (Y ) (cl is increasing or monotone)
cl (cl (X)) cl (X) (cl is idempotent)
A sct R S is cacd closed it it is cqua to its cosurc
R is coscd R cl (R).
Thc closed (frequent) item sets arc induccd y thc cosurc ojcrator
cl (I)
kK
T
(I)
t
k
.
rcstrictcd to thc sct ot trcqucnt itcm scts
C
T
(s
min
) I F
T
(s
min
) [ I cl (I)
Christian Borgelt Frequent Pattern Mining 177
Mathematical Excursion: Galois Connections
Lct (X, _
X
) and (Y, _
Y
) c two jartiay ordcrcd scts
A tunction jair (f
1
, f
2
) with f
1
X Y and f
2
Y X
is cacd a (monotone) Galois connection i
A
1
, A
2
X A
1
_ A
2
f
1
(A
1
) _ f
1
(A
2
),
B
1
, B
2
Y B
1
_ B
2
f
2
(B
1
) _ f
2
(B
2
),
A X B Y A _ f
2
(B) B _ f
1
(A)
A tunction jair (f
1
, f
2
) with f
1
X Y and f
2
Y X
is cacd an anti-monotone Galois connection i
A
1
, A
2
X A
1
_ A
2
f
1
(A
1
) _ f
1
(A
2
),
B
1
, B
2
Y B
1
_ B
2
f
2
(B
1
) _ f
2
(B
2
),
A X B Y A _ f
2
(B) B _ f
1
(A)
ln a monotonc Gaois conncction, oth f
1
and f
2
arc monotonc,
in an anti-monotonc Gaois conncction, oth f
1
and f
2
arc anti-monotonc
Christian Borgelt Frequent Pattern Mining 178
Mathematical Excursion: Galois Connections
Lct thc two scts X and Y c jowcr scts ot somc scts U and V , rcsjcctivcy,
and ct thc jartia ordcrs c thc susct rcations on thcsc jowcr scts, that is, ct
(X, _
X
) (2
U
, ) and (Y, _
Y
) (2
V
, ).
Thcn thc comination f
1
f
2
X X ot thc tunctions ot a Gaois conncction
is a closure operator (as wc as thc comination f
2
f
1
Y Y )
(i) A U A f
2
(f
1
(A)) (a cosurc ojcrator is extensive)
Sincc (f
1
, f
2
) is a Gaois conncction, wc know
A U B V A f
2
(B) B f
1
(A).
Choosc B f
1
(A)
A U A f
2
(f
1
(A)) f
1
(A) f
1
(A)
. .
truc
.
Choosc A f
2
(B)
B V f
2
(B) f
2
(B)
. .
truc
B f
1
(f
2
(B)).
Christian Borgelt Frequent Pattern Mining 179
Mathematical Excursion: Galois Connections
(ii) A
1
, A
2
U A
1
A
2
f
2
(f
1
(A
1
)) f
2
(f
1
(A
2
))
(a cosurc ojcrator is increasing or monotone)
This jrojcrty toows immcdiatcy trom thc tact that
thc tunctions f
1
and f
2
arc oth (anti-)monotonc
lt f
1
and f
2
arc oth monotonc, wc havc
A
1
, A
2
U A
1
A
2
A
1
, A
2
U f
1
(A
1
) f
1
(A
2
)
A
1
, A
2
U f
2
(f
1
(A
1
)) f
2
(f
1
(A
2
)).
lt f
1
and f
2
arc oth anti-monotonc, wc havc
A
1
, A
2
U A
1
A
2
A
1
, A
2
U f
1
(A
1
) f
1
(A
2
)
A
1
, A
2
U f
2
(f
1
(A
1
)) f
2
(f
1
(A
2
)).
Christian Borgelt Frequent Pattern Mining 180
Mathematical Excursion: Galois Connections
(ii) A U f
2
(f
1
(f
2
(f
1
(A)))) f
2
(f
1
(A)) (a cosurc ojcrator is idempotent)
Sincc oth f
1
f
2
and f
2
f
1
arc cxtcnsivc (scc aovc), wc know
A V A f
2
(f
1
(A)) f
2
(f
1
(f
2
(f
1
(A))))
B V B f
1
(f
2
(B)) f
1
(f
2
(f
1
(f
2
(B))))
Choosin B f
1
(A
/
) with A
/
U, wc otain
A
/
U f
1
(A
/
) f
1
(f
2
(f
1
(f
2
(f
1
(A
/
))))).
Sincc (f
1
, f
2
) is a Gaois conncction, wc know
A U B V A f
2
(B) B f
1
(A).
Choosin A f
2
(f
1
(f
2
(f
1
(A
/
)))) and B f
1
(A
/
), wc otain
A
/
U f
2
(f
1
(f
2
(f
1
(A
/
)))) f
2
(f
1
(A
/
))
f
1
(A
/
) f
1
(f
2
(f
1
(f
2
(f
1
(A
/
)))))
. .
truc (scc aovc)
.
Christian Borgelt Frequent Pattern Mining 181
Galois Connections in Frequent Item Set Mining
Considcr thc jartiay ordcrcd scts (2
B
, ) and (2
1,...,n
, )
Lct f
1
2
B
2
1,...,n
, I K
T
(I) k 1, . . . , n [ I t
k
and f
2
2
1,...,n
2
B
, J
jJ
t
j
i B [ j J i t
j
I
1
I
2
f
1
(I
1
) K
T
(I
1
) K
T
(I
2
) f
1
(I
2
),
J
1
, J
2
2
1,...,n
J
1
J
2
f
2
(J
1
)
kJ
1
t
k
kJ
2
t
k
f
2
(J
2
),
I 2
B
J 2
1,...,n
I f
2
(J)
jJ
t
j
J f
1
(I) K
T
(I)
As a conscqucncc f
1
f
2
2
B
2
B
, I
kK
T
(I)
t
k
is a closure operator
Christian Borgelt Frequent Pattern Mining 182
Galois Connections in Frequent Item Set Mining
Likcwisc f
2
f
1
2
1,...,n
2
1,...,n
, J K
T
(
jJ
t
j
)
is aso a closure operator
Iurthcrmorc, it wc rcstrict our considcrations to thc rcsjcctivc scts
ot coscd scts in oth domains, that is, to thc scts
(
B
I B [ I f
2
(f
1
(I))
kK
T
(I)
t
k
and
(
T
J 1, . . . , n [ J f
1
(f
2
(J)) K
T
(
jJ
t
j
),
thcrc cxists a 1-to-1 relationship ctwccn thcsc two scts,
which is dcscricd y thc Gaois conncction
f
/
1
f
1
[
(
B
is a bijection with f
/1
1
f
/
2
f
2
[
(
T
(This toows immcdiatcy trom thc tacts that thc Gaois conncction
dcscrics cosurc ojcrators and that a cosurc ojcrator is idcmjotcnt)
Thcrctorc ndin coscd itcm scts with a ivcn minimum support is cquivacnt
to ndin coscd scts ot transaction idcnticrs ot a ivcn minimum size
Christian Borgelt Frequent Pattern Mining 183
Closed Item Sets: Example
transaction dataasc
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
trcqucnt itcm scts
0 itcms 1 itcm 2 itcms 3 itcms
10 a a, c ! a, c, d 3
b 3 a, d ` a, c, e 3
c a, e o a, d, e !
d o b, c 3
e c, d !
c, e !
d, e !
A trcqucnt itcm scts arc coscd with thc cxccjtion ot b and d, e
b is a susct ot b, c, oth havc a sujjort ot 3 30/
d, e is a susct ot a, d, e, oth havc a sujjort ot ! !0/
Christian Borgelt Frequent Pattern Mining 184
Hasse diagram and Closed Item Sets
transaction dataasc
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
lcd oxcs arc coscd
itcm scts, whitc oxcs
intrcqucnt itcm scts
Lassc diaram with coscd itcm scts (s
min
3)
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
Christian Borgelt Frequent Pattern Mining 185
Reminder: Perfect Extensions
Thc scarch can c imjrovcd with so-cacd perfect extension pruning
Givcn an itcm sct I, an itcm a / I is cacd a perfect extension ot I,
i I and I a havc thc samc sujjort (a transactions containin I contain a)
lcrtcct cxtcnsions havc thc toowin jrojcrtics
lt thc itcm a is a jcrtcct cxtcnsion ot an itcm sct I,
thcn a is aso a jcrtcct cxtcnsion ot any itcm sct J I (as on as a / J)
lt I is a trcqucnt itcm sct and X is thc sct ot a jcrtcct cxtcnsions ot I,
thcn a scts I J with J 2
X
(whcrc 2
X
dcnotcs thc jowcr sct ot X)
arc aso trcqucnt and havc thc samc sujjort as I
This can c cxjoitcd y cocctin jcrtcct cxtcnsion itcms in thc rccursion,
in a third ccmcnt ot a sujrocm dcscrijtion S (D, P, X)
Oncc idcnticd, jcrtcct cxtcnsion itcms arc no oncr jroccsscd in thc rccursion,
ut arc ony uscd to cncratc a sujcrscts ot thc jrcx havin thc samc sujjort
Christian Borgelt Frequent Pattern Mining 186
Closed Item Sets and Perfect Extensions
transaction dataasc
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
b, c
S a, c, d, e
9 b, c, e
10 a, d, e
trcqucnt itcm scts
0 itcms 1 itcm 2 itcms 3 itcms
10 a a, c ! a, c, d 3
b 3 a, d ` a, c, e 3
c a, e o a, d, e !
d o b, c 3
e c, d !
c, e !
d, e !
c is a jcrtcct cxtcnsion ot b as b and b, c oth havc sujjort 3
a is a jcrtcct cxtcnsion ot d, e as d, e and a, d, e oth havc sujjort !
`on-coscd itcm scts josscss at cast onc jcrtcct cxtcnsion,
coscd itcm scts do not josscss any jcrtcct cxtcnsions
Christian Borgelt Frequent Pattern Mining 187
Relation of Maximal and Closed Item Sets
empty set
item base
maxima (trcqucnt) itcm scts
empty set
item base
coscd (trcqucnt) itcm scts
Thc sct ot coscd itcm scts is thc union ot thc scts ot maxima itcm scts
tor a minimum sujjort vaucs at cast as arc as s
min
C
T
(s
min
)
_
ss
min
,s
min
+1,...,n1,n
M
T
(s)
Christian Borgelt Frequent Pattern Mining 188
Types of Frequent Item Sets: Summary
Frequent Item Set
Any trcqucnt itcm sct (sujjort is hihcr than thc minima sujjort)
I trcqucnt s
T
(I) s
min
Closed (Frequent) Item Set
A trcqucnt itcm sct is cacd closed it no sujcrsct has thc samc sujjort
I coscd s
T
(I) s
min
J I s
T
(J) < s
T
(I)
Maximal (Frequent) Item Set
A trcqucnt itcm sct is cacd maximal it no sujcrsct is trcqucnt
I maxima s
T
(I) s
min
J I s
T
(J) < s
min
Ovious rcations ctwccn thcsc tyjcs ot itcm scts
A maxima itcm scts and a coscd itcm scts arc trcqucnt
A maxima itcm scts arc coscd
Christian Borgelt Frequent Pattern Mining 189
Types of Frequent Item Sets: Summary
0 itcms 1 itcm 2 itcms 3 itcms
+
10 a
+
a, c
+
! a, c, d
+
3
b 3 a, d
+
` a, c, e
+
3
c
+
a, e
+
o a, d, e
+
!
d
+
o b, c
+
3
e
+
c, d
+
!
c, e
+
!
d, e !
Frequent Item Set
Any trcqucnt itcm sct (sujjort is hihcr than thc minima sujjort)
Closed (Frequent) Item Set (markcd with
+
)
A trcqucnt itcm sct is cacd closed it no sujcrsct has thc samc sujjort
Maximal (Frequent) Item Set (markcd with
)
A trcqucnt itcm sct is cacd maximal it no sujcrsct is trcqucnt
Christian Borgelt Frequent Pattern Mining 190
Experiments: Data Sets (Reminder)
Chess
A data sct istin chcss cnd amc jositions tor kin vs kin and rook
This data sct is jart ot thc Cl machinc carnin rcjository
` itcms, 319o transactions
avcrac transaction sizc 3, dcnsity 0.`
Census
A data sct dcrivcd trom an cxtract ot thc S ccnsus urcau data ot 199!,
which was jrcjroccsscd y discrctizin numcric attriutcs
This data sct is jart ot thc Cl machinc carnin rcjository
13` itcms, !SS!2 transactions
avcrac transaction sizc 1!, dcnsity 0.1
Thc density ot a transaction dataasc is thc avcrac traction ot a itcms occurrin
jcr transaction dcnsity avcrac transaction sizc , numcr ot itcms
Christian Borgelt Frequent Pattern Mining 191
Experiments: Data Sets (Reminder)
T10I4D100K
An articia data sct cncratcd with lL`s data cncrator
Thc namc is tormcd trom thc jaramctcrs ivcn to thc cncrator
(tor cxamjc 100I 100000 transactions)
S0 itcms, 100000 transactions
avcrac transaction sizc 10.1, dcnsity 0.012
BMS-Webview-1
A wc cick strcam trom a c-carc comjany that no oncr cxists
lt has ccn uscd in thc ILL cuj 2000 and is a jojuar cnchmark
!9 itcms, `9o02 transactions
avcrac transaction sizc 2.`, dcnsity 0.00`
Thc density ot a transaction dataasc is thc avcrac traction ot a itcms occurrin
jcr transaction dcnsity avcrac transaction sizc , numcr ot itcms
Christian Borgelt Frequent Pattern Mining 192
Types of Frequent Item Sets: Experiments
1000 1200 1400 1600 1800 2000
4
5
6
7
frequent
closed
maximal
chess
0 5 10 15 20 25 30 35 40 45 50
4
5
6
frequent
closed
maximal
T10I4D100K
0 10 20 30 40 50 60 70 80 90 100
5
6
7 frequent
closed
maximal
census
33 34 35 36 37 38 39 40
4
5
6
7
8
frequent
closed
maximal
webview1
Lccima oarithm ot thc numcr ot itcm scts ovcr asoutc minimum sujjort
Christian Borgelt Frequent Pattern Mining 193
Reminder: Perfect Extension Pruning
1000 1200 1400 1600 1800 2000
-1
0
1
2
w/o pep
apriori
eclat
fpgrowth
chess
0 5 10 15 20 25 30 35 40 45 50
0
1
w/o pep
apriori
eclat
fpgrowth
T10I4D100K
0 10 20 30 40 50 60 70 80 90 100
0
1 w/o pep
apriori
eclat
fpgrowth
census
33 34 35 36 37 38 39 40
0
1
w/o pep
apriori
eclat
fpgrowth
webview1
Lccima oarithm ot cxccution timc in scconds ovcr asoutc minimum sujjort
Christian Borgelt Frequent Pattern Mining 194
Searching for Closed and Maximal Item Sets
Christian Borgelt Frequent Pattern Mining 195
Searching for Closed Frequent Item Sets
\c know that it succs to nd thc coscd itcm scts tocthcr with thcir sujjort
trom thcm a trcqucnt itcm scts and thcir sujjort can c rctricvcd
Thc charactcrization ot coscd itcm scts y
I coscd s
T
(I) s
min
I
kK
T
(I)
t
k
sucsts to nd thcm y tormin a jossic intcrscctions ot thc transactions
(with at cast s
min
transactions) and chcckin thcir sujjort
Lowcvcr, on standard data scts, ajjroachcs usin this idca
arc rarcy comjctitivc with othcr mcthods
Sjccia cascs in which thcy arc comjctitivc arc domains
with tcw transactions and vcry many itcms
Lxamjcs ot such a domains arc cnc cxjrcssion anaysis
and thc anaysis ot documcnt cocctions
Christian Borgelt Frequent Pattern Mining 196
Carpenter
|lan, Con, Tun, Yan, and Zaki 2003|
Christian Borgelt Frequent Pattern Mining 197
Carpenter: Enumerating Transaction Sets
Thc Carpenter aorithm imjcmcnts thc intcrscction ajjroach y cnumcratin
scts ot transactions (or, cquivacnty, scts ot transaction indiccs), intcrscctin thcm,
and rcmovin,jrunin jossic dujicatcs
This is donc with asicay thc samc divide-and-conquer scheme as tor thc
itcm sct cnumcration ajjroachcs, ony that it is ajjicd to transactions (that is,
itcms and transactions cxchanc thcir mcanin |liout et al. 2003|
Thc task to cnumcratc a transaction indcx scts is sjit into two su-tasks
cnumcratc a transaction indcx scts that contain thc indcx 1
cnumcratc a transaction indcx scts that do not contain thc indcx 1
Thcsc su-tasks arc thcn turthcr dividcd wrt thc transaction indcx 2
cnumcratc a transaction indcx scts containin
oth indiccs 1 and 2, indcx 2, ut not indcx 1,
indcx 1, ut not indcx 2, ncithcr indcx 1 nor indcx 2,
and so on rccursivcy
Christian Borgelt Frequent Pattern Mining 198
Carpenter: Enumerating Transaction Sets
Iormay, a sujrocms in thc rccursion can c dcscricd y trijcs S (I, K, k)
K 1, . . . , n is a sct ot transaction indiccs,
I
kK
t
k
is thcir intcrscction, and
k is a transaction indcx, namcy thc indcx ot thc ncxt transaction to considcr
Thc initia jrocm, with which thc rccursion is startcd, is S (B, , 1),
whcrc B is thc itcm asc and no transactions havc ccn intcrscctcd yct
A sujrocm S
0
(I
0
, K
0
, k
0
) is jroccsscd as toows
Lct K
1
K
0
k
0
and torm thc intcrscction I
1
I
0
t
k
0
lt I
1
, do nothin (rcturn trom rccursion)
lt [K
1
[ s
min
, and thcrc is no transaction t
j
with j 1, . . . , n K
1
such that I
1
t
j
, rcjort I
1
with sujjort s
T
(I
1
) [K
1
[
Lct k
1
k
0
+ 1 lt k
1
n, thcn torm thc sujrocms
S
1
(I
1
, K
1
, k
1
) and S
2
(I
0
, K
0
, k
1
) and jroccss thcm rccursivcy
Christian Borgelt Frequent Pattern Mining 199
Carpenter: List-based Implementation
Transaction identier lists arc uscd to rcjrcscnt thc currcnt itcm sct I
(vcrtica transaction rcjrcscntation, as in thc Lcat aorithm)
Thc intcrscction consists in cocctin a ists with thc ncxt transaction indcx k
Lxamjc transaction
dataasc
t
1
a c
t
2
a d c
t
3
c d
t
!
a c d
t
`
c
t
o
a d
t
d c
t
S
c d c
transaction
idcnticr ists
a c d c
1 1 1 2 2
2 3 3 3
! ! ! ! S
o ` ` o
o S
S
cocction
tor K 1
a c
2 3 3
! ! !
o ` `
o S
tor K 1, 2, 1, 3
a c
! ! !
o ` `
o S
Christian Borgelt Frequent Pattern Mining 200
Carpenter: Table-/Matrix-based Implementation
lcjrcscnt thc data sct y a n [B[ matrix M as toows |Lorct et al. 2011|
m
ki
_
0, it itcm i / t
k
,
[j k, . . . , n [ i t
j
[, othcrwisc.
Lxamjc transaction dataasc
t
1
a c
t
2
a d c
t
3
c d
t
!
a c d
t
`
c
t
o
a d
t
d c
t
S
c d c
matrix rcjrcscntation
a c d c
t
1
! ` ` 0 0
t
2
3 0 0 o 3
t
3
0 ! ! ` 0
t
!
2 3 3 ! 0
t
`
0 2 2 0 0
t
o
1 1 0 3 0
t
0 0 0 2 2
t
S
0 0 1 1 1
Thc currcnt itcm sct I is simjy rcjrcscntcd y thc containcd itcms
An intcrscction coccts a itcms i I with m
ki
> max0, s
min
[K[ 1
Christian Borgelt Frequent Pattern Mining 201
Carpenter: Duplicate Removal
Thc intcrscction ot scvcra transaction indcx scts can yicd thc samc itcm sct
Thc support ot thc itcm sct is thc sizc ot thc largest transaction index set
that yicds thc itcm sct. smacr transaction indcx scts can c skijjcd,inorcd
This is thc rcason tor thc chcck whcthcr thcrc cxists a transaction t
j
with j 1, . . . , n K
1
such that I
1
t
j
This chcck is sjit into thc two chccks whcthcr thcrc cxists such a transaction t
j
with j > k
0
and with j 1, . . . , k
0
1 K
0
(1) ,
(
Tt
(1) (
T
(1) t I [ s (
T
(1) I s t.
As a conscqucncc, wc can start thc jroccdurc with an cmjty sct ot coscd itcm
scts and thcn jroccss thc transactions onc y onc
ln cach stcj thc sct ot coscd itcm scts y addin thc ncw transaction t itsct
and thc additiona coscd itcm scts that rcsut trom intcrscctin it with (
T
(1)
ln addition, thc sujjort ot arcady known coscd itcm scts may havc to c ujdatcd
Christian Borgelt Frequent Pattern Mining 205
Ista: Cumulative Transaction Intersections
Thc corc imjcmcntation jrocm is to nd a data structure tor storin thc
coscd itcm scts that aows to quicky comjutc thc intcrscctions with a ncw trans-
action and to mcrc thc rcsut with thc arcady storcd coscd itcm scts
Ior this wc rcy on a prex tree, cach nodc ot which rcjrcscnts an itcm sct
Thc aorithm works on thc jrcx trcc as toows
At thc cinnin an cmjty trcc is crcatcd (dummy root nodc).
thcn thc transactions arc jroccsscd onc y onc
Lach ncw transaction is rst simjy addcd to thc jrcx trcc
Any ncw nodcs crcatcd in this stcj arc initiaizcd with a sujjort ot zcro
ln thc ncxt stcj wc comjutc thc intcrscctions ot thc ncw transactions
with a itcm scts rcjrcscntcd y thc currcnt jrcx trcc
A rccursivc jroccdurc travcrscs thc jrcx trcc sccctivcy (dcjth-rst) and
matchcs thc itcms in thc trcc nodcs with thc itcms ot thc transaction
Intersecting with and inserting into the tree can be combined.
Christian Borgelt Frequent Pattern Mining 206
Ista: Cumulative Transaction Intersections
transaction
dataasc
t
1
e c a
t
2
e d b
t
3
d c b a
0: 0 1: 1
e 1
c 1
a 1
2: 2
e 2
d 1
b 1
c 1
a 1
3.1: 2
e 2
d 1
b 1
c 1
a 1
d 0
c 0
b 0
a 0
3.2: 2
e 2
d 1
b 1
c 1
a 1
d 2
c 1
b 1
a 1
b 2
3.3: 2
e 2
d 1
b 1
c 1
a 1
d 2
c 1
b 1
a 1
b 2
c 2
a 2
Christian Borgelt Frequent Pattern Mining 207
Ista: Data Structure
typedef struct nodc , a jrcx trcc nodc ,
int stcj. , most rcccnt ujdatc stcj ,
int itcm. , assoc itcm (ast in sct) ,
int sujj. , sujjort ot itcm sct ,
struct nodc siin. , succcssor in siin ist ,
struct nodc chidrcn. , ist ot chid nodcs ,
`OLL.
Standard rst chid , riht siin nodc structurc
Iixcd sizc ot cach nodc aows tor ojtimizcd aocation
Icxic structurc that can casiy c cxtcndcd
Thc stcj cd indicatcs whcthcr thc sujjort cd was arcady ujdatcd
Thc stcj cd is an incrcmcnta markcr, so that it nccd not c ccarcd
in a scjaratc travcrsa ot thc jrcx trcc
Christian Borgelt Frequent Pattern Mining 208
Ista: Pseudo-Code
void iscct (`OLL nodc, `OLL ins)
, intcrscct with transaction ,
int i. , ucr tor currcnt itcm ,
`OLL d. , to aocatc ncw nodcs ,
while (nodc) , travcrsc thc siin ist ,
i nodcitcm. , ct thc currcnt itcm ,
if (trans|i|) , it itcm is in intcrscction ,
while ((d ins) :: (ditcm > i))
ins :dsiin. , nd thc inscrtion josition ,
if (d , it an intcrscction nodc with ,
:: (ditcm i)) , thc itcm arcady cxists ,
if (dstcj stcj) dsujj.
if (dsujj nodcsujj)
dsujj nodcsujj.
dsujj++. , ujdatc intcrscction sujjort ,
dstcj stcj. , and sct currcnt ujdatc stcj ,
Christian Borgelt Frequent Pattern Mining 209
Ista: Pseudo-Code
else , it thcrc is no corrcsj nodc ,
d maoc(sizcot(`OLL)).
dstcj stcj. , crcatc a ncw nodc and ,
ditcm i. , sct itcm and sujjort ,
dsujj nodcsujj+1.
dsiin ins. ins d.
dchidrcn `LL.
, inscrt nodc into thc trcc ,
if (i imin) rcturn. , it cyond ast itcm, aort ,
iscct(nodcchidrcn, :dchidrcn).
else , it itcm is not in intcrscction ,
if (i imin) rcturn. , it cyond ast itcm, aort ,
iscct(nodcchidrcn, ins).
, intcrscct with sutrcc ,
nodc nodcsiin. , o to thc ncxt siin ,
, cnd ot whic (nodc) ,
, iscct() ,
Christian Borgelt Frequent Pattern Mining 210
Ista: Keeping the Repository Small
ln jracticc wc wi not work with a minimum sujjort s
min
1
lcmovin intcrscctions cary, ccausc thcy do not rcach thc minimum sujjort
is dicut in jrincijc, cnouh ot thc transactions to c jroccsscd in thc tuturc
coud contain thc itcm sct undcr considcration
lmjrovcd jroccssin with itcm occurrcncc countcrs
ln an initia jass thc trcqucncy ot thc individua itcms is dctcrmincd
Thc otaincd countcrs arc ujdatcd with cach jroccsscd transaction
Thcy aways rcjrcscnt thc itcm occurrcnccs in thc unjroccsscd transactions
Lascd on thcsc countcrs, wc can ajjy thc toowin jrunin schcmc
Sujjosc that attcr havin jroccsscd k ot a tota ot n transactions
thc sujjort ot a coscd itcm sct I is s
T
k
(I) x
Lct y c thc minimum ot thc countcr vaucs tor thc itcms containcd in I
lt x + y < s
min
, thcn I can c discardcd, ccausc it cannot rcach s
min
kK
T
(H)
(t
k
H) ,
lt cithcr is thc casc, H is not coscd, othcrwisc it is
`otc that with thc attcr condition, thc intcrscction can c comjutcd transaction
y transaction lt can c concudcd that H is coscd as soon as thc intcrscction
ccomcs cmjty
Maximal Item Sets:
Chcck whcthcr a E s
T
(H a) s
min
Lct i
) > s
T
(I) and s
T
(i I [ i i
) s
T
(I).
lntuitivcy, thc itcm i
i I [ i < i
and X
T
(I) i B I [ s
T
(I i) s
T
(I)
Thcn thc canonica jarcnt p
C
(I) ot I is thc itcm sct
p
C
(I) I
i X
T
(I
) [ i > i
.
lntuitivcy, to nd thc canonica jarcnt ot thc itcm sct I, thc rcduccd itcm sct I
) [ i > i
,
ccausc a jcrtcct cxtcnsion ot I
,
sincc K
T
(I
) K
T
(I
)
Ior thc rccursivc scarch, thc toowin tormuation is usctu
Lct I B c a coscd itcm sct Thc canonical children ot I (that is,
thc coscd itcm scts that havc I as thcir canonica jarcnt) arc thc itcm scts
J I i j X
T
(I i) [ j > i
with j I i > j and j X
T
(I i) [ j < i X
T
(J)
Thc union with j X
T
(I i) [ j > i
rcjrcscnts jcrtcct cxtcnsion or jarcnt cquivacncc jrunin
a jcrtcct cxtcnsions in thc tai ot I i arc immcdiatcy addcd
Thc condition j X
T
(I i) [ j < i cxjrcsscs
that thcrc must not c any jcrtcct cxtcnsions amon thc ciminatcd itcms
Christian Borgelt Frequent Pattern Mining 228
Additional Frequent Item Set Filtering
Christian Borgelt Frequent Pattern Mining 229
Additional Frequent Item Set Filtering
General problem of frequent item set mining:
Thc numcr ot trcqucnt itcm scts, cvcn thc numcr ot coscd or maxima itcm
scts, can cxcccd thc numcr ot transactions in thc dataasc y tar
Thcrctorc Additiona tcrin is ncccssary to nd
thc rccvant or intcrcstin trcqucnt itcm scts
Gcncra idca Compare support to expectation.
ltcm scts consistin ot itcms that ajjcar trcqucnty
arc ikcy to havc a hih sujjort
Lowcvcr, this is not surjrisin
wc cxjcct this cvcn it thc occurrcncc ot thc itcms is indcjcndcnt
Additiona tcrin shoud rcmovc itcm scts with a sujjort
cosc to thc sujjort cxjcctcd trom an indcjcndcnt occurrcncc
Christian Borgelt Frequent Pattern Mining 230
Additional Frequent Item Set Filtering
Full Independence
Lvauatc itcm scts with
(I)
s
T
(I) n
[I[1
aI
s
T
(a)
p
T
(I)
aI
p
T
(a)
.
an rcquirc a minimum vauc tor this mcasurc
( p
T
is thc jroaiity cstimatc ascd on T)
Assumcs tu indcjcndcncc ot thc itcms in ordcr
to torm an cxjcctation aout thc sujjort ot an itcm sct
Advantac Can c comjutcd trom ony thc sujjort ot thc itcm sct
and thc sujjort vaucs ot thc individua itcms
Lisadvantac lt somc itcm sct I scorcs hih on this mcasurc,
thcn a J I arc aso ikcy to scorc hih,
cvcn it thc itcms in J I arc indcjcndcnt ot I
Christian Borgelt Frequent Pattern Mining 231
Additional Frequent Item Set Filtering
Incremental Independence
Lvauatc itcm scts with
ii
(I) min
aI
n s
T
(I)
s
T
(I a) s
T
(a)
min
aI
p
T
(I)
p
T
(I a) p
T
(a)
.
an rcquirc a minimum vauc tor this mcasurc
( p
T
is thc jroaiity cstimatc ascd on T)
Advantac lt I contains indcjcndcnt itcms,
thc minimum cnsurcs a ow vauc
Lisadvantacs \c nccd to know thc sujjort vaucs ot a suscts I a
lt thcrc cxist hih scorin indcjcndcnt suscts I
1
and I
2
with [I
1
[ > 1, [I
2
[ > 1, I
1
I
2
and I
1
I
2
I,
thc itcm sct I sti rcccivcs a hih cvauation
Christian Borgelt Frequent Pattern Mining 232
Additional Frequent Item Set Filtering
Subset Independence
Lvauatc itcm scts with
si
(I) min
JI,J,
n s
T
(I)
s
T
(I J) s
T
(J)
min
JI,J,
p
T
(I)
p
T
(I J) p
T
(J)
.
an rcquirc a minimum vauc tor this mcasurc
( p
T
is thc jroaiity cstimatc ascd on T)
Advantac Lctccts a cascs whcrc a dccomjosition is jossic
and cvauatcs thcm with a ow vauc
Lisadvantacs \c nccd to know thc sujjort vaucs ot a jrojcr suscts J
lmjrovcmcnt sc incrcmcnta indcjcndcncc and in thc minimum considcr
ony itcms a tor which I a has ccn cvauatcd hih
This cajturcs susct indcjcndcncc incrcmcntay
Christian Borgelt Frequent Pattern Mining 233
Summary Frequent Item Set Mining
\ith a canonical form ot an itcm sct thc Lassc diaram
can c turncd into a much simjcr prex tree
( dividc-and-conqucr schcmc usin conditiona dataascs)
Item set enumeration aorithms dicr in
thc traversal order ot thc jrcx trcc
(rcadth-rst,cvcwisc vcrsus dcjth-rst travcrsa)
thc transaction representation
horizontal (itcm arrays) vcrsus vertical (transaction ists)
vcrsus specialized data structures ikc Il-trccs
thc types of frequent item sets tound
frequent vcrsus closed vcrsus maximal item sets
(additiona jrunin mcthods tor coscd and maxima itcm scts)
An atcrnativc arc transaction set enumeration or intersection aorithms
Additional ltering is ncccssary to rcducc thc sizc ot thc outjut
Christian Borgelt Frequent Pattern Mining 234
Example Application:
Finding Neuron Assemblies in Neural Spike Data
Christian Borgelt Frequent Pattern Mining 235
Biological Background
Structure of a prototypical neuron
cc corc
axon
mycin shcath
cc ody
(soma)
tcrmina outon
synajsis
dcndritcs
Christian Borgelt Frequent Pattern Mining 236
Biological Background
c
_
2
0
0
l
u
i
z
-
\
i
a
r
r
c
a
T
(X)
s
T
(X Y )
s
T
(X)
s
T
(I)
s
T
(X)
Thc condcncc can c sccn as an cstimatc ot P(Y [ X)
Christian Borgelt Frequent Pattern Mining 257
Association Rules: Formal Denition
Given:
a sct B i
1
, . . . , i
m
ot itcms,
a vcctor T (t
1
, . . . , t
n
) ot transactions ovcr B,
a rca numcr
min
, 0 <
min
1, thc minimum support,
a rca numcr c
min
, 0 < c
min
1, thc minimum condence
Desired:
thc sct ot a association rules, that is, thc sct
! R X Y [
T
(R)
min
c
T
(R) c
min
.
General Procedure:
Iind thc trcqucnt itcm scts
Construct rucs and tcr thcm wrt
min
and c
min
3
isnode
p
p
p
p
p
J
J
p
p
p
p
p
J
J
p
p
p
p
p
T
(Y )
(Asoutc) dicrcncc ot itt vauc to 1
q
T
(R)
c
T
(X Y )
T
(Y )
1
1 min
_
c
T
(X Y )
T
(Y )
,
T
(Y )
c
T
(X Y )
_
i1
p
i
o
2
p
i
(Shannon 19!S)
I
ain
(X, Y ) H(Y ) H(Y [X)
..
k
Y
i1
p
i.
o
2
p
i.
..
k
X
j1
p
.j
_
_
k
Y
i1
p
i[j
o
2
p
i[j
_
_
H(Y ) Lntrojy ot thc distriution ot Y
H(Y [X) Expected entropy ot thc distriution ot Y
it thc vauc ot thc X ccomcs known
H(Y ) H(Y [X) Lxjcctcd cntrojy rcduction or information gain
Christian Borgelt Frequent Pattern Mining 270
Interpretation of Shannon Entropy
Lct S s
1
, . . . , s
n
c a nitc sct ot atcrnativcs
havin jositivc jroaiitics P(s
i
), i 1, . . . , n, satistyin
n
i1
P(s
i
) 1
Shannon Entropy:
H(S)
n
i1
P(s
i
) o
2
P(s
i
)
lntuitivcy Expected number of yes/no questions that have
to be asked in order to determine the obtaining alternative.
Sujjosc thcrc is an oracc, which knows thc otainin atcrnativc,
ut rcsjonds ony it thc qucstion can c answcrcd with ycs or no
A cttcr qucstion schcmc than askin tor onc atcrnativc attcr thc othcr
can casiy c tound Lividc thc sct into two suscts ot aout cqua sizc
Ask tor containmcnt in an aritrariy choscn susct
Ajjy this schcmc rccursivcy numcr ot qucstions oundcd y ,o
2
n|
Christian Borgelt Frequent Pattern Mining 271
Question/Coding Schemes
P(s
1
) 0.10, P(s
2
) 0.1`, P(s
3
) 0.1o, P(s
!
) 0.19, P(s
`
) 0.!0
Shannon cntrojy
i
P(s
i
) o
2
P(s
i
) 2.1` it,symo
Linear Traversal
s
!
, s
`
s
3
, s
!
, s
`
s
2
, s
3
, s
!
, s
`
s
1
, s
2
, s
3
, s
!
, s
`
0.10 0.15 0.16 0.19 0.40
s
1
s
2
s
3
s
!
s
`
1 2 3 ! !
Codc cnth 32! it,symo
Codc ccicncy 0oo!
Equal Size Subsets
s
1
, s
2
, s
3
, s
!
, s
`
0.25 0.75
s
1
, s
2
s
3
, s
!
, s
`
0.59
s
!
, s
`
0.10 0.15 0.16 0.19 0.40
s
1
s
2
s
3
s
!
s
`
2 2 2 3 3
Codc cnth 2`9 it,symo
Codc ccicncy 0S30
Christian Borgelt Frequent Pattern Mining 272
Question/Coding Schemes
Sjittin into suscts ot aout cqua sizc can cad to a ad arrancmcnt
ot thc atcrnativcs into suscts hih cxjcctcd numcr ot qucstions
Good qucstion schcmcs takc thc jroaiity ot thc atcrnativcs into account
Shannon-Fano Coding (19!S)
Luid thc qucstion,codin schcmc toj-down
Sort thc atcrnativcs wrt thcir jroaiitics
Sjit thc sct so that thc suscts havc aout cqua probability
(sjits must rcsjcct thc jroaiity ordcr ot thc atcrnativcs)
Human Coding (19`2)
Luid thc qucstion,codin schcmc ottom-uj
Start with onc ccmcnt scts
Aways cominc thosc two scts that havc thc smacst jroaiitics
Christian Borgelt Frequent Pattern Mining 273
Question/Coding Schemes
P(s
1
) 0.10, P(s
2
) 0.1`, P(s
3
) 0.1o, P(s
!
) 0.19, P(s
`
) 0.!0
Shannon cntrojy
i
P(s
i
) o
2
P(s
i
) 2.1` it,symo
ShannonFano Coding (19!S)
s
1
, s
2
, s
3
, s
!
, s
`
0.25
0.41
s
1
, s
2
s
1
, s
2
, s
3
0.59
s
!
, s
`
0.10 0.15 0.16 0.19 0.40
s
1
s
2
s
3
s
!
s
`
3 3 2 2 2
Codc cnth 22` it,symo
Codc ccicncy 09``
Human Coding (19`2)
s
1
, s
2
, s
3
, s
!
, s
`
0.60
s
1
, s
2
, s
3
, s
!
0.25 0.35
s
1
, s
2
s
3
, s
!
0.10 0.15 0.16 0.19 0.40
s
1
s
2
s
3
s
!
s
`
3 3 3 3 1
Codc cnth 220 it,symo
Codc ccicncy 09
Christian Borgelt Frequent Pattern Mining 274
Question/Coding Schemes
lt can c shown that Luman codin is ojtima
it wc havc to dctcrminc thc otainin atcrnativc in a sinc instancc
(`o qucstion,codin schcmc has a smacr cxjcctcd numcr ot qucstions)
Ony it thc otainin atcrnativc has to c dctcrmincd in a scqucncc
ot (indcjcndcnt) situations, this schcmc can c imjrovcd ujon
ldca lroccss thc scqucncc not instancc y instancc,
ut cominc two, thrcc or morc consccutivc instanccs and
ask dirccty tor thc otainin comination ot atcrnativcs
Athouh this cnarcs thc qucstion,codin schcmc, thc cxjcctcd numcr
ot qucstions jcr idcntication is rcduccd (ccausc cach intcrroation
idcntics thc otainin atcrnativc tor scvcra situations)
Lowcvcr, thc cxjcctcd numcr ot qucstions jcr idcntication
ot an otainin atcrnativc cannot c madc aritrariy sma
Shannon showcd that thcrc is a owcr ound, namcy thc Shannon cntrojy
Christian Borgelt Frequent Pattern Mining 275
Interpretation of Shannon Entropy
P(s
1
)
1
2
, P(s
2
)
1
!
, P(s
3
)
1
S
, P(s
!
)
1
1o
, P(s
`
)
1
1o
Shannon cntrojy
i
P(s
i
) o
2
P(s
i
) 1.S` it,symo
lt thc jroaiity distriution aows tor a
jcrtcct Luman codc (codc ccicncy 1),
thc Shannon cntrojy can casiy c intcr-
jrctcd as toows
i
P(s
i
) o
2
P(s
i
)
i
P(s
i
)
. .
occurrcncc
jroaiity
o
2
1
P(s
i
)
. .
jath cnth
in trcc
.
ln othcr words, it is thc cxjcctcd numcr
ot nccdcd ycs,no qucstions
Perfect Question Scheme
s
!
, s
`
s
3
, s
!
, s
`
s
2
, s
3
, s
!
, s
`
s
1
, s
2
, s
3
, s
!
, s
`
1
2
1
4
1
8
1
16
1
16
s
1
s
2
s
3
s
!
s
`
1 2 3 ! !
Codc cnth 1S` it,symo
Codc ccicncy 1
Christian Borgelt Frequent Pattern Mining 276
A Statistical Evaluation Measure
2
Measure
Comjarcs thc actua ,oint distriution
with a hypothetical independent distribution
scs asoutc comjarison
Can c intcrjrctcd as a dicrcncc mcasurc
2
(X, Y )
k
X
i1
k
Y
j1
n
..
(p
i.
p
.j
p
ij
)
2
p
i.
p
.j
Sidc rcmark lntormation ain can aso c intcrjrctcd as a dicrcncc mcasurc
I
ain
(X, Y )
k
X
j1
k
Y
i1
p
ij
o
2
p
ij
p
i.
p
.j
Christian Borgelt Frequent Pattern Mining 277
A Statistical Evaluation Measure
2
Measure
Comjarcs thc actua ,oint distriution
with a hypothetical independent distribution
scs asoutc comjarison
Can c intcrjrctcd as a dicrcncc mcasurc
2
(X, Y )
k
X
i1
k
Y
j1
n
..
(p
i.
p
.j
p
ij
)
2
p
i.
p
.j
Ior k
X
k
Y
2 (as tor ruc cvauation) thc
2
mcasurc simjics to
2
(X, Y ) n
..
(p
1.
p
.1
p
11
)
2
p
1.
(1 p
1.
)p
.1
(1 p
.1
)
n
..
(n
1.
n
.1
n
..
n
11
)
2
n
1.
(n
..
n
1.
)n
.1
(n
..
n
.1
)
.
Christian Borgelt Frequent Pattern Mining 278
Examples from the Census Data
A rucs arc statcd as
consequent <- antecedent (support%, confidence%, lift)
whcrc thc sujjort ot a ruc is thc sujjort ot thc antcccdcnt
Trivial/Obvious Rules
edu_num=13 <- education=Bachelors (16.4, 100.0, 6.09)
sex=Male <- relationship=Husband (40.4, 99.99, 1.50)
sex=Female <- relationship=Wife (4.8, 99.9, 3.01)
Interesting Comparisons
marital=Never-married <- age=young sex=Female (12.3, 80.8, 2.45)
marital=Never-married <- age=young sex=Male (17.4, 69.9, 2.12)
salary>50K <- occupation=Exec-managerial sex=Male (8.9, 57.3, 2.40)
salary>50K <- occupation=Exec-managerial (12.5, 47.8, 2.00)
salary>50K <- education=Masters (5.4, 54.9, 2.29)
hours=overtime <- education=Masters (5.4, 41.0, 1.58)
Christian Borgelt Frequent Pattern Mining 279
Examples from the Census Data
salary>50K <- education=Masters (5.4, 54.9, 2.29)
salary>50K <- occupation=Exec-managerial (12.5, 47.8, 2.00)
salary>50K <- relationship=Wife (4.8, 46.9, 1.96)
salary>50K <- occupation=Prof-specialty (12.6, 45.1, 1.89)
salary>50K <- relationship=Husband (40.4, 44.9, 1.88)
salary>50K <- marital=Married-civ-spouse (45.8, 44.6, 1.86)
salary>50K <- education=Bachelors (16.4, 41.3, 1.73)
salary>50K <- hours=overtime (26.0, 40.6, 1.70)
salary>50K <- occupation=Exec-managerial hours=overtime
(5.5, 60.1, 2.51)
salary>50K <- occupation=Prof-specialty hours=overtime
(4.4, 57.3, 2.39)
salary>50K <- education=Bachelors hours=overtime
(6.0, 54.8, 2.29)
Christian Borgelt Frequent Pattern Mining 280
Examples from the Census Data
salary>50K <- occupation=Prof-specialty marital=Married-civ-spouse
(6.5, 70.8, 2.96)
salary>50K <- occupation=Exec-managerial marital=Married-civ-spouse
(7.4, 68.1, 2.85)
salary>50K <- education=Bachelors marital=Married-civ-spouse
(8.5, 67.2, 2.81)
salary>50K <- hours=overtime marital=Married-civ-spouse
(15.6, 56.4, 2.36)
marital=Married-civ-spouse <- salary>50K (23.9, 85.4, 1.86)
Christian Borgelt Frequent Pattern Mining 281
Examples from the Census Data
hours=half-time <- occupation=Other-service age=young
(4.4, 37.2, 3.08)
hours=overtime <- salary>50K (23.9, 44.0, 1.70)
hours=overtime <- occupation=Exec-managerial (12.5, 43.8, 1.69)
hours=overtime <- occupation=Exec-managerial salary>50K
(6.0, 55.1, 2.12)
hours=overtime <- education=Masters (5.4, 40.9, 1.58)
education=Bachelors <- occupation=Prof-specialty (12.6, 36.2, 2.20)
education=Bachelors <- occupation=Exec-managerial (12.5, 33.3, 2.03)
education=HS-grad <- occupation=Transport-moving (4.8, 51.9, 1.61)
education=HS-grad <- occupation=Machine-op-inspct (6.2, 50.7, 1.6)
Christian Borgelt Frequent Pattern Mining 282
Examples from the Census Data
occupation=Prof-specialty <- education=Masters (5.4, 49.0, 3.88)
occupation=Prof-specialty <- education=Bachelors sex=Female
(5.1, 34.7, 2.74)
occupation=Adm-clerical <- education=Some-college sex=Female
(8.6, 31.1, 2.71)
sex=Female <- occupation=Adm-clerical (11.5, 67.2, 2.03)
sex=Female <- occupation=Other-service (10.1, 54.8, 1.65)
sex=Female <- hours=half-time (12.1, 53.7, 1.62)
age=young <- hours=half-time (12.1, 53.3, 1.79)
age=young <- occupation=Handlers-cleaners (4.2, 50.6, 1.70)
age=senior <- workclass=Self-emp-not-inc (7.9, 31.1, 1.57)
Christian Borgelt Frequent Pattern Mining 283
Summary Association Rules
Association Rule Induction is a Two Step Process
Iind thc trcqucnt itcm scts (minimum sujjort)
Iorm thc rccvant association rucs (minimum condcncc)
Generating the Association Rules
Iorm a jossic association rucs trom thc trcqucnt itcm scts
Iitcr intcrcstin association rucs
ascd on minimum sujjort and minimum condcncc
Filtering the Association Rules
Comjarc ruc condcncc and conscqucnt sujjort
lntormation ain,
2
mcasurc
ln jrincijc othcr mcasurcs uscd tor dccision trcc induction
Christian Borgelt Frequent Pattern Mining 284
Mining More Complex Patterns
Christian Borgelt Frequent Pattern Mining 285
Mining More Complex Patterns
Thc scarch schcmc in Ircqucnt Grajh,Trcc,Scqucncc minin is thc samc,
namcy thc cncra schcmc ot scarchin with a canonica torm
Frequent (Sub)Graph Mining comjriscs thc othcr arcas
Trccs arc sjccia rajhs, namcy rajhs that arc siny conncctcd
Scqucnccs can c sccn as sjccia trccs, namcy chains
(ony onc or two ranchcs dcjcndin on thc choicc ot thc root)
Frequent Sequence Mining and Frequent Tree Mining can cxjoit
Sjcciaizcd canonica torms that aow tor morc ccicnt chccks
Sjccia data structurcs to rcjrcscnt thc dataasc to minc,
so that sujjort countin ccomcs morc ccicnt
\c wi trcat Frequent Graph Mining rst and
wi discuss ojtimizations tor thc othcr arcas atcr
Christian Borgelt Frequent Pattern Mining 286
Motivation:
Molecular Fragment Mining
Christian Borgelt Frequent Pattern Mining 287
Molecular Fragment Mining
Motivation: Accelerating Drug Development
lhascs ot dru dcvcojmcnt jrc-cinica and cinica
Lata athcrin y hih-throuhjut scrccnin
uidin moccuar dataascs with activity intormation
Accccration jotcntia y intcicnt data anaysis
(quantitativc) structurc-activity rcationshij discovcry
Mining Molecular Databases
Lxamjc data `Cl LTl Ll\ Antivira Scrccn data sct
Lcscrijtion anuacs tor moccucs
S`lLLS, SL`, SLc,Cta ctc
Iindin common moccuar sustructurcs
Iindin discriminativc moccuar sustructurcs
Christian Borgelt Frequent Pattern Mining 288
Accelerating Drug Development
Lcvcojin a ncw dru can takc 10 to 12 years
(trom thc choicc ot thc tarct to thc introduction into thc markct)
ln rcccnt ycars thc duration ot thc dru dcvcojmcnt jroccsscs increased
continuousy. at thc samc thc number ot sustanccs undcr dcvcojmcnt
has gone down drasticay
Luc to hih invcstmcnts jharmaccutica comjanics must sccurc thcir markct
josition and comjctitivcncss y ony a few, highly successful drugs
As a conscqucncc thc chanccs tor thc dcvcojmcnt
ot drus tor tarct roujs
with rare diseases or
with special diseases in developing countries
arc considcray rcduccd
A sinicant reduction of the development time coud mitiatc this trcnd
or cvcn rcvcrsc it
(Source: Bundesministerium f ur Bildung und Forschung, Germany)
Christian Borgelt Frequent Pattern Mining 289
Phases of Drug Development
Discovery and Optimization of Candidate Substances
Lih-Throuhjut Scrccnin
Lcad Liscovcry and Lcad Ojtimization
Pre-clinical Test Series (tcsts with animas, ca 3 ycars)
Iundamcnta tcst wrt ccctivcncss and sidc cccts
Clinical Test Series (tcsts with humans, ca !o ycars)
lhasc 1 ca 30S0 hcathy humans
Chcck tor sidc cccts
lhasc 2 ca 100300 humans cxhiitin thc symjtoms ot thc tarct discasc
Chcck tor ccctivcncss
lhasc 3 uj to 3000 hcathy and i humans at cast 3 ycars
Lctaicd chcck ot ccctivcncss and sidc cccts
Ocial Acceptance as a Drug
Christian Borgelt Frequent Pattern Mining 290
Drug Development: Acceleration Potential
Thc cnth ot thc jrc-cinica and cinica tcsts scrics can hardy c rcduccd,
sincc thcy scrvc thc jurjosc to cnsurc thc satcty ot thc jaticnts
Thcrctorc ajjroachcs to sjccd uj thc dcvcojmcnt jroccss
usuay tarct thc pre-clinical phase ctorc thc anima tcsts
ln jarticuar, it is tricd to imjrovc thc scarch tor ncw dru candidatcs
(lead discovery) and thcir ojtimization (lead optimization)
Here Intelligent Data Analysis and Frequent Pattern Mining can help.
One possible approach:
\ith hih-throuhjut scrccnin a vcry arc numcr ot sustanccs
is tcstcd automaticay and thcir activity is dctcrmincd
Thc rcsutin moccuar dataascs arc anayzcd y tryin
to nd common substructures ot activc sustanccs
Christian Borgelt Frequent Pattern Mining 291
High-Throughput Screening
On so-cacd micro-plates jrotcins,ccs arc automaticay comincd with a arc
varicty ot chcmica comjounds
c
_
w
w
w
.
m
a
t
r
i
x
t
e
c
h
c
o
r
p
.
c
o
m
w
w
w
.
e
l
i
s
a
-
t
e
k
.
c
o
m
w
w
w
.
t
h
e
r
m
o
.
c
o
m
w
w
w
.
a
r
r
a
y
i
t
.
c
o
m
Christian Borgelt Frequent Pattern Mining 292
High-Throughput Screening
Thc cd micro-jatcs arc thcn cvauatcd in spectrometers
(wrt asorjtion, uorcsccncc, umincsccncc, joarization ctc)
c _ www.moleculardevices.com www.biotek.com
Christian Borgelt Frequent Pattern Mining 293
High-Throughput Screening
Attcr thc mcasurcmcnt thc sustanccs arc cassicd as active or inactive
Figure c _ Christof Fattinger, Homann-LaRoche, Basel
Ly anayzin thc rcsuts onc trics
to undcrstand thc dcjcndcncc
ctwccn moccuar structurc and
activity
QSAR
Quantitativc Structurc-Activity
lcationshij `odcin
ln this arca a arc
numcr ot data minin
aorithms arc uscd
tcaturc sccction mcthods
dccision trccs
ncura nctworks ctc
Christian Borgelt Frequent Pattern Mining 294
Example: NCI DTP HIV Antiviral Screen
Amon othcr data scts, thc `ationa Canccr lnstitutc (`Cl) has madc
thc DTP HIV Antiviral Screen Data Set juicy avaiac
A arc numcr ot chcmica comjounds whcrc tcstcd
whcthcr thcy jrotcct human CL` ccs aainst an Ll\-1 intcction
Sustanccs that jrovidcd `0/ jrotcction wcrc rctcstcd
Sustanccs that rcjroduciy jrovidcd 100/ jrotcction
arc istcd as conrmed active (CA)
Sustanccs that rcjroduciy jrovidcd at cast `0/ jrotcction
arc istcd as moderately active (CM)
A othcr sustanccs
arc istcd as conrmed inactive (CI)
32` CA, S CM, 3` 9o9 CI (tota 3 11 sustanccs)
Christian Borgelt Frequent Pattern Mining 295
Form of the Input Data
Lxccrjt trom thc `Cl LTl Ll\ Antivira Scrccn data sct (S`lLLS tormat)
737, 0,CN(C)C1=[S+][Zn]2(S1)SC(=[S+]2)N(C)C
2018, 0,N#CC(=CC1=CC=CC=C1)C2=CC=CC=C2
19110,0,OC1=C2N=C(NC3=CC=CC=C3)SC2=NC=N1
20625,2,NC(=N)NC1=C(SSC2=C(NC(N)=N)C=CC=C2)C=CC=C1.OS(O)(=O)=O
22318,0,CCCCN(CCCC)C1=[S+][Cu]2(S1)SC(=[S+]2)N(CCCC)CCCC
24479,0,C[N+](C)(C)C1=CC2=C(NC3=CC=CC=C3S2)N=N1
50848,2,CC1=C2C=CC=CC2=N[C-](CSC3=CC=CC=C3)[N+]1=O
51342,0,OC1=C2C=NC(=NC2=C(O)N=N1)NC3=CC=C(Cl)C=C3
55721,0,NC1=NC(=C(N=O)C(=N1)O)NC2=CC(=C(Cl)C=C2)Cl
55917,0,O=C(N1CCCC[CH]1C2=CC=CN=C2)C3=CC=CC=C3
64054,2,CC1=C(SC[C-]2N=C3C=CC=CC3=C(C)[N+]2=O)C=CC=C1
64055,1,CC1=CC=CC(=C1)SC[C-]2N=C3C=CC=CC3=C(C)[N+]2=O
64057,2,CC1=C2C=CC=CC2=N[C-](CSC3=NC4=CC=CC=C4S3)[N+]1=O
66151,0,[O-][N+](=O)C1=CC2=C(C=NN=C2C=C1)N3CC3
...
identication number, activity (2: CA, 1: CM, 0: CI), molecule description in SMILES notation
Christian Borgelt Frequent Pattern Mining 296
Input Format: SMILES Notation and SLN
SMILES Notation: (zL Layiht, lnc)
c1:c:c(-F):c:c2:c:1-C1-C(-C-C-2)-C2-C(-C)(-C-C-1)-C(-O)-C-C-2
SLN (SYBYL Line Notation): (Trijos, lnc)
C[1]H:CH:C(F):CH:C[8]:C:@1-C[10]H-CH(-CH2-CH2-@8)-C[20]H-C(-CH3)
(-CH2-CH2-@10)-CH(-CH2-CH2-@20)-OH
Represented Molecule:
Iu lcjrcscntation
F O
C
C
C C
C
C
C
C
C
C
C
C C
C
C C
C
C
C C
C H
H
H
H
H H
HH
H
H
H H
H H
H
H
H
H H
H
H
H H
Simjicd lcjrcscntation
O F
Christian Borgelt Frequent Pattern Mining 297
Input Format: Grammar for SMILES and SLN
Gcncra rammar tor (incar) moccuc dcscrijtions (S`lLLS and SL`)
`occuc Atom Lranch
Lranch
[ Lond Atom Lranch
[ Lond Lac Lranch
[ ( Lranch ) Lranch
Atom Lcmcnt LacLct
LacLct
[ Lac LacLct
ack non-tcrmina symos
uc tcrmina symos
Thc dcnitions ot thc non-tcrminas Lcmcnt, Lond, and Lac
dcjcnd on thc choscn dcscrijtion anuac Ior S`lLLS it is
Lcmcnt B [ C [ N [ O [ F [ [H] [ [He] [ [Li] [ [Be] [
Lond [ - [ = [ # [ : [ .
Lac Liit [ % Liit Liit
Liit 0 [ 1 [ [ 9
Christian Borgelt Frequent Pattern Mining 298
Input Format: SDle/Ctab
L-Alanine (13C)
user initials, program, date/time etc.
comment
6 5 0 0 1 0 3 V2000
-0.6622 0.5342 0.0000 C 0 0 2 0 0 0
0.6622 -0.3000 0.0000 C 0 0 0 0 0 0
-0.7207 2.0817 0.0000 C 1 0 0 0 0 0
-1.8622 -0.3695 0.0000 N 0 3 0 0 0 0
0.6220 -1.8037 0.0000 O 0 0 0 0 0 0
1.9464 0.4244 0.0000 O 0 5 0 0 0 0
1 2 1 0 0 0
1 3 1 1 0 0
1 4 1 0 0 0
2 5 2 0 0 0
2 6 1 0 0 0
M END
> <value>
0.2
$$$$
O
5
C
2 O
6
C
1
C
3 N4
SLc Structurc-data c
Cta Conncction tac (incs !1o)
c _ Lscvicr Scicncc
Christian Borgelt Frequent Pattern Mining 299
Finding Common Molecular Substructures
N N N O
O
N
N
O
O
O
O
N
N
N
N N N O
O
N
N
O
O
O
N N N O
O
N
N
O
O
O
P
O
O
O
O
O
N N N O
O
N
N
O
O
O
O
O
O
O
N N N O
O
N
N
O
O
Some Molecules from the NCI HIV Database
Common Fragment
Christian Borgelt Frequent Pattern Mining 300
Finding Molecular Substructures
Common Molecular Substructures
Anayzc ony thc activc moccucs
Iind moccuar tramcnts that ajjcar trcqucnty in thc moccucs
Discriminative Molecular Substructures
Anayzc thc activc and thc inactivc moccucs
Iind moccuar tramcnts that ajjcar trcqucnty in thc activc moccucs
and ony rarcy in thc inactivc moccucs
Rationale in both cases
Thc tound tramcnts can ivc hints which structura jrojcrtics
arc rcsjonsic tor thc activity ot a moccuc
This can hcj to idcntity dru candidatcs (so-cacd pharmacophores)
and to uidc tuturc scrccnin corts
Christian Borgelt Frequent Pattern Mining 301
Frequent (Sub)Graph Mining
Christian Borgelt Frequent Pattern Mining 302
Frequent (Sub)Graph Mining: General Approach
Iindin trcqucnt itcm scts mcans to nd
sets of items that are contained in many transactions
Iindin trcqucnt sustructurcs mcans to nd
graph fragments that are contained in many graphs
in a ivcn dataasc ot attriutcd rajhs (uscr sjccics minimum sujjort)
Grajh structurc ot vcrticcs and cdcs has to c takcn into account
Scarch jartiay ordcrcd sct ot rajh structurcs instcad ot suscts
`ain jrocm How can we avoid redundant search?
suay thc scarch is rcstrictcd to connected substructures
Conncctcd sustructurcs succ tor most ajjications
This rcstriction considcray narrows thc scarch sjacc
Christian Borgelt Frequent Pattern Mining 303
Frequent (Sub)Graph Mining: Basic Notions
Lct A a
1
, . . . , a
m
c a sct ot attributes or labels
A labeled or attributed graph is a trijc G (V, E, ), whcrc
V is thc sct ot vcrticcs,
E V V (v, v) [ v V is thc sct ot cdcs, and
V E A assins acs trom thc sct A to vcrticcs and cdcs
`otc that G is undirected and simple and contains no loops
Lowcvcr, rajhs without thcsc rcstrictions coud c handcd as wc
`otc aso that scvcra vcrticcs and cdcs may havc thc samc attriutc,ac
Lxamjc molecule representation
Atom attriutcs atom tyjc (chcmica ccmcnt), charc, aromatic rin a
Lond attriutcs ond tyjc (sinc, douc, trijc, aromatic)
Christian Borgelt Frequent Pattern Mining 304
Frequent (Sub)Graph Mining: Basic Notions
`otc that tor accd rajhs thc samc notions can c uscd as tor norma rajhs
\ithout torma dcnition, wc wi usc, tor cxamjc
A vcrtcx v is incident to an cdc e, and thc cdc is incident to thc vcrtcx v,
i e (v, v
/
) or e (v
/
, v)
Two dicrcnt vcrticcs arc adjacent or connected
it thcy arc incidcnt to thc samc cdc
A path is a scqucncc ot cdcs conncctin two vcrticcs
lt is undcrstood that no cdc (and no vcrtcx) occurs twicc
A rajh is cacd connected it thcrc cxists a jath ctwccn any two vcrticcs
A subgraph consists ot a susct ot thc vcrticcs and a susct ot thc cdcs
lt S is a (jrojcr) surajh ot G wc writc S G or S G, rcsjcctivcy
A connected component ot a rajh is a surajh that is conncctcd and
maxima in thc scnsc that any arcr surajh containin it is not conncctcd
Christian Borgelt Frequent Pattern Mining 305
Frequent (Sub)Graph Mining: Basic Notions
`otc that tor accd rajhs thc samc notions can c uscd as tor norma rajhs
\ithout torma dcnition, wc wi usc, tor cxamjc
A vcrtcx ot a rajh is cacd isolated it it is not incidcnt to any cdc
A vcrtcx ot a rajh is cacd a leaf it it is incidcnt to cxacty onc cdc
An cdc ot a rajh is cacd a bridge it rcmovin it
incrcascs thc numcr ot conncctcd comjoncnts ot thc rajh
`orc intuitivcy a ridc is thc ony conncction ctwccn two vcrticcs,
that is, thcrc is no othcr jath on which onc can rcach thc onc trom thc othcr
An cdc ot a rajh is cacd a proper bridge
it it is a ridc and not incidcnt to a cat
ln othcr words an cdc is a jrojcr ridc it rcmovin it crcatcs an isoatcd vcrtcx
A othcr ridcs arc cacd leaf bridges
(ccausc thcy arc incidcnt to at cast onc cat)
Christian Borgelt Frequent Pattern Mining 306
Frequent (Sub)Graph Mining: Basic Notions
Lct G (V
G
, E
G
,
G
) and S (V
S
, E
S
,
S
) c two accd rajhs
A subgraph isomorphism ot S to G or an occurrence ot S in G
is an in,cctivc tunction f V
S
V
G
with
v V
S
S
(v)
G
(f(v)) and
(u, v) E
S
(f(u), f(v)) E
G
S
((u, v))
G
((f(u), f(v)))
That is, thc majjin f jrcscrvcs thc conncction structurc and thc acs
lt such a majjin f cxists, wc writc S _ G
`otc that thcrc may c scvcra ways to maj a accd rajh S to a accd rajh G
so that thc conncction structurc and thc vcrtcx and cdc acs arc jrcscrvcd
Ior cxamjc, G may josscss scvcra surajhs that arc isomorjhic to S
lt may cvcn c that thc rajh S can c majjcd in scvcra dicrcnt ways to thc
samc surajh ot G This is thc casc it thcrc cxists a surajh isomorjhism ot S
to itsct (a so-cacd graph automorphism) that is not thc idcntity
Christian Borgelt Frequent Pattern Mining 307
Frequent (Sub)Graph Mining: Basic Notions
Lct S and G c two accd rajhs
S and G arc cacd isomorphic, writtcn S G, i S _ G and G _ S
ln this casc a tunction f majjin S to G is cacd a graph isomorphism
A tunction f majjin S to itsct is cacd a graph automorphism
S is properly contained in G, writtcn S < G, i S _ G and S , G
lt S _ G or S < G, thcn thcrc cxists a (jrojcr) surajh G
/
ot G,
such that S and G
/
arc isomorjhic
This cxjains thc tcrm surajh isomorjhism
Thc set of all connected subgraphs ot G is dcnotcd y ((G)
lt is ovious that tor a S ((G) S _ G
Lowcvcr, thcrc arc (unconncctcd) rajhs S with S _ G that arc not in ((G)
Thc sct ot a (conncctcd) surajhs is anaoous to thc jowcr sct ot a sct
Christian Borgelt Frequent Pattern Mining 308
Subgraph Isomorphism: Examples
G
S
1
S
2
N
N
O
O O
O
O
N
N
O
A moccuc G that rcjrcscnts a rajh in a dataasc
and two rajhs S
1
and S
2
that arc containcd in G
Thc surajh rcationshij is tormay dcscricd y a majjin f
ot thc vcrticcs ot onc rajh to thc vcrticcs ot anothcr
G (V
G
, E
G
), S (V
S
, E
S
), f V
S
V
G
.
This majjin must jrcscrvc thc conncction structurc and thc acs
Christian Borgelt Frequent Pattern Mining 309
Subgraph Isomorphism: Examples
G
S
1
f
1
V
S
1
V
G
S
2
f
2
V
S
2
V
G
N
N
O
O O
O
O
N
N
O
Thc majjin must jrcscrvc thc conncction structurc
(u, v) E
S
(f(u), f(v)) E
G
.
Thc majjin must jrcscrvc vcrtcx and cdc acs
v V
S
S
(v)
G
(f(v)), (u, v) E
S
S
((u, v))
G
((f(u), f(v))).
Lcrc oxycn must c majjcd to oxycn, sinc onds to sinc onds ctc
Christian Borgelt Frequent Pattern Mining 310
Subgraph Isomorphism: Examples
G
S
1
f
1
V
S
1
V
G
S
2
f
2
V
S
2
V
G
g
2
V
S
2
V
G
N
N
O
O O
O
O
N
N
O
Thcrc may c morc than onc jossic majjin , occurrcncc
(Thcrc arc cvcn thrcc morc occurrcnccs ot S
2
)
Lowcvcr, wc arc currcnty ony intcrcstcd in whcthcr thcrc cxists a majjin
(Thc numcr ot occurrcnccs wi ccomc imjortant
whcn wc considcr minin trcqucnt (su)rajhs in a sinc rajh)
Tcstin whcthcr a surajh isomorjhism cxists ctwccn ivcn rajhs S and G
is NP-complete (that is, rcquircs cxjoncntia timc uncss l `l)
Christian Borgelt Frequent Pattern Mining 311
Subgraph Isomorphism: Examples
G
S
1
f
1
V
S
1
V
G
S
3
f
3
V
S
3
V
G
g
3
V
S
3
V
G
N
N
O
O O
O
O
N
N O
O
A rajh may c majjcd to itsct (automorphism)
Triviay, cvcry rajh josscsscs thc idcntity as an automorjhism
(Lvcry rajh can c majjcd to itsct y majjin cach nodc to itsct)
lt a rajh (tramcnt) josscsscs an automorjhism that is not thc idcntity
thcrc is morc than onc occurrcncc at the same location in anothcr rajh
Thc numcr ot occurrcnccs ot a rajh (tramcnt) in a rajh can c huc
Christian Borgelt Frequent Pattern Mining 312
Frequent (Sub)Graph Mining: Basic Notions
Lct S c a accd rajh and ( a vcctor ot accd rajhs
A accd rajh G ( covers thc accd rajh S or
thc accd rajh S is contained in a accd rajh G ( i S _ G
Thc sct K
(
(S) k1, . . . , n [ S_G
k
is cacd thc cover ot S wrt (
Thc covcr ot a rajh is thc indcx sct ot thc dataasc rajhs that covcr it
lt may aso c dcncd as a vcctor ot a accd rajhs that covcr it
(which, howcvcr, is comjicatcd to writc in tormay corrcct way)
Thc vauc s
(
(S) [K
(
(S)[ is cacd thc (absolute) support ot S wrt (
Thc vauc
(
(S)
1
n
[K
(
(S)[ is cacd thc relative support ot S wrt (
Thc sujjort ot S is thc numcr or traction ot accd rajhs that contain it
Somctimcs
(
(S) is aso cacd thc (relative) frequency ot S wrt (
Christian Borgelt Frequent Pattern Mining 313
Frequent (Sub)Graph Mining: Formal Denition
Given:
a sct A a
1
, . . . , a
m
ot attriutcs or acs,
a vcctor ( (G
1
, . . . , G
n
) ot rajhs with acs in A,
a numcr s
min
l`, 0 < s
min
n, or (cquivacnty)
a numcr
min
ll, 0 <
min
1, thc minimum support
Desired:
thc sct ot frequent (sub)graphs or frequent fragments, that is,
thc sct F
(
(s
min
) S [ s
(
(S) s
min
or (cquivacnty)
thc sct
(
(
min
) S [
(
(S)
min
kK
T
(I)
t
k
.
rcstrictcd to thc sct ot trcqucnt itcm scts
C
T
(s
min
) I F
T
(s
min
) [ I cl (I)
Christian Borgelt Frequent Pattern Mining 328
Closed (Sub)Graphs
Question: ls thcrc a cosurc ojcrator that induccs thc coscd (su)rajhs
At rst ancc, it ajjcars natura to transtcr thc ojcration
cl (I)
kK
T
(I)
t
k
y rcjacin thc intcrscction with thc greatest common subgraph
ntortunatcy, this is not jossic, ccausc thc rcatcst common surajh
ot two (or morc) rajhs nccd not c uniqucy dcncd
Considcr thc two rajhs (which arc actuay chains)
A B C and A B B C.
Thcrc arc two rcatcst common surajhs
A B and B C.
As a conscqucncc, thc intcrscction ot a sct ot dataasc rajhs
can yicd a set of graphs instcad ot a sinc common rajh
Christian Borgelt Frequent Pattern Mining 329
Reminder: Galois Connections
Lct (X, _
X
) and (Y, _
Y
) c two jartiay ordcrcd scts
A tunction jair (f
1
, f
2
) with f
1
X Y and f
2
Y X
is cacd a (monotone) Galois connection i
A
1
, A
2
X A
1
_ A
2
f
1
(A
1
) _ f
1
(A
2
),
B
1
, B
2
Y B
1
_ B
2
f
2
(B
1
) _ f
2
(B
2
),
A X B Y A _ f
2
(B) B _ f
1
(A)
A tunction jair (f
1
, f
2
) with f
1
X Y and f
2
Y X
is cacd an anti-monotone Galois connection i
A
1
, A
2
X A
1
_ A
2
f
1
(A
1
) _ f
1
(A
2
),
B
1
, B
2
Y B
1
_ B
2
f
2
(B
1
) _ f
2
(B
2
),
A X B Y A _ f
2
(B) B _ f
1
(A)
ln a monotonc Gaois conncction, oth f
1
and f
2
arc monotonc,
in an anti-monotonc Gaois conncction, oth f
1
and f
2
arc anti-monotonc
Christian Borgelt Frequent Pattern Mining 330
Reminder: Galois Connections
Galois Connections and Closure Operators
Lct thc two scts X and Y c jowcr scts ot somc scts U and V , rcsjcctivcy,
and ct thc jartia ordcrs c thc susct rcations on thcsc jowcr scts, that is, ct
(X, _
X
) (2
U
, ) and (Y, _
Y
) (2
V
, ).
Thcn thc comination f
1
f
2
X X ot thc tunctions ot a Gaois conncction
is a closure operator (as wc as thc comination f
2
f
1
Y Y )
Galois Connections in Frequent Item Set Mining
Considcr thc jartiay ordcr scts (2
B
, ) and (2
1,...,n
, )
Lct f
1
2
B
2
1,...,n
, I K
T
(I) k 1, . . . , n [ I t
k
and f
2
2
1,...,n
2
B
, J
jJ
t
j
i B [ j J i t
j
S S C S C O S C N
O
O C O N C O S C N
O
C C N S C N S C N
O
C C N N C O S C N
O
ctc (S morc jossiiitics)
*
S O C N
S C C O C N
S C N S C O O C N
S C N
O
Christian Borgelt Frequent Pattern Mining 340
Reminder: Searching for Frequent Item Sets
\c havc to scarch thc jartiay ordcrcd sct (2
B
, ) , its Lassc diaram
Assinin uniquc jarcnts turns thc Lassc diaram into a trcc
Travcrsin thc rcsutin trcc cxjorcs cach itcm sct cxacty oncc
Lassc diaram and a jossic trcc tor vc itcms
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
Christian Borgelt Frequent Pattern Mining 341
Searching for Frequent (Sub)Graphs
\c havc to scarch thc jartiay ordcrcd sct ot (conncctcd) (su)rajhs
ranin trom thc cmjty rajh to thc dataasc rajhs
Assinin uniquc jarcnts turns corrcsjondin Lassc diaram into a trcc
Travcrsin thc rcsutin trcc cxjorcs cach (su)rajh cxacty oncc
Surajh Lassc diaram and a jossic trcc
*
F S O C N
F S O S S C C O C N
O S F F S C O S C S C N S C O O C N C N C
O S C
F F
S C N O S C
N
O S C
O
S C N
O
S C N
C O
C N C
O S C N
F
O S C N
O
S C N C
O
*
F S O C N
F S O S S C C O C N
O S F F S C O S C S C N S C O O C N C N C
O S C
F F
S C N O S C
N
O S C
O
S C N
O
S C N
C O
C N C
O S C N
F
O S C N
O
S C N C
O
Christian Borgelt Frequent Pattern Mining 342
Searching with Unique Parents
Principle of a Search Algorithm based on Unique Parents:
Base Loop:
Travcrsc a jossic vcrtcx attriutcs (thcir uniquc jarcnt is thc cmjty rajh)
lccursivcy jroccss a vcrtcx attriutcs that arc trcqucnt
Recursive Processing:
Ior a ivcn trcqucnt (su)rajh S
Gcncratc a cxtcnsions R ot S y an cdc or y an cdc and a vcrtcx
(it thc vcrtcx is not yct in S) tor which S is thc choscn uniquc jarcnt
Ior a R it R is trcqucnt, jroccss R rccursivcy, othcrwisc discard R
Questions:
Low can wc tormay assin uniquc jarcnts
(Low) Can wc makc surc that wc cncratc ony thosc cxtcnsions
tor which thc (su)rajh that is cxtcndcd is thc choscn uniquc jarcnt
Christian Borgelt Frequent Pattern Mining 343
Assigning Unique Parents
Iormay, thc sct ot a possible parents ot a (conncctcd) (su)rajh S is
P(S) R ((S) [ , U ((S) R U S.
ln othcr words, thc jossic jarcnts ot S arc its maximal proper subgraphs
Lach jossic jarcnt contains cxacty one edge less than thc (su)rajh S
lt wc can dcnc an ordcr on thc cdcs ot thc (su)rajh S,
wc can casiy sinc out a uniquc jarcnt, thc canonical parent p
c
(S)
Lct e
lt e
(R) as induccd y w
c
(R) is thc cdc addcd to S to torm R
and R is trcqucnt, jroccss R rccursivcy, othcrwisc discard R
Questions:
Low can wc tormay dcnc canonica codc words
Lo wc havc to cncratc a jossic cxtcnsions ot a trcqucnt (su)rajh
Christian Borgelt Frequent Pattern Mining 350
Canonical Forms: Prex Property
Sujjosc thc canonica torm josscsscs thc prex property
Every prex of a canonical code word is a canonical code word itself
Thc cdc e
S
0
N
1
C
3
C
7
C
8
O
2
C
4
O
5
O
6
rcadth-rst
L
C
6
O
7
O
8
S
0
N
1
C
2
O
3
C
4
C
5
Order of Elements: S N O C Order of Bonds:
Code Words:
A S 10-N 21-O 31-C 43-C 54-O 64=O 73-C 87-C 80-C
L S 0-N1 0-C2 1-O3 1-C4 2-C5 4-C5 4-C6 6-O7 6=O8
(lcmindcr in A thc cdcs arc sortcd descendingly wrt thc sccond cntry)
Christian Borgelt Frequent Pattern Mining 361
Checking for Canonical Form: Compare Prexes
Base Loop:
Travcrsc a vcrticcs with a ac no css than thc currcnt root vcrtcx
(rst charactcr ot thc codc word. jossic roots ot sjannin trccs)
Recursive Processing:
Thc rccursivc jroccssin constructs atcrnativc sjannin trccs and
comjarc thc codc words rcsutin trom it with thc codc word to chcck
ln cach rccursion stcj onc cdc is addcd to thc sjannin trcc and its dcscrijtion
is comjarcd to thc corrcsjondin onc in thc codc word to chcck
lt thc ncw cdc dcscrijtion is larger, thc cdc can c skijjcd
(ncw codc word is cxicorajhicay arcr)
lt thc ncw cdc dcscrijtion is smaller, thc codc word is not canonica
(ncw codc word is cxicorajhicay smacr)
lt thc ncw cdc dcscrijtion is equal, thc rcst ot thc codc word
is jroccsscd rccursivcy (codc word jrcxcs arc cqua)
Christian Borgelt Frequent Pattern Mining 362
Checking for Canonical Form
function isCanonica (w array ot int, G rajh) oocan.
var v vcrtcx. ( to travcrsc thc vcrticcs ot thc rajh )
e cdc. ( to travcrsc thc cdcs ot thc rajh )
x array ot vcrtcx. ( to cocct thc numcrcd vcrticcs )
begin
forall v G.V do v.i 1. ( ccar thc vcrtcx indiccs )
forall e G.E do e.i 1. ( ccar thc cdc markcrs )
forall v G.V do begin ( travcrsc thc jotcntia root vcrticcs )
if v.a < w|0| then return tasc. ( it v has a smacr ac, aort )
if v.a w|0| then begin ( it v has thc samc ac, chcck rcst )
v.i 0. x|0| v. ( numcr and rccord thc root vcrtcx )
if not rcc(w, 1, x, 1, 0) ( chcck thc codc word rccursivcy and )
then return tasc. ( aort it a smacr codc word is tound )
v.i 1. ( ccar thc vcrtcx indcx aain )
end.
end.
return truc. ( thc codc word is canonica )
end ( isCanonica ) ( tor a rcadth-rst scarch sjannin trcc )
Christian Borgelt Frequent Pattern Mining 363
Checking for Canonical Form
function rcc (w array ot int, k int, x array ot vcrtcx, n int, i int) oocan.
( w codc word to c tcstcd )
( k currcnt josition in codc word )
( x array ot arcady accd,numcrcd vcrticcs )
( n numcr ot accd,numcrcd vcrticcs )
( i indcx ot ncxt cxtcndac vcrtcx to chcck. i < n )
var d vcrtcx. ( vcrtcx at thc othcr cnd ot an cdc )
j int. ( indcx ot dcstination vcrtcx )
u oocan. ( a tor unnumcrcd dcstination vcrtcx )
r oocan. ( ucr tor a rccursion rcsut )
begin
if k cnth(w) return truc. ( tu codc word has ccn cncratcd )
while i < w|k| do begin ( chcck whcthcr thcrc is an cdc with )
forall e incidcnt to x|i| do ( a sourcc vcrtcx havin a smacr indcx )
if e.i < 0 then return tasc.
i i + 1. ( it thcrc is an unmarkcd cdc, aort, )
end. ( othcrwisc o to thc ncxt vcrtcx )
Christian Borgelt Frequent Pattern Mining 364
Checking for Canonical Form
forall e incidcnt to x|i| (in sortcd ordcr) do begin
if e.i < 0 then begin ( travcrsc thc unvisitcd incidcnt cdcs )
if e.a < w|k + 1| then return tasc. ( chcck thc )
if e.a > w|k + 1| then return truc. ( cdc attriutc )
d vcrtcx incidcnt to e othcr than x|i|.
if d.a < w|k + 2| then return tasc. ( chcck dcstination )
if d.a > w|k + 2| then return truc. ( vcrtcx attriutc )
if d.i < 0 then j n else j d.i.
if j < w|k + 3| then return tasc. ( chcck dcstination vcrtcx indcx )
|| ( chcck rcst ot codc word rccursivcy, )
( ccausc jrcxcs arc cqua )
end.
end.
return truc. ( rcturn that no smacr codc word )
end ( rcc ) ( than w coud c tound )
Christian Borgelt Frequent Pattern Mining 365
Checking for Canonical Form
forall e incidcnt to x|i| (in sortcd ordcr) do begin
if e.i < 0 then begin ( travcrsc thc unvisitcd incidcnt cdcs )
|| ( chcck thc currcnt cdc )
if j w|k + 3| then begin ( it cdc dcscrijtions arc cqua )
e.i 1. u d.i < 0. ( mark cdc and numcr vcrtcx )
if u then begin d.i j. x|n| d. n n + 1. end
r rcc(w, k + !, x, n, i). ( chcck rccursivcy )
if u then begin d.i 1. n n 1. end
e.i 1. ( unmark cdc (and vcrtcx) aain )
if not r then return tasc.
end. ( cvauatc thc rccursion rcsut )
end. ( aort it a smacr codc word was tound )
end.
return truc. ( rcturn that no smacr codc word )
end ( rcc ) ( than w coud c tound )
Christian Borgelt Frequent Pattern Mining 366
Restricted Extensions
Christian Borgelt Frequent Pattern Mining 367
Canonical Forms: Restricted Extensions
Principle of the Search Algorithm up to now:
Gcncratc a jossic cxtcnsions ot a ivcn canonica codc word
y thc dcscrijtion ot an cdc that cxtcnds thc dcscricd (su)rajh
Chcck whcthcr thc cxtcndcd codc word is canonica (and thc (su)rajh trcqucnt)
lt it is, jroccss thc cxtcndcd codc word rccursivcy, othcrwisc discard it
Straightforward Improvement:
Ior somc cxtcnsions ot a ivcn canonica codc word it is casy to scc
that thcy wi not c canonica thcmscvcs
Thc trick is to chcck whcthcr a sjannin trcc rooted at the same vertex
yicds a codc word that is smacr than thc crcatcd cxtcndcd codc word
This immcdiatcy rucs out cdcs attachcd to ccrtain vcrticcs in thc (su)rajh
(ony ccrtain vcrticcs arc extendable, that is, can c incidcnt to a ncw cdc)
as wc as ccrtain cdcs cosin cyccs
Christian Borgelt Frequent Pattern Mining 368
Canonical Forms: Restricted Extensions
Depth-First Search: Rightmost Path Extension
Extendable Vertices:
Ony vcrticcs on thc rightmost path ot thc sjannin trcc may c cxtcndcd
lt thc sourcc vcrtcx ot thc ncw cdc is not a cat, thc cdc dcscrijtion
must not jrcccdc thc dcscrijtion ot thc downward cdc on thc jath
(That is, thc cdc attriutc must c no css than thc cdc attriutc ot thc
downward cdc, and it it is cqua, thc attriutc ot its dcstination vcrtcx must
c no css than thc attriutc ot thc downward cdcs dcstination vcrtcx)
Edges Closing Cycles:
Ldcs cosin cyccs must start at an cxtcndac vcrtcx
Thcy must cad to thc rihtmost cat (vcrtcx at cnd ot rihtmost jath)
Thc indcx ot thc sourcc vcrtcx must jrcccdc thc indcx ot thc sourcc vcrtcx
ot any cdc arcady incidcnt to thc rihtmost cat
Christian Borgelt Frequent Pattern Mining 369
Canonical Forms: Restricted Extensions
Breadth-First Search: Maximum Source Extension
Extendable Vertices:
Ony vcrticcs havin an indcx no css than thc maximum source index
ot cdcs that arc arcady in thc (su)rajh may c cxtcndcd
lt thc sourcc ot thc ncw cdc is thc onc havin thc maximum sourcc indcx,
it may c cxtcndcd ony y cdcs whosc dcscrijtions do not jrcccdc
thc dcscrijtion ot any downward cdc arcady incidcnt to this vcrtcx
(That is, thc cdc attriutc must c no css, and it it is cqua,
thc attriutc ot thc dcstination vcrtcx must c no css)
Edges Closing Cycles:
Ldcs cosin cyccs must start at an cxtcndac vcrtcx
Thcy must cad torward,
that is, to a vcrtcx havin a arcr indcx than thc cxtcndcd vcrtcx
Christian Borgelt Frequent Pattern Mining 370
Restricted Extensions: A Simple Example
O
N
S
O
O
cxamjc
moccuc
dcjth-rst
A
S
0
N
1
C
3
C
7
C
8
O
2
C
4
O
5
O
6
rcadth-rst
L
C
6
O
7
O
8
S
0
N
1
C
2
O
3
C
4
C
5
Extendable Vertices:
A vcrticcs on thc rihtmost jath, that is, 0, 1, 3, , S
L vcrticcs with an indcx no smacr than thc maximum sourcc, that is, o, , S
Edges Closing Cycles:
A nonc, ccausc thc cxistin cycc cdc has thc smacst jossic sourcc
L thc cdc ctwccn thc vcrticcs and S
Christian Borgelt Frequent Pattern Mining 371
Restricted Extensions: A Simple Example
O
N
S
O
O
cxamjc
moccuc
dcjth-rst
A
S
0
N
1
C
3
C
7
C
8
O
2
C
4
O
5
O
6
rcadth-rst
L
C
6
O
7
O
8
S
0
N
1
C
2
O
3
C
4
C
5
lt othcr vcrticcs arc cxtcndcd, a trcc with the same root yicds a smacr codc word
Example: attach a sinc ond to a caron atom at thc cttmost oxycn atom
A S 10-N 21-O 31-C 43-C 54-O 64=O 73-C 87-C 80-C 92-C
S 10-N 21-O 32-C
L S 0-N1 0-C2 1-O3 1-C4 2-C5 4-C5 4-C6 6-O7 6=O8 3-C9
S 0-N1 0-C2 1-O3 1-C4 2-C5 3-C6
Christian Borgelt Frequent Pattern Mining 372
Canonical Forms: Restricted Extensions
Thc rucs undcryin rcstrictcd cxtcnsions jrovidc a onc-sidcd answcr
to thc qucstions whcthcr an cxtcnsion yicds a canonica codc word
Depth-rst search canonical form
lt thc cxtcnsion cdc is not a rihtmost jath cxtcnsion,
thcn thc rcsutin codc word is certainly not canonica
lt thc cxtcnsion cdc is a rihtmost jath cxtcnsion,
thcn thc rcsutin codc word may or may not be canonica
Breadth-rst search canonical form
lt thc cxtcnsion cdc is not a maximum sourcc cxtcnsion,
thcn thc rcsutin codc word is certainly not canonica
lt thc cxtcnsion cdc is a maximum sourcc cxtcnsion,
thcn thc rcsutin codc word may or may not be canonica
As a conscqucncc, a canonical form test is sti ncccssary
Christian Borgelt Frequent Pattern Mining 373
Example Search Tree
Start with a sinc vcrtcx (sccd vcrtcx)
Add an cdc (and mayc a vcrtcx) in cach stcj (restricted extensions)
Lctcrminc thc sujjort and jrunc intrcqucnt (su)rajhs
Chcck tor canonica torm and jrunc (su)rajhs with non-canonica codc words
cxamjc moccucs
S C N C
O
O S C N
F
O S C N
O
scarch trcc tor sccd S
S F
O S C
O
S C N C
O S C N
O
S C N C
O
S
S C S O
O S C S C N S C O
O S C N S C N
O
3
1
3
2
2 3
2
2 1
2
1
1 1
S F N C O - =
rcadth-rst scarch canonica torm
Christian Borgelt Frequent Pattern Mining 374
Searching without a Seed Atom
*
S N O C
S C N C O C O C C C
S C C N C C O C C O C O O C C O C O C C C
S C C C S C C N
S C C C
N
S C C C O S C C C O
S C C C O
O
12 7 5
3
cycin
N C
C
C
O
O
O
cystcin
N C
C
C
O
O
S
scrin
N C
C
C
O
O
O
rcadth-rst scarch canonica torm S N O C - =
Chcmica ccmcnts jroccsscd on thc ctt arc cxcudcd on thc riht
Christian Borgelt Frequent Pattern Mining 375
Comparison of Canonical Forms
(dcjth-rst vcrsus rcadth-rst sjannin trcc construction)
Christian Borgelt Frequent Pattern Mining 376
Canonical Forms: Comparison
Depth-First vs. Breadth-First Search Canonical Form
\ith rcadth-rst scarch canonica torm thc cxtcndac vcrticcs
arc much casicr to travcrsc, as thcy aways havc consccutivc indiccs
Onc ony has to storc and ujdatc onc numcr, namcy thc indcx
ot thc maximum cdc sourcc, to dcscric thc vcrtcx ranc
Aso thc chcck tor canonica torm is sihty morc comjcx (to jroram)
tor dcjth-rst scarch canonica torm
Thc two canonica torms oviousy cad to dicrcnt ranchin tactors,
widths and dcjths ot thc scarch trcc
Lowcvcr, it is not immcdiatcy ccar, which torm cads to thc cttcr
(morc ccicnt) structurc ot thc scarch trcc
Thc cxjcrimcnta rcsuts rcjortcd in thc toowin indicatc that it may dcjcnd
on thc data sct which canonica torm jcrtorms cttcr
Christian Borgelt Frequent Pattern Mining 377
Advantage for Maximum Source Extensions
Gcncratc a sustructurcs
(that contain nitrocn)
ot thc cxamjc moccuc
O
C
N
C
C
C O
lrocm Thc two ranchcs cmanatin
trom thc nitrocn atom start idcnticay
Thus rihtmost jath cxtcnsions try
thc riht ranch ovcr and ovcr aain
Search Trees with N O C
`aximum Sourcc Lxtcnsion
lihtmost lath Lxtcnsion
C
N
C
O
C
N
C
C
C
C
N
C
O
N
C
N
C
N
C
O
C
N
C
C
N
O
C
N
C
C
C
N
C
O
C
C
N
C
C
C
N
O
C
N
C
C
O
C
C
N
C
C
C
C
N
C
O
C
N
C
C
O
O
C
N
C
C
C O
C
C
N
C
C
O
C
N
C
C
C O
C
N
C
O
C
N
C
C
C
C
N
C
O
O
C
C
N
C
O
C
C
C
N
C
O
O
C
C
N
C
O
C
N
C
N
O
C
N
C
C
N
C
N
C
O
C
N
C
O
C
C
N
C
C
C
N
C
C
N
C
O
C
N
C
C
O
C
C
N
C O
C
C
N
C
C
C
C
N
C
O
C
N
C
C
O
O
C
N
C
C
C O
C
C
N
C
C
O
C
N
C
C
C O non-canonica 3 non-canonica o
Christian Borgelt Frequent Pattern Mining 378
Advantage for Rightmost Path Extensions
Gcncratc a sustructurcs
(that contain nitrocn)
ot thc cxamjc moccuc
(N C)
N
C
C
C
C
lrocm Thc rin ot caron atoms
can c coscd ctwccn any two ranchcs
(thrcc ways ot uidin thc tramcnt,
ony onc ot which is canonica)
Search Trees with N C
`aximum Sourcc Lxtcnsion
lihtmost lath Lxtcnsion
N
C
C C
C
N
C
C
C
C
3
5
4
N
C
C
C
C
5
4
3
N
N
C
N
C
C
N
C
C C
N
C
C
C
N
C
C
C
C
N
C
C
C
N
C
C
C
C
N
C
C
C
C
3
4
5
N
C
C C
C
N
N
C
N
C
C
N
C
C
C
N
C
C C
N
C
C
C
N
C
C
C
C
N
C
C
C
C
N
C
C
C
C
5 non-canonica 3 non-canonica 1
Christian Borgelt Frequent Pattern Mining 379
Experiments: Data Sets
Index Chemicus Subset of 1993
1293 moccucs , 3!!31 atoms , 3o`9! onds
Ircqucnt tramcnts down to tairy ow sujjort vaucs arc trccs (no rins)
`cdium numcr ot tramcnts and coscd tramcnts
Steroids
1 moccucs , !01 atoms , !`o onds
A arc jart ot thc trcqucnt tramcnts contain onc or morc rins
Luc numcr ot tramcnts, sti arc numcr ot coscd tramcnts
Christian Borgelt Frequent Pattern Mining 380
Steroids Data Set
O
O
O Br
O F
O
O
O
O O
O
O O
O
O O
O
O
O O
O
O O
O
O
O
O
O O
O O
O
O
O
O O
O
O
O
O O
O
O
O O
O
O
O
N
O
O
N
Christian Borgelt Frequent Pattern Mining 381
Experiments: IC93 Data Set
3 3.5 4 4.5 5 5.5 6
5
10
15
20
time/seconds
breadth-rst
depth-rst
3 3.5 4 4.5 5 5.5 6
0
5
10
15
fragments/10
4
breadth-rst
depth-rst
processed
3 3.5 4 4.5 5 5.5 6
4
6
8
10
12
14
occurences/10
6
breadth-rst
depth-rst
Lxjcrimcnta rcsuts on thc lC93 data
Thc horizonta axis shows thc minima
sujjort in jcrccnt Thc curvcs show thc
numcr ot cncratcd and jroccsscd tra-
mcnts (toj ctt), numcr ot jroccsscd oc-
currcnccs (toj riht), and thc cxccution
timc in scconds (ottom ctt) tor thc two
canonica torms,cxtcnsion stratcics
Christian Borgelt Frequent Pattern Mining 382
Experiments: Steroids Data Set
2 3 4 5 6 7 8
10
15
20
25
30
35
time/seconds
breadth-rst
depth-rst
2 3 4 5 6 7 8
5
10
15
fragments/10
5
breadth-rst
depth-rst
processed
2 3 4 5 6 7 8
6
8
10
12
occurrences/10
6
breadth-rst
depth-rst
Lxjcrimcnta rcsuts on thc stcroids data
Thc horizonta axis shows thc asoutc
minima sujjort Thc curvcs show thc
numcr ot cncratcd and jroccsscd tra-
mcnts (toj ctt), numcr ot jroccsscd oc-
currcnccs (toj riht), and thc cxccution
timc in scconds (ottom ctt) tor thc two
canonica torms,cxtcnsion stratcics
Christian Borgelt Frequent Pattern Mining 383
Equivalent Sibling Pruning
Christian Borgelt Frequent Pattern Mining 384
Alternative Test: Equivalent Siblings
Basic Idea:
lt thc (su)rajh to cxtcnd cxhiits a ccrtain symmctry, scvcra cxtcnsions
may c cquivacnt (in thc scnsc that thcy dcscric thc samc (su)rajh)
At most onc ot thcsc siin cxtcnsions can c in canonica torm, namcy
thc onc least restricting future extensions (cx smacst codc word)
ldcntity cquivacnt siins and kccj ony thc maximay cxtcndac onc
Test Procedure for Equivalence:
Gct any rajh in which two siin (su)rajhs to comjarc occur
(lt thcrc is no such rajh, thc siins arc not cquivacnt)
`ark any occurrcncc ot thc rst (su)rajh in thc rajh
Travcrsc a occurrcnccs ot thc sccond (su)rajh in thc rajh
and chcck whcthcr a cdcs ot an occurrcncc arc markcd
lt thcrc is such an occurrcncc, thc two (su)rajhs arc cquivacnt
Christian Borgelt Frequent Pattern Mining 385
Alternative Test: Equivalent Siblings
If siblings in the search tree are equivalent,
only the one with the least restrictions needs to be processed.
Example: `inin jhcno, j-crcso, and catccho
O C
C C
C
C C
C O C
C C
C
C C
O
O C
C C
C
C C
Considcr cxtcnsions ot a o-ond caron rin (twcvc jossic occurrcnccs)
O C
C C
C
C C
0
1 2
3
4 5
O C
C C
C
C C
1
2 3
4
5 0
O C
C C
C
C C
2
3 4
5
0 1
O C
C C
C
C C
1
0 5
4
3 2
Ony thc (su)rajh that least restricts future extensions
(ic, that has thc cxicorajhicay smacst codc word) can c in canonica torm
sc dcjth-rst canonica torm (rihtmost jath cxtcnsions) and C O
Christian Borgelt Frequent Pattern Mining 386
Alternative Test: Equivalent Siblings
Test for Equivalent Siblings before Test for Canonical Form
Travcrsc thc siin cxtcnsions and comjarc cach jair
Ot two cquivacnt siins rcmovc thc onc
that rcstricts tuturc cxtcnsions morc
Advantages:
ldcntics somc codc words that arc non-canonica in a simjc way
Tcst ot two siins is at most incar in thc numcr ot cdcs
and at most incar in thc numcr ot occurrcnccs
Disadvantages:
Locs not idcntity a non-canonica codc words,
thcrctorc a suscqucnt canonica torm tcst is sti nccdcd
Comjarcs two siin (su)rajhs,
thcrctorc it is quadratic in thc numcr ot siins
Christian Borgelt Frequent Pattern Mining 387
Alternative Test: Equivalent Siblings
Thc ccctivcncss ot cquivacnt siin jrunin dcjcnds on thc canonica torm
`inin thc IC93 data with !/ minima sujjort
dcjth-rst rcadth-rst
cquivacnt siin jrunin 1`o ( 19/) !19` (S3/)
canonica torm jrunin 9SS (9S1/) S1` (1o3/)
tota jrunin S1!! `010
(coscd) (su)rajhs tound 2002 2002
`inin thc steroids data with minima sujjort o
dcjth-rst rcadth-rst
cquivacnt siin jrunin 1`32 ( 2/) 1`2`o2 (`!o/)
canonica torm jrunin 19!!9 (92S/) 1202o (!`!/)
tota jrunin 212o 29`SS
(coscd) (su)rajhs tound 1!20 1!20
Christian Borgelt Frequent Pattern Mining 388
Alternative Test: Equivalent Siblings
Observations:
Lcjth-rst torm cncratcs morc dujicatc (su)rajhs on thc lC93 data
and tcwcr dujicatc (su)rajhs on thc stcroids data (as sccn ctorc)
Thcrc arc ony vcry tcw cquivacnt siins with dcjth-rst torm
on oth thc lC93 data and thc stcroids data
(Con,ccturc cquivacnt siins rcsut trom rotatcd trcc ranchcs,
which arc css ikcy to c siins with dcjth-rst torm)
\ith rcadth-rst scarch canonica torm a arc jart ot thc (su)rajhs
that arc not cncratcd in canonica torm (with a canonica codc word)
can c tcrcd out with cquivacnt siin jrunin
On thc tcst lC93 data no dicrcncc in sjccd coud c oscrvcd,
jrcsumay ccausc jrunin takcs ony a sma jart ot thc tota timc
On thc stcroids data, howcvcr, cquivacnt siin jrunin
yicds a siht sjccd-uj tor rcadth-rst torm ( `/)
Christian Borgelt Frequent Pattern Mining 389
Canonical Forms based on Adjacency Matrices
Christian Borgelt Frequent Pattern Mining 390
Adjacency Matrices
A (norma, that is, unaccd) rajh can c dcscricd y an adjacency matrix
A rajh G with n vcrticcs is dcscricd y an n n matrix A (a
ij
)
Givcn a numcrin ot thc vcrticcs (trom 1 to n), cach vcrtcx is associatcd
with thc row and coumn corrcsjondin to its numcr
A matrix ccmcnt a
ij
is 1 it thcrc cxists an cdc ctwccn thc vcrticcs
with numcrs i and j and 0 othcrwisc
Ad,accncy matriccs arc not uniquc
Licrcnt numcrins ot thc vcrticcs cad to dicrcnt ad,accncy matriccs
1 2
3
4
5
1
1
2
2
3
3
4
4
5
5
0 1 0 1 0
1 0 1 1 0
0 1 0 1 1
1 1 1 0 0
0 0 1 0 0
5 4
2
3
1
1
1
2
2
3
3
4
4
5
5
0 1 0 0 0
1 0 1 1 0
0 1 0 1 1
0 1 1 0 1
0 0 1 1 0
Christian Borgelt Frequent Pattern Mining 391
Extended Adjacency Matrices
A accd rajh can c dcscricd y an extended adjacency matrix
lt thcrc is an cdc ctwccn thc vcrticcs with numcrs i and j
thc matrix ccmcnt a
ij
contains thc ac ot this cdc
and thc sjccia ac (thc cmjty ac) othcrwisc
Thcrc is an additiona coumn containin thc vcrtcx acs
Ot coursc, cxtcndcd ad,accncy matriccs arc aso not uniquc
O
N
S
O
O
C
C
C
C
4
2
1
3
6
5
7
8
9
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
S
N
C
O
C
C
C
O
O
O
N
S
O
O
C
C
C
C
7
2
5
6
4
1
3
8
9
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
C
N
C
C
S
C
O
O
O
Christian Borgelt Frequent Pattern Mining 392
From Adjacency Matrices to Code Words
An (cxtcndcd) ad,accncy matrix can c turncd into a code word
y simjy istin its ccmcnts row y row
Sincc tor undircctcd rajhs thc ad,accncy matrix is ncccssariy symmctric,
it succs to ist thc ccmcnts ot thc ujjcr (or owcr) trianc
Ior sjarsc rajhs (tcw cdcs) istin ony coumn,ac jairs can advantacous,
ccausc this rcduccs thc codc word cnth
O
N
S
O
O
C
C
C
C
4
2
1
3
6
5
7
8
9
lcuar cxjrcssion
(non-tcrminas)
(a ( i
c
b )
)
n
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
S
N
C
O
C
C
C
O
O
codc word
S 2 - 3 -
N 4 - 5 -
C 6 -
O
C 6 - 7 -
C
C 8 - 9 =
O
O
Christian Borgelt Frequent Pattern Mining 393
From Adjacency Matrices to Code Words
\ith an (aritrary, ut xcd) ordcr on thc ac sct A (and dcnin that
intccr numcrs, which arc ordcrcd in thc usua way, jrcccdc a acs),
codc words can c comjarcd cxicorajhicay (S N O C . - =)
O
N
S
O
O
C
C
C
C
4
2
1
3
6
5
7
8
9
S 2 - 3 - N 4 - 5 - C 6 - O C 6 - 7 - C C 8 - 9 = O O
<
O
N
S
O
O
C
C
C
C
7
2
5
6
4
1
3
8
9
C 2 - 3 - 4 - N 5 - 7 - C 8 - 9 = C 6 - S 6 - C O O O
As tor canonica torms ascd on sjannin trccs, wc thcn dcnc thc cxicorajhicay
smacst (or arcst) codc word as thc canonical code word
`otc that ad,accncy matriccs aow tor a much larger number of code words,
ccausc any numcrin ot thc vcrticcs is acccjtac
Ior canonica torms ascd on sjannin trccs, thc vcrtcx numcrin
must c comjatic with a (sjccic) construction ot a sjannin trcc
Christian Borgelt Frequent Pattern Mining 394
From Adjacency Matrices to Code Words
Thcrc is a varicty ot othcr ways in which an ad,accncy matrix
may c turncd into a codc word
O
N
S
O
O
C
C
C
C
4
2
1
3
6
5
7
8
9
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
S
N
C
O
C
C
C
O
O
owcr trianc
S
N 1 -
C 1 -
O 2 -
C 2 -
C 3 - 5 -
C 5 -
O 6 - 7 -
O 7 =
coumnwisc
S N C O C C C O O
| 1 -
| 1 -
| 2 -
| 2 -
| 3 - 5 -
| 5 -
| 7 -
| 7 =
(`otc that thc coumnwisc istin nccds a scjarator charactcr |)
Lowcvcr, thc rowwisc istin rcstrictcd to thc ujjcr trianc (as uscd ctorc)
has thc advantac that it has a jrojcrty anaoous to thc prex property
ln contrast to this, thc two torms shown aovc do not havc this jrojcrty
Christian Borgelt Frequent Pattern Mining 395
Exploiting Vertex Signatures
Christian Borgelt Frequent Pattern Mining 396
Canonical Form and Vertex and Edge Labels
\crtcx and cdc acs hcj considcray to construct a canonica codc word
or to chcck whcthcr a ivcn codc word is canonica
Canonica torm chcck or construction arc usuay (much) sowcr,morc dicut
tor unaccd rajhs or rajhs with tcw dicrcnt vcrtcx and cdc acs
Thc rcason is that with vcrtcx and cdc acs constructcd codc word jrcxcs
may arcady aow us to makc a dccision ctwccn (scts ot) codc words
lntuitivc cxjanation with an cxtrcmc cxamjc
Sujjosc that a vcrticcs ot a ivcn (su)rajh havc dicrcnt acs Thcn
Thc root,rst row vcrtcx is uniqucy dctcrmincd
it is thc vcrtcx with thc smacst ac (wrt thc choscn ordcr)
Thc ordcr ot cach vcrtcxs ncihors in thc canonica torm is dctcrmincd
at cast y thc vcrtcx acs (ut mayc aso y thc cdc acs)
As a conscqucncc, constructin thc canonica codc word is straihttorward
Christian Borgelt Frequent Pattern Mining 397
Canonical Form and Vertex and Edge Labels
Thc comjcxity ot constructin a canonica codc word is causcd y cqua cdc and
vcrtcx acs, which makc it ncccssary to ajjy a backtracking aorithm
Question: Can wc cxjoit rajh jrojcrtics (that is, thc conncction structurc)
to distinuish vcrticcs,cdcs with thc samc ac
Idea: Lcscric how thc (su)rajh undcr considcration ooks trom a vcrtcx
This can c achicvcd y constructin a oca codc word (vertex signature)
Start with thc ac ot thc vcrtcx
lt thcrc is morc than onc vcrtcx with a ccrtain ac,
add a (sortcd) ist ot thc acs ot thc incidcnt cdcs
lt thcrc is morc than onc vcrtcx with thc samc ist,
add a (sortcd) ist ot thc ists ot thc ad,accnt vcrticcs
Continuc with thc vcrticcs that arc two cdcs away and so on
Christian Borgelt Frequent Pattern Mining 398
Constructing Vertex Signatures
Thc jroccss ot constructin vcrtcx sinaturcs is cst dcscricd
as an iterative subdivision of equivalence classes
Thc initia sinaturc ot cach vcrtcx is simjy its ac
Thc vcrtcx sct is sjit into cquivacncc casscs
ascd on thc initia vcrtcx sinaturc (that is, thc vcrtcx acs)
Lquivacncc casscs with morc than onc vcrtcx arc thcn jroccsscd
y ajjcndin thc (sortcd) acs ot thc incidcnt cdcs to thc vcrtcx sinaturc
Thc vcrtcx sct is thcn rcjartitioncd ascd on thc cxtcndcd vcrtcx sinaturc
ln a sccond stcj thc (sortcd) sinaturcs ot thc ad,accnt vcrticcs arc ajjcndcd
ln suscqucnt stcjs thcsc sinaturcs ot ad,accnt vcrticcs arc rcjaccd
y thc ujdatcd vcrtcx sinaturcs
Thc jroccss stojs whcn no rcjaccmcnt sjits an cquivacncc cass
Christian Borgelt Frequent Pattern Mining 399
Constructing Vertex Signatures
O
N
S
O
O
C
C
C
C
4
2
1
3
6
5
7
8
9
vcrtcx sinaturc
1 S
2 N
! O
S O
9 O
3 C
o C
` C
C
Vertex Signatures, Step 1
Thc initia vcrtcx sinaturcs
arc simjy thc vcrtcx acs
Thcrc arc tour cquivacncc casscs
S, N, O, and C
Thc cquivacncc casscs S and N
nccd not turthcr jroccssin,
ccausc thcy arcady contain
ony a sinc vcrtcx
Lowcvcr, thc vcrtcx sinaturcs O and C
nccd to c cxtcndcd in ordcr to sjit
thc corrcsjondin cquivacncc casscs
Christian Borgelt Frequent Pattern Mining 400
Constructing Vertex Signatures
O
N
S
O
O
C
C
C
C
4
2
1
3
6
5
7
8
9
vcrtcx sinaturc
1 S
2 N
! O -
S O -
9 O =
3 C --
o C --
` C ---
C --=
Vertex Signatures, Step 2
Thc vcrtcx sinaturcs ot thc casscs
that contain morc than onc vcrtcx arc
cxtcndcd y thc sortcd ist ot acs
ot thc incidcnt cdcs
This distinuishcs thc thrcc oxycn atoms,
ccausc two is incidcnt to a sinc ond,
thc othcr to a douc ond
lt aso distinuishcs most caron atoms,
ccausc thcy havc dicrcnt scts
ot incidcnt cdcs
Ony thc sinaturcs ot carons 3 and o
and thc sinaturcs ot oxycns ! and 9
nccd to c cxtcndcd turthcr
Christian Borgelt Frequent Pattern Mining 401
Constructing Vertex Signatures
O
N
S
O
O
C
C
C
C
4
2
1
3
6
5
7
8
9
vcrtcx sinaturc
1 S
2 N
! O - N
S O - C --=
9 O =
3 C -- S C --
o C -- C -- C ---
` C ---
C --=
Vertex Signatures, Step 3
Thc vcrtcx sinaturcs ot carons 3 and o
and ot oxycns ! and 9 arc cxtcndcd
y thc sortcd ist ot vcrtcx sinaturcs
ot thc ad,accnt vcrticcs
This distinuishcs thc two jairs
(caron 3 is ad,accnt to a sutur atom,
oxycn ! is incidcnt to a nitrocn atom)
As a rcsut, a cquivacncc casscs
contain ony a sinc vcrtcx and thus
wc otaincd a uniquc vcrtcx acin
\ith this uniquc vcrtcx acin,
constructin a canonica codc word
ccomcs vcry simjc and ccicnt
Christian Borgelt Frequent Pattern Mining 402
Elements of Vertex Signatures
sin ony (sortcd) ists ot acs ot incidcnt cdcs and ad,accnt vcrticcs
cannot aways distinuish a vcrticcs
Lxamjc Ior thc toowin two (unaccd) rajhs such vcrtcx sinaturcs
cannot sjit thc soc cquivacncc cass
Thc cquivacncc cass can c sjit tor thc riht rajh, thouh, it thc numcr
ot ad,accnt vcrticcs that arc ad,accnt is incorjoratcd into thc vcrtcx sinaturc
Thcrc is aso a arc varicty ot othcr rajh jrojcrtics that may c uscd
Lowcvcr, tor ncithcr rajh thc cquivacncc casscs can c rcduccd to sinc vcrticcs
Ior thc ctt rajh it is not cvcn jossic at a to sjit thc cquivacncc cass
Thc rcason is that oth rajhs josscss automorphisms othcr thcn thc idcntity
Christian Borgelt Frequent Pattern Mining 403
Automorphism Groups
Lct F
auto
(G) c thc sct ot a automorphisms ot a (accd) rajh G
Thc orbit ot a vcrtcx v V
G
wrt F
auto
(G) is thc sct
o(v) u V
G
[ f F
auto
(G) u f(v).
`otc that wc havc aways v o(v), ccausc thc idcntity is aways in F
auto
(G)
Thc vcrticcs in an orit cannot jossiy c distinuishcd y vcrtcx sinaturcs,
ccausc thc rajh ooks thc samc trom a ot thcm
ln ordcr to dca with orits, onc can cxjoit that thc automorjhisms F
auto
(G)
ot a rajh G torm a group (thc automorphism group ot G)
Lurin thc construction ot a canonica codc word,
dctcct automorjhisms (vcrtcx numcrins cadin to thc samc codc word)
Irom tound automorjhisms, generators ot thc rouj ot automorjhisms
can c dcrivcd Thcsc cncrators can thcn c uscd to avoid cxjorin
imjicd automorjhisms, thus sjccdin uj thc scarch |`cIay 19S1|
Christian Borgelt Frequent Pattern Mining 404
Canonical Form and Vertex Signatures
Advantages of Vertex Signatures:
\crticcs with thc samc ac can c distinuishcd in a jrcjroccssin stcj
Constructin canonica codc words can thus ccomc much casicr,tastcr,
ccausc thc ncccssary acktrackin can ottcn c rcduccd considcray
(Thc ains arc usuay jarticuary arc tor rajhs with tcw,no acs)
Disadvantages of Vertex Signatures:
\crtcx sinaturcs can rctcr to thc rajh as a whoc
and thus may c dicrcnt tor surajhs
(\crticcs with dicrcnt sinaturcs in a surajh
may havc thc samc sinaturc in a sujcrrajh and vicc vcrsa)
As a conscqucncc it can c dicut to cnsurc
that thc rcsutin canonica torm has thc prex property
ln such a casc onc may not c ac to rcstrict (su)rajh cxtcnsions
or to usc thc simjicd scarch schcmc (ony codc word chccks)
Christian Borgelt Frequent Pattern Mining 405
Repository of Processed Fragments
Christian Borgelt Frequent Pattern Mining 406
Repository of Processed Fragments
Canonical form pruning is thc jrcdominant mcthod
to avoid rcdundant scarch in trcqucnt (su)rajh minin
Thc ovious atcrnativc, a repository of processed (sub)graphs,
has rcccivcd tairy ittc attcntion |Lorct and Iicdcr 200|
\hcncvcr a ncw (su)rajh is crcatcd, thc rcjository is acccsscd
lt it contains thc (su)rajh, wc know that it has arcady ccn jroccsscd
and thcrctorc it can c discardcd
Ony (su)rajhs that arc not containcd in thc rcjository arc cxtcndcd
and, ot coursc, inscrtcd into thc rcjository
lt thc rcjository is aid out as a hash tac with a carctuy dcsincd
hash tunction, it is comjctitivc with canonica torm jrunin
(ln somc cxjcrimcnts, thc rcjository-ascd ajjroach
coud outjcrtorm canonica torm jrunin y 1`/)
Christian Borgelt Frequent Pattern Mining 407
Repository of Processed Fragments
Lach (su)rajh shoud c storcd usin a minima amount ot mcmory
(sincc thc numcr ot jroccsscd (su)rajhs is usuay huc)
Storc a (su)rajh y istin thc cdcs ot onc occurrcncc
(`otc that tor conncctcd (su)rajhs thc cdcs aso idcntity a vcrticcs)
Thc containmcnt tcst has to c madc as tast as jossic
(sincc it wi c carricd out trcqucnty)
Try to avoid a tu isomorjhism tcst with a hash tac
Lmjoy a hash tunction that is comjutcd trom oca rajh jrojcrtics
(Lasic idca cominc thc vcrtcx and cdc attriutcs and thc vcrtcx dcrccs)
lt an isomorjhism tcst is ncccssary, do quick chccks rst
numcr ot vcrticcs, numcr ot cdcs, rst containin dataasc rajh ctc
Actua isomorjhism tcst
mark storcd occurrcncc and chcck tor tuy markcd ncw occurrcncc
(ct thc jroccdurc ot cquivacnt siin jrunin)
Christian Borgelt Frequent Pattern Mining 408
Canonical Form Pruning versus Repository
Advantage of Canonical Form Pruning
Ony onc tcst (tor canonica torm) is nccdcd in ordcr to dctcrminc
whcthcr a (su)rajh nccds to c jroccsscd or not
Disadvantage of Canonical Form Pruning
lt is most costy tor thc (su)rajhs that arc crcatcd in canonica torm
( sowcst tor tramcnts that havc to c jroccsscd)
Advantage of Repository-based Pruning
Ottcn aows to dccidc vcry quicky that a (su)rajh has not ccn jroccsscd
( tastcst tor tramcnts that havc to c jroccsscd)
Disadvantages of Repository-based Pruning
`utijc isomorjhism tcsts may c ncccssary tor a jroccsscd tramcnt
`ccds tar morc mcmory than canonica torm jrunin
A rcjository vcry dicut to usc in a jarac aorithm
Christian Borgelt Frequent Pattern Mining 409
Canonical Form vs. Repository: Execution Times
2 2.5 3 3.5 4 4.5 5 5.5 6
20
40
60
80 time/seconds
canon. form
repository
2 2.5 3 3.5 4 4.5 5 5.5 6
20
40
60
80 time/seconds
canon. form
repository
Lxjcrimcnta rcsuts on thc lC93 data sct,
scarch timc in scconds (vcrtica axis) vcrsus
minimum sujjort in jcrccnt (horizonta axis)
Lctt maximum sourcc cxtcnsions
liht rihtmost jath cxtcnsions
Christian Borgelt Frequent Pattern Mining 410
Canonical Form vs. Repository: Numbers of (Sub)Graphs
2 2.5 3 3.5 4 4.5 5 5.5 6
0
20
40
60
80 subgraphs/10 000
generated
dupl. tests
processed
duplicates
2 2.5 3 3.5 4 4.5 5 5.5 6
0
20
40
60
80 subgraphs/10 000
generated
dupl. tests
processed
duplicates
Lxjcrimcnta rcsuts on thc lC93 data sct,
numcrs ot surajhs uscd in thc scarch
Lctt maximum sourcc cxtcnsions
liht rihtmost jath cxtcnsions
Christian Borgelt Frequent Pattern Mining 411
Repository Performance
2 2.5 3 3.5 4 4.5 5 5.5 6
0
20
40
60
80 subgraphs/10 000
generated
accesses
isom. tests
duplicates
2 2.5 3 3.5 4 4.5 5 5.5 6
0
20
40
60
80 subgraphs/10 000
generated
accesses
isom. tests
duplicates
Lxjcrimcnta rcsuts on thc lC93 data sct,
jcrtormancc ot rcjository-ascd jrunin
Lctt maximum sourcc cxtcnsions
liht rihtmost jath cxtcnsions
Christian Borgelt Frequent Pattern Mining 412
Perfect Extension Pruning
Christian Borgelt Frequent Pattern Mining 413
Reminder: Perfect Extension Pruning for Item Sets
lt ony coscd itcm scts or ony maxima itcm scts arc to c tound,
additiona jrunin ot thc scarch trcc ccomcs jossic
Sujjosc that durin thc scarch wc discovcr that
s
T
(I a) s
T
(I)
tor somc itcm sct I and somc itcm a / I (That is, I is not coscd)
\c ca thc itcm a a perfect extension ot I Thcn wc know
J I s
T
(J a) s
T
(J).
This can most casiy c sccn y considcrin that K
T
(I) K
T
(a)
and hcncc K
T
(J) K
T
(a), sincc K
T
(J) K
T
(I)
As a conscqucncc, no sujcrsct J I with a / J can c coscd
Lcncc a can c addcd dirccty to thc jrcx ot thc conditiona dataasc
Thc samc asic idca can aso c uscd tor rajhs, ut nccds modications
Christian Borgelt Frequent Pattern Mining 414
Perfect Extensions
An cxtcnsion ot a rajh (tramcnt) is cacd perfect,
it it can c ajjicd to a ot its occurrcnccs in cxacty thc samc way
Attention: lt may not c cnouh to comjarc thc sujjort
and thc numcr ot occurrcnccs ot thc rajh tramcnt
(Lvcn thouh jcrtcct cxtcnsions must havc thc samc sujjort and
an intccr mutijc ot thc numcr ot occurrcnccs ot thc asc tramcnt)
O C S C
N
O C S C N
O
O
C S C
C S C N O C S C
2+2 embs.
1+1 embs. 1+3 embs.
`cithcr is a sinc ond to nitrocn a jcrtcct cxtcnsion ot O-C-S-C
nor is a sinc ond to oxycn a jcrtcct cxtcnsion ot N-C-S-C
Lowcvcr, wc nccd that a jcrtcct cxtcnsion ot a rajh tramcnt
is aso a jcrtcct cxtcnsion ot any sujcrrajh ot this tramcnt
Consequence: lt may c ncccssary to chcck whcthcr a occurrcnccs
ot thc asc tramcnt cad to thc samc numcr ot cxtcndcd occurrcnccs
Christian Borgelt Frequent Pattern Mining 415
Partial Perfect Extension Pruning
Basic idea of perfect extension pruning:
Iirst row a tramcnt to thc icst common sustructurc
Partial perfect extension pruning: lt thc chidrcn ot a scarch trcc vcrtcx
arc ordcrcd cxicorajhicay (wrt thcir codc word), no tramcnt in a sutrcc
to thc riht ot a jcrtcct cxtcnsion ranch can c coscd |Yan and Lan 2003|
cxamjc moccucs
S C N C
O
O S C N
F
O S C N
O
scarch trcc tor sccd S
S F S O
S C O
O S C
O
S C N C
O S C N
O
S C N C
O
S
S C
O S C S C N
O S C N S C N
O
3
1
3
2
2 3
2
2 1
2
1
1 1
S F N C O - =
rcadth-rst scarch canonica torm
Christian Borgelt Frequent Pattern Mining 416
Full Perfect Extension Pruning
Full perfect extension pruning: |Lorct and `cin 200o|
Aso jrunc thc ranchcs to thc ctt ot thc jcrtcct cxtcnsion ranch
Problem: This jrunin mcthod intcrtcrcs with canonica torm jrunin,
ccausc thc cxtcnsions in thc ctt siins cannot c rcjcatcd in thc jcrtcct
cxtcnsion ranch (rcstrictcd cxtcnsions, simjc rucs tor canonica torm)
cxamjc moccucs
S C N C
O
O S C N
F
O S C N
O
scarch trcc tor sccd S
S F S O
O S C S C O
S C N C
O S C N
O
S C N C
O
S
S C
S C N
O S C N S C N
O
3
1
3
2
2 3
2
2
2
1
1 1
S F N C O - =
rcadth-rst scarch canonica torm
Christian Borgelt Frequent Pattern Mining 417
Code Word Reorganization
Restricted extensions:
`ot a cxtcnsions ot a tramcnt arc aowcd y thc canonica torm
Somc can c chcckcd y simjc rucs (rihtmost jath,max sourcc cxtcnsion)
Consequence: ln ordcr to makc canonica torm jrunin and tu jcrtcct
cxtcnsion jrunin comjatic, thc rcstrictions on cxtcnsions must c mitiatcd
Example:
Thc corc jrocm ot otainin thc scarch trcc on thc jrcvious sidc is
how wc can avoid that thc tramcnt O-S-C-N is jruncd as non-canonica
Thc rcadth-rst scarch canonica codc word tor this tramcnt is
S 0-C1 0-O2 1-N3
Lowcvcr, with thc scarch trcc on thc jrcvious sidc it is assincd
S 0-C1 1-N2 0-O3
Solution: Lcviatc trom ajjcndin thc dcscrijtion ot a ncw cdc
Aow tor a (stricty imitcd) codc word rcoranization
Christian Borgelt Frequent Pattern Mining 418
Code Word Reorganization
ln ordcr to otain a jrojcr codc, it must c jossic to shitt dcscrijtions
ot ncw cdcs jast dcscrijtions ot jcrtcct cxtcnsion cdcs in thc codc word
Thc codc word ot a tramcnt consists ot two jarts
a prex cndin with thc ast non-jcrtcct cxtcnsion cdc and
a (jossiy cmjty) sux ot jcrtcct cxtcnsion cdcs
A ncw cdc dcscrijtion is usuay ajjcndcd at thc cnd ot thc codc word
This is sti thc standard jroccdurc is thc sux is cmjty
Lowcvcr, it thc sux is not cmjty, thc dcscrijtion ot thc ncw cdc
may c inscrtcd into thc sux or cvcn movcd dirccty ctorc thc sux
(\hichcvcr jossiiity yicds thc cxicorajhicay smacst codc word)
lathcr than to actuay shitt and modity cdc dcscrijtion,
it is tcchnicay casicr to rcuid thc codc word trom thc tront
(ln jarticuar, rcnumcrin thc vcrticcs is casicr)
Christian Borgelt Frequent Pattern Mining 419
Code Word Reorganization: Example
Shift an cxtcnsion to thc jrojcr jacc and rcnumcr thc vcrticcs
1 Lasc tramcnt S-C-N canonica codc S 0-C1 1-N2
2 Lxtcnsion to O-S-C-N (non-canonica') codc S 0-C1 1-N2 0-O3
3 Shitt cxtcnsion (invaid) codc S 0-C1 0-O3 1-N2
! lcnumcr vcrticcs canonica codc S 0-C1 0-O2 1-N3
Rebuild thc codc word trom thc tront
Thc root vcrtcx (hcrc thc sutur atom) is aways in thc xcd jart
lt rcccivcs thc initia vcrtcx indcx, that is, 0 (zcro)
Comjarc two jossic codc word jrcxcs S 0-O1 and S 0-C1
Iix thc attcr, sincc it is cxicorajhicay smacr
Comjarc thc codc word jrcxcs S 0-C1 0-O2 and S 0-C1 1-N2
Iix thc tormcr, sincc it is cxicorajhicay smacr
Ajjcnd thc rcmainin jcrtcct cxtcnsion cdc S 0-C1 0-O2 1-N3
rcadth-rst scarch canonica torm. S N C O. - =
Christian Borgelt Frequent Pattern Mining 420
Perfect Extensions: Problems with Cycles/Rings
cxamjc
moccucs
scarch trcc tor sccd N
N O
C
C C
C
N O
C
C
C C
N
N O N C
C N O N O C N C C
C
N O C
C C
N O N O C
C
N C C
C
C C
N O C
C
N O C
C
O N
C C C
N O C
C C
C C
N O C
C C C C
C O N
Problem: lcrtcct cxtcnsions in cyccs may not aow tor jrunin
Consequence: Additional constraint |Lorct and `cin 200o|
lcrtcct cxtcnsions must c ridcs or cdcs cosin a cycc,rin
Christian Borgelt Frequent Pattern Mining 421
Experiments: IC93 without Ring Mining
2.5 3 3.5 4 4.5 5 5.5 6
4
6
8
10
12
14
occurrences/10
6
full
partial
none
2.5 3 3.5 4 4.5 5 5.5 6
5
10
15
20 fragments/10
4
full
partial
none
2.5 3 3.5 4 4.5 5 5.5 6
20
40
60
nodes/10
3
full
partial
none
Lxjcrimcnta rcsuts on thc lC93 data,
otaincd without rin minin (sinc
ond cxtcnsions) Thc horizonta axis
shows thc minima sujjort in jcrccnt
Thc curvcs show thc numcr ot cncratcd
tramcnts (toj ctt), thc numcr ot jro-
ccsscd occurrcnccs (ottom ctt), and thc
numcr ot scarch trcc nodcs (toj riht)
tor thc thrcc dicrcnt mcthods
Christian Borgelt Frequent Pattern Mining 422
Experiments: IC93 with Ring Mining
2 2.5 3 3.5 4
10
20
30
occurrences/10
5
full
partial
none
2 2.5 3 3.5 4
20
40
60
fragments/10
3
full
partial
none
2 2.5 3 3.5 4
0
5
10
15
20
nodes/10
3
full
partial
none
Lxjcrimcnta rcsuts on thc lC93 data,
otaincd with rin minin Thc hori-
zonta axis shows thc minima sujjort
in jcrccnt Thc curvcs show thc num-
cr ot cncratcd tramcnts (toj ctt), thc
numcr ot jroccsscd occurrcnccs (ottom
ctt), and thc numcr ot scarch trcc nodcs
(toj riht) tor thc thrcc dicrcnt mcth-
ods
Christian Borgelt Frequent Pattern Mining 423
Extensions for Molecular Fragment Mining
Christian Borgelt Frequent Pattern Mining 424
Extensions of the Search Algorithm
Rings |Lotcr, Lorct, and Lcrthod 200!. Lorct 200o|
lrcjroccssin Iind rins in thc moccucs and mark thcm
ln thc scarch jroccss Add a atoms and onds ot a rin in onc stcj
Considcray imjrovcs ccicncy and intcrjrctaiity
Carbon Chains |`cin, Lorct, and Lcrthod 200!|
Add a caron chain in onc stcj, inorin its cnth
Lxtcnsions y a caron chain match rcardcss ot thc chain cnth
Wildcard Atoms |Lotcr, Lorct, and Lcrthod 200!|
Lcnc casscs ot atoms that can c sccn as cquivacnt
Cominc tramcnt cxtcnsions with cquivacnt atoms
lntrcqucnt tramcnts that dicr ony in a tcw atoms
trom trcqucnt tramcnts can c tound
Christian Borgelt Frequent Pattern Mining 425
Ring Mining: Treat Rings as Units
General Idea of Ring Mining
A rin (cycc) is cithcr containcd in a tramcnt as a whoc or not at a
Filter Approaches
(Su)rajhs,tramcnts arc rown cdc y cdc (as ctorc)
Iound trcqucnt rajh tramcnts arc tcrcd
Grajh tramcnts with incomjctc rins arc discardcd
Additiona scarch trcc jrunin
lrunc sutrccs that yicd ony tramcnts with incomjctc rins
Reordering Approach
lt an cdc is addcd that is jart ot onc or morc rins,
(onc ot) thc containin rin(s) is addcd as a whoc (a ot its cdcs arc addcd)
lncomjatiiitics with canonica torm jrunin arc handcd
y rcordcrin codc words (simiar to tu jcrtcct cxtcnsion jrunin)
Christian Borgelt Frequent Pattern Mining 426
Ring Mining: Preprocessing
lin minin is simjcr attcr jrcjroccssin thc rins in thc rajhs to anayzc
Basic Preprocessing: (tor tcr ajjroachcs)
`ark a cdcs ot rins in a uscr-sjccicd sizc ranc
(moccuar tramcnt minin usuay rins with ` o vcrticcs,atoms)
Tcchnicay, thcrc arc two rin idcntication jarts jcr cdc
A markcr in thc cdc attriutc,
which tundamcntay distinuishcs rin cdcs trom non-rin cdcs
A sct ot as idcntityin thc dicrcnt rins an cdc is containcd in
(`otc that an cdc can c jart ot scvcra rins)
Extended Preprocessing: (tor rcordcrin ajjroach)
N
0
1
5
8 6
2
4
3
7 9
`ark pseudo-rings, that is, rins ot smacr sizc than thc uscr sjccicd, ut which
consist ony ot cdcs that arc jart ot rins within thc uscr-sjccicd sizc ranc
Christian Borgelt Frequent Pattern Mining 427
Filter Approaches: Open Rings
Idea of Open Ring Filtering:
lt wc rcquirc thc outjut to havc ony comjctc rins, wc havc to idcntity and
rcmovc tramcnts with rin cdcs that do not con to any comjctc rin
lin cdcs havc ccn markcd in thc jrcjroccssin
lt is known which cdcs ot a rown (su)rajh arc rin cdcs
(in thc undcryin rajhs ot thc dataasc)
Ajjy thc jrcjroccssin jroccdurc to a rown (su)rajh, ut
kccj thc markcr in thc cdc attriutc.
ony sct thc as that idcntity thc rins an cdc is containcd in
Chcck tor cdcs that havc a rin markcr in thc cdc attriutc,
ut did not rcccivc any rin a whcn thc (su)rajh was rcjroccsscd
lt such cdcs cxist, thc (su)rajh contains uncoscd,ojcn rins,
so thc (su)rajh must not c rcjortcd
Christian Borgelt Frequent Pattern Mining 428
Filter Approaches: Unclosable Rings
Idea of Unclosable Ring Filtering:
Grown (su)rajhs with ojcn rins that cannot c coscd y tuturc cxtcnsions
can c jruncd trom thc scarch
Canonica torm jrunin aows to rcstrict thc jossic cxtcnsions ot a tramcnt
Luc to jrcvious cxtcnsions ccrtain vcrticcs ccomc uncxtcndac
Somc rins cannot c coscd y cxtcndin a (su)rajh
Oviousy, a ncccssary (thouh not sucicnt) condition tor a rins cin coscd
is that cvcry vcrtcx has cithcr zcro or at cast two incidcnt rin cdcs
lt thcrc is a vcrtcx with ony onc incidcnt rin cdc,
this cdc must c jart ot an incomjctc rin
lt an uncxtcndac vcrtcx ot a rown (su)rajh has ony onc incidcnt rin cdc,
this (su)rajh can c jruncd trom thc scarch
(ccausc thcrc is an ojcn rin that can ncvcr c coscd)
Christian Borgelt Frequent Pattern Mining 429
Reminder: Restricted Extensions
O
N
S
O
O
cxamjc
moccuc
dcjth-rst
A
S
0
N
1
C
3
C
7
C
8
O
2
C
4
O
5
O
6
rcadth-rst
L
C
6
O
7
O
8
S
0
N
1
C
2
O
3
C
4
C
5
Extendable Vertices:
A vcrticcs on thc rihtmost jath, that is, 0, 1, 3, , S
L vcrticcs with an indcx no smacr than thc maximum sourcc, that is, o, , S
Edges Closing Cycles:
A nonc, ccausc thc cxistin cycc cdc has thc smacst jossic sourcc
L thc cdc ctwccn thc vcrticcs and S
Christian Borgelt Frequent Pattern Mining 430
Filter Approaches: Merging Ring Extensions
Idea of Merging Ring Extensions:
Thc jrcvious mcthods work on individua cdcs and hcncc cannot aways dctcct
it an cxtcnsion ony cads to tramcnts with comjctc rins that arc intrcqucnt
Add a cdcs ot a rin, thus distinuishin cxtcnsions that
start with thc samc individua cdc, ut
N O
C
C
C C
N O
C
C C
C
cad into rins ot dicrcnt sizc or dicrcnt comjosition
Lctcrminc thc sujjort ot thc rown (su)rajhs and jrunc intrcqucnt oncs
Trim and mcrc rin cxtcnsions that sharc thc samc initia cdc
Advantage of Merging Ring Extensions:
A cxtcnsions arc rcmovcd that ccomc intrcqucnt whcn comjctcd into rins
A occurrcnccs arc rcmovcd that cad to intrcqucnt (su)rajhs
oncc rins arc comjctcd
Christian Borgelt Frequent Pattern Mining 431
A Reordering Approach
Drawback of Filtering:
(Su)rajhs arc sti cxtcndcd cdc y cdc Iramcnts row tairy sowy
Better Approach:
Add a cdcs ot a rin in onc stcj (\hcn a rin cdc is addcd,
crcatc onc cxtcndcd (su)rajh tor cach rin it is containcd in)
lcordcr ccrtain cdcs in ordcr to comjy with canonica torm jrunin
Problems of a Reordering Approach:
Onc must aow tor inscrtions ctwccn arcady addcd rin cdcs
(ccausc ranchcs may jrcccdc rin cdcs in thc canonica torm)
Onc must not commit too cary to an ordcr ot thc cdcs
(ccausc ranchcs may inucncc thc ordcr ot thc rin cdcs)
A jossic ordcrs ot (ocay) cquivacnt cdcs must c tricd,
ccausc any ot thcm may jroducc vaid outjut
Christian Borgelt Frequent Pattern Mining 432
Problems of Reordering Approaches
One must not commit too early to an order of the edges.
lustration cccts ot attachin a ranch to an asymmctric rin N O C, - =
N O O
0
2
4
5
3
1
N 0-C1 0-C2 1-C3 2-C4 3-C5 4=C5
N O O
0
1
3
5
4
2
N 0-C1 0-C2 1-C3 2-C4 3=C5 4-C5
N O O
0
2
5
6
3
1
4
N 0-C1 0-C2 1-C3 2-O4 2-C5 3=C6 5-C6
N O O
0
1
4
6
5
2
3
N 0-C1 0-C2 1-O3 1-C4 2-C5 3-C6 5=C6
\rt a rcadth-rst scarch canonica torm, thc cdcs ot thc rin
can c ordcrcd in two dicrcnt ways (ujjcr two rows)
Thc ujjcr,ctt is thc canonica torm ot thc jurc rin
\ith an attachcd ranch (cosc to thc root vcrtcx),
thc othcr ordcrin ot thc rin cdcs (owcr,riht) is thc canonica torm
Christian Borgelt Frequent Pattern Mining 433
Keeping Non-Canonical Fragments
Solution of the early commitment problem:
`aintain (and cxtcnd) oth ordcrins ot thc rin cdcs and
aow tor dcviations trom thc canonica torm cyond xcd cdcs
Principle: kccj (and, conscqucnty, aso cxtcnd) tramcnts that arc not in
canonica torm, ut that coud ccomc canonica oncc ranchcs arc addcd
`ccdcd a ruc which non-canonica tramcnts to kccj and which to discard
ldca addin a rin can c sccn as addin its initia cdc as in an cdc-y-cdc
jroccdurc, and somc additiona cdcs, thc jositions ot which arc not yct xcd
As a conscqucncc wc can sjit thc codc word into two jarts
a xed prex, which is aso uit y an cdc-y-cdc jroccdurc, and
a volatile sux, which consists ot thc additiona (rin) cdcs
Christian Borgelt Frequent Pattern Mining 434
Keeping Non-Canonical Fragments
Fixed prex of a code word:
Thc jrcx ot thc codc word uj to (and incudin)
thc ast cdc addcd in an cdc-y-cdc manncr
Volatile sux of a code word:
Thc sux ot thc codc word attcr thc ast cdc
addcd in an cdc-y-cdc manncr (and cxcudin it)
Rule for keeping non-canonical fragments:
If the current code word deviates from the canonical code word
in the xed part, the fragment is pruned, otherwise it is kept.
Justication of this rule:
lt thc dcviation is in thc xcd jart, no atcr addition ot cdcs
can havc any ccct on it, sincc thc xcd jart wi ncvcr c chancd
lt, howcvcr, thc dcviation is in thc voatic jart, a atcr cxtcnsion cdc
may c inscrtcd in such a way that thc codc word ccomcs canonica
Christian Borgelt Frequent Pattern Mining 435
Search Tree for an Asymmetric Ring with Branches
`aintain (and cxtcnd) oth ordcrins ot thc rin cdcs and
aow tor dcviations trom thc canonica torm cyond xed cdcs
N
N O O
0
2
4
5
3
1
N O O
0
1
3
5
4
2
N O O
0
2
5
6
4
1
3
N O O
0
2
5
6
3
1
4
N O O
0
1
3
6
5
2
4
N O O
0
1
4
6
5
2
3
N O O
0
2
6
7
4
1
3 5
N O O
0
1 2
3
4 6
5
7
Thc cdcs ot a rown surajh arc sjit into
xed edges (cdcs that coud havc ccn addcd in an cdc-y-cdc manncr),
volatile edges (cdcs that havc ccn addcd with rin cxtcnsions
and ctorc,ctwccn which cdcs may c inscrtcd)
Christian Borgelt Frequent Pattern Mining 436
Search Tree for an Asymmetric Ring with Branches
Thc scarch constructs thc rin with oth jossic numcrins ot thc vcrticcs
Thc torm on thc ctt is canonic, so it is kcjt
ln thc tramcnt on thc riht ony thc rst rin ond is xcd,
a othcr onds arc voatic
Sincc thc codc word tor this tramcnt dcviatcs trom thc canonica onc
ony at thc `th ond, wc may not discard it
On thc ncxt cvc, thcrc arc two canonica and two non-canonica tramcnts
Thc non-canonica tramcnts oth dicr in thc xcd jart,
which now consists ot thc rst thrcc onds, and thus arc jruncd
On thc third cvc, thcrc is onc canonica and onc non-canonica tramcnt
Thc non-canonica tramcnt dicrs in thc voatic jart (thc rst tour onds
arc xcd, ut it dcviatcs trom thc canonica codc word ony in thc th ond)
and thus may not c jruncd trom thc scarch
Christian Borgelt Frequent Pattern Mining 437
Connected and Nested Rings
Connected and nested rings can josc jrocms, ccausc in thc jrcscncc ot
equivalent edges thc ordcr ot thcsc cdcs cannot c dctcrmincd ocay
N
0
1
5
8 6
2
4
3
7 9
5
8 6
2
4
7
N
N
N
0
1
3 5 4
2
N
0
1
3 7 6
2
5
4
N
0
1
5 7 6
2
4
3
N
0
1
3
6 5
2
4
4
8 7
N
0
1
3
6 5
2
4
8
9 7
Ldcs arc (ocay) equivalent it thcy start trom thc samc vcrtcx, havc thc samc
cdc attriutc, and cad to vcrticcs with thc samc vcrtcx attriutc
Lquivacnt cdcs must c spliced in a ways, in which thc ordcr ot thc cdcs
arcady in thc (su)rajh and thc ordcr ot thc ncwy addcd cdcs is jrcscrvcd
lt is ncccssary to considcr pseudo-rings tor cxtcnsions,
ccausc othcrwisc not a ordcrs ot cquivacnt cdcs arc cncratcd
Christian Borgelt Frequent Pattern Mining 438
Splicing Equivalent Edges
ln jrincijc, all possible orders of equivalent edges havc to c considcrcd,
ccausc any ot thcm may in thc cnd yicd thc canonica torm
\c cannot (aways) dccidc ocay which is thc riht ordcr,
ccausc this may dcjcnd on cdcs addcd atcr
`cvcrthccss, wc may not rcordcr cquivacnt cdcs trccy,
as this woud intcrtcrc with kccjin ccrtain non-canonica tramcnts
Ly kccjin somc non-canonica tramcnts wc arcady considcr somc variants
ot ordcrs ot cquivacnt cdcs Thcsc must not c cncratcd aain
Splicing rule for equivalent edges: (rcadth-rst scarch canonica torm)
Thc ordcr ot thc cquivacnt cdcs arcady in thc tramcnt must c maintaincd,
and thc ordcr ot thc cquivacnt ncw cdcs must c maintaincd
Thc two scqucnccs ot cquivacnt cdcs may c mcrcd in a zijjcr-ikc manncr,
sccctin thc ncxt cdc trom cithcr ist, ut jrcscrvin thc ordcr in cach ist
Christian Borgelt Frequent Pattern Mining 439
The Necessity of Pseudo-Rings
Thc splicing rule cxjains thc ncccssity ot pseudo-rings
\ithout jscudo-rins it is imjossic to achicvc canonica torm in somc cascs
N
0
1
5
8 6
2
4
3
7 9
5
8 6
2
4
7
N
N
N
0
1
3 5 4
2
N
0
1
3 7 6
2
5
4
N
0
1
5 7 6
2
4
3
N
0
1
3
6 5
2
4
4
8 7
N
0
1
3
6 5
2
4
8
9 7
lt wc coud ony add thc `-rin and thc o-rin, ut not thc 3-rin,
thc ujward ond trom thc atom numcrcd 1 woud aways jrcccdc
at cast onc ot thc othcr two onds that arc cquivacnt to it
(sincc thc ordcr ot cxistin onds must c jrcscrvcd)
Lowcvcr, in thc canonica torm thc ujward ond succccds oth othcr onds,
and this wc can achicvc ony y addin thc 3-ond rin rst
Christian Borgelt Frequent Pattern Mining 440
Splicing Equivalent Edges
Thc considcrcd splicing rule is tor a rcadth-rst scarch canonica torm
ln this torm cquivacnt cdcs arc ad,accnt in thc canonica codc word
ln a dcjth-rst scarch canonica torm cquivacnt cdcs
can c tar ajart trom cach othcr in thc codc word
`cvcrthccss somc sjicin is ncccssary to jrojcry trcat cquivacnt cdcs
in this canonica torm, cvcn thouh thc ruc is sihty simjcr
Splicing rule for equivalent edges: (dcjth-rst scarch canonica torm)
Thc rst ncw rin cdc has to c tricd in a ocations in thc voatic jart
ot thc codc word, whcrc cquivacnt cdcs can c tound
Sincc wc cannot dccidc ocay which ot thcsc cdcs shoud c toowcd rst
whcn uidin thc sjannin trcc, wc havc to try a ot thcsc jossiiitics
in ordcr not to miss thc canonica onc
Christian Borgelt Frequent Pattern Mining 441
Avoiding Duplicate Fragments
Thc sjicin rucs sti aow that thc samc tramcnt can c rcachcd in thc samc
torm in dicrcnt ways, namcy y addin (ncstcd) rins in dicrcnt ordcrs
lcason wc cannot aways distinuish ctwccn two dicrcnt ordcrs
in which two rins sharin a vcrtcx arc addcd
`ccdcd an augmented canonical form test
Ideas undcryin such an aumcntcd tcst
Thc rcquircmcnt ot comjctc rins introduccs dcjcndcnccs ctwccn cdcs
Thc jrcscncc ot ccrtain cdcs enforces thc jrcscncc ot ccrtain othcr cdcs
Thc samc codc word ot a tramcnt is crcatcd scvcra timcs,
ut cach timc with a dierent xed part
Thc josition ot thc rst cdc ot a rin cxtcnsion (attcr rcordcrin)
is thc cnd ot thc xcd jart ot thc (cxtcndcd) codc word
Christian Borgelt Frequent Pattern Mining 442
Ring Key Pruning
Dependences between Edges
Thc rcquircmcnt ot comjctc rins introduccs dcjcndcnccs ctwccn cdcs
(ldca considcr tormin su-tramcnts with ony comjctc rins)
A rin cdc e
1
ot a tramcnt enforces the presence ot anothcr rin cdc e
2
i thc sct ot rins containin e
1
is a susct ot thc sct ot rins containin e
2
N 0-C1 0-C2 is not a rin kcy, ccausc it docs not cntorcc, tor cxamjc, e
3
Example:
injut rajh
B A B A B
surajh
B A B
A B B A B
A B B A B
B A B A B
B A B A B
3 1 2
2 1 3
2 1 3
3 1 2
Christian Borgelt Frequent Pattern Mining 465
Maximum Independent Set Support
Lct G (V, E) c an (undircctcd) rajh with vcrtcx sct V
and cdc sct E V V (v, v) [ v V
An independent vertex set ot G is a sct I V with u, v I (u, v) / E
I is a maximum independent vertex set i
it is an indcjcndcnt vcrtcx sct and
tor a indcjcndcnt vcrtcx scts J ot G it is [I[ [J[
`otcs Iindin a maximum indcjcndcnt vcrtcx sct is an `l-comjctc jrocm
Lowcvcr, a rccdy aorithm usuay ivcs vcry ood ajjroximations
Lct O (V
O
, E
O
) c thc ovcraj rajh ot thc occurrcnccs
ot a accd rajh S (V
S
, E
S
,
S
) in a accd rajh G (V
G
, E
G
,
G
)
Thc maximum independent set support (or MIS-support tor short)
ot S wrt G is thc sizc ot a maximum indcjcndcnt vcrtcx sct ot O
Christian Borgelt Frequent Pattern Mining 466
Finding a Maximum Independent Set
nmark a vcrticcs ot thc ovcraj rajh
Exact Backtracking Algorithm
Iind an unmarkcd vcrtcx with maximum dcrcc and try two jossiiitics
Sccct it tor thc `lS, that is, mark it as sccctcd and
mark a ot its ncihors as cxcudcd
Lxcudc it trom thc `lS, that is, mark it as cxcudcd
lroccss thc rcst rccursivcy and rccord cst soution tound
Heuristic Greedy Algorithm
Sccct a vcrtcx with thc minimum numcr ot unmarkcd ncihors and
mark a ot its ncihors as cxcudcd
lroccss thc rcst ot thc rajh rccursivcy
ln oth aorithms vcrticcs with css than two unmarkcd ncihors
can c sccctcd and a ot thcir ncihors markcd as cxcudcd
Christian Borgelt Frequent Pattern Mining 467
Anti-Monotonicity of MIS-Support: Preliminaries
Lct G (V
G
, E
G
,
G
) and S (V
S
, E
S
,
S
) c two accd rajhs
Lct T (V
T
, E
T
,
T
) a (non-cmjty) jrojcr surajh ot S
(that is, V
T
V
S
, E
T
(V
T
V
T
) E
S
, and
T
S
[
V
T
E
T
)
Lct f c an occurrcncc ot S in G
An occurrcncc f
/
ot thc surajh T is cacd a T-ancestor ot thc occurrcncc f
i f
/
f[
V
T
, that is, it f
/
coincidcs with f on thc vcrtcx sct V
T
ot T
Observations:
Ior ivcn G, S, T and f thc T-anccstor f
/
ot thc occurrcncc f is uniqucy dcncd
Lct f
1
and f
2
c two (non-idcntica, ut mayc cquivacnt) occurrcncc ot S in G
f
1
and f
2
ovcraj it thcrc cxist ovcrajjin T-anccstors f
/
1
and f
/
2
ot thc occurrcnccs f
1
and f
2
, rcsjcctivcy
(`otc Thc invcrsc imjication docs not hod cncray)
Christian Borgelt Frequent Pattern Mining 468
Anti-Monotonicity of MIS-Support: Proof
Theorem: `lS-sujjort is anti-monotonc
Proof: \c havc to show that thc `lS-sujjort ot a surajh S wrt a rajh G
cannot cxcccd thc `lS-sujjort ot any (non-cmjty) jrojcr surajh T ot S
Lct I c an aritrary indcjcndcnt vcrtcx sct ot thc ovcraj rajh O ot S wrt G
Thc sct I induccs a susct I
/
ot thc vcrticcs ot thc ovcraj rajh O
/
ot an (aritrary, ut xcd) surajh T ot thc considcrcd surajh S,
which consists ot thc (uniqucy dcncd) T-anccstors ot thc vcrticcs in I
lt is [I[ [I
/
[, ccausc no two ccmcnts ot I can havc thc samc T-anccstor
\ith simiar arumcnt I
/
is an indcjcndcnt vcrtcx sct ot thc ovcraj rajh O
/
Lct H (V
H
, E
H
) c thc harmtu ovcraj rajh ot thc occurrcnccs
ot a accd rajh S (V
S
, E
S
,
S
) in a accd rajh G (V
G
, E
G
,
G
)
Thc harmful overlap support (or HO-support tor short) ot thc rajh S wrt G
is thc sizc ot a maximum indcjcndcnt vcrtcx sct ot H
Theorem: LO-sujjort is anti-monotonc
Proof: ldcntica to jroot tor `lS-sujjort
(Thc samc two oscrvations hod, which wcrc a that was nccdcd)
Christian Borgelt Frequent Pattern Mining 471
Harmful Overlap Graphs and Ancestor Relations
injut rajh
B A B A B
B B B A A A A B B B
B A B A B
B A B B A
B A B B A
B A B A B
A B B A B
A B B A B
B A B A B
B A B A B
3 1 2
2 1 3
2 1 3
3 1 2
Christian Borgelt Frequent Pattern Mining 472
Subgraph Support Computation
Chcckin whcthcr two occurrcnccs ovcraj is casy, ut
How do we check whether two occurrences overlap harmfully?
Core ideas of the harmful overlap test:
Try to construct a surajh S
E
(V
E
, E
E
,
E
) that yicds cquivacnt anccstors
ot two ivcn occurrcnccs f
1
and f
2
ot a rajh S (V
S
, E
S
,
S
)
Ior such a surajh S
E
thc majjin g V
E
V
E
with v f
1
2
(f
1
(v)),
whcrc f
1
2
is thc invcrsc ot f
2
, must c a i,cctivc majjin
`orc cncray, g must c an automorphism ot S
E
,
that is, a surajh isomorjhism ot S
E
to itsct
Lxjoit thc jrojcrtics ot automorjhism
to cxcudc vcrticcs trom thc rajh S that cannot c in V
E
and W
2
v V
S
[ f
2
(v) V
1
V
2
3) lt V
E
W
1
W
2
, rcturn false, othcrwisc rcturn true
V
E
is thc vcrtcx sct ot a surajh S
E
that induccs cquivacnt anccstors
Any vcrtcx v V
S
V
E
cannot contriutc to such cquivacnt anccstors
Lcncc V
E
is a maxima sct ot vcrticcs tor which g is a i,cction
Christian Borgelt Frequent Pattern Mining 474
Restriction to Connected Subgraphs
Thc scarch tor trcqucnt surajhs is usuay rcstrictcd to connected graphs
\c cannot concudc that no cdc is nccdcd it thc surajh S
E
is not conncctcd
thcrc may c a conncctcd surajh ot S
E
that induccs cquivacnt anccstors
ot thc occurrcnccs f
1
and f
2
1) Lct E
1
(v
1
, v
2
) E
G
[ (u
1
, u
2
) E
S
(v
1
, v
2
) (f
1
(u
1
), f
1
(u
2
))
and E
2
(v
1
, v
2
) E
G
[ (u
1
, u
2
) E
S
(v
1
, v
2
) (f
2
(u
1
), f
2
(u
2
))
2) Lct F
1
(v
1
, v
2
) E
S
[ (f
1
(v
1
), f
1
(v
2
)) E
1
E
2
and F
2
(v
1
, v
2
) E
S
[ (f
2
(v
1
), f
2
(v
2
)) E
1
E
2
3) Lct E
E
F
1
F
2
(rcmindcr v V
E
g(v) f
1
2
(f
1
(v)), g is an automorjhism ot S
E
)
Thcn it is cithcr W or W V
C
Proof: (y contradiction)
Sujjosc that thcrc is a conncctcd comjoncnt S
C
with W , and W , V
C
(`otc that V
/
E
docs not contain isoatcd vcrticcs)
3) Lct S
i
C
(V
i
C
, E
i
C
), 1 i n,
c thc conncctcd comjoncnts ot S
/
E
(V
/
E
, E
E
)
lt i. 1 i n v V
i
C
g(v) f
1
2
(f
1
(v)) V
i
C
,
rcturn true, othcrwisc rcturn false
Christian Borgelt Frequent Pattern Mining 478
Alternative: Minimum Number of Vertex Images
Lct G (V
G
, E
G
,
G
) and S (V
S
, E
S
,
S
) c two accd rajhs
and ct F c thc sct ot a surajh isomorjhisms ot S to G
Thcn thc minimum number of vertex images support
(or MNI-support tor short) ot S wrt G is dcncd as
min
vV
S
[u V
G
[ f F f(v) u[.
|Lrinmann and `i,sscn 200|
Advantage:
Can c comjutcd much morc ccicnty than `lS- or LO-sujjort
(`o nccd to dctcrminc a maximum indcjcndcnt vcrtcx sct)
Disadvantage:
Ottcn counts oth ot two cquivacnt occurrcnccs
(Iairy unintuitivc chavior)
Lxamjc B A A B
Christian Borgelt Frequent Pattern Mining 479
Experimental Results
lndcx
Chcmicus
1993
200 250 300 350 400 450 500
0
100
200
300
400
500
600
number of subgraphs
MNI-support
HO-support
MIS-support
# graphs
Tic-
Tac-
Toc
win
120 140 160 180 200 220 240 260 280 300
0
50
100
150
200
250
300
number of subgraphs
MNI-support
HO-support
MIS-support
Christian Borgelt Frequent Pattern Mining 480
Summary
Lcnin surajh sujjort in thc sinc rajh scttin
maximum independent vertex set ot an ovcraj rajh ot thc occurrcnccs
MIS-support is anti-monotone
lroot ook at induccd indcjcndcnt vcrtcx scts tor sustructurcs
Lcnition ot harmful overlap support ot a surajh
cxistcncc ot cquivacnt anccstor occurrcnccs
Simjc jroccdurc tor tcstin whcthcr two occurrcnccs ovcraj harmtuy
Harmful overlap support is anti-monotone
lcstriction to conncctcd sustructurcs and ojtimizations
Atcrnativc minimum number of vertex images
Software: http://www.borgelt.net/moss.html
Christian Borgelt Frequent Pattern Mining 481
Frequent Sequence Mining
Christian Borgelt Frequent Pattern Mining 482
Frequent Sequence Mining
Directed versus undirected sequences
Tcmjora scqucnccs, tor cxamjc, arc aways dircctcd
L`A scqucnccs can c undircctcd (oth dircctions can c rccvant)
Multiple sequences versus a single sequence
`utijc scqucnccs jurchascs with rcatc cards, wc scrvcr acccss jrotocos
Sinc scqucncc aarms in tcccommunication nctworks
(Time) points versus time intervals
loints L`A scqucnccs, aarms in tcccommunication nctworks
lntcrvas wcathcr data, movcmcnt anaysis (sjorts mcdicinc)
Iurthcr distinction onc o,cct jcr (timc) joint vcrsus mutijc o,ccts
Christian Borgelt Frequent Pattern Mining 483
Frequent Sequence Mining
Consecutive subsequences versus subsequences with gaps
a c b a b c b a aways counts as a suscqucncc abc
a c b a b c b c may not aways count as a suscqucncc abc
Existence of an occurrence versus counting occurrences
Cominatoria countin (a occurrcnccs)
`axima numcr ot dis,oint occurrcnccs
Tcmjora sujjort (numcr ot timc window jositions)
`inimum occurrcncc (smacst intcrva)
Relation between the objects in a sequence
itcms ony jrcccdc and succccd
accd timc joints t
1
< t
2
, t
1
t
2
, and t
1
> t
2
accd timc intcrvas rcations ikc before, starts, overlaps, contains ctc
Christian Borgelt Frequent Pattern Mining 484
Frequent Sequence Mining
Directed sequences arc casicr to handc
Thc (su)scqucncc itsct can c uscd as a codc word
As thcrc is ony onc jossic codc word jcr scqucncc (ony onc dircction),
this codc word is ncccssariy canonica
Consecutive subsequences arc casicr to handc
Thcrc arc tcwcr occurrcnccs ot a ivcn suscqucncc
Ior cach occurrcncc thcrc is cxacty onc jossic cxtcnsions
This aows tor sjcciaizcd data structurcs (simiar to an Il-trcc)
Item sequences arc casicst to handc
Thcrc arc ony two jossic rcations and thus jattcrns arc simjc
Othcr scqucnccs arc handcd with statc machincs tor containmcnt tcsts
Christian Borgelt Frequent Pattern Mining 485
A Canonical Form for Undirected Sequences
lt thc scqucnccs to minc arc not dircctcd, a suscqucncc can not c uscd
as its own codc word, ccausc it docs not havc thc prex property
Thc rcason is that an undircctcd scqucncc can c rcad torward or ackward,
which ivcs risc to two jossic codc words, thc smacr (or thc arcr) ot which
may thcn c dcncd as thc canonical code word
Lxamjcs (that thc jrcx jrojcrty is vioatcd)
Assumc that thc itcm ordcr is a < b < c . . . and
that thc cxicorajhicay smacr codc word is thc canonica onc
Thc scqucncc bab, which is canonica, has thc jrcx ba,
ut thc canonica torm ot thc scqucncc ba is rathcr ab
Thc scqucncc cabd, which is canonica, has thc jrcx cab,
ut thc canonica torm ot thc scqucncc cab is rathcr bac
As a conscqucncc, wc havc to ook tor a dicrcnt way ot tormin codc words
(at cast it wc want thc codc to havc thc jrcx jrojcrty)
Christian Borgelt Frequent Pattern Mining 486
A Canonical Form for Undirected Sequences
A (simjc) jossiiity to torm canonica codc words havin thc jrcx jrojcrty
is to handc (su)scqucnccs ot even and odd length separately
ln addition, tormin thc codc word is startcd in the middle
Even length: Thc scqucncc a
m
a
m1
. . . a
2
a
1
b
1
b
2
. . . b
m1
b
m
is dcscricd y thc codc word a
1
b
1
a
2
b
2
. . . a
m1
b
m1
a
m
b
m
or y thc codc word b
1
a
1
b
2
a
2
. . . b
m1
a
m1
b
m
a
m
Thc cxicorajhicay smacr ot thc two codc words is thc canonical code word
Such scqucnccs arc extended y addin a jair a
m+1
b
m+1
or b
m+1
a
m+1
,
that is, y addin onc itcm at thc tront and onc itcm at thc cnd
Christian Borgelt Frequent Pattern Mining 487
A Canonical Form for Undirected Sequences
Thc codc words dcncd in this way ccary havc thc prex property
Sujjosc thc jrcx jrojcrty woud not hod
Thcn thcrc cxists, without oss ot cncraity, a canonica codc word
w
m
a
1
b
1
a
2
b
2
. . . a
m1
b
m1
a
m
b
m
,
thc jrcx w
m1
ot which is not canonica, whcrc
w
m1
a
1
b
1
a
2
b
2
. . . a
m1
b
m1
,
As a conscqucncc, wc havc w
m
< v
m
, whcrc
v
m
b
1
a
1
b
2
a
2
. . . b
m1
a
m1
b
m
a
m
,
and v
m1
< w
m1
, whcrc
v
m1
b
1
a
1
b
2
a
2
. . . b
m1
a
m1
.
Lowcvcr, v
m1
< w
m1
imjics v
m
< w
m
,
ccausc v
m1
is a jrcx ot v
m
and w
m1
is a jrcx ot w
m
,
ut v
m
< w
m
contradicts w
m
< v
m
i1
(a
i
b
i
)
Thc symmctry a can c maintaincd in constant timc with
s
m+1
s
m
(a
m+1
b
m+1
).
Thc permissible extensions dcjcnd on thc symmctry a
it s
m
truc, it must c a
m+1
b
m+1
it s
m
tasc, any rcation ctwccn a
m+1
and b
m+1
is acccjtac
This ruc uarantccs that cxacty thc canonica cxtcnsions arc crcatcd
Ajjyin this ruc to chcck a candidatc cxtcnsion takcs constant time
Christian Borgelt Frequent Pattern Mining 489
Sequences of Time Intervals
A (accd or attriutcd) time interval is a trijc I (s, e, l),
whcrc s is thc start timc, e is thc cnd timc and l is thc associatcd ac
A time interval sequence is a sct ot (accd) timc intcrvas,
ot which wc assumc that thcy arc maxima in thc scnsc that tor two intcrvas
I
1
(s
1
, e
1
, l
1
) and I
2
(s
2
, e
2
, l
2
) with l
1
l
2
wc havc cithcr e
1
< s
2
or e
2
< s
1
Luc to thc assumjtion madc aovc, at cast thc third ojtion must hod
Christian Borgelt Frequent Pattern Mining 490
Allens Interval Relations
Luc to thcir tcmjora cxtcnsion, timc intcrvas aow tor dicrcnt rcations
A commony uscd sct ot rcations ctwccn timc intcrvas arc
Allens interval relations |Acn 19S3|
A ctorc B
A mccts B
A ovcrajs B
A is nishcd y B
A contains B
A is startcd y B
A cquas B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
B attcr A
B is mct y A
B is ovcrajjcd y A
B nishcs A
B durin A
B starts A
B cquas A
Christian Borgelt Frequent Pattern Mining 491
Temporal Interval Patterns
A tcmjora jattcrn must sjccity thc rcations ctwccn a rctcrcnccd intcrvas
This can convcnicnty c donc with a matrix
A
B
C
A B C
A c o
B io c m
C a im c
Such a tcmjora jattcrn matrix can aso c intcrjrctcd as an ad,accncy matrix
ot a rajh, which has thc intcrva rcationshijs as cdc acs
Gcncray, thc injut intcrva scqucnccs may c rcjrcscntcd as such rajhs,
thus majjin thc jrocm to trcqucnt (su)rajh minin
Lowcvcr, thc rcationshijs ctwccn timc intcrvas arc constraincd
(tor cxamjc, B attcr A and C attcr B imjy C attcr A)
Thcsc constraints can c cxjoitcd to otain a simjcr canonica torm
ln thc canonical form, thc intcrvas arc assincd in incrcasin timc ordcr
to thc rows and coumns ot thc tcmjora jattcrn matrix |Icmjc 200S|
Christian Borgelt Frequent Pattern Mining 492
Support of Temporal Patterns
Thc sujjort ot a tcmjora jattcrn wrt a sinc scqucncc can c dcncd y
Cominatoria countin (a occurrcnccs)
`axima numcr ot dis,oint occurrcnccs
Tcmjora sujjort (numcr ot timc window jositions)
`inimum occurrcncc (smacst intcrva)
Lowcvcr, a ot thcsc dcnitions sucr trom thc tact that such sujjort
is not anti-monotone or downward closed
A
B B
Thc sujjort ot A contains B is 2,
ut thc sujjort ot A is ony 1
`cvcrthccss an cxhaustivc jattcrn scarch can cnsurcd,
without havin to aandon jrunin with thc Apriori property
Thc rcasons is that with minimum occurrcncc countin thc rcationshij contains
is thc ony onc that can cad to sujjort anomaics ikc thc onc shown aovc
Christian Borgelt Frequent Pattern Mining 493
Weakly Anti-Monotone / Downward Closed
Lct T a jattcrn sjacc with a sujattcrn rcationshij < and
ct s c a tunction trom T to thc rca numcrs, s T ll
Ior a jattcrn S T ct P(S) R [ R < S , Q R < Q < S
c thc sct ot a parent patterns ot S
Thc tunction s on thc jattcrn sjacc T is cacd
strongly anti-monotone or strongly downward closed i
S T R P(S) s(R) s(S),
weakly anti-monotone or weakly downward closed i
S T R P(S) s(R) s(S).
Thc sujjort ot tcmjora intcrva jattcrns is wcaky anti-monotonc
(at cast) it it is comjutcd trom minima occurrcnccs
lt tcmjora intcrva jattcrns arc cxtcndcd backwards in time,
thc Apriori property can satcy c uscd tor jrunin |Icmjc 200S|
Christian Borgelt Frequent Pattern Mining 494
Summary Frequent Sequence Mining
Scvcra dicrcnt types of frequent sequence mining can c distinuishcd
sinc and mutijc scqucnccs, dircctcd and undircctcd scqucnccs
itcms vcrsus (accd) intcrvas, sinc and mutijc o,ccts jcr josition
rcations ctwccn thc o,ccts, dcnition ot jattcrn sujjort
A common tyjcs ot trcqucnt scqucncc minin josscss canonica torms
tor which canonical extension rules can c tound
\ith thcsc rucs it is jossic to chcck in constant timc
whcthcr a jossic cxtcnsion cads to a rcsut in canonica torm
A weakly anti-monotone sujjort tunction can c cnouh
to aow jrunin with thc Apriori property
Lowcvcr, in this casc it must c madc surc that thc canonica torm
assins an ajjrojriatc jarcnt jattcrn in ordcr to cnsurc an cxhaustivc scarch
Christian Borgelt Frequent Pattern Mining 495
Frequent Tree Mining
Christian Borgelt Frequent Pattern Mining 496
Frequent Tree Mining: Basic Notions
lcmindcr A path is a scqucncc ot cdcs conncctin two vcrticcs in a rajh
lcmindcr A (accd) rajh G is cacd a tree i tor any jair ot vcrticcs in G
thcrc cxists exactly one path conncctin thcm in G
A trcc is cacd rooted it it has a distinuishcd vcrtcx, cacd thc root
lootcd trccs arc ottcn sccn as dircctcd a cdcs arc dircctcd away trom thc root
lt a trcc is not rootcd (that is, it thcrc is no distinuishcd vcrtcx), it is cacd free
A trcc is cacd ordered it tor cach vcrtcx
thcrc cxists an ordcr on its incidcnt cdcs
lt thc trcc is rooted, thc ordcr may c dcncd on thc outoin cdcs ony
Trccs ot whichcvcr tyjc arc much casicr to handc trcqucnt (su)rajhs,
ccausc it is mainy thc cyccs (which may c jrcscnt in a cncra rajh)
that makc it dicut to construct thc canonica codc word
Christian Borgelt Frequent Pattern Mining 497
Frequent Tree Mining: Basic Notions
lcmindcr A path is a scqucncc ot cdcs conncctin two vcrticcs in a rajh
Thc length of a path is thc numcr ot its cdcs
Thc distance ctwccn two vcrticcs ot a rajh G
is thc cnth ot a shortcst jath conncctin thcm
`otc that in a trcc thcrc is cxacty onc jath conncctin two vcrticcs,
which is thcn ncccssariy aso thc shortcst jath
ln a rootcd trcc thc depth ot a vcrtcx is its distancc trom thc root vcrtcx
Thc root vcrtcx itsct has dcjth 0
Thc depth ot a trcc is thc dcjth ot its dccjcst vcrtcx
Thc diameter ot a rajh is thc arcst distancc ctwccn any two vcrticcs
A diameter path ot a rajh is a jath havin a cnth
that is thc diamctcr ot thc rajh
Christian Borgelt Frequent Pattern Mining 498
Rooted Ordered Trees
Ior rooted ordered trees codc words dcrivcd trom sjannin trccs
can dirccty c uscd thc sjannin trcc is simjy thc trcc itsct
Lowcvcr, thc root ot thc sjannin trcc is xed
it is simjy thc root ot thc rootcd ordcrcd trcc
ln addition, thc order of the children ot cach vcrtcx is xed
it is simjy thc ivcn ordcr ot thc outoin cdcs
As a conscqucncc, oncc a travcrsa ordcr tor thc sjannin trcc is xcd
(tor cxamjc, dcjth-rst or a rcadth-rst travcrsa), thcrc is ony
one possible code word, which is ncccssariy thc canonica codc word
Thcrctorc rightmost path extension (tor a dcjth-rst travcrsa)
and maximum source extension (tor a rcadth-rst travcrsa)
oviousy jrovidc a canonica cxtcnsion ruc tor rootcd ordcrcd trccs
Thcrc is no nccd tor an cxjicit tcst tor canonica torm
Christian Borgelt Frequent Pattern Mining 499
Rooted Unordered Trees
Rooted unordered trees can most convcnicnty c dcscricd y
so-cacd preorder code words
lrcordcr codc words arc coscy rcatcd to sjannin trccs that arc constructcd
with a dcjth-rst scarch, ccausc a jrcordcr travcrsa is a dcjth-rst travcrsa
Lowcvcr, thcir sjccia torm makcs it casicr to comjarc codc words tor sutrccs
Thc jrcordcr codc words wc considcr hcrc havc thc cncra torm
a ( d b a )
m
,
whcrc m is thc numcr ot cdcs ot thc trcc, m n 1,
n is thc numcr ot vcrticcs ot thc trcc,
a is a vcrtcx attriutc , ac,
b is an cdc attriutc , ac, and
d is thc dcjth ot thc sourcc vcrtcx ot an cdc
Thc sourcc vcrtcx ot an cdc is thc vcrtcx that is coscr to thc root (smacr dcjth)
Thc cdcs arc istcd in thc ordcr in which thcy arc visitcd in a jrcordcr travcrsa
Christian Borgelt Frequent Pattern Mining 500
Rooted Unordered Trees
a
b b
a
b d b
a
b
c
Ior simjicity wc omit cdc acs
ln rootcd trccs cdc acs can a-
ways c comincd with thc dcs-
tination vcrtcx ac (that is, thc
ac ot thc vcrtcx that is tarthcr
away trom thc root)
Thc aovc rootcd unordcrcd trcc can c dcscricd y thc codc word
a 0b 1d 1b 2b 2c 1a 0b 1a 1b
`otc that thc codc word consists ot sustrins that dcscric thc sutrccs
..
a 0
..
b 1
..
d 1
..
b 2
..
b 2
..
c 1
..
a 0
..
b 1
..
a 1
..
b
Thc sutrcc strins arc scjaratcd y a numcr statin thc dcjth ot thc jarcnt
Christian Borgelt Frequent Pattern Mining 501
Rooted Unordered Trees
Lxchanin codc words on thc samc cvc cxchancs ranchcs,sutrccs
..
a 0
..
b 1
..
d 1
..
b 2
..
b 2
..
c 1
..
a 0
..
b 1
..
a 1
..
b
Ior cxamjc, in this codc word thc chidrcn ot thc root arc cxchancd
..
a 0
..
b 1
..
a 1
..
b 0
..
b 1
..
d 1
..
b 2
..
b 2
..
c 1
..
a
a
b b
a
b d b
a
b
c
a
b b
d b
a a
b
b
c
Christian Borgelt Frequent Pattern Mining 502
Rooted Unordered Trees
A jossic jrcordcr codc words can c otaincd trom onc jrcordcr codc word
y cxchanin sustrins ot thc codc word that dcscric siin sutrccs
(This shows thc advantac ot usin thc vcrtcx dcjth rathcr than thc vcrtcx indcx
no rcnumcrin ot thc vcrticcs is ncccssary in such a cxchanc)
Ly dcnin an (aritrary, ut xcd) ordcr on thc vcrtcx acs
and usin thc standard ordcr ot thc intccr numcrs,
thc codc words can c comjarcd cxicorajhicay
(`otc that vcrtcx acs arc aways comjarcd to vcrtcx acs
and intccrs to intccrs, ccausc thcsc two ccmcnts atcrnatc)
Contrary to thc common dcnition uscd in a caricr cascs, wc dcnc
thc cxicorajhicay greatest codc word as thc canonical code word
Thc canonica codc word tor thc trcc on thc jrcvious sidcs is
a 0b 1d 1b 2c 2b 1a 0b 1b 1a
Christian Borgelt Frequent Pattern Mining 503
Rooted Unordered Trees
ln ordcr to undcrstand thc corc jrocm ot otainin an cxtcnsion ruc
tor rootcd unordcrcd trccs, considcr thc toowin trcc
a
b b
c c c c
d
c
d b d
c
d
Thc canonica codc word tor this trcc rcsuts trom thc shown ordcr ot thc sutrccs
a 0b 1c 2d 2c 1c 2d 2b 0b 1c 2d 2c 1c 2d
Any cxchanc ot sutrccs cads to a cxicorajhicay smaller codc word
Low can this trcc c cxtcndcd y addin a chid to thc rcy vcrtcx
That is, what ac may thc chid vcrtcx havc it thc rcsut is to c canonica
Christian Borgelt Frequent Pattern Mining 504
Rooted Unordered Trees
a
b b
c c c c
d
c
d b d
c
d
ln thc rst jacc, wc oscrvc that thc chid must not havc a ac succccdin d,
ccausc othcrwisc cxchanin thc ncw vcrtcx with thc othcr chid
ot thc rcy vcrtcx woud yicd a cxicorajhicay larger codc word
a 0b 1c 2d 2c 1c 2d 2b 0b 1c 2d 2c 1c 2d 2e
<
a 0b 1c 2d 2c 1c 2d 2b 0b 1c 2d 2c 1c 2e 2d
Gcncray, thc chidrcn ot a vcrtcx must c sortcd dcsccndiny wrt thcir acs
Christian Borgelt Frequent Pattern Mining 505
Rooted Unordered Trees
a
b b
c c c c
d
c
d b d
c
d
Sccondy, wc oscrvc that thc chid must not havc a ac succccdin c,
ccausc othcrwisc cxchanin thc sutrccs ot thc jarcnt ot thc rcy vcrtcx
woud yicd a cxicorajhicay larger codc word
a 0b 1c 2d 2c 1c 2d 2b 0b 1c 2d 2c 1c 2d 2d
<
a 0b 1c 2d 2c 1c 2d 2b 0b 1c 2d 2d 1c 2d 2c
Thc sutrccs ot any vcrtcx must c sortcd dcsccndiny wrt thcir codc words
Christian Borgelt Frequent Pattern Mining 506
Rooted Unordered Trees
a
b b
c c c c
d
c
d b d
c
d
Thirdy, wc oscrvc that thc chid must not havc a ac succccdin ,
ccausc othcrwisc cxchanin thc sutrccs ot thc root vcrtcx ot thc trcc
woud yicd a cxicorajhicay larger codc word
a 0b 1c 2d 2c 1c 2d 2b 0b 1c 2d 2c 1c 2d 2c
<
a 0b 1c 2d 2c 1c 2d 2c 0b 1c 2d 2c 1c 2d 2b
Thc sutrccs ot any vcrtcx must c sortcd dcsccndiny wrt thcir codc words
Christian Borgelt Frequent Pattern Mining 507
Rooted Unordered Trees
That a jossic cxchanc ot sutrccs at vcrticcs coscr to thc root
ncvcr yicd ooscr rcstrictions is no accidcnt
Sujjosc a rootcd trcc is dcscricd y a canonica codc word
a 0 b 1 w
1
1 w
2
0 b 1 w
3
1 w
!
.
Thcn wc know thc toowin rcationshijs ctwccn sutrcc codc words
w
1
w
2
and w
3
w
!
, ccausc othcrwisc an cxchanc ot sutrccs at thc
nodcs accd with woud cad to a cxicorajhicay arcr codc word
w
1
w
3
, ccausc othcrwisc an cxchanc ot sutrccs at thc nodc accd a
woud cad to a cxicorajhicay arcr codc word
Ony it w
1
w
3
, thc codc words w
1
and w
3
do not arcady dctcrminc thc ordcr
ot thc sutrccs ot thc vcrtcx accd with a ln this casc wc havc w
2
w
!
lt w docs not havc such a sux, thc cxtcndcd codc word is aways canonica
\ith this cxtcnsion ruc no suscqucnt canonica torm tcst is nccdcd
Christian Borgelt Frequent Pattern Mining 509
Rooted Unordered Trees
Thc discusscd cxtcnsion ruc is vcry ccicnt
Comjarin thc ccmcnts ot thc cxtcnsion takcs constant time
(at most onc intccr and onc ac nccd to c comjarcd)
Inowcdc ot thc strins w
3
tor a jossic vaucs ot x (0 x < d)
can maintaincd in constant time
lt succs to rccord thc startin joints ot thc sustrins
that dcscric thc rihtmost sutrcc on cach trcc cvc
At most onc ot thcsc startin joints can chanc with an cxtcnsion
Inowcdc ot thc vauc ot y and thc two startin joints ot thc strin w
1
in w
can c maintaincd in constant time
As on as no two siin vcrticcs carry thc samc ac, it is y d
lt a siin with thc samc ac is addcd, y is sct to thc dcjth ot thc jarcnt
w
1
a occurs at thc josition ot thc w
3
tor y and at thc cxtcnsion vcrtcx ac
lt a tuturc cxtcnsion dicrs trom w
2
, it is y d, othcrwisc w
1
is cxtcndcd
Christian Borgelt Frequent Pattern Mining 510
Free Trees
Free trees can c handcd y cominin thc idcas ot
how to handc sequences and rooted unordered trees
Simiar to scqucnccs, trcc trccs ot cvcn and odd diamctcr arc trcatcd scjaratcy
Gcncra idcas tor a canonica torm tor trcc trccs
Even Diameter:
Thc vcrtcx in thc middc ot a diamctcr jath is uniqucy dctcrmincd
This vcrtcx can c uscd as thc root ot a rootcd trcc
Odd Diameter:
Thc cdc in thc middc ot a diamctcr jath is uniqucy dctcrmincd
lcmovin this cdc sjits thc trcc trcc into two rootcd trccs
lroccdurc tor rowin trcc trccs
Iirst row a diamctcr jath usin thc canonica torm tor scqucnccs
Lxtcnd thc diamctcr jath into a trcc y addin ranchcs
Christian Borgelt Frequent Pattern Mining 511
Free Trees
`ain jrocm ot thc jroccdurc tor rowin trcc trccs
The initially grown diameter path must remain identiable.
(Othcrwisc thc prex property cannot c uarantccd)
ln ordcr to sovc this jrocm it is cxjoitcd that in thc canonica codc word tor a
rootcd unordcrcd trcc codc words dcscriin jaths trom thc root to a cat vcrtcx
arc cxicorajhicay incrcasin it thc jaths arc istcd trom ctt to riht
Even Diameter:
Thc oriina diamctcr jath rcjrcscnts two jaths trom thc root to two cavcs
To kccj thcm idcntiac, thcsc jaths must c thc cxicorajhicay smacst
and thc cxicorajhicay arcst jath cadin to this dcjth
Odd Diameter:
Thc oriina diamctcr jath rcjrcscnts onc jath trom thc root to a cat
in cach ot thc two rootcd trccs thc trcc trcc is sjit into
Thcsc jaths must c thc cxicorajhicay smacst jaths cadin to this dcjth
Christian Borgelt Frequent Pattern Mining 512
Summary Frequent Tree Mining
Rooted ordered trees
Thc root is xcd and thc ordcr ot thc chidrcn ot cach vcrtcx is xcd
Loth rightmost path extension and maximum source extension
oviousy jrovidc a canonica cxtcnsion ruc tor rootcd ordcrcd trccs
Rooted unordered trees
Thc root is xcd, ut thcrc is no ordcr ot thc chidrcn
Thcrc cxists a canonica cxtcnsion ruc ascd on sortcd jrcordcr strins
(constant timc tor ndin aowcd cxtcnsions) |Luccio ct a 2001, 200!|
Free trees
`o nodc is xcd as thc root, thcrc is no ordcr on ad,accnt vcrticcs
Thcrc cxists a canonica cxtcnsion ruc ascd on dcjth scqucnccs
(constant timc tor ndin aowcd cxtcnsions) |`i,sscn and Iok 200!|
Christian Borgelt Frequent Pattern Mining 513
Summary Frequent Pattern Mining
Christian Borgelt Frequent Pattern Mining 514
Summary Frequent Pattern Mining
lossic tyjcs ot jattcrns item sets, sequences, trees, and graphs
A corc inrcdicnt ot thc scarch is a canonical form ot thc tyjc ot jattcrn
lurjosc cnsurc that cach jossic jattcrn is jroccsscd at most oncc
(Liscard non-canonica codc words, jroccss ony canonica oncs)
lt is dcsirac that thc canonica torm josscsscs thc prex property
Lxccjt tor cncra rajhs thcrc cxist canonical extension rules
Ior cncra rajhs, restricted extensions aow to rcducc
thc numcr ot actua canonica torm tcsts considcray
Ircqucnt jattcrn minin aorithms jrunc with thc Apriori property
P S P s
T
(P) < s
min
s
T
(S) < s
min
.
That is No super-pattern of an infrequent pattern is frequent.
Additional ltering is imjortant to sinc out thc rccvant jattcrns
Christian Borgelt Frequent Pattern Mining 515
Software
Sottwarc tor trcqucnt jattcrn minin can c tound at
my wc sitc http://www.borgelt.net/fpm.html
Ajriori http://www.borgelt.net/apriori.html
Lcat http://www.borgelt.net/eclat.html
Il-Growth http://www.borgelt.net/fpgrowth.html
lLim http://www.borgelt.net/relim.html
Sa` http://www.borgelt.net/sam.html
`oSS http://www.borgelt.net/moss.html
thc Ircqucnt ltcm Sct `inin lmjcmcntations (Il`l) lcjository
http://fimi.cs.helsinki.fi/
This rcjository was sct uj with thc contriutions to thc Il`l workshojs in 2003
and 200!, whcrc cach sumission had to c accomjanicd y thc sourcc codc ot
an imjcmcntation Thc wc sitc ocrs a sourcc codc, scvcra data scts, and thc
rcsuts ot thc comjctition
Christian Borgelt Frequent Pattern Mining 516