You are on page 1of 44

The FP-Growth/Apriori Debate

Jeffrey R. Ellis CSE 300 01 April 11, 2002

Presentation Overview
FP-Growth Algorithm
Refresher Motivation

& Example

FP-Growth Complexity
Vs.

Apriori Complexity Saving calculation or hiding work?


Real World Application
Datasets

are not created equal Results of real-world implementation

FP-Growth Algorithm
Association Rule Mining
Generate

Frequent Itemsets

Apriori generates candidate sets FP-Growth uses specialized data structures (no candidate sets)
Find

Association Rules

Outside the scope of both FP-Growth & Apriori

Therefore, FP-Growth is a competitor to

Apriori

FP-Growth Example
TID 100 200 300 400 500 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:3 a:3 m:2 p:2 b:1 m:1 b:1 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} {} c:1 b:1 p:1 Conditional pattern bases item c cond. pattern base f:3

a
b m p

fc:3
fca:1, f:1, c:1 fca:2, fcab:1 fcam:2, cb:1

FP-Growth Example
Item Conditional pattern-base Conditional FP-tree
p m b {(fcam:2), (cb:1)} {(fca:2), (fcab:1)} {(fca:1), (f:1), (c:1)} {(c:3)}|p {(f:3, c:3, a:3)}|m Empty

a
c

{(fc:3)}
{(f:3)}

{(f:3, c:3)}|a
{(f:3)}|c

Empty

Empty

FP-Growth Algorithm
FP-CreateTree Two passes Input: DB, min_support through DB Output: FP-Tree Tree creation is 1. Scan DB & count all frequent based on items. number of items 2. Create null root & set as current node. in DB. 3. For each Transaction T Complexity of

Sort Ts items. For each sorted Item I


Insert I into tree as a child of current node. Connect new tree node to header list.

CreateTree is O(|DB|)

FP-Growth Example
TID 100 200 300 400 500 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:3 a:3 m:2 p:2 b:1 m:1 b:1 {} c:1 b:1 p:1

FP-Growth Algorithm
FP-Growth Recursive Input: FP-Tree, f_is (frequent algorithm itemset) creates FP-Tree structures and Output: All freq_patterns calls FP-Growth if FP-Tree contains single Path P then for each combination of nodes in P Claims no generate pattern f_is combination candidate else for each item i in header generation
generate pattern f_is i construct patterns conditional cFP-Tree if (FP-Tree 0) then call FP-Growth (cFP-Tree, pattern)

Two-Part Algorithm
A

if FP-Tree contains single Path P then for each combination of nodes in P generate pattern f_is combination

B C D

e.g., { A B C D }
p

= |pattern| = 4 AD, BD, CD, ABD, ACD, BCD, ABCD


(n=1 to p-1) (p-1Cn)

Two-Part Algorithm
else for each item i in header generate pattern f_is i construct patterns conditional cFP-Tree if (cFP-Tree null) then call FP-Growth (cFP-Tree, pattern)

A
B C D E C D E D E

e.g., { A B C D E } for f_is = D i = A, p_base = (ABCD), (ACD), (AD) i = B, p_base = (ABCD) i = C, p_base = (ABCD), (ACD) i = E, p_base = null

Pattern bases are generated by following f_is

path from header to each node in tree having it, up to tree root, for each header item.

FP-Growth Complexity
Therefore, each path in the tree will be

at least partially traversed the number of items existing in that tree path (the depth of the tree path) * the number of items in the header. Complexity of searching through all paths is then bounded by O(header_count2 * depth of tree) Creation of a new cFP-Tree occurs also.

Sample Data FP-Tree


null

A
B C D

K
L M

L
M

C
J K L M L

K M L M M

L M

Algorithm Results (in English)


Candidate Generation sets exchanged

for FP-Trees.

You MUST take into account all paths that contain an item set with a test item. You CANNOT determine before a conditional FP-Tree is created if new frequent item sets will occur. Trivial examples hide these assertions, leading to a belief that FP-Tree operates more efficiently.

FP-Growth Mining Example


Null Header: A B F G H I M
A:50 B:45 F:40 G:35 H:30 I:20

B:1
G:1

G:1
I:1

G:1
I:1

H:1
I:1

M:1

M:1

M:1

M:1

Transactions: A (49) B (44) F (39) G (34) H (30) I (20) ABGM BGIM FGIM GHIM

FP-Growth Mining Example


Null Header: A B F G H I M
A:50 B:45 F:40 G:35

B:1
G:1

G:1
I:1

G:1
I:1

H:1
I:1

i=A pattern = { A M } support = 1 pattern base cFP-Tree = null

M:1

M:1

M:1

M:1

freq_itemset = { M }

FP-Growth Mining Example


Null Header: A B F G H I M
A:50 B:45 F:40 G:35

B:1
G:1

G:1
I:1

G:1
I:1

H:1
I:1

i=B pattern = { B M } support = 2 pattern base cFP-Tree

M:1

M:1

M:1

M:1

freq_itemset = { M }

FP-Growth Mining Example


Header: G B M
G:2
B:2 M:2

Patterns mined: BM GM BGM

freq_itemset = { B M }

All patterns: BM, GM, BGM

FP-Growth Mining Example


Null Header: A B F G H I M
A:50 B:45 F:40 G:35

B:1
G:1

G:1
I:1

G:1
I:1

H:1
I:1

i=F pattern = { F M } support = 1 pattern base cFP-Tree = null

M:1

M:1

M:1

M:1

freq_itemset = { M }

FP-Growth Mining Example


Null Header: A B F G H I M
A:50 B:45 F:40 G:35

B:1
G:1

G:1
I:1

G:1
I:1

H:1
I:1

i=G pattern = { G M } support = 4 pattern base cFP-Tree

M:1

M:1

M:1

M:1

freq_itemset = { M }

FP-Growth Mining Example


Null Header: I B G M
I:3 B:1

B:1
G:1

G:2
M:2

G:1
M:1

i=I pattern = { I G M } support = 3 pattern base cFP-Tree

M:1

freq_itemset = { G M }

FP-Growth Mining Example


Header: I G M
I:3
G:3 M:3

Patterns mined: IM GM IGM

freq_itemset = { I G M } All patterns: BM, GM, BGM, IM, GM, IGM

FP-Growth Mining Example


Null Header: I B G M
I:3 B:1

B:1
G:1

G:2
M:2

G:1
M:1

i=B pattern = { B G M } support = 2 pattern base cFP-Tree

M:1

freq_itemset = { G M }

FP-Growth Mining Example


Header: B G M
B:2
G:2 M:2

Patterns mined: BM GM BGM

freq_itemset = { B G M }

All patterns: BM, GM, BGM, IM, GM, IGM, BM, GM, BGM

FP-Growth Mining Example


Null Header: A B F G H I M
A:50 B:45 F:40 G:35

B:1
G:1

G:1
I:1

G:1
I:1

H:1
I:1

i=H pattern = { H M } support = 1 pattern base cFP-Tree = null

M:1

M:1

M:1

M:1

freq_itemset = { M }

FP-Growth Mining Example


Complete pass for { M }
Move onto { I }, { H }, etc. Final Frequent Sets:
L2

: { BM, GM, IM, GM, BM, GM, IM, GM, GI, BG } L3 : { BGM, GIM, BGM, GIM } L4 : None

FP-Growth Redundancy v. Apriori Candidate Sets


FP-Growth (support=2) generates: L2 : 10 sets (5 distinct) L3 : 4 sets (2 distinct) Total : 14 sets
Apriori (support=2) generates: C2 : 21 sets C3 : 2 sets Total : 23 sets What about support=1? Apriori : 23 sets FP-Growth : 28 sets in {M} with i = A, B, F alone!

FP-Growth vs. Apriori


Apriori visits each transaction when

generating a new candidate sets; FPGrowth does not


Can

use data structures to reduce transaction list

FP-Growth traces the set of concurrent

items; Apriori generates candidate sets FP-Growth uses more complicated data structures & mining techniques

Algorithm Analysis Results


FP-Growth IS NOT inherently faster

than Apriori
Intuitively,

it appears to condense data Mining scheme requires some new work to replace candidate set generation Recursion obscures the additional effort
FP-Growth may run faster than Apriori

in circumstances No guarantee through complexity which algorithm to use for efficiency

Improvements to FP-Growth
None currently reported
MLFPT
Multiple

Local Frequent Pattern Tree New algorithm that is based on FP-Growth Distributes FP-Trees among processors
No reports of complexity analysis or

accuracy of FP-Growth

Real World Applications


Zheng, Kohavi, Mason Real World

Performance of Association Rule Algorithms


Collected

implementations of Apriori, FPGrowth, CLOSET, CHARM, MagnumOpus Tested implementations against 1 artificial and 3 real data sets Time-based comparisons generated

Apriori & FP-Growth


Apriori
Implementation

from creator Christian Borgelt (GNU Public License) C implementation Entire dataset loaded into memory
FP-Growth
Implementation

from creators Han & Pei Version February 5, 2001

Other Algorithms
CHARM
Based

on concept of Closed Itemset e.g., { A, B, C } ABC, AC, BC, ACB, AB, CB, etc.
CLOSET
Han,

Pei implementation of Closed Itemset

MagnumOpus
Generates

rules directly through searchand-prune technique

Datasets
IBM-Artificial
Generated

at IBM Almaden (T10I4D100K) Often used in association rule mining studies


BMS-POS
Years

of point-of-sale data from retailer

BMS-WebView-1 & BMS-WebView-2


Months

of clickstream traffic from ecommerce web sites

Dataset Characteristics

Experimental Considerations
Hardware Specifications
Dual

550MHz Pentium III Xeon processors 1GB Memory


Support { 1.00%, 0.80%, 0.60%, 0.40%,

0.20%, 0.10%, 0.08%, 0.06%, 0.04%, 0.02%, 0.01% } Confidence = 0% No other applications running (second processor handles system processes)

IBM-Artificial

BMS-POS

BMS-WebView-1

BMS-WebView-2

Study Results Real Data


At support 0.20%, Apriori performs

as fast as or better than FP-Growth At support < 0.20%, Apriori completes whenever FP-Growth completes
Exception

BMS-WebView-2 @ 0.01%

When 2 million rules are generated,

Apriori finishes in 10 minutes or less


Proposed

Bottleneck is NOT the rule algorithm, but rule analysis

Real Data Results


Algorithm Support Apriori FP-Growth Apriori FP-Growth Apriori FP-Growth Apriori FP-Growth Apriori FP-Growth 0.01 BMS-POS Time Rules 186m 120m 16m 9 s 10m 41s 8m 35s 6m 7s 3m 58s 3m 12s 1m 14s 1m 35s 214,300,568 BMS-WebView-1 Time Rules Failed Failed Failed Failed 1m 50s 52s 1.2s 1.2s 0.4s 0.7s Falied BMS-WebView-2 Time Rules Failed 13m 12s 58s 29s 28s 16s 9.1s 5.9s 2.4s 2.3s Failed

0.04

5,061,105

Failed

1,096,720

0.06

1,837,824

3,011,836

510,233

0.10

530,353

10,360

119,335

0.20

103,449

1,516

12,665

Study Results Artificial Data


At support < 0.40%, FP-Growth

performs MUCH faster than Apriori At support 0.40%, FP-Growth and Apriori are comparable
Support Time Rules Support Time Rules Support Time Rules Apriori 4m 4s 1m 1s 44s 0.01 1,376,684 0.04 56,962 0.06 41,215 FP-Growth 20s 9.2s 8.2s Apriori FP-Growth 0.10 34s 7.1 26,962 0.20 20s 13,151 5.8s 0.40 5.7s 4.3s 1.997

Real-World Study Conclusions


FP-Growth (and other non-Apriori)

perform better on artificial data On all data sets, Apriori performs sufficiently well in reasonable time periods for reasonable result sets FP-Growth may be suitable when low support, large result count, fast generation are needed Future research may best be directed toward analyzing association rules

Research Conclusions
FP-Growth does not have a better

complexity than Apriori


Common

sense indicates it will run faster

FP-Growth does not always have a

better running time than Apriori


Support,

dataset appears more influential

FP-Trees are very complex structures

(Apriori is simple) Location of data (memory vs. disk) is non-factor in comparison of algorithms

To Use or Not To Use?


Question: Should the FP-Growth be

used in favor of Apriori?


Difficulty

to code High performance at extreme cases Personal preference


More relevant questions
What

kind of data is it? What kind of results do I want? How will I analyze the resulting rules?

References

Han, Jiawei, Pei, Jian, and Yin, Yiwen, Mining Frequent Patterns without Candidate Generation. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 1-12, Dallas, Texas, USA, 2000. Orlando, Salvatore, High Performance Mining of Short and Long Patterns. 2001. Pei, Jian, Han, Jiawei, and Mao, Runying, CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets. In SIGMOD Int'l Workshop on Data Mining and Knowledge Discovery, May 2000. Webb, Geoffrey L., Efficient search for association rules. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 99--107, 2000. Zaiane, Osmar, El-Hajj, Mohammad, and Lu, Paul, Fast Parallel Association Rule Mining Without Candidacy Generation. In Proc. of the IEEE 2001 International Conference on Data Mining (ICDM'2001), San Jose, CA, USA, November 29-December 2, 2001 Zaki, Mohammed J., Generating Non-Redundant Association Rules. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2000. Zheng, Zijian, Kohavi, Ron, and Mason, Llew, Real World Performance of Association Rule Algorithms. In proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, August 2001.

You might also like