CSE300

The FP-Growth/Apriori Debate
Jeffrey R. Ellis CSE 300 01 April 11, 2002
Presentation Overview
FP-Growth Algorithm
Refresher Motivation
& Example
FP-Growth Complexity
Vs.
Apriori Complexity Saving calculation or hiding work?

Real World Application
Datasets
are not created equal Results of real-world implementation
FP-Growth Algorithm
Association Rule Mining
Generate
Frequent Itemsets
Apriori generates candidate sets FP-Growth uses specialized data structures (no candidate sets)
Find
Association Rules
Outside the scope of both FP-Growth & Apriori
Therefore, FP-Growth is a competitor to
Apriori
FP-Growth Example
TID 100 200 300 400 500 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:3 a:3 m:2 p:2 b:1 m:1 b:1 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} {} c:1 b:1 p:1 Conditional pattern bases item c cond. pattern base f:3
a
b m p
fc:3
fca:1, f:1, c:1 fca:2, fcab:1 fcam:2, cb:1
FP-Growth Example
Item Conditional pattern-base Conditional FP-tree
p m b {(fcam:2), (cb:1)} {(fca:2), (fcab:1)} {(fca:1), (f:1), (c:1)} {(c:3)}|p {(f:3, c:3, a:3)}|m Empty
a
c
{(fc:3)}
{(f:3)}
{(f:3, c:3)}|a
{(f:3)}|c
Empty
Empty
FP-Growth Algorithm
FP-CreateTree Two passes Input: DB, min_support through DB Output: FP-Tree Tree creation is 1. Scan DB & count all frequent based on items. number of items 2. Create null root & set as current node. in DB. 3. For each Transaction T Complexity of
Sort Ts items. For each sorted Item I

Insert I into tree as a child of current node. Connect new tree node to header list.
CreateTree is O(|DB|)
FP-Growth Example
TID 100 200 300 400 500 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:3 a:3 m:2 p:2 b:1 m:1 b:1 {} c:1 b:1 p:1
FP-Growth Algorithm
FP-Growth Recursive Input: FP-Tree, f_is (frequent algorithm itemset) creates FP-Tree structures and Output: All freq_patterns calls FP-Growth if FP-Tree contains single Path P then for each combination of nodes in P Claims no generate pattern f_is combination candidate else for each item i in header generation
generate pattern f_is i construct patterns conditional cFP-Tree if (FP-Tree 0) then call FP-Growth (cFP-Tree, pattern)
Two-Part Algorithm
A
if FP-Tree contains single Path P then for each combination of nodes in P generate pattern f_is combination
B C D
e.g., { A B C D }
p
= |pattern| = 4 AD, BD, CD, ABD, ACD, BCD, ABCD

(n=1 to p-1) (p-1Cn)
Two-Part Algorithm
else for each item i in header generate pattern f_is i construct patterns conditional cFP-Tree if (cFP-Tree null) then call FP-Growth (cFP-Tree, pattern)
A
B C D E C D E D E
e.g., { A B C D E } for f_is = D i = A, p_base = (ABCD), (ACD), (AD) i = B, p_base = (ABCD) i = C, p_base = (ABCD), (ACD) i = E, p_base = null
Pattern bases are generated by following f_is
path from header to each node in tree having it, up to tree root, for each header item.
FP-Growth Complexity
Therefore, each path in the tree will be
at least partially traversed the number of items existing in that tree path (the depth of the tree path) * the number of items in the header. Complexity of searching through all paths is then bounded by O(header_count2 * depth of tree) Creation of a new cFP-Tree occurs also.
Sample Data FP-Tree

null
A
B C D
K
L M
L
M
C
J K L M L
K M L M M
L M
Algorithm Results (in English)

Candidate Generation sets exchanged
for FP-Trees.
You MUST take into account all paths that contain an item set with a test item. You CANNOT determine before a conditional FP-Tree is created if new frequent item sets will occur. Trivial examples hide these assertions, leading to a belief that FP-Tree operates more efficiently.
FP-Growth Mining Example

Null Header: A B F G H I M
A:50 B:45 F:40 G:35 H:30 I:20
B:1
G:1
G:1
I:1
G:1
I:1
H:1
I:1
M:1
M:1
M:1
M:1
Transactions: A (49) B (44) F (39) G (34) H (30) I (20) ABGM BGIM FGIM GHIM

A:50 B:45 F:40 G:35
B:1
G:1
G:1
I:1
G:1
I:1
H:1
I:1
i=A pattern = { A M } support = 1 pattern base cFP-Tree = null
M:1
M:1
M:1
M:1
freq_itemset = { M }

A:50 B:45 F:40 G:35
B:1
G:1
G:1
I:1
G:1
I:1
H:1
I:1
i=B pattern = { B M } support = 2 pattern base cFP-Tree
M:1
M:1
M:1
M:1

Header: G B M
G:2
B:2 M:2
Patterns mined: BM GM BGM
freq_itemset = { B M }
All patterns: BM, GM, BGM

A:50 B:45 F:40 G:35
B:1
G:1
G:1
I:1
G:1
I:1
H:1
I:1
i=F pattern = { F M } support = 1 pattern base cFP-Tree = null
M:1
M:1
M:1
M:1

A:50 B:45 F:40 G:35
B:1
G:1
G:1
I:1
G:1
I:1
H:1
I:1
i=G pattern = { G M } support = 4 pattern base cFP-Tree
M:1
M:1
M:1
M:1

Null Header: I B G M
I:3 B:1
B:1
G:1
G:2
M:2
G:1
M:1
i=I pattern = { I G M } support = 3 pattern base cFP-Tree
M:1
freq_itemset = { G M }

Header: I G M
I:3
G:3 M:3
Patterns mined: IM GM IGM
freq_itemset = { I G M } All patterns: BM, GM, BGM, IM, GM, IGM

Null Header: I B G M
I:3 B:1
B:1
G:1
G:2
M:2
G:1
M:1
i=B pattern = { B G M } support = 2 pattern base cFP-Tree
M:1
freq_itemset = { G M }

Header: B G M
B:2
G:2 M:2
Patterns mined: BM GM BGM
freq_itemset = { B G M }
All patterns: BM, GM, BGM, IM, GM, IGM, BM, GM, BGM

A:50 B:45 F:40 G:35
B:1
G:1
G:1
I:1
G:1
I:1
H:1
I:1
i=H pattern = { H M } support = 1 pattern base cFP-Tree = null
M:1
M:1
M:1
M:1

Complete pass for { M }
Move onto { I }, { H }, etc. Final Frequent Sets:
L2
: { BM, GM, IM, GM, BM, GM, IM, GM, GI, BG } L3 : { BGM, GIM, BGM, GIM } L4 : None
FP-Growth Redundancy v. Apriori Candidate Sets

FP-Growth (support=2) generates: L2 : 10 sets (5 distinct) L3 : 4 sets (2 distinct) Total : 14 sets
Apriori (support=2) generates: C2 : 21 sets C3 : 2 sets Total : 23 sets What about support=1? Apriori : 23 sets FP-Growth : 28 sets in {M} with i = A, B, F alone!
FP-Growth vs. Apriori

Apriori visits each transaction when
generating a new candidate sets; FPGrowth does not

Can
use data structures to reduce transaction list
FP-Growth traces the set of concurrent
items; Apriori generates candidate sets FP-Growth uses more complicated data structures & mining techniques
Algorithm Analysis Results

FP-Growth IS NOT inherently faster
than Apriori
Intuitively,
it appears to condense data Mining scheme requires some new work to replace candidate set generation Recursion obscures the additional effort
FP-Growth may run faster than Apriori
in circumstances No guarantee through complexity which algorithm to use for efficiency
Improvements to FP-Growth
None currently reported
MLFPT
Multiple
Local Frequent Pattern Tree New algorithm that is based on FP-Growth Distributes FP-Trees among processors
No reports of complexity analysis or
accuracy of FP-Growth
Real World Applications

Zheng, Kohavi, Mason Real World
Performance of Association Rule Algorithms

Collected
implementations of Apriori, FPGrowth, CLOSET, CHARM, MagnumOpus Tested implementations against 1 artificial and 3 real data sets Time-based comparisons generated
Apriori & FP-Growth

Apriori
Implementation
from creator Christian Borgelt (GNU Public License) C implementation Entire dataset loaded into memory
FP-Growth
Implementation
from creators Han & Pei Version February 5, 2001
Other Algorithms
CHARM
Based
on concept of Closed Itemset e.g., { A, B, C } ABC, AC, BC, ACB, AB, CB, etc.
CLOSET
Han,
Pei implementation of Closed Itemset
MagnumOpus
Generates
rules directly through searchand-prune technique
Datasets
IBM-Artificial
Generated
at IBM Almaden (T10I4D100K) Often used in association rule mining studies

BMS-POS
Years
of point-of-sale data from retailer
BMS-WebView-1 & BMS-WebView-2

Months
of clickstream traffic from ecommerce web sites
Dataset Characteristics
Experimental Considerations
Hardware Specifications
Dual
550MHz Pentium III Xeon processors 1GB Memory

Support { 1.00%, 0.80%, 0.60%, 0.40%,
0.20%, 0.10%, 0.08%, 0.06%, 0.04%, 0.02%, 0.01% } Confidence = 0% No other applications running (second processor handles system processes)
IBM-Artificial
BMS-POS
BMS-WebView-1
BMS-WebView-2
Study Results Real Data

At support 0.20%, Apriori performs
as fast as or better than FP-Growth At support < 0.20%, Apriori completes whenever FP-Growth completes
Exception
BMS-WebView-2 @ 0.01%
When 2 million rules are generated,
Apriori finishes in 10 minutes or less

Proposed
Bottleneck is NOT the rule algorithm, but rule analysis
Real Data Results

Algorithm Support Apriori FP-Growth Apriori FP-Growth Apriori FP-Growth Apriori FP-Growth Apriori FP-Growth 0.01 BMS-POS Time Rules 186m 120m 16m 9 s 10m 41s 8m 35s 6m 7s 3m 58s 3m 12s 1m 14s 1m 35s 214,300,568 BMS-WebView-1 Time Rules Failed Failed Failed Failed 1m 50s 52s 1.2s 1.2s 0.4s 0.7s Falied BMS-WebView-2 Time Rules Failed 13m 12s 58s 29s 28s 16s 9.1s 5.9s 2.4s 2.3s Failed
0.04
5,061,105
Failed
1,096,720
0.06
1,837,824
3,011,836
510,233
0.10
530,353
10,360
119,335
0.20
103,449
1,516
12,665
Study Results Artificial Data

At support < 0.40%, FP-Growth
performs MUCH faster than Apriori At support 0.40%, FP-Growth and Apriori are comparable
Support Time Rules Support Time Rules Support Time Rules Apriori 4m 4s 1m 1s 44s 0.01 1,376,684 0.04 56,962 0.06 41,215 FP-Growth 20s 9.2s 8.2s Apriori FP-Growth 0.10 34s 7.1 26,962 0.20 20s 13,151 5.8s 0.40 5.7s 4.3s 1.997
Real-World Study Conclusions

FP-Growth (and other non-Apriori)
perform better on artificial data On all data sets, Apriori performs sufficiently well in reasonable time periods for reasonable result sets FP-Growth may be suitable when low support, large result count, fast generation are needed Future research may best be directed toward analyzing association rules
Research Conclusions
FP-Growth does not have a better
complexity than Apriori

Common
sense indicates it will run faster
FP-Growth does not always have a
better running time than Apriori

Support,
dataset appears more influential
FP-Trees are very complex structures
(Apriori is simple) Location of data (memory vs. disk) is non-factor in comparison of algorithms
To Use or Not To Use?

Question: Should the FP-Growth be
used in favor of Apriori?

Difficulty
to code High performance at extreme cases Personal preference

More relevant questions
What
kind of data is it? What kind of results do I want? How will I analyze the resulting rules?
References

Han, Jiawei, Pei, Jian, and Yin, Yiwen, Mining Frequent Patterns without Candidate Generation. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 1-12, Dallas, Texas, USA, 2000. Orlando, Salvatore, High Performance Mining of Short and Long Patterns. 2001. Pei, Jian, Han, Jiawei, and Mao, Runying, CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets. In SIGMOD Int'l Workshop on Data Mining and Knowledge Discovery, May 2000. Webb, Geoffrey L., Efficient search for association rules. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 99--107, 2000. Zaiane, Osmar, El-Hajj, Mohammad, and Lu, Paul, Fast Parallel Association Rule Mining Without Candidacy Generation. In Proc. of the IEEE 2001 International Conference on Data Mining (ICDM'2001), San Jose, CA, USA, November 29-December 2, 2001 Zaki, Mohammed J., Generating Non-Redundant Association Rules. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2000. Zheng, Zijian, Kohavi, Ron, and Mason, Llew, Real World Performance of Association Rule Algorithms. In proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, August 2001.

CSE300

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSE300

Uploaded by

Copyright:

Available Formats

The FP-Growth/Apriori Debate

Jeffrey R. Ellis CSE 300 01 April 11, 2002

Apriori Complexity Saving calculation or hiding work?

are not created equal Results of real-world implementation

Outside the scope of both FP-Growth & Apriori

Therefore, FP-Growth is a competitor to

Sort Ts items. For each sorted Item I

= |pattern| = 4 AD, BD, CD, ABD, ACD, BCD, ABCD

Pattern bases are generated by following f_is

Sample Data FP-Tree

Algorithm Results (in English)

FP-Growth Mining Example

FP-Growth Mining Example

i=A pattern = { A M } support = 1 pattern base cFP-Tree = null

FP-Growth Mining Example

i=B pattern = { B M } support = 2 pattern base cFP-Tree

FP-Growth Mining Example

Patterns mined: BM GM BGM

All patterns: BM, GM, BGM

FP-Growth Mining Example

i=F pattern = { F M } support = 1 pattern base cFP-Tree = null

FP-Growth Mining Example

i=G pattern = { G M } support = 4 pattern base cFP-Tree

FP-Growth Mining Example

i=I pattern = { I G M } support = 3 pattern base cFP-Tree

FP-Growth Mining Example

Patterns mined: IM GM IGM

freq_itemset = { I G M } All patterns: BM, GM, BGM, IM, GM, IGM

FP-Growth Mining Example

i=B pattern = { B G M } support = 2 pattern base cFP-Tree

FP-Growth Mining Example

Patterns mined: BM GM BGM

FP-Growth Mining Example

i=H pattern = { H M } support = 1 pattern base cFP-Tree = null

FP-Growth Mining Example

FP-Growth Redundancy v. Apriori Candidate Sets

FP-Growth vs. Apriori

generating a new candidate sets; FPGrowth does not

use data structures to reduce transaction list

FP-Growth traces the set of concurrent

Algorithm Analysis Results

in circumstances No guarantee through complexity which algorithm to use for efficiency

Real World Applications

Performance of Association Rule Algorithms

Apriori & FP-Growth

from creators Han & Pei Version February 5, 2001

Pei implementation of Closed Itemset

rules directly through searchand-prune technique

at IBM Almaden (T10I4D100K) Often used in association rule mining studies

of point-of-sale data from retailer

BMS-WebView-1 & BMS-WebView-2

of clickstream traffic from ecommerce web sites

550MHz Pentium III Xeon processors 1GB Memory

Study Results Real Data

When 2 million rules are generated,

Apriori finishes in 10 minutes or less

Bottleneck is NOT the rule algorithm, but rule analysis

Real Data Results

Study Results Artificial Data

Real-World Study Conclusions

complexity than Apriori

sense indicates it will run faster

FP-Growth does not always have a