Professional Documents
Culture Documents
Presentation Overview
FP-Growth Algorithm
Refresher Motivation
& Example
FP-Growth Complexity
Vs.
FP-Growth Algorithm
Association Rule Mining
Generate
Frequent Itemsets
Apriori generates candidate sets FP-Growth uses specialized data structures (no candidate sets)
Find
Association Rules
Apriori
FP-Growth Example
TID 100 200 300 400 500 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:3 a:3 m:2 p:2 b:1 m:1 b:1 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} {} c:1 b:1 p:1 Conditional pattern bases item c cond. pattern base f:3
a
b m p
fc:3
fca:1, f:1, c:1 fca:2, fcab:1 fcam:2, cb:1
FP-Growth Example
Item Conditional pattern-base Conditional FP-tree
p m b {(fcam:2), (cb:1)} {(fca:2), (fcab:1)} {(fca:1), (f:1), (c:1)} {(c:3)}|p {(f:3, c:3, a:3)}|m Empty
a
c
{(fc:3)}
{(f:3)}
{(f:3, c:3)}|a
{(f:3)}|c
Empty
Empty
FP-Growth Algorithm
FP-CreateTree Two passes Input: DB, min_support through DB Output: FP-Tree Tree creation is 1. Scan DB & count all frequent based on items. number of items 2. Create null root & set as current node. in DB. 3. For each Transaction T Complexity of
Insert I into tree as a child of current node. Connect new tree node to header list.
CreateTree is O(|DB|)
FP-Growth Example
TID 100 200 300 400 500 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:3 a:3 m:2 p:2 b:1 m:1 b:1 {} c:1 b:1 p:1
FP-Growth Algorithm
FP-Growth Recursive Input: FP-Tree, f_is (frequent algorithm itemset) creates FP-Tree structures and Output: All freq_patterns calls FP-Growth if FP-Tree contains single Path P then for each combination of nodes in P Claims no generate pattern f_is combination candidate else for each item i in header generation
generate pattern f_is i construct patterns conditional cFP-Tree if (FP-Tree 0) then call FP-Growth (cFP-Tree, pattern)
Two-Part Algorithm
A
if FP-Tree contains single Path P then for each combination of nodes in P generate pattern f_is combination
B C D
e.g., { A B C D }
p
Two-Part Algorithm
else for each item i in header generate pattern f_is i construct patterns conditional cFP-Tree if (cFP-Tree null) then call FP-Growth (cFP-Tree, pattern)
A
B C D E C D E D E
e.g., { A B C D E } for f_is = D i = A, p_base = (ABCD), (ACD), (AD) i = B, p_base = (ABCD) i = C, p_base = (ABCD), (ACD) i = E, p_base = null
path from header to each node in tree having it, up to tree root, for each header item.
FP-Growth Complexity
Therefore, each path in the tree will be
at least partially traversed the number of items existing in that tree path (the depth of the tree path) * the number of items in the header. Complexity of searching through all paths is then bounded by O(header_count2 * depth of tree) Creation of a new cFP-Tree occurs also.
A
B C D
K
L M
L
M
C
J K L M L
K M L M M
L M
for FP-Trees.
You MUST take into account all paths that contain an item set with a test item. You CANNOT determine before a conditional FP-Tree is created if new frequent item sets will occur. Trivial examples hide these assertions, leading to a belief that FP-Tree operates more efficiently.
B:1
G:1
G:1
I:1
G:1
I:1
H:1
I:1
M:1
M:1
M:1
M:1
Transactions: A (49) B (44) F (39) G (34) H (30) I (20) ABGM BGIM FGIM GHIM
B:1
G:1
G:1
I:1
G:1
I:1
H:1
I:1
M:1
M:1
M:1
M:1
freq_itemset = { M }
B:1
G:1
G:1
I:1
G:1
I:1
H:1
I:1
M:1
M:1
M:1
M:1
freq_itemset = { M }
freq_itemset = { B M }
B:1
G:1
G:1
I:1
G:1
I:1
H:1
I:1
M:1
M:1
M:1
M:1
freq_itemset = { M }
B:1
G:1
G:1
I:1
G:1
I:1
H:1
I:1
M:1
M:1
M:1
M:1
freq_itemset = { M }
B:1
G:1
G:2
M:2
G:1
M:1
M:1
freq_itemset = { G M }
B:1
G:1
G:2
M:2
G:1
M:1
M:1
freq_itemset = { G M }
freq_itemset = { B G M }
All patterns: BM, GM, BGM, IM, GM, IGM, BM, GM, BGM
B:1
G:1
G:1
I:1
G:1
I:1
H:1
I:1
M:1
M:1
M:1
M:1
freq_itemset = { M }
: { BM, GM, IM, GM, BM, GM, IM, GM, GI, BG } L3 : { BGM, GIM, BGM, GIM } L4 : None
items; Apriori generates candidate sets FP-Growth uses more complicated data structures & mining techniques
than Apriori
Intuitively,
it appears to condense data Mining scheme requires some new work to replace candidate set generation Recursion obscures the additional effort
FP-Growth may run faster than Apriori
Improvements to FP-Growth
None currently reported
MLFPT
Multiple
Local Frequent Pattern Tree New algorithm that is based on FP-Growth Distributes FP-Trees among processors
No reports of complexity analysis or
accuracy of FP-Growth
implementations of Apriori, FPGrowth, CLOSET, CHARM, MagnumOpus Tested implementations against 1 artificial and 3 real data sets Time-based comparisons generated
from creator Christian Borgelt (GNU Public License) C implementation Entire dataset loaded into memory
FP-Growth
Implementation
Other Algorithms
CHARM
Based
on concept of Closed Itemset e.g., { A, B, C } ABC, AC, BC, ACB, AB, CB, etc.
CLOSET
Han,
MagnumOpus
Generates
Datasets
IBM-Artificial
Generated
Dataset Characteristics
Experimental Considerations
Hardware Specifications
Dual
0.20%, 0.10%, 0.08%, 0.06%, 0.04%, 0.02%, 0.01% } Confidence = 0% No other applications running (second processor handles system processes)
IBM-Artificial
BMS-POS
BMS-WebView-1
BMS-WebView-2
as fast as or better than FP-Growth At support < 0.20%, Apriori completes whenever FP-Growth completes
Exception
BMS-WebView-2 @ 0.01%
0.04
5,061,105
Failed
1,096,720
0.06
1,837,824
3,011,836
510,233
0.10
530,353
10,360
119,335
0.20
103,449
1,516
12,665
performs MUCH faster than Apriori At support 0.40%, FP-Growth and Apriori are comparable
Support Time Rules Support Time Rules Support Time Rules Apriori 4m 4s 1m 1s 44s 0.01 1,376,684 0.04 56,962 0.06 41,215 FP-Growth 20s 9.2s 8.2s Apriori FP-Growth 0.10 34s 7.1 26,962 0.20 20s 13,151 5.8s 0.40 5.7s 4.3s 1.997
perform better on artificial data On all data sets, Apriori performs sufficiently well in reasonable time periods for reasonable result sets FP-Growth may be suitable when low support, large result count, fast generation are needed Future research may best be directed toward analyzing association rules
Research Conclusions
FP-Growth does not have a better
(Apriori is simple) Location of data (memory vs. disk) is non-factor in comparison of algorithms
kind of data is it? What kind of results do I want? How will I analyze the resulting rules?
References
Han, Jiawei, Pei, Jian, and Yin, Yiwen, Mining Frequent Patterns without Candidate Generation. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 1-12, Dallas, Texas, USA, 2000. Orlando, Salvatore, High Performance Mining of Short and Long Patterns. 2001. Pei, Jian, Han, Jiawei, and Mao, Runying, CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets. In SIGMOD Int'l Workshop on Data Mining and Knowledge Discovery, May 2000. Webb, Geoffrey L., Efficient search for association rules. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 99--107, 2000. Zaiane, Osmar, El-Hajj, Mohammad, and Lu, Paul, Fast Parallel Association Rule Mining Without Candidacy Generation. In Proc. of the IEEE 2001 International Conference on Data Mining (ICDM'2001), San Jose, CA, USA, November 29-December 2, 2001 Zaki, Mohammed J., Generating Non-Redundant Association Rules. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2000. Zheng, Zijian, Kohavi, Ron, and Mason, Llew, Real World Performance of Association Rule Algorithms. In proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, August 2001.