Data Mining

Dynamic Itemset Counting
References: S. Brin, R. Motwani, J.D. Ullman, S. Tsur, "Dynamic Itemset Counting and Implication Rules for Market Basket Data", SIGMOD Record, Volume 6, Number 2: New York, June 1997, pp. 255 - 264. Su, Yibin, Dynamic Itemset Counting and Implication Rules for Market Basket Data: Project Final Report, CS831, April 2000.
Introduction

Alternative to Apriori Itemset Generation Itemsets are dynamically added and deleted as transactions are read Relies on the fact that for an itemset to be frequent, all of its subsets must also be frequent, so we only examine those itemsets whose subsets are all frequent
Algorithm stops after every M transactions to add more itemsets.
Train analogy: There are stations every M transactions. The passengers are itemsets. Itemsets can get on at any stop as long as they get off at the same stop in the next pass around the database. Only itemsets on the train are counted when they occur in transactions. At the very beginning we can start counting 1-itemsets, at the first station we can start counting some of the 2-itemsets. At the second station we can start counting 3-itemsets as well as any more 2-itemsets that can be counted and so on.
Itemsets are marked in four different ways as they are counted:

Solid box: confirmed frequent itemset - an itemset we have finished counting and exceeds the support threshold minsupp Solid circle: below minsupp confirmed infrequent itemset - we have finished counting and it is
Dashed box: suspected frequent itemset - an itemset we are still counting that exceeds minsupp Dashed circle: is below minsupp suspected infrequent itemset - an itemset we are still counting that
DIC Algorithm
Algorithm: 1. Mark the empty itemset with a solid square. Mark all the 1-itemsets with dashed circles. Leave all other itemsets unmarked. 2. While any dashed itemsets remain: 1. Read M transactions (if we reach the end of the transaction file, continue from the beginning). For each transaction, increment the respective counters for the itemsets that appear in the transaction and are marked with dashes. 2. If a dashed circle's count exceeds minsupp, turn it into a dashed square. If any immediate superset of it has all of its subsets as solid or dashed squares, add a new counter for it and make it a dashed circle. 3. Once a dashed itemset has been counted through all the transactions, make it solid and stop counting it.
Itemset lattices: An itemset lattice contains all of the possible itemsets for a transaction database. Each itemset in the lattice points to all of its supersets. When represented graphically, a itemset lattice can help us to understand the concepts behind the DIC algorithm.
Example: minsupp = 25% and M = 2. TID T1 T2 T3 T4 A 1 1 0 B 1 0 1 C 0 0 1 0
0 0 Transaction Database
Itemset lattice for the above transaction database:
Itemset lattice before any transactions are read:
Counters: A = 0, B = 0, C = 0 Empty itemset is marked with a solid box. All 1-itemsets are marked with dashed circles.
After M transactions are read:
After 2M transactions are read:
Counters: A = 2, B = 1, C = 0, AB = 0 We change A and B to dashed boxes because their counters are greater than minsup (1) and add a counter for AB because both of its subsets are boxes.
Counters: A = 2, B = 2, C = 1, AB = 0, AC = 0, BC = 0 C changes to a square because its counter is greater than minsup.A, B and C have been counted all the way through so we stop counting them and make their boxes solid. Add counters for AC and BC because their subsets are all boxes.
After 3M transactions read:
After 4M transactions read:
Counters: A = 2, B = 2, C = 1, AB = 1, AC = 0, BC = 0 AB has been counted all the way through and its counter satisfies minsup so we change it to a solid box. BC changes to a dashed box.
Counters: A = 2, B = 2, C = 1, AB = 1, AC = 0, BC = 1 AC and BC are counted all the way through. We do not count ABC because one of its subsets is a circle. There are no dashed itemsets left so the algorithm is done.
Implementation
Go to the DIC Implementation page to see a working implementation in Java. Operations: 1. 2. 3. 4. add new itemsets maintain a counter for every itemset manage itemset states from dashed to solid and from circle to square when itemsets become large determine which new itemsets should be added because they could potentially be large
Pseudocode Algorithm: SS = ; // solid square (frequent) SC = ; // solid circle (infrequent) DS = ; // dashed square (suspected frequent) DC = { all 1-itemsets } ; // dashed circle (suspected infrequent) while (DS != 0) or (DC != 0) do begin read M transactions from database into T forall transactions t T do begin //increment the respective counters of the itemsets marked with dash for each itemset c in DS or DC do begin if ( c t ) then c.counter++ ; for each itemset c in DC if ( c.counter threshold ) then
move c from DC to DS ; if ( any immediate superset sc of c has all of its subsets in SS or DS ) then add a new itemset sc in DC ; end for each itemset c in DS if ( c has been counted through all transactions ) then move it into SS ; for each itemset c in DC if ( c has been counted through all transactions ) then move it into SC ; end end Answer = { c SS } ;
DIC Implementation
The DIC algorithm has been implemented as dic.java.
Note: The DIC implementation given here may not produce accurate output for small
databases (fewer than 100 transactions). To get accurate output for these databases we need to choose step M > 4. Download the following files:
1. dic.java: The DIC algorithm. 2. config.txt: Consists of four lines.
1. 2. 3. 4.
Number of items Number of transactions Minimum support, i.e. 20 represents 20% minsupp Size of step M for the DIC algorithm. This line is ignored by the Apriori algorithm
3. transa.txt: Contains the transaction database as a n x m table, with n rows and m columns. Each row represents a transaction. Columns are separated by a space and represent items. A 1 indicates that an item is present in the transaction and a 0 indicates that it is not. The sample file has 10000 lines (transactions) with values for 8 items on each line. Compile the .java file:
hercules[1]% javac -deprecation dic.java
Any warning messages about deprecated files can be ignored: If you get the following message, you forgot the -deprecation flag: Note: dic.java uses a deprecated API. Recompile with "-deprecation" for details.
Change config.txt and transa.txt to represent the database and criteria to be tested. Run the programs: hercules[2]% java dic
Example
We use the database example from Apriori Itemset Generation. The minsupp is 40%.
TID T1 T2 T3 T4 T5
A 1 1 1 1 1
B 1 1 0 0 1
C 1 1 1 1 1
D 0 1 1 1 1
E 0 1 0 1 0
Transa.txt contains a row for each of the five transactions and a column for each of the five items.
11100 11111 10110 10111 11110 transa.txt
Config.txt: Here we use 5 as the size of step M for the DIC algorithm
5 40 5
Output:
hercules[67]% java apriori
Algorithm apriori starting now..... Press 'C' to change the default configuration and transaction files or any other key to continue. Input configuration: 5 items, 5 transactions, minsup = 40% Frequent 1-itemsets: [1, 2, 3, 4, 5] Frequent 2-itemsets: [1 2, 1 3, 1 4, 1 5, 2 3, 2 4, 3 4, 3 5, 4 5] Frequent 3-itemsets: [1 2 3, 1 2 4, 1 3 4, 1 3 5, 1 4 5, 2 3 4, 3 4 5] Frequent 4-itemsets: [1 2 3 4, 1 3 4 5] Execution time is: 0 seconds. hercules[68]%
Execution of dic.java
We get the same results as we did earlier when we did the Apriori algorithm by hand.

Data Mining

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining

Uploaded by

Copyright:

Available Formats

Dynamic Itemset Counting

Algorithm stops after every M transactions to add more itemsets.

Itemsets are marked in four different ways as they are counted:

Example: minsupp = 25% and M = 2. TID T1 T2 T3 T4 A 1 1 0 B 1 0 1 C 0 0 1 0

Itemset lattice for the above transaction database:

Itemset lattice before any transactions are read:

After M transactions are read:

After 2M transactions are read:

After 3M transactions read:

After 4M transactions read:

The DIC algorithm has been implemented as dic.java.

hercules[1]% javac -deprecation dic.java

11100 11111 10110 10111 11110 transa.txt

hercules[67]% java apriori

You might also like