You are on page 1of 64

1.

INTRODUCTION
1.1 Synopsis This project entitled, Study on Value-Added Service in Mobile Telecom Based on Association Rules is extracted from proceedings of the 10th ACIS International Conference on Software Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing. With the continuous development of information technology, it is beginning to be applied in managing data storage business in more and more fields. However, in face of an increasing number of data, there is a good database management system still needs further exploration. Data mining not only be able to make efficient data storage, and can extract hide useful The current Internet technology and its growing demand necessitates the development of more advanced data mining technologies to interpret the information and knowledge from the data distributed all over the world. In the 21st century this demand continues to grow. Data mining can discover interesting patterns or relationships describing the data and predictive or classify the behavior of the model based on available data. In other words, it is an interdisciplinary field with a general goal of predicting outcomes and uncovering relationships in data. It uses automated tools that employ sophisticated algorithms to discover mainly hidden patterns, associations, anomalies, and/or structure from large amounts of data stored in data warehouses or other information repositories and filter necessary information from this big dataset. Association rule mining refers to discovering association relationships among different attributes. Data mining in the telecommunications sales can. Analyze the optimal and rational sales to match. The association rule mining commodities can be found in the relationship between commodities, such as commodities which are often together at the same time to buy.

Telecommunications industry is a typical data intensive industry, with the deepening of telecom reform, competition is also becoming fierce increasingly. Compared with other industries, the telecommunications industry have more user's data., which can help people analyze the data accurately and obtain useful knowledge, in order to win the competition , people should find more business opportunities and provide users with better service. As a result, data warehouse and data mining has important value in the telecommunications industry. 1

Data mining is the task of discovering interesting patterns from large amounts of data where the data can be stored in databases, data warehouses, or other information repositories. It a young interdisciplinary field, drawing from areas such as database systems, data warehousing, statistics, machine learning, data visualization, information retrieval, and high performance computing. Other contributing areas include neural networks, pattern recognition, spatial data analysis, image databases, signal processing and many application fields, such as business, economics and bioinformatics. It includes data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation and knowledge presentation. Since different users can be interested in different kinds of knowledge, data mining should cover a wide spectrum of data analysis and knowledge discovery tasks, including data characterization, discrimination, association, classification, clustering trend and deviation analysis and similarity analysis. These tasks may use the same database in different ways and require the development of numerous data mining techniques. It includes the discovery of concept or class descriptions, association, classification, prediction, clustering, trend analysis, deviation analysis, and similarity analysis. Characterization and discrimination are forms of data summarization. It can be classified according to the kinds of databases mined, the kinds of knowledge mined, the techniques used, or the applications adapted. Data mining can be classified into descriptive data mining and predictive data mining. Concept description is the most basic form of descriptive data mining. It describes a given set of taskrelevant data in a concise and summarative manner, presenting general properties of the data. Efficient and effective data mining in large databases poses numerous requirements and great challenges to researchers and developers. The issues involved include data mining methodology, user interaction, performance and scalability, and the processing of a large variety of data types. Other issues include the exploration of data mining applications and their social impacts.

1.2 Apriori Algorithm For Frequent Itemsets

Finding frequent itemsets in transaction databases has been demonstrated to be useful in several business applications. Many algorithms have been proposed to find frequent itemsets from a very large database. However, there is no published implementation that outperforms every other implementation on every database with every support threshold. In general, many implementations are based on the two main algorithms: Apriori and frequent pattern growth (FP-growth). The Apriori algorithm discovers the frequent itemsets from a very large database through a series of iterations. The Apriori algorithm is required to generate candidate itemsets, compute the support, and prune the candidate itemsets to the frequent itemsets in each iteration. The FP-growth algorithm discovers frequent itemsets without the time-consuming candidate generation process that is critical for the Apriori algorithm. Although the FP-growth algorithm generally outperforms the Apriori algorithm in most cases, several refinements of the Apriori algorithm have been made to speed up the process of frequent itemsets mining. This paper, implemented a parallel Apriori algorithm based on Bodons work and analyzed its performance on a parallel computer. The reason we adopted Bodons implementation for parallel computing is because Bodons implementation using the trie data structure outperforms the other implementations using hash tree. The rest of the paper is organized as follows. It introduces related work on frequent item sets mining. We present our implementation for parallel computing on frequent item sets mining. We present the experimental results of our implementation on a symmetric multiprocessing computer.

1.3 Key Issues Of Apriori Algorithm


The Apriori Algorithm is an influential algorithm for mining frequent item sets for boolean association rules. 1. Key Concepts Frequent Item sets: The sets of item which has minimum support (denoted by Li for ithItemset). Apriori Property: Any subset of frequent item set must be frequent. Join Operation: To find Lk , a set of candidate k-item sets is generated by joining Lk-1 with itself. 3

2. Methods to Improve Aprioris Efficiency Hash-based item set counting: A k-item set whose corresponding hashing bucket count is below the threshold cannot be frequent. Transaction reduction: A transaction that does not contain any frequent k-item set is useless in subsequent scans. Partitioning: Any item set that is potentially frequent in DB must be frequent in at least one of the partitions of DB. Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness. Dynamic item set counting: add new candidate item sets only when all of their subsets are estimated to be frequent.

1.4 Data Mining Concepts And Techniques


Mining different kinds of Knowledge in Database Since different users can be interested in different kinds of knowledge, data mining should cover a wide spectrum of data analysis and knowledge discovery tasks, including data characterization, discrimination, association, classification, clustering trend and deviation analysis and similarity analysis. These tasks may use the same database in different ways and require the development of numerous data mining techniques. Data Mining query languages and Data Mining Relational query languages (such as SQL) allow users to pose queries for data retrieval. In a similar high level data mining tasks by facilitating the specification of the relevant sets of data for analysis, the domain knowledge , the kinds of knowledge to be mined, and conditions and constraints to be enforced on the discovered patterns. Such language should be integrated with a database or data warehouse query language, and optimized for efficient and flexible data mining. Database Technology Database Technology has evolved from primitive file processing to the development of database management systems with query and transaction processing. Further progress has led to the increasing demand for efficient and effective data analysis and data understanding tools. This 4

need is a result of the explosive growth in data collected from applications including business and management, government administration, science and engineering, and environmental control. Data Mining Data Mining is the task of discovering interesting patterns from large amounts of data where the data can be stored in databases, data ware houses, or other information repositories. It a young interdisciplinary field, drawing from areas such as database systems, data warehousing, statistics, machine learning, data visualization, information retrieval, and high performance computing. Other contributing areas include economics and bioinformatics. neural networks, pattern recognition, spatial data analysis, image databases, signal processing and many application fields, such as business,

Figure 1.4.1 Architecture of a Typical Data Mining System.

A Knowledge Discovery Process 5

A knowledge discovery process includes data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation and knowledge presentation. Data Patterns Data Patterns can be mined from many different kinds of databases, such as relational databases, data warehouses, and transactional, object-relational, and object-2oriented databases. Interesting data patterns can also be extracted from other kinds of information repositories, including spatial, time related, text, multimedia, and legacy databases, and the World Wide Web (WWW). A Data Warehouse A data warehouse is a repository for long-term storage of data from multiple sources, organized so as to facilitate management decision making. The data are stored under a unified schema and are typically summarized. Data warehouse systems provide some data analysis capabilities, collectively referred to as OLAP(On-Line Analytical Processing). Suppose that All Electronics is a successful international company, with branches around the world. Each branch has its own set of databases. The president of All Electronics has asked you to provide an analysis of the companys sales per item type per branch for the third quarter. This is a difficult task, particularly since the relevant data are spread out over several databases, physically located at numerous sites. If All Electronics had a data warehouse, this task would be easy. A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site. Data warehouses are constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data refreshing. This process is discussed in Chapters 2 and 3. Figure 1.7 shows the typical framework for construction and use of a data warehouse for All Electronics.

Figure 1.4.2 Typical Framework Of A Data Warehouse.

Data Mining Functionalities Include the discovery of concept / class descriptions, association, classification, prediction, clustering, trend analysis, deviation analysis, and similarity analysis. Characterization and discrimination are forms of data summarization. We have observed various types of databases and information repositories on which data mining can be performed. Let us now examine the kinds of data patterns that can be mined. Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks. In general, data mining tasks can be classified into two categories: descriptive and predictive. Descriptive mining tasks characterize the general properties of the data in the database. Predictive mining tasks perform Inference on the current data in order to make predictions. In some cases, users may have no idea regarding what kinds of patterns in their data may be interesting, and hence may like to search for several different kinds of patterns in parallel. Thus it is important to 7

have a data mining system hat can mine multiple kinds of patterns to accommodate different user expectations or applications. Furthermore, data mining systems should be able to discover patterns at various granularity (i.e., different levels of abstraction). Data mining systems should also allow users to specify hints to guide or focus the search for interesting patterns. Because some patterns may not hold for all of the data in the database, a measure of certainty or trustworthiness is usually associated with each discovered pattern. Data mining functionalities, and the kinds of patterns they can discover, are described below.

Data Mining Systems Data Mining Systems can be classified according to the kinds of databases mined, the kinds of knowledge mined, the techniques used, or the applications adapted. Data mining can be classified into descriptive data mining and predictive data mining. Concept description is the most basic form of descriptive data mining. It describes a given set of task- relevant data in a concise and summarative manner, presenting general properties of the data. Efficient and effective data mining in large databases poses numerous requirements and great challenges to researchers and developers. The issues involved include data mining methodology, user interaction, performance and scalability, and the processing of a large variety of data types. Other issues include the exploration of data mining applications and their social impacts.

Classification of Data Mining Systems


Data mining is an interdisciplinary field, the confluence of a set of disciplines, including database systems, statistics, machine learning, visualization, and information science (Figure 1.12).Moreover, depending on the data mining approach used, techniques from other disciplines may be applied, such as neural networks, fuzzy and/or rough set theory, knowledge representation, inductive logic programming, or high-performance computing. Depending on the kinds of data to be mined or on the given data mining application ,the data mining system may also integrate techniques from spatial data analysis, information retrieval, pattern recognition, image analysis, signal processing, computer graphics, Web technology, economics, business, bioinformatics, or psychology. Because of the diversity of disciplines contributing to data mining, data mining research is expected to generate a large variety of data mining systems. Therefore, it is necessary to provide a clear classification of data mining systems, which may help potential users distinguish

between such systems and identify those that best match their needs. Data mining systems can be categorized according to various criteria, as follows:

Figure 1.4.3 Data Mining As A Confluence Of Multiple Disciplines.

1.5 Mining Frequent Itemsets


1. What is a frequent pattern? Pattern (set of items, sequence, etc.) that occurs frequently in a 2. Frequent pattern: an important form of regularity what products were often purchased together? 3. Frequent pattern mining Foundation for several essential data mining tasks: association, correlation, causality sequential pattern 4. Applications: market data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, Web log sequence, DNA analysis, etc. 5. Two Key Elements in Frequent Pattern Mining Candidate Pattern Set: the less, the better. It includes all frequent patterns. Data Set: the more compact, the better. It includes all information related To frequent pattern mining. Figure 1.5.1 Frequent Pattern Mining database

Patter 6. What Should A Good Frequent Pattern Mining Algorithm Have? Data test n Good candidate pattern generation method:
9

Each rule

o Generating candidate patterns as less as possible. o The best method is Apriori Algorithm(AS 1994) Good database processing method: o Sorting, aggregation and classification of data set according to the intrinsic principle of frequent pattern mining. o Some unsuccessful methods: sampling(Toivonen 1996), partition(SON 1995) o Successful methods: tree projection(AAP 2000) o and FP-growth(HPY 2000)

7.

A Famous Frequent Pattern Mining Method Apriori Algorithm: Major idea: o A subset of a frequent itemset must be frequent. o A powerful candidate set pruning technique: it reduces candidate itemsets dramatically. o generate candidate frequent k-itemsets. o Use database scan and pattern matching to collect counts for the candidate itemsets Core of Apriori algorithm: Use frequent (k 1)-itemsets to

1.6 Methodology
Apriori Algorithm In Value-Added Service The Apriori algorithm discovers the frequent item sets from a very large database through a series of iterations. The Apriori algorithm is required to generate candidate item sets, compute the support, and prune the candidate item sets to the frequent item sets in each iteration. The FPgrowth algorithm discovers frequent item sets without the time-consuming candidate generation process that is critical for the Apriori algorithm. Although the FP-growth algorithm generally outperforms the Apriori algorithm in most cases, several refinements of the Apriori algorithm have been made to speed up the process of frequent item sets mining. 10

However, in this method, multiple passes have to be made over the database for each different value of minimum support and confidence. This number can be as large as the longest frequent item set. For very large databases of transactions, this may involve considerable inputoutput (I/O) and lead to unacceptable response times for online queries. Moreover, the potential number of frequent item sets is exponential to the number of different items, although the actual number of frequent item sets can be smaller. From the table we can see that surf on-line and Java have high association relationships among different value-added services in Mobile Telecom in China. Therefore telecom enterprises can the bundling selling these two kinds of services and provide more favorable package service. First, confirm that you have the correct template for your This algorithm computes frequent item sets from a transactions database over multiple iterations. Each iteration involves Candidate generation and Candidate counting and selection. Utilizing the knowledge about infrequent item sets, obtained from previous iterations, the algorithm prunes a priori those candidate item sets that cannot become frequent. After discarding every candidate item set that has an infrequent subset, the algorithm enters the candidate counting step. Value added service in Mobile Telecom includes SMS, surf on-line, CRBT, MMS, Java, IVR and so on. In the following tables, Apriority algorithm is applied to discover association relationships among different valueadded services in Mobile Telecom in China. Table depicts the value-added service items in four transactions. For a minimum support of 50% and a minimum confidence of 50%, we have the following rules (1) SMS = CRBT with 50% support and 66% confidence; (2) CRBT =>SMS with 50% support and 100% confidence.

Table 1.6.1 Application in telecom of association rule Mining The objective is to generate confident rules, having at least the minimum confidence. The problem decomposition proceeds as follows:

11

Find all sets of items that have minimum support, typically using the Apriority algorithm This is the most expensive phase of the search, and involves lots of research for reducing the complexity. Use the frequent item sets to generate the desired rules. Given m items there can be potentially 2m frequent item sets. Consider Table . For the rule SMS = CRBT , we have Support = Support ({SMS, CRBT }) = 50% And Confidence = Support ({SMS, CRBT }) / Support ({SMS}) = 66%.

Table 1.6.2 Computation Of Frequent Item Sets The Apriori algorithm is outlined as follows. Let FK be the set of frequent itemsets of size k, let Ck be the set of candidate itemsets of size k, and let F1 be the set of large items. We start from k = 1. (1) for all items in frequent item set FK repeat steps 2-4. (2)Generate new candidates Ck+1 from FK. (3)for each transaction T in the database, increment the count of all candidates in CK+1 that are contained in T. (4)Generate frequent item sets Fk+1of size k from candidates in CK+1 with minimum support. A key observation is that every subset of a frequent item set is also frequent. This implies that a candidate item set in CK+1 can be pruned if even one of its subsets is not contained in FK.

12

Table 1.6.3 Transactions Database For Frequent Item Set Generation A priori algorithm is explained with an example database of transactions provided in Table . Consider Table , with a minimum Support > 50%. After the first scan of the database, we have the candidate item sets C1 along with their corresponding Supports, as : SMS50%, { surf online } : 75%, { CRBT } : 75%, { MMS } : 25%, and { Java } : 75%.The frequent item sets FI consist of { surf on-line }, { CRBT }, and { Java }, each with Support of 75%.Now the candidate item sets C2 are surf on-line, CRBT , surf on-line, Java , CRBT, Java , with Supports of 50%, 75%, 50%, respectively. The corresponding frequent item set F2 becomessurf on-line, Javawith a Support of 75%.The rules generated are

CRBT=Java with Confidence =support({surfonline,java})/support({surfonline}) =100% and Java=CRBT with Confidence =support({surfonline,java})/support({java}) =100%

Table 1.6.4 Stages Of Apriority Algorithm Demonstrating Frequent Item Set Generation 13

However, in this method, multiple passes have to be made over the database for each different value of minimum support and confidence. This number can be as large as the longest frequent item set. For very large databases of transactions, this may involve considerable input-output (I/O) and lead to unacceptable response times for online queries. Moreover, the potential number of frequent item sets is exponential to the number of different items, although the actual number of frequent item sets can be smaller. From the table we can see that surf on-line and Java have high association relationships among different value-added services in Mobile Telecom in China. Therefore telecom enterprises can the bundling selling these two kinds of services and provide more favorable package service.

Applications
Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc., mining for associations among items in a large database of sales transaction is an important database mining function. below: Keyboard Mouse [support = 6%, confidence = 70%] Based on the types of values, the association rules can be classified into two categories: Boolean Association Rules and Quantitative Association Rules* Boolean Association Rule: Keyboard Mouse [support = 6%, Confidence = 70%] Quantitative Association Rule: (Age = 2630) (Cars =1, 2) [Support 3%, confidence = 36%] Minimum Support Threshold The support of an association pattern is the percentage of task-relevant data transactions for which the pattern is true. Minimum Confidence Threshold For example, the information that a customer who purchases a keyboard also tends to buy a mouse at the same time is represented in association rule

14

Confidence is defined as the measure of certainty or trustworthiness associated with each discovered pattern. Association Rule Mining Process Find all frequent item sets. Each support S of these frequent item sets will at least equal to a pre-determined min sup (An item set is a subset of items in I, like A). Generate strong association rules from the frequent item sets. These rules must be the frequent item sets and must satisfy min_sup and min_conf.

2.OVERVIEW OF THE PROBLEM


2.1 Literature Survey
Yanbin Ye and Chia-Chu Chiang said finding frequent itemsets is one of the most investigated fields of data mining. The Apriori algorithm is the most established algorithm for frequent itemsets mining (FIM). Several implementations of the Apriori algorithm have been reported and evaluated. One of the implementations optimizing the data structure with a trie by 15

Bodon catches the attention. The results of the Bodons implementation for finding frequent itemsets appear to be faster than the ones by Borgelt and Goethals. Thus revised Bodons implementation into a parallel one where input transactions are read by a parallel computer. The effect a parallel computer on this modified implementation is presented. F. Bodon, said the efficiency of frequent itemset mining algorithms is determined mainly by three factors: the way candidates are generated, the data structure that is used and the implementation details. Most papers focus on the first factor, some describe the underlying data structures, but implementation details are almost always neglected. In this paper we show that the effect of implementation can be more important than the selection of the algorithm. Ideas that seem to be quite promising may turn out to be ineffective if we descend to the implementation level. We theoretically and experimentally analyze APRIORI which is the most established algorithm for frequent item set mining. Several implementations of the algorithm have been put forward in the last decade. Although they are implementations of the very same algorithm, they display large differences in running time and memory need. In this paper we describe an implementation of APRIORI that outperforms all implementations known to us. Thus analyze, theoretically and experimentally, the principal data structure of our solution. This data structure is the main factor in the efficiency of our implementation. Moreover, it presents a simple modification of APRIORI that appears to be faster than the original algorithm.

3. SYSTEM STUDY
Finding frequent itemsets in transaction databases has been demonstrated to be useful in several business applications. Many algorithms have been proposed to find frequent itemsets from a very large database. However, there is no published implementation that outperforms every other implementation on every database with every support threshold. In general, many implementations are based on the two main algorithms: Apriori and frequent pattern growth (FP-growth). The 16

Apriori algorithm discovers the frequent itemsets from a very large database through a series of iterations. The Apriori algorithm is required to generate candidate itemsets, compute the support, and prune the candidate itemsets to the frequent itemsets in each iteration. The FP-growth algorithm discovers frequent itemsets without the time-consuming candidate generation process that is critical for the Apriori algorithm. Although the FP-growth algorithm generally outperforms the Apriori algorithm in most cases, several refinements of the Apriori algorithm have been made to speed up the process of frequent itemsets mining.

3.1 Existing System


Finding frequent itemsets is one of the most investigated fields of data mining. Many algorithms have been proposed to find frequent itemsets from a very large database. Transaction mapping algorithm is also an algorithm to find frequent itemsets mining. But this is not a time consuming one. However , there is no implementation on every database with every support threshold .Earlier Decision tree and FP(Frequent Pattern) algorithms are the established algorithm for frequent itemsets mining .But both these algorithms doesnt satisfy two constraints which are time and accuracy. The time taken is less and result is not at all accurate..

3.2 Problem Definition


Finding frequent itemsets is one of the most investigated fields of data mining. The Apriori algorithm is the most established algorithm for frequent itemsets mining (FIM). Several implementations of the Apriori algorithm have been reported and evaluated. One of the implementations optimizing the data structure with a trie by Bodon catches our attention. The results of the Bodons implementation for finding frequent itemsets appear to be faster than the 17

ones by Borgelt and Goethals. In this paper, revised about Bodons implementation into a parallel one where input transactions are read by a parallel computer. The effect a parallel computer on this modified implementation is presented.

3.3 Proposed System


The Apriori algorithm is the most established algorithm for frequent itemsets mining (FIM).Apriori outperforms all implementations known to us. Several implementations of the Apriori algorithm have been reported and evaluated.Apriori algorithm find frequent itemsets using Association rule of mining One of the implementations optimizing the data structure with a trie by Bodon catches the attention. The results of the Bodons implementation for finding frequent itemsets appear to be faster than the ones by Borgelt and Goethals. The reason adopted to choose Bodons implementation is because this trie data structure outperforms the other implementation using hash tree In this , Bodons implementation into a parallel one where input transactions is read by a parallel computer. The effect a parallel computer on this modified implementation is presented.The proposed system will satisfy time consuming and accurate result constraints.

4. SYSTEM DESCRIPTION
4.1 Modules
a) Input Data Module b) Calculation for Classification based on Multiple Association Rules(CMAR). c) Calculation for Apriori Algorithm to find Frequent Item Sets. 18

d) GUI Designing.

a) Input Data Module Here we create the item sets. We are taking the Transaction list This is the .dat files. In this module we take an Input Text File and convert them into string tokens. We convert the tokens in to sets.

Example: sms,mms,Surf on-line,crbt. sms, Surf on-line,Java,Ivr,crbt. Surf on-line,sms,mms. Java,Ivr, Surf on-line. Java,Sms,Mms.

b) Calculation for Classification based on Multiple Association Rules(CMAR). CMAR is Classification Based On Multiple Association Rules. All our calculations are based upon the Association Rules. i. ii. iii. iv. data input and input error checking data preprocessing, manipulation of records (e.g. operations such as subset, member, union etc.) and data and parameter output.

The CMAR algorithm (as described in Li et al 2001) uses an FP-growth algorithm (Han et al. 2000) to produce a set of CARs which are stored in data structure referred to as a CR tree. CARs are inserted into the CR tree if: The Chi-Squared value is above a user specified critical threshold (Li et al. suggest a5% significance level; assuming a degree of freedom equivalent to 1, this will equate to a threshold of 3.8415), and 19

The CR tree does not contain a more general rule with a higher priority.Given two CARs, R1 and R2, R1 is said to be a more general rule than R2 if the antecedent for R1 is a subset of the antecedent for R2. In CMAR CARS are prioritised using the following ordering: Confidence: A rule r1 has priority over a rule r2 if confidence(r1) > confidence(r2). Support: A ruler1 has priority over a rule r2 if confidence(r1)==confidence(r2) && support(r1)>support(r2). Size of antecedent : A rule r1 has priority over a rule r2 if confidence(r1)==confidence(r2) && support(r1)==support(r2) && |Ar1|<|Ar2|. Once a complete CR tree has been produced the generated rules are placed into a list R, ordered according to the above prioritisation, which is then pruned. The pruning algorithm (using the cover principle) is presented below. T set of training records C array of length |T| with value of elements set to 0 R The prioritised rule list R' empty rule list For each r in R if (T=null) break; coverFlag <-- false for each ti in T Loop (1 <= i <= [T|) if (r.antecedent subset ti) rule staisfies record C[i] <-- C[i]+1 coverFlag <-- true End loop if (coverFlag=true) rule satisfies at least one record R' <-- R' union r Loop from i<--1 to i<--|c| in steps of 1 if ci> MIN_COVER remove ti from T End loop R' is then the resulting classifier. In their experiments Li et al. used a MIN_COVER value of 3, i.e. each record had to be satisfied (covered) by at least three rules before it is no longer to be considered in the generation process and removed from the training set. To test the resulting classifier Li et al. propose the following process. Given a record r in the test set: 20

Collect all rules that satisfy r, and If consequents of all rules are all identical, or only one rule, classify record according to the consequents. Else group rules according to classifier and determine the combined effect of the rules in each group, the classifier associated with the "strongest group" is then selected. The strength of a group is calculate using a Weighted Chi Squared (WCS) measure. This is done by first defining a Maximum Chi Squared (MCS) value for each rule A->c: MCS = (min(sup(A),sup(c)) - sup(A)sup(c)/N)^2 * N * e Where: sup(P) = support for antecedent. sup(c) = support for consequent. N = Number of records in test set. e is calculated as follows: e = 1/sup(A)sup(c) + 1/sup(A)N-sup(c) + 1/N-sup(A)sup(c) + 1/(N-sup(A))(N-sup(c)) For each group of rules the Weighted Chi Squared value is defined as: WCS = The sum of (Chi-Squared * Chi-Squared)/(MCS)

AssocRuleMining.java: Set of general ARM utility methods to allow: (i) data input and input error checking, (ii) data preprocessing, (iii) manipulation of records (e.g. operations such as subset, member, union etc.) and (iv) data and parameter output. PartialSupportTree.java: Methods to implement the "Apriori-TFP" algorithm using both the "Partial support" and "Total support" tree data structure (P-tree and T-tree). PtreeNode.java: Methods concerned with the structure of Ptree nodes. PtreeNodeTop.java: Methods concerned with the structure of the top level of the P-tree which comprises, to allow direct indexing for reasons of efficiency, an array of "top P-tree nodes". RuleList.java: Set of methods that allow the creation and manipulation (e.g. ordering, etc.) of a list of ARs. TotalSupportTree.java: Methods to implement the "Apriori-T" algorithm using the "Total support" tree data structure (T-tree). 21

TtreeNode.java: Methods concerned with the structure of Ttree nodes. Apriori TFP class.java: Parent class for classification rule generator. AprioriTFP_CMAR.java: Aprori-TFPC CMAR algorithm. ClassCMAR_App.java: Fundamental LUCS-KDD CMAR application using a 50:50 training/test set split. The P-tree Node, P-tree Node Top and T-tree Node classes are separate to the remaining classes which are arranged in a class hierarchy of the form illustrated below.

c) Calculation for Apriori Algorithm to find Frequent Item Sets. Here we will find out the Associativity of theitemsets. We will take the support and the confidence of each itemset. And we will find out the probability of the item sets.

Support/confidence Support shows the frequency of the patterns in the rule; it is the percentage of transactions that contain both A and B, i.e. Support = Probability(A and B) Support = (# of transactions involving A and B) / (total number of transactions). Confidence is the strength of implication of a rule; it is the percentage of transactions that contain B if they contain A, ie. Confidence = Probability (B if A) = P(B/A) Confidence = (# of transactions involving A and B) / (total number of transactions that have A). Example: Customer Item purchased 22 Item purchased

1 2 3 4

pizza salad pizza salad

beer soda soda tea

Table 4.1.1 Data set Table If A is purchased pizza and B is purchased soda then Support = P(A and B) = Confidence = P(B / A) = Confidence does not measure if the association between A and B is random or not. For example, if milk occurs in 30% of all baskets, information that milk occurs in 30% of all baskets with bread is useless. But if milk is present in 50% of all baskets that contain coffee, that is significant information. Support allows us to weed out most infrequent combinations but sometimes we should not ignore them, for example, if the transaction is valuable and generates a large revenue, or if the products repel each other. Ex. We measure the following: P(Coke in a basket) = 50% P(pepsi in a basket) = 50% P(coke and peps in a basket) = 0.001% What does this mean? If Coke and Pepsi were independent, we would expect that P(coke and pepsi in a basket) = .5*0.5 = 0.25. The fact that the joint probability is much smaller says that the products are dependent and that they repel each other. d) GUI Designing The GUI designing is done using Swings.

What Is Swing?

23

Swing is a tool kit in java. It is part of Sun Microsoft System JFC i.e., Java Foundation Classes an API for providing a GUI for Java programs. Swing includes GUI widgets such as text boxes, buttons, split-panes, and tables. Swings are called Light Weight Process. U can develop our own look n feel using swings. If you poke around the Java home page (http://java.sun.com/ ), you'll find Swing advertised as a set of customizable graphical components whose look-and-feel can be dictated at runtime. In reality, however, Swing is much more than this. Swing is the next-generation GUI toolkit that Sun Microsystems is developing to enable enterprise development in Java. By enterprise development, we mean that programmers can use Swing to create large-scale Java applications with a wide array of powerful components. In addition, you can easily extend or modify these components to control their appearance and behavior. Swing is not an acronym. The name represents the collaborative choice of its designers when the project was kicked off in late 1996. Swing is actually part of a larger family of Java products known as the Java Foundation Classes ( JFC), which incorporate many of the features of Netscape's Internet Foundation Classes (IFC), as well as design aspects from IBM's Taligent division and Lighthouse Design. Swing has been in active development since the beta period of the Java Development Kit (JDK)1.1, circa spring of 1997. The Swing APIs entered beta in the latter half of 1997 and their initial release was in March of 1998. When released, the Swing 1.0 libraries contained nearly 250 classes and 80 interfaces. Although Swing was developed separately from the core Java Development Kit, it does require at least JDK 1.1.5 to run. Swing builds on the event model introduced in the 1.1 series of JDKs; you cannot use the Swing libraries with the older JDK 1.0.2. In addition, you must have a Java 1.1- enabled browser to support Swing applets.

Is Swing a Replacement for AWT?


No. Swing is actually built on top of the core 1.1 and 1.2 AWT libraries. Because Swing does not contain any platform-specific (native) code, you can deploy the Swing distribution on any platform that implements the Java 1.1.5 virtual machine or above. In fact, if you have JDK 1.2 on your platform, then the Swing classes will already be available and there's nothing further to download. If you do not have JDK 1.2, you can download the entire set of Swing libraries as a set 24

of Java Archive (JAR) files from the Swing home page In either case, it is generally a good idea to visit this URL for any extra packages or look-and-feels that may be distributed separately from the core Swing libraries. Figure 4.1.1 shows the relationship between Swing, AWT, and the Java Development Kit in both the 1.1 and 1.2 JDKs. In JDK 1.1, the Swing classes must be downloaded separately and included as an archive file on the classpath (swingall.jar)JDK 1.2 comes with a Swing distribution, although the relationship between Swing and the rest of the JDK has shifted during the beta process. Nevertheless, if you have installed JDK 1.2, you should have Swing. The standalone Swing distributions contain several other JAR files. swingall.jar is everything (except the contents of multi.jar) wrapped into one lump, and is all you normally need to know about. For completeness, the other JAR files are: swing.jar, which contains everything but the individual look-and-feel packages; motif.jar, which contains the Motif (Unix) look-and-feel; windows.jar, which contains the Windows look-and-feel; multi.jar, which contains a special lookand feel that allows additional (often non-visual) L&Fs to be used in conjunction with the primary L&F; and beaninfo.jar, which contains special classes used by GUI development tools.

Figure 4.1.1 Relationships Between Swing, Awt, And The Jdk In The 1.1 And 1.2 Jdks Swing contains nearly twice the number of graphical components as its immediate predecessor, AWT 1.1. Many are components that have been scribbled on programmer wish-lists since Java first debutedincluding tables, trees, internal frames, and a plethora of advanced text components. In addition, Swing contains many design advances over AWT. For example, Swing introduces a new Action class that makes it easier to coordinate GUI components with the functionality they perform. You'll also find that a much cleaner design prevails throughout Swing; this cuts down on the number of unexpected surprises that you're likely to face while coding. 25

Swing depends extensively on the event handling mechanism of AWT 1.1, although it does not define a comparatively large amount of events for itself. Each Swing component also contains a variable number of exportable properties. This combination of properties and events in the design was no accident. Each of the Swing components, like the AWT 1.1 components before them, adhere to the popular JavaBeans specification. As you might have guessed, this means that you can import all of the Swing components into various GUI-builder toolsuseful for powerful visual programming.

Swing Features
Swing provides many new features for those planning to write large-scale applications in Java. Here is an overview of some of the more popular features. Pluggable Look-and-Feels One of the most exciting aspects of the Swing classes is the ability to dictate the look-andfeel (L&F) of each of the components, even resetting the look-and-feel at runtime. Look-and-feels have become an important issue in GUI development over the past five years. Most users are familiar with the Motif style of user interface, which was common in Windows 3.1 and is still in wide use on Unix platforms. Microsoft has since deviated from that standard with a much more optimized look-and-feel in their Windows 95/98 and NT 4.0 operating systems. In addition, the Macintosh computer system has its own branded look-and-feel, which most Apple users feel comfortable with. Swing is capable of emulating several look-and-feels, and currently includes support for Windows 98 and Unix Motif. This comes in handy when a user would like to work in the L&F environment which he or she is most comfortable with. In addition, Swing can allow the user to switch look-and feels at runtime without having to close the current application. This way, a user can experiment to see which L&F is best for them with instantaneous feedback. And, if you're feeling really ambitious as a developer (perhaps a game developer), you can create your own look-and-feel for each one of the Swing components! Swing comes with a default look-and-feel called "Metal," which was developed while the Swing classes were in the beta-release phase. This look-and-feel combines some of the best graphical elements in today's L&Fs and even adds a few surprises of its own. Figure 1.3 shows an example of several look-and-feels that you can use with Swing, including the new Metal look-andfeel. All Swing L&Fs are built from a set of base classes called the Basic L&F. However, though we may refer to the Basic L&F from time to time, you can't use it on its own. 26

Figure 4.1.2 Various look-and-feels in the Java Swing environment

Lightweight Components
Most Swing components are lightweight. In the purest sense, this means that components are not dependent on native peers to render themselves. Instead, they use simplified graphics primitives to paint themselves on the screen and can even allow portions to be transparent. The ability to create lightweight components first emerged in JDK 1.1, although the majority of AWT components did not take advantage of it. Prior to that, Java programmers had no choice but to subclass java.awt.
Canvas

or java.awt. Panel if they wished to create their own components. With both classes, Java

allocated an opaque peer object from the underlying operating system to represent the component, 27

forcing each component to behave as if it were its own window, taking on a rectangular, solid shape. Hence, these components earned the name "heavyweight," because they frequently held extra baggage at the native level that Java did not use. Heavyweight components were unwieldy for two reasons: Equivalent components on different platforms don't necessarily act alike. A list component on one platform, for example, may work differently than a list component on another. Trying to coordinate and manage the differences between components was a formidable task. The look-and-feel of each component was tied to the host operating system and could not be changed.

Additional Features
Several other features distinguish Swing from the older AWT components: A wide variety of new components, such as tables, trees, sliders, progress bars, internal frames, and text components. Swing components contain support for replacing their insets with an arbitrary number of concentric borders. Swing components can have tooltips placed over them. A tooltip is a textual popup that momentarily appears when the mouse cursor rests inside the component's painting region. Tooltips can be used to give more information about the component in question. You can arbitrarily bind keyboard events to components, defining how they will react to various keystrokes under given conditions. There is additional debugging support for the rendering of your own lightweight Swing components. We will discuss each of these features in greater detail as we move.

Advantages of Swings Swing widgets provide more sophisticated GUI Components Swing Components are not implemented by platform specific code. They are written purely in Java and are called as platform independent. Advantage is uniform behavior on all Swing supports a pluggable look and feel .This means you can get any supported look and feel on any platform or by using the current platform's graphics interface to achieve 28

consistency through modifications made by additional API calls. This means the application can use any supported look and feel on any platform. Disadvantage of Swings The disadvantage of lightweight components is slower execution.

Flow Diagram
CMAR=Calculation for Classification based on Multiple Association Rules

29

INPUT NUM FILES PROCESSING CMAR FI SET


No of CMAR Rules

OUTPUTS

UTILITIES

No of Frequent Item Sets

Figure 4.1.3 Dataflow Diagram

4.2 Models for Data Mining


In the business environment, complex data mining projects may require the coordinate efforts of various experts, stakeholders, or departments throughout an entire organization. In the data mining literature, various "general frameworks" have been proposed to serve as blueprints for how to organize the process of gathering data, analyzing data, disseminating results, implementing results, and monitoring improvements. One such model, CRISP (Cross-Industry Standard Process for data mining) was proposed in the mid-1990s by a European consortium of companies to serve as a non-proprietary standard process model for data mining. This general approach postulates the following (perhaps not particularly controversial) general sequence of steps for data mining projects:

30

Another approach - the Six Sigma methodology - is a well-structured, data-driven methodology for eliminating defects, waste, or quality control problems of all kinds in manufacturing, service delivery, management, and other business activities. This model has recently become very popular (due to its successful implementations) in various American industries, and it appears to gain favor worldwide. It postulated a sequence of, so-called, DMAIC steps -

- that grew up from the manufacturing, quality improvement, and process control traditions and is particularly well suited to production environments (including "production of services," i.e., service industries). Another framework of this kind (actually somewhat similar to Six Sigma) is the approach proposed by SAS Institute called SEMMA -

- which is focusing more on the technical activities typically involved in a data mining project. All of these models are concerned with the process of how to integrate data mining methodology into an organization, how to "convert data into information," how to involve important stakeholders, and how to disseminate the information in a form that can easily be converted by stakeholders into resources for strategic decision making. Some software tools for data mining are specifically designed and documented to fit into one of these specific frameworks.

31

The general underlying philosophy of Stat Soft's STATISTICA Data Miner is to provide a flexible data mining workbench that can be integrated into any organization, industry, or organizational culture, regardless of the general data mining process-model that the organization chooses to adopt. For example, STATISTICA Data Miner can include the complete set of (specific) necessary tools for ongoing company wide Six Sigma quality control efforts, and users can take advantage of its (still optional) DMAIC-centric user interface for industrial data mining tools. It can equally well be integrated into ongoing marketing research, CRM (Customer Relationship Management) projects, etc. that follow either the CRISP or SEMMA approach - it fits both of them perfectly well without favoring either one. Also, STATISTICA Data Miner offers all the advantages of a general data mining oriented "development kit" that includes easy to use tools for incorporating into your projects not only such components as custom database gateway solutions, prompted interactive queries, or proprietary algorithms, but also systems of access privileges, workgroup management, and other collaborative work tools that allow you to design large scale, enterprise-wide systems (e.g., following the CRISP, SEMMA, or a combination of both models) that involve your entire organization. Predictive Data Mining The term Predictive Data Mining is usually applied to identify data mining projects with the goal to identify a statistical or neural network model or set of models that can be used to predict some response of interest. For example, a credit card company may want to engage in predictive data mining, to derive a (trained) model or set of models (e.g., neural networks, metalearner) that can quickly identify transactions which have a high probability of being fraudulent. Other types of data mining projects may be more exploratory in nature (e.g., to identify cluster or segments of customers), in which case drill-down descriptive and exploratory methods would be applied. Data reduction is another possible objective for data mining (e.g., to aggregate or amalgamate the information in very large data sets into useful and manageable chunks).

4.3. Java Programming Language


The JAVA language was created by James Gosling in June 1991 for use in a set top box project. The language was initially called Oak, after an oak tree that stood outside Gosling's office - and also went by the name Green - and ended up later being renamed to Java, from a list of random words. Gosling's goals were to implement a virtual machine and a language that had a familiar C/C++ style of notation.[6] The first public implementation was Java 1.0 in 1995. It promised "Write Once, Run Anywhere" (WORA), providing no-cost runtimes on popular 32

platforms. It was fairly secure and its security was configurable, allowing network and file access to be restricted. Major web browsers soon incorporated the ability to run secure Java applets within web pages. Java quickly became popular. With the advent of Java 2, new versions had multiple configurations built for different types of platforms. For example, J2EE was for enterprise applications and the greatly stripped down version J2ME was for mobile applications. J2SE was the designation for the Standard Edition. In 2006, for marketing purposes, new J2 versions were renamed Java EE, Java ME, and Java SE, respectively. In 1997, Sun Microsystems approached the ISO/IEC JTC1 standards body and later the Ecma International to formalize Java, but it soon withdrew from the process. Java remains a de facto standard that is controlled through the Java Community Process. At one time, Sun made most of its Java implementations available without charge although they were proprietary software. Sun's revenue from Java was generated by the selling of licenses for specialized products such as the Java Enterprise System. Sun distinguishes between its Software Development Kit (SDK) and Runtime Environment (JRE) which is a subset of the SDK, the primary distinction being that in the JRE, the compiler, utility programs, and many necessary header files are not present. On 13 November 2006, Sun released much of Java as free software under the terms of the GNU General Public License (GPL). On 8 May 2007 Sun finished the process, making all of Java's core code open source, aside from a small portion of code to which Sun did not hold the copyright.

Primary Goals There were five primary goals in the creation of the Java language: 1. It should use the object-oriented programming methodology. 2. It should allow the same program to be executed on multiple operating systems. 3. It should contain built-in support for using computer networks. 4. It should be designed to execute code from remote sources securely. 5. It should be easy to use by selecting what were considered the good parts of other objectoriented languages

33

The Java Programming Language The Java programming language is a high-level language that can be characterized by all of the following buzzwords: Simple Object oriented Distributed Multithreaded Dynamic Architecture neutral Portable High performance Robust Secure

Table 4.3.1 Properties Of Programming Language Each of the preceding buzzwords is explained in The Java Language Environment a white paper written by James Gosling and Henry McGilton. In the Java programming language, all source code is first written in plain text files ending with the .java extension. Those source files are then compiled into .class files by the javac compiler. A .class file does not contain code that is native to your processor; it instead contains bytecodes the machine language of the Java Virtual Machine1 (Java VM). The java launcher tool then runs your application with an instance of the Java Virtual Machine.

Figure 4.3.1 An Overview Of The Software Development Process. Because the Java VM is available on many different operating systems, the same .class files are capable of running on Microsoft Windows, the Solaris TM Operating System (Solaris OS), Linux, or Mac OS. Some virtual machines, such as the Java HotSpot virtual machine, perform additional steps at runtime to give your application a performance boost. This include various tasks such as finding performance bottlenecks and recompiling (to native code) frequently used sections of code.

34

Figure 4.3.2 Through The Java Vm, The Same Application Is Capable Of Running On Multiple Platforms.

The Java Platform


A platform is the hardware or software environment in which a program runs. We've already mentioned some of the most popular platforms like Microsoft Windows, Linux, Solaris OS, and Mac OS. Most platforms can be described as a combination of the operating system and underlying hardware. The Java platform differs from most other platforms in that it's a softwareonly platform that runs on top of other hardware-based platforms. The Java platform has two components:

The Java Virtual Machine The Java Application Programming Interface (API)

You've already been introduced to the Java Virtual Machine; it's the base for the Java platform and is ported onto various hardware-based platforms. The API is a large collection of ready-made software components that provide many useful capabilities. It is grouped into libraries of related classes and interfaces; these libraries are known as packages. The next section, What Can Java Technology Do? highlights some of the functionality provided by the API. 35

Figure 4.3.3 The API and Java Virtual Machine insulate the program from the underlying hardware. As a platform-independent environment, the Java platform can be a bit slower than native code. However, advances in compiler and virtual machine technologies are bringing performance close to that of native code without threatening portability.

Java Runtime Environment


The Java Runtime Environment, or JRE, is the software required to run any application deployed on the Java Platform. End-users commonly use a JRE in software packages and Web browser plugins. Sun also distributes a superset of the JRE called the Java 2 SDK (more commonly known as the JDK), which includes development tools such as the Java compiler, Javadoc, Jar and debugger. One of the unique advantages of the concept of a runtime engine is that errors (exceptions) should not 'crash' the system. Moreover, in runtime engine environments such as Java there exist tools that attach to the runtime engine and every time that an exception of interest occurs they record debugging information that existed in memory at the time the exception was thrown (stack and heap values). These Automated Exception Handling tools provide 'root-cause' information for exceptions in Java programs that run in production, testing or development environments.

Uses OF JAVA
Blue is a smart card enabled with the secure, cross-platform, object-oriented Java Card API and technology. Blue contains an actual on-card processing chip, allowing for enhanceable and multiple functionality within a single card. Applets that comply with the Java Card API specification can run on any third-party vendor card that provides the necessary Java Card Application Environment (JCAE). Not only can multiple applet programs run on a single card, but new applets and functionality can be added after the card is issued to the customer Java Can be used in Chemistry. 36

In NASA also Java is used. In 2D and 3D applications java is used. In Graphics Programming also Java is used. In Animations Java is used. In Online and Web Applications Java is used.

5. IMPLEMENTATION
37

Conception Of Association Rule And The Current Related Works


Association rules are used to show the relationships between data items. These uncovered relationships are not inherent in the data, as with functional dependencies, and they do not represent any sort of causality or correlation. Instead, association rules detect common usage of items. The strength of association rule is measured using support and confidence. Rule mining constitutes a major function of data mining. Consider a database of telecom customer transactions T, where each transaction is a set of items. The objective is to find all rules of the form A => B, which correlate the presence of one set of items A with another set of items B. An example of such a rule is 80% of high users use the value-added services such as SMS, CRBT color ring back tone , access the Internet by mobile phone and so on at the same time. For this purpose, one needs to ensure that support of A and B are greater than a user threshold s, and conditional probability (confidence) of B given A is greater than a user threshold c. A rule must have some minimum user-specified Confidence. A rule 1 &2= 3 is defined to have 90% confidence if when a customer choose service 1 and 2, in 90% of those cases, the customer also choose service 3. A rule must also have some minimum user-specified Support. This implies that the rule 1 & 2 => 3 should hold in some minimum percentage of transactions, in order to have business value. There can be any number of services in the consequent and antecedent parts of a rule. It is also possible to specify constraints on rules. An association rule can generally be viewed as being defined over attributes of a relation and has the form A => B. Suppose I = (i1, i2, i3... im) is item sets. Set up task related data D is a collection of database, in which each of the items T is a collection and T. Suppose A is a item set , T includes A only if A is a subset of T, then A => B, of which A , B is a subset of the I and A B = , the rules of A => B-D in the affairs of support with which s is the s in D Services include the percentage of A B, The confidence level in Rules A => B of D item set is c, Rules A => B-D in the affairs with a confidence level c: this is the conditional probability P (B | A), namely Support (A=>B)= P(A B); confidence(A=>B)=P(B| A).Association rule mining is an expression of the form A => B, where A and B are subsets of all attributes, and the implication holds with a confidence > c, where c is a user-defined threshold. This implies IF ATHEN B, with at least c confidence. In other words, given huge volumes of heterogeneous data, the objective is to efficiently extract meaningful patterns that can be of interest and hence useful to the user. The role of interestingness is to threshold or filters the large number of discovered patterns, and reports only those which may be of some use. There are two approaches to designing a 38

measure of interestingness of a pattern, namely, objective and subjective. The former uses the structure of the pattern and is generally quantitative. Often it fails to capture all the complexities of the pattern discovery process. The subjective approach, on the other hand, depends additionally on the user-who examines the pattern. Association rules initially were developed in the context of market-basket analysis, they have proved useful in a wide range of applications. However, association rules are not directly applicable to numeric data. The standard approach to incorporating numeric data is discrimination, a process by which the values of numeric fields in a database are divided into sub ranges. Each such sub range is treated as an item in the association rule analysis. Applications of association rule analysis are not restricted to such analyses of purchasing behavior, however. The manner in which association rule approaches identify all rules that satisfy some set of measures of interestingness make them useful for a wide variety of applications in which the user seeks insight into a set of data. The user can specify the quantifiable measures of what makes a rule interesting. Further factors may apply that are difficult to quantify, but the user can apply these manually to select the most useful of the rules discovered by the automated system. Another application of association rules has been for classification. Rather than the classical approach of learning classification rules, a large set of high confidence association rules are discovered. To classify a new object, all rules that cover the object are identified, and then the relative support and confidence of each rule is used to identify the most probable class for the object Early approaches to association rule discovery sought all rules that satisfy constraints on support and confidence. More recent techniques utilize additional measures of how interesting a rule is, including lift and leverage. the current related works about the research of this topic is still less. The innovation of this paper lies in the association rules algorithm is applied to mobile value added services. The conclusion of the research is valuable for Telecom operators in China to develop the value-added services. Support Support shows the frequency of the patterns in the rule; it is the percentage of transactions that contain both A and B, i.e. Support = Probability(A and B) Support = (# of transactions involving A and B) / (total number of transactions).

Confidence 39

Confidence is the strength of implication of a rule; it is the percentage of transactions that contain B if they contain A, i.e. Confidence = Probability (B if A) = P(B/A) Confidence = (# of transactions involving A and B) / (total number of transactions that have A).

Constraints Based Association Rule Mining


Many data mining techniques consist in discovering patterns frequently occurring in the source dataset. Typically, the goal is to discover all the patterns whose frequency in the dataset exceeds a user-specified threshold. However, very often users want to restrict the set of patterns to be discovered by adding extra constraints on the structure of patterns. Data mining systems should be able to exploit such constraints to speedup the mining process. Techniques applicable to constraint-driven pattern discovery can be classified into the following groups: post-processing (filtering out patterns that do not satisfy user-specified pattern constraints after the actual discovery process); . pattern filtering (integration of pattern constraints into the actual mining process in order to generate only patterns satisfying the constraints); dataset filtering (restricting the source dataset to objects that can possibly contain patterns that satisfy pattern constraints). Wojciechowski and Zakrzewicz [39] focus on improving the efficiency of constraint- based frequent pattern mining by using dataset filtering techniques. Dataset filtering conceptually transforms a given data mining task into an equivalent one operating on a smaller dataset. Tien Dung Do et al [14] proposed a specific type of constraints called category-based as well as the associated algorithm for constrained rule mining based on Apriori. The Category-based Apriori algorithm reduces the computational complexity of the mining process by bypassing most of the subsets of the final itemsets. An experiment has been conducted to show the efficiency of the proposed technique. Rapid Association Rule Mining (RARM) [13] is an association rule mining method that uses the tree structure to represent the original database and avoids candidate generation process. In order to improve the efficiency of existing mining algorithms, constraints were applied during the mining process to generate only those association rules that are interesting to users instead of all the association rules.

Categories Of Databases In Which Association Rules Are Applied

40

Transactional database refers to the collection of transaction records, in most cases they are sales records. With the popularity of computer and e-commerce, massive transactional databases are available now. Data mining on transactional database focuses on the mining of association rules, finding the correlation between items in the transaction records. One of data mining techniques, generalized association rule mining with taxonomy, is potential to discover more useful knowledge than ordinary flat association rule mining by taking application specific information into account [27]. In particular in retail one might consider as items particular brands of items or whole groups like milk, drinks or food. The more general the items chosen the higher one can expect the support to be. Thus one might be interested in discovering frequent itemsets composed of items which themselves form taxonomy. Earlier work on mining generalized association rules ignore the fact that the taxonomies of items cannot be kept static while new transactions are continuously added into the original database. How to effectively update the discovered generalized association rules to reflect the database change with taxonomy evolution and transaction update is a crucial task. Tseng et al [34] examine this problem and propose a novel algorithm, called IDTE, which can incrementally update the discovered generalized association rules when the taxonomy of items is evolved with new transactions insertion to the database. Empirical evaluations show that this algorithm can maintain its performance even in large amounts of incremental transactions and high degree of taxonomy evolution, and is more than an order of magnitude faster than applying the best generalized associations mining algorithms to the whole updated database. Spatial databases usually contain not only traditional data but also the location or geographic information about the corresponding data. Spatial association rules describe the relationship between one set of features and another set of features in a spatial database, for example (Most business centers in Greece are around City Hall), the spatial operations that used to describe the correlation can be within, near, next to, etc. The form of spatial association rules is also XY, where X, Y are sets of predicates and of which some are spatial predicates, and at least one must be a spatial predicate.. Temporal association rules can be more useful and informative than basic association rules. Recent Advances In Association Rule Discovery A serious problem in association rule discovery is that the set of association rules can grow to be unwieldy as the number of transactions increases, especially if the support and confidence thresholds are small. As the number of frequent itemsets increases, the number of rules presented to the user typically increases proportionately. Many of these rules may be redundant. 41

Redundant Association Rules To address the problem of rule redundancy, four types of research on mining association rules have been performed. First, rules have been extracted based on user-defined templates or item constraints. Secondly, researchers have developed interestingness measures to select only interesting rules. Thirdly, researchers have proposed inference rules or inference systems to prune redundant rules and thus present smaller and usually more understandable sets of association rules to the user. Finally, new frameworks for mining association rule have been proposed that find association rules with different formats or properties. Ashrafi et al presented several methods to eliminate redundant rules and to produce small number of rules from any given frequent or frequent closed itemsets generated. Ashrafi et al present additional redundant rule elimination methods that first identify the rules that have similar meaning and then eliminate those rules. Furthermore, their methods eliminate redundant rules in such a way that they never drop any higher confidence or interesting rules from the resultant rule set. Jaroszewicz and Simovici presented another solution to the problem using the Maximum Entropy approach. The problem of efficiency of Maximum Entropy computations is addressed by using closed form solutions for the most frequent cases. Analytical and experimental evaluation of their proposed technique indicates that it efficiently produces small sets of interesting association rules. Moreover, there is a need for human intervention in mining interesting association rules. Such intervention is most effective if the human analyst has a robust visualization tool for mining and visualizing association rules. Techapichetvanich and Datta presented a three-step visualization method for mining market basket association rules. These steps include discovering frequent itemsets, mining association rules and finally visualizing the mined association rules. Negative Association Rules Typical association rules consider only items enumerated in transactions. Such rules are referred to as positive association rules. Negative association rules also consider the same items, but in addition consider negated items (i.e. absent from transactions). Negative association rules are useful in market-basket analysis to identify products that conflict with each other or products that complement each other. Mining negative association rules is a difficult task, due to the fact that there are essential differences between positive and negative association rule mining. The researchers attack two key problems in negative association rule mining: (i) how to effectively 42

search for interesting itemsets, and (ii) how to effectively identify negative association rules of interest. Brin et. al [8] mentioned for the first time in the literature the notion of negative relationships. Their model is chi-square based. They use the statistical test to verify the independence between two variables. To determine the nature (positive or negative) of the relationship, a correlation metric was used. In [28] the authors present a new idea to mine strong negative rules. They combine positive frequent itemsets with domain knowledge in the form of taxonomy to mine negative associations. However, their algorithm is hard to generalize since it is domain dependant and requires a predefined taxonomy. A similar approach is described in [37]. Wu et al [40] derived a new algorithm for generating both positive and negative association rules. They add on top of the support-confidence framework another measure called mininterest for a better pruning of the frequent itemsets generated. In [32] the authors use only negative associations of the type X Y to substitute items in market basket analysis.

6. UML REPRESENTATION
43

Unified Modeling Language is a graphical visualization language. It consists of a series of symbols and connectors that can be used to create process diagrams and is often used to model computer programs and workflows. The Unified Modeling Language (UML) is used to specify, visualize, modify, construct and document the artifacts of an object-oriented software intensive system under development. UML offers a standard way to visualize a system's architectural blueprints, including elements such as: UML combines techniques from data modeling (entity relationship diagrams), business modeling (work flows), object modeling, and component modeling. It can be used with all processes, throughout the software development life cycle, and across different implementation technologies. UML has synthesized the notations of the Booch method, the Object-modeling technique (OMT) and Object-oriented software engineering (OOSE) by fusing them into a single, common and widely usable modeling language. UML aims to be a standard modeling language which can model concurrent and distributed systems. UML is a de facto industry standard, and is evolving under the auspices of the Object Management Group (OMG). OMG initially called for information on object-oriented methodologies that might create a rigorous software modeling language. Many industry leaders have responded in earnest to help create the UML standard. UML models may be automatically transformed to other representations (e.g. Java) by means of QVT-like transformation languages, supported by the OMG. UML is extensible, offering the following mechanisms for customization: profiles and stereotype. The semantics of extension by profiles have been improved with the UML 2.0 major revision.

6.1 Use case Diagram

44

Use Case is a description of set of sequence of actions that a system performs that yields an observable result of value to a particular actor. A use case is used to structure the behavioral things in a model. A use case is realized by collaborations.

S u p p o rt va lu e

user

d a t a fil e

A p r io ri r u le

C o n fid e n c e va l u e

Figure 6.1 Use Case Diagram

6.2 Class Diagram


45

A class diagram is a collection of classes, objects and its relationships. Set of objects that there share some Attributes, Operations, Relationships, and Semantics. A class implements one or more interfaces (defined later).Graphically, a class is rendered as a rectangle, usually including its name, attributes and operations.

Main f_output p_output p_main mb_main aprioriImpl add() setJMenuBar() setSize() setLocationRelativeTo() setVisible()

transaction list AprioriImpl transactionList dumpItemsets() getNumTransactions() computeSupportForItemset() computeConfidenceForAssociationRule()

Itemset items addItem() intersectWith() unionWith() minusAllIn() toString()

PropertiesDialog confidence support actionPerformed() PropertiesDialog()

Item name equals() hashCode()

Figure 6.2 Class Diagram

6.3 Sequence Diagram


46

A sequence diagram gives the flow of activities among the objects involved in the interaction with each other.

Gui Design

Item sets

Apriori

Frequent sets

user interface is deisned

transaction dat file is selected

Run Apriori

check the associativity

support values are calculated

confidence values are calculated

Figure 6.3 Sequence Diagram

47

6.4 Collaboration Diagram


Defines an interaction and is a society of roles and other elements that work together to provide some cooperative behavior that's bigger than the sum of all the elements. Collaborations have structural, as well as behavioral, dimensions. A given class might participate in several collaborations. Represent the implementation of patterns that make up a system.

1: us er interfac e is deis ned 2: trans ac tion dat file is s elec ted G ui Des ign Item s ets

3: Run A priori 4: c hec k the as s oc iativity 5: s upport values are c alc ulated 6: c onfidenc e values are c alc ulated A priori Frequent s ets

Figure 6.4 Collaboration Diagram

7. SCREEN SHOTS
48

The frequent itemset is evaluated using num file. Itemsets are converted into num files and given as input. To generate frequent itemsets support and confidence is given as threshold and support count of an itemset is an aggregated of all local support count of itemset. To generate frequent itemset, a threshold value has to be specified which is known as minimum support count. The itemset is set to be frequent if it satisfies support >= minimum support. Then the candidate itemset is generated using support count in iterative passes. If no more items satisfy support count the process is stopped and frequent itemset is generated.

1) Main Window 49

Figure 7.1 Main Window

2) Select File 50

Figure 7.2 Select File

51

3) File Selected

Figure 7.3 File Selected

4) Item sets 52

Figure 7.4 Item Sets

5) Confidence & Support 53

Figure 7.5 Confidence & Support

6) Assignment of Support and confidence 54

Figure 7.6 Assignment of Support and confidence

7) Associativity Result
55

Figure 7.7 Associativity Result

8.

TESTING
56

Testing is a process of executing a program with the intent of finding an error. A good test has a high probability of finding an as yet undiscovered error. A successful test is one that uncovers an as yet undiscovered error

The objective is to design tests that systematically uncover different classes of errors and do so with a minimum amount of time and effort. Testing cannot show the absence of defects, it can only show that software defects are present.

8.1 Unit Testing


Interface Number of input parameters should be equal to number of arguments. Parameter and argument attributes must match. Parameters passed should be in correct order. Global variable definitions consistent across module. If module does I/O, Error Handling Error description unintelligible. Error noted does not correspond to error encountered. 57 File attributes should be correct. Open/Close statements must be correct. Format specifications should match I/O statements. Buffer Size should match record size. Files should be opened before use. End of file condition should be handled. I/O errors should be handled. Any textual errors in output information must be checked. Improper or inconsistent typing. Erroneous initialization or default values. Incorrect variable names. Inconsistent date types. Overflow, underflow, address exceptions.

Local Data Structures (common source of errors!)

Boundary conditions and Independent paths

Error condition handled by system run-time before error handler gets control. Exception condition processing incorrect.

8.2 Integration Testing


Top Down Integration Modules integrated by moving down the program design hierarchy. Can use depth first or breadth first top down integration Verifies major control and decision points early in design process. Top-level structure tested most. Depth first implementation allows a complete function to be implemented, tested and demonstrated. Can do depth first implementation of critical functions early. Top down integration forced (to some extent) by some development tools in programs with graphical user interfaces. Begin construction and testing with atomic modules (lowest level modules).Bottom up integration testing as its name implies begins construction and testing with atomic modules. Because modules are integrated from the bottom up, processing required for modules subordinate to a given level is always available and the need for stubs is eliminated

8.3 Validation Testing


Validation testing is aims to demonstrate that the software functions in a manner that can be reasonably expected by the customer. This tests conformance the software to the Software Requirements Specification.

8.4 System Testing


Software is only one component of a system. Software will be incorporated with other system components and system integration and validation test performance.

8.5 Recovery Testing


Many systems need to be fault tolerant-processing faults must not cause overall system failure. Other systems require after a failure within a specified time. Recovery testing is the forced failure of the software in a variety of ways to verify that recovery is properly performed.

8.6 Security Testing


System with sensitive information or which have the potential to harm individuals can be target for improper or illegal use. This can include: gain. 58 Attempted penetration of the system by outside individuals for fun or personal

Disgruntled or dishonest employees.

During security testing the tester plays the role of the individual trying to penetrate the system. Large range of methods: Attempt to acquire passwords through external clerical means. Use custom software to attack the system. Overwhelm the system with requests. Cause system errors and attempt to penetrate the system during recovery. Browse through insecure data.

Given time and resources, the security of most systems can be breached.

8.7 Performance Testing


For real-time and embedded systems, functional requirements may be satisfied but performance problems make the system unacceptable. Performance testing checks the run-time performance in the context of the integrated system Can be coupled with stress testing, May require special software instrumentation. Testing under various software development stages Requirements Stage The requirements documents are tested by disciplined inspection and review. The preparation of test plan, which should include: 1. Specification 2. Description of test precious 3. Test milestones 4. Test Schedule 5. Test data reduction 6. Evaluation criteria Design Stage Design products are tested by analysis, simulation, walkthrough and inspection. Test data for functions are generated. Test cases based on structure of system are generated.

59

Construction Stage This stage includes the actual execution of code with test data. Code walkthrough and inspection are conducted. Static analysis, Dynamic analysis, Construction of test drivers, hair nesses and stubs are done. Control and management of test process is critical. All test sets, test results and test reports should be catalogued and stored. Operation and Maintenance Stage Modifications done to the software requires retesting this is termed regression testing. Changes at a given level will necessitate retesting at all levels below it. Approaches Two basics approach: 1. Black box or "Functional" analysis 2. White box or "Structural" analysis Boundary value analysis (Stress Testing) In this method the input data is partitioned and data inside and at the boundary of each partition is tested. Design Based Functional Testing Functional hierarchy is constructed. For each function at each level extrenal, non-extremal and special value test data are identified. Test data is identified such that it will generate extremal, non-extremal and special output values. Cause-Effect Graphing In this method the characteristic input stimuli (Causes), characteristic output classes (effects) are identified. The dependencies are identified using specification. These details are presented as directed graph. Test cases are chosen to test dependencies.

Coverage-Based Testing

60

The Program is represented as control-flow graph. The paths are identified. Data are chosen to maximize paths executed under test conditions. For paths that are not always finite and those infeasible, Coverage metrics can be applied. . Complexity-Based Testing The Cyclomatic Complexity is measured. The paths actually executed by program running on test data are identified and the actual complexity is set. A test set is devised which will drive actual complexity closer to Cyclomatic complexity. Test Data Analysis During Test Data Analysis The Goodness of the test data set" is taken into major consideration.

Statistical Analysis And Error Seeding Known errors are seeded into the code so that their placement is statistically similar to that of actual errors . Mutation Analysis It is assumed that a set of test data that can uncover all simple faults in a program is capable of detecting more complex faults. In mutation analysis a large number of simple faults, called mutation, are introduced in a program one at a time .The resulting changed versions of the test program are called mutates. Test data is then be constructed to cause these mutants to fail. The effectiveness of the test data set is measured by the percentage to mutants killed. Test Results The listed tests were conducted in the software at the various developments stages. Unit testing was conducted. The errors were debugged and regression testing was performed. The integration testing will be performed once the system is integrated with other related systems like Inventory, Budget etc. Once the design stage was over the Black Box and White Box Testing was performed on the entire application. The results were analyzed and the appropriate alterations were made. The test results proved to be positive and henceforth the application is feasible and test approved. 61

The Sample Test Case

Figure 8.7.1 The Sample Test Case

62

9. CONCLUSION
Implementation of the Apriori based algorithm focus on the way candidate itemsets generated, the optimization of data structures for storing itemsets. Bodon presented an implementation that solved frequent itemsets mining problem in most cases faster than other well known implementations. In this paper, Bodons implementation is used for parallel computing. The efficiency of time and results are improved with the help of apriori algorithm. Data mining is a good area of scientific study holding ample promise for the research community Lots of progress has been reported for large databases, specifically involving association rules, classification, clustering, similar time sequences, similar text document retrieval, similar image retrieval, outlier discovery, etc. Many papers have been published in major conferences and leading journals. However, it still remains a promising and rich field with many challenging research issues. With the fierce competition of telecommunications industry, business managers have become more and more aware of the importance of marketing. It is believed that more extensive uses of data mining technology in the telecommunications industry will make enterprises control the loss of customers from the source. "Preventive measures" can avoid many unnecessary losses to a large extent, so that enterprises will be in an invincible position in the increasingly fierce market competition

63

10.REFERENCES
[1] [2] [3] [4] [5] Jiawei Iian,Michelins Kamber.Data mining: Concepts and techniques[M].America: Morgan Kaufman Publishers, 2000. Alex Berson,Stephen Smith,Kert Thearling. Building data mining applications CRM[M]. America: McGraw-Hill Companies, 2000. Sushmita Mitra,Tinku Acharya. Data Mining Multimedia, Soft Computing, and Bioinformatics, Wiley Publishing , 2003 Lin T . Y . Cerone N. Rough Sets and Data Mining Analysis [M ] , USA, KluwerAcademic Pubishers, 1997. M Kamber , R Shinghal . Evaluating the Interestingness of Character is Rules[ C ]. Proceedings of the 2nd International Conference on Discovery and Data Mining, AAA I , 1996. 2632266. [6] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur, "Dynamic itemset counting and implication rules for market basket analysis," in Proceedings of 1997 ACMSIGMOD International Conference on Management of Data (SIGMOD '97) (Tuscon, AZ), pp. 255-264, May 1997. [7] C.C.Aggarwal, C. Procopiuc, and P.S.Yu, "Finding localized associations in basket data," IEEE Transactions on Knowledge and Data 62, 2002. [8] A. Savasere, E. Omiecinski, and S. Navathe, "An efficient algorithm for mining association rules in large databases," in Proceedings of 1995 International Very Large Data Bases (VLDB '95) (Zurich, Switzerland), pp. Education Press, 2001. [10] Pawlak Z, Wong S KM, Ziarko W. Rough sets Probabilistic versus approach. International Journal of Man Machine Studies, deterministic 1988, 29 :81 295. [11] Ziarko W. Variable precision rough set model. Journal of Computer and Sciences, 1993, 46 :39 259. 64 System Conference on 432-443, September 1995. market Engineering, vol.14, pp. 51Acteristic Knowledge Boston: for

[9] Han J, KambrM. DATA MINING Concepts and Techniques[M], Beijing Higher

You might also like