You are on page 1of 5

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)

Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com

Volume 2, Issue 2, March – April 2013

ISSN 2278-6856

Informative Pattern Discovery Using Combined Mining

T.Ramanjaneyulu 1 , B.Naresh Kumar Reddy 2 and Jogi.Suresh 3

1 Dept. of IT, JNTUK UCEV, Vizianagaram, A.P., 535003, India

2 Dept. of Electronic and Communication Engineering, LBRCE, Mylavaram,

3 Dept. of CSE,KLU, A.P., 535003,

Abstract: These Actionable patterns are identified by applying different data mining methods. Traditional data mining techniques are available to identify homogeneous features of patterns in the data sources. Enterprise applications consist of large volumes of data, mining such data causes more space complexity and time complexity. In this paper authors implemented combined mining approach i.e. Location based search mining algorithm and Cluster kinship search technique to generate the incremental cluster patterns. By observing informative patterns acquired from the above techniques efficient actionable decision making possible and also user interestingness evaluated.

1. INTRODUCTION

Real time complex data consists of vast information from this mining required information with the existing single traditional data mining method not possible. To acquire knowledgeable information from data source we should integrate one or more data mining methods. Combined mining approach implemented with different data mining algorithms on mobile transaction data set. It is a time series data represents users, location, and service and transaction path sequence. Users may access different services from various locations. Sometimes few of the users access the similar service from same location. Similar patterns [11] are identified with LBSA technique [5] by seeing behavior of the user from various transactions. In order to provide effective and efficient location management technologies in mobile communication systems, researchers recently tend to focus on characterizing the mobile users moving or calling Characteristics such as Call to Mobility Ratio (Markov model). User’s moving behaviors can help to allocate personal data, pre-fetch useful information, pre- allocate wireless resources, and design personal paging

areas. .

Moving sequential patterns is a kind of moving

behaviors, and we will first systematically describe the problem of mining moving sequential patterns. It can be viewed as a special case of mining sequential patterns

with the extension of support, which helps a more reasonable pattern discovery. There are major differences when mining conventional sequential patterns and moving sequential patterns. Firstly, if two items are

consecutive in a moving sequence α, which is a subsequence of β, those two items must be consecutive in

β. That is because we care about what the next move is

for a mobile user in a mobile computing environment. Secondly, in mining moving sequential patterns the support considers the number of occurrences in a moving sequence, so the support of a moving sequence is the sum of the number of occurrence in all the moving sequences in the moving sequence database.

Frequent patterns of all the users from various locations maintain as a list of similarity matrix. User is dynamic in nature access service from various locations so level of similarity varied compared with other location users. All users accessing the same services have the same similarity value. Location based services have been recently attracting a lot of interest from both industry and research. When using these services, many users may be concerned with giving up one more piece of their private information by revealing their exact location, or releasing the information of having used a particular service. More generally, the association between the real identity of user issuing Location based service request and request itself as it reaches the service provider can be consider a privacy threat. The similarity of clusters is defined by the number of time intersecting points. The resulting clusters represent hyper-rectangular approximations of the true sub-space clusters. In an optional post processing step, these approximations can be refined by again applying any clustering algorithm to the points included in the approximation projected onto the corresponding subspace.

Cluster

kinship

technique

[10]

is

applied

to

get

efficient

informative

patterns

of

users

from

similar

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org,

Volume 2, Issue 2 March – April 2013

Page 186

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)

Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com

Volume 2, Issue 2, March – April 2013

ISSN 2278-6856

patterns. It gives incremental cluster patterns as result. Affinity threshold values are input to clustering technique it generates cluster patterns [7]. Altered threshold values produces variety of cluster patterns and the number of cluster patterns increases or sometimes changes occur in order of the cluster pattern. These informative patterns are used as per for user preference to take actionable decision. The technical and business interestingness of user evaluated. Prediction of user behavior is possible.

This project is organized in the following way. Section 1 presents the introduction of the modules implemented in this project. Section 2 presents the Literature Survey of the project and it provides the introduction to the area of actionable pattern discovery to find informative patterns from the complex data to take actionable decisions. Section 3 presents a general overview of the combined mining approach that is Location based search alignment, similarity matrix and Cluster kinship search technique algorithms implemented.

2. BACKGROUND

This chapter provides the introduction of the topic combined mining, and its relationship to the Actionable Pattern Discovery. The principles of combined mining are:

Combine of multiple data mining methods for Discovery of actionable patterns. Combined cluster patterns generated easily are no need of depth mining. The papers [2], [3], and [4], consists of information how to create cluster patterns and cluster rules. The most common way to measure the proximity between categorical data is to use simple matching distance .which is a count of the number of matching attributes divided by the total number of attributes The Jacquard Coefficient has also been used as a Similarity measure between transactions in transactional data. These two non-metric proximity measures, although they reflect mutual proximity between all pairs of data points, do not give a global measure of the topology of the dataset. Relying on such non-metric proximity measures in the presence of categorical attributes limits the choice of the clustering algorithms used. The minimum Spanning Tree hierarchical clustering and the hierarchical clustering with group averages are widely used in such situations. The minimum spanning tree algorithm is known to be very sensitive to outliers while the group average algorithm has a tendency to split large clusters. With the objective of enabling distance based clustering methods in data sets with mixed attributes, Tuv and Runger suggested a procedure for mapping categorical variables to numeric scores. The scoring approach explores mutual relationships between variables in the data set and attempts to preserve the mutual information between all the variables. It uses a supervised contrasting independence clustering method relying on CKST as a supervised learning tool to discover contrasts between the

original data and artificially generated data. Our work uses a similar approach but exploits class association rules to form the clusters directly without relating any scoring scheme.

Association rules mining [3] developed as a technique for finding interesting rules from transactional databases. An association rule is an expression of the form A→ C, where A and C are subsets of the set of items. Here A is referred to as the set of antecedents and C as the set of consequents. The subsets A and C are disjoint. The importance of a rule is evaluated by its support and confidence. The support of a rule is the fraction of all transactions where the set of antecedents A and the set of consequents C apply simultaneously. The support of a rule is a measure of its importance in terms of the number of transactions. The higher the support, the more important the rule is. The confidence of a rule is the fraction of transactions containing the set of antecedents A which also contain the set of consequents C. A minimum support and confidence thresholds are usually pre-specified before mining for association rules. Association rules mining [3] is also used in classification; the integration of classification and association rules mining is known as associative classification. Let R= {r 1 , r2….r m } represent the set of m resulting rules .These rules are potential to find clusters. Association rules refer to rules with a single antecedent and a single consequent. The resulting Meta rules provide a summary of the containment and overlap between the rules and hence can be used to organize the discovered rules. To find the meta rules from the set R, each rule in R is mapped to the data rows which support its set of antecedents. This results in a new set of transactions Q= {q 1 , q2…q m } m≤n such that every element qj from Q is a subset of rules from R. Meta rules between the rules of R are then learned from the transactions of Q. Since the rules from R are used to find the potential clusters, and the discovered Meta rules express associations between the rules, they also reveal the similarities between the clusters and could hence be used to merge similar clusters when the desired number of clusters is a priori specified. Classifiers have been quite successful in a variety of domains ranging from the identification of fraud and credit risks in financial transactions to medical diagnosis to intrusion detection. A hybrid recommender system mixes collaborative and content filtering using an induction learning classifier. Feature vector classification[9] of movies and compared the classification with nearest-neighbor recommendation; this study found that the classifiers did not perform as well as nearest neighbor, but that combining the two added value over nearest-neighbor alone. Association rules have been used for many years in merchandising, both to analyze patterns of preference across products, and to recommend products to consumers based on other products they have selected. An association rule expresses the relationship that one product is often purchased along with other

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org,

Volume 2, Issue 2 March – April 2013

Page 187

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)

Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com

Volume 2, Issue 2, March – April 2013

ISSN 2278-6856

products. The number of possible association rules grows exponentially with the number of products in a rule, but constraints on confidence and support, combined with algorithms that build association rules [3] with item sets of n items from rules with n-1 item sets, reduce the effective search space. Association rules can form a very compact representation of preference data that may improve efficiency of storage as well as performance. They are more commonly used for larger populations rather than for individual consumers, and they, like other learning methods that first build and then apply models, are less suitable for applications where knowledge of preferences changes rapidly. Association rules [3] have been particularly successfully in broad applications such as shelf layout in retail stores. By contrast, recommender systems based on nearest neighbor techniques are easier to implement for personal recommendation in a domain where consumer opinions are frequently added, such as on-line retail. Horting is a graph-based technique in which nodes are consumers, and edges between nodes indicate degree of similarity between two consumers. Predictions are produced by walking the graph to nearby nodes and combining the opinions of the nearby consumers. Horting differs from nearest neighbor as the graph may be walked through other consumers who have not rated the product in question, thus exploring transitive relationships that nearest neighbor algorithms do not consider. In one study using synthetic data, Horting produced better predictions than a nearest neighbor algorithm.

Cluster patterns [12] are generated from similar patterns these consists of heterogeneous item sets from transaction dataset. These cluster patterns not produce by traditional algorithms [8]. The approach combined mining is general method for directly identifying actionable patterns [7] from mobile transaction dataset. It is a multi feature data set consists of huge amount of information for number of users accessing services from various locations [5]. User pattern information changes frequently even prediction of user movements possible.

The main objective of this paper is: using existing works, simplifying the concept of combined mining that can be extended and instantiated into many specific approaches and models to mine industrial data sources and to obtain informative knowledge.

3. RELATED WORK

Analyzing

Multidimensional

Mobile

transaction

dataset in clustering involves following steps.

1) Determination of transaction path sequence data:

Data can be represented by real-valued expression matrix I where I ij is the measured transaction pattern of user I in experiment condition j. The I th row of matrix is vector forming transaction pattern of user

2) i. Calculation of similarity matrix S: In this matrix entry S ij represents the similarity of transaction sequence patterns for user i and j. Time distance between users can be used to calculate similarity of patterns, finally similarity matrix [11] calculated users. 3) Clustering the group of users based on the location or sequence patterns: Users belong to same location or consists of similar sequence patterns comes under same cluster, otherwise belongs to another cluster. In this passion we get number of cluster patterns.

Clustering High dimensional dataset: Similarity data matrix means in the original data space. This will leads us to the surprising study that in terms of general subspace clustering approaches, many approaches in this field tackle rather simple, specialized or complex problems. For a more exhaustive covering of biclustering algorithms we refer to surveys covering biclustering in biological and medical applications. Pattern based clustering [12] is especially popular in the bioinformatics community, focusing on the application of biclustering on microarray data. It is important to understand the difference between unsupervised classification and supervised classification. In supervised classification, we are provided with a collection of labeled patterns, the problem is to label a newly encountered unlabeled pattern. Typically the given training patterns(labeled) are used to learn the descriptions of classes which in turn are used to label a new pattern. In case of clustering, the problem is to group a given collection of unlabeled patterns into meaningful clusters. In a sense, labels are associated with clusters also, but these type labels are data driven; that is, they are obtained individually from the data. Clustering [12]useful in several areas of pattern-analysis, grouping, decision making, and machine-learning situations, including data mining, document retrieval, image segmentation, and pattern classification. However, in many such problems, there is little prior information of statistical models available about the data, and the decision- maker must make as few assumptions about the data as possible.

Hubert’s statistics [8] followed to reduce dimensionality of matrices if any mismatch occurs in case of dimensions. Projection techniques [8] reduce the data dimensionality by combining the original variables into a smaller number of new dimensions, in a linear or nonlinear manner. Projection preserves the inherent relationships and structure of the dataset. Hubert’s statistics [8] evaluates the fit between the distance matrices of original data and projected data. Cluster kinship search technique [10] uses average similarity (affinity) between unassigned cluster and current core cluster to make next decision. It differs from theoretical algorithm which repeats same process many times,

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org,

Volume 2, Issue 2 March – April 2013

Page 188

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)

Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com

Volume 2, Issue 2, March – April 2013

ISSN 2278-6856

instead used cleaning steps to remove spurious elements from core cluster. CKST adds elements from current one at a time improves decision base for next cluster. It handles general input similarities are valued and user specified affinity threshold parameter determines what affinity level is significant. This parameter influences number and size of clusters.

The input to Cluster kinship search technique algorithm [10] is a pair (S, t) Where Sis real Symmetric n-by-n similarity matrix [S(i, j),[0, 1]] and t is the affinity Threshold. The clusters are constructed one at a time. The current cluster denoted as C open. The affinity of element x with respect to current cluster defined as a(x)=y Copen S(x, y). We say the element x has high affinity a(x) ≥ t|C open |. Otherwise x has low affinity. Cluster kinship technique alternates between adding high affinity element to C open and remove low affinity element from it. This process halts when maximum clusters form as incremental cluster.

4. EXPERIMENTAL WORK

Mobile path sequence data consists of time series and categorical data. User can access different services from various locations in alternate fashion forms multiple patterns. Similarity value calculated for multiple users from similarity matrix of various patterns. If number of users increased takes more time evaluate the level of similarity.

Similarity is fundamental to the definition of a cluster, a measure of the similarity between two patterns drawn from the same feature space is essential to most clustering procedures. Because of the variety of feature types and scales, the distance measures must be chosen carefully. It is most common to calculate the dissimilarity between two patterns using a distance measure defined on the feature space. We will focus on the well known distance measures used for patterns whose features are all continuous.

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org,

Figure 1: Level of similarity In this synthetic data set different users get similar patterns are further evaluated by giving different affinity threshold values to observe the number of cluster patterns forming, after certain range of threshold value same cluster pattern repeated. From this prediction of users cluster patterns possible. Number of users more cluster patterns increased.

Incremental clusters 2 1.5 1 0.5 0 1 2 3 4 Threshold values
Incremental clusters
2
1.5
1
0.5
0
1
2
3
4
Threshold values

Figure 2: prediction of cluster patterns Temporal mobile transaction dataset consists of number of users, and their accessing different locations. Similarity levels of Services available for users in different locations. Threshold values to observe number of incremental clusters for varied values.

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org,

Figure 3: Transactions verses Number of users Every user creates more than one transaction at different times. One user at particular point of time may create a transaction that is varied in case of services and user identity and location of other transactions. Incremental clusters patterns depends on the number of user transactions, threshold values we are giving as input values to cluster algorithm.

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org,

Figure 4: Number of user their similarity of services

Few of the users may access similar service at different times .This identification is useful to predict user particular service and present user is working with and cost estimation also possible. Analysis of a service tariffs also possible with and maximum user preferable services.

Table1: Mobile Transactions dataset analysis table

User

 

Service

Similarit

Threshol

s

Location

s

y levels

d values

5

3

4

0.1

0.3

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org,

Volume 2, Issue 2 March – April 2013

Page 189

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)

Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com

Volume 2, Issue 2, March – April 2013

ISSN 2278-6856

10

4

5

0.2

0.5

15

5

7

0.3

0.6

20

6

8

0.4

0.9

25

8

9

0.5

1

30

10

10

0.7

1.5

[10] Amirben-dor, Ron Shamir, Zoharyakhini, “clustering

gene Expression Patterns” 1999.

[11] Hua yuan, junjie wu, “Mining Maximal Frequent

Patterns with Similarity Matrices of Data

Records

  • 5. CONCLUSION

Combined mining is better than individual data mining methods to retrieve knowledgeable information from the heterogeneous features dataset and to measure technical interestingness and business interestingness of users. We developed a tool Actionable Pattern Discovery using combined mining process to identify incremental cluster patterns [7] as informative patterns. Similarity matrix designed With LBSA algorithm [6] based on time distance between similar patterns in mobile sequence data set and similarity value measured. Cluster Kinship Search Algorithm (CKST) applied to cluster similar patterns with support of min and max affinity threshold values to get incremental cluster patterns. From our experimental results it is clear that level of similarity and cluster patterns depends on number of user transactions. Users pattern interestingness and prediction of user behavior possible. These patterns used to take efficient actionable decisions.

  • 6. Future work

This process will be extended to other kinds of real time datasets such as categorical and temporal datasets, to mine informative patterns, prediction of patterns, merging of patterns and interestingness of patterns.

References

[1]

L. Cao, Y. Zhao, H. Zhang, D. Luo, and C. Zhang, “Flexible frameworks for actionable knowledge discovery” ,IEEE Trans Sep. 2010.

[2] B. Lent, A. N. Swami, and J.Widom “Clustering association rules” ,CDE, 1997.

[3] H. Zhang, Y. Zhao, L. Cao, and C. Zhang “Combined association rule mining”, in Proc. PAKDD, 2008.

[4]

H. Zhang, L.Cao, H .Bohlscheid, ”Combined pattern mining from learned rules to actionable knowledge”

,AI, 2008. [5] Eric Hsueh-Chan Lu, Vincent S. Tseng, Philip S. Yu, “Mining Cluster-Based Temporal Mobile Sequential Patterns in Location-Based Service Environments”, IEEE 2011 [6] Cao, L.yu, zhang.C, zhayo.Y, “Domain Driven Data Mining”, Springer, 2009 [7] Haixun wang, Wei wang, jiong wang “Clustering by pattern similarity in large data sets”. [8] Dorina Marghescu “Evaluation of projection techniques using HUBERT’Sstatistics”.2007

[9]

J.Wang,

G. Karypis, “HARMONY: Efficiently

mining the best rules for classification,” SDM, 2005.

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org,

Volume 2, Issue 2 March – April 2013

Page 190