Distributed Clustering Survey

Survey of Distributed Clustering Techniques
Horatiu Mocian Imperial College London
1st term ISO report
Supervisor: Dr Moustafa Ghanem
Abstract Research in distributed clustering has been very active in the last decade. As part of the broader distributed data mining paradigm, clustering algorithms have been employed in a variety of distributed environments, from computer clusters, to P2P networks with thousands of nodes, to wireless sensor networks. In trying to observe different distributed clustering algorithms, there are a lot of aspects that need to be considered: the type of data on which the algorithm is applied (genes, text, medical records, etc.), the partitioning of data (homogeneous or heterogeneous), the environment in which it has to run (LAN, WAN, P2P network), the privacy requirements, and many others. Accordingly, all this information needs to be known about the algorithms in order to evaluate them properly. Although there are plenty data clustering surveys, there is no one that focuses exclusively on distributed clustering. The aim of this report is to present the most inuential algorithms that have appeared in this eld. But the most important contribution is a taxonomy for classifying and comparing the existing algorithms, as well as placing future distributed clustering techniques into a perspective. The second major contribution is the proposal of a new parallel clustering algorithm, Parallel QT.
Contents
1 Introduction 1.1 Data Mining . . . . . . . . 1.2 Distributed Clustering . . . 1.3 Motivation . . . . . . . . . 1.4 Organization of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 1 2 2 2 3 3 4 5 5 5 5 6 7 7 7 8 8 8 9 9 9 10 10 10 11 11 12 12 12 12 12 14 20 23 23 23 23 24 25
Clustering 2.1 Clustering Algorithms . . . . . . . . . 2.2 Similarity Measures . . . . . . . . . . . 2.3 Clustering Evaluation Measures . . . . 2.4 Clustering High Dimensional Data Sets 2.4.1 Feature Selection . . . . . . . . 2.4.2 Coclustering . . . . . . . . . . 2.5 Examples of Clustering Applications . . 2.5.1 Document Clustering . . . . . . 2.5.2 Gene Expression Clustering . . 2.6 Summary . . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
Distributed Clustering 3.1 Distributed Data Mining . . . . . . . . . . . 3.2 Key Aspects of Distributing Clustering . . . . 3.2.1 Distributed Data Scenarios . . . . . . 3.2.2 Types of Distributed Environments . . 3.2.3 Communication Constraints . . . . . 3.2.4 Facilitator/Worker vs P2P . . . . . . 3.2.5 Types of Data Transmissions . . . . . 3.2.6 Scope of Clustering . . . . . . . . . . 3.2.7 Ensemble Clustering . . . . . . . . . 3.2.8 Privacy Issues . . . . . . . . . . . . . 3.3 Distributed Clustering Performance Analysis 3.3.1 Measuring Scalability . . . . . . . . 3.3.2 Evaluating Quality . . . . . . . . . . 3.3.3 Determining Complexity . . . . . . . 3.4 Summary . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
Survey of Distributed Clustering Techniques 4.1 Initial Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Review of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Rened Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parallel QT: A New Distributed Clustering Algorithm 5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . 5.2 Original QT Clustering . . . . . . . . . . . . . . . 5.2.1 Advantages of QT Clustering . . . . . . . 5.3 Parallel QT Clustering . . . . . . . . . . . . . . . 5.3.1 Challenges of the algorithm . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Conclusion and Future Work 6.1 Emerging Areas of Distributed Data Mining 6.1.1 Sensor Networks . . . . . . . . . . 6.1.2 Data Stream Mining . . . . . . . . 6.2 Future Work . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
25 25 25 26 26
Introduction
Clustering, or unsupervised learning, is one of the most basic, and at the same time important, elds of machine learning. It splits the data into groups of similar objects helping in extraction of new information by summarizing or discovering new patterns. Clustering is used in a variety of elds, including statistics, pattern recognition and data mining. This paper focuses on clustering from the data mining perspective.
1.1
Data Mining
Data mining, also called knowledge discovery from databases (KDD), is a eld that has gained a lot of attention in the last decade and a half [22]. It tries to discover meaningful information, from large datasets, which are otherwise unmanageable. Knowledge can be extracted by either aggregating the data (the best examples is OLAP, but clustering also achieves that), or by discovering new patters and connections between data. As the amount of data available in digital format has increased exponentially in the last years, so has the employment of data mining techniques across a varied range of elds: business intelligence, information retrieval, e-commerce, bioinformatics, counter-terrorism, and so on. Two related subelds of data mining that have developed rapidly, aided by an ever-high Internet penetration rate, are text mining and web mining. The former tries to discover new information by analyzing sets of documents. In addition to common techniques, text mining has specic tasks like entity extraction, sentiment analysis, or generation of taxonomies [38]. Web mining derives new knowledge by exploring the Internet. Kosala and Blockeel [55] split this eld into three subcategories: web content mining, web structure mining and web usage mining, respectively. The rise of the Internet also brought two major problems in data mining: rst, the amounts of data became so large that even high-performance supercomputers couldnt process it. Second, the data was stored at multiple sites and it became increasingly infeasible to centralize it in one place. Bandwidth limitation and privacy concerns were among the factors that hindered centralization. To solve this problems, distributed data mining has emerged as a hot research area. Distributed Data Mining [48, 37] makes the assumption that either the computation or the data itself is distributed. It can be used in environments ranging from parallel supercomputers to P2P networks. It can be applied in areas like distributed information retrieval and sensor networks. It caters for communication and privacy constraints or for resource constraints, like battery power. Consequently, new frameworks and new algorithms needed to be developed to work under this conditions.
1.2
Distributed Clustering
As clustering is an essential technique for data mining, distributed clustering algorithms were developed as part of the distributed data mining research. The rst of them were modied versions of the existing algorithms: parallel k-means [20] or parallel DBSCAN [72]. Gradually, other algorithms have surfaced, especially for P2P systems. Usually distributed clustering algorithms work in the following way: 1. A local model is computed 2. The local models are aggregated by a central node (or a super-peer in P2P clustering systems) 3. Either a global model is computed, or aggregated models are sent back to all the nodes to produce locally optimized clusters.
1.3
Motivation
I have decided to do in-depth research in distributed data clustering for several reasons: Working in the eld of data mining, I had also touched upon distributed clustering, which seemed a very interesitng topic. 1
Distributed Clustering has been a very active eld, but no surveys of this topic exist I am proposing a new distributed clustering algorithm, and I needed to know how it stands in comparison to other algoriths. In order to achieve that, I created a taxonomy for classifying distributed clustering algorithms. The contributions of this paper are: a new taxonomy for distributed clustering algorithms and the proposal of a new algorithm for distributed clustering. This is the rst survey that concentrates exclusively on distributed clustering, although PhD theses from Hammouda [31] and Kashef [52] provide good reviews of the eld.
1.4
Organization of the Report
The rest of the paper is organized as follows: Section 2 contains a brief overview of clustering, Section 3 presents aspects of distributed clustering, followed by a survey of the most important papers in the eld and a new taxonomy in Section 4. A new distributed algorithm will be presented in Section 5, while the conclusions and future research scenarios are mentioned in Section 6.
Clustering
Clustering, or unsupervised learning, is the task of grouping together related data objects [39]. Unlike supervised learning, there isnt a predened set of discrete classes to assign the objects to. Instead, new classes, in this case called clusters, have to be found. There are a lot of possible denitions for what a cluster is, but most of them are based on two properties: objects in the same cluster should be related to each other, while objects in different clusters should be different. Clustering is a very natural way for humans to discover new patterns, having been studied since the ancient times [35]. If it is regarded as unsupervised learning, clustering is one of the basic tasks of machine learning, but it is used in a whole range of other domains: data mining, statistics, pattern recognition, bioinformatics or image processing. The focus of this survey is the application of clustering algorithms in data mining.
2.1
Clustering Algorithms
There are many methods and algorithms for clustering. Some of them will be presented later on, but this section is far from being a complete review of the clustering algorithms available. All the algorithms that are relevant to the subject of this paper, distributed clustering, and are not presented in this section will be detailed in the following section. The most widespread clustering algorithms fall into two categories: hierarchical and partitional clustering. Hierarchical clustering builds a tree of clusters, known as a dendrogram. A hierarchical clustering algorithm can either be top-down (divisive) or bottom-up (agglomerative). The top-down approach starts with one cluster containing all the points from a set, and then splits it iteratively until a stopping condition has been fullled. Agglomerative clustering begins with singleton clusters, and then, at each step, two clusters are merged. The clusters to be joined are chosen using a predened measure. Again, a stopping criterion should be specied. The advantage of hierarchical clustering is that the clusters can be easily visualized using the dendrogram, and is very suitable for data that inherently contains hierarchies. The disadvantages of this approach are constituted by the issue of deciding upon a termination criteria and its O(n2 ) complexity for both running time and data storage. Hierarchical clustering stores the distances between each data point in a matrix with N x N dimensions, where N is the number of data points that need to be processed. Although in theory this matrix may be too large to be stored in memory, in practice the sparsity of this matrix is increased using different methods, so that it occupies much less space. The methods include omitting entries smaller than a certain dimension, or by storing only the distance to the nearest neighbours of a certain data point [61]. To compute the distance between two clusters, as opposed to individual points, a measure called linkage metric must be used. The most used linkage metrics are: single link (the minimum distance between any two points from the clusters) complete link (the maximum distance between any two points from the clusters) 2
average link (the mean of the distances between two points from the clusters). A complete survey of the linkage metrics can be found in [60]. Partitional clustering splits the data into a number of subsets. Because it is infeasible to compute all the possible combinations of subsets, a number relocation schemes are used to iteratively optimize the clusters. This means that the clusters are revisited at each step and can be rened, constituting an advantage over hierarchical clustering. Another difference between hierarchical and partitional methods is the way the distances between clusters are calculated. The former approach computes the distances between each individual points of a group, while the latter uses a single representative for each cluster. There are two techniques used for constructing a representative for a cluster: taking the data point that represents the cluster best (k-medoids) or computing a centroid, usually by averaging all the points in the group (k-means). K-means [23] is the most widely used clustering algorithm, offering a number of advantages: it is straightforward, has lower complexity than other algorithms, and it is easy to parallelize. The algorithm works as follows: rst, k points are generated randomly as centres for the k clusters. Then, the data points are assigned to the closest clusters. Afterwards, the centroids of the clusters are recomputed. Then, at each step the points are re-assigned to their closest cluster and the centroids are recomputed. The algorithm stops when there are no more movements of the data points between clusters. Although highly popular, the k-means algorithm has a multitude of issues [9]. First, the choice of k is not a trivial one. Second, the accuracy of the algorithms depends much on the initial cluster centers that are chosen. If this is done randomly, the results are hard to control. Third, the algorithm is sensitive to outliers. Finally, it is not scalable.
2.2
Similarity Measures
The concepts of similarity and distance are fundamental in data clustering, and they are a basic requirement for any algorithm [52]. A similarity measure assesses how close two data points are from each other. The distance measure, or dissimilarity, does exactly the opposite. Usually, either one or the other measure is used. Anyway, each measure can be easily derived from the other. This section will review the most widely used similarity measures. As this measure is used in the local computing of cluster models or in the aggregation step, there are no measures that are specic for distributed algorithms. The simplest similarity measure is the most widely used. If we visualize the data points as points in a Ndimensional space, then the Minkowski distance between them can be computed. Accordingly, objects with the smallest distance have the highest similarity [14]. The particular cases of the Minkowsi distance are for n=1 (Manhattan distance), and for n=2 (Euclidean distance). Another measure that is widely used is the cosine similarity measure. Again, data points are considered in a coordinates system with N axis (dimension). Then, for each point a representative vector with the origin in the origin of the coordinates system can be drawn. The cosine similarity measure between two points is the cosine of the angle of the representative vectors for the respective points. This measure is heavily employed in document clustering. The Jaccard measure [14, 56] divides the number of features that are common to both points, by the number of their total features (in its binary form). Formulas for these metrics are given in the table below.
2.3
Clustering Evaluation Measures
In order to assess the performance, or quality, of different clustering algorithms, objective measures need to be established. There are 3 types of quality measures: external, when there is a priori knowledge about the clusters, internal, which assumes no knowledge about the clusters, and relative, which evaluates the differences between different clusters. Among the quality measures used in clustering are [52]:
Metric Minkowski Cosine Similarity Jaccard Measure
Formula ||x y||r =

r
sumd |xi yi |r i=1

xy ||x||||y||
cosSim(x, y) = JaccardSim(x, y) =
sumd min(xi ,yi ) i=1 sumd max(xi ,yi ) i=1
Table 1: Similarity Metrics 1. External Quality Measures Precision: this measure retrieves the number of correct assignments out of the number of total assignments made by the system Recall: this measure retrieves the number of correct assignments made by the system, out of the number of all possible assignments F-measure: this measure is a combination of the precision and recall measures used in machine learning Entropy: indicates the homogeneity of a cluster: a low entropy indicates a high homogeneity, and vice versa Purity: the average precision of the clusters relative to their best matching classes. 2. Internal Quality Measures Intracluster Similarity: the most common quality measure, it represents the average of the similarity between any two points of the cluster. Partition Index: reects the ratio of the sum of compactness and separation of clusters. Dunn Index: it tries to identify compact and well separated clusters. The Dunn Index has a O(n2 ) complexity which could make it infeasible for large n. Also, the diameter of the largest cluster can be inuenced by outliers Separation Index measures the inter-cluster dissimilarity and the intra-cluster similarity by using cluster centroids. It is computationally more efcient than Dunn Index. 3. Relative Quality Measures Cluster Distance Index measures the difference between 2 clusters. For centroid-based clusters, the distance is just ||ci c j ||. If the clusters are not represented by centroids, then the linkage paradigm from hierarchical clustering can be applied: the cluster distance could either refer to the minimum, maximum or average distance between any two points of the two clusters.
2.4
Clustering High Dimensional Data Sets
The majority of clustering algorithms work optimally with data sets where each record has no more than a few tens of attributes (features or dimensions). However, there are many data sets, for which this number is in the hundreds or thousands, like gene expressions or text corpora. The high number of features poses a problem known as the dimensionality curse: in high dimensional space, there is a lack of data separation. More specic, the difference in distance between the nearest neighbours of a point and other points in the space becomes very small, and thus insignicant [4].
2.4.1
Feature Selection
To reduce the number of dimensions, techniques called dimensionality reduction, or feature selection, are used. The most representative algorithms for feature selection are Principal Component Analysis (PCA) [44] and Singular Value Decomposition (SVD)[10]. 2.4.2 Coclustering
Coclustering is another technique for dimensionality reduction, involving the clustering of the set of attributes, as well as the clustering of data points. Coclustering is an old concept [5], and exists under a lot of names: simultaneous clustering, block clustering or biclustering. One of the most common uses of this techniques is for gene expression analysis, where both samples and genes have to be clustered in order to obtain compelling results. According to [58], there are 4 types of cluster blocks (biclusters): biclusters with constant values, biclusters with constant values on rows or columns, biclusters with coherent values and biclusters with coherent evolutions. If we imagine a the rows and the attributes in a matrix, the rst three of them analyze the numeric values of the matrix, while the last one looks at behaviours, considering elements in the data matrix as symbols.
2.5
Examples of Clustering Applications
Clustering is applied in a variety elds. In fact, because clustering is an essential component of Data Mining, it is applied wherever DM is used. In this section, well take a closer look, at two applications of clustering, in text mining and in gene expression analysis. 2.5.1 Document Clustering
Document clustering [67] is a subeld of text mining. It partitions documents into groups based on their textual meaning. It is applied in information retrieval (Vivisimo), automated creation of taxonomies (Yahoo!, DMOZ) and news aggregation systems (Google News) [31]. A feature that sets document clustering apart is its high dimensionality. A corpus with several hundred documents can have thousands of different words. Nowadays, a set of 1 million documents is not unusual. If the corpus is a general one, and not a specialized one containing articles from a single eld, it can have more than 20.000 words. There are several ways in which the number of attributes (i.e. words) is reduced when working with document data sets. The rst one is text preprocessing: removal of stopwords (common words like it, has, an), stemming the words (reducing them to their lexical roots, for example both employee and employer can be stemmed to employ) and pruning (deleting the words that appear only a couple of times in the entire corpus. Except for stemming, these techniques could be applied very easily for other types of clustering: stopword removal and pruning basically mean eliminating the attributes that appear to often, and to seldom, respectively, in the data set. After preprocessing, feature selection/extraction can be applied. There are many sophisticated algorithms for doing that, but in [73] Yang proves that TF (term frequency) is as good as any. The third way of reducing dimensionality is to use geometric embedding algorithms for clustering. LSI (Latent Semantic Indexing) is the best example of its kind: applying it on text results in a representation with fewer words, but with the same semantic meaning. Another aspect of document clustering is represented by the text representation models, which are specic to text documents: vector space models, N-grams, sufx trees and document index graphs. The most used document data model, the Vector Space Model, was introduced by Salton [63]. Each document is represented by a vector d = t f1 ,t f2 , ,t fn , where t fi is the frequency of each term (TF) in the document. In order to represent the documents in the same term space, all the terms from all the documents have to be extracted rst. These results in a term space of thousands of dimensions. Because each document usually contains only at most several hundred words, these representation leads to a high degree of sparsity. Usually, another factor is added do the term frequency: the inverse document frequency (IDF). Thus, each component of the vector d now becomes
t fi id fi . The explanation of this formula is that the more a term appears in other documents, the less is its relevance for a particular document. N-grams represent each document as a sequence of characters [69]. From this, n-character sub-sequences are obtained. The similarity between two documents is equal to the number of n-grams that they have in common. The major advantage of this approach is its tolerance to spelling errors. The last two approaches are phrase-based representation models. The rst one builds a Sufx Tree where each node represents a part of a phrase and is linked to all the documents containing the respective sufx [75]. The Document Index Graph model was proposed by Hammouda and Kamel in [32]. Using this model, each document is represented as a graph, with words being represented as words, and sequences of words (i.e. phrases) as edges. This model can be used for keyphrase extraction and cluster summarization. The advantage of phrase-based models is that they capture the proximity between words. There have been distributed clustering algorithms developed specically for document clustering, like the one from Kargupta et al. [49] or Hammouda [31]. 2.5.2 Gene Expression Clustering
Gene clustering is another eld that is very suitable for employing clustering techniques. The development of DNA microarray technology has made it possible to monitor the activity of thousands of genes simultaneously, across collection of samples. But analyzing the measurements obtained requires large amounts of computing power, as they can easily reach the order of millions of data points. Consequently, DDM algorithms need to be employed for gene expression analysis. A good survey of clustering techniques for genes data can be found in [42], that goes through all the steps included in gene expression analysis. There are three basic procedures for studying genes with microarrays: rst, a small chip is manufactured, onto which tens of thousands of DNA probes are attached in xed grids. Each cell in the grid corresponds to a DNA sequence. Second, targets are prepared, labeled and hybridized. A target is a sample of cDNA (complementary DNA) obtained from the reverse transcription of two mRNA (messenger RNA) samples, one for control and for testing. Third, chips are scanned to read the signal intensity that is emitted from the targets. The data obtained from a microarray experiment is expressed using a gene expression matrix: rows represent genes and columns represents samples. Cell (i, j) represents the expression of gene i at sample j. With this matrix representation, we end our foray into the eld of genetics, and return to the more familiar clustering. From the gene expression matrix, one property can be observed: the high dimensionality of the data. As stated before, tens of thousands of genes are observed simultaneously in a microarray experiment. Another characteristic that sets gene expression data apart from other types of data is the fact that clustering both samples and genes is relevant. Samples can be partitioned into homogeneous groups, while coexpressed genes can be grouped together. Accordingly there are three types of clustering gene expressions: gene-based clustering: samples are considered features; genes, considered objects, are clustered sample-based clustering: genes are considered features; samples are clustered subspace clustering: genes and samples are treated symmetrically, clusters are formed by a subset of genes across a subset of samples. Other problems that inuence clustering in this area are the high amount of noise induced by the microarray measurements, and the high connectivity between clusters: they can have signicant overlap or be included into one another. Most clustering techniques have been applied on gene expression data: k-means [66], hierarchical clustering [62], self organizing maps. Other clustering algorithms have been specically developed for gene expression analysis: Quality Threshold [36], or DHC (density-based hierarchical clustering) [41]. Traditional clustering techniques cannot be used for subspace clustering. Thus, new techniques have been developed, like Coupled Two-Way Clustering [27] or [57]. These are coclustering algorithms, also covered in section 2.4.2. 6
This section is not meant to be an exhaustive review of gene expression clustering. More information about this eld can be found in the survey by Jiang et al. [42]. Clustering survey by Xu et al. [71] covers gene clustering algorithms in section I. For a review of coclustering techniques, the reader is reffered to [58]. Other information about gene clustering can be found in [6] and [12]. Although they come from totally different areas, document clustering and gene clustering have one thing in common: high dimensionality. Both of them can have well over one thousand features, but the number can easily exceed ten thousand. Moreover, one data row has only a small number of features: a text document cannot contain all the words in the dictionary, while genes do not express in all the samples. This common feature should encourage researchers working in these elds to get inspired from each other. There are a lot of clustering algorithms applied in both elds, like k-means or hierarchical clustering. An example of cooperation between these two elds is found in [16], where document clustering is performed using an algorithm that has been originally developed for gene expression analysis.
2.6
Summary
This section gives an overview of data clustering, an essential task in data mining. The most important clustering techniques, agglomerative and partitioning, are detailed with an example of each: hierarchical agglomerative clustering and k-means, respectively. Next, similarity measures used for clustering are presented: Minkowski metrics, cosine similarity and Jaccard measure. The three categories of clustering evaluation measures introduced in the papers are: external, internal and relative. High dimensionality and feature selection are also touched upon, before illustrating popular applications of data clustering: for documents and for gene expression analysis.
3
3.1
Distributed Clustering
Distributed Data Mining
Distributed Data Mining (DDM), sometimes called Distributed Knowledge Discovery (DKD), is a research eld that has emerged in the last decade. The reasons for the rapid growth of activity in DDM are not hard to nd: the development of the Internet, which made communication on long distances possible, the exponential increase of digital data available (and the necessity to process it) which outstripped the increase in computing power and the need for companies and organizations to work together and share some of their data in order to address common tasks: detecting nancial fraud or nding connections between diseases and symptoms. An example of the quantities of data needed to be processed by a single system can be found in astronomy [37]: the NASA Earth Observing System (EOS) collects data from a number of satellites and holds 1450 data sets that are stored and managed by different EOS Data and Information System (EOSDIS) sites. These are located at different geographic locations across USA. A single pair of satellites produces 350 GB of data per day, making it extremely difcult or impossible to process all the EOS data centrally: even if the communication and storage issues would be solved, the computing power requirements cannot be met. Another scenario in which DDM can be employed is clustering medical records. This could be done in order to nd similarities between patients suffering of the same disease, which could be further use to discover new information about the respective disease. In this case, there are two major problems: rst, there is no central entity storing all the patient records - they are kept by their GPs or by hospitals, so the data is inherently distributed. Second, privacy is essential in this case, so a central site should receive local models of the data, not the data itself. Privacy has been researched intensively lately and is covered later in the chapter. The main goal of distributed data mining research is to provide general frameworks on top of which specic data mining algorithms can be built upon. Examples of such frameworks can be found in papers by [29], Cannataro [13] or Kargupta [15]. From the point of view of the partitioning of data, DDM techniques can be classied as homogeneous and heterogeneous. If we imagine a database table, splitting data into homogeneous blocks is equivalent to horizontal partitioning, where each block contains the same columns (attributes), but different records. Heterogeneous data
is obtained when vertical partitioning is applied where each block contains different attributes of the same record. An unique identier is used to keep track of a particular record across all sites. The same analogy can be applied to data stored in other formats than a database table. A related eld that has signicant overlap with DDM is parallel data mining (PDM). When computers in a network are tightly coupled (e.g. in a LAN, web farm or cluster), parallel computing techniques can be used for DDM. Parallel Data Mining was developed more than a decade ago in the attempt to run data mining algorithms on supercomputers. This is what is called a ne grain approach [46]. The coarse grain approach in PDM is when several algorithms are run in parallel and the results combined. This approach is heavily used in Distributed Data Mining. In this paper, both techniques will be referred to as DDM. In [48], the most important types of DDM algorithms are presented: distributed clustering, distributed classier learning, distributed rules association and computing statistical aggregates. The focus of this survey is on distributed clustering. Some of the new environments in which distributed data mining has been applied are also detailed in [48]. Among them, are sensor networks, which are constituted of numerous small devices connected wirelessly. A main characteristic of the sensors is that the power required for wireless communication is much greater that the power required for local computations. Of course, in this environment severe energy and bandwidth constraints apply. Other recent applications of DDM are in grid mining applications in science and business, in elds like astronomy, weather analysis, customer buying patterns or fraud detection. There is a wide range of distributed data mining surveys in the literature. Paper [37] has already been quoted before. Survey by Zaki [74] discusses a large variety of aspects from DDM. A range of DDM techniques are discussed in the book [47] by Kargupta and Chan. A more recent overview of the eld by Datta et al. can be found in [17].
3.2
Key Aspects of Distributing Clustering
Distributed clustering is the partitioning of data into groups, in a distributed environment. The latter means that either the data itself or the clustering process are distributed [citation]. Although relatively new, this eld has been researched intensively in the last years, as the need to employ distributed algorithms has grown signicantly. The evolution of the networking and storage equipment fostered the development of very large databases which are in different physical locations. It is infeasible to centralize these data stores in order to analyze them. Additionally, even if they could be centralized, no single computer could meet the storage and processing requirements of such a task. 3.2.1 Distributed Data Scenarios
Distributed clustering is applied when either the data that needs to be processed is distributed, or the computation is distributed, or both of them. If none of these two is distributed, we are talking about centralized clustering. The possbile distribution scenarios are included in the table below (from [31]). Centralized Data CD-CC CD-DC Distributed Data DD-CC DD-DC
Centralized Clustering Distributed Clustering
Table 2: Scenarios of Data and Computation Distribution
3.2.2
Types of Distributed Environments
The environments in which distributed clustering runs can be classied in three types, by taking into account their hardware setup and network topology: Computers with parallel processors and shared storage and/or memory 8
Tightly-coupled computers in a high-speed LAN network (clusters) Computers connected through WAN networks Peer-to-peer networs over the Interent As it will be shown, the environment has a great inuence on the choice of one or another clustering algorithm. 3.2.3 Communication Constraints
Depending on the data distribution, the nature of data, and the network environment in which it is run, communication in distributed clustering algorithms is constrained by a number of factors: Bandwidth limitations: in distributed networks, computers may have different connection speeds, which might not allow them to transfer large quantity of data in an acceptable time. This is an issue for wireless ad-hoc networks, as well as sensor networks. Communication costs: these become a problem in sensor networks especially, where the energy consumption of communication is much higher than the one required for computation [48]. Privacy concerns: when clustering is computed on data from multiple sources, each of the sources might want to keep sensitive data hidden, by revealing only a summary of it, or a model. 3.2.4 Facilitator/Worker vs P2P
There are two main architectures for distributed clustering: facilitator/worker and P2P [31]. The former assumes that there is a central process (facilitator) which coordinates the workers. While this approach benets from splitting the work to multiple workers, it has a single point of failure and is not suitable in P2P networks, raising privacy and communications concerns. To overcome these issues, P2P distributed clustering algorithms were developed, resembling the structure of P2P networks: big number of nodes connected in an ad-hoc way (node failures and joins are very frequent), each node communicates only with its neighbours, and failure of a node is handled elegantly. P2P clustering algorithms can be again split into structured and unstructured ones. The structured ones use superpeers, each controlling a set of common peers. The nodes can communicate only with other nodes that have the same superpeer, while superpeers communicate between them. This is a recurrent theme in research by Hammouda et al. 3.2.5 Types of Data Transmissions
There are several options of what type of data is transmitted between two distributed processes [52] throughout the clustering process: Whole data set: the most simple way of communicating is for the peers to exchange the entire data that needs to be clustered. In the same time, it is the most inefcient method: it offers solutions for none of the constraints listed above. Accordingly, the possibility of using it is severely limited. Representatives: a way to mitigate the costs of communication and bandwidth constraints is to exchange only the most representative data samples between processes. However, choosing the data sampling algorithm is not trivial. Additionally, the problem of privacy remains unsolved. Cluster prototypes: these are computed locally prior to sending them over. The prototypes themselves depend on what algorithm is used locally. They could be centroids, dendrograms or generative models. This approach caters for all the communication constraints, and currently is the most widely used.
[15] takes an in-depth look at the number of rounds of communication required by various distributed clustering algorithms. In this regard, there are two types of models: those that require one round of communication, and those requiring multiple rounds and, consequently, more frequent synchronization. Algorithms which use only one round of communication are generally based on ensemble clustering and exchange cluster prototypes. Therefore, a local computation needs to be performed and its results sent to the central node or super-peer, where the global model is computed. Locally optimized P2P systems need at least two rounds of communication, as the information obtained from global computation is sent back to the other nodes for help in rening local clusters. 3.2.6 Scope of Clustering
Looking at the model that is obtained, distributed clustering can be divided into local clustering and global clustering. Local clustering means that only a clustering of local data and, possibly some data from neighbours, is computed. In this case, the algorithm uses global data, obtained by aggregating local models, to optimize the local clustering results. The computation of local clustering is specic to some peer-to-peer systems, where each node has its own data, and wants to use aggregate information in order to improve it, or to exchange documents which are of interest to neighbouring nodes. For example, if a node has all the data grouped in some clusters, but has some outliers, these could be moved to another node where they could t in the clusters that were already created there. Global clustering creates a global model of data. Here, the property of exact or approximate clustering can be introduced. It can be said about a distributed clustering that it is exact if the results of the clustering are equivalent to the results obtained if the data would have been centralized rst, and then a clustering algorithm was applied on it. 3.2.7 Ensemble Clustering
A technique which is often used in distributed clustering is called ensemble clustering. Ensemble clustering is the application of more than one clustering algorithms on a set of data. On the other hand, hybrid clustering uses only one of many algorithms at any given time, while the others are idle. Both ensemble clustering and hybrid clustering are types of cooperative clustering. Many of the distributed clustering algorithms perform a local computation rst, using any of the clustering algorithms available, and then the results are sent to another node, which clusters them, possibly using another clustering algorithm. Some of the distributed clustering frameworks do not care which algorithm is used locally. Instead they focus on the communication between nodes and how global clustering is performed. 3.2.8 Privacy Issues
The issue of privacy is as old as the Internet itself. But in the last years, it became the most important problem in any discussion about Internet and web systems. This happens for several reasons: rst, the amount of digital data stored in distributed networks has increased exponentially. And we are not talking about ordinary data. Highly-sensitive data, from Government agencies, banks or other companies can be found on computers. Second, as connection to the Internet have improved, these data stores are more prone to attack from malicious users. In data mining, privacy concerns may arise from different situations, illustrated by examples. In the rst one, suppose that a credit rating agency wants to create credit maps, where for each neighbourhood from a certain city, a map reecting credit scores in that area is plotted. Of course, the agency would need to collaborate with banks to obtain credit-related data. In this case, the banks can supply the required information in a summarized way by computing an average over all the costumers in each area. Revealing the aggregated data would not compromise any information about individual customers. In the second example, two or more Internet users want to perform a local clustering of their documents and produce a common taxonomy by summarizing them. Of course, none of the users would like others to see their personal les. Fortunately, data mining has an intrinsic feature that makes privacy preserving easier to implement: it summarizes data, looking for trends or patterns. Consequently, individual data points need not be revealed. This property 10
is transferred to distributed clustering, too. For this to be enforced only cluster prototypes must be exchanged between participating nodes. More detail about privacy preservation in distributed clustering can be found in [15].
3.3
Distributed Clustering Performance Analysis
The performance of a distributed clustering algorithm can be evaluated in different ways. Clustering performance criteria can be classied into three types , depending on what they measure: scalability, quality and complexity. Among the most important measurements are the ones that concern scalability: speedup and scaleup. The rst one measures the reduction in computation time when the same amount of data is used, while the second one captures the variation of the computation time when the amount of data per process is constant, but more processes are added. These criteria are arguably the most important when evaluating distributed clustering algorithms. For evaluating the quality of a distributed clustering algorithm, several methods are used. If the clustering results are known, then an external measure like the F-measure can be applied. Else, internal similarity measures can be used: intracluster similarity, partition index, Dunn index, etc. Finally, where possible, results of the distributed algorithms can be compared with the results obtained when the same data is processed by a traditional algorithm. When a global distributed clustering is said to be exact, then its quality will be equal to the one of its central counterpart. Finally, the complexity of distributed clustering approaches is obtained by analyzing the algorithm. Those algorithms evolved from central approaches can use the knowledge about the complexity of the central algorithm. For some distributed clustering approaches it is impossible to determine a global complexity, because they allow any algorithm to be used at the local processing step. The best examples are the algorithms that fall under the ensemble clustering category [45, 68]. In addition to calculating the complexity of the computation, for some algorithms, the complexity of the communication is also evaluated. For distributed algorithms, cost of communication is essential. 3.3.1 Measuring Scalability
In general, experimental tests with distributed algorithms measure the speedup of the algorithms. This is very important because distribution incurs communication costs. If these are too high, then the employment of a certain algorithm in distributed environments becomes infeasible. Speedup of clustering algorithms can be almost linear, the parallel k-means algorithm by Dhillon and Modha [20] achieves a speedup of 15.62 when using 16 nodes for processing. This was achieved because the paralellization of k-means is inherently natural, so communication is reduced. Other clustering algorithms have more modest results. For example, the cooperative clustering algorithm proposed in [52] achieves a speedup of between 30 and 40, on a setup of 50 nodes. This results are explained by the fact that the algorithm works in P2P networks, where communication is more intensive. It is difcult to compare speedup results obtained by various algorithms, for a number of reasons: The environment in which they run has a signicant impact on the costs. It is a big difference in communication speed between a cluster of computers tightly connected by a gigabit LAN network, and the same number of computers connected through the Internet. Experiments are performed using different data sets. Although there are some standard corpora (Reuters RCV1, 20-newsgroups), not all the algorithms have been tested on them. A number of algorithms are tested on synthetic data, like Gaussian distributions. The nature and amount of data inuence speedup signicantly, so testing algorithms on the same dataset is the only way to compare them. Speedup has to be visualized in the context of quality. There is a trade-off between these two, so some algorithms might reduce speedup in order to improve quality, or vice-versa. Another measure of scalability is called scaleup. This measures if the running time of a distributed algorithm remains the same when more data is added, proportionally with the addition of new processing nodes. This measure is not as widely used as speedup. Scaleup tests have been performed in [20].
11
3.3.2
Evaluating Quality
Quality, or accuracy, of clustering algorithms can be measured externally, when the correct assignments are known beforehand, or internally where measures like intraclustering similarity can be used. The reader is referred to section 2.3 to read more about clustering quality evaluation measures for centralized clustering. There are very few measures that have been developed specically for distributed clustering. Two of them can be found in [40]: Discrete Object Quality: returns a boolean result assessing if an object X was clustered correctly by the distributed clustering algorithm (as compared to a centralized algorithm) Continuous Object Quality: same as the previous one, but returns a real value between 0 and 1 determining how close was the distributed algorithm to the centralized one. Not all the distributed algorithms have been tested for their quality. The reason for this is that they have emerged from centralized ones, which have been tested thoroughly before. So, assuming that a distributed clustering algorithm is exact, or almost exact, there isnt a large difference expected in accuracy between itself and the centralized algorithm that it originated from. As for speedup, it is impossible to compare the quality of distributed clustering across the board. The reasons remain the same: the fact that algorithms have been tested using a wide range of data sets, that dont have anything in common, and the need to consider the measures of quality and speedup together, because they are tightly coupled: modifying one inuences the other. 3.3.3 Determining Complexity
Complexity of distributed clustering algorithms is the only measure that can be used to compare them objectively, because complexity is determined by analyzing an algorithm mathematically. Although determining complexity doesnt depend on any other parameters, algorithms with the same complexity can differ greatly in speed. This difference can be made by the usual factors: nature of data, quality of the algorithm, optimizations, etc. When measuring complexity of distributed algorithms, two processes need to be taken into consideration: actual computation of data, and communicating results (in one or more rounds). Their corresponding measures are the computation complexity, and communication complexity, respectively. Put together, they form the overall complexity of the algorithm. Communication complexity depends on the environment that a certain algorithm has been designed for.
3.4
Summary
This section aims to place distributed clustering in the context of distributed data mining. After achieving this, the key aspects of distributed clustering are discussed, beginning with the possible scenarios for distributed data and computation. Next, the types of distributed scenarions are listed: parallel computer, LAN/WAN and P2P networks. Each of these environments impose different constraints on communication. The two possible architectures of distributed clustering algorithms are Facilitator/Worker and P2P. Data transmitted by these techniques fall into three categories: data, representatives or prototypes. Ensemble clustering and privacy issues are also important aspects of the eld. Finally, three attributes that need to be analyzed in order to evaluate the performance of distributed clustering algorithms are presented: scalability, quality and complexity.
4
4.1
Survey of Distributed Clustering Techniques

Initial Taxonomy
In order to nd a structured way of surveying distributed clustering algorithms, I tried to establish a taxonomy that includes all the concepts that dene the clustering process. Classifying the algorithms according to this taxonomy 12
will give a broad picture of the eld and will make it easier to position new algorithms in relation to the existing ones. Each step of the distributing clustering process involves design decisions that are usually inuenced by what needs to be achieved. For example, if an algorithm has to run in a P2P environment, it cannot use the facilitator-worker architecture. Also, algorithms having multiple rounds of communications would not be suitable for this environment. If privacy preserving is a priority, then algorithms which send the entire data or samples of it certainly cannot be used. These decisions ultimately dene a distributed clustering algorithm. All the key aspects that need to be taken into consideration have been grouped by where they appear in the clustering process. First, lets review the steps of a general clustering algorithm: 1. A local model is computed 2. The local models are aggregated by a central node (or a super-peer in P2P clustering systems) 3. Either a global model is computed, or aggregated models are sent back to all the nodes to produce locally optimized clusters. Next, the elements of the taxonomy that can be found at each step are presented. 1. Computation of a local model If taken out of the distributed process context, this step can be viewed as classical, or local clustering. Therefore, all the traditional aspects of clustering apply here: Type of the data on which the clustering algorithm is applied: general, synthetic, documents, gene expressions Local clustering algorithm: partitional, hierarchical, density-based, geometric embedding, none (if the entire data is passed to the central node/other peers) Feature selection, used when working with high dimensional data sets 2. Aggregation of the local models Underlying distributed environment: parallel computer, cluster, WAN, P2P network The number of rounds of communication required by the algorithm What is communicated: entire data, representatives, prototypes 3. Optimization of local models Is it global or local (if its global, this step is unnecessary) Does it output exactly the same results as the centralized version of the algorithm, or approximate results 4. General aspects Is the data centralized or distributed prior to the start of the algorithm? Is the algorithm incremental or non-incremental? Is privacy required?
13
4.2
Review of Algorithms
[1.PADMA] One of the rst efforts in distributed clustering was made by Kargupta et al. in 1997. A parallel hierarchical clustering algorithm is presented in [49], as part of the broader mining framework PADMA (PArallel Data Mining Agents). The framework consists of a user interface, a facilitator and independent agents with their own storage. It is implemented on distributed memory machines using MPI. The clustering component has been tested with documents. N-grams are used for representing them. The parallelization of the hierarchical agglomerative clustering is straightforward: each agent performs a local clustering on its data, until only a few clusters are left. The results are send to a client, which executes clustering on very condensed data. PADMA was tested on a 128-node IBM SP2 supercomputer, on a corpus containing 25273 text documents. Because there is no interprocess communication, linear speedup was achieved. Computation of local model: documents, hierarchical clustering, N-grams Aggregation of local model: distributed memory machine environment, prototypes(dendrograms) are sent, one round of communication Optimization of local model: global clustering, no optimization needed General aspects: data is distributed, exact algorithm [2.RACHET] Another distributed hierarchical clustering algorithm is proposed by Samatova et al. in [64]. It is quite similar to the one contained in PADMA, but it has a major difference: after local dendrograms are created at each site, the clusters are not transmitted as a whole to the merger site. Instead, only the centroid descriptive statistics are sent for each cluster: a 6-tuple composed of the number of data points in the cluster, the square norm of the centroid, the radius of the cluster, the sum of the components, their minimum and maximum value, respectively. This representation accounts for a signicant reduction of the communication costs. Dendrograms are merged at the central site using just these statistics. Tests on synthetic data, with up to 16 dimensions, and real data show that the quality of the algorithm is comparable with the one of the centralized hierarchical clustering approach. Computation of local model: general, hierarchical clustering Aggregation of local model: WAN, prototypes(statistics) are sent, one round of communication Optimization of local model: global clustering, no optimization needed General aspects: data is distributed, approximate algorithm [3.CHC] In [43], Johnson and Kargupta present a distributed hierarchical algorithm, called CHC (Collective Hierarchical Clustering). It works on data that is heterogeneously distributed, with each site having only a subset of all features. First, a local hierarchical clustering is performed on each site. Afterwards, the obtained dendrograms are sent to a facilitator which computes the global model, using statistical bounds. The aggregated results are similar to centralized clustering results, making CHC an exact algorithm. An implementation of the algorithm for single link clustering is also introduced in the paper. Empirical tests are made on the Boston Housing Data Set, where CHC is compared to monolithic clustering. Computation of local model: general, hierarchical clustering, Aggregation of local model: WAN, prototypes(dendrograms) are sent, one round of communication Optimization of local model: global clustering, no optimization needed General aspects: data is distributed, exact algorithm
14
[4.CPCA] Collective principal analysis clustering (CPCA) is proposed by Kargupta et al. in [50]. It is one of the early distributed clustering algorithms and it works on heterogeneous data. It consists of the following steps: rst, PCA is performed locally at each site. Next, a sample of the projected data, as well as the eigenvectors, are sent to a central processing node, which combines the projected data from all the sites. Finally, PCA is performed on the global data set to identify the dominant eigenvectors. These are sent back to the sites which perform local clustering upon receiving them. Models of the clusters obtained are sent once more to the central node which combines them using various techniques. In the paper this achieved using a nearest neighbour technique. CPCA is an approximate algorithm: experiments comparing it with centralized PCA show differences between the two approaches. Computation of local model: general, PCA clustering, Aggregation of local model: WAN, representatives are sent, multiple rounds of communication Optimization of local model: global clustering, no optimization needed General aspects: data is distributed, approximate algorithm [5.Parallel K-means] Dhillon and Modha [20] developed a version of k-means that runs on parallel supercomputer. They make use of the inherent parallelization capabilities of k-means. Each processor, or node, receives only a segment of the data that needs to be clustered. One of the nodes selects the initial cluster centroids, before sending them to the others. New distances between centroids and data points are computed independently, but after each iteration of the algorithm, the independent results must be aggregated or reduced. This is done using the MPI (Message Passing Interface). The reduced centroids obtained after the last iteration represent the nal result of the clustering process. In the paper, it is shown both analytically and empirically that the speedup and scaleup of the algorithm should be closer to optimal as the number of nodes increases, because communication cost will have a smaller impact. On a 16 node machine, speedup of 15.62 was obtained. Computation of local model: general, k-means clustering, Aggregation of local model: parallel computer, prototypes(centroids) are sent, multiple rounds of communication Optimization of local model: global clustering, no optimization needed General aspects: data is centralized, exact algorithm [6.Parallel K-harmonic] Zhang et al. [76] propose a method for parallel clustering that can be applied to iterative center-based clustering algorithms: k-means, k-harmonic means or EM (Expectation-Maximization). The algorithm is similar to the previous one, but it is more general: rst the data to be clustered is split into a number of partitions equal to the number of computing units. A computing unit can be a CPU in a multiprocessor computer or a workstation in a LAN. Then, at each step, Sufcient Statistics (SS) are computed locally and then summed up by an integrator, which broadcasts them to all the computing units for the next iteration, if required. The implemented algorithm achieved almost linear speedup. In [24], the parallel clustering approach is extended to work with inherently distributed data. Computation of local model: general data, k-means/k-harmonic means/EM Aggregation of local model: multiprocessor computer/LAN environment, prototypes (SS), multiple rounds of communication Optimization of local model: global clustering, no optimization needed
15
General aspects: data is distributed, exact algorithm [7.DBDC] The paper by Januzaj [40] introduces the Density Based Distributed Clustering algorithm (DBDC). It can be used in the case when the data to be clustered is distributed and infeasible to centralize. DBDC works by rst computing a local clustering at each of the nodes. For this, a density-based algorithm, DBSCAN, is used [21]. The representatives of local clusters are created by using k-means to discover the most important data points in each cluster. These are passed to a central node that aggregates them, and sends the results back for local clustering optimization. This algorithm requires two rounds of communication between the facilitator and the workers. This paper also introduces two distributed clustering quality measures: discrete object quality and continuous object quality. The algorithm was tested on up to 200000 2-dimensional data points and run on one machine. Computation of local model: general, density-based clustering Aggregation of local model: WAN environment, representatives are sent, one round of communication Optimization of local model: local clusters are optimized, another round of communication General aspects: data is distributed, approximate algorithm [8.PDBSCAN] Another algorithm that uses DBSCAN was proposed by Xu in [72]. PDBSCAN adapts the original algorithm to work in parallel on a cluster of interconnected computers. In introduces a new storage structure, called dR*-tree. It differs from a normal R*-tree by storing the pages on different computers, while the index is replicated to all the computers in the cluster. The data placement function uses Hilbert curves to distribute spatial-related areas to the same computer. After local clustering is performed on each computer, results are sent to a master, which merges clusters where needed. The merging step ensures that the nal results are similar to the results obtained with centralized clustering. Thus, PDBSCAN is an exact algorithm. It was tested on 8 interconnected computers and on up to 1 million synthetic 2D data points, achieving near linear speedup and scaleup. Computation of local model: spatial data, DBSCAN clustering Aggregation of local model: cluster of computers, prototypes are sent, multiple rounds of communication Optimization of local model: global clustering, no optimization needed General aspects: data is distributed, exact algorithm [9.KDEC] The KDEC scheme[54, 53] proposed by Klusch et al. performs kernel-density-based clustering over homogeneously distributed data. It uses the multi-agent systems paradigm, where each site has an intelligent agent that performs local computation. Additionally, a helper agent sums the data from all the site agents. Density estimates at each site are computed locally, using a global kernel function. The global density estimate is calculated by adding the local estimates. This value is sent back to the local sites, which will construct the clusters. Points that can be connected by a continuous uphill path to a local maximum are considered to be in the same cluster. To address privacy concerns, a sampling technique is used: a global cube and a grid are dened globally and each site computes the densities in the corners of the grid. In the end, data will be clustered locally at each site. There is no global clustering. Computation of local model: general, kernel-density-based clustering Aggregation of local model: WAN, prototypes(density estimates) are sent, two rounds of communication Optimization of local model: global densities are sent back for local clustering optimization
16
General aspects: data is distributed, approximate algorithm, privacy preserving [10.KDEC-S] In [15] da Silva et al. focus on the security of the previous kernel-density-based clustering algorithm and introduce KDEC-S. This algorithm has the advantage of a better protection from malicious attacks. Specically, it solves the possible problem of a malicious agent using inference to obtain data from other peers. Testing on synthetic data shows that the accuracy of clustering is not signicantly affected. Its characteristics are similar to KDEC, but its more secure. [11.Strehl Ensemble] Strehl and Ghosh propose a method for ensemble clustering in [68]. Ensemble clustering combines the results of different clustering algorithms to obtain the optimum. While this approach can be also employed for knowledge reuse or other scenarios, it can obviously be applied to distributed clustering, where it combines the local clustering results obtained at different sites. There are three methods proposed for combining clustering results. Cluster-based Similarity Partitioning Algorithm (CSPA) reclusters objects using pairwise similarities obtained from the initial clusters. HyperGraph Partitioning Algorithm (HGPA) approximates the maximum mutual information objective with a minimum cut objective. Meta-CLustering Algorithm (MCLA) tries to nd groups of clusters, or meta-clusters. It is argued that, because of their low complexity, it is feasible to run all the three functions for a data set and choose the best of them. For large data sets, sending all the clusters to a central site might pose a communication problem that makes scalability difcult to achieve. Because the merging function dont need any information about any object except the cluster that it belongs to, this approach also accounts for privacy preservation. Computation of local model: general, any algorithm can be used Aggregation of local model: although no environment is specied in the paper, it is more suitable for WAN than others, prototypes(clustering results) are sent, one round of communication Optimization of local model: global clustering, no optimization needed General aspects: data is distributed, approximate algorithm [12.Fred Ensemble] Another ensemble clustering approach is presented by Fred and Jain in [25]. It uses evidence accumulation from multiple runs of the k-means algorithm to construct a n x n matrix representing a new similarity measure. Then, the clusters are built using a minimum spanning tree algorithm (MST) to cut weak links, that are beneath a certain threshold. The results are similar to the ones obtained by the single link algorithm. While this approach wasnt designed specically for distributed clustering and this subject is not covered in the paper, an adaptation is possible: it can be assumed that the different runs of the k-means algorithm come from different sites and a global co-association matrix is built. However, constructing this matrix in a communication efcient way needs to be addressed in order to achieve a successful implementation of this ensemble clustering algorithm in a distributed environment. Computation of local model: k-means Aggregation of local model: any (most suitable for WAN), prototypes(clustering results) are sent, one round of communication Optimization of local model: global clustering, no optimization needed General aspects: data is distributed, approximate algorithm [13.Jouve Ensemble] Jouve and Nicoloyannis take a different approach to ensemble clustering in [45]. They develop a technique for combining partitions that can also be applied for distributed clustering. A data set can be split into multiple partitions, clustering is run in parallel on each of them, and the results are combined. This 17
technique can be readily used for distributed clustering, with each process, site or peer receiving a partition of the data set. An adequacy measure is used for determining the optimal unique combination of the structures. The paper proposes a new heuristic method to compute this measure. Some tests have been made, but the efciency of the parallel clustering cannot be determined from them. Computation of local model: any algorithm can be used Aggregation of local model: WAN, prototypes(clustering results are sent) are sent, one round of communication Optimization of local model: global clustering, no optimization needed General aspects: data can be distributed, approximate algorithm [14.Merugu Privacy] Merugu and Ghosh [59] perform distributed clustering by using generative models. They have the advantages of privacy preservation and low communication costs. Instead of sending the actual data, generative models are built at each site and then sent to a central location. The paper proves that a certain mean model is able to represent all the data at one site. This model can be approximated using Markov Chain Monte Carlo techniques. The local models obtained this way are fed to a central Expectation-Maximization (EM) algorithm which computes a global model, by trying to minimize the KL (Kullback-Leibler) distance. Special attention is given to privacy, which is dened as the inverse of the probability of generating the data from a specic model. Better clustering can be achieved by decreasing privacy, although the authors argue that hiqh quality clustering can be obtained with little privacy loss. Computation of local model: general data, generative models are determined using Markov-Chain Monte Carlo techniques Aggregation of local model: WAN environment, prototypes(generative models) are sent, one round of communication Optimization of local model: global clustering, no optimization needed General aspects: data is distributed, approximate algorithm, privacy preserving [15.Vaidya Privacy] Another algorithm that considers privacy essential, is the k-means clustering algorithm introduced by Vaidya and Clifton in [70]. It tries to solve the problem of protecting privacy when distributed clustering is performed on vertical partitioned (heterogeneous) data. The algorithm guarantees that each site, or party, knows only the part of the centroid relevant to the attributes that it holds, and the cluster assignments at each point. There are two issues that are addressed by the algorithm: rst, the assignment of points to their corresponding cluster at each iteration and, second, knowing when to end the iterations. Permutations and combinatorics are used for solving this issues while maintaining the privacy goal. Additionally, functions from the eld of Secure Multiparty Computation are used. Computation of local model: k-means clustering Aggregation of local model: WAN, prototypes(centroids) are sent, multiple rounds of communication Optimization of local model: global clustering, no optimization needed General aspects: data is distributed heterogeneously, approximate algorithm, privacy-preserving
18
[16.DIB] In [19], Deb and Angryk propose an algorithm that uses word-clusters for distributed documents clustering, called DIB (Distributed Information Bottleneck). The algorithm is based on the earlier work of Slonim and Tishby [65], which developed the aIB algorithm: Agglomerative Information Bottleneck. It is stated that word-clusters are better than the classic co-occurrence matrix of documents because of the following reasons: higher accuracy, a structure that is easier to navigate and reduced dimensionality. The last feature is important for distributed clustering. It is known that documents have very high dimensionality, with thousand of attributes. But one document contains only a small portion, several hundred, of them. Word-clusters reduce the dimensionality, and thus they help lower the costs of communication. Another advantage of this algorithm is the fact that the user doesnt have to specify the number of clusters to be created at each site. The Distributed Information Bottleneck Algorithm works in the following way: rst, each site generates local word-clusters. Second, this are sent to a central site which generates the global word-clusters, which are then sent to all the participating sites. Third, documents are clustered locally using aIB and the word-cluster representation. Finally, the local models are sent to the central site which computes the global clustering model. Experiments show that the accuracy of DIB is close to the one of the centralized approach. Computation of local model: agglomerative Information Bottleneck, using word-clusters Aggregation of local model: WAN, prototypes are sent, clustering is done in two steps: rst word-clusters, then actual clusters, three rounds of communication Optimization of local model: global clustering, no optimization needed General aspects: data is distributed, approximate algorithm [17.HP2PC] HP2PC (Hierarchically-Distributed P2P Clustering) is an algorithm recently introduced by Hammouda and Kamel in [34]. It is suitable for large P2P systems. The approach is to rst create a logical structure over a P2P network, consisting of layers of neighbourhoods. Each of these is a group of peers that can communicate only between themselves. Additionally, there is a supernode that aggregates all the data from the group and represents the only point of communication between neighbourhoods. In turn, each supernode represents a peer of a higher-level neighbourhood, where the same rules apply. This way, multiple layers can be created. At the highest level, there is the root supernode. In each neighbourhood, a set of centroids is created using an algorithm similar to P2P k-means, but with slight differences. The centroids are passed forward to the supernode, which will compute another set of centroids together with its neighbours. Finally, the root supernode will contain the centroids for the whole data. Tests were run on several document corpora like the 20-newsgroups data set and using F-measure as the quality criteria. Computation of local model: k-means Aggregation of local model: P2P networks, prototypes(centroids) are sent, multiple rounds of communication Optimization of local model: global clustering, no optimization needed General aspects: data is distributed, approximate algorithm [18.P2P K-Means] The algorithm P2P K-means, developed by Datta et al [18] is one of the rst algorithms developed for P2P systems. Each node requires synchronization only with the nodes that it is directly connected to, or its neighbourhood. Only one node initializes the centroids used for k-means, which are then spread to the entire network. The centroids are updated iteratively. Before computing them at step i, a node must receive the centroids obtained at step i-1 by all of its neighbours. When the new centroids of a particular node dont suffer major modications, then the node enters a terminated state, where it doesnt request any centroids, but it can reponse to requests by neighbours. Node or edge failures and additions are also accounted for by P2P k-means, making it suitable for dynamic networks. Experiments were conducted on 10D generated data points, and high accuracy and good scalability were observed. 19
Computation of local model: k-means Aggregation of local model: P2P network, prototypes(centroids) are sent, multiple rounds of communication Optimization of local model: global clustering, no optimization needed General aspects: data is distributed, approximate algorithm [19.Collaborative] The collaborative clustering algorithm proposed by Hammouda and Kamel in [30, 33] performs local documents clustering in P2P environments. First, a local model is computed using an incremental clustering algorithm based on similarity histograms. Clusters are summarized and the resulting models are exchanged between peers. The second step allows the renement of the clusters obtained initially. Each node sends relevant documents to its neighbours. Upon receiving new documents, a node merges them to its existing clusters incremental clustering allows that. Here, the distinction between collaborative and cooperative clustering is made: while the former implies a local solution, the latter aims for a global solution. The algorithm has been tested in 3-peer, 5-peer and 7-peer environments on the 20 newsgroups dataset. The experiments show that both the entropy and the F-measure have increased after rening the clusters obtained in the initial step. Computation of local model: Incremental clustering usint Similarity Histograms is performed locally Aggregation of local model: P2P, prototypes(cluster summaries) are sent, multiple rounds of communication Optimization of local model: local clustering is optimized after communicating with peers General aspects: data is distributed, exact algorithm [20.DCCP2P] Another algorithm, related to the previous ones, is proposed by Kashef in his PhD thesis [52]: Distributed Cooperative Clustering in super-peer P2P networks (DCCP2P). A centralized cooperative clustering algorithm is introduced rst, which combines data obtained from three clustering algorithms: k-means, bisecting kmeans and partitions around medoids (PAM). Then the algorithm is adapted for structured P2P networks: there are two tiers, one containing ordinary nodes, grouped into neighbourhoods, and another one containing supernodes, one for each neighbourhood. Each node communicates only with others in the same neighbourhood, or its super-node. The algorithm is composed of 3 steps: rst, local models are computed on each node. Second, each supernode merges the clusters obtained on the local nodes. Third, the root super-peer merges the clusters obtained by all the super-peers, and thus the global model is obtained. The algorithm has been tested on gene expressions and documents. Computation of local model: tested for genes and documents, local algorithms: k-means/bisecting k-means/PAM Aggregation of local model: Structured P2P, prototypes(clustering results) are sent, multiples rounds of communication Optimization of local model: global clustering, no optimization needed General aspects: data is distributed, approximate algorithm
4.3
Rened Taxonomy
While reviewing the papers, I realized that the taxonomy that I have devised initially was not the most suitable for all the algorithms that I have encountered. For example, only one of the papers in the review tackled the topic of feature selection. The others didnt consider it an important aspect in regard to distributed clustering. Almost all the algorithms assume that data is distributed prior to running them. The few algorithms that work with data 20
that is initially centralized, like parallel k-means [20], consider the distribution of data as a preprocessing step, and not a part of the algorithm itself. Thus, it made no sense to keep this element in the taxonomy. The possibility to perform incremental clustering using distributed algorithms was also tackled by just a few, so that column was also excluded. The structure of the taxonomy was also changed. The steps of the clustering process are not very relevant for the elements of the taxonomy because, many of them cannot be included in one of the steps. The new hierarchy groups the concepts under 3 categories: requirements, design choices and communication aspects. Additionally, a column for the global clustering algorithm has been added, for the situations in which it is different from the local clustering algorithm. Following next, is a short description of the rened taxonomy. The table on the next page classies distributed clustering algorithms according to this concepts. 1. Requirements: attributes that come from the underlying problem that is solved and cannot be modied by the designer of the algorithm Data Type: the type of data that needs to be clustered: general, spatial, documents, gene expressions Scope: local or global Accuracy: exact or approximate Privacy Preserving: data from one node cannot be accessed by other nodes Environment: the underlying structure on which the algorithm is running, can be a parallel computer or a cluster of computers, Wide Area Network, Peer-to-Peer Networks 2. Design: choices that are made regarding the algorithm, so that it can work with the given requirements Architecture: P2P or facilitator/worker Local Algorithm: the choice for the local clustering, if one is used Global Algorithm: the technique for global clustering: it can be the same algorithm as the local one (in the case when the same algorithm is used in the two steps, e.g. PADMA), it can be different algorithms (e.g. Merugu privacy), or it can be a single algorithm that produces global results without local clustering (Parallel k-means, P2P k-means) 3. Communication: aspects regarding communication between the nodes What Is Communicated: data, representatives or prototypes (see Section 3.2.5) Rounds of Communication: one (for facilitator/worker architectures), two (when local model is optimized after global aggregation of models), multiple (parallel and P2P algorithms) 4. Other: elements that do not fall into one of the given categories
21
Parallel QT: A New Distributed Clustering Algorithm
In this section, a new distributed clustering algorithm is introduced, called Parallel QT (Quality Threshold) Clustering.
5.1
Motivation
All of the clustering algorithms developed in the last several years a based on P2P or WAN environments, where a model is computed locally rst. These work very well on the Internet and can be applied in situations like clustering for le-sharing applications, or data mining at multiple sites of a corporation or agency. But these algorithms cannot be used by companies storing huge amounts of data in grids formed of thousands of computers. If Google and Amazon would try to work on a common project that implies sharing of data, they can apply the same principles used throughout distributed data mining: compute a local model and then aggregate it. But their problem is how to compute the local model. They have thousands of servers which can store and process data, so they need highly-parallelilzed algorithms. It is in this area of necessity where the new algorithm will be placed. Thus, it would be a lower level algorithm than the distributed algorithms developed today and it could be used in conjunction with them (for the local clustering step). Parallel QT is an extension of the QT clustering algorithm that I have used in my previous research. The next section will give an overview of the algorithm and its advantages.
5.2
Original QT Clustering
The original QT algorithm was presented by Heyer et al. in [36] for gene clustering. The idea of the algorithm is to built a possbile cluster for each data point, and then choose the best possible cluster. In the paper, it is proved that considering the largest cluster as the best one is as good as other, more sophisticated techniques. Thse are the steps of the QT clustering algorithm: 1. For a candidate document D, compute the cosine similarity between D and all the other documents. 2. All the documents that are closer than a certain threshold to D are added to a possible cluster with the center in D. 3. Repeat steps 1 and 2 for all the documents in the dataset. This results in a set of possible clusters. The number of possible clusters is equal to the number of documents in the dataset. Each document is the center of one and only one of the possible clusters. One document cat be a member (but not the center) of multiple possible clusters. 4. Choose the cluster with the highest number of documents as an actual cluster, store it, and remove its documents from the dataset that the algorithm is working on. 5. Repeat steps 1 - 4 on the remaining dataset until there are no more documents to cluster or until all the possible clusters remaining have a single document. 5.2.1 Advantages of QT Clustering
QT clustering has the following advantages when compared to other clustering techniques, in particular k-means: It doesnt need to know the number of clusters in advance It doesnt make any random decision, so the results will be the same for each run of the algorithm Also, it is said about QT that it is a quality clustering algorithm, and its results proved that. On the other hand, k-means is a lower-quality algorithm (although much faster).
23
The main problem with QT is its O(n2 ) complexity. For Newistic, a system that I have co-built previously [citatie], heavy optimization managed to reduce the number of similarities that are computed between documents, and the execution time was reduced by a factor of 4. An implementation detail was that we retained the similarity matrix in memory after clustering was completed, so for the next round we only needed to compute cosines for new documents. With the optimzations, and run on a quad-core computer with 4 GB RAM, our clustering implementations could work with up to 200k documents. However, the algorithm is currently not scalable beyond that point. Thus, the need of paralellizing this algorithm has appeared.
5.3
Parallel QT Clustering
This section describes the new Parallel QT Clustering algorithm: Suppose there are n documents and N parallel processes. Step 1, Computing the similarities: there will be n(n+1) similarity computations we want to split this similarity computations equally to the N processes we will split the n documents into M blocks. To compute a similarity between 2 blocks means to compute the similarity between each document of the rst block, and each document of the second block. block similarity has to be computed for each pair of blocks: B1 B1 , B1 B2 , B1 B3 ..., B1 Bm , B2 B2 , B2 B3 , ..., B2 Bm , ..., B( m 1) Bm , B( m 1) Bm . In total, there will be M(M+1) computations between the blocks we want to assign each of the block computation to one process. From this, results N=M(M+1). From here we can calculate the value of M: M=[(sqrt(N)]-1. By using the modulo function we can assign each Bi B j block to one of the N processes. The respective process will receive the documents included in both blocks Bi and B j Each process will compute the similarities for the blocks it receives. If the similarity between two documents is higher than a certain threshold, it is written into a similarity matrix (either 1 or the similarity itself) After step 1 is complete, a global nxn similarity matrix will be available. But this matrix is split into M(M+1) different blocks. Step 2, Computing the clusters: This step is iterative: 1. Choose the best cluster from the remaining candidates 2. Delete the documents of the best cluster from the similarity matrix, and go to step 1) (Continues until there are only one-document candidate clusters) Algorithm (needs a facilitator): 1. Calculate the number of non-zero elements for each row in the similarity matrix; 2. Out of these totals, take the largest. If the largest total 1, then send the index of the coresponding row (iB) back to each of the processes. Else STOP; 3. When a process receives the row index computed at step 2 (iB) and it is contained in one of the two blocks, it puts all the non-zero elements corresponding to that row in another structure (e.g. HashMap: iB > NonZeroColumn1, NonZeroColumn2, ...) and deletes them from the similarity matrix (both row iB and column iB); 24
4. Repeat step 1. 5. If STOP command is given, each process sends its cluster lists to the facilitator, which aggregates them. At the end of step 2, the facilitator should store all the computed clusters. 5.3.1 Challenges of the algorithm
The biggest challenge regarding this algorithm is estimating communication cost accurately. It might be that because of this cost (there is quite a lot of communication between the peers) the scaleup and speedup obtained are not good enough to make it worth using. But we will get an answer to this question once the experiments are set up.
Conclusion and Future Work
Although distributed clustering is a young eld, it has seen a lot of research activity. The rst distributed clustering techniques have tried to mimic the exact behiavour of traditional clustering algorithms (hierarchical, k-means) into a parallel environment. Later on, new algorithms have been developed using the technique called ensemble clustering, where several sets of clustering results (usually obtained with different algorithms) are combined together to provide an optimal model. Afterwards, research in distributed clustering techniques has gone into two main directions: privacy preservation, and algorithms for P2P environments. Indeed, the most recent papers on this topic focus on distributed clustering in peer-to-peer networks, Although a lot of issues have been solved by the existing algorithms, new challenges continue to appear. As computing technology and broadband networks have evolved rapidly, there is a huge amount of digital data available that needs to be mined. As one of the most important tasks in data mining, clustering is needed everywhere. New clustering algorithms need to be developed for the communication environments of today: on the one hand, algorithms that can run of millions of computing units connected via the Internet and, on the other hand, high throughput algorithms for grid and cloud computing. While research in P2P clustering addresses the rst type of environment, there has been little development in the recent years of algorithms that can work on clusters of computers and process massive amounts of data. The algorithm proposed in this paper, Parallel QT Clustering, is best suited for running in grid computing environments, tackling an area that hasnt been researched recently. Its utility can be evaluated only after thorough tests of the algorithm have been made. This the subject of my future work. The necessity of new algorithms for distributed clustering also comes from the development of distributed data mining into new elds, like data stream mining and wireless sensor networks. These require specic algorithms to deal with the intrinsic constraints that they posses. The following subsections will present this two emerging elds in detail.
6.1
6.1.1
Emerging Areas of Distributed Data Mining

Sensor Networks
Sensor networks are used more and more often and in many domains, including medicine, warfare and autonomous buildings. Lightweight, sometimes mobile, sensors with reduced power consumption form these networks by establishing wireless connections to each other or to a central node. Often, wireless networks are situated in harsh environments where resources, like power or cost, are limited. Examples of wireless sensor networks, where DDM is applied, can be found in [28] (pollution monitoring) or [51] (stock market watch). In addition to the limited power supply, information processing algorithms in wireless sensor networks have other aspects to take into account like reduced computing power of the sensors, the inherently dynamic state of the network, where sensors often fail or move in the range of other nodes, and the asynchronous nature of wireless
25
networks. Data mining systems that have been successfully used for wireless sensor networks use multi-agent systems [15] or P2P architectures [11]. Book by Zhao and Guibas [77] contains in-depth information about data processing in sensor networks. Reader is referred to the book for more information on this eld. 6.1.2 Data Stream Mining
A novel application of DDM is in mining data streams. In this scenario, data is sent uently from a producer. A producer is an application that outputs data like web pages, sensor measurements, voice call records in a continuous manner. The ow of data can be rapid, uctuating, unpredictable and unlimited in size. Thus, the data is viewed as a stream, and not a collection of individual blocks. Because traditional Database Management Systems (DBMS) are not able to cope with this new source of data, new algorithms were required for basic functions like viewing, ltering or querying data streams. In the last years, more advanced research in data stream mining has been undertaken, in subjects like distributed clustering [8], and frequent pattern mining. In [7], a data stream model is introduced, possessing the following characteristics: Data elements arrive online There is no order among the elements that can be imposed by the system Data streams can have unlimited size Once an element is processed, getting access to it is difcult or even impossible. A Data Stream Management System [2] has been developed at Stanford University to process data streams based on the previously described model. Examples of applications of distributed data streams are TraderBot [3], a nancial search engine over streams of stock tickers and news feeds, and [1] which provides intrusion detection over gigabit networks packet streams. More details about Data Stream Mining can be found in paper by Babcock et al. [7], as well as in the survey by Gaber et al [26].
6.2
Future Work
The algorithm proposed in this paper, Parallel QT Clustering, is best suited for running in grid computing environments, tackling an area that hasnt been researched recently. Its utility can be evaluated only after thorough testing of the algorithm has been performed. This constitues the subject of my future work in the eld of distributed clustering.
References
[1] ipolicy networks web page. http://www.ipolicynetworks.com. [2] Stanford stream data management project. http://www-db.stanford.edu/stream. [3] Traderbot web page. http://www.traderbot.com. [4] Charu C. Aggarwal and Philip S. Yu. Finding generalized projected clusters in high dimensional spaces. In SIGMOD 00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 7081, New York, NY, USA, 2000. ACM. [5] M. R. Anderberg. Cluster analysis for applications. Probability and Mathematical Statistics, New York: Academic Press, 1973, 1973.
26
[6] Francisco Azuaje. Clustering-based approaches to discovering and visualising microarray data patterns. Brief Bioinform, 4(1):3142, 2003. [7] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Phokion G. Kolaitis, editor, Proceedings of the 21nd Symposium on Principles of Database Systems, pages 116. ACM Press, 2002. [8] Sanghamitra Bandyopadhyay, Chris Giannella, Ujjwal Maulik, Hillol Kargupta, Kun Liu, and Souptik Datta. Clustering distributed data streams in peer-to-peer environments. Information Sciences, 176(14):1952 1985, 2006. Streaming Data Mining. [9] Pavel Berkhin. Survey of clustering data mining techniques. Technical report, 2002. [10] Michael W. Berry, Susan T. Dumais, and Gavin W. OBrien. Using linear algebra for intelligent information retrieval. SIAM Rev., 37(4):573595, 1995. [11] Kanishka Bhaduri. Efcient Local Algorithms for Distributed Data Mining in Large Scale Peer to Peer Environments: A Deterministic Approach. PhD thesis, University of Maryland Baltimore County, 2008. [12] Paul C. Boutros and Allan B. Okey. Unsupervised pattern recognition: An introduction to the whys and wherefores of clustering microarray data. Brief Bioinform, 6(4):331343, 2005. [13] M. Cannataro, A. Congiusta, A. Pugliese, D. Talia, and P. Truno. Distributed data mining on grids: services, tools, and applications. Systems, Man, and Cybernetics, Part B, IEEE Transactions on, 34(6):24512465, Dec. 2004. [14] Krzysztof Cios, Witold Pedrycz, and Roman W. Swiniarski. Data mining methods for knowledge discovery. Kluwer Academic Publishers, Norwell, MA, USA, 1998. [15] Josenildo C. da Silva, Chris Giannella, Ruchita Bhargava, Hillol Kargupta, and Matthias Klusch. Distributed data mining and agents. Engineering Applications of Articial Intelligence, 18:791807, 2005. [16] Ovidiu Dan and Horatiu Mocian. Scalable web mining with newistic. In Proceedings of PAKDD 09, 2009. [17] Souptik Datta, Kanishka Bhaduri, Chris Giannella, Ran Wolff, and Hillol Kargupta. Distributed data mining in peer-to-peer networks. IEEE Internet Computing, 10(4):1826, 2006. [18] Souptik Datta, Chris Giannella, and Hillol Kargupta. K-means clustering over a large, dynamic network. In Joydeep Ghosh, Diane Lambert, David B. Skillicorn, and Jaideep Srivastava, editors, SDM. SIAM, 2006. [19] D. Deb and R.A. Angryk. Distributed document clustering using word-clusters. Computational Intelligence and Data Mining, 2007. CIDM 2007. IEEE Symposium on, pages 376383, 1 2007-April 5 2007. [20] Inderjit S. Dhillon and Dharmendra S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD, pages 245260, London, UK, 2000. Springer-Verlag. [21] M. Ester, H.-P. Kriegel, J. Sander, and X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Evangelos Simoudis, Jiawei Han, and Usama Fayyad, editors, Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD96 ), pages 226231. AAAI Press, 1996. [22] Usama M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. From data mining to knowledge discovery: an overview. pages 134, 1996. [23] E. Forgy. Cluster analysis of multivariate data: efciency versus interpretability of classications. Biometrics, 21:768780, 1965. 27
[24] George Forman and Bin Zhang. Distributed data clustering can be efcient and exact. SIGKDD Explor. Newsl., 2(2):3438, 2000. [25] A.L.N Fred and A.K. Jain. Data clustering using evidence accumulation. Pattern Recognition, 2002. Proceedings. 16th International Conference on, 4:276280 vol.4, 2002. [26] Mohamed Medhat Gaber, Arkady Zaslavsky, and Shonali Krishnaswamy. Mining data streams: a review. SIGMOD Rec., 34(2):1826, 2005. [27] G. Getz, E. Levine, and E. Domany. Coupled two-way clustering analysis of gene microarray data. Proc. Natl Acad. Sci. USA, 97:1207912084, 2000. [28] M. Ghanem, Y. Guo, J. Hassard, M. Osmond, and M Richards. Sensor grids for air pollution monitoring. In In Proc. 3rd UK e-Science All Hands Meeting, 2004. [29] Yike Guo and Janjao Sutiwaraphun. Probing knowledge in distributed data mining. In PAKDD 99: Proceedings of the Third Pacic-Asia Conference on Methodologies for Knowledge Discovery and Data Mining, pages 443452, London, UK, 1999. Springer-Verlag. [30] Khaled Hammouda and Mohamed Kamel. Distributed collaborative web document clustering using cluster keyphrase summaries. Information Fusion, 9(4):465 480, 2008. Special Issue on Web Information Fusion. [31] Khaled M. Hammouda. Distributed Document Clustering and Cluster Summarization in Peer-to-Peer Environments. PhD thesis, University of Waterloo, Department of Electrical and Computer Enginnering, 2007. [32] Khaled M. Hammouda and Mohamed S. Kamel. Incremental document clustering using cluster similarity histograms. In WI 03: Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence, page 597, Washington, DC, USA, 2003. IEEE Computer Society. [33] Khaled M. Hammouda and Mohamed S. Kamel. Collaborative document clustering. In Joydeep Ghosh, Diane Lambert, David B. Skillicorn, and Jaideep Srivastava, editors, SDM. SIAM, 2006. [34] Khaled M. Hammouda and Mohamed S. Kamel. Hp2pc: Scalable hierarchically-distributed peer-to-peer clustering. In SDM. SIAM, 2007. [35] Pierre Hansen and Brigitte Jaumard. Cluster analysis and mathematical programming. Math. Program., 79(1-3):191215, 1997. [36] L. Heyer, S. Kruglyak, and S. Yooseph. Exploring expression data: Identication and analysis of coexpressed genes. Genome Research 9, pp. 1106-1115, 1999. [37] Byung hoon Park and Hillol Kargupta. Distributed data mining: Algorithms, systems, and applications. pages 341358, 2002. [38] Ah hwee Tan. Text mining: The state of the art and the challenges. In In Proceedings of the PAKDD 1999 Workshop on Knowledge Disocovery from Advanced Databases, pages 6570, 1999. [39] Anil K. Jain and Richard C. Dubes. Algorithms for clustering data. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1988. [40] Eshref Januzaj, Hans-Peter Kriegel, and Martin Pfeie. Dbdc: Density based distributed clustering. In Elisa Bertino, Stavros Christodoulakis, Dimitris Plexousakis, Vassilis Christophides, Manolis Koubarakis, Klemens Bhm, and Elena Ferrari, editors, EDBT, volume 2992 of Lecture Notes in Computer Science, pages 88105. Springer, 2004.
28
[41] Daxin Jiang, Jian Pei, and Aidong Zhang. Dhc: A density-based hierarchical clustering method for time series gene expression data. In BIBE 03: Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngineering, page 393, Washington, DC, USA, 2003. IEEE Computer Society. [42] Daxin Jiang and Aidong Zhang. Cluster analysis for gene expression data: A survey. IEEE Transactions on Knowledge and Data Engineering, 16:13701386, 2004. [43] Erik L. Johnson and Hillol Kargupta. Collective, hierarchical clustering from distributed, heterogeneous data. In Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD, pages 221244, London, UK, 2000. Springer-Verlag. [44] I. T. Jolliffe. Principal component analysis. Springer Series in Statistics, Berlin: Springer, 1986, 1986. [45] Pierre-Emmanuel Jouve and Nicolas Nicoloyannis. A method for aggregating partitions, applications in k.d.d. In Kyu-Young Whang, Jongwoo Jeon, Kyuseok Shim, and Jaideep Srivastava, editors, PAKDD, volume 2637 of Lecture Notes in Computer Science, pages 411422. Springer, 2003. [46] H. Kargupta and P. Chan. Advances in Distributed and Parallel Knowledge Discovery, chapter Distributed and Parallel Data Mining: A Brief Introduction. AAAI/MIT Press, 2000. [47] H. Kargupta and P. Chan. DAdvances in Distributed and Parallel Knowledge Discovery. AAAI/MIT Press, Cambridge, MA, USA, 2000. [48] H. Kargupta and K. Sivakumar. Data Mining: Next Generation Challenges and Future Directions, chapter Existential Pleasures of Distributed Data Mining. AAAI/MIT Press, 2004. [49] Hillol Kargupta, Ilker Hamzaoglu, and Brian Stafford. Scalable, distributed data mining using an agent based architecture. In Proceedings the Third International Conference on the Knowledge Discovery and Data Mining, AAAI Press, Menlo Park, California, pages 211214. AAAI Press, 1997. [50] Hillol Kargupta, Weiyun Huang, Krishnamoorthy Sivakumar, and Erik Johnson. Distributed clustering using collective principal component analysis. Knowl. Inf. Syst., 3(4):422448, 2001. [51] Hillol Kargupta, Byung-Hoon Park, Sweta Pittie, Lei Liu, Deepali Kushraj, and Kakali Sarkar. Mobimine: monitoring the stock market from a pda. SIGKDD Explor. Newsl., 3(2):3746, 2002. [52] Rasha Kashef. Cooperative Clustering Model and Its Applications. PhD thesis, University of Waterloo, Department of Electrical and Computer Enginnering, 2008. [53] Matthias Klusch, Stefano Lodi, and Gianluca Moro. Agent-based distributed data mining: The KDEC scheme. In Matthias Klusch, Sonia Bergamaschi, Peter Edwards, and Paolo Petta, editors, Intelligent Information Agents: The AgentLink Perspective, volume 2586 of Lecture Notes in Computer Science. Springer, 2003. [54] Matthias Klusch, Stefano Lodi, and Gianluca Moro. Distributed clustering based on sampling local density estimates. In Proc. International Joint Conference on Articial Intelligence (IJCAI), Acapulco, Mexico, August 2003. [55] Raymond Kosala and Hendrik Blockeel. Web mining research: A survey. CoRR, cs.LG/0011033, 2000. [56] R. Krishnapuram, A. Joshi, and Liyu Yi. A fuzzy relative of the k-medoids algorithm with application to web document and snippet clustering. Fuzzy Systems Conference Proceedings, 1999. FUZZ-IEEE 99. 1999 IEEE International, 3:12811286 vol.3, 1999. [57] Laura Lazzeroni and Art Owen. Plaid models for gene expression data. Statistica Sinica, 12:6186, 2002.
29
[58] Sara C. Madeira and Arlindo L. Oliveira. Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 1(1):2445, 2004. [59] Srujana Merugu and Joydeep Ghosh. Privacy-preserving distributed clustering using generative models. In ICDM, pages 211218. IEEE Computer Society, 2003. [60] F. Murtagh. A survey of recent advances in hierarchical clustering algorithms. The Computer Journal, 26(4):354359, November 1983. [61] Clark F. Olson. Parallel algorithms for hierarchical clustering. Parallel Comput., 21(8):13131325, 1995. [62] K. Rose. Deterministic annealing for clustering, compression, classication, regression, and related optimization problems. Proceedings of the IEEE, 86(11):22102239, Nov 1998. [63] G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM, 18(11):613620, 1975. [64] Nagiza F. Samatova, George Ostrouchov, Al Geist, and Anatoli V. Melechko. Rachet: An efcient cover-based merging of clustering hierarchies from distributed datasets. Distrib. Parallel Databases, 11(2):157180, 2002. [65] Noam Slonim and Naftali Tishby. Document clustering using word clusters via the information bottleneck method. In SIGIR 00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 208215, New York, NY, USA, 2000. ACM. [66] F. De Smet, J. Mathys, K. Marchal, G. Thijs, B. De Moor, and Y. Moreau. Adaptive quality-based clustering of gene expression proles. Bioinformatics, 18(5):735746, 2002. [67] Michael Steinbach, George Karypis, and Vipin Kumar. Abstract a comparison of document clustering techniques. KDD-2000 Workshop on Text Mining, 2000. [68] Alexander Strehl and Joydeep Ghosh. Cluster ensembles a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res., 3:583617, 2003. [69] C. Y. Suen. N-gram statistics for natural language understanding and text processing. In IEEE Trans. on Pattern Analysis and Machine Intelligence, volume 2, pages 164172., 1979. [70] Jaideep Vaidya and Chris Clifton. Privacy-preserving k-means clustering over vertically partitioned data. In Lise Getoor, Ted E. Senator, Pedro Domingos, and Christos Faloutsos, editors, KDD, pages 206215. ACM, 2003. [71] Rui Xu and II Wunsch. Survey of clustering algorithms. Neural Networks, IEEE Transactions on, 16(3):645 678, May 2005. [72] Xiaowei Xu, Jochen J ger, and Hans-Peter Kriegel. A fast parallel clustering algorithm for large spatial a databases. Data Min. Knowl. Discov., 3(3):263290, 1999. [73] Yiming Yang and Jan O. Pedersen. A comparative study on feature selection in text categorization. In ICML 97: Proceedings of the Fourteenth International Conference on Machine Learning, pages 412420, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc. [74] Mohammed Javeed Zaki. Parallel and distributed data mining: An introduction. In Mohammed Javeed Zaki and Ching-Tien Ho, editors, Large-Scale Parallel Data Mining, volume 1759 of Lecture Notes in Computer Science, pages 123. Springer, 1999. [75] Oren Zamir and Oren Etzioni. Web document clustering: a feasibility demonstration. In SIGIR 98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 4654, New York, NY, USA, 1998. ACM. 30
[76] Bin Zhang, Meichun Hsu, and George Forman. Accurate recasting of parameter estimation algorithms using sufcient statistics for efcient parallel speed-up: Demonstrated for center-based data clustering algorithms. In PKDD 00: Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, pages 243254, London, UK, 2000. Springer-Verlag. [77] Feng Zhao and Leonidas Guibas. Wireless Network Sensors: An Information Processing Approach. Morgan Kauffman, 2004.
31

Distributed Clustering Survey

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Distributed Clustering Survey

Uploaded by

Copyright:

Available Formats

Survey of Distributed Clustering Techniques

Horatiu Mocian Imperial College London

1st term ISO report

Supervisor: Dr Moustafa Ghanem

Organization of the Report

Clustering Evaluation Measures

Metric Minkowski Cosine Similarity Jaccard Measure

Formula ||x y||r =

sumd |xi yi |r i=1

sumd min(xi ,yi ) i=1 sumd max(xi ,yi ) i=1

Clustering High Dimensional Data Sets

Examples of Clustering Applications

Key Aspects of Distributing Clustering

Centralized Clustering Distributed Clustering

Table 2: Scenarios of Data and Computation Distribution

Types of Distributed Environments

Distributed Clustering Performance Analysis

Survey of Distributed Clustering Techniques

Parallel QT: A New Distributed Clustering Algorithm

Conclusion and Future Work

Emerging Areas of Distributed Data Mining

You might also like