You are on page 1of 5

ENHANCED DOCUMENT CLUSTERING USING FUSION OF MULTISCALE WAVELET DECOMPOSITION

Mahmoud F. Hussin*, Ibrahim El Rube*, and Mohamed S. Kamel**


*

Arab Academy for Science and Technology and Maritime Transport, Alexandria, Egypt
mfarouk@pami.uwaterloo.ca

**

Dept. of Electrical and Computer Engineering, University of waterloo, Waterloo, Ontario, Canada mkamel@pami.uwaterloo.ca

ABSTRACT Most term weighting schemes for text document clustering depend on the term frequency based analysis of the text contents. A shortcoming of these indexing schemes, which consider only the occurrences of the terms in a document, is that they have some limitations in filtering out noise in most cases. In this paper, we propose a novel weighting approach using fusion technique that can be combined with wavelet-based estimation to achieve consistent improvements in the clustering. Our approach involves three steps: (1) Term Frequency (TF) weighting scheme, (2) Multiple wavelets estimating, and (3) data fusion. Specifically, we apply the wavelet with different scales to produce different estimation values of the original TF, and use the fusion of these different values as new features for clustering the documents. The conducted experiments of clustering the documents from RETURES corpus verify that our weighting schemes using wavelet and fusion techniques reduces effectively the noise and improves clustering performance evaluated using the entropy and F_measure.

1. INTRODUCTION
The growth of the Internet has seen an explosion in the amount of information available, document clustering plays an important role for helping people organize this vast amount of data. It attempts to organize documents into groups such that documents within a group are more similar to each other than documents belonging to different groups. Document clustering has many applications, such as providing a summarized view of the information space by grouping documents by topic, clustering of search engine results to present organized and understandable results to the user (e.g. Vivisim), clustering documents in a collection (e.g. digital libraries), and improving the efficiency of information retrieval by focusing on relevant groups (clusters) rather than the whole collections. The first stage in any document clustering technique is the

document representation model. The most common approaches that are in use for this model are based on the bag of words paradigm. In this paradigm, the basic idea is to extract unique words from the set of documents and treat these words as features then represent each document as a vector of features. Each feature in this vector is assigned a weight scheme in several ways: binary, term frequency, or term frequency/inverse document frequency (tf/idf) depending on the application. So, we believe that the improvement in the weighting scheme will improve the accuracy of the clustering process. These schemes work well on clustering text documents since text document collections usually contain focused and mostly relevant vocabulary for their terms. However, the feature space usually contains various kinds of noise features; this is due to the nature of the text. During the last decade, the use of the Wavelet Transform (WT) has become an important and powerful tool in many fields including, signal processing, analysis, denoising, and data mining. WT is capable of providing simultaneous time-frequency representation of the signal. Furthermore, WT hierarchically decomposes the signal into detail and approximation information. This multiscale coarseto-fine structure retains different features of the signal [1][2]. The approximation coefficients hold the global information, whereas the detail coefficients contain the local features. Also, fusion is a technique that combines results achieved by different systems to form enhanced results. The document representation used in this work depends on our work in [3]and [4], we considered phrase based document clustering representation with flat Self Organizing Map (SOM) [5] and HSOM [6], and proposed a clever combination between SOM and Adaptive Resonance Theory (ART) [7] in HSOMART [8]. In [9] we presented the Two-Level SOMART (TLSOMART) clustering method that uses multiple SOMs in the first level to map the original high

dimensional vector space to a reduced one based on its output clusters, and then ART in the second level as the clustering algorithm to improve the quality of clusters of documents. In [10] we presented the neural network based document clustering using word clusters that uses SOM to cluster the words into word_clusters based on the occurrence of words in the documents, and then cluster the documents based on word_clusters instead of words as features using ART to achieve high quality of document clustering. In this paper, we present a novel term weighting scheme using wavelet and fusion. We perform wavelet estimation with different scales on the TF values and obtain different estimated values. Finally, a conventional fusion method is applied to produce combined results. There are some key benefits using the proposed method: first, the wavelet transform with its denoising and estimating capability can produce different values of TF. Second, the fusion technique performs well with the different generated values from WT which improve the quality of the clusters of the documents. The rest of this paper is organized as follows: Section 2 presents our proposed weighting scheme using fusion and wavelet. Section 3 discusses the experimental results. Finally, we conclude and give the directions of future work in Section 4.

scheme using wavelet and fusion is summarized in algorithm1.

2. SOM BASED DOCUMENT CLUSTERING USING WAVELET FUSION


In this section, we introduce our proposed weighting scheme method using WT and fusion. The key idea of our method is to combine the capability of the WT to approximate the TF weighting scheme values with noise using different scales, and then the fusion is used to produce the weighting value with high accuracy as shown in Fig. 1. Given a set of documents D, the two-step weighting scheme method starts by preparing the word-bydocument matrix W: W = (wik ), Where, wik is the TF weight of word i in document k. Using Wavelet with different scales in the first step, TF values are approximated so, each word i have different approximated TF values. In the second step Fusion technique is applied to those different approximated TF values to produce high quality weighting value for the word i. Finally, to produce the final clusters using the enhanced weighting scheme we apply an SOM clustering algorithm to show that our method leads to higher clustering quality. The algorithm for weighting

Figure 1: The Proposed Document Clustering Process using Wavelet Fusion.

Algorithm 1: Wavelet Fusion Weighting Scheme Given a set of document collection D, 1. Prepare the word-by-document matrix A of set D using TF weighting scheme. 2. Apply multiple Wavelets with different coefficient scales using matrix A to approximate wik value onto j values, (wik 1, wik 2, ., wik j). 3. Construct the new word-by-document matrix B of set D using MIN/MAX/AVG fusion/selection methods. For each document k For each word i Do fusion/selection using (wik 1, wik 2, ., wik j). End for End for 4. Apply SOM clustering algorithm using matrix B to produce clusters.

2.1. IMPLEMENTED FUSION/SELECTION FUNCTIONS


In order to compare the performance of the suggest algorithm, three different functions are implemented separately. Two of these functions select either the max of min values of the WT. coefficients. The third function is an arithmetic fusion function that averages the WT. coefficients for various scale levels: Maximum value (MAX) scheme: This scheme just picks the maximum approximated term frequency value produced from multiple wavelet transforms using different coefficient scales. Minimum value (MIN) scheme: This scheme just picks the minimum approximated term frequency value produced from multiple wavelet transforms using different coefficient scales. Average value (AVG) scheme: This scheme compute the average approximated term frequency value produced from multiple wavelet transforms using different coefficient scales.

The SOM used with dimensions ranging from 16 units (4X4) to 100 units (10X10) with learning rate 0.02.

3.2. QUALITY MEASURES


Two measures are widely used in text mining literature to evaluate the quality of the clustering algorithms: cluster entropy and F-measure [12]. Both of these techniques rely on labeled test corpus, where each document is assigned to a class label. The F-measure combines the precision and recall concepts from information retrieval, where each class is treated as the desired result for the query, and each cluster is treated as the actual result for the query, for more details refer to [3]. Cluster entropy uses the entropy concept from information theory and measures the homogeneity of the clusters. Lower entropy indicates more homogeneous cluster and vice versa. In the test corpus, some documents have multiple classes assigned to them. This does not affect the Fmeasure, however the cluster entropy can no longer be calculated. Instead of cluster entropy, we define the class-entropy[3], which measures the homogeneity of a class rather than the homogeneity of the clusters.

3. EXPERIMENTAL RESULTS AND EVALUATIONS


We conducted experiments to demonstrate that the proposed weighting scheme provides more valuable features and improves clustering performance. Self Organizing Map (SOM) was chosen as clustering algorithm, WT with different scales used to approximate the original TF weight of each feature and then the fusion is used to give the good weighting value. The resulted clusters were evaluated by FMeasure and entropy.

3.3. RESULTS ANALYSIS


Basically we would like to maximize the F_measure, and minimize the class-entropy of clusters to achieve high quality clustering. In the conducted experiments, we studied the approximation capability of the wavelet transform applied to the original values of the TF weighting scheme without fusion, as shown in Fig. 2. The average reduction of the class-entropy using the wavelet approximation applied to TF rather than the original TF value is -2.88%, -3.65%, and 1.13% for WT1, WT2, and WT3 scales respectively. The corresponding improvement in the F_measure is 3.40%, 6.15%, and 20.71% for WT1, WT2, and WT3 scales respectively. As mentioned previously, the proposed weighting scheme using wavelet and fusion is intended to find valuable features which will help improving the clustering results. We further examined this improvement as shown in Fig. 3. evaluated by FMeasure, and entropy respectively. Similarly, the average reduction of the class-entropy using proposed weighting scheme instead of TF is -2.78%, -3.26%, and 6.83% for MAX, MIN, and AVG respectively, while the corresponding improvement in F-measure is 7.22%, 6.23%, and 21.92%. It can be seen from the results that the proposed weighting scheme using average fusion AVG outperforms the original TF, this may be caused by the noise values in the original TF.

3.1. EXPERIMENTAL SETUP


We used 1000 documents to be clustered from the REUTERS test corpus. After removing a set of common words using a "stopword'' list and removing the suffixes using a Porter stemmer. The TF representation is used to prepare a word-by-document matrix for the set of 1000 documents. The resulted vector space size is 1000X4411. The TF values of the vector space are approximated using the WT with different scales in the first step to produce different approximated values. The nonredundant wavelet transform using a trous method with quadratic BSpline wavelets are used in decomposing the FT vectors into details and approximations. In order to have the smoothed vectors, only the approximation coefficients, with different scale levels, are considered in this work due. The documents were clustered using_ the SOM algorithm implemented using the SOM-PAK package developed by Kohonen et. al. [11]. The configurations of these document clustering techniques were as follows:

TF 5 4.5

4 3.5 3 2.5 16 25 36 49 64 number of clusters 81 100 TF WT 1 WT 2 WT 3

5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 4 9 16 25 36 49 64 number of clusters

Class entropy

Class entropy

TF AVG MAX MIN 81 100

TF 0.45 0.40
f_measure 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 4 9

f_measure

0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 16 25 36 49 64 number of clusters 81 TF WT 1 WT 2 WT 3 100

TF MAX 16 25 36 49 64 number of clusters

AVG MIN 81 100

Figure 2: Class entropy and F-measure for 1000 document set, using TF, WT1, WT2, and WT3 weighting schemes and clustered by SOM technique.

Figure 3: Class entropy and F_measure for 1000 document set, using TF, MAX, MIN, and MAX weighting schemes and clustered by SOM technique.

4. CONCLUSION AND FUTURE WORK


In this paper, we proposed a novel weighting scheme using wavelet and fusion method for clustering. It uses the wavelet with different scales in the first step to produce multiple approximation values of TF, and then fusion is applied to achieve an enhanced weighting value in the second step. This method has some nice properties as it provides the capability of wavelet to approximating and denoising the data, and fusion to producing a good value from different values. Finally, we provided experimental results to validate the effectiveness and enhancement of our weighting scheme by clustering the set of documents from REUTERS corpus, and comparing it to TF weighting scheme. The results have shown that our weighting scheme using wavelet and fusion with average function AVG is much better than TF and improve the clustering results. In the future work, we would be looking at evolving to a cluster ensembles system to improve the quality and robustness of clustering solutions. In this system we can use the different approximated vector spaces

using different coefficient scales as input to the SOM clustering techniques, and then integrating the different clustering results into a combined one.

REFERENCES
[1] S. Mallat. A theory for multiresolution signal decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7):674693, 1989. [2] S. Mallat. A Wavelet Tour of Signal Processing. Academic Press, second edition, 1999. [3] J. Bakus , M. F. Hussin, and M. Kamel, A SOMBased Document Clustering using Phrases, In Proc of the 9th Intl Conf on neural information processing, Singapore, pp. 2212-2216, November 2002. [4] M.F. Hussin, J. Bakus, and M. Kamel, Enhanced phrase-based document clustering using SelfOrganizing Map (SOM) architectures, Book Chapter in: Neural Information Processing: Research and

Development,_Springer-Verlag, pp. 405-424, May, 2004. [5] T. Kohonen, Self-organizing maps, Springer Verlag, Berlin, 1995. [6] J. Lampinen and E. Oja, Clustering properties of hierarchial selforganizing maps, Journal of Mathematical Imaging and Vision, pp. 261-272, 1992. [7] G.A. Carpenter and S. Grossberg, A massively parallel architecture for a self-organizing neural pattern recognition machine., Computer Vision, Graphics, and Image processing, vol. 34, pp. 54-115, 1987. [8] M.F. Hussin, M. Kamel "Document clustering using hierarchical SOMART neural network". In Proceedings of the 2003 Intl Joint Conf on Neural Network, Portland, Oregon, USA, pp. 2238-2242, July, 2003. [9] M.F.Hussin, M.Kamel, and M.Nagi, An Efficient Twolevel SOMART Document Clustering Through Dimensionality Reduction, In ICONIP 2004, India, Springer, pp. 158 165, 2004. [10] M.F.Hussin, and M.Kamel, Enhanced Neural Network Document ClusteringUsing SOM based Word Clusters, In ICONIP 2005, Taiwan, Springer, pp. 716 720, 2005. [11] T. Kohonen, J. Kangas and J. Laaksonen, " SOM-PAK: the self-organizing map program package ver.3.1, SOM programming team of Helsinki University of Technology, Apr. 1995. [12] M. Steinbach, G. Karypis, and V. Kumar, A comparison of document clustering techniques, KDD2000, Workshop on Text Mining, 2000.

You might also like