Professional Documents
Culture Documents
Arab Academy for Science and Technology and Maritime Transport, Alexandria, Egypt
mfarouk@pami.uwaterloo.ca
**
Dept. of Electrical and Computer Engineering, University of waterloo, Waterloo, Ontario, Canada mkamel@pami.uwaterloo.ca
ABSTRACT Most term weighting schemes for text document clustering depend on the term frequency based analysis of the text contents. A shortcoming of these indexing schemes, which consider only the occurrences of the terms in a document, is that they have some limitations in filtering out noise in most cases. In this paper, we propose a novel weighting approach using fusion technique that can be combined with wavelet-based estimation to achieve consistent improvements in the clustering. Our approach involves three steps: (1) Term Frequency (TF) weighting scheme, (2) Multiple wavelets estimating, and (3) data fusion. Specifically, we apply the wavelet with different scales to produce different estimation values of the original TF, and use the fusion of these different values as new features for clustering the documents. The conducted experiments of clustering the documents from RETURES corpus verify that our weighting schemes using wavelet and fusion techniques reduces effectively the noise and improves clustering performance evaluated using the entropy and F_measure.
1. INTRODUCTION
The growth of the Internet has seen an explosion in the amount of information available, document clustering plays an important role for helping people organize this vast amount of data. It attempts to organize documents into groups such that documents within a group are more similar to each other than documents belonging to different groups. Document clustering has many applications, such as providing a summarized view of the information space by grouping documents by topic, clustering of search engine results to present organized and understandable results to the user (e.g. Vivisim), clustering documents in a collection (e.g. digital libraries), and improving the efficiency of information retrieval by focusing on relevant groups (clusters) rather than the whole collections. The first stage in any document clustering technique is the
document representation model. The most common approaches that are in use for this model are based on the bag of words paradigm. In this paradigm, the basic idea is to extract unique words from the set of documents and treat these words as features then represent each document as a vector of features. Each feature in this vector is assigned a weight scheme in several ways: binary, term frequency, or term frequency/inverse document frequency (tf/idf) depending on the application. So, we believe that the improvement in the weighting scheme will improve the accuracy of the clustering process. These schemes work well on clustering text documents since text document collections usually contain focused and mostly relevant vocabulary for their terms. However, the feature space usually contains various kinds of noise features; this is due to the nature of the text. During the last decade, the use of the Wavelet Transform (WT) has become an important and powerful tool in many fields including, signal processing, analysis, denoising, and data mining. WT is capable of providing simultaneous time-frequency representation of the signal. Furthermore, WT hierarchically decomposes the signal into detail and approximation information. This multiscale coarseto-fine structure retains different features of the signal [1][2]. The approximation coefficients hold the global information, whereas the detail coefficients contain the local features. Also, fusion is a technique that combines results achieved by different systems to form enhanced results. The document representation used in this work depends on our work in [3]and [4], we considered phrase based document clustering representation with flat Self Organizing Map (SOM) [5] and HSOM [6], and proposed a clever combination between SOM and Adaptive Resonance Theory (ART) [7] in HSOMART [8]. In [9] we presented the Two-Level SOMART (TLSOMART) clustering method that uses multiple SOMs in the first level to map the original high
dimensional vector space to a reduced one based on its output clusters, and then ART in the second level as the clustering algorithm to improve the quality of clusters of documents. In [10] we presented the neural network based document clustering using word clusters that uses SOM to cluster the words into word_clusters based on the occurrence of words in the documents, and then cluster the documents based on word_clusters instead of words as features using ART to achieve high quality of document clustering. In this paper, we present a novel term weighting scheme using wavelet and fusion. We perform wavelet estimation with different scales on the TF values and obtain different estimated values. Finally, a conventional fusion method is applied to produce combined results. There are some key benefits using the proposed method: first, the wavelet transform with its denoising and estimating capability can produce different values of TF. Second, the fusion technique performs well with the different generated values from WT which improve the quality of the clusters of the documents. The rest of this paper is organized as follows: Section 2 presents our proposed weighting scheme using fusion and wavelet. Section 3 discusses the experimental results. Finally, we conclude and give the directions of future work in Section 4.
Algorithm 1: Wavelet Fusion Weighting Scheme Given a set of document collection D, 1. Prepare the word-by-document matrix A of set D using TF weighting scheme. 2. Apply multiple Wavelets with different coefficient scales using matrix A to approximate wik value onto j values, (wik 1, wik 2, ., wik j). 3. Construct the new word-by-document matrix B of set D using MIN/MAX/AVG fusion/selection methods. For each document k For each word i Do fusion/selection using (wik 1, wik 2, ., wik j). End for End for 4. Apply SOM clustering algorithm using matrix B to produce clusters.
The SOM used with dimensions ranging from 16 units (4X4) to 100 units (10X10) with learning rate 0.02.
TF 5 4.5
Class entropy
Class entropy
TF 0.45 0.40
f_measure 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 4 9
f_measure
0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 16 25 36 49 64 number of clusters 81 TF WT 1 WT 2 WT 3 100
Figure 2: Class entropy and F-measure for 1000 document set, using TF, WT1, WT2, and WT3 weighting schemes and clustered by SOM technique.
Figure 3: Class entropy and F_measure for 1000 document set, using TF, MAX, MIN, and MAX weighting schemes and clustered by SOM technique.
using different coefficient scales as input to the SOM clustering techniques, and then integrating the different clustering results into a combined one.
REFERENCES
[1] S. Mallat. A theory for multiresolution signal decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7):674693, 1989. [2] S. Mallat. A Wavelet Tour of Signal Processing. Academic Press, second edition, 1999. [3] J. Bakus , M. F. Hussin, and M. Kamel, A SOMBased Document Clustering using Phrases, In Proc of the 9th Intl Conf on neural information processing, Singapore, pp. 2212-2216, November 2002. [4] M.F. Hussin, J. Bakus, and M. Kamel, Enhanced phrase-based document clustering using SelfOrganizing Map (SOM) architectures, Book Chapter in: Neural Information Processing: Research and
Development,_Springer-Verlag, pp. 405-424, May, 2004. [5] T. Kohonen, Self-organizing maps, Springer Verlag, Berlin, 1995. [6] J. Lampinen and E. Oja, Clustering properties of hierarchial selforganizing maps, Journal of Mathematical Imaging and Vision, pp. 261-272, 1992. [7] G.A. Carpenter and S. Grossberg, A massively parallel architecture for a self-organizing neural pattern recognition machine., Computer Vision, Graphics, and Image processing, vol. 34, pp. 54-115, 1987. [8] M.F. Hussin, M. Kamel "Document clustering using hierarchical SOMART neural network". In Proceedings of the 2003 Intl Joint Conf on Neural Network, Portland, Oregon, USA, pp. 2238-2242, July, 2003. [9] M.F.Hussin, M.Kamel, and M.Nagi, An Efficient Twolevel SOMART Document Clustering Through Dimensionality Reduction, In ICONIP 2004, India, Springer, pp. 158 165, 2004. [10] M.F.Hussin, and M.Kamel, Enhanced Neural Network Document ClusteringUsing SOM based Word Clusters, In ICONIP 2005, Taiwan, Springer, pp. 716 720, 2005. [11] T. Kohonen, J. Kangas and J. Laaksonen, " SOM-PAK: the self-organizing map program package ver.3.1, SOM programming team of Helsinki University of Technology, Apr. 1995. [12] M. Steinbach, G. Karypis, and V. Kumar, A comparison of document clustering techniques, KDD2000, Workshop on Text Mining, 2000.