You are on page 1of 6

ROBUST MULTI-RESOLUTION WEB USAGE MINING WITH GENETIC NICHE CLUSTERING Olfa Nasraoui Dept.

of Electrical and Computer Engineering The University of Memphis 206 Engr. Science Bldg. Memphis, TN 38152 onasraou@memphis.edu Raghu Krishnapuram IBM India Research Lab Block 1 Indian Institute of Technology, Hauz Khas New Delhi 110016, India kraghura@in.ibm.com

ABSTRACT In this paper, we present a new hierarchical clustering technique based on the concept of genetic niches, called Hierarchical Unsupervised Niche Clustering (HUNC) which is considerably faster than its non-hierarchical counterpart (UNC), and offers the advantage of multi-resolution clustering. We use HUNC as part of a complete system of knowledge discovery in Web usage data. Our new approach does not necessitate xing the number of clusters in advance, is insensitive to initialization, can handle noisy data, general non-differentiable similarity measures,and can provide proles to match any desired level of detail or resolution. Our experiments show that our algorithm is not only capable of extracting meaningful user proles on real Web sites, but also discovers associations between distinct URL pages on a site, with no additional cost. Unlike content based association methods, our approach discovers associations between different Web pages based only on the user access patterns and not on the page content. Also, unlike traditional context-blind association discovery methods, HUNC discovers context-sensitive associations. INTRODUCTION Manually entered Web user proles have raised serious privacy concerns, are subjective, and do not adapt to the users changing interests. Mass proling, on the other hand, is based on general trends of usage patterns (thus protecting privacy) compiled from all users on a site, and can be achieved by mining or discovering user proles from the historical data stored in server access logs. Current Web usage mining approaches avoid the feature representation dilemma of Web data by resorting to memory and computation intensive relational clustering (require the computation of all pairwise dissimilarities) or computation intensive association rule discovery (because very low support thresholds are needed to discover typical proles). A classical non-relational approach requires a differentiable dissimilarity measure. However, for DM problems, a domain specic similarity measure should be designed free of any constraints. Recently (Nasraoui and Krishnapuram, 2000), we have presented a new evolutionary approach to robust clustering based on the Unsupervised Niche Clustering algorithm (UNC). UNC seeks dense areas in feature space and determines their number by converting the clustering problem into a multimodal function optimization problem within the context of genetic niching and striving to locate the peaks of niches or subpopulations in the search space. Niching methods (Holland, 1975; 1

(a)

(b)

(c)

(d)

Figure 1: Evolution of the population using UNC: (a) original data set (b) Initial population, (c) population after 30 generations, (d) nal extracted centers Mahfoud, 1992) were designed to identify multiple optima within multimodal domains. Each peak in a mutlimodal domain can be thought of as a niche. In nature, niches correspond to different subspaces of the environment that can support different types of life such as species or organisms. Genetic Optimization makes UNC much less prone to suboptimal solutions than other objective function based approaches. Fig. 1 shows the evolution of an initial random population (denoted by square symbols) using UNC, for a noisy data set, toward the correct niches in subsequent generations. We propose a Hierarchical modication of UNC, called HUNC, that departs from the traditional limited at view of the data, and generates instead, a hierarchy of clusters which give more insight to the Web mining process. We use HUNC as part of a complete system of knowledge discovery in Web usage data. Our new approach does not necessitate xing the number of clusters in advance, can provide proles to match any desired level of detail or resolution, and requires no analytical derivation of the prototypes. Thus, it can handle a vast array of general subjective, even non-metric dissimilarities, making it suitable for many applications including data and Web mining. THE KNOWLEDGE DISCOVERY PROCESS OF WEB SESSION PROFILING The access log for a given Web server consists of a record of all les accessed by users. Each log entry consists of: (i) Users IP address, (ii) Access time, (iii) URL of the page accessed, , etc. A user session consists of accesses originating from the same IP address within a predened time period. Each URL in the site is assigned a unique number , where is the total number of valid URLs. Thus, the user session is encoded as an -dimensional binary attribute vector with the property if the user accessed the URL during the session otherwise The ensemble of all sessions extracted from the server log le is denoted . The similarity measure between two user-sessions (Nasraoui et al., 1999, 2000) takes the sites structure into account, and satises the desirable property of becoming more stringent as the accessed URLs get farther from the root because the amount of specicity in user accesses increases correspondingly. The sesare summarized by a typical session prole vector (Nasraoui sions in cluster
( C T R P USQ( F H I( ( C

et al., 1999)

. The components of

are URL relevence

#! $"

 %

 

         # $! !

    

#! $"

9 @

F D GE4 (

0(' 1)&

0( 6 4 1) 75' 3 2

weights, estimated by the probability of access of each URL during the sessions of . They measure the signicance of a given URL to the prole. The nal proles can be evaluated based on the average dissimilarity, which for the cluster, is given by
#! $
b

Another measure is the robust cardinality given by



!( %   $   #

is a robust weight (that is high for inliers/good data where and low for outliers/noise). HIERARCHICAL UNSUPERVISED NICHE CLUSTERING AND ITS APPLICATION TO WEB USAGE MINING We retain the principal structure of UNC (Nasraoui and Krishnapuram, 2000), except for a few differences that result from the distinct nature of the session data: The solution space for possible session prototypes consists of binary chromosome strings which are dened to be the binary session attribute vectors , and the new Web session dissimilarity measure is used instead of the Euclidean distance to take the Web site topology in account. UNCs computational time can be signicantly reduced if we perform clustering in a hierarchical mode. In other words, we could cluster smaller subsets of the data using a smaller population size at multiple levels, instead of clustering the entire data set on a single level which would necessitate a larger population size. The computational complexiy of UNC is , where is the population size and is the number of data points to be clustered. Since, can usually be a very small fraction of (typical in the hierarchical mode, example from our experiments: to , this complexity is much lower than that of relational clustering techniques such as Agglomerative Hierarchical Clustering (AHC) (Duda and Hart, 1973), and the closely related graph theoretic based Minimum Spanning Tree (MST) (Duda and Hart, 1973), . The hierarchical clustering is performed recursively starting from the top level (lowest resolution) until a termination criterion, based on the minimum acceptable size of a cluster, , and its maximum allowable mean squared error, , is met. Let denote the current level. Let denote the data set partitionned at level , where is the th cluster found at level . Hierarchical clustering proceeds by re-clustering the data in each of the above clusters in a recursive fashion to obtain the next level . Even though the above parameters will eventually determine the number of clusters at the last level of the partitionning, they are not crucial to the performance of HUNC This is because, unlike classical divisive hierarchical clustering. techniques, our approach relies on robust weights to suppress the inuence of outliers and data belonging to other clusters, and on a multiomodal optimization approach where multiple clusters are sought in parallel at each level. This means that at any given level, HUNC is expected to identify as many good clusters as the population size, 3
q " px v " y! w( C Q

h f )pd

0 H

r sh

'

H H C PIG

&

#! $

( "H 0

r t h uH 0 H

E F

'

r sh

'

C D

bc a3Y`WX  S S S PTH

A B@

!(

     

0 "H

"

r sh

2 q

'

"

U V@

S S S S PPTH

C R

T 6

9 78 52 3 3D 6 4 

!(

h f d ige

0 ( & 1)'4

2 q

!( $

(1)

(2)

while classical hierarchical approaches are expected to yield the optimal cluster prototypes only at the optimal level of the partition that corresponds to the known correct number of clusters. Also, to the contrary of classical hierarchical techniques, HUNC re-partitions the data at the end of clustering (the last level of the hierarchy). Thus, the partition is not subject to any nal commitment at each level, one of the well known pitfalls of hierarchical clustering techniques. WEB USAGE MINING EXPERIMENTAL RESULTS The parameters for the robust hierarchical UNC were xed to the following values: The crossover and mutation probabilities are and respectively. UNC used generations per clustering with a population size, . Since all session dissimilarities are conned in it is reasonable to choose The prole vectors are displayed in the format URL in table 1, illustrating typical proles. MU-CECS1 Data The 12-day access record (during 1998) of the Web site of the Dept. of Computer Science and Computer Engineering at the University of Missouri, Columbia generated 1703 sessions accessing 369 distinct URLs. The results obtained at levels are summarized in Table 3 showing proles that reect typical access patterns the general outside visitor is captured in proles 1 and 3; prospective students in prole 2 and 4, CECS 352 students in prole 7, etc. The quality of these clusters is conrmed by their low average dissimilarity compared to the maximal value of . (i) Robust proling is obtained by retaining prole members whose robust weights, , exceed a given threshold, , equal to in our experiments. This allows us to concentrate on the core of each prole by ltering out the noise sessions assigned to the closest prole. Different values generate different -cuts of the proles when these are viewed as fuzzy sets. The -core of the prole is dened as


The cores of proles Nos. and end up having less than 20 members, hence was discovmaking weak proles. Also, the core of the spurious cluster No. ered to contain sessions accessing the site managers pages. (ii) Multiresolution proling: Note how prole 2 (at ) in Table 2 is split into many proles with distinct user interests (at ) (proles No. 5, 6, 7, 8, and 9) as shown in Table 3. The same observation can be made about the rst cluster (general inquiries about the CECS department) which at level 2 gets split into proles No. 1, 2, 3, and 4, with each such prole showing a more specic kind of interest in the department. (iii) Inferring Associations between different URLs: Prole 2 (at ) in Table 2 contains accesses to two different courses taught by different professors, signaling an association. It was later revealed that one of the courses (CECS 352: Operating systems) relies for the implementation of its projects on the C programming language which is taught in the other course (CECS 333: Object Oriented Design). CONCLUSION 4
E E

S8 
3 ( F

 

#(

" )

8 G

" S8 

0 4 

 

r

#(

#(

4 F % &`( !(
$

$ ( B "

   S8 0 `' &

 (

S8 

F '

 S8 (

!(

h f igd

 S S8   3

( $

!(

h f ig d

# $!

# $!

(3)

Table 1: Examples of Proles discovered by HUNC from MU-CECS1 Data at and

2 3 4 5

305 185 162 73

170 111 84 56

191.0 124.0 102.2 51.0

enquiries, people and main degree page Dr. Saabs and Dr. Joshis course pages (CECS 333 and CECS 352 respectively) Accesses to the CECS227 class pages Dr Shis CECS345 pages Dr. Shangs course pages

0.54 0.2 0.37 0.08

For Web usage mining, the session dissmilarity measure is not a distance metric, and dealing with relational data is impractical given the huge dimension of the data sets. Therefore, we presented an adaptation of UNC, called Hierarchical Unsupervised Niche Clustering (HUNC) which is considerably faster than its non-hierarchical counterpart. Our new approach does not necessitate xing the number of clusters in advance, can provide proles to match any desired level of detail or resolution, and requires no analytical derivation of the prototypes. Thus, it can handle a vast array of general subjective, even non-metric dissimilarities, making it suitable for many applications in data and Web mining. We have illustrated through several examples that our clustering process results in the discovery of associations between different URL addresses on a given site, with no additional cost. Also, the associations are meaningful only within well dened distinct proles/contexts (context-sensitive) as opposed to all or none of the data (context-blind). This approach of discovering context-sensitive associations via clustering can be generalized to other transactional data. ACKNOWLEDGMENTS Partial support of this work by the National Science Foundation Grant IIS 9800899 is gratefully acknowledged.

References
Duda, R., Hart, P., 1973. Pattern Classication and Scene Analysis. Wiley Interscience, NY. 5

Table 2: 4 of the 7 proles discovered by HUNC from MU-CECS1 Data at description 1 572 312 362.2 main page, class list, course 0.32
 

"

.83 - /CECS computer.class .95 - /courses.html .95 - /courses index.html .95 - /courses100.html .19 - /courses200.html .19 - /people.html .19 - /people index.html .19 - /faculty.html .93 - / 1.00 - / .67 - /CECS computer.class

" S8 

"

!(

"

"

Table 3: Some of the 16 proles discovered by HUNC from MU-CECS1 Data at and description
main page, class list, course enquiries and people main page, class list, course and undergraduate degree enquiries Short sessions mostly limited to main page and class list main page, people, individual faculty, research and graduate degree pages Dr. Saabs CECS333 pages (long detailed sessions) Dr. Saabs CECS333 pages (short sessions) Dr. Saabs CECS303 pages (long detailed sessions) Accesses to the CECS227 class pages Dr Shis CECS345 (main page and Java examples) Dr Shis CECS345 (long sessions: lectures and project No. 1) Dr Shis CECS345 (short sessions to main page)
(


1 2 3 4 6 8 9 10

219 119 140 129 133 47 53 184 77 47 34

132 73 85 71 80 28 111 49 30 -

140.5 77.0 91.6 80.7 85.2 29.4 33.4 123.3 49.3 30.0 22.5

11
12 13

Holland, J. H., 1975. Adaptation in natural and articial systems. MIT Press. Mahfoud, S. W., Sep. 1992. Crowding and preselection revisited. In: 2nd Conf. Parallel problem Solving from Nature, PPSN 92. Brussels, Belgium. Nasraoui, O., Krishnapuram, R., May 2000. A novel approach to unsupervised robust clustering using genetic niching. In: Ninth IEEE International Conference on Fuzzy Systems. San Antonio, TX, pp. 170175. Nasraoui, O., Krishnapuram, R., Frigui, H., A., J., 2000. Extracting web user proles using relational competitive fuzzy clustering. To appear in International Journal of Articial Intelligence . Nasraoui, O., Krishnapuram, R., Joshi, A., Jun. 1999. Mining web access logs using a relational clustering algorithm based on a robust estimator. In: NAFIPS Conference. New York, NY, pp. 705709.

"

"

"

"

0.16 0.27 0.13 0.39 0.46 0.16 0.19 0.2 0.27 0.26 0.19

You might also like