You are on page 1of 14

A Web Usage Mining Framework for Mining

Evolving User Profiles in Dynamic Web Sites


Olfa Nasraoui, Member, IEEE, Maha Soliman, Member, IEEE, Esin Saka, Member, IEEE,
Antonio Badia, Member, IEEE, and Richard Germain
AbstractIn this paper, we present a complete framework and findings in mining Web usage patterns from Web log files of a real Web
site that has all the challenging aspects of real-life Web usage mining, including evolving user profiles and external data describing an
ontology of the Web content. Even though the Web site under study is part of a nonprofit organization that does not sell any products,
it was crucial to understand who the users were, what they looked at, and how their interests changed with time, all of which are
important questions in Customer Relationship Management (CRM). Hence, we present an approach for discovering and tracking
evolving user profiles. We also describe how the discovered user profiles can be enriched with explicit information need that is inferred
from search queries extracted from Web log data. Profiles are also enriched with other domain-specific information facets that give a
panoramic view of the discovered mass usage modes. An objective validation strategy is also used to assess the quality of the mined
profiles, in particular their adaptability in the face of evolving user behavior.
Index TermsMining evolving clickstreams, user profiles, Web usage mining, semantic Web mining.

1 INTRODUCTION
C
USTOMER Relationship Management (CRM) can use data
from within and outside an organization to allow an
understanding of its customers on an individual basis or on
a group basis such as by forming customer profiles. An
improved understanding of the customers habits, needs,
and interests can allow the business to profit by, for
instance, cross selling or selling items related to the ones
that the customer wants to purchase. Hence, reliable
knowledge about the customers preferences and needs
forms the basis for effective CRM. As businesses move
online, the competition between businesses to keep the
loyalty of their old customers and to lure new customers is
even more important, since a competitors Web site may be
only one click away. The fast pace and large amounts of
data available in these online settings have recently made it
imperative to use automated data mining or knowledge
discovery techniques to discover Web user profiles. These
different modes of usage or the so-called mass user profiles
can be discovered using Web usage mining techniques that
can automatically extract frequent access patterns from the
history of previous user clickstreams stored in Web log files.
These profiles can later be harnessed toward personalizing
the Web site to the user or to support targeted marketing.
Although there have been considerable advances in Web
usage mining, there have been no detailed studies present-
ing a fully integrated approach to mine a real Web site with
the challenging characteristics of todays Web sites, such as
evolving profiles, dynamic content, and the availability of
taxonomy or databases in addition to Web logs.
In this paper, we present a complete framework and a
summary of our experience in mining Web usage patterns
with real-world challenges such as evolving access
patterns, dynamic pages, and external data describing an
ontology of the Web content and how it relates to the
business actors (in the case of the studied Web site, the
companies, contractors, consultants, etc., in corrosion). The
Web site in this study is a portal that provides access to
news, events, resources, company information (such as
companies or contractors supplying related products and
services), and a library of technical and regulatory
documentation related to corrosion and surface treatment.
The portal also offers a virtual meeting place between
companies or organizations seeking information about
other companies or organizations. Without loss of general-
ity, in the rest of this paper, we will refer to all the Web
site participants (organizations, contractors, consultants,
agencies, corporations, centers, agencies, etc.) simply as
companies. The Web site in our study is managed by a
nonprofit organization that does not sell anything but only
provides free information that is ideally complete, accu-
rate, and up to date. Hence, it was crucial to understand
the different modes of usage and to know what kind of
information the visitors seek and read on the Web site and
how this information evolves with time. For this reason,
we perform clustering of the user sessions extracted from
the Web logs to partition the users into several homo-
geneous groups with similar activities and then extract
user profiles from each cluster as a set of relevant URLs.
This procedure is repeated in subsequent new periods of
Web logging (such as biweekly), then the previously
202 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 2, FEBRUARY 2008
. O. Nasraoui, M. Soliman, E. Saka, and A. Badia are with the Department
of Computer Engineering and Computer Science, Speed School of
Engineering, University of Louisville, Louisville, KY 40292.
E-mail: {olfa.nasraoui, masoli01, esin.saka, abadia}@louisville.edu.
. R. Germain is with the College of Business, University of Louisville,
154 College of Business, Louisville, KY 40292.
E-mail: richard.germain@louisville.edu.
Manuscript received 21 Feb. 2006; revised 12 Oct. 2006; accepted 10 Aug.
2007; published online 4 Sept. 2007.
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-0088-0206.
Digital Object Identifier no. 10.1109/TKDE.2007.190667.
1041-4347/08/$25.00 2008 IEEE Published by the IEEE Computer Society
discovered user profiles are tracked, and their evolution
pattern is categorized. When clustering the user sessions,
we exploit the Web site hierarchy to give partial weights in
the session similarity between URLs that are distinct and
yet located closer together on this hierarchy. The Web site
hierarchy is inferred both from the URL address and from
a Web site database that organizes most of the dynamic
URLs along an is-a ontology of items. We also enrich the
cluster profiles with various facets, including search
queries submitted just before landing on the Web site,
and inquiring and inquired companies, in case users from
(inquiring) companies inquire about any of the (inquired)
companies listed on the Web site, which provide related
services.
The rest of this paper is organized as follows: In Section 2,
we present an overview of Web usage mining, in particular
advances involving semantics and profile evolution. In
Section 3, we describe our approach to profile discovery
using Web usage mining. In Section 4, we discuss our
approach for handling dynamic content and exploiting
external data that describes an ontology of the Web content
derived from the server database. In Section 5, we discuss
our approach for tracking evolving user profiles. In Section 6,
we present a systematic and objective validation strategy for
the discovered user profiles. In Section 7, we present our
results in mining evolving user profiles. Finally, in Section 8,
we present our conclusions.
2 AN OVERVIEW OF WEB USAGE MINING
Recently, data mining techniques have been applied to
extract usage patterns from Web log data [1], [2], [3], [4], [5].
This process, known as Web usage mining, is traditionally
performed in several stages [1], [3] to achieve its goals:
1. collection of Web data such as activities/clickstreams
recorded in Web server logs,
2. preprocessing of Web data such as filtering crawlers
requests, requests to graphics, and identifying
unique sessions,
3. analysis of Web data, also known as Web Usage
Mining [4], to discover interesting usage patterns or
profiles, and
4. interpretation/evaluation of the discovered profiles.
In this paper, we further added a fifth step after a
repetitive application of steps 1-4 on multiple time
periods, i.e.,
5. tracking the evolution of the discovered profiles.
Web usage mining can use various data mining or
machine learning techniques to model and understand Web
user activity. In [6], clustering was used to segment user
sessions into clusters or profiles that can later form the basis
for personalization. In [7], the notion of an adaptive Web
site was proposed, where the users access pattern can be
used to automatically synthesize index pages. The work in
[1] is based on using association rule discovery as the basis
for modeling Web user activity, whereas the approach
proposed in [8] used probabilistic grammars to model Web
navigation patterns for the purpose of prediction. The
approach in [9] proposed building data cubes from Web log
data and later applying Online Analytical Processing
(OLAP) and data mining on the cube model. Web
Utilization Miner (WUM) was presented [5] to discover
navigation patterns with user-specified characteristics over
an aggregated materialized view of the Web log, consisting
of a trie of sequences of Web views. New fuzzy relational
clustering techniques were used to discover user profiles
that can overlap [4], whereas robust clustering [3] was
proposed to mine profiles that are resistant to noise that is
naturally present in clickstream data. A robust density-
based evolutionary clustering technique was proposed to
discover an optimal number of multiresolution and robust
user profiles [10]. Many Web usage mining approaches are
surveyed in [4].
2.1 Handling Profile Evolution
Most previous research efforts in Web usage mining have
worked with the assumption that the Web usage data is
static. However, the dynamic aspects of Web usage have
recently become important. This is because Web access
patterns on a Web site are dynamic due not only to the
dynamics of Web site content and structure but also to
changes in the users interests and, thus, their navigation
patterns. Thus, it is desirable to study and discover Web
usage patterns at a higher level, where such dynamic
tendencies and temporal events can be distinguished.
Mining evolving clickstreams is the subject of only a few
recent research efforts [11], [12], [13]. In [11], an immune
system inspired approach, called Tracking Evolving Clus-
ters in NOisy Streams (TECNO-STREAMS), was proposed
to continuously learn and adapt to new incoming patterns
by detecting an unknown number of clusters in evolving
noisy data in a single pass. This stream summary or
synopsis consists of a set of cluster representatives with
properties such as scale and age. Apart from the recent
interest in studying evolution in Web clickstreams, there
have been several research efforts in machine learning
regarding the related notion of concept drift. According to
Maloof and Michalski [14], learning evolving concepts adds
another layer of difficulty to the process of online learning,
since concepts can no longer be assumed to be constant. In
an evolving scenario, with time, past training examples may
become obsolete and therefore need to be replaced by more
recent examples. One of the earliest works [14] presented a
method for selecting training examples for a partial
memory learning system, which was later extended in
[15], by using a time-based forgetting function to remove
examples that are older than a certain age from a partial
memory. Within the area of personalization, Mitchell et al.s
Personal Assistant [16] trained decision trees to learn how
an individuals meetings can be scheduled in a personalized
calendar. A time window was used to confine and adapt the
training samples for learning changing user preferences.
NewsDude [17] is an intelligent agent built to adapt to
changing users interests by learning two separate user
models that represent short-term and long-term interests.
The short-term model is learned from the most recent
observations only, whereas the long-term (default) model
represents the users general preferences. In [18], a user
profiling system was developed based on monitoring the
users Web browsing and e-mail habits. This system used a
NASRAOUI ET AL.: A WEB USAGE MINING FRAMEWORK FOR MINING EVOLVING USER PROFILES IN DYNAMIC WEB SITES 203
clustering algorithm to group user interests into several
interest themes, and the user profiles had to adapt to
changing interests of the users over time.
Maloof and Michalski [15] classified online learning in
the presence of concept drift as either evolutionary or
revolutionary with regard to adaptation to change. An
evolutionary scheme modifies existing knowledge based on
completely new training examples (for example, STAGGER
[19]), whereas a revolutionary approach discards old knowl-
edge and learns new knowledge from the new training
examples (for example, window-based techniques [20]). A
third approach includes hybrids that inherit from both the
revolutionary and evolutionary approaches. For instance,
Mitchells Calendar Learning Apprentice [16] learns new
decision rules from training data and incorporates these
new rules into the existing knowledge base. Maloof and
Michalski [14] further classified the way online learning
systems work into three different modes: no memory, partial
memory, or full memory. In the no-memory mode, the system
does not use any past training examples for updating the
current model (for example, STAGGER [19]), whereas in the
partial-memory mode, a subset of the previously seen
training examples is used for later learning. Finally, in the
full-memory mode, all past training examples are used in
updating an existing model. A continuum between no
memory and full memory (gradual forgetting) used a
forgetting function-based approach in supervised learning
[21] and to cluster evolving streams [11].
It is important to note that apart from [18] (which was
limited to a small number of attributes and users), all of the
above approaches were proposed within a supervised
learning framework (classification) or focused on adaptation
to a single user (predicting whether an object is relevant or
not). On the other hand, the work that we present in this
paper is based on an unsupervised learning framework that
tries to learn mass anonymous user profiles on the server side.
Nonetheless, according to Maloof and Michalskis categor-
ization of concept drift systems [14], our proposed system
can be categorized as a no-memory revolutionary user profile
mining approach. However, the user profile tracking and
validation approach works in the full-memory mode. Further-
more, in this paper, we are more interested in quantifying
and categorizing or annotating the various types of evolution
(not only detecting evolution and adapting to it), and this,
in turn, can form a higher level of knowledge, in addition to
the description of the profiles themselves as user models.
We adopt an approach based on periodical batch mining
that has the advantage of being easy to adapt to use any
other unsupervised learning tool that automatically dis-
covers clusters in static or dynamic data. In this work, we
use the full memory (periodical or window based), in part,
because our goal was to describe the user profiles in certain
periodical increments (about two weeks each). Hence, it
was essential to fully mine the Web logs from each period
and then compare the subsequent results.
2.2 Integrating Semantics in Web Usage Mining
Relying only on Web usage data for user modeling or for
personalization can be inefficient, either when there is
insufficient usage data for the purpose of mining certain
patterns or when new pages are added and thus do not
accumulate sufficient usage data at first. The lack of usage
data in these cases can be compensated by adding other
information such as the content of Web pages [22] or the
structure of a Web site [2], [3]. In [22], the keywords that
appear in Web pages are used to generate document
vectors, which are later clustered in the document space
to further augment user profiles. In [2], [3], [10], the Web
sites own hierarchical structure is treated like an implicit
taxonomy or concept hierarchy that is exploited in
computing the similarity between any two Web pages on
the Web site. This allows a better comparison between
sessions that contain visits to Web pages that are different
and yet semantically related (for example, under the same
more general topic). The idea of exploiting concept
hierarchies or taxonomies has already been found to
enhance association rule mining [23] and to facilitate
information searching in textual data [24]. Even though
keywords that are present in the Web pages have been used
to add a content aspect to usage data, the keyword-based
approach remains incapable of capturing more complex
relationships at a deeper semantic level. Thus, in [25], a
general framework was proposed for using domain ontolo-
gies to automatically characterize usage profiles containing
a set of structured Web objects.
The advent of dynamic URLs mostly in tandem with Web
databases has recently made it even more difficult to
interpret URLs in terms of user behavior, interests, and
intentions. For instance, consider the following cryptic
association rule within the context of an online bookstore
[26]: If http://www.the_shop.com/show.html-?item=123,
then http://www.the_shop.com/show.html?item=456, sup-
port = 0.05, and confidence = 0.4. A more meaningful rule
would be users who bought Hamlet also tended to buy How
to Stop Worrying and Start Living. This, in turn, has
motivated [26] which mined patterns of application events
instead of patterns of URLs by exploiting the semantics of the
visited pages. Within this spirit, Service-based concept
hierarchies were introduced earlier [27] for analyzing the
search behavior of visitors, that is, howthey navigate rather
than what they retrieve. In this case, concept hierarchies
form the basic method of grouping Web pages together
before Web usage mining. In [26], usage mining was
enhanced by describing the user behavior in terms of an
ontology underlying a particular Web site. The semantic
annotation of the Web content was assumed to have been
performed a priori, since the Web site in question was a
knowledge portal with an inherent RDF annotation. In order
to mine interesting patterns, first, the Web logs were
semantically enriched with ontology concepts. Then, these
semantic Web logs were mined to extract patterns such as
groups of users, users preferences, and rules. Following a
similar approach, in [28], Web usage logs were enriched with
semantics derived from the content of the Web sites pages.
Content keywords were first mapped to the categories of a
manually constructed domain-specific taxonomy through
the use of a thesaurus, and then the Web documents were
clustered based on the taxonomy categories. The enhanced
Web logs, called C-Logs, were then used as input to Web
usage mining.
204 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 2, FEBRUARY 2008
Most of the efforts cited on the previous page rely on an
explicit taxonomy that needs to be handcrafted by an expert
before the analysis. On the other hand, the implicit
taxonomy, as used in [2], [3], is inferred automatically and
quickly from the Web site directory structure via URL
tokenization. Furthermore, this implicit taxonomy does not
require any modification to the underlying data mining
algorithm, since it is only incorporated within the similarity
measure used to cluster the user sessions. In this paper, we
will exploit both an implicit taxonomy as inferred from the
Web site directory structure and an explicit taxonomy as
inferred from data that is external to the Web logs and that
is already available with the Web sites content database.
3 PROFILE DISCOVERY BASED ON WEB
USAGE MINING
The framework for our Web usage mining and a road map
to the rest of this paper is summarized in Fig. 1, which starts
with the integration and preprocessing of Web server logs
and server content databases, includes data cleaning and
sessionization, and then continues with the data mining/
pattern discovery via clustering. This is followed by a
postprocessing of the clustering results to obtain Web user
profiles and finally ends with tracking profile evolution.
The automatic identification of user profiles is a knowledge
discovery task consisting of periodically mining new
contents of the user access log files and is summarized in
the following steps:
1. Preprocess Web log file to extract user sessions.
2. Cluster the user sessions by using Hierarchical
Unsupervised Niche Clustering (H-UNC) [10].
3. Summarize session clusters/categories into user
profiles.
4. Enrich the user profiles with additional facets by
using additional Web log data and external domain
knowledge.
5. Track current profiles against existing profiles.
3.1 Preprocessing the Web Log File to Extract User
Sessions
The access log of a Web server is a record of all files (URLs)
accessed by users on a Web site. Each log entry consists of
the access time, IP address, URL viewed, REFERRER (the
Web page visited just prior to the current one), etc. The first
step in preprocessing [1], [2] consists of mapping the N
U
URLs on a Web site to distinct indices. A user session
consists of requests from the same IP address within a
predefined time period. Each URL in the site is assigned a
unique number j 2 1; . . . ; N
U
, where N
U
is the total number
of valid URLs. The ith user session is then encoded as an
N
U
-dimensional binary attribute vector ss
i
with the follow-
ing property:
s
i
j

1 if user i accessed URL j;
0 otherwise:
_
1
In addition to URLs, we encode the search query terms
from the initial requests REFERRER field and take
advantage of the Power Law properties of session lengths
(with the majority tending to be short) [29] to implement
sessions as lists instead of vectors, thus saving on memory
and computational costs.
3.2 Clustering Sessions into an Optimal Number of
Categories
To cluster user sessions, we use H-UNC [10], a divisive
hierarchical version of a robust clustering approach (Un-
supervised Niche Clustering (UNC)) [30] that uses a
Genetic Algorithm (GA) [31] to evolve a population of
candidate solutions through generations of competition and
reproduction. The main outline of the H-UNC algorithm is
sketched in the following. The reason that we use H-UNC
instead of other clustering algorithms is that unlike most
other algorithms, H-UNC can handle noise in the data and
automatically determines the number of clusters. In addi-
tion, evolutionary optimization allows the use of any
domain-specific optimization criterion and any similarity
measure, in particular a subjective measure that exploits
domain knowledge or ontologies, as given in (3). However,
unlike purely evolutionary search-based algorithms,
H-UNC combines evolution with local Piccard updates to
estimate the scale
i
of each profile, thus converging fast
(about 20 generations). H-UNC is outlined as follows (more
details can be found in [10]):
NASRAOUI ET AL.: A WEB USAGE MINING FRAMEWORK FOR MINING EVOLVING USER PROFILES IN DYNAMIC WEB SITES 205
Fig. 1. Web usage mining process and discovered profile facets.
3.3 Similarity Measure Used in Clustering
The similarity score between an input session ss and the
ith profile pp
ii
can be computed using the cosine similarity as
follows (where N
u
is the total number of URLs):
S
cosine
si

N
U
k1
p
ik
s
k

N
U
k1
p
ik

N
U
k1
S
k
_ : 2
If a hierarchical Web site structure is to be taken into
account, then a modification of the cosine similarity, which
we introduced in [1], [2] and can take the Web site structure
into account, can be used to yield the following similarity
measure:
S
Web
si
max

N
U
l1

N
U
k1
p
il
S
u
l; ks
k

NU
k1
p
il

NU
k1
s
k
; S
cosine
si
_ _
; 3
where S
u
i; j is a URL to the URL similarity function that is
computed based on the amount of overlap between the
paths P
i
and P
j
leading from the root of the Web site (the
main page) to any two URLs i and j. This is given by
S
u
i; j
1 if i j;
min 1;
P
i
\P
j j j
max 1;max jP
i
j;jP
j
j 1
_ _
otherwise:
_
_
_
4
We refer to the special similarity in (3) as the Web
Session Similarity. This Web similarity takes into account
not only the hierarchical structure of the Web site content
as inferred from the URL address itself (for example, URLs
a/b/c and a/b/d are related from the hierarchical
structure aspect) but also how different content items on
the Web site relate to each other according to an externally
defined Web site ontology (for example, URLs pages.
aspx?x=30 and pages.aspx?x=40 are semantically related
from an external ontology aspect if these two URLs can be
mapped to A/B and A/C, that is, if they refer to content
areas B and C that share the same parent A). Thus, the
combination of hierarchical site structure and external
ontology occurs naturally in two stages: First, each URL is
parsed to extract the structure, and then, each remaining
dynamic URL (after the first stage) is mapped according to
the ontology, as explained in Section 4. In this case, we use
a simple ontology based on is-a relationships (that is, a
taxonomy) between individual dynamic URLs and higher
level categories encoded in a Web site ontology. This
similarity is used in our clustering algorithm (H-UNC) to
group similar user sessions into clusters or profiles. The
URL-to-URL similarities in (4) form a sparse matrix; hence,
only nonzero values are stored. Furthermore, access to
these values is accelerated by hashing the two indices
corresponding to a given pair of URLs. In addition, for the
purpose of clustering, the similarity S
Web
measure in (3) is
mapped to a distance d
Web
1 S
Web
2. Because the
distances 1 S
Web
are in [0, 1], squaring them was
found to cause more distinction between smaller and
larger distances and therefore helped delineate clusters
more easily. This distance measure will also be used to
compare and track evolving profiles, as explained in
Section 6.
Our approach only implicitly incorporates information
about the Web pages content. This is different frommethods
based on the explicit content of the Web pages, as we infer
this information from the hierarchy knowledge that is
external to the Web logs. Related to our semantic similarity
in (4) are several measures proposed in the past by Resnick
[32], Wu and Palmer [33], and Lin [34] to relate concepts and
by Ziegler et al. [35] for recommendations. A good survey
with extensions in a fuzzy-set theoretic framework can also
be found in [36]. When formulated using our notation above
for path length and path intersection, Wu and Palmers
similarity [33] can be written as
S
WuPalmer

P
i
\ Pj j j
P
i
P
j

=2
:
Thus, normalization is done by dividing by the average of
the path lengths, whereas our similarity divides by the
maximal length. Hence, our similarity is more restrictive
and penalizes more for widely differing path lengths that
correspond to concepts at widely different levels of
specificity and generality. Resnicks similarity [32] is
defined as the Information Content (IC) of the closest
common ancestor concept c
3
of concepts c
1
and c
2
. In our
notation, c
3
is located at the end of the intersection (shared)
path between their URL paths P
1
and P
2
. However, IC is
computed based on the log probability of occurrence of the
206 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 2, FEBRUARY 2008
concept over an entire text corpuss content. Our measure
relies only on superficial URLs in tokenized form (for example,
a/b/c) and not their content. Hence, it is lighter (faster and
easier) to implement. In addition, IC requires a rigid
taxonomy and reliable corpus to be able to accurately
estimate the probabilities. Aside from that, since Web
sessions follow a Power Law distribution [29], the majority
of concepts/URLs are at the long tail of the distribution and
thus have very low probabilities that can limit the
applicability of the IC measure as a pure similarity measure.
Lin [34] extended Resnicks similarity by dividing it by the
sum of the ICs of the concepts c
1
and c
2
, hence suffering
from the same drawbacks of the former measure. We also
note that our similarity measure in (4), first proposed in [2],
[3], has recently been generalized in [37] to be used for
information retrieval within the context of digital libraries.
3.4 Postprocessing and Enrichment of Session
Clusters into Multifaceted User Profiles
After automatically grouping sessions into different clus-
ters, we summarize the session categories in terms of user
profile vectors [3], [4] pp
ii
. The kth component/weight of this
vector (p
ik
) captures the relevance of URL
k
in the ith profile,
as estimated by the conditional probability that URL
k
is
accessed in a session belonging to the ith cluster (this is the
frequency with which URL
k
was accessed in the sessions
belonging to the ith cluster). The profiles are then converted
to binary vectors (sets) so that only URLs with weights >
0:15 remain. The model is further extended to a robust profile
[2], [3] based on robust weights (w
ij
) computed in the UNC
algorithm (see Section 3.2) that assign only sessions with
high robust weights (that is, w
ij
> w
min
) to a clusters core.
The core of a profile consists only of sessions that are very
similar to the representative profile. Thus, noisy sessions
are eliminated from the recomputation of profiles. Each
profile pp
ii
is discovered along with an automatically
determined measure of scale
i
that represents the amount
of variance or dispersion of the user sessions in a given
cluster around the cluster representative (profile). This
measure will later serve an important role in determining
the boundary of each cluster and thus allows us to
automatically determine whether two profiles are compa-
tible or not.
In addition to the cluster-induced user profiles above, we
were interested in several descriptors of the users in each
cluster. The additional profile descriptors that we refer to as
facets (Fig. 1), were extracted, partly from the Web logs
themselves (from the REFERRER field), partly from external
public information (the www.whois.com Web service), and
partly from domain-specific information (the Web site
content and registration database). In addition to the
viewed Web pages, the profile properties include the
following facets (see Fig. 3 for a real example):
1. Search queries. These are queries submitted to search
engines before visiting the Web site for sessions that
belong to this profile.
2. Inquiring companies. These are companies/organiza-
tions of registered users or unregistered users whose
IP addresses can be mapped.
3. Inquired companies. These are companies/organiza-
tions that have been inquired about during the
sessions belonging to this profile.
Such a rounded representation of a profile gives a
panoramic view of a cluster of Web site visitors that can
help in understanding their interests better and further be
harnessed toward supporting personalization efforts.
3.4.1 Enriching User Profiles with Search Query Terms
(Search Queries)
In addition to the relevant URLs that are extracted from the
sessions assigned to each profile, we can extract information
about the explicit information need of the users in each
profile from the queries that they could have typed prior to
visiting the Web site when this information is available
from the readily available REFERRER field in the Web log
files. Hence, for each profile, we accumulate all the search
phrases extracted from the REFERRER fields of the
assigned user sessions. This allows us to describe each
profile in terms of either a set of significant URLs or a set of
explicit search query phrases and terms.
3.4.2 Enriching User Profiles with Inquiring Company
Information (from Companies)
In addition to the relevant URLs that are extracted from the
sessions assigned to each profile, we can extract information
about which companies or organizations tend to visit the
Web site and fall in this profile. We extract this information
from two complementary sources: 1) by getting the
company information that corresponds to an ID in the
server content Database, where the ID is extracted from the
Web log file in case the visitors register and sign in through
the registration page, or 2) if the visitors did not sign in
through the registration page, then an attempt is made to
obtain the company affiliation from a specialized Web
service (www.whois.com). This can be queried with an IP
address via an API to determine not only what information
was found relevant on the Web site but also to whom it was
relevant to help support further personalization efforts.
3.4.3 Enriching User Profiles with Queried Company
Information (about Companies)
The Web site under study provides a virtual meeting point
between different companies providing various services
that are related to the portals subject. Hence, it was
important to know not only which companies take part in
each cluster of activities but also what company information
seemed to be relevant to users in each cluster. For this
reason, in addition to the relevant URLs that are extracted
from the sessions assigned to each profile, we extracted
information about which companies have been inquired
about by visitors in this profile in case a user searches and
clicks on one of the listed companies contact information on
the Web site.
1
We parse the identity of the company from
the Web log file and map it to a specific company via the
server content database.
NASRAOUI ET AL.: A WEB USAGE MINING FRAMEWORK FOR MINING EVOLVING USER PROFILES IN DYNAMIC WEB SITES 207
1. Users that search for companies in a certain area, click on a company
name in order to have the Web server return the company specialty area
and contact information.
4 EXPLOITING AN EXTERNAL ONTOLOGY FOR
MAPPING AND RELATING DYNAMIC WEB PAGES
Most of todays Web sites deliver a large number of URLs, if
not only dynamic URLs. A dynamic URL is a page address
that results from the search of a database-driven Web site or
a Web site that runs a script. Unlike static URLs, in which
the contents of the Web page do not change, dynamic URLs
are typically generated from specific queries to a sites
database. Even though the examples given in the following
discussion consistently use the ASP extension, this exten-
sion can be replaced by any other dynamic URL extension
(such as PHP), without any changes in our generic
approach. Although static Web pages tend to have mean-
ingful URLs such as /reports/fall_2003/benefits.html, most
dynamic URLs such as /universal.aspx?id=55&codes
_id=60 are unfortunately hard to discern or even recognize
based only on their URL. We resolved this issue by
resorting to available external data
2
that maps database
contents to a dynamic resource and its parameter values.
The ASP codes in most menus can be mapped during the
preprocessing phase to a parent/child structure by using
external data (excerpt in Table 1), thus mapping URLs to
meaningful hierarchical descriptions.
We illustrate our mapping procedure with the Regula-
tions and Laws page. In the Web log data, this URL is
only recorded as universal.aspx?id=56. Table 2 lists its
content information as Regulations and Laws under the
field item_name. Furthermore, its parent (at level
item level 0) is the item with code menus id 4; 939
with label NST Center&reg, as given by the field item_-
name. Hence, this URL is mapped to a semantic label:
NST Center&reg/Regulations and Laws.
In general, we need to read the parent of each item and
then recursively map a dynamic URL such as universal.asp-
x?id=56 to a string consisting of tokens separated by /,
where tokens are labels[parent-items]. Insertion is done in
reverse order from the end to the start of the final
composed label until we reach the parent at level 0. Both
implicit (URL itself) and explicit (Table 1) taxonomy
information are seamlessly incorporated into the session
clustering via the computation of the special session
similarity measure in (3).
5 TRACKING EVOLVING USER PROFILES
Tracking different profile events across different time
periods can generate a better understanding of the evolu-
tion of user access patterns and seasonality. Note that both
profiles and clickstreams are typically evolving, since the
profiles are nothing more than summaries of the click-
streams, which are themselves evolving. Each profile pp
ii
is
discovered along with an automatically determined measure
of scale
i
that represents the amount of variance or
dispersion of the user sessions in a given cluster around the
cluster representative. This measure is used to determine
the boundary around each cluster (an area located at a
distance
i
from the profile pp
ii
) and thus allows us to
automatically determine whether two profiles are compa-
tible. Two profiles are compatible if their boundaries
overlap. The notion of compatibility between profiles is
essential for tracking evolving profiles. After mining the
Web log of a given period, we perform an automated
comparison between all the profiles discovered in the
current batch and the profiles discovered in the previous
batch by a sequence of SQL queries on the profiles that have
been stored in a database, as shown in the TrackProfiles
Algorithm. A typical query for retrieving corresponding
profiles between Periods T
1
and T
11
is SELECT ThisPro-
file, TothisProfile FROM ProfileTrail WHERE Period T
1
.
We define a profile evolution event as a coarse categor-
ization of possible real evolution scenarios that relate how
profiles that are discovered during a certain period relate
to profiles discovered in another period. The above
208 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 2, FEBRUARY 2008
2. That is, external to the Web logs.
TABLE 1
Partial Taxonomy of a Few Dynamic URLs (Identified by Base URL (url) and Parameter (menus_id))
TABLE 2
Taxonomy Data for the Dynamic URL universal.aspx?id=56
comparison process determines which new profiles are
compatible with the old profiles and which new profiles
are incompatible with any previous profile. These last two
cases, respectively, give rise to two kinds of events:
Persistence and Birth. A third event Death arises in case
an old profile does not find a compatible profile from the
new batch. It is also possible to track profile reemergence
in the long term. This is the case of an old profile that
disappears and then reappears when it is found to be
compatible with a new profile in the current batch. This
event is labeled as Atavism. We can visualize the temporal
dynamics of profiles birth, persistence, death, and atavism
(rebirth) by labeling the x-axis with the periods corre-
sponding to the different Web log batches that undergo
Web usage mining: period 1, period 2, etc. On the other
hand, the y-axis is used to indicate the profile index: New
profiles are vertically expanded by adding new indices on
top of existing ones. Finally, we generate a plot depicting
the Web site user trend evolution by adding a special
symbol whenever profile y appears in period x and
possibly adding event labels such as Birth, Death, and
Atavism, as these occur. This idea is illustrated in Fig. 2.
Note that this tracking takes advantage of a database
management system to accelerate the access to archived
user profiles (which, as a summary, are a negligible
number compared to the original input data). Moreover,
this process is done offline and is only periodically done
(not adding any burden on the data mining/clustering
itself), since it is an offline analysis of the results of Web
usage mining to help track the user profiles evolution in
retrospect. The choice of the basic period length can be
either arbitrary or based on the domain knowledge and
intuition (like whether changes have been made to the
Web site or whether new events related to the Web site
domain may have occurred). In our experiments, we have
chosen periods that varied from one week to one month.
In general, if the periods are too small, then fewer changes
will be detected, as opposed to longer periods. Thus, the
right period length should be determined by trial and
error.
The analysis of profile evolution, as shown in Fig. 2, can
improve our understanding of the user activity trends and
detect seasonality in their access patterns, especially over a
long time span. It also helps in implementing a dynamic
recommendation strategy, for instance, by caching fre-
quently reemerging (atavistic) profiles. Dead profiles can be
relegated to the secondary memory for possible reemer-
gence, whereas persistent profiles can be kept in the
primary memory for fast access and then relegated to the
secondary memory when they die. Similarly, dead profiles
that have been persistent during an earlier period should be
distinguished from dead profiles that have never been
persistent, that is, volatile profiles. Table 3 lists the formal
conditions defining evolution events and potential implica-
tions in a marketing context. The time period T denotes the
basic unit of analysis (for example, one week). We define
the Boolean predicate Comppp
ii
; pp
jj
to be TRUE only if the
profiles are compatible. We denote consecutive time
periods as T 1, T, T 1, T 2, etc., with T being the
current period and P
T
being the set of mass user profiles
discovered during period T. Table 4 shows profile events
determined for real sessions in consecutive periods based
on the discovered profiles in Table 5. Aggregating evolution
events can help track profiles over many periods. For
example, averaging the number of atavisms of a profile over
several periods can summarize its changes.
NASRAOUI ET AL.: A WEB USAGE MINING FRAMEWORK FOR MINING EVOLVING USER PROFILES IN DYNAMIC WEB SITES 209
Fig. 2. Visualization of the profile evolution. A dot in location x; y
indicates that profile y was active in time period x.
Fig. 3. Different facets of a cluster profile (inquiring companies were obfuscated for the sake of privacy). Center: pages viewed. Right: search queries
submitted to external engines prior to landing on this Web site. Top and left: inquiring companies that are active in this profile (from registered and
unregistered users, respectively). Bottom: inquired companies about which users in this profile have sought information.
6 A SYSTEMATIC APPROACH TO PROFILE AND
EVOLUTION VALIDATION IN AN INFORMATION
RETRIEVAL CONTEXT
In this paper, we view the discovered profiles as frequent
patterns that provide one way of forming a summary of the
input data. As a summary, profiles represent a reduced
form of the data that is, at the same time, as close as possible
to the original input data. This description is reminiscent of
an information retrieval scenario in the sense that profiles
that are retrieved should be as close as possible to the
original session data. Closeness should take both of the
following into account:
1. Precision. A summary profiles items are all correct or
included in the original input data; that is, they
include only the true data items.
2. Coverage/Recall. A summary profiles items are
complete compared to the data that is summarized;
that is, they include all the data items.
These criteria are clearly contradictory, since precision will
favor only the smallest profiles, eventually with a single
URL, whereas coverage will favor the largest possible
profiles. Ideally, each data query should be answered by a
profile that is identical to this query. However, this is
unrealistic, since it requires the profiles summary to be
identical to the entire input database. Therefore, the
summary should consist of the smallest number of profiles
that are as similar as possible to the input data. Our
validation procedure [38] attempts to answer the following
questions:
1. Is the data set completely summarized/represented
by the mined profiles/patterns?
2. Is the data set faithfully/accurately summarized/
represented by the mined profiles/patterns?
Each of the previous questions is answered by comput-
ing coverage/recall as part of a quality or interestingness
measure to answer part 1, and precision as part of a
quality/interestingness measure to answer part 2. First, we
compute the following interestingness measures for each
discovered profile, letting the quality or interestingness
measure Q
ij
Cov
ij
(that is, coverage) to answer part 1, and
Q
ij
Prec
ij
(that is, precision) to answer part 2. Here,
coverage and precision for a discovered mass profile pp
ii
as a
summary of an input session ss
jj
are given by Cov
ij

jss
jj
\ pp
ii
j=js
j
j and Prec
ij
jss
jj
\ pp
ii
j=jpp
ii
j. A combined measure
of precision and coverage is given by the F
1
information
retrieval metric Q
ij
F
1;ij
, which simultaneously answers 1
and 2 and is given by
F
1;ij
2Prec
ij
:Cov
ij
=Prec
ij
Cov
ij
:
Now, let
s

T
1
; T
2
s
j
2 S
Ti

Max
ijP
i
2P
T
2
Q
ij
! Q
min
_ _
be the subset of input user sessions S
T1
, logged during
period T
1
, that are summarized by any of the user profiles
P
T
2
, discovered at period T
2
, with a quality level higher than
a given minimum quality threshold Qmin. Then, we can
capture the concept drift by the decline of the metric defined
as follows, particularly when T
2
occurs earlier than T
1
:
210 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 2, FEBRUARY 2008
TABLE 3
Profile Evolution Events and Interpretation within a Marketing Context
TABLE 4
Profile Evolution Corresponding to Table 5 for June through September 2004
QS
T1
; P
T2
S

T
1
; T
2
j j= S
T1

; 5
where j:j denotes the cardinality of a set. When Q
ij
Cov
ij
,
we call QS
T1
; P
T2
the Cumulative Coverage of sessions, and it
answers question 1. When Q
ij
Prec
ij
, we call QS
T1
; P
T2

the Cumulative Precision of sessions, and it answers question 2.


Both questions are answered when Q
ij
F
1;ij
. The above
measures are computed for the different techniques over
the entire range of the quality threshold Q
min
from 0 percent
to 100 percent, in increments of 10 percent, and compared
as shown in Fig. 4. The F
1
values represent the match
between the output profiles and the input sessions, with a
perfect score of 1 when a profile matches a session perfectly
without a missing URL (or the recall would be reduced) and
extra/spurious URL (or the precision would be reduced).
Lower scores mean that the profile less perfectly matches
the input session. For example, obtaining 0.9 in F
1
for
22 percent of the sessions means that 22 percent of the
input sessions can be matched to a profile with an F
1
score
! 0:9. This validation process tries to predict, in advance,
how the discovered profiles would fare if these is used as
part of a nearest profile collaborative filtering recommender
system, as long as the user activity remains similar to the
activity during the time when the profiles were mined from
the Web logs. This validation strategy can both serve 1) to
evaluate the quality of the data mining under different
scenarios and 2) to test how current a set of previously
discovered profiles remains as new sessions arrive, that is,
to test for the concept drift or evolution of the user activities.
We will illustrate how we used the above validation
procedure in Section 7. The above definition of the quality
of discovered profiles (5) can also be interpreted from a
probabilistic point of view as the probability that the quality
with which a discovered profile at period T
2
summarizes any
input session from period T
1
is higher than a minimum level
Q
min
. Finally, it is important to note that our evaluation
targets the quality of the mass user profiles as a summary of
the input sessions and is not necessarily relevant within the
framework of a recommendation system. In other words,
what we are interested in is a fast prediagnostic evaluation of
the quality of the data mining task and not the quality of
personalization. Our validation is more analogous with the
validation of the results of clustering algorithms within an
NASRAOUI ET AL.: A WEB USAGE MINING FRAMEWORK FOR MINING EVOLVING USER PROFILES IN DYNAMIC WEB SITES 211
TABLE 5
Illustration of Tracking of Evolving User Profiles (a Subset of Table 4 Obtained by Using the TrackProfiles Algorithm)
with Several Example Profiles (Underlined and Boldface Items Indicate Information Inferred from the Web Site Ontology
of What Would Otherwise Appear as Encoded Dynamic URLs)
Profiles in the same row are found to be compatible based on their similarity compared to the profile boundaries. The Birth of a new profile can be
seen when the cells preceding it on the same row are empty (for example, profile 8), whereas Death is when the cells following it are empty (for
example, profile 2). Atavism is when two profiles on the same row are separated by some empty cells (for example, profile 5). Persistent profiles
show activity in contiguous cells on the same row. In this case, notice how some profiles can be split into more specific profiles in subsequent
periods, and vice versa, some profiles seem to merge into one more general profile with time (for example, profiles 2 and 3).
unsupervised learning framework. In addition, in our current
context, this evaluation should constitute the last phase of the
multistage process of Knowledge Discovery in Data (KDD) and
not the evaluation of a recommendation system. This is
because the latter is a decision-making task that can
significantly be supported by the former but is nonetheless
separate and not necessarily equivalent to the KDD process.
7 EXPERIMENTAL RESULTS AND VALIDATION OF
PROFILE DISCOVERY
7.1 Results of Mining Multifaceted User Profiles
from Web Usage Data
H-UNC [10] was applied on a set of Web sessions
preprocessed from Web log data for several months in
2004 and 2005. After filtering out irrelevant entries, the data
was segmented into sessions based on the client IP address
and a time-out threshold between two consecutive accesses
in the same session of 45 minutes. After filtering irrelevant
URLs (for example, graphics) and requests from Web
crawlers, we get a number of unique sessions varying from
800-1,500 per week, accessing between 3,000-6,000 URLs
(not counting graphics). For the studied periods, H-UNC
was applied to the Web sessions by using a maximal
number of levels L 10 in the hierarchy and the following
parameters that control the final resolution [10]: N
split
30
and
split
0:01. H-UNC partitioned the Web user sessions
of each period into several clusters (that ranged from 20 to
35 clusters, depending on the period), and each cluster was
characterized by one of the profile vectors pp
ii
. We show one
profile with all of its facets combined in Fig. 3, including the
viewed pages (at the center), the search queries (on the
right), the inquiring companies that are on the left and at
the top (from companies), and the inquired companies at
the bottom (about companies).
7.2 Results of Tracking Evolving Access Patterns
Table 4 illustrates the results of the automated profile
tracking and comparison process from June to September
2004 (the other months are omitted because of the lack of
space). The detailed contents of the profiles are again
included in Table 5. Some months were too large for
processing in one batch; therefore, they were divided into
halves. For these months such as August, we use August
2004Part 1 and August 2004Part2 to indicate the first
half and second half of August 2004, respectively. The
analysis of profile evolution (Table 4) can improve our
understanding of the user activity trends and detect
seasonality in their access patterns, especially over a long
time span. We note that the profiles that are discovered
correspond to compact clusters, as can be judged by their
very small scales
i
, which are typically very small (in the
order of 0.1), and also because the sessions assigned to each
profile tend to have similar access patterns. They are also
sufficiently distinct, as can be judged by their different URL
components, and also because the sessions assigned to
different profiles exhibit different access patterns.
7.3 Validation of the Discovered Profiles under the
Effect of User Behavior Evolution
We use the validation procedure described in Section 6 to
predict how a previously discovered set of profiles fares
with the arrival of newer sessions, that is, to test for the
evolution of the user activities. First, we illustrate how we
used the validation procedure by testing how well the
mined profiles of a given period perform against the
sessions from the same period (these light shaded curves
are all shown in Fig. 4). Then, we perform a cross-period
validation that evaluates how well the mined profiles of an
earlier period perform against the sessions from a later
period (these dark curves are all shown Fig. 4). These
results can help gauge the quality of the mined profiles
from the freshness point of view. When the quality is seen
to deteriorate, this sends a signal that it may be necessary to
mine an updated set of user profiles. Hence, this validation
can form a systematic way of guiding the Web usage
mining process as the user access patterns evolve with time.
To illustrate this procedure, we performed the validation
based on combined F
1
measure by testing how accurate the
profiles discovered from an earlier period still summarize
the user sessions in a new time period. To do this, we first
validate the profiles of the first two weeks of September
2005 against the user sessions from the same period. This is
212 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 2, FEBRUARY 2008
Fig. 4. QS
T1
; P
T2
: Comparison of F
1
based pure (same period) validation of profiles from T2 first two weeks of September 2005 against
sessions from the same period, that is, T1 first two weeks of September 2005 (light shaded) with, respectively, the cross-period validations that
compare the performance of user profiles of period T2 first two weeks of September 2005 against the sessions from the following later periods
(dark) showing varying levels of deterioration of performance with the lowest for a gap of two periods in between old profiles and new input sessions,
that is, for the first two weeks of October 2005. (a) T1 last two weeks of September 2005. (b) T1 first two weeks of October 2005. (c) T1 last
two weeks of October 2005.
depicted using the red curve in Figs. 4a, 4b, and 4c. Fig. 4
shows the results of the validation procedure by using
QS
T1
; P
T2
, as defined in (5), that is used to verify the
quality of the discovered profiles. The light shaded curve is
obtained with the same user input session and discovered
profiles periods T
1
T
2
. The curve shows the percentage of
sessions (relative to the entire Web log of a specific period),
which can be retrieved with a given minimal quality level
by one of the discovered profiles, where the minimal
quality level Q
min
is varied from 0 percent to 100 percent.
The quality measures shown are the F
1
score. By reading
the y-axis for a given minimal F
1
quality level on the x-axis,
we can obtain the percentage of sessions that achieve a
given F
1
quality level with at least one of the discovered
profiles. For instance, in Fig. 4a, the light shaded curve (for
the pure/same period validation of the mined profiles
against the input Web sessions of the first two weeks of
September 2005) shows that 33 percent of the sessions
achieve F
1
! 0:5 (this corresponds to the point (0.5, 0.33) on
the graph), whereas 22 percent of the sessions achieve
F
1
! 0:9.
7.3.1 Cross-Period Validation for
Batch Period = Two Weeks
The data was partitioned by date, with each partition
consisting of user sessions for two weeks. Clustering was
carried out on the first partition consisting of data from the
first two weeks of September, and the F
1
measure was
calculated using the remaining partitions. These cross
validations are shown in Figs. 4a, 4b, and 4c, respectively,
by the dark curves. For instance, the dark curve in Fig. 4a
is obtained for the cross validation of the profiles of period
T
1
first two weeks of September 2005 against the input
data/Web sessions of the period T
2
second two weeks of
September 2005. This curve shows that 29 percent of the
sessions achieve F
1
! 0:5, whereas close to 20 percent of
the sessions achieve F
1
! 0:9. Here, we notice that the
quality of the same old profiles in summarizing more
recent input user sessions deteriorates to varying degrees,
which is not necessarily correlated with the number of
gaps between the earlier and later periods but more
reasonably due to how different the user sessions in these
different periods are. Fig. 5 shows a similar comparison
between pure (same later period) validation and cross-
period validations, where the profiles from an earlier
period are validated against the sessions from the
immediately following period (that is, the gap between
the profiles and the input sessions is one period). Figs. 5a,
5b, and 5c, respectively, show this comparison when the
earlier period were the first two weeks of September 2005,
then the last two weeks of September 2005, and, finally,
the last two weeks of October 2005. In this case, the
deterioration happens to be the most significant for the
transition between the first two weeks (Fig. 5a) and the last
2 weeks of September 2005. On the other hand, there is no
significant change between the first and last halves of
October 2005 (Fig. 5c), signaling that there is no significant
advantage of mining the later batch of data.
7.3.2 Cross-Period Validation for
Batch Period = One Week
We repeated the entire mining and validation procedure
above by using a smaller batch period of one week instead
of two weeks. Fig. 6 shows a comparison between the pure
(same period) validation and cross-period validation, where
the profiles from an earlier period in September 2005 are
validated against the sessions from a later period in
September 2005 (the gap between the profiles and the
sessions is one period in Figs. 6a and 6b, and the gap is two
periods in Fig. 6c). We notice that for a gap of one week,
there is no significant change between the first and second
weeks. However, there is a change between the second and
third weeks. We also note the biggest change for a gap of
two weeks between the profile discovery and the input
sessions, which happens between the first and third weeks.
8 CONCLUSION
We presented a framework for mining, tracking, and
validating evolving multifaceted user profiles on Web sites
that have all the challenging aspects of real-life Web usage
mining, including evolving user profiles and access
patterns, dynamic Web pages, and external data describing
an ontology of the Web content. A multifaceted user profile
summarizes a group of users with similar access activities
and consists of their viewed pages, search engine queries,
NASRAOUI ET AL.: A WEB USAGE MINING FRAMEWORK FOR MINING EVOLVING USER PROFILES IN DYNAMIC WEB SITES 213
Fig. 5. QS
T1
; P
T2
: Comparison of F
1
-based pure (same period) validation of new (later) profiles against new sessions from the same period (light
shaded) with the cross-period validation (dark) that compares the performance of older user profiles against the newer sessions (dark).
(a) Earlier first two weeks of September 2005, and later last two weeks of September 2005. (b) Earlier last two weeks of September 2005,
and later first two weeks of October 2005. (c) Earlier first two weeks of October 2005, and later last two weeks of October 2005.
and inquiring and inquired companies. The choice of the
period length for analysis depends on the application or can
be set, depending on the cross-period validation results.
Even though we did not focus on scalability, the latter can
be addressed by following an approach similar to [11],
where Web clickstreams are considered as an evolving data
stream, or by mapping some new sessions to persistent
profiles and updating these profiles, hence eliminating most
sessions from further analysis and focusing the mining on
truly new sessions.
ACKNOWLEDGMENTS
This research was supported by the US National Science
Foundation CAREER Award IIS-0133948 to O. Nasraoui.
Partial support was also provided by a grant from
Innovative Productivity Inc. A preliminary version of this
paper was presented at the International Workshop on
Customer Relationship Management: Data Mining Meets
Marketing that was held in New York on 18-19 November
2005.
REFERENCES
[1] R. Cooley, B. Mobasher, and J. Srivastava, Web Mining:
Information and Pattern Discovery on the World Wide Web,
Proc. Ninth IEEE Intl Conf. Tools with AI (ICTAI 97), pp. 558-567,
1997.
[2] O. Nasraoui, R. Krishnapuram, and A. Joshi, Mining Web Access
Logs Using a Relational Clustering Algorithm Based on a Robust
Estimator, Proc. Eighth Intl World Wide Web Conf. (WWW 99),
pp. 40-41, 1999.
[3] O. Nasraoui, R. Krishnapuram, H. Frigui, and A. Joshi, Extract-
ing Web User Profiles Using Relational Competitive Fuzzy
Clustering, Intl J. Artificial Intelligence Tools, vol. 9, no. 4,
pp. 509-526, 2000.
[4] J. Srivastava, R. Cooley, M. Deshpande, and P.-N. Tan, Web
Usage Mining: Discovery and Applications of Usage Patterns
from Web Data, SIGKDD Explorations, vol. 1, no. 2, pp. 1-12, Jan.
2000.
[5] M. Spiliopoulou and L.C. Faulstich, WUM: A Web Utilization
Miner, Proc. First Intl Workshop Web and Databases (WebDB 98),
1998.
[6] T. Yan, M. Jacobsen, H. Garcia-Molina, and U. Dayal, From User
Access Patterns to Dynamic Hypertext Linking, Proc. Fifth Intl
World Wide Web Conf. (WWW 96), 1996.
[7] M. Perkowitz and O. Etzioni, Adaptive Web Sites: Automatically
Learning for User Access Pattern, Proc. Sixth Intl WWW Conf.
(WWW 97), 1997.
[8] J. Borges and M. Levene, Data Mining of User Navigation
Patterns, Web Usage Analysis and User Profiling, LNCS,
H.A. Abbass, R.A. Sarker, and C.S. Newton, eds. pp. 92-111,
Springer-Verlag, 1999.
[9] O. Zaiane, M. Xin, and J. Han, Discovering Web Access Patterns
and Trends by Applying OLAP and Data Mining Technology on
Web Logs, Proc. Advances in Digital Libraries (ADL 98), pp. 19-29,
1998.
[10] O. Nasraoui and R. Krishnapuram, A New Evolutionary
Approach to Web Usage and Context Sensitive Associations
Mining, Intl J. Computational Intelligence and Applications, special
issue on Internet intelligent systems, vol. 2, no. 3, pp. 339-348,
Sept. 2002.
[11] O. Nasraoui, C. Cardona, C. Rojas, and F. Gonzalez, Mining
Evolving User Profiles in Noisy Web Clickstream Data with a
Scalable Immune System Clustering Algorithm, Proc. Workshop
Web Mining as a Premise to Effective and Intelligent Web Applications
(WebKDD 03), pp. 71-81, Aug. 2003.
[12] P. Desikan and J. Srivastava, Mining Temporally Evolving
Graphs, Proc. Workshop Web Mining and Web Usage Analysis
(WebKDD 04), 2004.
[13] O. Nasraoui, C. Rojas, and C. Cardona, A Framework for Mining
Evolving Trends in Web Data Streams Using Dynamic Learning
and Retrospective Validation, Computer Networks, special issue
on Web dynamics, vol. 50, no. 14, Oct. 2006.
[14] M.A. Maloof and R.S. Michalski, Learning Evolving Concepts
Using Partial Memory Approach, Working Notes AAAI Fall Symp.
Active Learning 1995, pp. 70-73, 1995.
[15] M.A. Maloof and R.S. Michalski, Selecting Examples for Partial
Memory Learning, Machine Learning, vol. 41, no. 11, pp. 27-52,
2000.
[16] T. Mitchell, R. Caruana, D. Freitag, J. McDermott, and D.
Zabowski, Experience with a Learning Personal Assistant,
Comm. ACM, vol. 37, no. 7, pp. 80-91, 1994.
[17] D. Billsus and M.J. Pazzani, A Hybrid User Model for News
Classification, Proc. Seventh Intl Conf. User Modeling (UM 99),
J. Kay, ed., pp. 99-108, 1999.
[18] I. Grabtree and S. Soltysiak, Identifying and Tracking Changing
Interests, Intl J. Digital Libraries, vol. 2, pp. 38-53,
[19] J. Schlimmer and R. Granger, Incremental Learning from Noisy
Data, Machine Learning, vol. 1, no. 3, pp. 317-357, 1986.
[20] G. Widmer and M. Kubat, Learning in the Presence of Concept
Drift and Hidden Contexts, Machine Learning, vol. 23, pp. 69-101,
1996.
[21] I. Koychev, Gradual Forgetting for Adaptation to Concept Drift,
Proc. ECAI Workshop Current Issues in Spatio-Temporal Reasoning
00, pp. 101-106, 2000.
[22] B. Mobasher, H. Dai, T. Luo, Y. Sung, and J. Zhu, Integrating
Web Usage and Content Mining for More Effective Personaliza-
tion, Proc. Intl Conf. e-Commerce and Web Technologies (ECWeb
00), Sept. 2000.
214 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 2, FEBRUARY 2008
Fig. 6. QS
T1
; P
T2
: Validation and cross-validation for a different mining period (one week), that is, comparison of F
1
-based pure (same period)
validation of new (later) profiles against new sessions from the same period (light shaded) with the cross-period validation (dark) that compares the
performance of older user profiles against the newer sessions (dark). (a) Earlier week 1 of September 2005, and later week 2 of September
2005. (b) Earlier week 2 of September 2005, and later week 3 of September 2005. (c) Earlier week 1 of September 2005, and later week 3 of
September 2005.
[23] R. Srikant and R. Agrawal, Mining Generalized Association
Rules, Proc. 21st Intl Conf. Very Large Data Bases (VLDB 95), Sept.
1995.
[24] S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan, Using
Taxonomy, Discriminants, and Signatures for Navigation in Text
Databases, Proc. 23rd Intl Conf. Very Large Data Bases (VLDB 97),
1997.
[25] H. Dai and B. Mobasher, Using Ontologies to Discover Domain-
Level Web Usage Profiles, Proc. Second ECML/PKDD Semantic
Web Mining Workshop, 2002.
[26] D. Oberle, B. Berendt, A. Hotho, and J. Gonzalez, Conceptual
User Tracking, Proc. First Intl Atlantic Web Intelligence Conf.
(AWIC 03), 2003.
[27] B. Berendt and M. Spiliopoulou, Analysis of Navigational
Behavior in Web Sites Integrating Multiple Information Systems,
VLDB J., vol. 9, no. 1, pp. 56-75, 2000.
[28] M. Eirinaki, H. Lampos, M. Vazirgiannis, and I. Varlamis,
SEWeP: Using Site Semantics and a Taxonomy to Enhance the
Web Personalization Process, Proc. ACM SIGKDD 03, Aug. 2003.
[29] M. Levene, J. Borges, and G. Loizou, Zipfs Law for Web
Surfers, Knowledge and Information Systems, vol. 3, no. 1, pp. 120-
129, Feb. 2001.
[30] O. Nasraoui and R. Krishnapuram, A Novel Approach to
Unsupervised Robust Clustering Using Genetic Niching, Proc.
Ninth IEEE Intl Conf. Fuzzy Systems (FUZZ 00), pp. 170-175, May
2000.
[31] J.H. Holland, Adaptation in Natural and Artificial Systems. MIT
Press, 1975.
[32] O. Resnik, Semantic Similarity in a Taxonomy: An Information-
Based Measure and Its Application to Problems of Ambiguity and
Natural Language, J. Artificial Intelligence Research, vol. 11, pp. 95-
130, 1999.
[33] Z. Wu and M. Palmer, Verb Semantics and Lexical Selection,
Proc. 32nd Ann. Meeting of the Assoc. Computational Linguistics,
pp. 133-138, June 1994.
[34] D. Lin, An Information-Theoretic Definition of Similarity, Proc.
15th Intl Conf. Machine Learning (ICML 98), 1998.
[35] C. Ziegler, G. Lausen, and L. Schmidt-Thieme, Taxonomy-Driven
Computation of Product Recommendations, Proc. 13th ACM
Conf. Information and Knowledge Management (CIKM 04), pp. 406-
415, 2004.
[36] V. Cross, Fuzzy Semantic Distance Measures between Ontologi-
cal Concepts, Proc. Ann. Meeting North Am. Fuzzy Information
Processing Soc. (NAFIPS 04), pp. 392-397, June 2004.
[37] P. Ganesan, H. Garcia-Molina, and J. Widom, Exploiting
Hierarchical Domain Structure to Compute Similarity, ACM
Trans. Information Systems, vol. 21, no. 1, pp. 64-93, 2003.
[38] O. Nasraoui and S. Goswami, Mining and Validating Localized
Frequent Itemsets with Dynamic Tolerance, Proc. Sixth SIAM Intl
Conf. Data Mining (SDM 06), pp. 578-582, Apr. 2006.
Olfa Nasraoui received the PhD degree in
computer engineering and computer science
from the University of Missouri, Columbia, in
1999. From 2000 to 2004, she was an
assistant professor at the University of Mem-
phis. She is currently the director of the
Knowledge Discovery and Web Mining La-
boratory, University of Louisville, where she is
also an associate professor of computer
engineering and computer science and the
Endowed Chair of e-Commerce. Her research interests include Web
mining and stream data mining. She is a member of the IEEE and
the ACM. She is a recipient of a US National Science Foundation
Faculty Early Career Development (CAREER) Award.
Maha Soliman received the PhD degree in
computer science and computer engineering
from the University of Louisville in 2004. She
joined the Knowledge Discovery and Web
Mining Laboratory, University of Louisville, in
2005 as a postdoctoral researcher, and, in 2006,
she joined the Kentucky Biomedical Laboratory,
University of Louisville. Her research interests
include distributed data mining, Web mining, and
bioinformatics. She is a member of the IEEE.
Esin Saka received the BS and MS degrees
(with a double major in mathematics) in compu-
ter engineering from the Middle East Technical
University (METU), Turkey. She is currently
working toward the PhD degree in the Depart-
ment of Computer Engineering and Computer
Science at the University of Louisville. She was
with the Scientific and Technical Research
Council of Turkey (TU

BITAK) as a researcher,
and, in the summer 2007, was with the Yahoo!
Search Marketing as an intern. Her research interests include Web
mining, multiagent systems, and genetic programming. She is a member
of the IEEE.
Antonio Badia received the PhD degree in
computer science from Indiana University in
1997. He is currently an associate professor and
the founding director of the Database Labora-
tory, Department of Computer Science and
Computer Engineering, University of Louisville.
His research interests include query processing
and optimization, extended query languages,
integration of documents in databases, and data
mining and its applications to counterterrorism.
He is a member of the IEEE and the ACM. He received a US National
Science Foundation Faculty Early Career Development (CAREER)
Award.
Richard Germain received the PhD degree in
marketing from Michigan State University in
1989. He is the author of three books and
several articles, mostly in supply chain manage-
ment, which have appeared in the Strategic
Management Journal, the Decision Sciences
Journal, the Journal of Business Logistics, the
Journal of Marketing Research, the Journal of
International Business Studies, Academy of
Marketing Science, and so forth.
. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
NASRAOUI ET AL.: A WEB USAGE MINING FRAMEWORK FOR MINING EVOLVING USER PROFILES IN DYNAMIC WEB SITES 215

You might also like