Professional Documents
Culture Documents
1 INTRODUCTION
C
USTOMER Relationship Management (CRM) can use data
from within and outside an organization to allow an
understanding of its customers on an individual basis or on
a group basis such as by forming customer profiles. An
improved understanding of the customers habits, needs,
and interests can allow the business to profit by, for
instance, cross selling or selling items related to the ones
that the customer wants to purchase. Hence, reliable
knowledge about the customers preferences and needs
forms the basis for effective CRM. As businesses move
online, the competition between businesses to keep the
loyalty of their old customers and to lure new customers is
even more important, since a competitors Web site may be
only one click away. The fast pace and large amounts of
data available in these online settings have recently made it
imperative to use automated data mining or knowledge
discovery techniques to discover Web user profiles. These
different modes of usage or the so-called mass user profiles
can be discovered using Web usage mining techniques that
can automatically extract frequent access patterns from the
history of previous user clickstreams stored in Web log files.
These profiles can later be harnessed toward personalizing
the Web site to the user or to support targeted marketing.
Although there have been considerable advances in Web
usage mining, there have been no detailed studies present-
ing a fully integrated approach to mine a real Web site with
the challenging characteristics of todays Web sites, such as
evolving profiles, dynamic content, and the availability of
taxonomy or databases in addition to Web logs.
In this paper, we present a complete framework and a
summary of our experience in mining Web usage patterns
with real-world challenges such as evolving access
patterns, dynamic pages, and external data describing an
ontology of the Web content and how it relates to the
business actors (in the case of the studied Web site, the
companies, contractors, consultants, etc., in corrosion). The
Web site in this study is a portal that provides access to
news, events, resources, company information (such as
companies or contractors supplying related products and
services), and a library of technical and regulatory
documentation related to corrosion and surface treatment.
The portal also offers a virtual meeting place between
companies or organizations seeking information about
other companies or organizations. Without loss of general-
ity, in the rest of this paper, we will refer to all the Web
site participants (organizations, contractors, consultants,
agencies, corporations, centers, agencies, etc.) simply as
companies. The Web site in our study is managed by a
nonprofit organization that does not sell anything but only
provides free information that is ideally complete, accu-
rate, and up to date. Hence, it was crucial to understand
the different modes of usage and to know what kind of
information the visitors seek and read on the Web site and
how this information evolves with time. For this reason,
we perform clustering of the user sessions extracted from
the Web logs to partition the users into several homo-
geneous groups with similar activities and then extract
user profiles from each cluster as a set of relevant URLs.
This procedure is repeated in subsequent new periods of
Web logging (such as biweekly), then the previously
202 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 2, FEBRUARY 2008
. O. Nasraoui, M. Soliman, E. Saka, and A. Badia are with the Department
of Computer Engineering and Computer Science, Speed School of
Engineering, University of Louisville, Louisville, KY 40292.
E-mail: {olfa.nasraoui, masoli01, esin.saka, abadia}@louisville.edu.
. R. Germain is with the College of Business, University of Louisville,
154 College of Business, Louisville, KY 40292.
E-mail: richard.germain@louisville.edu.
Manuscript received 21 Feb. 2006; revised 12 Oct. 2006; accepted 10 Aug.
2007; published online 4 Sept. 2007.
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-0088-0206.
Digital Object Identifier no. 10.1109/TKDE.2007.190667.
1041-4347/08/$25.00 2008 IEEE Published by the IEEE Computer Society
discovered user profiles are tracked, and their evolution
pattern is categorized. When clustering the user sessions,
we exploit the Web site hierarchy to give partial weights in
the session similarity between URLs that are distinct and
yet located closer together on this hierarchy. The Web site
hierarchy is inferred both from the URL address and from
a Web site database that organizes most of the dynamic
URLs along an is-a ontology of items. We also enrich the
cluster profiles with various facets, including search
queries submitted just before landing on the Web site,
and inquiring and inquired companies, in case users from
(inquiring) companies inquire about any of the (inquired)
companies listed on the Web site, which provide related
services.
The rest of this paper is organized as follows: In Section 2,
we present an overview of Web usage mining, in particular
advances involving semantics and profile evolution. In
Section 3, we describe our approach to profile discovery
using Web usage mining. In Section 4, we discuss our
approach for handling dynamic content and exploiting
external data that describes an ontology of the Web content
derived from the server database. In Section 5, we discuss
our approach for tracking evolving user profiles. In Section 6,
we present a systematic and objective validation strategy for
the discovered user profiles. In Section 7, we present our
results in mining evolving user profiles. Finally, in Section 8,
we present our conclusions.
2 AN OVERVIEW OF WEB USAGE MINING
Recently, data mining techniques have been applied to
extract usage patterns from Web log data [1], [2], [3], [4], [5].
This process, known as Web usage mining, is traditionally
performed in several stages [1], [3] to achieve its goals:
1. collection of Web data such as activities/clickstreams
recorded in Web server logs,
2. preprocessing of Web data such as filtering crawlers
requests, requests to graphics, and identifying
unique sessions,
3. analysis of Web data, also known as Web Usage
Mining [4], to discover interesting usage patterns or
profiles, and
4. interpretation/evaluation of the discovered profiles.
In this paper, we further added a fifth step after a
repetitive application of steps 1-4 on multiple time
periods, i.e.,
5. tracking the evolution of the discovered profiles.
Web usage mining can use various data mining or
machine learning techniques to model and understand Web
user activity. In [6], clustering was used to segment user
sessions into clusters or profiles that can later form the basis
for personalization. In [7], the notion of an adaptive Web
site was proposed, where the users access pattern can be
used to automatically synthesize index pages. The work in
[1] is based on using association rule discovery as the basis
for modeling Web user activity, whereas the approach
proposed in [8] used probabilistic grammars to model Web
navigation patterns for the purpose of prediction. The
approach in [9] proposed building data cubes from Web log
data and later applying Online Analytical Processing
(OLAP) and data mining on the cube model. Web
Utilization Miner (WUM) was presented [5] to discover
navigation patterns with user-specified characteristics over
an aggregated materialized view of the Web log, consisting
of a trie of sequences of Web views. New fuzzy relational
clustering techniques were used to discover user profiles
that can overlap [4], whereas robust clustering [3] was
proposed to mine profiles that are resistant to noise that is
naturally present in clickstream data. A robust density-
based evolutionary clustering technique was proposed to
discover an optimal number of multiresolution and robust
user profiles [10]. Many Web usage mining approaches are
surveyed in [4].
2.1 Handling Profile Evolution
Most previous research efforts in Web usage mining have
worked with the assumption that the Web usage data is
static. However, the dynamic aspects of Web usage have
recently become important. This is because Web access
patterns on a Web site are dynamic due not only to the
dynamics of Web site content and structure but also to
changes in the users interests and, thus, their navigation
patterns. Thus, it is desirable to study and discover Web
usage patterns at a higher level, where such dynamic
tendencies and temporal events can be distinguished.
Mining evolving clickstreams is the subject of only a few
recent research efforts [11], [12], [13]. In [11], an immune
system inspired approach, called Tracking Evolving Clus-
ters in NOisy Streams (TECNO-STREAMS), was proposed
to continuously learn and adapt to new incoming patterns
by detecting an unknown number of clusters in evolving
noisy data in a single pass. This stream summary or
synopsis consists of a set of cluster representatives with
properties such as scale and age. Apart from the recent
interest in studying evolution in Web clickstreams, there
have been several research efforts in machine learning
regarding the related notion of concept drift. According to
Maloof and Michalski [14], learning evolving concepts adds
another layer of difficulty to the process of online learning,
since concepts can no longer be assumed to be constant. In
an evolving scenario, with time, past training examples may
become obsolete and therefore need to be replaced by more
recent examples. One of the earliest works [14] presented a
method for selecting training examples for a partial
memory learning system, which was later extended in
[15], by using a time-based forgetting function to remove
examples that are older than a certain age from a partial
memory. Within the area of personalization, Mitchell et al.s
Personal Assistant [16] trained decision trees to learn how
an individuals meetings can be scheduled in a personalized
calendar. A time window was used to confine and adapt the
training samples for learning changing user preferences.
NewsDude [17] is an intelligent agent built to adapt to
changing users interests by learning two separate user
models that represent short-term and long-term interests.
The short-term model is learned from the most recent
observations only, whereas the long-term (default) model
represents the users general preferences. In [18], a user
profiling system was developed based on monitoring the
users Web browsing and e-mail habits. This system used a
NASRAOUI ET AL.: A WEB USAGE MINING FRAMEWORK FOR MINING EVOLVING USER PROFILES IN DYNAMIC WEB SITES 203
clustering algorithm to group user interests into several
interest themes, and the user profiles had to adapt to
changing interests of the users over time.
Maloof and Michalski [15] classified online learning in
the presence of concept drift as either evolutionary or
revolutionary with regard to adaptation to change. An
evolutionary scheme modifies existing knowledge based on
completely new training examples (for example, STAGGER
[19]), whereas a revolutionary approach discards old knowl-
edge and learns new knowledge from the new training
examples (for example, window-based techniques [20]). A
third approach includes hybrids that inherit from both the
revolutionary and evolutionary approaches. For instance,
Mitchells Calendar Learning Apprentice [16] learns new
decision rules from training data and incorporates these
new rules into the existing knowledge base. Maloof and
Michalski [14] further classified the way online learning
systems work into three different modes: no memory, partial
memory, or full memory. In the no-memory mode, the system
does not use any past training examples for updating the
current model (for example, STAGGER [19]), whereas in the
partial-memory mode, a subset of the previously seen
training examples is used for later learning. Finally, in the
full-memory mode, all past training examples are used in
updating an existing model. A continuum between no
memory and full memory (gradual forgetting) used a
forgetting function-based approach in supervised learning
[21] and to cluster evolving streams [11].
It is important to note that apart from [18] (which was
limited to a small number of attributes and users), all of the
above approaches were proposed within a supervised
learning framework (classification) or focused on adaptation
to a single user (predicting whether an object is relevant or
not). On the other hand, the work that we present in this
paper is based on an unsupervised learning framework that
tries to learn mass anonymous user profiles on the server side.
Nonetheless, according to Maloof and Michalskis categor-
ization of concept drift systems [14], our proposed system
can be categorized as a no-memory revolutionary user profile
mining approach. However, the user profile tracking and
validation approach works in the full-memory mode. Further-
more, in this paper, we are more interested in quantifying
and categorizing or annotating the various types of evolution
(not only detecting evolution and adapting to it), and this,
in turn, can form a higher level of knowledge, in addition to
the description of the profiles themselves as user models.
We adopt an approach based on periodical batch mining
that has the advantage of being easy to adapt to use any
other unsupervised learning tool that automatically dis-
covers clusters in static or dynamic data. In this work, we
use the full memory (periodical or window based), in part,
because our goal was to describe the user profiles in certain
periodical increments (about two weeks each). Hence, it
was essential to fully mine the Web logs from each period
and then compare the subsequent results.
2.2 Integrating Semantics in Web Usage Mining
Relying only on Web usage data for user modeling or for
personalization can be inefficient, either when there is
insufficient usage data for the purpose of mining certain
patterns or when new pages are added and thus do not
accumulate sufficient usage data at first. The lack of usage
data in these cases can be compensated by adding other
information such as the content of Web pages [22] or the
structure of a Web site [2], [3]. In [22], the keywords that
appear in Web pages are used to generate document
vectors, which are later clustered in the document space
to further augment user profiles. In [2], [3], [10], the Web
sites own hierarchical structure is treated like an implicit
taxonomy or concept hierarchy that is exploited in
computing the similarity between any two Web pages on
the Web site. This allows a better comparison between
sessions that contain visits to Web pages that are different
and yet semantically related (for example, under the same
more general topic). The idea of exploiting concept
hierarchies or taxonomies has already been found to
enhance association rule mining [23] and to facilitate
information searching in textual data [24]. Even though
keywords that are present in the Web pages have been used
to add a content aspect to usage data, the keyword-based
approach remains incapable of capturing more complex
relationships at a deeper semantic level. Thus, in [25], a
general framework was proposed for using domain ontolo-
gies to automatically characterize usage profiles containing
a set of structured Web objects.
The advent of dynamic URLs mostly in tandem with Web
databases has recently made it even more difficult to
interpret URLs in terms of user behavior, interests, and
intentions. For instance, consider the following cryptic
association rule within the context of an online bookstore
[26]: If http://www.the_shop.com/show.html-?item=123,
then http://www.the_shop.com/show.html?item=456, sup-
port = 0.05, and confidence = 0.4. A more meaningful rule
would be users who bought Hamlet also tended to buy How
to Stop Worrying and Start Living. This, in turn, has
motivated [26] which mined patterns of application events
instead of patterns of URLs by exploiting the semantics of the
visited pages. Within this spirit, Service-based concept
hierarchies were introduced earlier [27] for analyzing the
search behavior of visitors, that is, howthey navigate rather
than what they retrieve. In this case, concept hierarchies
form the basic method of grouping Web pages together
before Web usage mining. In [26], usage mining was
enhanced by describing the user behavior in terms of an
ontology underlying a particular Web site. The semantic
annotation of the Web content was assumed to have been
performed a priori, since the Web site in question was a
knowledge portal with an inherent RDF annotation. In order
to mine interesting patterns, first, the Web logs were
semantically enriched with ontology concepts. Then, these
semantic Web logs were mined to extract patterns such as
groups of users, users preferences, and rules. Following a
similar approach, in [28], Web usage logs were enriched with
semantics derived from the content of the Web sites pages.
Content keywords were first mapped to the categories of a
manually constructed domain-specific taxonomy through
the use of a thesaurus, and then the Web documents were
clustered based on the taxonomy categories. The enhanced
Web logs, called C-Logs, were then used as input to Web
usage mining.
204 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 2, FEBRUARY 2008
Most of the efforts cited on the previous page rely on an
explicit taxonomy that needs to be handcrafted by an expert
before the analysis. On the other hand, the implicit
taxonomy, as used in [2], [3], is inferred automatically and
quickly from the Web site directory structure via URL
tokenization. Furthermore, this implicit taxonomy does not
require any modification to the underlying data mining
algorithm, since it is only incorporated within the similarity
measure used to cluster the user sessions. In this paper, we
will exploit both an implicit taxonomy as inferred from the
Web site directory structure and an explicit taxonomy as
inferred from data that is external to the Web logs and that
is already available with the Web sites content database.
3 PROFILE DISCOVERY BASED ON WEB
USAGE MINING
The framework for our Web usage mining and a road map
to the rest of this paper is summarized in Fig. 1, which starts
with the integration and preprocessing of Web server logs
and server content databases, includes data cleaning and
sessionization, and then continues with the data mining/
pattern discovery via clustering. This is followed by a
postprocessing of the clustering results to obtain Web user
profiles and finally ends with tracking profile evolution.
The automatic identification of user profiles is a knowledge
discovery task consisting of periodically mining new
contents of the user access log files and is summarized in
the following steps:
1. Preprocess Web log file to extract user sessions.
2. Cluster the user sessions by using Hierarchical
Unsupervised Niche Clustering (H-UNC) [10].
3. Summarize session clusters/categories into user
profiles.
4. Enrich the user profiles with additional facets by
using additional Web log data and external domain
knowledge.
5. Track current profiles against existing profiles.
3.1 Preprocessing the Web Log File to Extract User
Sessions
The access log of a Web server is a record of all files (URLs)
accessed by users on a Web site. Each log entry consists of
the access time, IP address, URL viewed, REFERRER (the
Web page visited just prior to the current one), etc. The first
step in preprocessing [1], [2] consists of mapping the N
U
URLs on a Web site to distinct indices. A user session
consists of requests from the same IP address within a
predefined time period. Each URL in the site is assigned a
unique number j 2 1; . . . ; N
U
, where N
U
is the total number
of valid URLs. The ith user session is then encoded as an
N
U
-dimensional binary attribute vector ss
i
with the follow-
ing property:
s
i
j
1 if user i accessed URL j;
0 otherwise:
_
1
In addition to URLs, we encode the search query terms
from the initial requests REFERRER field and take
advantage of the Power Law properties of session lengths
(with the majority tending to be short) [29] to implement
sessions as lists instead of vectors, thus saving on memory
and computational costs.
3.2 Clustering Sessions into an Optimal Number of
Categories
To cluster user sessions, we use H-UNC [10], a divisive
hierarchical version of a robust clustering approach (Un-
supervised Niche Clustering (UNC)) [30] that uses a
Genetic Algorithm (GA) [31] to evolve a population of
candidate solutions through generations of competition and
reproduction. The main outline of the H-UNC algorithm is
sketched in the following. The reason that we use H-UNC
instead of other clustering algorithms is that unlike most
other algorithms, H-UNC can handle noise in the data and
automatically determines the number of clusters. In addi-
tion, evolutionary optimization allows the use of any
domain-specific optimization criterion and any similarity
measure, in particular a subjective measure that exploits
domain knowledge or ontologies, as given in (3). However,
unlike purely evolutionary search-based algorithms,
H-UNC combines evolution with local Piccard updates to
estimate the scale
i
of each profile, thus converging fast
(about 20 generations). H-UNC is outlined as follows (more
details can be found in [10]):
NASRAOUI ET AL.: A WEB USAGE MINING FRAMEWORK FOR MINING EVOLVING USER PROFILES IN DYNAMIC WEB SITES 205
Fig. 1. Web usage mining process and discovered profile facets.
3.3 Similarity Measure Used in Clustering
The similarity score between an input session ss and the
ith profile pp
ii
can be computed using the cosine similarity as
follows (where N
u
is the total number of URLs):
S
cosine
si
N
U
k1
p
ik
s
k
N
U
k1
p
ik
N
U
k1
S
k
_ : 2
If a hierarchical Web site structure is to be taken into
account, then a modification of the cosine similarity, which
we introduced in [1], [2] and can take the Web site structure
into account, can be used to yield the following similarity
measure:
S
Web
si
max
N
U
l1
N
U
k1
p
il
S
u
l; ks
k
NU
k1
p
il
NU
k1
s
k
; S
cosine
si
_ _
; 3
where S
u
i; j is a URL to the URL similarity function that is
computed based on the amount of overlap between the
paths P
i
and P
j
leading from the root of the Web site (the
main page) to any two URLs i and j. This is given by
S
u
i; j
1 if i j;
min 1;
P
i
\P
j j j
max 1;max jP
i
j;jP
j
j 1
_ _
otherwise:
_
_
_
4
We refer to the special similarity in (3) as the Web
Session Similarity. This Web similarity takes into account
not only the hierarchical structure of the Web site content
as inferred from the URL address itself (for example, URLs
a/b/c and a/b/d are related from the hierarchical
structure aspect) but also how different content items on
the Web site relate to each other according to an externally
defined Web site ontology (for example, URLs pages.
aspx?x=30 and pages.aspx?x=40 are semantically related
from an external ontology aspect if these two URLs can be
mapped to A/B and A/C, that is, if they refer to content
areas B and C that share the same parent A). Thus, the
combination of hierarchical site structure and external
ontology occurs naturally in two stages: First, each URL is
parsed to extract the structure, and then, each remaining
dynamic URL (after the first stage) is mapped according to
the ontology, as explained in Section 4. In this case, we use
a simple ontology based on is-a relationships (that is, a
taxonomy) between individual dynamic URLs and higher
level categories encoded in a Web site ontology. This
similarity is used in our clustering algorithm (H-UNC) to
group similar user sessions into clusters or profiles. The
URL-to-URL similarities in (4) form a sparse matrix; hence,
only nonzero values are stored. Furthermore, access to
these values is accelerated by hashing the two indices
corresponding to a given pair of URLs. In addition, for the
purpose of clustering, the similarity S
Web
measure in (3) is
mapped to a distance d
Web
1 S
Web
2. Because the
distances 1 S
Web
are in [0, 1], squaring them was
found to cause more distinction between smaller and
larger distances and therefore helped delineate clusters
more easily. This distance measure will also be used to
compare and track evolving profiles, as explained in
Section 6.
Our approach only implicitly incorporates information
about the Web pages content. This is different frommethods
based on the explicit content of the Web pages, as we infer
this information from the hierarchy knowledge that is
external to the Web logs. Related to our semantic similarity
in (4) are several measures proposed in the past by Resnick
[32], Wu and Palmer [33], and Lin [34] to relate concepts and
by Ziegler et al. [35] for recommendations. A good survey
with extensions in a fuzzy-set theoretic framework can also
be found in [36]. When formulated using our notation above
for path length and path intersection, Wu and Palmers
similarity [33] can be written as
S
WuPalmer
P
i
\ Pj j j
P
i
P
j
=2
:
Thus, normalization is done by dividing by the average of
the path lengths, whereas our similarity divides by the
maximal length. Hence, our similarity is more restrictive
and penalizes more for widely differing path lengths that
correspond to concepts at widely different levels of
specificity and generality. Resnicks similarity [32] is
defined as the Information Content (IC) of the closest
common ancestor concept c
3
of concepts c
1
and c
2
. In our
notation, c
3
is located at the end of the intersection (shared)
path between their URL paths P
1
and P
2
. However, IC is
computed based on the log probability of occurrence of the
206 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 2, FEBRUARY 2008
concept over an entire text corpuss content. Our measure
relies only on superficial URLs in tokenized form (for example,
a/b/c) and not their content. Hence, it is lighter (faster and
easier) to implement. In addition, IC requires a rigid
taxonomy and reliable corpus to be able to accurately
estimate the probabilities. Aside from that, since Web
sessions follow a Power Law distribution [29], the majority
of concepts/URLs are at the long tail of the distribution and
thus have very low probabilities that can limit the
applicability of the IC measure as a pure similarity measure.
Lin [34] extended Resnicks similarity by dividing it by the
sum of the ICs of the concepts c
1
and c
2
, hence suffering
from the same drawbacks of the former measure. We also
note that our similarity measure in (4), first proposed in [2],
[3], has recently been generalized in [37] to be used for
information retrieval within the context of digital libraries.
3.4 Postprocessing and Enrichment of Session
Clusters into Multifaceted User Profiles
After automatically grouping sessions into different clus-
ters, we summarize the session categories in terms of user
profile vectors [3], [4] pp
ii
. The kth component/weight of this
vector (p
ik
) captures the relevance of URL
k
in the ith profile,
as estimated by the conditional probability that URL
k
is
accessed in a session belonging to the ith cluster (this is the
frequency with which URL
k
was accessed in the sessions
belonging to the ith cluster). The profiles are then converted
to binary vectors (sets) so that only URLs with weights >
0:15 remain. The model is further extended to a robust profile
[2], [3] based on robust weights (w
ij
) computed in the UNC
algorithm (see Section 3.2) that assign only sessions with
high robust weights (that is, w
ij
> w
min
) to a clusters core.
The core of a profile consists only of sessions that are very
similar to the representative profile. Thus, noisy sessions
are eliminated from the recomputation of profiles. Each
profile pp
ii
is discovered along with an automatically
determined measure of scale
i
that represents the amount
of variance or dispersion of the user sessions in a given
cluster around the cluster representative (profile). This
measure will later serve an important role in determining
the boundary of each cluster and thus allows us to
automatically determine whether two profiles are compa-
tible or not.
In addition to the cluster-induced user profiles above, we
were interested in several descriptors of the users in each
cluster. The additional profile descriptors that we refer to as
facets (Fig. 1), were extracted, partly from the Web logs
themselves (from the REFERRER field), partly from external
public information (the www.whois.com Web service), and
partly from domain-specific information (the Web site
content and registration database). In addition to the
viewed Web pages, the profile properties include the
following facets (see Fig. 3 for a real example):
1. Search queries. These are queries submitted to search
engines before visiting the Web site for sessions that
belong to this profile.
2. Inquiring companies. These are companies/organiza-
tions of registered users or unregistered users whose
IP addresses can be mapped.
3. Inquired companies. These are companies/organiza-
tions that have been inquired about during the
sessions belonging to this profile.
Such a rounded representation of a profile gives a
panoramic view of a cluster of Web site visitors that can
help in understanding their interests better and further be
harnessed toward supporting personalization efforts.
3.4.1 Enriching User Profiles with Search Query Terms
(Search Queries)
In addition to the relevant URLs that are extracted from the
sessions assigned to each profile, we can extract information
about the explicit information need of the users in each
profile from the queries that they could have typed prior to
visiting the Web site when this information is available
from the readily available REFERRER field in the Web log
files. Hence, for each profile, we accumulate all the search
phrases extracted from the REFERRER fields of the
assigned user sessions. This allows us to describe each
profile in terms of either a set of significant URLs or a set of
explicit search query phrases and terms.
3.4.2 Enriching User Profiles with Inquiring Company
Information (from Companies)
In addition to the relevant URLs that are extracted from the
sessions assigned to each profile, we can extract information
about which companies or organizations tend to visit the
Web site and fall in this profile. We extract this information
from two complementary sources: 1) by getting the
company information that corresponds to an ID in the
server content Database, where the ID is extracted from the
Web log file in case the visitors register and sign in through
the registration page, or 2) if the visitors did not sign in
through the registration page, then an attempt is made to
obtain the company affiliation from a specialized Web
service (www.whois.com). This can be queried with an IP
address via an API to determine not only what information
was found relevant on the Web site but also to whom it was
relevant to help support further personalization efforts.
3.4.3 Enriching User Profiles with Queried Company
Information (about Companies)
The Web site under study provides a virtual meeting point
between different companies providing various services
that are related to the portals subject. Hence, it was
important to know not only which companies take part in
each cluster of activities but also what company information
seemed to be relevant to users in each cluster. For this
reason, in addition to the relevant URLs that are extracted
from the sessions assigned to each profile, we extracted
information about which companies have been inquired
about by visitors in this profile in case a user searches and
clicks on one of the listed companies contact information on
the Web site.
1
We parse the identity of the company from
the Web log file and map it to a specific company via the
server content database.
NASRAOUI ET AL.: A WEB USAGE MINING FRAMEWORK FOR MINING EVOLVING USER PROFILES IN DYNAMIC WEB SITES 207
1. Users that search for companies in a certain area, click on a company
name in order to have the Web server return the company specialty area
and contact information.
4 EXPLOITING AN EXTERNAL ONTOLOGY FOR
MAPPING AND RELATING DYNAMIC WEB PAGES
Most of todays Web sites deliver a large number of URLs, if
not only dynamic URLs. A dynamic URL is a page address
that results from the search of a database-driven Web site or
a Web site that runs a script. Unlike static URLs, in which
the contents of the Web page do not change, dynamic URLs
are typically generated from specific queries to a sites
database. Even though the examples given in the following
discussion consistently use the ASP extension, this exten-
sion can be replaced by any other dynamic URL extension
(such as PHP), without any changes in our generic
approach. Although static Web pages tend to have mean-
ingful URLs such as /reports/fall_2003/benefits.html, most
dynamic URLs such as /universal.aspx?id=55&codes
_id=60 are unfortunately hard to discern or even recognize
based only on their URL. We resolved this issue by
resorting to available external data
2
that maps database
contents to a dynamic resource and its parameter values.
The ASP codes in most menus can be mapped during the
preprocessing phase to a parent/child structure by using
external data (excerpt in Table 1), thus mapping URLs to
meaningful hierarchical descriptions.
We illustrate our mapping procedure with the Regula-
tions and Laws page. In the Web log data, this URL is
only recorded as universal.aspx?id=56. Table 2 lists its
content information as Regulations and Laws under the
field item_name. Furthermore, its parent (at level
item level 0) is the item with code menus id 4; 939
with label NST Center®, as given by the field item_-
name. Hence, this URL is mapped to a semantic label:
NST Center®/Regulations and Laws.
In general, we need to read the parent of each item and
then recursively map a dynamic URL such as universal.asp-
x?id=56 to a string consisting of tokens separated by /,
where tokens are labels[parent-items]. Insertion is done in
reverse order from the end to the start of the final
composed label until we reach the parent at level 0. Both
implicit (URL itself) and explicit (Table 1) taxonomy
information are seamlessly incorporated into the session
clustering via the computation of the special session
similarity measure in (3).
5 TRACKING EVOLVING USER PROFILES
Tracking different profile events across different time
periods can generate a better understanding of the evolu-
tion of user access patterns and seasonality. Note that both
profiles and clickstreams are typically evolving, since the
profiles are nothing more than summaries of the click-
streams, which are themselves evolving. Each profile pp
ii
is
discovered along with an automatically determined measure
of scale
i
that represents the amount of variance or
dispersion of the user sessions in a given cluster around the
cluster representative. This measure is used to determine
the boundary around each cluster (an area located at a
distance
i
from the profile pp
ii
) and thus allows us to
automatically determine whether two profiles are compa-
tible. Two profiles are compatible if their boundaries
overlap. The notion of compatibility between profiles is
essential for tracking evolving profiles. After mining the
Web log of a given period, we perform an automated
comparison between all the profiles discovered in the
current batch and the profiles discovered in the previous
batch by a sequence of SQL queries on the profiles that have
been stored in a database, as shown in the TrackProfiles
Algorithm. A typical query for retrieving corresponding
profiles between Periods T
1
and T
11
is SELECT ThisPro-
file, TothisProfile FROM ProfileTrail WHERE Period T
1
.
We define a profile evolution event as a coarse categor-
ization of possible real evolution scenarios that relate how
profiles that are discovered during a certain period relate
to profiles discovered in another period. The above
208 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 2, FEBRUARY 2008
2. That is, external to the Web logs.
TABLE 1
Partial Taxonomy of a Few Dynamic URLs (Identified by Base URL (url) and Parameter (menus_id))
TABLE 2
Taxonomy Data for the Dynamic URL universal.aspx?id=56
comparison process determines which new profiles are
compatible with the old profiles and which new profiles
are incompatible with any previous profile. These last two
cases, respectively, give rise to two kinds of events:
Persistence and Birth. A third event Death arises in case
an old profile does not find a compatible profile from the
new batch. It is also possible to track profile reemergence
in the long term. This is the case of an old profile that
disappears and then reappears when it is found to be
compatible with a new profile in the current batch. This
event is labeled as Atavism. We can visualize the temporal
dynamics of profiles birth, persistence, death, and atavism
(rebirth) by labeling the x-axis with the periods corre-
sponding to the different Web log batches that undergo
Web usage mining: period 1, period 2, etc. On the other
hand, the y-axis is used to indicate the profile index: New
profiles are vertically expanded by adding new indices on
top of existing ones. Finally, we generate a plot depicting
the Web site user trend evolution by adding a special
symbol whenever profile y appears in period x and
possibly adding event labels such as Birth, Death, and
Atavism, as these occur. This idea is illustrated in Fig. 2.
Note that this tracking takes advantage of a database
management system to accelerate the access to archived
user profiles (which, as a summary, are a negligible
number compared to the original input data). Moreover,
this process is done offline and is only periodically done
(not adding any burden on the data mining/clustering
itself), since it is an offline analysis of the results of Web
usage mining to help track the user profiles evolution in
retrospect. The choice of the basic period length can be
either arbitrary or based on the domain knowledge and
intuition (like whether changes have been made to the
Web site or whether new events related to the Web site
domain may have occurred). In our experiments, we have
chosen periods that varied from one week to one month.
In general, if the periods are too small, then fewer changes
will be detected, as opposed to longer periods. Thus, the
right period length should be determined by trial and
error.
The analysis of profile evolution, as shown in Fig. 2, can
improve our understanding of the user activity trends and
detect seasonality in their access patterns, especially over a
long time span. It also helps in implementing a dynamic
recommendation strategy, for instance, by caching fre-
quently reemerging (atavistic) profiles. Dead profiles can be
relegated to the secondary memory for possible reemer-
gence, whereas persistent profiles can be kept in the
primary memory for fast access and then relegated to the
secondary memory when they die. Similarly, dead profiles
that have been persistent during an earlier period should be
distinguished from dead profiles that have never been
persistent, that is, volatile profiles. Table 3 lists the formal
conditions defining evolution events and potential implica-
tions in a marketing context. The time period T denotes the
basic unit of analysis (for example, one week). We define
the Boolean predicate Comppp
ii
; pp
jj
to be TRUE only if the
profiles are compatible. We denote consecutive time
periods as T 1, T, T 1, T 2, etc., with T being the
current period and P
T
being the set of mass user profiles
discovered during period T. Table 4 shows profile events
determined for real sessions in consecutive periods based
on the discovered profiles in Table 5. Aggregating evolution
events can help track profiles over many periods. For
example, averaging the number of atavisms of a profile over
several periods can summarize its changes.
NASRAOUI ET AL.: A WEB USAGE MINING FRAMEWORK FOR MINING EVOLVING USER PROFILES IN DYNAMIC WEB SITES 209
Fig. 2. Visualization of the profile evolution. A dot in location x; y
indicates that profile y was active in time period x.
Fig. 3. Different facets of a cluster profile (inquiring companies were obfuscated for the sake of privacy). Center: pages viewed. Right: search queries
submitted to external engines prior to landing on this Web site. Top and left: inquiring companies that are active in this profile (from registered and
unregistered users, respectively). Bottom: inquired companies about which users in this profile have sought information.
6 A SYSTEMATIC APPROACH TO PROFILE AND
EVOLUTION VALIDATION IN AN INFORMATION
RETRIEVAL CONTEXT
In this paper, we view the discovered profiles as frequent
patterns that provide one way of forming a summary of the
input data. As a summary, profiles represent a reduced
form of the data that is, at the same time, as close as possible
to the original input data. This description is reminiscent of
an information retrieval scenario in the sense that profiles
that are retrieved should be as close as possible to the
original session data. Closeness should take both of the
following into account:
1. Precision. A summary profiles items are all correct or
included in the original input data; that is, they
include only the true data items.
2. Coverage/Recall. A summary profiles items are
complete compared to the data that is summarized;
that is, they include all the data items.
These criteria are clearly contradictory, since precision will
favor only the smallest profiles, eventually with a single
URL, whereas coverage will favor the largest possible
profiles. Ideally, each data query should be answered by a
profile that is identical to this query. However, this is
unrealistic, since it requires the profiles summary to be
identical to the entire input database. Therefore, the
summary should consist of the smallest number of profiles
that are as similar as possible to the input data. Our
validation procedure [38] attempts to answer the following
questions:
1. Is the data set completely summarized/represented
by the mined profiles/patterns?
2. Is the data set faithfully/accurately summarized/
represented by the mined profiles/patterns?
Each of the previous questions is answered by comput-
ing coverage/recall as part of a quality or interestingness
measure to answer part 1, and precision as part of a
quality/interestingness measure to answer part 2. First, we
compute the following interestingness measures for each
discovered profile, letting the quality or interestingness
measure Q
ij
Cov
ij
(that is, coverage) to answer part 1, and
Q
ij
Prec
ij
(that is, precision) to answer part 2. Here,
coverage and precision for a discovered mass profile pp
ii
as a
summary of an input session ss
jj
are given by Cov
ij
jss
jj
\ pp
ii
j=js
j
j and Prec
ij
jss
jj
\ pp
ii
j=jpp
ii
j. A combined measure
of precision and coverage is given by the F
1
information
retrieval metric Q
ij
F
1;ij
, which simultaneously answers 1
and 2 and is given by
F
1;ij
2Prec
ij
:Cov
ij
=Prec
ij
Cov
ij
:
Now, let
s
T
1
; T
2
s
j
2 S
Ti
Max
ijP
i
2P
T
2
Q
ij
! Q
min
_ _
be the subset of input user sessions S
T1
, logged during
period T
1
, that are summarized by any of the user profiles
P
T
2
, discovered at period T
2
, with a quality level higher than
a given minimum quality threshold Qmin. Then, we can
capture the concept drift by the decline of the metric defined
as follows, particularly when T
2
occurs earlier than T
1
:
210 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 2, FEBRUARY 2008
TABLE 3
Profile Evolution Events and Interpretation within a Marketing Context
TABLE 4
Profile Evolution Corresponding to Table 5 for June through September 2004
QS
T1
; P
T2
S
T
1
; T
2
j j= S
T1
; 5
where j:j denotes the cardinality of a set. When Q
ij
Cov
ij
,
we call QS
T1
; P
T2
the Cumulative Coverage of sessions, and it
answers question 1. When Q
ij
Prec
ij
, we call QS
T1
; P
T2
BITAK) as a researcher,
and, in the summer 2007, was with the Yahoo!
Search Marketing as an intern. Her research interests include Web
mining, multiagent systems, and genetic programming. She is a member
of the IEEE.
Antonio Badia received the PhD degree in
computer science from Indiana University in
1997. He is currently an associate professor and
the founding director of the Database Labora-
tory, Department of Computer Science and
Computer Engineering, University of Louisville.
His research interests include query processing
and optimization, extended query languages,
integration of documents in databases, and data
mining and its applications to counterterrorism.
He is a member of the IEEE and the ACM. He received a US National
Science Foundation Faculty Early Career Development (CAREER)
Award.
Richard Germain received the PhD degree in
marketing from Michigan State University in
1989. He is the author of three books and
several articles, mostly in supply chain manage-
ment, which have appeared in the Strategic
Management Journal, the Decision Sciences
Journal, the Journal of Business Logistics, the
Journal of Marketing Research, the Journal of
International Business Studies, Academy of
Marketing Science, and so forth.
. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
NASRAOUI ET AL.: A WEB USAGE MINING FRAMEWORK FOR MINING EVOLVING USER PROFILES IN DYNAMIC WEB SITES 215