Web Navigation Pattern Prediction Process

Web Navigation Pattern Prediction Process
N.Janani II CSE
Computer Science and Engineering
Cheran College of Engineering
Karur, TamilNadu
jankodumudi@gmail.com
G.Sasikala,II CSE
Computer Science and Engineering
Cheran College of Engineering
Karur, TamilNadu
gmsweetsai@gmail.com

AbstractAs of now many users are surfing in World Wide Web
and the information in web is also increasing explosively. This makes
the users more difficult to find relevant and useful information from
large amount of information. Because of this discovery and analysis
of useful information from the World Wide Web becomes an
important issue. Web access prediction is a pattern, which is used to
predict the navigation pattern through a web site. Web usage mining
is the application of data mining techniques to discover useful
patterns from the web. Web access logs are used for this purpose. It is
a file to which the Web server writes information every time a user
requests a resource. It contains all the details about the request such
as the page accessed, the request method, date, time etc. In this paper
access logs are used for web navigation prediction, which is the
process of predicting the next set of pages that a user may visit based
on the previously visited pages.

I ndex TermsWeb usage mining, Association Rule Mining,
Markov model.

II. RELATED WORK
The web navigation problem was addressed in [2] by using
the Markov chain which is based on computation of the
information contained in a typical navigation trail.
Association Rule Mining and kNN based collaborative
filtering has been used to perform web personalization. The
web navigation problem has also been addressed in [9] [10]
by using Support Vector Machines (SVM), Artificial
Neural Networks (ANN) and Markov Model. This paper is
based on [1] where the Association Rule Mining and
Markov Model (MM) is used. Our work is to use Weighted
Association Rule Mining (WARM) for navigation
prediction and to combine the results of WARM and MM
to improve prediction accuracy.

I. INTRODUCTION
The World Wide Web has become a pervasive tool, used in all
the areas, used to find information related to business, related
to education and many more. The Internet provides a rich
environment for users to retrieve information. At the same
time, it also makes it easy for a user to get lost in the sea of
information. The activity of searching for information consists
of the cycle: (i) submitting a query to a search engine, (ii)
selecting a page for browsing from the returned list of pages,
and (iii) navigating (surfing through link). This paper is based
on the navigation process. The purpose is to predict the future
requests of the users while surfing in a website. it is called as
web navigation prediction.

Web Navigation Prediction problem attempts to predict the
next set of pages that a user may visit based on the knowledge
of the previously visited pages while surfing a website. This
identification is mainly based on web server access logs. The
web server maintains log to record the details of the requests
made to a website. It contains all the details about the request.
Web Usage Mining attempts to find useful information from
the server logs. When a prediction model for a certain Web site
is available, the search engine can utilize it to cache the next set
of pages that the users might visit [1], [3].

III. WEB USAGE MINING
Web usage mining is the area of data mining which deals
with the discovery and analysis of usage patterns from Web
data, specifically web logs, in order to improve web based
applications. Web usage mining consists of three phases,
preprocessing, pattern discovery, and pattern analysis. After
the completion of these three phases the user can find the
required usage patterns and use these information for the
specific needs. It is mainly based on access logs. The
format of the log file will be

<ip_addr><base_url><date><method>
<file><protocol><code><bytes>

Fig.1.Common Log Format

IV. DATA PREPROCESSING
Web log preprocessing aims to reformat the original web
logs to identify all web access sessions. The Web server
usually registers all users access activities of the website as
Web server logs. Due to different server setting parameters,
there are many types of web logs, but typically the log files
share the same basic information, such as: client IP address,
request time, requested URL, HTTP status code, referrer,
etc. Generally, several preprocessing tasks need to be done
before performing web mining algorithms on the Web
server logs. The preprocessing consists of two steps.

Log
Preprocessing
File

Data
Cleaning

Session
Identification

Fig.2.Preprocessing

Data cleaning
In the original web logs, not all the log entries are valid for
web usage mining. We only want to keep the entries that carry
relevant information. Therefore, data cleaning is used to
eliminate the irrelevant entries from the log file. A HTTP
protocol requires a separate connection for every file that is
requested from the web server. Therefore a users request to
view a particular page often results in several log entries since
graphics and scripts are downloaded in addition to the HTML
file. In most cases, only the log entry of the HTML file request
is relevant and should be kept for the user sessions. User
requests for one URL frequently result in multiple entries in
the server logs, independent of one another, representing
requests for the hyperlinked elements, such as images, style
sheets and so on. Since the main intention of the Web Usage
Mining is to get a picture of the users behavior, it does not
make sense to process such file requests. This also reduces
the size of the data to be analyzed.

Example Log File

66.249.65.107 - - [08/Oct/2007:04:54:20 -0400]
"GET/support.html HTTP/1.1" 200 11179 "-"
"Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)"

111.111.111.111 - - [08/Oct/2007:11:17:55 -0400]
"GET /style.css HTTP/1.1" 200 3225
"http://www.loganalyzer.net/" "Mozilla/5.0 (Windows;
U; Windows NT 5.2; en-US; rv:1.8.1.7)
Gecko/20070914 Firefox/2.0.0.7"

Session identification
A user session can be defined as knowledge of users history of
navigation within a particular period of time. For logs that
span long periods of time it is very likely that users will visit
the web site more than once. The goal of session identification
is to divide the page accesses of each user into individual
sessions. Session identification is carried out using the
assumption that if a certain predefined period of time between
two accesses is exceeded, a new session starts at that point.

V. BACKGROUND
In this section we present the necessary prediction models
that we utilize in our work. The Markov Model is used to
predict the next set of pages. ARM and WARM are based on
generating and association rules and they are used with MM to
improve accuracy.

Association rule mining

ARM is a data mining technique that has been applied
successfully to discover related transactions. In ARM,
relationships among item sets are discovered based on their co-
occurrence in the transactions. Specifically, ARM focuses on
associations among frequent item sets. For example, in a
supermarket store, ARM helps uncover items purchased
together which can be utilized for shelving and ordering
processes. In the following, we briefly present how we apply
ARM in WPP. For more details and background about ARM,
see [6] and [8]. In Web navigation, prediction is conducted
according to the association rules that satisfy certain support
and confidence as follows. For each rule, R = X Y , of the
implication, X is the user session and Y denotes the target
destination page.

Weighted Association Rule Mining

The ARM model considers only that whether an item present
in the transaction or not. It does not take into account the
statistical features of an item. The Weighted Association Rule
Mining (WARM) provides a feature that we can associate
weight with each data item in a resulting association rule.
Weighted Association Rule (WAR) doesnt interfere with the
process of generating frequent itemset. Rather, it focuses on
how weighted association rules can be generated by examining
the weighting factors of the items included in generated
frequent itemsets. Let I = {i
1,
i
2,,
i
m
} be a set of distinct items
and W be a set of non-negative real numbers. A pair (x, w) is
called a weighted item where x I is an item and wW is the

weight associated with x. A transaction is a set of weighted
items, each of which may appear in multiple transactions with
different weights. In web navigation each session is considered
as one transaction or one itemset. The weighted support and
Confidence values are generated for each page. Based on that
the prediction is performed.

Markov Model

The basic concept of Markov model is to predict the next
action depending on the result of previous actions. In Web
prediction, the next action corresponds to predicting the next
page to be visited. The previous actions correspond to the
previous pages that have already been visited. In Web
prediction, the Kth-order Markov model is the probability that
a user will visit the kth page provided that she has visited the
ordered k 1 page [10], [20]. For example, in the second-
order Markov model, prediction of the next Web page is
computed based only on the two Web pages previously
visited.
The main advantages of Markov model are its efficiency and
performance in terms of model building and prediction time. It
can be easily shown that building the kth order of Markov
model is linear with the size of the training set [10]. The key
idea is to use an efficient data structure such as hash tables to
build and keep track of each pattern along its probability.
Prediction is performed in constant time because the running
time of accessing an entry in a hash table is constant. Note that
a specific order of Markov model cannot predict for a session
that was not observed in the training set since such session
will have zero probability.

VI. RESULTS

The prediction processes were conducted by using by
combining Markov Model with Association Rule Mining and
Weighted Association Rule Mining. The Results shows that
Weighted Association Rule Mining achieves better accuracy.

VII. C O N C L U S I O N
This paper addresses the problem of accuracy of
prediction process. The experiments were conducted by using
Markov Model, Association Rule Mining and Weighted
Association Rule Mining. The results showed that better
prediction accuracy is obtained by combining the Markov
Model and Weighted Association Rule Mining. The future
work of this paper is based on applying Bagging and Boosting
technique for the prediction process.

REFERENCES
[1] Mamoun A. Awad and Issa Khalil, Prediction of Users Web
Browsing Behavior:Application of Markov Model, IEEE
transactions on systems, man, and cybernetics.vol.42,no.42,August
2012.
[2] M. Levene and G. Loizou, Computing the entropy of user
navigation in theWeb, Int. J. Inf. Technol. Decision Making, vol. 2,
no. 3, pp. 459476,2003.
[3] B. Mobasher, H. Dai, T. Luo, and M. Nakagawa, Effective
personalization based on association rule discovery from Web usage
data, in Proc.ACM Workshop WIDM, Atlanta, GA, Nov. 2001.
[4] R. Agrawal, T. Imielinski, and A. Swami, Mining association
rules between sets of items in large databases, in Proc. ACM
SIGMOD Conf. Manage. Data, Washington, DC, May 1993.
[5] M. Baumgarten, A. G. Bchner, S. S. Anand, D. D. Mulvenna,
and J. B. Hughes, Navigation Pattern Discovery from Internet Data,
User-Driven Navigation Pattern Discovery from Internet Data.
Heidelbert,Germany: Springer-Verlag, 2000, pp. 7491.
[6] R. Agrawal and R. Srikant, Fast algorithms for mining
association rules,in Proc. 20th Int. Conf. VLDB, Santiago, Chile,
1994.
[7] R. Cooley, B. Mobasher, and J. Srivastava, Data preparation for
mining World Wide Web browsing patterns, J. Knowl. Inf. Syst.,
vol. 1, no. 1,pp. 532, 1999.
[8] M. T. Hassan, K. N. Junejo, and A. Karim, Learning and
predicting key Web navigation patterns using Bayesian models, in
Proc. Int. Conf.Comput. Sci. Appl. II, Seoul, Korea, 2009, pp. 877
887.
[9] M. Awad, L. Khan, and B. Thuraisingham, Predicting WWW
surfing using multiple evidence combination, VLDB J., vol. 17, no.
3, pp. 401417, May 2008.
[10] M. Awad and L. Khan, Web navigation prediction using
multiple evidence combination and domain knowledge, IEEE Trans.
Syst., Man, Cybern. A, Syst., Humans, vol. 37, no. 6, pp. 10541062,
Nov. 2007.
[11] Internet Traffic Archive. [Online]. Available:
http://ita.ee.lbl.gov/html/traces.html
[12] Y. Fu, H. Paul, and N. Shetty, Improving mobile Web
navigation using N-Gram prediction model, Int. J. Intell. Inf.
Technol., vol. 3, no. 2,pp. 5164, 2007.
[13] M. Perkowitz and O. Etzioni, Adaptive Web sites: An AI
challenge, in Proc. IJCAI Workshop, Nagoya, Japan, 1997.
[14] C. R. Anderson, P. Domingos, and D. S. Weld, Adaptive Web
navigation for wireless devices, in Proc. IJCAI Workshop, Seattle,
WA, 2001.
[15] D. W. Albrecht, I. Zukerman, and A. E. Nicholson, Pre-sending
documents on the WWW: A comparative study, in Proc. 16th
IJCAI, 1999,pp. 12741279.

Web Navigation Pattern Prediction Process

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Web Navigation Pattern Prediction Process

Uploaded by

Copyright:

Available Formats

Web Navigation Pattern Prediction Process

You might also like