Prioritizing Web Links Based On Web Usage and Content Data

Prioritizing Web Links Based on Web Usage and
Content Data
Kamika Chaudhary
Santosh Kumar Gupta

Department of Computer Science & Engineering
Krishna Institute of Engineering & Technology
Ghaziabad-201206, India
santoshg25@gmail.com
Department of Computer Science & Engineering

Krishna Institute of Engineering & Technology
Ghaziabad-201206, India
kamika.agrohi@gmail.com
Abstract- Web has grown enormously and is stiD growing

rapidly day by day. With this huge amount of information in the
web it has become difficult for the search engines to retrieve the
required
and
relevant
information
efficiently.
Web
mining
techniques, using different approaches, have contributed a lot in

providing the relevant information to the user query. This paper
introduces a new method for prioritizing the web pages based on
web usage and web content data. The proposed method uses
Genetic Algorithm for providing good quality web pages as a
result of user query. Prioritization of web pages faDs in the
category of NP-complete problems. Genetic algorithm is used to
deal with this. The method includes the parameters from both
web usage and web content mining. Experimental results show
that the proposed approach performed better than the existing
approach.
Keywords--Genetic
algorithm;
web
usage
mining;
web
content mining; common entry and exit points
I.
INTRODUCTION
World Wide Web has brought revolutionary changes in the

popularity of internet. It has grown into a huge and global
information space. The volume of information present on the
web is distributed in nature and growing at an exponential rate.
To get the desired information without wandering through the
pages of website has become an irksome job. Different types of
methods are required to organize and manage the information
so that it can be used efficiently for business purpose. There
exists a need of web mining technique in order to explore such
a gigantic information base. Web mining is the process of
uncovering user desired information from web documents by
applying data mining techniques. Web mining aims to develop
new methods for effective retrieval of potentially useful
information. A large amount of information on the web is
redundant in nature resulting in multiple pages carrying similar
contents. There is a present of heterogeneity among data
present on the websites.
Based on the type of data present in web documents, web
mining is divided into three classes: web content mining, web
structure mining and web usage mining. Web content mining
searches the information from structured, semi structured or
unstructured content of the web. There are a number of links
present on the web pages which connects and organizes the
978-1-4799-2900-9/14/$3l.00 2014 IEEE
information together. These hyperlink structures are utilized by

web structure mining for retrieval of information. Web usage
mining discovers the usage pattern of visitor by mining the log
files. It works by preprocessing the initial log data which
removes the redundancy among data and then detecting the
patterns and then performing an analysis on these patterns in
order to find out user behavior. Several optimization
techniques have been used for fmd the most useful pages of
web site by using web usage and web content mining. The
proposed approach uses natural optimization technique called
genetic algorithm to explore the search space by using both
content and usage mining. The inspiration behind genetic
algorithm is the process of natural selection and genetic
dynamics [5]. Genetic algorithm has its roots in the Darwin's
theory of survival of the fittest. So genetic algorithm is a search
algorithm based upon the process of natural selection and
population genetics [19]. The proposed approach aims to use
genetic algorithm on the data collected by integrating web
usage mining and web content mining in order to find the
pages of web site which are of utmost importance to user. Our
approach is compared with the approach in [20] hereafter
named as EA and results are found to be better.
In Section II paper presents the Literature review. Section
III introduces the concept of Genetic Algorithm. In Section IV
the proposed algorithm is presented. Implementations details of
the proposed approach are given in Section V. Section VI &
VII describe experimentation and conclusion respectively.
II.
RELATED WORK
Web usage mining is the most crucial field of web mining.

A lot of research has been done in this area which shows the
importance of web usage mining to search engines. Speed and
precision acts as most desirable characteristics of search
engines. Evolutionary algorithms more specifically genetic
algorithm plays a vital role in achieving these characteristics.
These algorithms also play an important role in the mining of
web usage data. In [1] authors discuss about the use of genetic
algorithm for mining the information from the web. They
found that results of queries provided by search engines
suffered from the problem of poor information and irrelevant
546
pages. They provide a genetic strategy for search engines and

considered web search as a standard optimization problem. The
efficiency of search engine can be improved through web
usage mining by using MASEL (matrix analysis on search
engine log) algorithm proposed in [2]. The relationship among
user, query and resource acts as central idea for this algorithm.
MASEL considered a resource to be good if it is accessed by
many good users. The purpose of improving search engine
retrieval performance is dealt in [3]. Authors have proposed a
genetic programming based framework for discovering ranking
function which improves the retrieval performance by
prioritizing the web pages in the decreasing order of relevance.
The results are compared and found to be better than other
existing ranking function for information retrieval. In [4]
grammar based genetic programming used as data mining
optimization technique in e-Iearning system. A group of useful
education prediction (EP) rules are developed and provided to
courseware authors to improve the adaptive systems for web
based education (AS WE).
Genetic Algorithm is a natural selection theory based algorithm
used for solving optimization problems. It is an adaptive
heuristic search algorithm based on concept of survival of the
fittest. Selection, crossover, mutation and acceptance are the
main steps used for finding the solution to a problem. Fitness
function is used for fmding the goodness of any solution and
mutation escapes the population from problem of local optima
[5]. A probabilistic web user model based on genetic algorithm
for improving the web site structure is proposed in [6].
Adjacency matrixes have been used for representing the
genetic population and ranking acts as a parameter for fitness
scaling. Random binary vector is created by using scattered
crossover. The result shows an improvement over another
method.
Web usage mining works on the data collected from client
server interaction. It utilizes secondary data present in web
server logs, browser logs, proxy server logs, registration data,
user profiles, cookies or any other source for mining the
interesting patterns. It mainly consists of three phase data
preprocessing, pattern discovery and pattern analysis [7].
Pattern discovery is performed in order to draw useful patterns
from preprocessed data [S]. A system called Web Sift is
designed to perform usage mining. It utilizes data from web
server log in order to perform mining task. This data suffers
from real world challenges. A framework dealing with all these
challenges is discussed in [9]. A number of soft computing
techniques had been used for retrieving the information such as
in the field of web mining [10]. Soft computing technique
called self organizing map (SOM) is applied to preprocessed
data in web usage mining in order to find visitors navigation
behavior [11]. This behavior of them is used for discovering
the useful knowledge from secondary data [12, 21]. Authors
proposed an optimization technique called ant colony
clustering algorithm (ACLUSTER) for detecting useful trends
and used linear genetic programming for analysis of user
trends. ACLUSTER algorithm is applied on the preprocessed
and cleaned data by using number of objects in the area and
their similarity is used as independent threshold to form
clusters of web usage patterns. It is important to improve the
structure of web site from time to time as sites outgrow in their
2014
design by compiling links and pages together. In [13] websites

are reorganized by using 0-1 programming approach. This
method is based on the co -occurrence frequencies between
web pages which are obtained by user access pattern. In order
to reduce the search depth and information overload for users
two constraints are used number of outward links from each
page and length of shortest path from home page to each page.
Web personalization is the way of providing service to web
visitor for retrieving the information of hislher interest. This is
achieved by predicting the next page access by user. An
accurate recommendation system for predicting next page
access by using web usage mining has been discussed in [14].
Pair wise nearest neighbor clustering is used for identifying
similar access pattern. The method provides good prediction
accuracy and minimizes state space complexity. A two step
strategy to improve retrieval effectiveness for personalizing the
web has been presented in [15]. In the first step users query are
categorized by system automatically based on his search
history and then these categories are used for performing web
search.
An intelligent miner (i-miner) framework has been used for
analyzing the user trend [16]. A hybrid evolutionary approach
called FCM has been used for forming the clusters for
separating the user with similar interest and the Takagi-Sugeno
fuzzy inference system has been used for analyzing the trends.
Another approach for exploring the navigational pattern is by
discovering the relationship existing among user and web
object. A system based on probabilistic latent semantic analysis
(PLSA) [17]. It has been developed for automatically
characterizing the user preference and interests. Probabilistic
inference has been used for performing analysis tasks. Authors
in [IS] have proposed a workflow that shows how usage data
can be extracted and processed for a real world tourism web
site.
III.
THE PROPOSED ApPROACH
The Genetic Algorithm (GA) is a natural optimization and

adaptive heuristic search technique whose basic idea depends
upon process of natural evolution. The mechanism of evolution
is parallel in nature and has been used for solving several
computational problems [19]. GA is used for solving general
purpose optimization problems [5].
In computational problem genetic algorithm begins by
selecting initial population in the form of chromosome and
then applying fitness function which minimizes the cost on
selected chromosome. Then two parent chromosomes having
greater fitness are selected. Crossover and mutation are
performed on selected parents. The process is repeated until
best solution among current population is retrieved. After
selection crossover is performed between two parent string and
it results into offspring string. Mutation is another operator
which is applied after crossover in order to change genetic
material between parents and forms offspring. Then on the
basis of Darwinism the offspring which survives most is
chosen to be fittest [20].
A collection of webpage is used to represent chromosome
in web usage mining problem. In order to find the web pages
that is of utmost importance to user GA is used in this
approach. Unique number has been assigned to web pages.
International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT)
547
These pages are indexed by assigning ID to them and thus

chromosomal representation looks like below in fig I
Chromosomel= {set of web links} = {PI, P2, P3, P4, P5, P6,
P7, P8, P9, PIO}
Where PI, P2 . . . . . . . represents web links
43
48
10
13
37
38
44
36
14
popularity among users if it is visited by more number of

distinct visitors. Table III shows unique visitors.
TABLE III. UNIQUE VISITORS AND THEIR CORRESPONDING USER lD
Unique Id
49
Fig.1. Representation of web links in chromosomal form
A.
URL
Number of Unique Users
192.168.30.95
23
192.168.30.15
15
192.168.1.5
35
192.168.1.5
24
192.168.30.127
19
Chromosome Representation
The chromosomes are used for representing initial

population. Each chromosome shows a candidate solution. For
representing the web page we will assign a unique number id to
each unique URL taken from web server log. For further
processing these unique no id is used instead of URL of pages
visited by user.
3) Time Duration: The amount of time spent on a page

shows the relevance of page for the user. If a user spent more
amount of time on a particular page then that page is
considered to be useful for the user. Table IV shows duration
of particular URL.
TABLE 1. UNIQUE lD ASSIGNED TO WEB PAGE URL'S
TABLE IV. AMOUNT OF TIME USER STAYED ON THE PAGE WITH RESPECT TO
Unique Id
B.
URL
192.168.30.95:51854
192.168.30.15:45682
192.168.1.5:32773
192.168.1.5:32773
192.168.30.128:60339
URL
Unique Id
Duration(seconds)
45
192.168.30.95
192.168.30.15
217
192.168.1.5
84
192.168.1.5
24
192.168.30.127
Fitness Function
Fitness function is an objective function used for selection

of best individual among all individuals. It is used for
quantifying the optimality of a solution. It measures the
goodness of a solution by providing ranks to solution [21].
Various parameters are required for calculating the fitness of a
solution as presented below.
i) Access frequency: Access frequency measures number
of times a particular page is visited by user irrespective of
user id In web usage mining, the usefulness of any particular
page can be measured by calculating the access frequency.
More the access frequency more could be its usefulness. Table
II shows the URL and their related access frequency.
TABLE II.
URL
URCS AND THEIR RELATED ACCESS COUNT
Access Count
192.168.30.95
33
192.168.30.15
12
192.168.1.5
192.168.1.5
24
192.168.30.127
19
2) Number of unique visitors: This factor shows the

importance of any web page on the basis of unique visitors
visited this page. This means that a URL can have more
548
URL
2014
4) Number of bytes received: The quantity of data

downloaded by user from the web page shows that page has
content which is relevant for user. The entries for number of
bytes received by user are present in web log server entry.
From this entry we can deduce whether a page is important or
not.
TABLE V. NUMBER OF BYTES RECEIVED BY USER
Unique Id
Amount of bytes received
270
2254
1059
124
1609
5) Common entry and exit points: A visitor begins his

search by clicking on a link which forwarded him towards a
page of website. This page is considered as the entry point of
the user. The exit point signifies the destination of the visitor.
It tells what visitors are looking for in the website.
6) Number of advertisements: The importance of any web
page can also be recognized by analyzing the number of
advertisement present on any particular page. If a page
international Conference on issues and Challenges in intelligent Computing Techniques (iCICT)
consists of more number of advertisements then that page is
CostCommonPoints= 20
thought
CostAdvertisements=
to
be
visited
by
more
number
of
visitors.
Advertisements are placed on the pages which have higher

frequency of visits by user so they signifY the importance of
page.
C(x)=2.4*33+0.05*217+0.6*35+0.003*2254+0.6*20
+1.5*30= 174.812
C.
I. Access frequency of each page

2. Number of unique user
3. The amount of time user stayed on the page
4. Number of bytes received
5. Common entry and exit points
6. Number of advertisement
CostAccess frequency (AF)
If=1(A. Fi)
Where n=number of entries in the web log and AF is
number of times a page is accessed by visitors.
Costunique user (UNQ) = I f=1(UNQi)
Where n =number of entries in the web log and UNQ is
the number of different users visited a URL.
CostDuration (OUR) = IF; 1( DURi)
Where n=number of entries in the web log and DUR is
the amount of time user stayed on a web page.
CostBytes Received (BR) = I r 1(BRi)
Where n=number of entries in the web log and BR is the
amount of data user fetched from a web page.
Costcommon entry exit point (EP) = I 1{EPi)
Where n=number of entries and EP shows the pages of
beginning and finish of a user access session.
CostNumber of advertisement (AD) = IF;1(Alli)
Where n=number of entries and AD signifies the number
of advertisement present on any web page.
Cost Function
C(x) = Cl. IF;1(A H)+ C2. IF;lUNQi) + C3. If=1(OlJRi)
+C4. If=1(BRi) +C5. IP=1(EP i)+C6. If=1(AOi)
Where CI, C2, C3, C4, C5 and C6 represent different
constants and they are used for adjusting the values of
different parameters.
30
Selection
Selection is the process of choosing the fitter

chromosomes from the population. The main objective of
selection is to give importance to good solution and ignoring
bad solution. In our approach we are using binary tournament
selection which picks two individuals randomly from large set
of population.
Fig.2. Cost function and its parameters
An example for calculating cost of various parameters is

shown below:
F(x) = Cl.CostAccessFrequency + C2.CostDuration +
C3.CostUniqueUser
+
C4.CostBytesreceived
+
C5.CostCommonPoints + C6.CostAdvertisements
CI, C2, C3, C4, C5 and C6 are constants whose function is to
normalize the value of parameters
CI= 2.4 C2=0.05
C3=0.6
C4=0.003
C5=0.6
C6=1.5
In this example values are taken from above tables for
calculating the cost
(Table 2)
CostAccessFrequency= 3 3
CostStayDuration= 217
(Table 4)
CostUniqueUser= 35
(Table 3)
CostBytes Received= 2254
(Table 5)
2014
D.
Crossover
Crossover is the method which exchanges the genetic

material of both the parents to get new offspring. Main
function of crossover is to recombine two strings to get a new
better string. Various types of crossover exists, among all of
them cyclic crossover is used in the proposed work.
Parent I
43 I 48
Parent 2
49
14
Offspring 2
10
13
37
38
44
14
36
49
38
13
48
44
37
10
43
36
After Cyclic Crossover
Offspring I
43 I 48
49
14
10
13
37
49
14
38
13
48
38
13
48
43
48
10
13
37
Fig.3. Process of crossover
E.
Mutation
Mutation is the third operator of GA that performs the

function of maintaining diversity in the population by altering
some bits present in the chromosome. It randomly distributes
genetic information and avoids the probability of algorithm to
suffer from the problem of local optima [20). There are many
types of mutation operator: flip bit, boundary, uniform, non
uniform and Gaussian. It exploits the search space more
thoroughly and results in providing better solution.
Flip Bit Mutation
49 I 48 I 10
After Mutation
48
49
10
13
13
37
38
44
14
43
36
S4
38
44
14
43
36
Fig.4. Process of flip bit mutation
549
IV.
V.
PROPOSED ALGORITHM
The proposed GA based algorithm (PGA) applies a fitness

function on the randomly selected initial population to
produce a set of web links which are of higher priority
(TopLink-P) as compared to other existing links. The fitness
function includes a number of parameters from both the
content and usage pattern of web links. PGA initiates by
randomly selecting a set of initial population and then
applying the operators of crossover and mutation on the
population for several generations until the population gets
converged and result is produced. The whole process is
represented in the form of steps in fig 5
AN EXAMPLE
An Example depicting procedure of proposed GA based

algorithm (PGA) is shown in the Fig 6.For execution of PGA
we have used java programming language and program is run
for 50 generations with initial population consisting of lO
chromosomes. Each chromosome includes a set of 5 pages and
then their cost is calculated by applying genetic operators. The
program runs till last generation which implies the convergence
of cost. In our experiment the generation converges at cost 509.
Fig 6 shows the chromosomes with their fitness cost at
generation 1, 2, 3 and 50 with crossover rate of 75%.
Stept
Input:
Step3
Third Generation
First GenerJ.tioll
Initial Population Size, PopSize
Number of generations, N
Crossover Rate, CR
Mutation Rate, MR
Output: Set of Top Priority Web links, TopLink-P
Cost Function:
Access frequency of each page (APi)
The amount of time user stayed on the page(DU Ri)
Number of unique user (lJ NQi)
Number of bytes received(BRi)
Common entry and exit points(E P' i)
Number of advertisement (ADi)
Cost
Chromosomes
Chromsomes
Cost
CRI
18
15
19
14
215
CRI
16
15
17
509
CR2
18
19
12
310
CR2
16
15
17
509
CRJ
16
13
II
215
CR3
10
14
12
310
CR4
J3
15
17
509
CR4
10
17
509
CR5
10
16
14
335
CR5
16
15
17
509
CR6
10
17
12
215
CR6
16
15
17
478
CR7
15
II
215
CR7
16
15
17
509
CR8
16
15
17
509
CR8
16
15
17
509
CR9
10
16
II
15
351
CR9
16
15
17
233
CRI
10
14
12
310
CRI 16
15
17
509
Selection
Crossover
Mutation
Step2
Step n
Fifteith Generation
Second Generation
Cost
Chromosomes
Chromosomes
Cost
Cost(C)=
CRI
16
15
17
509
CRI
16
12
II
17
509
Cl. Li,;,,(AF:i}tC2. bi,;,,(UNQi)+C3. b, _,(mJru)+C4. bi _1( B ru.)+C5
CR2
15
17
509
CR2
16
15
17
509
CRJ
10
16
II
15
351
CR3
14
15
509
CR4
10
16
14
335
CR4
16
15
17
509
. b,';',(EPi.}+C6.b;::' ,(A i)
Method:
I.
Generate Initial Population Set of randomly selected

web links, WebLinks [PopSize]
2.
Evaluate each Top-P Web Link in the set of Top-P web
links, WebLinks [PopSize], using cost function
While Generation::: N Do
3.
a) Perform Binary Tournament
Selection,
CrossoverWebLinks[PopSize]
among
b)Apply
Cyclic
Crossover
CrossoverWebLinks[popSize]
(WLinkParent 1, WLinkParent2)=Randoml yChoose(Crossover
WebLinks [popSize])
(WLinkOffspring 1, WLinkOffspring2 )=Cycli cCrossover(WLink
Parent 1, WLinkParent2)
c) Copy (WLinkOffspringl, WLinkOffspring2) to
NewWebLinks
NewWebLinks[]=(WLinkOffspring 1, WLinkOffspring2)
d) Perform mutation with mutation rate, MR
e) Copy New Web Links to Initial set of Top-P WebLinks
WebLinks [ ] = NewWebLinks []
End While
4.
TopLink-P WebLinks=LowCostWebLink(WebLinks [])
5.
Return TopLink-P
CR5
16
13
II
215
CR5
16
15
17
509
CR6
16
13
II
215
CR6
16
15
17
509
CR7
10
16
14
335
CR7
16
15
17
509
CR8
10
14
12
310
CR8
16
15
17
509
CR9
16
13
II
215
CR9
14
15
509
CRI
10
14
12
310
CRI
16
15
17
509
VI.
EXPERIMENTATlON AND RESULTS
The results produced after implementing the PGA on a

programming language is shown by making use of graph
structure. In our program we have included the parameters
from both the content of the web and from the usage pattern of
the web pages. The comparison of results of both PGA and
existing approach EA are shown in Fig 7.We have run the
program for 50 generations with different crossover rate
ranging from 50% to &75%. For a different crossover rate cost
of the web links varies. We have also studied the quality of the
web pages for different generations. We increased the
generations till 400 and keep the constant crossover rate of
75% and compared the value of fitness score. The finding
shows that on moving from one generation to next the cost
varies. We have also tries to study the effect of crossover rate
over the cost. The experimental results proves that in most of
the cases the cost of TopLink-P web pages are better for
proposed approach as compare to the existing approach.
Fig.5. Proposed GA based Algorithm (PGA)
550
2014
Programming," Data Mining Xl: Data Mining, Text Mining and Their
Business Applications, pp.205-214 ,2005.

[5)
R.C. Chakraborty, "Fundamentals of Genetic Algorithms," Artificial

Intelligence ,2010.
[6)
E. Andaur, S. Rios, P. Roman, and J. Velasquez, "Best Web Site

Structure for Users Based on a Genetic Algorithm Approach,"
University of chile, 2010.
[7)
1. Srivastava, R.Cooley, M. Deshpande, and P. N. Tan, "Web Usage

Mining: Discovery and Applications of Usage Patterns from Web Data,"
ACM SIGKDD Explorations Newsletter 1.2 , pp.12-23, 2000.
[8)
R. L. Haupt, "Practical Genetic Algorithms," John Wiley & Sons Inc.

Chapter 1-7, pp. 1-251, 2004.
[9)
1. Srivastava, R. Cooley, M. Deshpande and P.N. Tan, "Web usage

mining: Discovery and applications of usage patterns from web
data", ACM SIGKDD Explorations Newsletter, 1(2), pp.12-23, 2000
[10) S. P. Nina, M. Rahman, K. l. Bhuiyan and K. Ahmed, " Pattern

discovery of web usage mining," In Computer Technology and
Development, ICCTD 09 International Conference on vol. 1, pp. 499503 lEEE 2009
[11) O.Nasraoui, M. Soliman, E. Saka, A. Badia, and R. Germain, " A web
usage mining framework for mining evolving user profiles in dynamic
web sites," Knowledge and Data Engineering, IEEE Transactions,
voI20(2), pp.202-215, 2008
[12) S. K. Pal, V. Talwar, and P. Mitra, "Web mining in soft computing
framework: Relevance, state of the art and future directions," Neural
Networks, IEEE Transactions ,vol 13(5), pp.1163-1177, 2002.

Fig.7. PGA vs EA at Cross Over rate 50%, 60%, 70%, 75%
VII.
CONCLUSION
As size of information present on the internet has taken a

shape of the giant it has become a necessity to increase the
efficiency of the search engines. Web mining is aiming in this
direction. It helps in mining the information on the basis of
content, structure and usage of web pages. The proposed GA
based approach combines the information from both content
as well as usage of a web page in order to provide the required
and relevant pages to user. We have calculated the cost of the
web pages till the value gets converged in order to get the
most optimized result. This cost is used as parameter in order
to find the relevance of TopLink-P web pages. We have
represented the experimental results in the form of graphical
structure. These results show the superiority of proposed
approach as compared to existing approach.
[13) K. Etminani, A. R. Delui, N. R. Yanehsari, and M. Rouhani, " Web

usage mining: Discovery of the users' navigational patterns using SOM,"
IEEE First International Conference in Networked Digital Technologies,
NDT'09 , pp. 224-249, 2009.
[14) A. Abraham, and V. Ramos, "Web usage mining using artificial ant
colony
clustering and linear genetic programming,"
lEEE
In Evolutionary Computation CEC'03 vol. 2, pp. 1384-1391, (2003)
[15) C. C. Lin, "Optimal Web site reorganization considering information
overload and search depth," European Journal of Operational
Research 173(3), pp.839-848, 2006.
[16) X. Jin, Y. Zhou, B. Mobasher, " Web usage mining based on
probabilistic latent semantic analysis," In Proceedings of the tenth ACM
SIGKDD international conference on Knowledge discovery and data
mining pp. 197-205, 2004.
[17) A. Pitman, M. Zanker, M. Fuchs, M. Lexhagen," Web usage mining in
tourism-a query term analysis and clustering approach," Information
and Communication Technologies in Tourism , pp 393-403, 2010.
[18) M. Mitchell, " An Introduction to Genetic Algorithms," MIT Press.
Chapter 1-6. pp. 1-203, 1998
[19) T. V. Mathew, " Genetic Algorithm," Indian Institute of Technology
Bombay, Mumbai pp. 1-15, 2012
[20) A. R. Simpson, G. C. Dandy, L. J. Murphy, "Genetic algorithms
compared to other techniques for pipe optimization" Journal of Water
Resources Planning and Management, 120(4), pp. 423-443, 1994
[21) A. K. Mishra, M. K. Mishra, V. Chaturvedi, S. K. Gupta and J. Singh,
"Web usage mining using self organized maps" International Journal of
Advanced Research in Computer Scence and Software Engineering, vol
3(6), pp. 532-539, 2013
REFERENCES
[1)
F. Picarougne, N. Monmarche, A. Oliver and G. Venturini,"GeniMiner:

Web Mining with a Genetic-Based Algorithm," ICWI, pp. 263-270,
2002.
[2)
D. Zhang, and Y. Dong, " A novel web usage mining approach for
search engines," Computer Networks, vol 39(3) ,pp 303-310, 2002.
[3)
W. Fan, M. Gordon and P. Pathak, "Genetic programming-based

discovery of ranking functions for effective web search," Journal of
Management Information Systems, vol 21(4), pp 37-56, 2005.
[4)
C. Romero, S. Ventura, C. Hervas and P. Gonzalez,"Rule Discovery in

web-based educational systems using Grammar-Based Genetic
2014
551

Prioritizing Web Links Based On Web Usage and Content Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Prioritizing Web Links Based On Web Usage and Content Data

Uploaded by

Copyright:

Available Formats

Prioritizing Web Links Based on Web Usage and

Santosh Kumar Gupta

Department of Computer Science & Engineering

Abstract- Web has grown enormously and is stiD growing

techniques, using different approaches, have contributed a lot in

content mining; common entry and exit points

World Wide Web has brought revolutionary changes in the

978-1-4799-2900-9/14/$3l.00 2014 IEEE

information together. These hyperlink structures are utilized by

Web usage mining is the most crucial field of web mining.

pages. They provide a genetic strategy for search engines and

design by compiling links and pages together. In [13] websites

THE PROPOSED ApPROACH

The Genetic Algorithm (GA) is a natural optimization and

International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT)

These pages are indexed by assigning ID to them and thus

popularity among users if it is visited by more number of

Fig.1. Representation of web links in chromosomal form

Number of Unique Users

The chromosomes are used for representing initial

3) Time Duration: The amount of time spent on a page

TABLE 1. UNIQUE lD ASSIGNED TO WEB PAGE URL'S

Fitness function is an objective function used for selection

URCS AND THEIR RELATED ACCESS COUNT

2) Number of unique visitors: This factor shows the

4) Number of bytes received: The quantity of data

Amount of bytes received

5) Common entry and exit points: A visitor begins his

international Conference on issues and Challenges in intelligent Computing Techniques (iCICT)

consists of more number of advertisements then that page is

Advertisements are placed on the pages which have higher

I. Access frequency of each page

Selection is the process of choosing the fitter

Fig.2. Cost function and its parameters

An example for calculating cost of various parameters is

Crossover is the method which exchanges the genetic

After Cyclic Crossover

Fig.3. Process of crossover

Mutation is the third operator of GA that performs the

Fig.4. Process of flip bit mutation

International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT)

The proposed GA based algorithm (PGA) applies a fitness

An Example depicting procedure of proposed GA based

Initial Population Size, PopSize

Access frequency of each page (APi)

The amount of time user stayed on the page(DU Ri)

Number of unique user (lJ NQi)

Number of bytes received(BRi)

Common entry and exit points(E P' i)

Number of advertisement (ADi)

Cl. Li,;,,(AF:i}tC2. bi,;,,(UNQi)+C3. b, _,(mJru)+C4. bi _1( B ru.)+C5

Generate Initial Population Set of randomly selected

TopLink-P WebLinks=LowCostWebLink(WebLinks [])

EXPERIMENTATlON AND RESULTS

The results produced after implementing the PGA on a

Fig.5. Proposed GA based Algorithm (PGA)

International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT)

R.C. Chakraborty, "Fundamentals of Genetic Algorithms," Artificial

E. Andaur, S. Rios, P. Roman, and J. Velasquez, "Best Web Site

1. Srivastava, R.Cooley, M. Deshpande, and P. N. Tan, "Web Usage

R. L. Haupt, "Practical Genetic Algorithms," John Wiley & Sons Inc.

1. Srivastava, R. Cooley, M. Deshpande and P.N. Tan, "Web usage

[10) S. P. Nina, M. Rahman, K. l. Bhuiyan and K. Ahmed, " Pattern