Professional Documents
Culture Documents
A THESIS
Submitted by
VENKETESH P
SEPTEMBER 2013
ii
iii
iv
v
ABSTRACT
and web services constantly demands good infrastructure to deliver the web
contents to users with minimal delay. Tremendous increase in the global traffic
due to demands from large number of users strains the servers and network,
mitigate the user perceived latency. This thesis is focused on the study of existing
contents and store it in local cache before user actually requests them. It
minimizes the latency time perceived by users when accessing the content.
activity
Server based predictions consider access history of several users stored in a log
vi
information associated with each hyperlink is used to compute its priority and
then sorted (highest to lowest) to create prediction (hint) list. Both prediction and
build Precedence Graph by analyzing the user access patterns from log files. It
considers the object URI and referrer recorded for each request to conceive
precedence relation for adding arcs between nodes to build a graph. Predictions
good hit rate. Client-side cache is partitioned into regular and prefetch cache to
enhance the services of web caching and prefetching. LRU algorithm is used to
manage the contents of prefetch cache. Fuzzy Inference System (FIS) based
algorithm is used to manage the contents of regular cache. When web objects
vii
available in prefetch cache are frequently accessed by user, they are moved to
TABLE OF CONTENTS
ABSTRACT v
1. INTRODUCTION 1
2. LITERATURE SURVEY 12
2.1 INTRODUCTION 12
2.2 CONTENT BASED PREDICTION 15
2.3 ACCESS PATTERN BASED PREDICTION 18
2.3.1 Graph Models 20
2.3.1.1 Dependency Graph 21
2.3.1.2 Double Dependency Graph 22
2.3.2 Markov Models 24
2.3.3 PPM Models 26
2.3.4 Web Mining Models 26
x
3.1 INTRODUCTION 39
3.2 NAÏVE BAYES APPROACH 42
3.2.1 Prediction/Prefetch Procedure 43
3.2.2 Implementation 47
3.2.2.1 Prediction Engine 47
3.2.2.1.1 Tokenizer 47
3.2.2.1.2 User-Accessed repository 49
3.2.2.1.3 Computing priority value 51
of Hyperlinks
3.2.2.1.4 Prediction List 54
3.2.2.2 Prefetching Engine 55
3.3 FUZZY LOGIC APPROACH 56
3.3.1 Prediction/Prefetch Procedure 56
3.3.2 Implementation 60
3.3.2.1 Prediction Engine 60
3.3.2.1.1 Predicted- Unused 61
Repository
xi
REFERENCES 153
LIST OF TABLES
LIST OF FIGURES
Prefetching
5.14 Byte Hit Ratio using traces of Group-B (user 6 to 10) 146
5.15 Byte Hit Ratio using traces of Group-C (user 11 to 15) 147
xvii
LIST OF ABBREVIATIONS
DG - Dependency Graph
PG - Precedence Graph
RG - Referrer Graph
CHAPTER 1
INTRODUCTION
new applications and services have dramatically increased the number of users,
problem of tackling web access latency challenges the researchers even with the
storage space. Downloading a web object from server involves two main
components: a) time taken by request to reach the server plus the time taken by
response to reach the client (i.e. RTT – Round Trip Time) and b) object transfer
time (depends on bandwidth between client and server). Web caching provides
solution for improving the web performance by reducing user perceived latency
through usage of local storage (cache) and effective management of web objects
stored in the cache. The limitations of web caching is handled using web
prefetching technique that fetches the web documents from server even before
the user actually requests them. Web prefetching has been proposed as a
object that is referenced by its Uniform Resource Identifier (URI). Web object is
2
a term used for all possible objects (HTML pages, images, videos etc) that can be
Web caching benefits from the temporal locality, where the objects that are
accessed by users in the same geographical area. Spatial locality indicates that
object closer to the currently accessed object has high probability of being
accessed in the near future. Web prefetching exploits the spatial locality property
when accessing web pages is represented as the time interval between issuing a
request for web page and actual display of page in the browser window. Web
exploiting the spatial locality inherent in the user’s accesses to web objects.
user information for generating the predictions (hint list), which is used to
prefetch (download) the web objects and store it in cache before they are actually
requested by the users. The benefits of prefetching were constrained in the past
3
increase network traffic when its predictions are not accurate enough. In current
reasonable cost.
Prefetching can be termed as proactive caching scheme, since it caches the web
pages prior to receiving the requests for those pages from users. The
implements an algorithm to predict the user’s next request and provide these
hints decides to prefetch (download) the web objects when it has the available
bandwidth and idle time. Prefetching of web objects in advance reduces the user
perceived latency when these objects were actually requested by the users. The
The prediction and prefetching engine can be located in any part of the
web architecture: client, proxy and server. Prefetching engine acts independently
from the prediction engine and can be placed in any element of the web
architecture to receive the hint list. The common trend is to place the prefetching
cache until they are demanded by user or evicted from cache. The system needs
to predict and prefetch only the web objects that can be stored in the cache.
can lower the actual cache hit rate by prefetching useless web objects and storing
Start Prefetch
between the user requests. The value of ‘N’ depends on page view time (VT). If
user spends more time in a page, then more number of pages can be prefetched;
prefetched if it is cacheable and its retrieval is safe to the system. Web objects
that can be prefetched include documents such as jpg, gif, png, html, htm, asp,
demand web page requests from users. Web prefetching is used to enhance
several web services such as: a) accessing static web objects and dynamically
(timing problem).
on the type of information used to generate the predictions. Based on the location
architecture.
prefetching engine at the client as shown in Figure 1.2. Prediction engine uses
the access pattern of all the users to a specific web server for generating the
predictions. Prefetching engine receives the hints (predictions) from web server
and use it to prefetch the web objects during idle time. This approach has been
widely explored in the literature because of its prediction accuracy and its
potential use in real scenarios. Web server is able to observe every client access
and provide accurate access information when proxies are not involved during
Request Client
Prediction
Engine Response + Hints
Hints
Prefetch
Web Server Prefetch
Response Engine
shown in Figure 1.3, where the prediction engine analyzes the navigational
at the client receives the hints (predictions) and use it to prefetch web objects
during idle time. The mechanism covers the usage behavior of single or few
users across different web servers. When client based prediction model is built
predictions that are highly personalized and thus reflect the behavior patterns of
individual user.
Client
Request
Prediction
Response Engine
Web
Server Hints
Prefetch
Prefetch
Response Engine
The proxy server that sits between the web server and client holds both
prediction and prefetching engine to perform the following tasks: a) prefetch web
8
objects on its own and stores them in its cache b) provide hints to the client about
the web objects it can prefetch in its idle time. The mechanism covers different
such as: a) allows users of non-prefetching browsers to benefit from the server
provided prefetch hints b) ability to perform complex and precise predictions due
Prefetch Prefetch
utilizes the access information available in both proxy and server to generate the
available in proxies will serve data prefetching for group of clients sharing
common surfing interests. Access information in the web server will be used
9
proxies. Client and proxy side prefetching provides greater geographic and IP
proximity to the client by separating caching from HTTP server and placing it
effectively mitigate the user access latency. Contributions made in this thesis are:
are generated.
the graph.
cache that has been partitioned into two parts: regular cache (for
analyzes several web prediction algorithms found in the literature that generates
web predictions based on web content and user access patterns. Detailed
discussed.
hyperlinks that can be prefetched during browser idle time. It discusses two
techniques: Naïve Bayes and Fuzzy Logic for generating the predictions. In these
schemes, both the prediction and prefetching engine are located at the client
machine. The user’s browsing behavior is monitored through web browser and
the web server and prefetching engine at client machine. It uses the access
patterns of users stored in log files at the server to build Precedence Graph that
into two parts: regular cache (to support web caching) and prefetch cache (to
support web prefetching). The contents of regular cache are managed using
Fuzzy Inference System (FIS) algorithm and contents of the prefetch cache
will be moved to the regular cache for effectively satisfying the user requests.
research work.
12
CHAPTER 2
LITERATURE SURVEY
2.1 INTRODUCTION
There has been significant amount of research work carried out in the
past for enhancing the performance of web prefetching. Several techniques were
enhancing the delivery of web pages to the users. The browsing behavior of users
was analyzed to identify interests on specific domain for supporting services like
different parts of the web architecture: client, proxy and server. The algorithms
are categorized based on the type of information used to generate the predictions.
that make prediction by analyzing the content of recently visited web pages and
b) algorithms that predict future accesses based on the past user access patterns.
Prediction algorithms discussed in the literature used different data structure and
T0 T1 T2
(T2 - T1 = page access latency)
T0 T1 Tp T2
Figure 2.1 and 2.2 represents the page access without and with
is retrieved from server, which consumes time resulting in noticeable delay when
14
user accesses the page. In case of page access with prefetching, Pi is prefetched
in advance with the anticipation that it will be accessed in future. When user
reflects the advantages and need for applying prefetching in the web architecture.
space in cache, which helps to satisfy large number of user requests with minimal
access latency using the web objects stored in the cache. We discuss several
cache replacement schemes proposed in the literature for improving the cache
usage to satisfy the user requests. The performance of web prefetching can be
fine tuned by using efficient replacement algorithm for managing the prefetch
cache apart from the regular cache used for web caching. Prefetching is
beneficial to the system only if the prefetched pages are really requested by users
before they become invalid or purged from cache. Otherwise, resources are
wasted on fetching unwanted pages from server that degrades the overall system
performance.
that have longest lifetime was selected to minimize the bandwidth requirement.
Object lifetime reflects the average time interval between consecutive updates to
the object. Prefetching algorithm should consider objects with longer lifetime,
since they are the best candidates to minimize extra bandwidth consumption that
objects in order to increase cache hit rate and reduce user latency. Selection of
frequency, update intervals and size. Web servers proactively ‘push’ fresh copies
balance object popularity and object update rate to achieve good hit rate
text surrounding the hyperlinks and labels with metadata information to generate
the predictions. Anchor text will be one of the major resources for getting
(Chakrabarti et al 1998) performed analysis of text and links for determining the
16
web resources that are suitable for a particular topic. Davison (2000) conducted
an analysis that focused on examining the descriptive quality of web pages and
the presence of textual overlap in web pages. The text in and around the
hypertext anchors of selected web pages were used (Davison 2002) to determine
the user’s interest in accessing the web pages. Craswell et al (2001) indicated that
the anchor texts were highly useful in site finding based on the analysis of link
and content based ranking methods in finding the web sites. A framework that
used link analysis algorithm was designed (Chen et al 2002) to exploit the
link structures.
2004) used neural networks to predict the future requests based on semantic
preferences of past retrieved web documents. Topical locality assumes that pages
connected by links are more likely about the same topic that the user is interested
loading web pages by semantically bundling it with the faster loading web pages.
Semantic link prefetcher (Pons 2006a) was used to predict and prefetch the web
objects during the limited view time interval of web pages. A transparent and
speculative algorithm designed (Georgakis and Li 2006) for content based web
page prefetching indicate that the textual information in both the visited pages
that combined usage data and link analysis techniques for ranking and
recommending the web pages to end user. Web pages of different categories
were analyzed (Chauhan and Sharma 2007) to suggest usage of cohesive and
non-cohesive text present near the anchor text for extracting information about
the target web page. Georgakis (2007) presented a client side algorithm that
learnt and predicted user requests based on user behavior profile that was built
using the user’s web surfing behavior. It used part-of-speech tagger to filter
useful user keywords. Tagging was used to identify the lexical or linguistic
category for individual words. Dutta et al (2009) proposed web page prediction
and ranking of links in the current web page for prediction accuracy.
content of hypertext associated with hyperlink. Naïve Bayes and Fuzzy logic
reducing the access latency with less system complexity. Bayesian network
considerable attention from scientists and engineers across various fields such as
science. Naive Bayes a simple Bayesian network has been applied successfully in
independent and it ignores any correlation among them. It has been used
extensively in applications such as: email spamming, mining log files for system
hierarchical text categorization. Naive Bayes classifiers are very fast and they
have very low storage requirements. They are very good in domains with many
information for predicting the user’s future requests. Several techniques were
explored in the literature that predicted future requests based on the past
sequence of user requests. The information available in web access logs varies
depending on the format of logs and log data selections made by administrators.
is a log data analysis procedure that is used to determine the pages that are most
prefix tree (path profile) based on the requests in server logs that used longest
19
model (Cooley et al 2000) based on support logic used information such as usage,
prediction model size that fits in to main memory with improved prediction
accuracy and moderate decrease in applicability. Web pages were clustered into
Pages were categorized into levels based on their page rank and those pages at
the top levels had higher probability of being predicted and prefetched.
origin server or intermediate proxy server to provide the list of web documents to
and web servers. The access information stored in proxies served data
prefetching for clients sharing common surfing interests. Web server access
information was utilized for data objects that were not qualified for proxy based
prefetching.
The history based prefetching algorithm (Liu and Oba 2008) achieved
high prediction accuracy with limited memory by storing only the useful request
sequences and discarding those that will not yield useful predictions.
Dimopoulos et al (2010) modeled users’ navigation history and web page content
20
with weighted suffix trees to support web page usage prediction. Based on the
access time of user requests, the access sequences were partitioned into different
data blocks (Ban and Bao 2011). They used a decision method to select the
Mogul (1996) built Dependency Graph (DG) for representing the access patterns
of users. Predictions were generated based on the graph that was updated
achieved acceptable performance when it was proposed but it did not consider
the structure of current web pages (i.e. HTML object with several embedded
sequence of accesses. It considers that two objects are related if they are being
and DDG) to provide hints for both the standard object requests and prefetch
requests. A web prediction algorithm that built Referrer Graph (RG) based on
21
object URI and its referrer was designed by De la Ossa et al (2010). The graph
Dependency Graph had a node for every object that had been accessed
by the user. An arc is drawn between the nodes A and B if at some point in time
the client accessed node B within ‘w’ accesses after A was accessed, where ‘w’
represents the lookahead window size. The confidence of each arc will be the
window size of 2 using the access patterns of two users. The access sequence of
user1 will be: {HTM1, IMG1, HTML2, IMG2, HTML4, IMG4} and that of user
Each node is represented with its object and occurrence count. Each
arc represented with pair of values {arc count, arc confidence}, e.g. {1, 0.5}
between HTML1 and HTML3 where arc confidence computed as arc count /
HTML2 1, 1 IMG
1 2
1, 1 1
1, 0.5
1, 0.5 1, 1
2, 1
HTML1 HTML4
2, 1 IMG1 1, 1
2 1
2
1, 1 1, 1
1, 0.5
1, 0.5
HTML3 IMG4
1 1
and dependencies to object of another page. The graph had node for every object
that had been accessed, with an arc from node A to B if client accessed B within
‘w’ accesses to A. The arc is termed primary if A and B are objects of different
pages, i.e. either B is an HTML object or user accessed one HTML object
between A and B. The arc is termed secondary if there are no HTML accesses
between A and B. The graph had same order of complexity as that of DG, but it
HTML2 1, 1 IMG
1 2
1, 1 1
1, 0.5
1, 0.5 1, 1
2, 1
HTML1 HTML4
2, 1 IMG1 1, 1
2 1
2
1, 1 1, 1
1, 0.5
1, 0.5
HTML3 IMG4
1 1
of 2 using the access patterns of two users. The access sequence of user-1 will
be: {HTM1, IMG1, HTML2, IMG2, HTML4, IMG4} and that of user-2 will be:
lines and secondary arcs with dashed lines. Predictions are obtained by applying
requested object and its referrer for generating the web predictions. Precedence
Graph had less number of arcs when compared to DG and DDG, which helps it
algorithms.
24
the information gathered from web logs. They focus on minimizing the system
latency or improving the web server efficiency. The precision of Markov models
comes from the consideration of consecutive orders of preceding pages. The goal
is to build effective user behavioral models that can be used to predict web pages
that user will most likely access in the future. The order of Markov model
indicates how many past user accesses were used to define the context in a node.
Low order Markov models lack web page prediction accuracy due to minimal
usage of pages in history and high order Markov models suffer from high state
space complexity.
Markov chains predicted the next request based on the history of user access
(Davison 2004) used information from user access history and web page content
to accurately predict the user’s next request. Deshpande and Karypis (2004)
presented Markov model with reduced state complexity and improved prediction
different order Markov models. Three pruning schemes (support, confidence and
error) were presented to prune the states of All-Kth order markov model.
25
information into Markov models for prediction (Mabroukeh and Ezeife 2009)
allowed low order Markov models to make intelligent accurate predictions with
less complexity than higher order models. Feng et al (2009) constructed Markov
tree using web page access patterns for effective page predictions and cache
association rules and Markov models to achieve better prediction accuracy with
theory to achieve higher accuracy, better coverage and overall performance while
property from web log data. TLPM could decrease the size of candidate set of
web pages and increase the prediction speed with adequate accuracy. Markov
model was used in level one to predict the categories and Bayesian model was
used in level two to predict the desired web pages in the predicted categories.
consumption.
26
web prefetching for predicting the user’s next request by extracting useful
knowledge from historical user requests. Factors such as page access frequency,
algorithm to generate the predictions. Chen and Zhang (2003) used popularity of
URL access patterns to build a PPM model for generating accurate predictions by
PPM model based on non compact suffix tree that used maximum entropy
prediction capability using a target function. It selected the node with maximum
data for improving the web services. Mining web access sequences helps to
discover useful knowledge from web logs that can be applied to variety of
applications such as: navigation suggestion for users, customer classification and
efficient access across related web pages. Several research efforts in web usage
27
patterns and clustering) for analyzing the web data and generating desired output.
Clustering of web user access patterns helps to build user profiles by capturing
common user interests that can be applied to applications such as web caching
and prefetching. Association rules help to optimize the organization and structure
of websites.
data mining was designed (Borges and Levene 1999), which used high
probability strings to represent the user’s preferred trails. A web usage mining
prediction based proxy server was designed (Huang and Hsu 2008) to effectively
components: log file filter, access sequence miner and prediction based buffer
manager. The log file filter removes irrelevant records from the log file and feed
the cleaned file as input to the access sequence miner. Sequence miner processes
the popular access sequences to generate rule table. Buffer manager decides
caching / prefetching or buffer size adjustment based on the buffer contents and
rule table.
caching and prefetching used web navigational graph to represent the user
requests. Its efficiency was tested using the developed simulation environment.
28
The statistical analysis and web usage mining techniques were combined
usage by considering the client side data. Browsing time (statistical analysis)
helps to effectively evaluate the website, and graph mining (web usage mining)
(DPS) – adapted for the modern web framework b) Adaptive Rate Controller
(ARC) – determined prefetch rate based on the dynamic memory status and c)
web access sequences that efficiently handled both forward and backward
references. It could perform both static and incremental mining of web access
Indexing (RI) was used to build user profiles for applications such as web
results may sometime include first page of the list as a hint embedded in the
HTML code. If the web browser has prefetching capabilities, then it can request
SeaMonkey, Netscape, Camino and Epiphany that are based on the Mozilla
Accelerator (Google 05), a free web browser extension available for Mozilla
prefetches hints included in the HTML body; also it prefetches all the links in
(introduced in 2005) prefetches all the hyperlinks found in the current page
during browser idle time. PeakJet, a commercial product for end user, available
around 1998, included several tools to improve user access to the web. It
history or links. It could prefetch links on the current web page that were visited
by the user in the past or all links on the current web page. NetAccelerator, a
product commercialized between 1998 and 2005, prefetched all the links in the
page that are being visited and store the objects in browser cache. It could refresh
architecture for performing prefetching and provide insight into the efficiency of
prefetching was profitable even with the presence of good caching system.
(2006c) analyzed large set of key metrics used by various researchers to propose
taxonomy based on three main categories for better understanding and evaluation
Prediction indexes are used to quantify the efficiency and efficacy of prediction
algorithms. Resource indexes are used to quantify the additional cost incurred
due to prefetching. End-to-End latency indexes are used to highlight the system’s
31
recall and byte recall indexes. Experimental results indicated that the user
estimation model was designed that used available information in the server to
accurately calculate the extra server load and network traffic generated by
prefetching in the web architecture. It also offered the flexibility to set prediction
and prefetching engine at any part (client, proxy, and server) of the web
architecture.
Precision (Pc)
then finally requested by user (prefetch hits) versus the total number of objects
Prefetch Hits
Pc =
Prefetchs
Recall (Rc)
Prefetch Hits
Rc =
User Requests
Resource Usage
performance.
two side effects: objects not used and overhead. When objects were not used, it
wastes network bandwidth because they were never requested by the user.
that clients will get when using prefetching. The index estimates the ratio of the
amount of prefetched objects never used with respect to the total user’s requests.
Object Latency
web users.
Average page latency with prefetch
ÑPL =
Average page latency no prefetch
of web caches, and there were several policies proposed in the literature
capacity reaches its maximum limit, objects already stored in cache are purged to
store newly downloaded web objects. Decision about the objects to be purged
will achieve better performance when it receives stream of requests with high
2005).
improved caching performance over the content blind schemes. Few cache
popularity–rank of the object (Chen et al 2003), cost function for object retrieval
(Cao and Irani 2002), page grade (Bian and Chen 2008) to decide the
cacheability or purging of web objects from the cache. The grading mechanisms
tradeoff between Hit Ratio (HR) and Byte Hit Ratio (BHR) focusing to improve
(2007) and they suggested policies that will provide good performance for each
document type. Wong (2006) suggested replacement policies for proxies with
power. They also analyzed policies that will be better for proxies at ISP and root
level. Romano and ElAarag (2008) considered factors such as: Frequency,
cache replacement policies. The replacement algorithm chooses better victims for
eviction from the cache, when it considers several factors for making the
decision.
cache by fragmenting the cache into three slices: Sleep Slice (SS), Active Slice
(AS) and Trash Slice (TS). Based on the hit count, slicing was performed to
group cached pages that help in reducing the latency when they are retrieved.
One time hit pages were discriminated from other pages to ensure that hot pages
were made available when user requests them. Performance metrics such as File
36
Hit Ratio, Speedup, Delay Saving Ratio and Number of Evictions were used to
performance using object features such as HTTP responses of the server, access
log of the cache and HTML structure of the object. The drawbacks in the model
intensive learning phase and c) more number of inputs to the model. Neural
network based web proxy cache replacement scheme was designed by Cobb and
replacement scheme that estimated the frequency count and recency time within
objects to be evicted from cache based on the inputs: frequency, recency, size
and delay time was discussed by Ali and Shamsuddin (2007). The effect of
Ali and Shamsuddin (2009) proposed an approach that partitioned the client
cache into short-term and long-term cache. Short-term cache managed using
LRU algorithm and long term cache managed using neuro-fuzzy system. A
Support Vector Machine (SVM) based approach was designed by Ali et al (2011)
to predict the web object classes using frequency, recency, size and object type
Classification Rate (CCR), True Positive Rate (TPR), True Negative Rate (TNR)
(2003) balanced the large and small documents that existed in the cache. To
2.8 SUMMARY
grouping them into two categories: a) algorithms that generate predictions based
on web content and b) algorithms that generate predictions based on user access
patterns. In chapter 3, we discuss content based web predictions that use Naïve
Bayes and Fuzzy Logic mechanisms for computing the priority values based on
regular and prefetch cache for managing the objects. Regular cache managed
using FIS algorithm and prefetch cache managed using LRU algorithm.
39
CHAPTER 3
3.1 INTRODUCTION
The usage of Internet over the years has increased tremendously and
users are leveraging its benefits to access variety of services provided over the
network. Due to massive growth of Internet, network load and access time have
The user perceived latency when accessing the web pages is affected by the
server c) round trip time and d) object size. Implementing the caches either
remotely (in web server or proxy server) or locally (in browser’s cache or local
proxy server) significantly reduces the access latency. The usage of cache can be
with the anticipation that these contents will be requested by users in the near
future. Web prefetching exploits the spatial locality exhibited by users when
prefetched from server for satisfying the user requests. Web predictions can be
pages and object popularity depending on the location (server, proxy or client) of
its implementation. The client decides to prefetch web objects based on the
following factors (Mogul 1998): a) object availability in cache and its current
timestamp b) idleness of user for more than the threshold interval c) network
system implemented by Zhang et al (2003) generated set of URLs that the user
metadata for predicting the probable pages that will be requested in future. These
attached to or extracted from the web pages as input for generating the
predictions. They should have provision of being enabled or disabled when user
enters or exits the web services, since they are domain specific and focus on
that are used to access web objects across different web pages. When a web page
has higher usability, then prefetching it will improve the system performance.
interests on a topic, but also the structure of web pages. For example, if user is
currently viewing page Pi then there is high probability of using the links
available in that page to visit page Pi+1. User navigates through pages by clicking
41
links based on the text anchored around them. It is assumed that there is a
relationship between the textual content of web pages and the user’s interests.
hypertext of URL that refers a web object. Hyperlink (URL) represents relation
between two different web pages or two parts of the same web page. Hypertext
Bayes and b) Fuzzy Logic approach. They are responsible for computing the
in the prediction list. Client is responsible for performing web prediction and
prefetching, where it prefetch the objects during browser idle time based on the
generated predictions. Predictions are generated dynamically for each new web
page visited by the user based on the information maintained in the repositories
when user requests the main page. Requests for embedded objects are separated
from regular user requests. To avoid interference between the prefetch and
demand requests, any spare resources available on servers and network should be
Prefetching hit ratio and bandwidth overhead are the most popular parameters
idle time will vary; if user navigates too fast between the pages then web browser
will not have enough time to prefetch all the hints. It occurs even if the prediction
overcome this situation, it is important to provide the good hints in order so that
applying Naïve Bayes Classifier to find probability of each token in the hypertext
that gets added to finally produce the priority value of link. The use of Naïve
Bayes Classifier for computing the priority value are due to the fact that they are
fast and accurate, simple to implement and has the ability to dynamically learn
steps:
hyperlink.
43
clicking hyperlinks in the web page. When the cache contents are used to satisfy
When user visits a web page and spends some time either reading or
exploring some valuable information, then the textual content of that page
reveals ‘region of interest’ that matches with the user’s interest. When users are
not visiting web pages according to a pattern, then it indicates that they are
prefetch web objects for satisfying the user requests with minimal latency. It uses
client-side prefetching mechanism where the client directly prefetch web objects
from server and stores them in its local cache (prefetch cache) to serve the user
requests. Both prediction and prefetching engine are located in the client machine,
and the prediction engine uses components such as Tokenizer and Token
44
is generated to the server only if the requested web page is not available in cache
When user visits a web page, hyperlinks in that page forms a pool of
URL’s from which user selects a hyperlink that suits his/her interest to visit the
hyperlinks in a web page that reflects user’s interest to create a prediction list that
is used for prefetching the web objects during browser idle time.
prediction list for prefetching the web objects is shown in Figure 3.1. Procedural
steps for performing the prediction and prefetching (as in Figure 3.1) are
explained as follows:
1. User initially requests a new web page by typing its URL in the
web browser.
the user.
Web Browser
1
User enters the URL 5
2 User-Accessed
Repository
Web page displayed 6
3 Token Count
Hyperlinks 7
1. . . . . . . .
2. . . . . . . . Naïve Bayes Hyperlink
3. . . . . . . . Classifier Priority
4 8
10 9
Prefetch Cache
11
12
Display web page Internet
computation.
9. Web objects are prefetched during browser idle time using the
10. When user requests new web page by either typing its URL in the
12. The web page will be retrieved from server and displayed to the
3.2.2 Implementation
machine with prediction engine responsible for generating the predictions and
prefetching engine responsible for retrieving the web objects and storing it in
latency when the predicted object has already been demand requested by the user
and is waiting in the browser queue for connection to the web server.
applying Naïve Bayes classifier on the set of tokens associated with each
hypertext. The advantages of using Naïve Bayes Classifier (Rish 2001) for
value for the specified data b) requires minimal storage, since it maintains only
the token count and c) performs incremental update whenever new data is
processed.
3.2.2.1.1 Tokenizer
When user visits a new web page, Tokenizer parses that page to extract
meaningful keywords that act as tokens of that text. Hypertext refers to the text
that surrounds hyperlink definitions (hrefs) in web pages (Xu and Ibrahim 2004).
48
In our approach, text between the tags <a> and </a> is used to
compute priority value of hyperlinks. When user clicks a hyperlink to visit new
web page, tokens of its hypertext are stored in user-accessed repository. When a
token has new entry in the user-accessed repository, it will have initial count
value of 1. For the tokens that exist already in the repository, its count value will
be incremented.
computing the priority value of hyperlinks. To remove the stop words, tokens of
hypertext are compared with a database that contains commonly occurring stop
words such as the one shown in Figure 3.2. After removing the stop words,
words from its inflectional form or derivationally related form to their common
base form. Factors to be considered when stemming the words are: a) Different
words with same base meaning converted to same form and b) Words with
distinct meanings are kept separate. Porter stemming (Porter 1980) algorithm is
49
able, about, above, again, after, and, any, back, be, been, before,
below, but, by, came, can, can't, did, do, each, edu, eg, even, ever, far,
for, few, get, go, gone, got, has, have, her, here, how, if, in, is, isn't,
keep, kept, last, less, little, like, let's, make, may, many, miss, more,
my, name, new, next, not, now, of, often, one once, only, over, plus,
per, please, quite, right, round, saw, say, seen, sent, shall, since, still,
take, than, that, this, there, thing, twice, two, use, us, via, want, was,
we way, when, where, who, why, yet, you, yours, zero
to visit the new web pages. Each token is stored with an initial count of 1, which
gets incremented when same token is added to the repository from hypertexts of
used hyperlinks. The tokens stored in the repository exhibits user’s browsing
interests and it is used for computing the priority value of hyperlinks. Repository
information reflects user and session characteristics, where session represents the
50
time interval between the start and end of user’s browsing instance. During a
browsing session, user clicks several hyperlinks to visit web pages of his interest.
independent tokens (keywords). When user has long browsing session and surfs
the web focusing on same topic, then there will be saturation in identifying new
without stemming) used for computing the priority value of hyperlinks. Each
entry in the repository contains token with its occurrence count. When stemming
is not applied, then words that are closely related to each other will have separate
entry in the repository with its occurrence count. It leads to the following
count of related words spread across multiple entries. When stemming is applied,
related words are confined to a single base form that reduces the number of
entries in repository. It also improves the occurrence count, since the count
interest, then most of the keywords generated would be trivial and cannot be used
for generating the predictions. The repository is of fixed size and new tokens are
added into it by eliminating old tokens when the repository size reaches its
of legitimate tokens and to prevent trivial tokens from occupying the space for
longer time.
taken and its tokens are compared with tokens stored in user-accessed repository
obtained by multiplying the probability value of its tokens. Hyperlinks are then
Pr (A|U) · Pr (U)
Pr (U|A) = (3.1)
Pr (A)
U = User-accessed Repository
repository
The value of Pr (U) will be 1, since it is the only repository used for
computation. The value of Pr (A) that acts as a scaling factor for Pr (U | A) will
Count of Ti in U
Pr (Ti | U) = (3.3)
Total count of Tokens in U
where i = 1 to m
m
Pr (A|U) = Π C + Pr (Ti | U) (3.4)
i=1
=1) is added to each token probability value irrespective of whether the token is
the user-accessed repository, then its probability value will be zero. Reason for
adding ‘C’ to each token probability is to achieve the following two conditions:
1) probability value of hypertext should not be less than the individual token
hypertext is present in the user-accessed repository, then its probability value will
hypertext will be greater than 1, if either few tokens or all the tokens of hypertext
having priority value greater than 1 are considered for inclusion in the prediction
list.
For each web page, based on the computed priority value of hyperlinks,
prediction list is created by including hyperlinks with good priority value. The
priority links at the top of queue. Prefetching engine takes links from the top of
Hypertext
Hyperlink Prediction Engine (Computes Priority)
Hyperlink, Priority
Prediction List
Link1 Link 2 Link 3 Link 4 Link 5 Link 6 Link 7
Prefetch Engine
When user navigates to new web page, prediction list will be cleared
and filled with new set of hyperlinks based on the new web page. It helps to
55
eliminate prefetching of irrelevant links during user browsing session. Figure 3.3
and store it in prefetch cache maintained at the client machine to serve user
requests with minimal latency. Web objects are prefetched using the hyperlinks
taken from the prediction list. It carries out prefetching only during browser’s
idle time. Prefetch requests are given low priority than regular user requests, so
prefetching activity. The number of links that can be prefetched will vary
depending on the amount of time a user spends on each web page during its
browsing session. If user spends more time on a page, then more links can be
prefetched.
in user access patterns, the client maintains prefetch cache separately from
browser’s in-built cache. When new web objects need to be stored in prefetch
cache and if it is full, then it selects objects not accessed for a long time to be
purged from cache to make space for storing newly downloaded objects.
Fuzzy Logic has been used over the years in several domains such as
expert systems, data mining and pattern recognition. It deals with fuzzy sets
(Zadeh 1965) that allow partial membership in a set represented by its degree of
exist in several information retrieval (IR) tasks (Chris Tseng 2007) and helps to
a web page to predict web objects to be prefetched for satisfying user’s future
over the set of tokens associated with each hypertext and use it to generate the
set of hyperlinks in a web page that reflects user interest. Prefetching engine uses
57
the prediction list to prefetch web objects and store it in prefetch cache before the
Figure 3.4 represents the process of applying fuzzy logic to decide the
set of hyperlinks in the prediction list, which will be used by prefetching engine
1. User initially requests a web page by typing its URL in the web
browser.
3. Extract all the hyperlinks and its associated hypertexts from the
5. When user visits new web page by clicking hyperlink in the current
repository.
Web Browser
1
User enters the URL
2 5
User-Accessed Predicted -Unused
Display Web page Repository Repository
3 6 14
Extract Hyperlinks Token Count Token Count
1. . . . . . . .
2. . . . . . . . 7
3. . . . . . . .
13
4 Fuzzy Compute Priority value
Logic
Convert Hypertext
8
to Tokens
Prediction List
1. . . . . . . .
2. . . . . . . .
3. . . . . . . .
10 9
Prefetch Cache
11
12
Display web page Internet
over the set of tokens associated with each hypertext with reference
10. When user wish to visit new web page by either clicking hyperlink
11. When prefetch cache is able to satisfy user request, then contents of
12. When the requested contents not available in prefetch cache, then it
13. Tokens of hyperlinks in the prediction list that are not used by
repository. When user visits new web page, prediction list will be
prefetching activity.
3.3.2 Implementation
engine are implemented in the client machine. In fuzzy logic approach, two
priority of hyperlinks.
the set of tokens related to hyperlinks. The role of Tokenizer and user-accessed
repository used in this approach are similar to that of Naïve Bayes Approach.
Tokenizer is responsible for parsing the web page to extract hyperlinks along
with its hypertexts, which is analyzed to form set of tokens from each hypertext
and stored in user-accessed repository when hyperlinks are used by users. User-
but not used are stored in this repository. It provides feedback to prediction
engine for tuning the generation of predictions from web pages. The repository
accessed repository that are of less or no interest to the user when computing
each web page. When ‘n’ number of predictions is generated for each web page,
only one in the prediction list may match with the hyperlink used by user to visit
the next page. The tokens of hyperlinks in prediction list that do not match with
user interests are stored in this repository. Tokens of hyperlink are subjected to
stop word removal and stemming before they are stored in this repository.
computing its priority value. Fuzzy logic is applied over the set of tokens by
associating it with fuzzy set (i.e. repository storing the tokens). The tokens are
(TCi)R1
µR1(Ti ) = (3.5)
(TCi)R1 + (TCi)R2
repositories R1 and R2, they are compared to decide whether to include Ti for
computing the priority value of hyperlink. The Token Acceptance (TAi) of Ti will
be set to 1 if µR1 (Ti) > µR2 (Ti) i.e. Membership value of T i relative to R1 is
For Ti with its TAi set to 1, the token popularity (TPi) in repository R1
is computed by dividing its token count with maximum token count value in R1
(TCi)R1
TPi = (3.7)
max [(TC)R1]
n
1 S TPi (3.8)
PV =
n i=1
64
approach to maintain high priority links at the top of queue. The hyperlinks are
sorted based on the computed priority value that falls within the range of 0 to 1.
Figure 3.5 represents the prediction list that arranges hyperlinks based on the
Hypertext
Hyperlink Prediction Engine (Computes Priority)
Hyperlink, Priority
Prediction List
Link1 Link 2 Link 3 Link 4 Link 5 Link 6 Link 7
Prefetching Engine
download web objects during browser idle time that helps to avoid interference
with regular user requests. Prefetching engine will not download all the predicted
users
Prediction list should have hyperlinks that reflects user interests, else
and managed using LRU algorithm. If a demand requested web object resides in
prefetch cache, then it is moved to regular cache. A web object will not reside in
both the caches (regular and prefetch cache) at the same time.
3.4 EVALUATION
performance of Naïve Bayes and Fuzzy Logic approaches, since web traces do
analyzing browsing pattern in each session. User visits new web page either by
typing its URL in web browser or clicking hyperlink in a web page. In the
that allows user to configure prefetch settings based on the requirements. Each
browsing session has active and idle periods based on the access pattern of user.
Active period represents the phase where web objects are demand requested by
user and idle period represents the phase where the displayed web objects are
viewed by user. The retrieval of main html file and its embedded objects initiates
the active period. Idle period is used by prefetching engine to download web
objects using the hyperlinks in prediction list and store them in prefetch cache. A
log file is used to record user requests during browsing sessions that is used to
spends reading the displayed web page and it is considered an important attribute
in predicting user’s interests (Liang and Lai 2002; Gunduz and Ozsu 2003; Guo
et al 2007). The user navigation time was partitioned into four discrete intervals
(Xing and Shen 2004): passing, simple viewing, normal viewing and preferred
viewing. If user spends more time reading a web page (preferred viewing), it
increases browser idle time allowing many hyperlinks in the prediction list to be
prefetched. When user visits a web page that contains content irrelevant to his
interest, then prediction list for that page will have either zero or less number of
hyperlinks. When user demand requests a web object, its availability is first
checked in the local cache (regular or prefetch cache) before forwarding the
repository will receive tokens once user starts using hyperlinks present in web
hyperlinks that are predicted but not utilized to prefetch web objects or does not
Logic) depends on the browsing interest of individual users and hence the results
from different users are incomparable. Results are obtained by analyzing the
browsing sessions carried out for a period of six weeks. To establish user access
browsing behaviors. The metrics used for evaluation are: Recall (hit rate) and
percentage of user requests served using the contents of prefetch cache against
the total number of user requests. When recall is high, it effectively minimizes
the user perceived latency since most of the user requests are served from local
cache.
Prefetch Hits
Rc = (3.9)
Total User Requests
68
prefetched web pages requested by users against the total number of web pages
browsing sessions.
Prefetch Hits
Pc = (3.10)
Total Prefetchs
value computed using Naïve Bayes formula. Since prefetch operation is carried
out only during browser idle time, the number of hyperlinks prefetched will vary
dynamically depending on the user’s access pattern. Figure 3.6 represents the
hyperlinks that are predicted and prefetched for pages accessed by users (e.g.
user-A, user-B) in a browsing session. Initially few pages in the session have
remains empty and it cannot be used to make predictions. When user starts
filled with tokens that improve the predictions for the pages. When user visits the
pages quickly, then only few objects are predicted and prefetched. In some
User -A
20
18
16
14
No. of 12
Predicted
Pages 10
Prefetched
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
User-B
35
30
25
No. of
Pages 20 Predicted
15 Prefetched
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Pages accessed in a session
0.8
Recall
0.6 User-A
0.4 User-B
0.2
0
10 20 30 40 50 60 70
User access session
(in minutes)
browsing sessions based on the users’ access patterns. The percentage of Recall
varies between the users’ (e.g. user-A, user-B), due to the fact that the
each session. As shown in the graph, when user has long browsing sessions the
repositories will contain large number of tokens that helps to predict useful
on the browser idle time, and it varies depending on the user’s navigation pattern
across pages.
71
0.8
Precision
0.6 User-A
0.4 User-B
0.2
0
10 20 30 40 50 60 70
User access session
(in minutes)
repository collects tokens that effectively reflect the user interest, then the
prefetched pages will be useful to the user. During browsing sessions, users
searched for information related to a specific topic that helped to utilize the
pages not looking for specific information, then usage of prefetched pages will
considering the metrics: Recall and Precision. This approach used two
feedback to the system regarding the unused hyperlinks that helps to fine tune the
0.8
0.6
Recall
User-A
0.4 User-B
0.2
0
10 20 30 40 50 60 70
User access session
(in minutes)
Figure 3.9 indicates the Recall (Rc) achieved during the browsing
sessions for fuzzy logic approach. As indicated in the graph, when user has long
browsing session the percentage of recall improves and helps to reduce the user
access latency. Since links are suggested based on the inference from both the
0.8
Precision
0.6 User-A
0.4 User-B
0.2
0
10 20 30 40 50 60 70
User access session
(in minutes)
Figure 3.10 indicates the Precision (Pc) achieved during the browsing
sessions for fuzzy logic approach. The use of predicted-unused repository helps
to identify tokens of hypertext that are of less interest to users and eliminate them
from being used for computing the predictions. As a result, the hyperlinks
predicted and prefetched will closely match the user interests thus minimizing
time. Top-down approach by Markatos and Chronaki (1998) and Bigrams in link
0.8
Top-down
Recall 0.6
Naïve Bayes
0.4 Bigrams-Link
Fuzzy Logic
0.2
0
2 4 6 8 10
Number of links Prefetched
during browser idle time is varied between 2 to 10 for analyzing the effectiveness
of each approach in satisfying the user requests. When more links are prefetched,
it can easily satisfy the user requests. As indicated in the graph, fuzzy logic
provides better performance over the other approaches in all the cases. Naïve
performance to the Bigrams in link approach. The reason for fuzzy logic
approach being able to produce better results are, it can refine the generation of
0.8
Top-down
Precision 0.6 Naïve Bayes
0.4 Bigrams-Link
Fuzzy Logic
0.2
0
2 4 6 8 10
Number of links Prefetched
(0.3 to 0.55) across various approaches. Precision in the range (0.4 to 0.65) is
achieved when more than four links are prefetched with an upper bound of up to
eight links. Prefetching more than eight links will satisfy the user requests but the
number of unused links will be more resulting in wastage of resources and poor
precision rate. Fuzzy logic approach provides better performance across all the
3.5 CONCLUSION
associated with hyperlinks present in web pages. Two approaches: Naïve Bayes
and Fuzzy Logic are designed to compute priority value of hyperlinks using
hypertext information. Naïve Bayes approach used only the token information
The predictions are generated dynamically for each new web page
visited by the user. Experimental results indicate that the proposed approaches
metrics clearly demonstrate the efficiency of fuzzy logic and naïve bayes
CHAPTER 4
4.1 INTRODUCTION
criteria such as access patterns, object popularity and structure of accessed web
Graph (PG) by analyzing user access patterns to generate predictions that reflect
user’s future requests. It uses object URI and referrer in each user request to
build precedence relation between the requested web object and its source. The
algorithm differentiates the dependencies between objects of the same web page
and the objects of different web pages to incorporate the characteristics of current
websites when building the graph. It uses simple data structure for implementing
the graph that requires minimal memory and computational resources. When user
requests a new web page, it is satisfied using the contents of prefetch cache and it
related to user requests for the web pages stored in a server to build the graph.
The output of prediction engine is the hint list (predictions) with set of URI’s that
are likely to be requested by the user in near future. Prefetching engine located at
78
the client receives predictions from server and uses it to download web objects
and store them in prefetch cache maintained at the client. Precedence graph gets
updated dynamically with new nodes/arcs when user requests for new web
objects, which ensures that the predictions are generated based on the recent
discovered for the first time) between the nodes or updating the occurrence count
Figure 4.1 represents the basic interaction between web client and web
prediction engine and client performing the prefetching operation. When user
clicks on a URL from a displayed web page or types new URL in the web
browser, a HTTP GET request is sent by the browser to the server for fetching
the object. Web server on receiving the request establishes connection with the
performs predictions for the request and generates hint list, which the web server
79
sends it to the client through HTTP response by including the HTTP response
header (e.g. Link :< P2>; rel: prefetch) that indicates the link to be prefetched.
Request uses
GET P1 Predict
P1
P1
Precedence
User Graph
P1 200 OK (P1) Hints: P2, P3
Link: <P2>; rel: prefetch Generates
Hints
Link: <P3>; rel: prefetch
GET P2
Browser
Idle
Time 200 OK (P2)
(Prefetch) P2
Request
P2 Prefetch
User Cache
P2
Figure 4.1 Browser & Server interaction with Prediction and Prefetching
80
When web browser is idle, it uses the hints received in HTTP response
to download the web objects and store it in prefetch cache. If user requests new
web page (e.g. P2), and if it is available in prefetch cache, then it is served
each page and c) size of visited web pages. Prefetching algorithm should aim to
balance the cache-hit ratio and usefulness against the bandwidth overhead when
predictions are:
· To build the graph with minimal number of arcs that will reduce
4.2.1 Introduction
that represents the user access patterns with nodes representing the web objects
and arcs representing the relation between web objects. Arcs are added between
81
the nodes when user request reports the requested object (successor node) and
source object (predecessor node) from where the request is generated. Each arc
has transition weight associated with it that represents the transition confidence
update the graph in two ways: 1) add a new node to represent the web object and
a new arc to represent the relationship between the new node and an existing
node in the graph 2) update the existing node and arc by incrementing the node
and arc occurrence count. The updated graph generates predictions for the user
request by analyzing nodes and arcs that are related to the node representing the
requested web object. Arcs with occurrence count greater than threshold value
are considered for generating the predictions. Prefetching engine uses the
generated predictions to download web objects from the server during browser
idle time.
Web access log files are filtered to select appropriate HTTP method
(i.e. GET) and HTTP response code (200 – OK, 304 – Not Modified, 206 –
Partial Content) before it is used for building the Precedence Graph. HTTP
headers offer accurate information, since they are explicitly provided by the web
Input:
· User requested object (URL)
· Referrer in the user request
· Object type (primary/secondary)
Output: updated Precedence Graph
Step 1: Adding new node (or) updating existing node
‘x’ → a node in the graph
Find ‘x’ that matches with the user requested object
If ‘x’ available, then
occurrence (x) ← occurrence (x) + 1 // count updated
Else
‘x’ newly created // represents the requested object
occurrence (x) =1
End if
Step 2: Adding new arc (or) updating existing arc
‘ y’ → a node in the graph
Find ‘y’ that matches with the referrer in user request
If ‘y’ available, then {
Find the arc ‘yx’ // transition from node y to x (y x)
If ‘yx’ available, then
occurrence (yx) ← occurrence (yx) + 1
Else
‘yx’ newly created // arc from y to x
occurrence (yx) = 1
End if
} Else
No arc gets added or updated in the graph
End if
Step 3: Compute transition confidence of all arcs from node ‘y’
arc transition confidence ← arc occurrence / ‘y’ occurrence
Return Precedence Graph
83
Graph for predicting user’s future requests. Each node in the graph represents
user requested web object with an initial count value of 1. When user requests
the same web object again, then node’s occurrence count get incremented. Each
arc in the graph represents transition from one web object into another object.
For each user requested web object, an arc will be drawn with an initial count
value of 1, if its source node (predecessor) is specified in the request. The arc
reflects precedence relation that exists between the source object (predecessor
node) and the newly requested object (successor node). When user repeats
request for the same web object through same source, then the existing arc that
Example: Consider a user requests new web object ‘B’ through source
initial count value as 1. When user again requests object B through object A, then
referrer that is recorded for each user request in the log file. Referrer information
is used to create an arc from the source object (predecessor) to the requested
object (successor).
84
Secondary Node
P2.html P1.gif
Primary Arc
Secondary Arc
P3.html
P2.jpg
Each web page contains a main object (html) referred as the primary
object and several embedded objects (jpg, png) referred as the secondary objects.
The primary nodes in graph represent main objects that are demand requested by
users, while secondary nodes represent embedded objects that are requested by
web browser.
and secondary nodes. Arcs connecting two primary nodes termed as primary arcs
and those connecting primary and secondary nodes termed as secondary arcs. As
shown in Figure 4.2, objects (P2.html, P1.gif) requested from P1.html, objects
(P2.jpg, P3.html) requested from P2.html, and P1.html requested from P3.html.
continuous learning of user access patterns. Each node in the graph contains
85
primary/secondary arcs. Each arc in the graph contains destination URI, arc type
occurrence count represents the number of user requests to the web object
represented as a node in the graph. Arc occurrence count represents the number
occurrence count.
web object; else new node is created and added to the graph with
(successor); else new arc is created between the nodes and added to
transition.
The graph will grow in size during its learning process and is
controlled by removing nodes and arcs that least represents the user interest and
When user requests a web page through web browser, the primary
object of web page is first requested and then secondary objects are requested
either from server or local cache. For each web page, perfect prediction
algorithm (de la Ossa et al 2009) reports three types of hints: a) primary object of
next web page to be requested by user b) secondary objects associated with next
web page and c) further next pages. The proposed prediction algorithm generates
hints for a web object by analyzing nodes in the Precedence Graph. If a node
representing the requested web object is available in the graph, then its
associated arcs are analyzed to generate the predictions; else no predictions are
generated for the web object. The prediction engine needs to provide hints to the
pages.
1. For each user requested web object find a primary node in the
prediction list.
6. Final prediction list contains object URI’s from both primary and
rel= prefetch”>
88
Figure 4.3 shows sample HTTP header that supplies the referrer
information to server. It is recorded in the access log file at server, which is used
Figure 4.4 shows sample HTTP response from server that includes the
link to be prefetched during idle time. It is used by the client to download the
web object.
prefetches URLs provided using HTTP protocol without any embedded objects.
It will not prefetch URLs that contain parameters (queries). For example, Mozilla
Firefox an open source web browser that has web prefetching capabilities
89
Content-Encoding : gzip
Content-Length : 10933
Content-Type : text/html
Date : Sat, 24 Nov 2012 08:16:14 GMT
Proxy-Connection : keep-alive
Server : Apache
Vary : Accept-Encoding
Via : 1.0 localhost: 8080 (squid/2.6.STABLE6)
X-Cache : MISS from localhost
Link to be prefetched
X-Cache-Lookup : MISS from localhost: 8080
Link : <http://www.rediff.com/getahead/img1.jpg>
rel = “prefetch”
predictions from server via HTTP response header and uses it to prefetch web
objects during browser idle time. It helps to avoid interference with regular user
whether a web object is eligible for caching. The downloaded objects are stored
in prefetch cache that is maintained separately from regular cache to improve hit
rate. Web object will not be prefetched if it already exists in either regular or
prefetch cache. When user demand requests a new web page, then prefetching
90
available in client machine, then user requests will be served from either local
Prefetching engine should not retrieve a web page that will be accessed
after a long time, since at the time of access the page may contain old data. User
perceived latency can be significantly reduced by prefetching more pages, but the
prefetch accuracy will diminish if the prefetched pages are not referenced by
users.
Table 4.1 are used to build Precedence Graph shown in Figure 4.5. Primary and
secondary nodes in the graph represented with object URI’s and their occurrence
count. Primary and secondary arcs in the graph represented with their occurrence
count.
P1.html 3 1
1 2 3 3
1 1 1 1
1
1 1
P5.html 1
P3.jpg 1 P3.gif 1
Graph in which the primary nodes of graph are stored as keys in the map and
primary/secondary arcs originating from each node are stored as list associated
with keys in the map. Secondary nodes of graph are not stored as keys in the map,
92
since in most cases they do not act as source of new web object. Each element in
the list is shown with three fields: object URI, arc occurrence count and arc
transition confidence.
Key Value
P1.html 3 P1.gif 3 1 P1.jpg 3 1 P2.html 2 0.6 P4.html 1 0.3
P5.htm1 1
Example:
P1.html is:
P1.gif = 3/3 =1
Table 4.2 represents the hints generated for user requests based on the
as hints. But compared to secondary objects, the primary objects provide more
secondary objects.
secondary objects.
constantly updating it with request information whenever user accesses the web
pages. When the graph size increases over time, there will be increase in the
will become obsolete due to the following factors: a) change in access patterns of
the user due to change in topic of interest b) web pages previously accessed by
the users may be removed from the website or referred by a different URL. In
these cases, the occurrence count of nodes and arcs will not be updated and it
remains in their old count value. When graph contains such obsolete information,
can be solved by periodically trimming the graph to remove nodes and arcs that
should not affect the prediction accuracy of the algorithm by removing useful
nodes and arcs from the graph. The trimming operation analyzes the entire graph
to cover all its nodes and arcs. The nodes are removed from the graph based on
its popularity (number of access) and access time. If a node does not reach its
minimum popularity or it is not accessed for a long time, then it will be removed
95
from the graph. Nodes having popularity greater than the prescribed threshold or
it has been accessed recently, then it will be retained in the graph. Table 4.3
Variable Meaning
The time counter and trimming threshold are used to decide the
trimming threshold is set by the user and it can be decided such that it will not
affect the performance of the system. The procedure used for invoking the graph
Initialization:
T_C = node access time
T_Th = T_C + Interval Duration
Step 1: Updating the Graph and node access time
While (T_C < T_Th) {
// add new node or arc; else increment count of node or arc
Update the graph {
Increment node/arc occurrence count; (OR)
Add new node/arc;
}
T_C = node access time; // Updated with new access time
}
Step 2: Invoking Trimming operation on the Graph
If (T_C ≥ T_Th) {
// Access time greater or equal to threshold, perform trimming
Invoke Trimming algorithm;
}
Step 3: Reset the counters
After trimming is completed, the counters are reset:
T_C = 0;
T_Th =0;
Access time in Node =0; // reset in all the nodes
Go to Step 1 to restart the activity with:
T_C = New access time
T_Th = New threshold value
97
It is used to keep track of the current access time, whenever the graph
is updated with user request information. The graph gets updated in two ways: a)
addition of new node or arc to reflect the access information b) increment the
During trimming, the nodes and arcs of graph will be analyzed and based on their
specified by the user and the initial value that is set in the time counter before
starting the updating activity in the graph. Whenever time counter is initialized
with new access time, then trimming threshold will have new time that acts as
the timeline to decide the invocation of trimming operation. The interval duration
way that it should not degrade the performance of the prediction algorithm. If the
interval duration is too long, then the graph will contain outdated information
that will be of no use to the user when they are given as predictions. If the
98
interval duration is too short, then it leads to frequent trimming of the graph that
will waste the computational resources and also it may remove useful
Each node in the graph maintains node access time (n_a_t) that is
updated with the current access time to reflect the request information. The node
access time is compared with time counter value to find when the particular node
value, then the particular node need to be removed from the graph since it has not
If [T_C – n_a_t] less than the threshold (Time_Diff) value, then arcs
associated with the particular node will be analyzed. Based on the arc threshold,
its primary and secondary arcs will be selected for removal from the graph.
99
Example:
= 0 + 3600
= 3600
The time counter is initialized to zero and it gets new access time
time interval of every one hour. Minimum number of node occurrences will be
In the graph shown in Figure 4.10, the meaning of following variables is:
P1.html
25 n_occ = 100 12
n_a_t = 3600
100 100
P2.html 10 P3.html
n_occ = 70
n_occ = 100
P1.gif P1.jpg n_a_t = 2500
n_a_t = 2854 n_occ = 100 n_occ = 100
70
100 60
P5.html
P2.gif n_occ = 20
P3.jpg
P4.html n_occ = 100
n_occ = 100
n_a_t = 1700
n_occ = 60
n_a_t = 2850
represented with information such as: object URI, node occurrence count and
current access time. Each secondary node is represented with information such
as: object URI and node occurrence count. Primary and secondary arcs
represented with its occurrence count. Access time of secondary objects will be
same as that of primary object, because they will be accessed in most cases
Let us consider that the time counter (T_C) reached its threshold limit;
i.e. T_C = T_Th (T_C = 3600). The graph is now subjected to trimming
operation.
P1.html will be retained in the graph. Now analyze its primary and
2. P1.html has two secondary arcs leading to P1.gif and P1.jpg. In both
the nodes,
all the nodes will be initialized to 0. The node and arc occurrence count reduced
to 10% of its original count value, so that in the next interval an accurate analysis
is carried out.
performed with access time of nodes set to 0 and the count values reduced to
P1.html
3 n_occ = 10 P3.html
n_a_t = 0 n_occ = 7
n_a_t = 0
10 10
P2.html 7
n_occ = 10
P1.gif P1.jpg
n_a_t = 0 n_occ = 10 n_occ = 10 P3.jpg
n_occ = 10
10 6
P2.gif
n_occ = 10 P4.html
n_occ = 6
n_a_t = 0
proposed algorithm and the workload characteristics used for building the graph.
engine and client with prefetching engine. Web server builds Precedence Graph
using user access patterns and then generates predictions for user requests. Client
receives predictions from web server and uses it to download web objects during
105
browser idle time. To simulate group of users accessing the web server for
information, real web traces are fed to client that uses prefetching enabled web
browser. The time interval between two successive web requests computed using
timestamp value recorded in the log file to mimic actual client behavior. Each
user request and its response are recorded in a log file during simulation. The log
constantly learns the user’s access patterns during experiments, thus guaranteeing
since most of the web prefetching techniques use part of user access sequence in
training data has few access requests, then relevant user requests will be missed
includes excessive user accesses, then it may contain some outdated user access
Web access log files record users’ access patterns during website
navigation. The log files can be maintained at client, server or proxy in the web
architecture. Several research initiatives used log files maintained at web server
The log files for experimentation are collected from our institutional
news articles, admission details, course details, examination details etc. to the
faculty members and students. Most web pages maintained in the server are static
The log file includes information such as: requested URLs, request
time, object type, identifier assigned to IP address of user requesting the URL,
elapsed time for serving the request. Table 4.4 lists the important fields in a log
Field Description
192.168.10.1 Client IP address that made the request
web access sessions and use the information to build Precedence Graph for
by removing redundant and useless entries from web log file and retain only
Entries that are removed from log file during data cleaning operation
are:
sessions. Each user session consists of sequence of web pages visited over a
period of time. When a user remains idle for more than 30 minutes without
making any request, then the next request from same user is considered as the
4.5 RESULTS
generating predictions for the user requests by measuring Recall and Precision
108
metrics that quantifies the efficiency and usefulness of the generated predictions.
When the contents of prefetch cache are used to satisfy the user requests, then it
indicates prefetch hit else it is prefetch miss. The prediction algorithm needs to
generate useful hints for each web request to achieve high hit rate and reduction
Recall (hit rate) represents the ratio of prefetch hits to the total number
(accuracy) represents the ratio of prefetched pages requested by the user from
prediction threshold that controls the number of predictions generated for the
user requests.
Figure 4.9 represents the Recall achieved for user requests with
different prediction threshold (0.2 to 0.5). When threshold is 0.5, the number of
links recommended as hints from the graph will be minimal resulting in medium
number of user requests being satisfied using the contents of cache. For threshold
of 0.2 the Recall achieved is very high, because the graph will generate more
1
0.9
0.8
0.7 Th=0.5
Recall 0.6
Th=0.4
0.5
0.4 Th=0.3
0.3 Th=0.2
0.2
0.1
0
5000 10000 15000 20000
No. of User Requests
A smaller threshold value achieves higher hit rate but increases the
used to achieve higher hit rate if waiting time is a critical requirement for users.
Figure 4.10 represents the Precision achieved for user requests with
predicting and prefetching of more objects to satisfy the user requests. But in real
0.8
Th=0.5
Precision
0.6 Th=0.4
0.4 Th=0.3
Th=0.2
0.2
0
5000 10000 15000 20000
No. of User Requests
on the number of nodes and arcs that constitute the graph. For each object
requested by the user, a node is created and added to the graph to represent that
object in all the algorithms (PG, DG and DDG). The difference now lies in the
algorithms based on the user requests. It clearly indicates that the Precedence
Graph (PG) has less number of arcs compared to the existing methods (DG and
111
DDG). In case of PG, it adds an arc between the nodes based on the precedence
relation inferred from the request stored in the log file and it does not consider
the sequence of user requests. In case of DG and DDG, it adds an arc between
the nodes based on the sequence of user requests recorded in the log file.
15000
12000
No.of Arcs
9000 DG
DDG
6000 PG
3000
0
5000 10000 15000 20000
No. of User Requests
When implementing the Precedence Graph, only its primary nodes are
added as key values in the adjacency map. Secondary nodes of the graph are
added only as link elements to a particular key value in the map. It will avoid
wastage of key entries for secondary nodes that are not going to add any link
elements.
Predictions are generated from the graph by analyzing various arcs and
nodes associated with the node that represents the requested web object. The
the total number of arcs in the graph. Precedence Graph is able to generate
predictions in quick time due to the fact that it has less number of arcs compared
to DG and DDG.
The results presented above are for the Precedence Graph that is built
normally using the user access patterns. It is not subjected to any trimming
operation, and the graph grew in size as it learns the user access patterns. When
trimming algorithm is applied over the graph, significant changes can be noticed
in the number of arcs and nodes the graph possesses after the operation. The size
5000
4000
No.of Nodes
3000 PG
2000 PG + Trimming
1000
0
5000 10000 15000 20000
No. of User Requests
with and without trimming operation. As shown in the graph, with trimming
operation the number of nodes is very minimal when compared to the graph
113
5000
4000
No.of Arcs
3000 PG
2000 PG + Trimming
1000
0
5000 10000 15000 20000
No. of User Requests
4.6 CONCLUSION
Graph by learning the user access patterns to predict user’s future requests. The
secondary objects (e.g., images) by considering two types of arcs (primary and
secondary) when constructing the graph. Precedence Graph is built with fewer
arcs than the existing approaches (DG and DDG), since it considers only the
precedence relation for the user request rather than the user access sequences
recorded in the log file. The graph structure gets updated dynamically with new
114
nodes/arcs based on the user requests, which ensures that the predictions
Recall and Precision with minimal resource consumption (i.e. usage of memory
arcs from the graph. It helps Precedence Graph to learn new user access patterns
CHAPTER 5
PREFETCHING
5.1 INTRODUCTION
enhance the response time of end users. The web objects are stored at locations
closer to end users for serving their requests with minimal delay. Web caching
exploits the temporal locality and prefetching exploits the spatial locality that is
inherent in the user access patterns of web objects. Web caches are categorized
into: client cache, proxy cache and server cache depending on the location where
they are deployed in the web architecture (Zeng et al 2004). Server cache also
server and reduces its workload. Proxy caches that are often located near network
gateways allow several users to share the resources and reduce bandwidth
referred to as browser cache are located close to the web users and provide short
response time if the requested object is available in cache. It enhances the web
end users.
116
from web server during its operation. Two approaches for prefetching the objects
from server are: a) online approach - It fetches web objects during short pauses
that occur when user reads displayed page on screen b) offline approach – It
fetches web objects during off-peak periods or when user remains idle for certain
pollution by replacing useful data with prefetched data in the cache. Similarly, if
web objects stored in the cache as part of web caching are not accessed
performance. For effectively utilizing the limited cache capacity and to avoid
cache by effectively selecting the objects to be evicted from cache for storing
works indicate that the replacement activity based on intelligent approaches are
managing the client-side cache that is partitioned into regular and prefetch cache
for handling web caching and prefetching. Regular cache stores web objects
received from the following sources: a) objects that are demand requested by the
users and b) frequently accessed objects in prefetch cache that are transferred to
regular cache. Prefetch cache stores web objects downloaded based on the
are managed using the replacement algorithm based on Fuzzy Inference System
(FIS). LRU algorithm is used to manage the contents of prefetch cache. The
proposed scheme is designed such that it retains the useful web objects for longer
time duration and removes the unwanted objects from cache for efficient
the hit ratio because the prefetched objects are stored in prefetching cache
web objects to be evicted from cache for satisfying the following aspects:
the cache to select web objects to be evicted from cache. Factors considered for
computing the priority of web objects are: popularity (frequency), recency, object
size, popularity consistency, access latency (delay) and object type (html/text,
interval between sending the user request and receiving the last byte of requested
content as response. Recency represents the time when object was last referenced
and it reflects the temporal locality that exists in user access patterns. Web
objects are selected for eviction from the cache such that it has the lowest access
reaches its maximum limit or to evict objects that are not used for long duration.
deciding the web objects to be removed from cache is not an easy task as each
characterizes the ability to predict future accesses to web objects based on the
past accesses to objects. Two main types of locality are: Temporal and Spatial.
accessed again in the future. Spatial locality indicates that accesses to certain
them URL is the unique characteristic to identify the object. Most replacement
server
cache
Factors such as object size, object type and access latency are static
and they are determined only once when the object is initially requested by the
user. Factors such as frequency, recency and popularity consistency are dynamic
and they are computed frequently till the object resides in cache.
· Frequency based
objects.
· Recency based
· Randomized
The objects are randomly selected for removal from the cache.
sequences. Object size and cost of fetching an object from server, along with
temporal locality and long term popularity plays significant role in performance
computing framework based on the concept of fuzzy set theory, fuzzy if-then
rules and fuzzy reasoning. Fuzzy Inference is the process of formulating the
mapping from a given input to an output using fuzzy logic. The mapping
provides a basis from which decisions can be made or patterns discerned. FIS has
into degree of match with linguistic values. Knowledge base comprises of two
components: Rule base and Database. The rule base contains various fuzzy if-
121
then rules and database defines the membership functions of fuzzy sets used in
Knowledge Base
Input Output
Fuzzification Defuzzification
considers to make decision. Fewer rules in the database will result in better
system performance.
fuzzy set. It can be chosen either arbitrarily by the user based on his experience
bell-shaped.
122
- ( x -c ) 2
f ( x;s , c) = e 2s 2
ì 0, x£a ü
ïx-a ï
ï , a £ x £ bï
ïb - a ï
f ( x; a, b, c, d ) = í 1, b £ x £ cý
ïd - x ï
ïd - c , c £ x £ dï
ï 0, d £ x ïþ
î
· Bell Function
1
f ( x; a, b, c) = 2b
x-c
1+
a
· Triangular Function
ì 0, x£a ü
ïx - a ï
ïï b - a , a £ x £ b ïï
f ( x; a, b, c) = í ý
c-x
ï , b £ x £ cï
ïc -b ï
ïî 0, c £ x ïþ
the input parameters to the membership value in the range 0 to 1. The input space
Delay Time and Object Size. Each input parameter is associated with three
membership functions: {low, medium and high} that will map the input values to
membership functions {low, medium and high} and it is illustrated using a bell
curve. Figure 5.3 represents the Frequency input being mapped to its
between input and output values. IF part is called as “antecedent” and THEN part
is called as “consequent”.
Example:
Antecedent Consequent
function.
If antecedent of a rule has more than one part, then fuzzy operator
(AND) is applied to obtain a single value that represents the antecedent result for
System, so the rules must be combined in order to generate the final output.
Aggregation is the process by which the fuzzy sets that represent the output of
each rule are combined into a single fuzzy set. It occurs only once for each
5.3.3 Defuzzification
It takes the aggregated output fuzzy set as input and produces a single
are:
òm A ( z ) zdz
zCOA = z
òm z
A ( z )dz
Z BOA b
òm
a
A ( z )dz = ò m A ( z )dz
z BOA
ò zdz
z MOM = z'
,
ò dz
z'
z1 + z 2
if max m A ( z ) = [ z1 , z2 ] then z MOM =
2
Amongst all z that belong to [z1, z2], the smallest is called zSOM
Amongst all z that belong to [z1, z2], the largest value is called
zLOM
partitioning them into two parts: regular cache and prefetch cache. Each part of
the cache has its own storage space and is managed independently using separate
Purge Objects
– FIS Used
Client
Request
Regular Interaction
User
Browser
Cache
Web Response
Server Interaction
Prefetch User
Prefetch
Cache
Response
Purge Objects
– LRU Used
Regular cache stores web objects that are demand requested by users
and frequently accessed objects that are transferred from prefetch cache. Prefetch
cache stores web objects that are downloaded from server using the predictions
generated as part of web prefetching. When user frequently accesses the objects
130
stored in prefetch cache, they are moved to regular cache to ensure that the
popular objects reside in cache for longer time duration. The scheme effectively
removes the useless objects to alleviate cache pollution and maximize the hit
ratio.
When users’ requests are satisfied using the contents of either regular
or prefetch cache, then it indicates cache hit and the requests are not forwarded
to the web server. In case of cache miss, the requests are forwarded to server for
When server receives the user request, it performs the following tasks:
request
Server analyzes the user requests stored in access log file to generate
the predictions and deliver it to client. Client on receiving the requested object
along with list of predictions from server performs the following tasks:
Client Request
No Content
Cacheable
Yes
Content No Regular No
Prefetched Cache Full
Yes Yes
Purge objects
No Prefetch (FIS used)
Cache Full
Store new
object in cache
Object Yes
frequently
accessed
No
Object available
to client access
The client can also take the responsibility of generating the predictions
on its own and use them to prefetch web objects from server. These downloaded
objects are then stored in prefetch cache. The web objects received based on
the cache, the caching system first verifies if the object is cacheable or not. If it is
whether it is full or it has space to store the object. LRU algorithm is used to
purge objects from the prefetch cache. When objects residing in prefetch cache
are accessed frequently within a short time period, then they are moved to regular
cache. The regular cache is verified whether it can accommodate the objects
coming from prefetch cache or objects demand requested by user. When regular
cache is full, then objects are purged based on the outcome of FIS algorithm. The
objects stored in regular and prefetch cache is used to satisfy the client requests
with minimal latency. In case if the object is not cacheable, then they are
are: frequency, recency and object size. Object popularity is a good estimator for
verifying the cacheability of documents, since the objects that are more popular
133
have high probability of being referenced again by the user in near future
zero.
· It must use GET or HEAD method and the status code should be
Dynamic requests are not cached, since they return unique objects
Variable Meaning
IP1 Recency of Web object
IP2 Frequency of Web object
IP3 Retrieval time of Web object
IP4 Size of Web object
Frequency and Recency for the objects are estimated based on the
window of a request represents the time before and after the request is made.
134
Symbol Meaning
Oi requested object
∆Ti time period since object Oi was last requested
Fi Frequency of object Oi within sliding window
SWL Sliding Window length
OT Target Output
If an object is requested for the first time, then its recency will be fixed
as SWL; else it will have the maximum value among SWL and ∆Ti.
Fi +1 if ∆Ti £ SWL
frequency (Oi) =
Fi =1 if Oi accessed beyond SWL
i.e. the time interval between the previous request and the new request is within
reinitialized to 1.
135
within the forward looking sliding window; else, OT will be 0. The objective is to
use the information of web object requested in the past to predict its revisit in the
object will be stored in the cache. Else, it decides to evict objects based on the
outcome generated by Fuzzy Inference System (FIS) algorithm. FIS takes the
input parameters of object and decides whether it can reside in the cache or it
membership function and the aggregated output is defuzzified using the centroid
of area method.
When the object has high recency and frequency, then it has good
chance of residing in the cache. If the outcome from FIS has a value greater than
0.5, then it indicates that the object can reside in cache; else it can be purged. The
5.5 IMPLEMENTATION
It discusses the training data used for simulation and the process
Inference System.
1995) is used for the simulation. The data collection consists of 9633 files
users. The trace files contain sequence of web object requests that is served either
Each line in the log file represents unique URL requested by the user.
It consists of machine name, the timestamp when the request was made, User_ID
number, requested URL, size of the document and the object retrieval time in
seconds. If a log entry indicates the number of bytes as zero and the retrieval
delay as zero, then it indicates that the request was satisfied using the contents of
internal cache.
extract useful information that reflects user navigational behavior. Figure 5.9
represents the sample log file that is used for preprocessing operation. The
· Parse the log file to identify distinct fields in each record entry and
file.
· Extract the useful fields from each line in the log file to be used for
analysis.
The output file generated after preprocessing the log file contains the
· Requested URL
· Delay time
Table 5.3 represents the sample preprocessed data created from the log
file that will be used to obtain the training data to be given as input to the Fuzzy
Inference System.
140
training data as shown in Table 5.4. The recency and frequency values are
Time period for sliding window length (SWL) in both the forward and backward
scenario is taken as 20 minutes to simulate the user browsing patterns. Since the
141
user tends to change browsing patterns often and they may have short browsing
Inputs Target
Recency Frequency Retrieval Size
Time (ms) (bytes)
When an object is requested for the first time or its re-request is within
the SWL length, then its recency will be set as 1200. If the time difference
between the new request and the previous request to an object is greater than the
SWL, then its recency will be the value representing the time difference.
142
The frequency for the object will be set to 1, if it is requested for the
first time. If object is re-requested within the SWL length, then its frequency is
irrespective of its previous value. The target output will be set to 1, if an object
has future reference within the forward SWL length; else its value is set to 0.
replacement policies. The storage space allocated for cache in the client machine
will be distributed equally between the regular and prefetch cache; i.e. 50% of
the total capacity allocated to regular cache and remaining 50% to the prefetch
displayed on the browser, predictions are made for that page and objects are
prefetched and stored in prefetch cache. If the objects in prefetch cache are
performance of web caching and prefetching is evaluated using the metrics: Hit
Rate (HR) and Byte Hit Rate (BHR). HR represents the percentage of user
the percentage of bytes served from cache against the total number of bytes
Important point to note is that the Hit Rate and Byte Hit Rate cannot
be optimized for at the same time (podlipnig 2003). Strategies that optimize Hit
Rate give preference to smaller sized objects, which tend to decrease the Byte Hit
common replacement policies: LRU and LFU in terms of HR and BHR. In LRU,
least recently used objects are removed first and it is a simple and efficient
scheme for uniform sized objects. In LFU, least frequently used objects are
removed first and its advantage is its simplicity. It is also compared with
NNPCR-2 (Romano and ElAarag 2011) an intelligent web caching approach that
decisions.
The algorithms are simulated by varying the cache size from 10MB to
100MB. Log files of 15 different users are split into three groups: user (1 to 5) in
Group A, user (6 to 10) in Group B, user (11 to 15) in Group C. HR and BHR
144
for each group are evaluated separately to analyze the behavior of algorithms that
Figure 5.10 represents the hit rate of different polices using the traces
of Group-A (user 1 to 5). Figure 5.11 represents the hit rate using traces of
Group-B (user 6 to 10). Figure 5.12 represents hit rate using traces of Group-C
(user 11 to 15).
Group A
70
65
Hit Ratio (%)
LRU
60
LFU
55
NNPCR-2
50
FIS-LRU
45
40
10 20 40 60 80 100
Cache Size (MB)
When cache size increases, it improves the HR for all the replacement
policies. It is due to the fact that the cache can store large number of web objects
to satisfy the user requests. As observed in the graphs of different traces, the HR
better than NNPCR-2 in most cases and in few the results match with that of
NNPCR-2.
Group-B
70
65
Hit Ratio (%)
LRU
60
LFU
55
NNPCR-2
50 FIS-LRU
45
40
10 20 40 60 80 100
Cache Size (MB)
Group-C
70
65
Hit Ratio (%)
60 LRU
LFU
55
NNPCR-2
50
FIS-LRU
45
40
10 20 40 60 80 100
Cache Size (MB)
Group-A
50
Group-B
50
45
Byte Hit Ratio (%)
LRU
40
LFU
35
NNPCR-2
30 FIS-LRU
25
20
10 20 40 60 80 100
Cache Size (MB)
Figure 5.14 Byte Hit Ratio using Traces of Group-B (user 6 to 10)
147
Group-C
50
Byte Hit Ratio (%) 45
40 LRU
LFU
35
NNPCR-2
30
FIS-LRU
25
20
10 20 40 60 80 100
Cache Size (MB)
Figure 5.15 Byte Hit Ratio using Traces of Group-C (user 11 to 15)
shown in Figure 5.13, using the traces of Group-B is shown in Figure 5.14 and
using the traces of Group-C is shown in Figure 5.15. As observed in these graphs,
BHR of the proposed scheme (FIS-LRU) is better in all the cases when compared
5.7 CONCLUSION
manages the client-side cache, which is partitioned into regular and prefetch
cache for handling the web caching and prefetching. The proposed scheme uses
Fuzzy Inference System (FIS) based algorithm for managing the contents of
regular cache and LRU algorithm for managing the contents of prefetch cache.
When objects stored in prefetch cache are frequently accessed by users, then they
148
are moved to regular cache where they are managed efficiently based on the
outcome of FIS algorithm. The scheme helps to retain useful objects for longer
time period while effectively removing the unwanted objects from the cache.
BHR is compared with various algorithms (LRU, LFU and NNPCR-2), where
LRU and LFU are basic algorithms and NNPCR-2 is an intelligent algorithm
scheme are computed by considering both the regular and prefetch cache.
Results clearly indicate that the proposed scheme (FIS-LRU) outperforms other
CHAPTER 6
CONCLUSION
Web caching and prefetching techniques have been designed and used
and download the objects before user actually demand requests them. It alleviates
the problems encountered in web caching. Several researchers over the years
solutions for reducing the latency. It has been observed that a fast and accurate
techniques could prefetch large number of web objects, when there is increase in
prediction and prefetching engine are deployed in the client machine and the
access patterns are observed when user views web pages in the browser. Two
new approaches (Naïve Bayes and Fuzzy Logic) have been proposed to generate
the predictions. When user views a webpage, the hyperlinks in that page are
150
Hyperlinks with high priority value forms part of the prediction list (hints) that is
used by the prefetching engine to download web objects during browser idle time.
navigate the web pages. Predicted unused repository is used to store information
tune its predictions. Both the approaches could generate effective predictions to
minimize the access latency. The approaches would be effective when user has
the user access patterns recorded in server log files is used to build Precedence
effectively records the predecessor and successor relationship between the user
requests. It has less number of arcs in the graph compared to the existing
algorithms (DG and DDG). Graph trimming has been employed to keep the size
graph. Server intimates the predictions (hints) to the client through HTTP
response headers that is easily recognized by the browser. During idle time,
client uses the predictions to download web objects and store them in cache to
effectively manage the client side cache that is used to support both caching and
prefetching. The cache is partitioned into two parts: regular cache (for caching)
and prefetch cache (for prefetching). Regular cache is managed using Fuzzy
managed using LRU algorithm. The objective is to retain the useful objects for
longer time duration and effectively remove the unwanted objects from cache to
when the frequently accessed prefetched objects are properly retained in the
user access patterns with web page content to build simple and
prefetching.
performance.
LIST OF PUBLICATIONS
INTERNATIONAL JOURNALS
than threshold
1 copy
Fraction later
1.3 0.1
1.5 0.01 2 copies
1.2
1 0.001
1.1 1 copy 1 copy 0.0001
2 copies 0.5 2 copies 1e-05
1
0 1e-06
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 1 10 100 1000
Load Load Response time (s)
Figure 1: A first example of the effect of replication, showing response times when service time distribution
is deterministic and Pareto (α = 2.1)
Pareto service times. Note the thresholding effect: in both Theorem 1. Within the independence approximation, if
systems, there is a threshold load below which redundancy the service times at every server are i.i.d. exponentially dis-
always helps improve mean latency, but beyond which the tributed, the threshold load is 33%.
extra load it adds overwhelms any latency reduction that it
Proof. Assume, without loss of generality, that the mean
achieves. The threshold is higher — i.e., redundancy helps
service time at each server is 1 second. Suppose requests
over a larger range of loads — when the service time distri-
arrive at a rate of ρ queries per second per server.
bution is more variable.
Without replication, each server evolves as an M/M/1
The threshold load, defined formally as the largest uti-
queue with departure rate 1 and arrival rate ρ. The re-
lization below which replicating every request to 2 servers
sponse time of each server is therefore exponentially dis-
always helps mean response time, will be our metric of in-
tributed with rate 1 − ρ [6], and the mean response time is
terest in this section. We investigate the effect of the service 1
.
time distribution on the threshold load both analytically and 1−ρ
in simulations of the queueing model. Our results, in brief: With replication, each server is an M/M/1 queue with
departure rate 1 and arrival rate 2ρ. The response time of
1. If redundancy adds no client-side cost (meaning server- each server is exponentially distributed with rate 1 − 2ρ,
side effects are all that matter), there is strong evi- but each query now takes the minimum of two independent
dence to suggest that no matter what the service time samples from this distribution, so that the mean response
distribution, the threshold load has to be more than 1
time of each query is 2(1−2ρ) .
25%. Now replication results in a smaller response time if and
1 1
2. In general, the higher the variability in the service-time only if 2(1−2ρ) < 1−ρ , i.e., when ρ < 31 .
distribution, the larger the performance improvement
achieved. While we focus on the k = 2 case in this section, the
analysis in this theorem can be easily extended to arbitrary
3. Client-side overhead can diminish the performance im- levels of replication k.
provement due to redundancy. In particular, the thresh- Note that in this special case, since the response times are
old load can go below 25% if redundancy adds a client- exponentially distributed, the fact that replication improves
side processing overhead that is significant compared mean response time automatically implies a stronger distri-
to the server-side service time. butional dominance result: replication also improves the pth
If redundancy adds no client-side cost percentile response time for every p. However, in general,
an improvement in the mean does not automatically imply
Our analytical results rely on a simplifying approximation: stochastic dominance.
we assume that the states of the queues at the servers evolve
completely independently of each other, so that the average In the general service time case, two natural (service-time
response time for a replicated query can be computed by independent) bounds on the threshold load exist.
taking the average of the minimum of two independent sam- First, the threshold load cannot exceed 50% load in any
ples of the response time distribution at each server. This system. This is easy to see: if the base load is above 50%,
is not quite accurate because of the correlation introduced replication would push total load above 100%. It turns out
by replicated arrivals, but we believe this is a reasonable that this trivial upper bound is tight — there are fami-
approximation when the number of servers N is sufficiently lies of heavy-tailed high-variance service times for which the
large. In a range of service time distributions, we found that threshold load goes arbitrarily close to 50%. See Figures 2(a)
the mean response time computed using this approximation and 2(b).
was within 3% of the value observed in simulations with Second, we intuitively expect replication to help more as
N = 10, and within 0.1% of the value observed in simula- the service time distribution becomes more variable. Fig-
tions with N = 20. ure 2 validates this trend in three different families of distri-
We start with a simple, analytically-tractable special case: butions. Therefore, it is reasonable to expect that the worst-
when the service times at each server are exponentially dis- case for replication is when the service time is completely
tributed. A closed form expression for the response time deterministic. However, even in this case the threshold load
CDF exists in this case, and it can be used to establish the is strictly positive because there is still variability in the sys-
following result. tem due to the stochastic nature of the arrival process. With
Weibull service times Pareto service times Simple two-point service time distribution
0.5 0.5 0.5
Threshold load
Threshold load
0.3 0.3 0.3
0 0 0
0 2 4 6 8 10 12 14 16 18 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Inverse shape parameter 𝛾 Inverse scale paramemter 𝛽 p
Figure 2: Effect of increasing variance on the threshold load in three families of unit-mean distributions:
Pareto, Weibull, and a simple two-point discrete distribution (service time = 0.5 with probability p, 1−0.5p
1−p
with probability 1 − p). In all three cases the variance is 0 at x = 0 and increases along the x-axis, going to
infinity at the right edge of the plot.
the Poisson arrivals that we assume, the threshold load with 0.5
deterministic service time turns out to be slightly less than Conjectured lower bound
0.4 Uniform
26% — more precisely, ≈ 25.82% — based on simulations
Dirichlet
Threshold load
of the queueing model, as shown in the leftmost point in
Figure 2(c). 0.3
We conjecture that this is, in fact, a lower bound on the
threshold load in an arbitrary system. 0.2
than threshold
Fraction later
30 600 0.1
400 0.01
20
1 copy 1 copy 0.001 1 copy
10 200 0.0001
2 copies 2 copies 2 copies
0 0 1e-05
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 10 100 1000
Load Load Response time (ms)
Mean response time 99.9th %ile response time Load 0.2: CDF
Response time (ms)
than threshold
Fraction later
30 600 0.1
20 400 0.01
1 copy 200 1 copy 1 copy
10 0.001
2 copies 2 copies 2 copies
0 0 0.0001
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 10 100 1000
Load Load Response time (ms)
Mean response time 99.9th %ile response time Load 0.2: CDF
Response time (ms)
40 1
than threshold
Fraction later
30 600 0.1
20 400 0.01
1 copy 200 1 copy 1 copy
10 0.001
2 copies 2 copies 2 copies
0 0 0.0001
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 10 100 1000
Load Load Response time (ms)
Mean response time 99.9th %ile response time Load 0.2: CDF
Response time (ms)
40 1
than threshold
Fraction later
30 600 0.1
400 0.01
20
1 copy 1 copy 0.001 1 copy
10 200 0.0001
2 copies 2 copies 2 copies
0 0 1e-05
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 10 100 1000
Load Load Response time (ms)
Figure 8: Cache:disk ratio 0.01 instead of 0.1. Higher variability because of the larger proportion of accesses
hitting disk. Compared to Figure 5, 99.9th percentile improvement goes from 2.3× to 2.8× at 10% load, and
from 2.2× to 2.5× at 20% load.
Fraction later than threshold
Mean response time 99.9th %ile response time Rate 1000 queries/sec/node: CDF
Response time (ms)
40 1
600 1 copy
30 0.1
2 copies
20 400 0.01
1 copy 200 1 copy
10 0.001
2 copies 2 copies
0 0 0.0001
0 2 4 6 8 10 12 0 2 4 6 8 10 12 10 100 1000
Arrival rate (queries/sec/node) Arrival rate (queries/sec/node) Response time (ms)
Figure 9: EC2 nodes instead of Emulab. x-axis shows unnormalised arrival rate because maximum throughput
seems to fluctuate. Note the much larger tail improvement compared to Figure 5.
Mean response time 99.9th %ile response time Load 0.2: CDF
Response time (ms)
than threshold
Fraction later
80 600 0.1
60 400 0.01
40 1 copy 1 copy 1 copy
20 200 0.001
2 copies 2 copies 2 copies
0 0 0.0001
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 10 100 1000
Load Load Response time (ms)
Mean response time 99.9th %ile response time Load 0.2: CDF
Response time (ms)
2 1
than threshold
Fraction later
60 0.1
1.5
0.01
1 40 0.001
1 copy 20 1 copy 0.0001 1 copy
0.5
2 copies 2 copies 1e-05 2 copies
0 0 1e-06
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0.1 1 10 100 1000
Load Load Response time (ms)
Figure 11: Cache:disk ratio 2 instead of 0.1. Cache is large enough to store contents of entire disk
Mean response time 99.9th %ile response time Load 0.2: CDF
Response time (ms)
0.5 2 1
than threshold
1 copy
Fraction later
0.4 0.1
1.5
0.3 0.01 2 copies
1 0.001
0.2 1 copy 1 copy 0.0001
0.1 0.5
2 copies 2 copies 1e-05
0 0 1e-06
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0.1 1 10 100
Load Load Response time (ms)
than threshold
0.7
Fraction later
10 Gbps, 6 us per hop 8
25 0.6
20 6 0.5
15 0.4
4 0.3
10
2 0.2
5 0.1
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Total load Total load Flow completion time (ms)
Figure 14: Median and tail completion times for flows smaller than 10 KB
completion time, meaning that the total latency savings servers and the clients against the economic value of the
achieved is correspondingly smaller. At 40% network load, latency improvement that would be achieved. In our eval-
we obtain a 38% improvement in median flow completion uation we find that the latency improvement achieved by
time (0.29 ms vs. 0.18 ms) when we use 5 Gbps links with redundancy is orders of magnitude larger than the required
2 us per-hop delay. The improvement falls to 33% (0.15 ms threshold in both the applications we consider here.
vs. 0.10 ms) with 10 Gbps links with 2 us per-hop delay,
and further to 19% (0.21 ms vs. 0.17 ms) with 10 Gbps links 3.1 Application: Connection establishment
with 6 us per-hop delay. We start with a simple example, demonstrating why repli-
Next, Figure 14(b) shows the 99th percentile flow comple- cation should be cost-effective even when the available choices
tion times for one particular delay-bandwidth combination. are limited: we use a back-of-the-envelope calculation to
In general, we see a 10-20% reduction in the flow comple- consider what happens when multiple copies of TCP-handshake
tion times, but at 70-80% load, the improvement spikes to packets are sent on the same path. It is obvious that this
80-90%. The reason turns out to be timeout avoidance: at should help if all packet losses on the path are independent.
these load levels, the 99th percentile unreplicated flow faces In this case, sending two back-to-back copies of a packet
a timeout, and thus has a completion time greater than the would reduce the probability of it being lost from p to p2 .
TCP minRTO, 10 ms. With redundancy, the number of In practice, of course, back-to-back packet transmissions are
flows that face timeouts reduces significantly, causing the likely to observe a correlated loss pattern. But Chan et
99th percentile flow completion time to be much smaller al. [11] measured a significant reduction in loss probabil-
than 10 ms. ity despite this correlation. Sending back-to-back packet
At loads higher than 80%, however, the number of flows pairs between PlanetLab hosts, they found that the aver-
facing timeouts is high even with redundancy, resulting in a age probability of individual packet loss was ≈ 0.0048, and
narrowing of the performance gap. the probability of both packets in a back-to-back pair being
Finally, Figure 14(c) shows a CDF of the flow completion dropped was only ≈ 0.0007 – much larger than the ∼ 10−6
times at one particular load level. Note that the improve- that would be expected if the losses were independent, but
ment in the mean and median is much larger than that in still 7× lower than the individual packet loss rate.2
the tail. We believe this is because the high latencies in the As a concrete example, we quantify the improvement that
tail occur at those instants of high congestion when most of this loss rate reduction would effect on the time required to
the links along the flow’s default path are congested. There- complete a TCP handshake. The three packets in the hand-
fore, the replicated packets, which likely traverse some of shake are ideal candidates for replication: they make up
the same links, do not fare significantly better. an insignificant fraction of the total traffic in the network,
Replication has a negligible impact on the elephant flows: and there is a high penalty associated with their being lost
it improved the mean completion time for flows larger than (Linux and Windows use a 3 second initial timeout for SYN
1 MB by a statistically-insignificant 0.12%. packets; OS X uses 1 second [12]). We use the loss prob-
ability statistics discussed above to estimate the expected
3. INDIVIDUAL VIEW latency savings on each handshake.
We consider an idealized network model. Whenever a
The model and experiments of the previous section in- packet is sent on the network, we assume it is delivered suc-
dicated that in a range of scenarios, latency is best opti- cessfully after (RT T /2) seconds with probability 1 − p, and
mized in a fixed set of system resources through replication. lost with probability p. Packet deliveries are assumed to be
However, settings such as the wide-area Internet are better independent of each other. p is 0.0048 when sending one
modeled as having elastic resources: individual participants copy of each packet, and 0.0007 when sending two copies
can selfishly choose whether to replicate an operation, but of each packet. We also assume TCP behavior as in the
this incurs an additional cost (such as bandwidth usage or Linux kernel: an initial timeout of 3 seconds for SYN and
battery consumption). In this section, we present two exam- SYN-ACK packets and of 3 × RT T for ACK packets, and
ples of wide-area Internet applications in which replication exponential backoff on packet loss [12].
achieves a substantial improvement in latency. We argue With this model, it can be shown that duplicating all three
that the latency reduction in both these applications out- packets in the handshake would reduce its expected comple-
weighs the cost of the added overhead by comparing against
a benchmark that we develop in a companion article [29]. 2
It might be possible to do even better by spacing the trans-
The benchmark establishes a cost-effectiveness threshold by missions of the two packets in the pair a few milliseconds
comparing the cost of the extra overhead induced at the apart to reduce the correlation.
tion time by approximately (3+3+3×RT T )×(4.8−0.7) ms, 1
humans of lower latency. While an accurate comparison is Response time threshold (s)
% latency reduction
50
as it improves latency by 16 ms for every KB of extra traf-
fic. In comparison, the latency savings we obtain in TCP 40
CURRICULUM VITAE
Computer Technology degree from the same institution in May 2001. He also
obtained his M.S (By Research) degree from Anna University, Chennai in April
Coimbatore.