You are on page 1of 6

A Web Browsing Trafc Model for Simulation: Measurement and Analysis

Lourens O. Walters
Data Networks Architecture Group University of Cape Town Private Bag, Rondebosch, 7701 Tel: (021) 650 2663, Fax: (021) 689 9465 Email: lwalters@cs.uct.ac.za

Pieter S. Kritzinger
Data Networks Architecture Group University of Cape Town Private Bag, Rondebosch, 7701 Tel: (021) 650 2663, Fax: (021) 689 9465 Email: psk@cs.uct.ac.za

Abstract The simulation of packet switched networks depends on accurate workload models to serve as input for network models. We derive a workload model for data trafc generated by an individual browsing the web. The parameters of the model characterize user, web browser software, and web server software behaviour, independently of underlying network characteristics such as latency and throughput, which differ between network sites. We measured data on a campus network by capturing packet traces of IP, TCP and HTTP header data. The data were processed by extracting datasets for the model parameters from the data. The approach of obtaining datasets for parameters is novel as a heuristic algorithm was used to achieve this. The parameter datasets were analyzed by using visual techniques and goodness-of-t measures for deriving analytic distributions. The workload model will be implemented as a trafc generating module in a network simulator. Trafc generated by the model will be validated by analyzing burstiness and self-similarity characteristics of the trafc, as well as comparing it to independently measured trafc. This paper presents our work up to date, which includes the denition of our workload model, the measurement and processing of web trafc data and initial ndings from our statistical analysis of the data .

I. I NTRODUCTION Structural modelling is an approach to network trafc modelling which takes into account underlying characteristics of trafc streams. For example, Internet trafc can be broken down into source-destination specic trafc, application specic trafc or user specic trafc. This approach contrasts the black box modelling approach which analyzes aggregate trafc as a monolithic body of data. By modelling component trafc streams it is often possible to shed light on characteristics of aggregate trafc streams e.g. Willinger et al. [1] found a strong connection between self-similarity of aggregate network trafc and the occurrence of heavy-tailed, innite variance distributions within individual source-destination network connections. We employed the approach of structural modelling by dening a detailed characterization of trafc generated by an individual web user. By detailed we mean a characterization of web trafc that takes into account the nuances of web trafc generated by a user e.g. a small textual HTML le being downloaded almost always being followed by a series of large graphics les, or the relatively small but numerous

request packets generated by a web browser client in order to download active content or graphics les making up a web document. One might ask why modelling the numerous small request packets is important when most of the bandwidth is used by much larger response packets. The interaction of TCP and HTTP, and in particular the slow start mechanism of TCP has a considerable inuence on the performance of HTTP [2]. TCP is fundamentally a bulk transfer protocol and is poorly suited to frequent, short, request-response-style trafc such as that of web trafc. Short connections, such as those required to transmit small request packets interact poorly with TCPs slow start congestion avoidance algorithm which causes increased latency for most web users [2]. Several recent simulation studies have taken these facts into consideration by using detailed web trafc, TCP and radio interface models e.g. Staehle et al. [3] used simulation to show that the QoS of Internet Access with GPRS in its rst phase is comparable to that of a modem with a speed of 32kps, for medium trafc loads. Using a bidirectional web trafc model Kalden et al. [4] showed that GPRS provides bandwidthefcient support for bursty applications such as web access. The paper is organized as follows. Section II discusses related work. We present our workload model in Section III. Section IV discusses the measurement methodology we employed to obtain data. The heuristic algorithm we implemented to extract datasets for the model parameters is presented in Section V. The statistical methodology we followed is explained in Section VI, and preliminary results discussed in Section VII. Section VIII concludes the paper and discusses remaining work. II. R ELATED W ORK During the latter half of the 1990s, World Wide Web trafc has dominated the Internet and became the main focus of trafc modelling. Trafc modelling work also started to take into consideration the self-similar nature of network trafc as the seminal paper on the self-similarity of ethernet trafc by Leland et al. [5] suggested. Crovella et al. [6] showed that web trafc is self-similar, and that the self-similarity is in all likelihood attributable to the heavy-tailed distributions

of transmission times of documents and silent times between document requests. The data traces used in the Crovella study were recorded by Cunha et al. [7]. These traces served as the basis of the workload generator SURGE developed by Barford et al. [8]. The objective of the SURGE workload generator is to generate trafc representative of the World Wide Web in order to exercise web servers and networks. SURGE generates web trafc equivalent to a set of real users accessing a web server. Mah [9] developed an empirical model of HTTP network trafc. The parameters modelled are represented by their empirical cumulative distribution functions (as opposed to analytic distribution functions). The Inverse Transformation Method is applied to these in order to generate the relevant random numbers. The model is derived from packet traces and models bidirectional trafc i.e. trafc from the web-client to the web-server (requests) as well as trafc in the opposite direction (responses). Choi et al. [10] developed an analytic behavioral model of web trafc. The parameters modelled are represented by analytic distribution functions which were chosen by means of applying the visual Quantile-Quantile plot technique to datasets. The parameters characterize the behavior of web users and model unidirectional trafc from the server to the client. We briey give an overview of Mahs model as it is closest in structure and purpose to ours. Table I shows the parameters modelled by Mah. We use the word User throughout this article to refer to a web user (person) making requests by using a Web Client (browser software).
HTTP User Request Size HTTP Web Client Request Size HTTP User Response Size HTTP Web Client Response Size Number of Web Client Requests Think Time Consecutive Document Retrievals per Server Server Popularity TABLE I M AH PARAMETERS M ODELLED

Client Requests, each of size HTTP Web Client Request Size. 6) The web server receives each of the requests, and responds with a document of size HTTP Web Client Response Size for each request. 7) The process returns to Step 2 and repeats Consecutive Document Retrievals per Server number of times before a new server is chosen in Step 1 It should be noted that the model does not represent interarrival times. Interarrival times are inuenced in part by TCP ow control and congestion control algorithms. These algorithms depend on the latency and bandwidth of the network environment that the web client and servers are in. Also, the web proxy cache server/s between the web client and server have different cache management algorithms which affect interarrival times. Any measurements of interarrival times will be affected by these factors. A simulation of web trafc using Mahs model will have to include a simulation of the actual TCP algorithms as well as the web proxy caching algorithms. III. W ORKLOAD M ODEL As mentioned before, the workload model is similar to the one developed by Mah [9]. It is bidirectional and therefore differentiates between User and Web Client Requests, and User and Web Client Responses. The model is layered, and is shown in Figure 1.
BS WR
: Browsing

Browsing Session Arrival Process

BS

BS

BS t

Session

Web Request Web Request Arrival Process

UReq : User Request UResp : User Response CReq : Web Client Request CResp : Web Client Response

WR

WR

WR t

CReq CReq UReq CReq t

Client Side Web Request Dialogue

Server Side

t UResp CResp CResp CResp

During a simulation trafc is generated according to Mahs model as follows: 1) A web server is selected according to the Server Popularity table (Table containing relative popularity of servers). 2) A period of length called Think Time elapses. 3) A User clicks on a web-page generating a request of size HTTP User Request Size, which is sent to the web server. 4) The web server receives the request, and responds with a document of size HTTP User Response Size. 5) The Web Client receives the response, and responds by generating a number of requests equal to Number of Web

Fig. 1.

Layered Web Browsing Workload Model

The Browsing Session Layer models the time during which a user browses the web. We chose a 15 minute period of inactivity to signal the end of a Browsing Session. We chose 15 minutes as an indicator of the end of a Browsing Session as we observed that most users do not spend more than 15 minutes on reading a single web page, they either request another page or stop browsing. The Web Request Layer models requests for web documents. A web document is composed of a textual HTML le and the graphical images displayed along with the text.

The Web Request Dialogue Layer shows an example of a typical interaction between a web client and server during a Web Request. The arrows in Figure 1 indicate when a request or response is sent, and the size of a box indicates the size of the le and hence the transmission time of the le. A response is sent immediately after a le is received on either side. A typical scenario is for a user to request an HTML le. On arrival at the server side, the server responds to the request by sending the requested le to the client. The client responds by sending requests for all the graphical objects that needs to be displayed along with the text. On arrival at the server side, the server responds to the requests by sending the requested les to the client. The parameters we model are displayed in Table II.
Browsing Inter-Session Time Number of Web Requests per Session Web Request Interarrival Time User Request Size Web Client Request Size Web Client Request Interarrival Time User Response Size Number of Web Client Responses Web Client Response Size TABLE II PARAMETERS M ODELLED

used to reconstruct the browsing sessions of individuals browsing the web. The advantage of this approach is that a large sample of user datasets can be obtained relatively easily. The disadvantage is that unlike client logs the required data have to be obtained by reconstructing browsing sessions, from sometimes incomplete data, using heuristic methods. As the rst method is inappropriate for our needs, and the second too difcult to implement, we chose to use packet traces to obtain web trafc datasets. We developed a data capturing tool, similar to the Bi-Layer Tracing tool [13], to use for collecting packet traces. The tool extracts data from Network, Transport and Application Layer packet headers and writes it to logle. Table III shows which data is extracted by the tool.
Data IP address of browsing host TCP port of browsing host Requested URL Referer URL Content-Length of Request Content-Length of Response Content-Type eld Arrival time TABLE III D ATA EXTRACTED FROM IP, TCP AND HTTP HEADERS Protocol Header IP TCP HTTP HTTP IP, TCP and HTTP HTTP HTTP

The Browsing Inter-Session Time is the time between the end of one Browsing Session and the start of another. The Web Request Interarrival Time and Web Client Request Interarrival Time are the only interarrival time parameters modelled. As mentioned in Section II, it is not advisable to use interarrival times measured on the Internet in trafc models due to the inuence that different network environments have on interarrival times. However, we will show in Section IV that the environment in which we took our measurements had no undue inuence on the Web Request Interarrival Time and Web Client Request Interarrival Time parameters. IV. DATA M EASUREMENT M ETHODOLOGY There are generally three methods of obtaining web trafc data, and they have been used in various studies. They are: Server Logs [11], Client Logs [7], and Packet Traces [9], [12]. Web Server Logs are very useful for analyzing the workload of a specic web server. They are not appropriate for our work as it is an impossible task to analyze all the server logs visited by a specic user during a browsing session. Client Logs are created by customising web browser software to write logles of required data. This approach suits our needs as it provides exactly the data we require. The browser used by most people at the university is Microsoft Explorer, and as we cannot make changes to this software, this method is also impossible for us to implement. Packet Traces are obtained by recording data from a shared medium such as an Ethernet LAN. The recorded data are

The arrival time data is measured by the system clock. The tool is implemented in C using the Linux Socket Filter (LSF). The Linux Socket Filter lters packets in Kernel Space. Packets meeting lter criteria are passed to User Space while all other packets are discarded. All processing by the TCP/IP stack is therefore avoided. The performance gain of using kernel space ltering enabled us to measure trafc on a 100Mbs link. The tool is implemented for both Intel and SPARC architectures. We instrumented two machines with the data capturing tool and placed them at the Trace Collection Points on the campus network as illustrated in Figure 2. The position of the machines on the network ensured that all HTTP requests from and responses to university students and staff browsing the web were captured. We captured trafc generated by 6 689 hosts over a 1 month long period. The measurement machines were synchronized by means of the Network Time Protocol (NTP) [14]. Synchronization was necessary as data captured by the machines were merged into one dataset. The resultant dataset was used for analysis. Some parts of the analysis required accuracy in the order of hundreds of milliseconds. The machines clocks were synchronized to within tens of milliseconds of each another. The parameter Web Client Request Interarrival Times is measured in microseconds, which would cause a problem during the merging of datasets. Luckily all Web Client Requests in a set are always sent to the same proxy machine, and the analysis is therefore not affected by the merging process.

Campus Network Trace Collection Point

HTTP Proxy

Switch

Gateway Router

Trace Collection Point

HTTP Proxy Internet

Fig. 2.

Network Conguration

instance, it would not be possible to decide when a new Web Request is made, or when a new Browsing Session starts without knowing whether a request was generated by a User or Web Client. The Web Client Request Matching Problem: How do we match Web Client Requests found in the datasets to the correct User Requests responsible for them? I.e. which User Request is responsible for the generation of a specic set of Web Client Requests? The Web Client Request Interarrival Time and Number of Web Client Responses parameters are dependent on the correct matchings. The HTTP Response Matching Problem: How do we match HTTP response packets to the GET request packets which requested them? The information obtained by solving this problem is used in the solution of the previous two problems. The information also enables us to dispose of TCP connections, and the data associated with these connections, which terminated abnormally. A. User vs. Web Client Request Differentiation Algorithm The differentiation algorithm is heuristic. By studying the measured web trafc, we identied a set of request characteristics, each of which implies that a request is a Web Client Request. The characteristics we identied are based on the following properties of requests: Requested URL Referer URL Type of request Interarrival time between requests The Requested and Referer URLs are recorded in the datasets and the Interarrival time between requests can be calculated from values in the datasets. The Type of request can be inferred from the le extension of the Requested URL. We distinguish between the following types of requests: HTML: Requests for les with HTML content i.e. les with the following extensions: .html, .js, .cgi, .php, .asp, .pl, .cfm, .vbs and .css GRAPHICS: Requests for graphics les i.e. les with the following extensions: .gif, .jpg, .png and .jpeg OTHER: Requests for all other les The list of request characteristics we identied is long, and is therefore not listed here. We list two of the characteristics in Table IV.

As mentioned in Section III, the Web Request Interarrival Time and Web Client Request Interarrival Time are the only interarrival time parameters we model. These parameters model the time between web user clicks and web client initiated requests for graphics les respectively. They are not intrinsically linked to network behaviour, and if in fact we had been able to instrument all the web browsers at the university, we would have been able to measure these parameters without any inuence by network conditions. As Figure 2 however shows, our measurement machines were placed on the network. This means that the time it takes for a packet to travel from a web users host on the university campus network to the measurement machine would inuence the measurements of these variables. The university campus network however has a ber optic backbone with capacity of 100Mbs. The utilization and on average, on this network is very low, between , which results in low and very seldomly higher than latency and variability of latency between users on the campus network and the measurement machines. We therefore measure data for these parameters at the Trace Collection Points.

V. DATA P ROCESSING

As mentioned in Section IV, the datasets obtained by means of a packet trace do not contain direct information about the No. Description 9 modelled parameters shown in Table II. The challenge we 1 A GRAPHICS request with a Referer URL matching the Refaced was to use the information contained in the datasets quested URL of a preceding HTML request is a Web Client as shown in Table III, to obtain the information shown in Request Table II. We used heuristic methods to obtain best-guess 2 A GRAPHICS request with a Referer URL matching that of any other request preceding it by at most 10 seconds is a Web values for the 9 parameters. The methods were implemented Client Request in a Data Extraction Program which extracts datasets for the 9 parameters from the 5051 datasets. There are three main TABLE IV problems the Data Extraction Program has to solve: R EQUEST C HARACTERISTICS The User vs. Web Client Request Differentiation Problem: Is a request a User or Web Client Request? I.e did a A request is categorized by matching its properties to those person or web browser generate the request? All further extraction depends on the answer to this question. For of each entry in the list of characteristics. If it matches an entry,

it is categorized as a Web Client Request, if it doesnt match a single entry, it is categorized as a User Request. Characteristic No. 1 in Table IV follows intuitively from the interaction between a web-client and web-server i.e. a requested HTML page usually generates several requests for graphics les. Characteristic No. 2 follows from the observation that images making up a web-page are downloaded in quick succession to one another. Figure 3 shows the implementation of Characteristic No.s 1 and 2.
if(this letype GRAPHICS) /* Characteristic No. 1 */ for(all previous requests in html request queue) if(this referer url prev request url) this category Web Client Request break else if(le has not been classied yet) /* Characteristic No. 2 */ if(this letype GRAPHICS) for(all previous requests in all request queue) if(arrival time difference 10) break else if(this referer url prev referer url) this category Web Client Request break

a TCP connection, meaning that multiple requests are made without waiting for responses to return. Responses return in the same sequence as requests made. If we keep track of all GET requests made on all TCP connections we can match responses to requests. We use a queue to keep track of GET requests on a TCP connection. New requests are added to the tail of the queue, and requests are removed from the head of the queue, and matched to incoming responses. We maintain a queue for every TCP connection opened by a host. We recorded SYN, FIN and RST ags in our data. These ags tell us when a TCP connection is opened, closed and terminated abnormally. We open a TCP connection queue everytime a SYN ag is encountered in a dataset, and close the queue if a FIN or RST packet is encountered on that connection. Whilst a TCP connection queue is open we add GET requests to it as they occur in the dataset. When a HTTP response arrives, we match it to the GET request at the head of the queue, and remove the request from the queue. VI. S TATISTICAL M ETHODOLOGY We will nd analytic distributions to represent each of the nine parameters of our workload model. By analytic distribution we mean both the family of the distribution e.g. normal or exponential, and the parameters associated with this family e.g. location and scale. We use the following visual techniques to uncover characteristics of the data that are suggestive of underlying mathematical properties: 1) The Histogram 2) The Empirical Cumulative Distribution Function 3) The Log Empirical Complementary Cumulative Distribution Function 4) The Probability Plot We use goodness-of-t techniques to test inferences suggested by visual analysis, and to quantify the evidence suggested by visual analysis. We use the Anderson Darling test and the discrepancy measure. We use both measures as the Anderson Darling test is more accurate than the discrepancy measure for small datasets, but cannot be used for very large datasets [15], [8]. The discrepancy measure as dened by Pederson et al. [16] is a binning technique based on the goodness-of-t statistic. The Anderson Darling test is based on the empirical distribution function (EDF) [17]. EDF tests are generally more powerful than those based on binning techniques such as types of tests. VII. P RELIMINARY R ESULTS We completed the analysis of the Browsing Inter-Session Time parameter. The visual analysis as well as the Anderson Darling test showed that the Weibull distribution provides the Of the best t for most of the 2526 datasets under test. datasets passed the Anderson Darling test at a signicance level. Figure 4 shows Weibull probability plots for 2 randomly selected datasets.

Fig. 3.

Implementation of Characteristic No.s 1 and 2

The all request queue in Figure 3, is a queue which stores properties of all previous requests (i.e. HTML, GRAPHICS and OTHER requests). The html request queue is a queue which stores properties of only HTML requests. The code in Figure 3 for Characteristic No. 1 does the following: It loops through all the requests in the html request queue. If for any one of these the Request URL matches the Referer URL of the current request, we categorize the current request as being a Web Client Request. We tested the algorithm by capturing web trafc generated by ourselves, whilst keeping track of the URLs we requested. We processed the measured data using the Data Extraction Program, and compared the results with the records we kept during the test. The program performed extremely well, it correctly categorized all requests except for requests issued from pop-up windows opened by some web pages. B. Web Client Request Matching Problem This problem is partially solved in the process of solving the User vs. Web Client Request Differentiation Problem. Minor additions to the program can keep track of which User Request generated a particular Web Client Request. We therefore wont discuss this problem any further here. C. HTTP Response Matching Problem The matching of HTTP responses to GET requests is difcult as an HTTP header contains no information associating a particular response with a particular request. The problem was solved by using our knowledge of how HTTP uses TCP connections. HTTP requests are pipelined on

Plot No. 1
2 2

Plot No. 2

sured trafc obtained independently from the trafc the model was derived from. R EFERENCES
[1] W. Willinger, V. Paxson, and M. S. Taqqu, Self-similarity and Heavy Tails: Structural Modeling of Network Trafc, in A Practical Guide to Heavy Tails: Statistical Techniques and Applications, R. J. Adler, R. E. Feldman, and M. S. Taqqu, Eds. Boston: Birkhauser, 1998. [2] J. Heidemann, K. Obraczka, and J. Touch, Modeling the performance of HTTP over several transport protocols, IEEE/ACM Transactions on Networking, vol. 5, no. 5, pp. 616630, October 1997. [3] D. Staehle, K. Leibnitz, and K. Tsipotis, QoS of internet access with GPRS, in Proceedings of the 4th ACM international workshop on Modeling, analysis and simulation of wireless and mobile systems. Rome, Italy: ACM Press, 2001, pp. 5764. [4] R. Kalden, I. Meirick, and M. Meyer, Wireless Internet Access Based on GPRS, IEEE Personal Communications, vol. 7, no. 2, pp. 818, April 2000. [5] W. E. Leland, M. S. Taqqu, W. Willinger, and D. V. Wilson, On the SelfSimilar Nature of Ethernet Trafc, in ACM SIGCOMM, D. P. Sidhu, Ed., San Francisco, California, 1993, pp. 183193. [6] M. E. Crovella and A. Bestavros, Self-Similarity in World Wide Web Trafc: Evidence and Possible Causes, in Proceedings ACM SIGMETRICS 1996: The ACM International Conference on Measurement and Modeling of Computer Systems, Philadelphia, Pennsylvania, May 1996, pp. 151160, also, in Performance evaluation review, May 1996, 24(1):160-169. [7] C. Cunha, A. Bestavros, and M. E. Crovella, Characteristics of World Wide Web Client-based Traces, Boston University, CS Dept, Boston, MA 02215, Tech. Rep. BUCS-TR-1995-010, April 1995. [Online]. Available: http://www.cs.bu.edu/techreports/pdf/1995-010-www-clienttraces.pdf [8] P. Barford and M. Crovella, Generating Representative Web Workloads for Network and Server Performance Evaluation, in Performance 1998/ACM SIGMETRICS 1998, 1998, pp. 151160. [9] B. A. Mah, An Empirical Model of HTTP Network Trafc, in 17th IEEE InfoComm Conference, April 1997. [10] H. Choi and J. Limb, A Behavioural Model of Web Trafc, in 7th International Conference on Network Protocols (ICNP 1999), Toronto, Canada, October 1999. [11] M. F. Arlitt and C. L. Williamson, Web Server Workload Characterization: The Search for Invariants, in Measurement and Modeling of Computer Systems, 1996, pp. 126137. [12] A. Reyes-Lecuona, E. Gonzalez-Parada, E. Casilari, J. C. Casasola, and A. Diaz-Estrella, A page-oriented WWW trafc model for wireless system simulations, in 16th International Teletrafc Congress(ITC16), D. Smith and P. Key, Eds., Edinburgh, June 1999, pp. 12711280. [13] A. Feldmann, BLT: Bi-Layer Tracing of HTTP and TCP/IP, WWW9/ Computer Networks, vol. 33, no. 1-6, pp. 321335, 2000. [Online]. Available: citeseer.nj.nec.com/201157.html [14] D. Mills, RFC 1305: Network Time Protocol, March 1992, Draft Standard. [15] V. Paxson, Empirically-Derived Analytic Models of Wide-Area TCP Connections, IEEE/ACM Transactions on Networking, pp. 316336, August 1994. [16] S. P. Pederson and M. E. Johnson, Estimating Model Discrepancy, TECHNOMETRICS, vol. 32, pp. 305314, 1990. [17] M. A. Stephens, Tests Based on EDF Statistics, STATISTICS: textbooks and monographs, vol. 68, pp. 97185, 1986.

Z Values

Z Values 2 0 2 4 6

6 2

Log(Int. Times)

Log(Int. Times)

Fig. 4.

Weibull probability plots for 2 of the datasets

The regression statistics for the straight lines tted to the Weibull probability plots are shown in Table V.
Plot No. 1 2

0.976 0.975

Std. Error of Estimate 0.198 0.204

TABLE V L EAST S QUARES R EGRESSION RESULTS FOR W EIBULL PROBABILITY


PLOTS

It is clear from Table V that the Weibull distribution is a very good t to the datasets. The shape and scale parameters for the Weibull distribution were estimated using Maximum and Likelihood Estimation as being: .

VIII. C ONCLUSION There is a need for detailed web workload models in the simulation eld. The characterization of relevant parameters for such models depends on the acquisition of reliable, relevant and recent trafc measurements. Such measurements are often not easily obtainable. We found that freely available trafc measurement software dont provide the functionality necessary for taking specialized measurements such as ours. With the help of the Linux Socket Filter, and custom written software using a heuristic algorithm we were able to obtain the necessary data. The analysis of data trafc often involves analyzing very large datasets. Common statistical tests such as the Anderson Darling test are not suitable for analyzing large datasets. We have implemented both the Anderson Darling test and the discrepancy measure, which can be used to analyze large datasets. We used this methodology, assisted by the relevant visual techniques, to obtain the best-t distribution for one of our model parameters. We will nd analytic distributions for the remaining eight parameters, and implement the model as a trafc generation module in the ns simulation package. We will validate the model by testing generated trafc for burstiness and selfsimilarity, as well as comparing it to characteristics of mea-

Lourens Walters received his BA degree in History and BSc Hons degree in Computer Science at the University of Cape Town, in 1998 and 1999 respectively, and is currently working towards the MSc degree.

Pieter Kritzinger has a Phd in Computer Science from the University of Waterloo in Canada and is a full professor in the Computer Science Department at the University of Cape Town.

You might also like