Professional Documents
Culture Documents
Overview
Introduction Business Application of Web Mining Web Mining Technologies Architecture of WEBMINER system Browsing behavior models Preprocessing
1. 2. 3. 4. Data cleaning User identification Session identification Path completion
Introduction
The Web, then, is becoming the apocryphal Vox Populi. The WWW continues to grow at an astounding rate resulting in increase of complexity of tasks such as web site design, web server design and of simply navigating through a web site An important input to these design tasks is analysis of how a web site is used. Usage information can be used to restructure a web site in order to better serve the needs of users of a site in providing E-Business Web mining is the application of data mining techniques to large web data repositories in order to produce results that can be used in these design tasks.
Some of the data mining algorithms that are commonly used in web usage mining are:
i) Association rule generation: Association rule mining techniques discover unordered correlations between items found in a database of transactions. 3
ii) Sequential Pattern generation: This is concerned with finding intertransaction patterns such that the presence of a set of items is followed by another item in the time-stamp ordered transaction set iii) Clustering: Clustering analysis allows one to group together users or data items that have similar characterstics The input for the web usage mining process is a file, referred to as a user session file, that gives an exact accounting of who accessed the web site, what pages were requested and in what order, and how long each page was viewed Web server log does not reliably represent a user session file. Hence, several preprocessing tasks must be performed prior to applying data mining algorithms to the data collected from server logs. 4
Web mining is the application of data mining or other information process techniques to WWW, to find useful patterns. Business organizations can take advantage of these patterns to access WWW more efficiently
The business benefits that Web mining affords to digital service providers include personalization, collaborative filtering, enhanced customer support, product and service strategy definition, particle marketing and fraud detection. The powerful computing available on the Internet can be exploited through the use of meta-computing technology, i.e. large scale (internet-based) parallel and distributed computing.
The WEBMINER system recognizes five main types of pages:-Head Page: a page whose purpose is to be the first page the users visit web site is providing -Content Page: a page that contains a portion of the information content -Navigation Page: a page whose purpose is to provide links to guide users on to content pages -Look-up Page: a page used to provide a definition or acronym expansion -Personal Page: a page used to present information of biographical nature Each of these types of pages is expected to exhibit certain physical characteristics
ii) Users Model: Analogous to each of the common physical characteristics of the different page types, there is expected to be common usage characteristics among different users
9
For the purposes of association rule discovery, it is really the content page references that are of interest. The other pages are just to facilitate the browsing of a user while searching for information, and are referred to as auxiliary pages
Transactions can be defined in two ways using the concept of auxiliary and content page references.
Auxiliary content transaction consists of all the auxiliary references up to and including each content reference for a given user . Mining these would give the common traversal paths through the website to a given content page Content only transaction consists of all the content references for a given user. Mining these would give association between the content pages of a site, without any information as to the path taken between uses.
10
Preprocessing
Two of the biggest impediments to collecting reliable usage data are local caching and proxy servers In order to improve performance and minimize network traffic, most web browsers cache the pages that have been requested. As a result, when a user hits a back button, the cached page is displayed and the web server is not aware of the repeat page access
Proxy servers provide an intermediate level of caching and create even more problems with identifying site usage. In a web server log, all requests potentially represent more than one user. Also due to proxy server level caching, a single request from the server could actually be viewed by multiple users through an extended period of time Hence to input reliable and more accurate data to the mining algorithms the following preprocessing tasks are to be done on the web server log data:Data cleaning Session identification User identification Path completion
11
Data Cleaning
Techniques to clean a server log to eliminate irrelevant items are of importance for any type of web log analysis
The discovered associations or reported statistics are only useful if data represented in the server log gives an accurate picture of the user access to the web site Problem: The HTTP protocol requires a separate connection for every file that is requested from the web server. Therefore, a users request to view a particular page often results in several log entries since graphics and scripts are downloaded in addition to the HTML file. In most cases, only the the log entry of the HTML file request is relevant and should be kept for the user session file Solution: Elimination of items deemed irrelevant can be reasonably accomplished by checking the suffix of URL name. All log entries with filename suffixes such as gif, jpeg,GIF,JPEG,JPG,jpg and map can be removed. However, the list can be modified depending on the site being analyzed
12
User Identification
User identification is the process of associating page references, even with those with same IP addresses, with different users. Problem:This task is greatly complicated by the existence of local caches, corporate firewalls and proxy servers. Solution: Even if the IP address is same, if the agent shows a change in browser software or operating system, a reasonable assumption to make is that each different agent type for an IP address represents a different user
Solution: If a page is requested that is not directly reachable by a hyperlink from any of the pages visited by the user,it implies there is another user with the same IP address For the sample log, three unique users are identified with browsing paths of A-B-F-O-G-A-D, A-B-C-J, and L-R, respectively
13
Advantages
The preprocessing tasks described in this paper have several advantages over current methods of collecting information like the use of cookies and cache busting
Cache busting is the practice of preventing browsers from using stored local versions of a page from the server every time it is viewed Cache busting defeats the speed advantage that caching was created to provide
Cookies can be deleted or disabled by the user These problems can be overcome by applying the mentioned preprocessing tasks to the web server log
15
Disadvantages
Two users with the same IP address that use the same browser on the same type of machine can easily be confused as a single user if they are looking at same set of pages A single user with two different browsers running, or who types in URLs directly without using a sites link structure can be mistaken for multiple users
While computing missing page references, we can be misled, by the fact that the user might have known the URL for a page and typed it in directly. (it is assumed that this does not occur often enough to affect mining algorithms)
16
Conclusion
Through this paper, we have tried to highlight how to gain business advantage in e-Business, by using the WWW, which is considered to be the largest pool of information resources. This technology helps in gaining meaningful insights related to the day-today business activities by using the useful information made accessible through the web. Web sites present enormous potential for direct marketing. In a nutshell, Web Mining can be summed up as a very viable technology, application and a product suite meant for discovering knowledge pertaining to the routine business activities and can prove to be a very important approach for gaining competitive advantage.
17
Thank You
18