You are on page 1of 16

Introduction

1.1 Web Mining: Mining means extracting something useful or valuable from a baser substance, such as mining gold from the earth. Web mining is the application of data mining techniques to discover patterns from the Web. Web mining is the application of machine learning techniques to web-based data for the purpose of learning or extracting knowledge. In web usage mining the goal is to examine web page usage patterns in order to learn about a web system's users or the relationships between the documents. According to analysis targets, web mining can be divided into three different types, which are Web usage mining, Web content mining and Web structure mining.

1.1.1

Web Usage Mining: Web usage mining is the process of extracting useful information from

server logs i.e users history. Web usage mining is the process of finding out what users are looking for on the Internet. Some users might be looking at only textual data, whereas some others might be interested in multimedia data.

1.1.2

Web Content Mining: Mining, extraction and integration of useful data, information and

knowledge from Web page contents. The heterogeneity and the lack of structure that permeates much of the ever expanding information sources on the World Wide Web, such as hypertext documents, makes automated discovery, organization, and search and indexing tools of the Internet and the World Wide Web such as Lycos, Alta Vista, WebCrawler, Meta Crawler, and others provide some comfort to users,

but they do not generally provide structural information nor categorize, filter, or interpret documents.

In recent years these factors have prompted researchers to develop more intelligent tools for information retrieval, such as intelligent web agents, as well as to extend database and data mining techniques to provide a higher level of organization for semi-structured data available on the web. The agent-based approach to web mining involves the development of sophisticated AI systems that can act autonomously or semi-autonomously on behalf of a particular user, to discover and organize web-based information. The web content mining is differentiated from two different points of view: Information Retrieval View and Database View. R. Kosala summarized the research works done for unstructured data and semi-structured data from information retrieval view. It shows that most of the researches use bag of words, which is based on the statistics about single words in isolation, to represent unstructured text and take single word found in the training corpus as features. For the semi-structured data, all the works utilize the HTML structures inside the documents and some utilized the hyperlink structure between the documents for document representation. As for the database view, in order to have the better information management and querying on the web, the mining always tries to infer the structure of the web site to transform a web site to become a database. There are several ways to represent documents; vector space model is typically used. The documents constitute the whole vector space. 1.1.3 Web Structure Mining: Web structure mining is the process of using graph theory to analyze the node and connection structure of a web site. According to the type of web structural data, web structure mining can be divided into two kinds: 1. Extracting patterns from hyperlinks in the web: a hyperlink is a structural component that connects the web page to a different location.

2. Mining the document structure: analysis of the tree-like structure of page structures to describe HTML or XML tag usage.

1.2 Document Clustering: Document clustering (also referred to as Text clustering) is closely related to the concept of data clustering. Document clustering is a more specific technique for unsupervised document organization, automatic topic extraction and fast information retrieval or filtering. A web search engine often returns thousands of pages in response to a broad query, making it difficult for users to browse or to identify relevant information. Clustering methods can be used to automatically group the retrieved documents into a list of meaningful categories.

Document clustering involves the use of descriptors and descriptor extraction. Descriptors are sets of words that describe the contents within the cluster. Document clustering is generally considered to be a centralized process. Examples of document clustering include web document clustering for search users.

The application of document clustering can be categorized to two types, online and offline. Online applications are usually constrained by efficiency problems when compared offline applications. In general, there are two common algorithms. The first one is the hierarchical based algorithm, which includes single link, complete linkage, group average and Ward's method. By aggregating or dividing, documents can be clustered into hierarchical structure, which is suitable for browsing. However, such an algorithm usually suffers from efficiency problems. The other algorithm is developed using the K-means algorithm and its variants. Usually, it is of greater efficiency, but less accurate than the hierarchical algorithm. Other algorithms involve graph based clustering, ontology supported clustering and order sensitive clustering. Hierarchical algorithms start with established clusters, and then create new clusters based upon the relationships of the data within the set. Hierarchical algorithms can do this one of two ways, from the bottom up or from the top down. These two methods are known as agglomerative and divisive, respectively. Agglomerative algorithms start with the individual elements in the set as clusters, and then merge them into successively larger clusters. Divisive algorithms start with the entire dataset in one cluster, and then break it up into successively smaller clusters. Because hierarchical algorithms must analyze all the relationships inherent in the dataset, they tend to be costly in terms of time and processing power.

Partitional algorithms determine the clusters at one time, in the beginning of the clustering process. Once the clusters have been created, each element of the dataset is then analyzed and placed within the cluster that it is the closest to. Partitional algorithms run much faster than hierarchical ones, which allow them to be used in analyzing large datasets, but they have their disadvantages as well. Generally, the initial choice of clusters is arbitrary, and does not necessarily comprise all of the actual groups that exist within a dataset. Therefore, if a particular group is missed in the initial clustering decision, the members of that group will be placed within the clusters that are closest to them, according to the predetermined parameters of the algorithm. In addition, Partitional algorithms can yield inconsistent results- the clusters determined this time by the algorithm probably wont be the same as the clusters generated the next time it is used on the same dataset.

Clustering in mining the Web Documents

The World Wide Web is a vast resource of information and services that continues to grow rapidly. Powerful search engines have been developed to aid in locating unfamiliar documents by

category, contents, or subjects. However, queries often return inconsistent results, with document referrals that meet the search criteria but are of no interest to the user. While it may be currently feasible to extract in full the meaning of an HTML document, intelligent software agents have been developed which extract features from the words or structures of an HTML document and employ them to classify and categorize the documents. Under classification, the researcher attempts to assign a data item to a predefined category based on a method that is created from Pre-classified training data (supervised learning). Clusterings goal is to separate a given group of data items (the data set) into groups called clusters such that items in the same cluster are similar to each other and dissimilar to items in other clusters or to identify distinct groups in a dataset. The results of clustering could then be used to automatically formulate queries and search for other similar documents on the web, or to organize bookmark files, or to construct a user profile. In contrast to the highly structured tabular data upon which most machine learning methods are expected to operate, web and text documents are semi structured. Web documents have well defined structures such as letters, words, sentences, paragraphs, sections, punctuation marks, HTML tags and so forth. Hence, developing improved methods of performing machine learning techniques in this vast amount of non tabular, semi structured web data is highly desirable. In contrast to the highly structured tabular data upon which most machine learning methods are expected to operate, web and text documents are semi-structured. Web documents have welldefined structures such as letters, words, sentences, paragraphs, sections, punctuation marks, HTML tags, and so forth. We know that words make up sentences, sentences make up paragraphs, and so on, but many of the rules governing the order in which the various elements are allowed to appear are vague or ill-defined and can vary dramatically between documents. It is estimated that as much as 85% of all digital business information, most of it web-related, is stored in nonstructured formats (i e. non-tabular formats, such as those that are used in databases and spreadsheets). Developing improved methods of performing machine learning techniques on this vast amount of non-tabular, semi-structured web data is therefore highly desirable. Clustering and classification have been useful and active areas of machine learning research that promise to help us cope with the problem of information overload on the Internet. With clustering the goal is to separate a given group of data items (the data set) into groups called clusters such that items in the same cluster are similar to each other and dissimilar to the items in other clusters. In clustering methods no labeled examples are provided in advance for training (this

is called unsupervised learning). Under classification we attempt to assign a data item to a predefined category based on a model that is created from pre-classified training data (supervised learning). In more general terms, both clustering and classification come under the area of knowledge discovery in databases or data mining. Applying data mining techniques to web page content is referred to as web content mining which is a new sub-area of web mining, partially built upon the established field of information retrieval. When representing text and web document content for clustering and classification, a vector-space model is typically used. In this model, each possible term that can appear in a document becomes a feature dimension. The value assigned to each dimension of a document may indicate the number of times the corresponding term appears on it or it may be a weight that takes into account other frequency information, such as the number of documents upon which the terms appear. This model is simple and allows the use of traditional machine learning methods that deal with numerical feature vectors in a Euclidean feature space. However, it discards information such as the order in which the terms appear, where in the document the terms appear, how close the terms are to each other, and so forth. In this seminar we examine the use of advanced techniques of data clustering in algorithms that employ abstract categories for the pattern matching and pattern recognition procedures used in data mining searches of web documents. With the rapid advances in data mining software technology now taking place, website managers and search engine designers have begun to struggle to maintain efficiency in "mining" for patterns of information and user behavior. Part of the problem is the enormous amount of data being generated, making the search of web document databases in real time difficult. Real-time searching are critical for real time problem solving, highlevel documentation searches and prevention of database security breaches. The analysis of this problem will be followed by a detailed description of weaknesses in data mining methods, with suggestions for a reduction of pre-processing to improve performance of search engine algorithms, and recommendation of an optimum algorithm for this task. The first investigators who gave a serious thought to the problem of algorithm speed were persons conducting researching in the area of database searches. The field is still in its infancy most of the tools and techniques used for data mining today come from other related fields such as pattern

recognition, statistics and complexity theory. Only recently have the researchers of these various fields been interacting to solve mining and timing issues.

Overview of the Methodologies


The methodologies given in this seminar will be experimental analysis, with the objective of testing the feasibility of abstract category data clustering algorithms for a real world web application. In order to perform this test, a group of five linear time clustering algorithms will be applied to a sample group of online web documents, simulating the activities of a web search engine looking for similar words, phrases or sequences in a large database set of web articles,

publications and records. The six techniques compared will be the K-Means, Single Pass, Fractionation, Buckshot, Suffix Tree and AprioriAll clustering algorithms. The procedure will be to measure the execution time of the test algorithms in clustering data sets consisting of whole documents, excerpts and key words of a fixed quantity and size. 3.1 Purpose of the Study The purpose of this study is to conduct research that will analyze and improve the use of data clustering techniques in creating abstract categories in algorithms, allowing data analysts to conduct more efficient execution of large-scale searches. Increasing the efficiency of the search process requires a detailed knowledge of abstract categories, pattern matching techniques, and their relationship to search engine speed. Data mining involves the use of search engine algorithms looking for hidden predictive information, patterns and correlations within large databases. The technique of data clustering divides datasets into mutually exclusive groups. The distance between groups is measured with respect to all the available variables, versus variables that are specific predictors, to produce "abstract categories" for analysis. Search engine algorithms and user audit trails are complex, leading to time consuming quests for specific information. It is anticipated that the proposed study will identify the most efficient and effective data clustering algorithms for this purpose.

3.2 Background of the Problem Data Clustering is a technique employed for the purpose of analyzing statistical data sets. Clustering is the classification of objects with similarities into different groups. This is accomplished by partitioning data into different groups, known as clusters, so that the elements in each cluster share some common trait, usually proximity according to a defined distance measure. Essentially, the goal of clustering is to identify distinct groups within a dataset, and then place the data within those groups, according to their relationships with each other. Document Clustering

automatically groups documents into sets. There is no predetermined taxonomy - the taxonomy of the clusters is determined at run time. One of the main points of document clustering is that it isnt so much a matter of finding the documents off of the web- thats what search engines are for. Search engines do a fantastic job by themselves, when the user has a specific query, and knows exactly what words will get the desired results. But the user gets a ranked list that has questionable relevance because it turns up anything that has the word in it, and the only metric is the number of times that word appears in the document. With a good clustering program, the user will instead be presented with multiple avenues of inquiry organized into broad groups that get more specific as selections are made. Clustering exploits similarities between the documents to be clustered. The similarity of two documents is computed as a function of distance between the corresponding term vectors for those documents. Of the various measures used to compute this distance, the cosine-measure has
proved the most reliable and accurate.

Data clustering algorithms come in two basic types as we discuss as: hierarchical and Partitional. In partitions, searches are matched against the designated clusters, and the documents in the highest scoring clusters are returned as a result. When a hierarchy processes a query, it moves down the tree along the highest scoring branches until it achieves the predetermined stopping condition. The sub tree where the stopping condition is satisfied is then returned as the result of the search. Clustering is a methodology for more effective search and retrieval functions pertaining to datasets. The principle is simple enough- documents with a high degree of similarity will automatically be sought by the same query. By automatically placing the documents in groups based upon similarity (e.g. clusters), the search is effectively broadened. Study of Algorithms: Out of six algorithms which we are going to discuss here three are hierarchical and three Partitional algorithms. The three hierarchical methods are suffix tree, single pass and AprioriAll. The three partitional algorithms are k-means, buckshot, and fractionation.

The suffix tree, as defined is a compact representation of a retrieval corresponding to the suffixes of a given string where all nodes with one child are merged with their parents. It is a divisive method; it begins with the dataset as a whole and divides it into progressively smaller clusters, each composed of a node with suffixes branching off of it like leaves.

Single-pass clustering, on the other hand, is an agglomerative, or bottom-up, method. It begins with a single cluster, and then analyzes each element in turn to determine if it falls within a current cluster, or places it in a new cluster, depending on the similarity threshold set by the analyst.

Suppose that we have the following set of documents and terms, and that we are interested in clustering the terms using the single pass method (note that the same method can beused to cluster the documents, but in that case, we would be using the document vectors (rows) rather than the term vector (columns). T1 Doc1 Doc2 Doc3 Doc4 Doc5 1 3 3 2 2 T2 2 1 0 1 2 T3 0 2 0 0 1 T4 0 3 0 3 5 T5 1 0 1 0 1

Start with T1 in a cluster by itself, say C1. At this point, C1 contains only one item, T1, so the centroid of C1 is simply the vector for T1: C1 = <1, 3, 3, 2, 2>. Now compare (i.e., measure similarities) of the next item (T2) to centroids of all existing clusters. At this point we have only one cluster, C1 (we will use dot product for simplicity): SIM(T2, C1) = 1*2 + 1*3 + 0*3 + 1*2 + 2*2 = 11 Now we need a pre-specified similarity threshold. Let's say that our threshold is 10. This means that if the similarity of T2 to the cluster centroid is >= 10, then we add T2 to the cluster, otherwise we use T2 to start a new cluster. In this case. SIM(T2, C1) = 11 > 10. Therefore we add T2 to cluster C1. We now need to compute the new centroid for C1 (which now contains T1 and T2). The centroid (which is the average vector for T1 and T2 is: C1 = <3/2, 4/2, 3/2, 3/2, 4/2>

Now, we move to the next item, T3. Again, there is only one cluster, C1, so we only need to compare T3 with C1 centroid. The dot product of T3 and the above centroid is: SIM(T3, C1) = 0 + 8/2 + 0 + 0 + 4/2 = 6 This time, T3 does not pass the threshold test (the similarity is less than 10). Therefore, we use T3 to start a new cluster, C2. Now we have two clusters C1 = {T1, T2} C2 = {T3} We move to the next unclustered item, T4. Since we now have two clusters, we need to compute the MAX similarity of T4 to the 2 cluster centroids (note that the centroid of cluster C2 right now is just the vector for T3): SIM(T4, C1) = <0, 3, 0, 3, 5> . <3/2, 4/2, 3/2, 3/2, 4/2> = 0 + 12/2 + 0 + 9/2 + 20/2 = 20.5 SIM(T4, C2) = <0, 3, 0, 3, 5> . <0, 2, 0, 0, 1> = 0 + 6 + 0 + 0 + 5 = 11

Note that both similarity scores pass the threshold (10), however, we pick the MAX, and therefore, T4 will be added to cluster C1. Now we have the following: C1 = {T1, T2, T4} C2 = {T3} The centroid for C2 is still just the vector for T3:

C2 = <0, 2, 0, 0, 1> and the new centroid for C1 is now: C1 = <3/3, 7/3, 3/3, 6/3, 9/3> The only item left unclustered is T5. We compute its similarity to the centroids of existing clusters: SIM(T5, C1) = <1, 0, 1, 0, 1> . <3/3, 7/3, 3/3, 6/3, 9/3> = 3/3 + 0 + 3/3 + 0 + 9/3 = 5 SIM(T5, C2) = <1, 0, 1, 0, 1> . <0, 2, 0, 0, 1> = 0 + 0 + 0 + 0 +1 = 1

Neither of these similarity values pass the threshold. Therefore, T5 will have to go into a new cluster C3. There are no more unclustered items, so we are done (after making a single pass through the items). The final clusters are: C1 = {T1, T2, T4} C2 = {T3} C3 = {T5} Note: Obviously, the results for this method are highly dependent on the similarity threshold that is used. You should use your judgment in setting this threshold so that you are left with a reasonable number of clusters.

The AprioriAll algorithm builds upon a principle called Learning Association. Learning Associations consists of discovering strong associations among transactions items, or by extension, among objects variables and attributes. Examples include: Customers who bought product A also bought product B, or Children of 4-5 yrs old with younger siblings are more likely to be kind to others. The purpose of association learning is to find rules that meet two user defined conditions: minimum support and minimum confidence. K-means derives its clusters based upon longest distance calculations of the elements in the dataset then it assigns each element to the closest centroid (the data point that is the mean of the values in each dimension of a set of multi-dimensional data points). Buckshot is a hybrid clustering method that combines the partitioning and hierarchical clustering methods. More precisely, it combines Hierarchical Agglomerative Clustering (HAC) and K-Means Clustering by using HAC to bootstrap K Means. In the algorithm, the number of clusters, k, must first be selected. Next we create the starting clusters by randomly selecting N documents from S, and putting each of the N documents in its own cluster. We then compute the similarity between every cluster and every other cluster and merge the two closest clusters into one. We continue merging the two closest clusters until we are left with k clusters. The centroids of the k clusters are then used as the starting centroids for the k-means algorithm described above. The main reason for using HAC to select the k-means starting centroids is that it can help avoid selecting bad starting seeds. HAC is known to produce quality cluster but is often too costly to run. Therefore, by using HAC to select the starting centroids for k-means, we get the benefits from both algorithms. Fractionation is a more careful clustering algorithm which divides the dataset into smaller and smaller groups through successive iterations of the clustering subroutine. Fractionation requires more processing power, and therefore time.

You might also like