You are on page 1of 20

Document Clustering

Outline

Introduction Background and Motivation

Clustering Techniques
Web Document Clustering Conclusion

1. Introduction

Document - The set of words. - eg. Research paper, Web page

Cluster - Grouping the set of similar objects.


Clustering - Expanding queries, by including terms.

2. Background and Motivation


Old information retrieval (IR) systems.


To browse a collection of documents or result returned by a search engine.

To generate hierarchical clusters of documents automatically.


The investigation mainly as a means of improving the performance of search engines.

3. Clustering Techniques
Hierarchical Agglomerative clustering

- Compute similarity and merge closest two clusters.


K-mean clustering

- Based on the idea of center point to represent a cluster.

Clustering Techniques (Contd)


One pass clustering

- Give size one to cluster ,& compute distance

to all remaining nodes. Add closest node to the cluster.

Buckshot clustering

Suffix tree clustering

4. Web Document Clustering


Introduction

- Applied to the small set of documents returned


in response to a query.

- Model
clusters User query

Clustering Engine

Search Engine

Introduction (Contd)

Basic Key requirements for Web Document Clustering methods

- Relevance
- Browsable Summaries - Overlap - Snippet-tolerance - Speed - Incrementality

Suffix Tree Clustering


Three logical steps

- Step 1- document cleaning


- Step 2- identifying base clusters using a suffix tree - Step 3- combining these base clusters into clusters

Suffix Tree Clustering (contd)


Step 1 Document Cleaning

- Transformation of the string of text representing


each document. - Marking of sentence boundaries and stripping of non-word tokens.

Suffix Tree Clustering (contd)


Step 2 Identifying Base Clusters

- Rooted, directed tree..


- At least 2 children for each internal node. - To labeled each node with a non-empty sub string of S .

- The Concatenation of the edge-label on the path.


- Existence of a suffix-node for each suffix s of S.

Suffix Tree Clustering (contd)

Suffix Tree Clustering (contd)

Step 3 Combining Base Clusters

- Overlapped and identical document sets


of distinct base clusters - Merges base clusters with a high overlap in

there documents set.

Suffix Tree Clustering (contd)

A binary similarity measure

- Given 2 base clusters Bm and Bn, with sizes | Bm|


and | Bn|, and |Bm Bn| representing the number of documents common to both clusters.

Bm and Bns similarity is defined to be 1 iff - |Bm Bn| / | Bm| > 0.5 and - |Bm Bn| / | Bn| > 0.5 - Otherwise, their similarity is defined to be 0.

Suffix Tree Clustering (contd)

Experiments

Effectiveness for information retrieval

Snippets vs. Whole document

- Web document contained 760 words on average. - Snippets contained 50 words on average.

Execution Time

Pros and Cons

Pros
the contents of a document collection - Also reduce the search space

- Clustering can work to give user an overview of

Cons

- Computationally expensive
- Difficult to identify which cluster or cluters should be searched

Conclusion
The identification of the unique requirements of
document clustering of Web search engine results.

The definition of STC - an incremental, O(n) time clustering algorithm that satisfies these requirements. The first experimental evaluation of clustering algorithms on Web search engine results, forming a baseline for future work.

Questions & Answers Session

You might also like