Welcome to Scribd!

Document Clustering

Uploaded by

0% found this document useful (0 votes)

10 views20 pages

Document clustering techniques are used to group similar documents together to improve search engine performance. Common techniques include hierarchical agglomerative clustering, k-means clustering, and suffix tree clustering. Suffix tree clustering identifies base clusters using a suffix tree and then combines base clusters with high overlap. Experiments show suffix tree clustering is effective for information retrieval by giving users an overview of a document collection and reducing the search space, though it is computationally expensive.

Original Description:

Original Title

Presentation

Copyright

Available Formats

PPT, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PPT, PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

10 views20 pages

Document Clustering

Uploaded by

SHAIK CHAND PASHA

Copyright:

Available Formats

Download as PPT, PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 20

Search inside document

Document Clustering

Outline

Introduction Background and Motivation

Clustering Techniques
Web Document Clustering Conclusion

1. Introduction

Document - The set of words. - eg. Research paper, Web page

Cluster - Grouping the set of similar objects.

Clustering - Expanding queries, by including terms.

2. Background and Motivation

Old information retrieval (IR) systems.

To browse a collection of documents or result returned by a search engine.

To generate hierarchical clusters of documents automatically.

The investigation mainly as a means of improving the performance of search engines.

3. Clustering Techniques
Hierarchical Agglomerative clustering

- Compute similarity and merge closest two clusters.

K-mean clustering

- Based on the idea of center point to represent a cluster.

Clustering Techniques (Contd)

One pass clustering

- Give size one to cluster ,& compute distance

to all remaining nodes. Add closest node to the cluster.

Buckshot clustering

Suffix tree clustering

4. Web Document Clustering

Introduction

- Applied to the small set of documents returned

in response to a query.

- Model
clusters User query

Clustering Engine

Search Engine

Introduction (Contd)

Basic Key requirements for Web Document Clustering methods

- Relevance
- Browsable Summaries - Overlap - Snippet-tolerance - Speed - Incrementality

Suffix Tree Clustering

Three logical steps

- Step 1- document cleaning

- Step 2- identifying base clusters using a suffix tree - Step 3- combining these base clusters into clusters

Suffix Tree Clustering (contd)

Step 1 Document Cleaning

- Transformation of the string of text representing

each document. - Marking of sentence boundaries and stripping of non-word tokens.

Suffix Tree Clustering (contd)

Step 2 Identifying Base Clusters

- Rooted, directed tree..

- At least 2 children for each internal node. - To labeled each node with a non-empty sub string of S .

- The Concatenation of the edge-label on the path.

- Existence of a suffix-node for each suffix s of S.

Suffix Tree Clustering (contd)

Step 3 Combining Base Clusters

- Overlapped and identical document sets

of distinct base clusters - Merges base clusters with a high overlap in

there documents set.

Suffix Tree Clustering (contd)

A binary similarity measure

- Given 2 base clusters Bm and Bn, with sizes | Bm|

and | Bn|, and |Bm Bn| representing the number of documents common to both clusters.

Bm and Bns similarity is defined to be 1 iff - |Bm Bn| / | Bm| > 0.5 and - |Bm Bn| / | Bn| > 0.5 - Otherwise, their similarity is defined to be 0.

Suffix Tree Clustering (contd)

Experiments

Effectiveness for information retrieval

Snippets vs. Whole document

- Web document contained 760 words on average. - Snippets contained 50 words on average.

Execution Time

Pros and Cons

Pros
the contents of a document collection - Also reduce the search space

- Clustering can work to give user an overview of

Cons

- Computationally expensive
- Difficult to identify which cluster or cluters should be searched

Conclusion
The identification of the unique requirements of
document clustering of Web search engine results.

The definition of STC - an incremental, O(n) time clustering algorithm that satisfies these requirements. The first experimental evaluation of clustering algorithms on Web search engine results, forming a baseline for future work.

Questions & Answers Session

International Journal of Engineering Research and Development
Document8 pages
International Journal of Engineering Research and Development
IJERD
No ratings yet
Improving Suffix Tree Clustering Algorithm For Web Documents
Document5 pages
Improving Suffix Tree Clustering Algorithm For Web Documents
Hidayah Nurul Hasanah Zen
No ratings yet
Document Clustering in Web Search Engine: International Journal of Computer Trends and Technology-volume3Issue2 - 2012
Document4 pages
Document Clustering in Web Search Engine: International Journal of Computer Trends and Technology-volume3Issue2 - 2012
surendiran123
No ratings yet
Document Clustering Doc Rport
Document13 pages
Document Clustering Doc Rport
Demelash Seifu
No ratings yet
5) - Differentiate Between K-Means and Hierarchical Clustering
Document4 pages
5) - Differentiate Between K-Means and Hierarchical Clustering
Dhananjay Sharma
No ratings yet
Survey of Combined Clustering Approaches: Mr. Santosh D. Rokade, Mr. A. M. Bainwad
Document5 pages
Survey of Combined Clustering Approaches: Mr. Santosh D. Rokade, Mr. A. M. Bainwad
Shakeel Rana
No ratings yet
A New Approach For Multi-Document Summarization: Savita P. Badhe, Prof. K. S. Korabu
Document3 pages
A New Approach For Multi-Document Summarization: Savita P. Badhe, Prof. K. S. Korabu
theijes
No ratings yet
Lightweight Document Clustering Sholom Weiss, Brian White, Chid Apte IBM Research Report RC-21684
Document13 pages
Lightweight Document Clustering Sholom Weiss, Brian White, Chid Apte IBM Research Report RC-21684
cs_bd4654
No ratings yet
A Bfs-Based Similar Conference Retrieval Framework
Document8 pages
A Bfs-Based Similar Conference Retrieval Framework
CS & IT
No ratings yet
Comparing The Performance of SOM With Traditional Methods For Document Clustering Using Wordnet Ontologies
Document9 pages
Comparing The Performance of SOM With Traditional Methods For Document Clustering Using Wordnet Ontologies
IJRASETPublications
No ratings yet
Ktustudents - In: 1. Hierarchical Methods
Document21 pages
Ktustudents - In: 1. Hierarchical Methods
E3 Tech
No ratings yet
Lecture 17 Clustering
Document63 pages
Lecture 17 Clustering
John Mathew
No ratings yet
An Improved Technique For Document Clustering
Document4 pages
An Improved Technique For Document Clustering
International Jpurnal Of Technical Research And Applications
No ratings yet
Clustering Techniques Notes 1
Document20 pages
Clustering Techniques Notes 1
Rbrto Rodriguez
No ratings yet
FP-Growth Algorithm Explained with Example
Document31 pages
FP-Growth Algorithm Explained with Example
Gukesh
No ratings yet
Distributed Cluster Pruning in Hadoop: M.Sc. Project Report
Document48 pages
Distributed Cluster Pruning in Hadoop: M.Sc. Project Report
abhisheks_492
No ratings yet
Determining Traversal Orientation of Tree Data Structures
Document12 pages
Determining Traversal Orientation of Tree Data Structures
Sunita Dhokne
No ratings yet
Ambo University: Inistitute of Technology
Document15 pages
Ambo University: Inistitute of Technology
abay
No ratings yet
Data Mining-Model Based Clustering
Document8 pages
Data Mining-Model Based Clustering
Raj Endran
No ratings yet
Scalable Contruction of Topic Directory With Nonparametric Closed Termset Mining
Document8 pages
Scalable Contruction of Topic Directory With Nonparametric Closed Termset Mining
Mohamed El Amine Bouhadiba
No ratings yet
Data Mining Unit 5
Document30 pages
Data Mining Unit 5
Dr. M. Kathiravan Assistant Professor III - CSE
No ratings yet
Relevance Ranking and Relevance Feedback: Carl Staelin
Document34 pages
Relevance Ranking and Relevance Feedback: Carl Staelin
api-20013624
No ratings yet
A Genetic Algorithm For Database Query Optimization: February 1970
Document9 pages
A Genetic Algorithm For Database Query Optimization: February 1970
AndersonOliveira
No ratings yet
I Jsa It 01132012
Document5 pages
I Jsa It 01132012
WARSE Journals
No ratings yet
Content-Based Audio Retrieval Using A Generalized Algorithm
Document13 pages
Content-Based Audio Retrieval Using A Generalized Algorithm
MarcoVillaranReyes
No ratings yet
Synopsis OF Term Paper: Course: CSE2050 (DATA STRUCTURES)
Document3 pages
Synopsis OF Term Paper: Course: CSE2050 (DATA STRUCTURES)
Narinder Bansal
No ratings yet
VISHAL RATHORE ( (+91) 9861084119) : (Pick The Date)
Document16 pages
VISHAL RATHORE ( (+91) 9861084119) : (Pick The Date)
Vishal_Rathore_4340
No ratings yet
Application of Cluster Analysis 1
Document4 pages
Application of Cluster Analysis 1
Naga Venkata Sai Suraj Maheswaram
No ratings yet
MVS Clustering of Sparse and High Dimensional Data
Document5 pages
MVS Clustering of Sparse and High Dimensional Data
International Journal of Application or Innovation in Engineering & Management
No ratings yet
A Comparison of Document Clustering Techniques: 1 Background and Motivation
Document20 pages
A Comparison of Document Clustering Techniques: 1 Background and Motivation
irisnellygomez4560
No ratings yet
Clustering System
Document48 pages
Clustering System
Duong Duc Hung
No ratings yet
Websets: Extracting Sets of Entities From The Web Using Unsupervised Information Extraction
Document10 pages
Websets: Extracting Sets of Entities From The Web Using Unsupervised Information Extraction
Aniket Verma
No ratings yet
6 Text Clustering
Document66 pages
6 Text Clustering
Tushar Shah
No ratings yet
Cluster Analysis Concepts and Algorithms Explained
Document141 pages
Cluster Analysis Concepts and Algorithms Explained
Mayank Wadhwani
No ratings yet
Web Document Clustering Using: Fuzzy Equivalence Relations
Document17 pages
Web Document Clustering Using: Fuzzy Equivalence Relations
sangor1
No ratings yet
Evaluation of Clustering Algorithms For Search Engine: Abstract: Users of Web Search Engines Are Often Forced
Document7 pages
Evaluation of Clustering Algorithms For Search Engine: Abstract: Users of Web Search Engines Are Often Forced
preethu05
No ratings yet
6902 An Applied Algorithmic Foundation For Hierarchical Clustering
Document10 pages
6902 An Applied Algorithmic Foundation For Hierarchical Clustering
galaxystar
No ratings yet
An Analysis of Some Graph Theoretical Cluster Techniques
Document18 pages
An Analysis of Some Graph Theoretical Cluster Techniques
pakalagopal
No ratings yet
An Efficient and Empirical Model of Distributed Clustering
Document5 pages
An Efficient and Empirical Model of Distributed Clustering
seventhsensegroup
No ratings yet
Lecture12 Clustering
Document48 pages
Lecture12 Clustering
arjun
No ratings yet
Introduction To: Information Retrieval
Document48 pages
Introduction To: Information Retrieval
Tayyaba Faisal
No ratings yet
UNIT 3 DWDM Notes
Document32 pages
UNIT 3 DWDM Notes
Divyansh
No ratings yet
Unit-3 DWDM 7TH Sem Cse
Document54 pages
Unit-3 DWDM 7TH Sem Cse
Navdeep Khubber
No ratings yet
A Novel Multi-Viewpoint Based Similarity Measure For Document Clustering
Document4 pages
A Novel Multi-Viewpoint Based Similarity Measure For Document Clustering
IJMER
No ratings yet
UNIT III IRT
Document66 pages
UNIT III IRT
Amaya Ema
No ratings yet
High Dimensional Data Clustering Using Cuckoo Search Optimization Algorithm
Document5 pages
High Dimensional Data Clustering Using Cuckoo Search Optimization Algorithm
DHEENATHAYALAN K
No ratings yet
Improving Webpage Clustering Using Multiview Laerning
Document6 pages
Improving Webpage Clustering Using Multiview Laerning
International Journal of Application or Innovation in Engineering & Management
No ratings yet
Grouper A Dynamic Cluster Interface To Web Search Results
Document15 pages
Grouper A Dynamic Cluster Interface To Web Search Results
Dc Larry
No ratings yet
Information Retrieval On Cranfield Dataset
Document15 pages
Information Retrieval On Cranfield Dataset
vanya
No ratings yet
Introduction To KEA-Means Algorithm For Web Document Clustering
Document5 pages
Introduction To KEA-Means Algorithm For Web Document Clustering
surendiran123
No ratings yet
A Review Article On Naive Bayes Classifi
Document5 pages
A Review Article On Naive Bayes Classifi
Adi Chandra
No ratings yet
Survey On Clustering Algorithms For Sentence Level Text
Document6 pages
Survey On Clustering Algorithms For Sentence Level Text
seventhsensegroup
No ratings yet
Clustering Algorithm With A Novel Similarity Measure: Gaddam Saidi Reddy, Dr.R.V.Krishnaiah
Document6 pages
Clustering Algorithm With A Novel Similarity Measure: Gaddam Saidi Reddy, Dr.R.V.Krishnaiah
International Organization of Scientific Research (IOSR)
No ratings yet
Approximate String Matching in DNA Sequences
Document8 pages
Approximate String Matching in DNA Sequences
Eaco Shaw
No ratings yet
Dynamic Clustering Based Query
Document20 pages
Dynamic Clustering Based Query
pradeep6174
No ratings yet
CS276A Text Retrieval and Mining: (Borrows Slides From Ray Mooney and Soumen Chakrabarti)
Document48 pages
CS276A Text Retrieval and Mining: (Borrows Slides From Ray Mooney and Soumen Chakrabarti)
Kristine Anne Montoya Quirante
No ratings yet
Flat Clustering & Hierarchical Clustering in I.R
Document13 pages
Flat Clustering & Hierarchical Clustering in I.R
XY Z
No ratings yet
A New Hybrid K-Means and K-Nearest-Neighbor Algorithms For Text Document Clustering
Document7 pages
A New Hybrid K-Means and K-Nearest-Neighbor Algorithms For Text Document Clustering
putri dewi
No ratings yet
Document Clustering Method Based On Visual Features
Document5 pages
Document Clustering Method Based On Visual Features
rajan
No ratings yet
Search Tree: Fundamentals and Applications
From Everand
Search Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
XML-Based Servers - Communicating Meaningful Information Over The Web Using XML
Document42 pages
XML-Based Servers - Communicating Meaningful Information Over The Web Using XML
SHAIK CHAND PASHA
No ratings yet
ECG Data Acquisition Using Lab VIEW
Document76 pages
ECG Data Acquisition Using Lab VIEW
SHAIK CHAND PASHA
No ratings yet
Conceptual Idea of Pipe Crawling and Open Channel Drainage Cleaning Machine
Document27 pages
Conceptual Idea of Pipe Crawling and Open Channel Drainage Cleaning Machine
SHAIK CHAND PASHA
No ratings yet
Content Management Concepts
Document44 pages
Content Management Concepts
SHAIK CHAND PASHA
No ratings yet
Wimax Technology PDF
Document39 pages
Wimax Technology PDF
SHAIK CHAND PASHA
No ratings yet
XFS - Extended Filesystem
Document46 pages
XFS - Extended Filesystem
SHAIK CHAND PASHA
No ratings yet
Seminar On: Under The Guidance of
Document25 pages
Seminar On: Under The Guidance of
SHAIK CHAND PASHA
No ratings yet
Enforcement of Security in Wap Via The WTLS Protocol
Document26 pages
Enforcement of Security in Wap Via The WTLS Protocol
SHAIK CHAND PASHA
No ratings yet
Cell Centered Database
Document5 pages
Cell Centered Database
SHAIK CHAND PASHA
No ratings yet
Distributed Multimedia Applications Using CINEMA
Document11 pages
Distributed Multimedia Applications Using CINEMA
SHAIK CHAND PASHA
No ratings yet
Digital Rights Management
Document25 pages
Digital Rights Management
SHAIK CHAND PASHA
No ratings yet
Cellular Internet Protocol PDF
Document35 pages
Cellular Internet Protocol PDF
SHAIK CHAND PASHA
No ratings yet
Design of A Cluster Logical Volume Manager
Document44 pages
Design of A Cluster Logical Volume Manager
SHAIK CHAND PASHA
No ratings yet
YIMA2
Document41 pages
YIMA2
SHAIK CHAND PASHA
No ratings yet
Bioinformatics and Role of Software Engineers in It
Document49 pages
Bioinformatics and Role of Software Engineers in It
SHAIK CHAND PASHA
No ratings yet
Bioinformatics and Role of Software Engineers in It
Document49 pages
Bioinformatics and Role of Software Engineers in It
SHAIK CHAND PASHA
No ratings yet
Document Clustering PDF
Document29 pages
Document Clustering PDF
SHAIK CHAND PASHA
No ratings yet
Cellular Internet Protocol PDF
Document35 pages
Cellular Internet Protocol PDF
SHAIK CHAND PASHA
No ratings yet
Prof. Hanumant Pawar: Geomorphology
Document15 pages
Prof. Hanumant Pawar: Geomorphology
SHAIK CHAND PASHA
No ratings yet
Connectionist Systems
Document13 pages
Connectionist Systems
SHAIK CHAND PASHA
No ratings yet
Data Mining Applications in PDF
Document23 pages
Data Mining Applications in PDF
SHAIK CHAND PASHA
No ratings yet
YIMA2
Document41 pages
YIMA2
SHAIK CHAND PASHA
No ratings yet
Ubiquitous Networks
Document40 pages
Ubiquitous Networks
SHAIK CHAND PASHA
No ratings yet
A Secure Mobile Agent System
Document31 pages
A Secure Mobile Agent System
SHAIK CHAND PASHA
No ratings yet
Speech Synthesis: A Seminar Report
Document4 pages
Speech Synthesis: A Seminar Report
SHAIK CHAND PASHA
No ratings yet
YIMA2
Document41 pages
YIMA2
SHAIK CHAND PASHA
No ratings yet
Blue Gene Super Computer PDF
Document37 pages
Blue Gene Super Computer PDF
SHAIK CHAND PASHA
No ratings yet
Document Clustering PDF
Document29 pages
Document Clustering PDF
SHAIK CHAND PASHA
No ratings yet
Wavelength Division Multiplexing1
Document22 pages
Wavelength Division Multiplexing1
SHAIK CHAND PASHA
No ratings yet
Data Mining Applications in PDF
Document23 pages
Data Mining Applications in PDF
SHAIK CHAND PASHA
No ratings yet
Smart Call Home Quick Start Configuration Guide For Cisco Integrated Services Routers
Document5 pages
Smart Call Home Quick Start Configuration Guide For Cisco Integrated Services Routers
Yen Lung Lee
No ratings yet
Io T2
Document6 pages
Io T2
Rupesh Marathe
No ratings yet
Managing Linux Users Groups and File Permissions
Document16 pages
Managing Linux Users Groups and File Permissions
Vijay
No ratings yet
Object-Oriented Programming Fundamentals in C
Document9 pages
Object-Oriented Programming Fundamentals in C
Nikita Patil
No ratings yet
IT 3060: Database Management II: University of Cincinnati-Blue Ash Spring 2021
Document6 pages
IT 3060: Database Management II: University of Cincinnati-Blue Ash Spring 2021
gin
No ratings yet
Developing A Template For Linked List
Document10 pages
Developing A Template For Linked List
sujithamohan
No ratings yet
Installation Steps
Document16 pages
Installation Steps
Fawzy
No ratings yet
Goldcut Manual PDF
Document28 pages
Goldcut Manual PDF
Roxanna Sabando
No ratings yet
Inmos - Transputer Databook 3e
Document515 pages
Inmos - Transputer Databook 3e
juggle333
No ratings yet
Tvonics DTR Fp1600 User Guide
Document32 pages
Tvonics DTR Fp1600 User Guide
vince_thomas7654
No ratings yet
Survey ICS-2019 Radiflow
Document23 pages
Survey ICS-2019 Radiflow
abdel taib
No ratings yet
Systems Analysis and Design in A Changing World 7th Edition Satzinger Test Bank
Document16 pages
Systems Analysis and Design in A Changing World 7th Edition Satzinger Test Bank
shannontoddfbjgzpaiec
100% (24)
HOW To SELL HILLSTONE 01. Hillstone Company and Solution Overview 2022
Document45 pages
HOW To SELL HILLSTONE 01. Hillstone Company and Solution Overview 2022
Elvis da Silva Sales
No ratings yet
A.) B.) C.) D.)
Document4 pages
A.) B.) C.) D.)
Anonymous xMYE0TiNBc
No ratings yet
CV VeeraBhadra
Document2 pages
CV VeeraBhadra
Anish Kumar Dhiraj
No ratings yet
Iso 90003 Sample7
Document8 pages
Iso 90003 Sample7
Olivia Strijd
0% (1)
Alumni Portal Project Report
Document28 pages
Alumni Portal Project Report
Bharat Chaudhary
No ratings yet
Um2407 stm32h7 Nucleo144 Boards mb1364 Stmicroelectronics
Document49 pages
Um2407 stm32h7 Nucleo144 Boards mb1364 Stmicroelectronics
FelipeAlmeida
No ratings yet
Training Report: Summer Training On PHP A
Document6 pages
Training Report: Summer Training On PHP A
Akshay Sharma
No ratings yet
Interview Questions For Angular Developer PDF
Document2 pages
Interview Questions For Angular Developer PDF
Vilas Pawar
No ratings yet
Lab 9-The C# Station ADO - Net Tutorial
Document11 pages
Lab 9-The C# Station ADO - Net Tutorial
Hector Felipe Calla Mamani
No ratings yet
AWS Certified Cloud Practitioner Cheat Sheet Guide
Document12 pages
AWS Certified Cloud Practitioner Cheat Sheet Guide
Ram Mai
100% (4)
COP 4710 - Database Systems - Spring 2004 Homework #3 - 115 Points
Document5 pages
COP 4710 - Database Systems - Spring 2004 Homework #3 - 115 Points
quang140788
100% (1)
Examples of Application Letter
Document2 pages
Examples of Application Letter
Muhammad Ariza
No ratings yet
Abhishek Prasoon PMP 14 Years
Document2 pages
Abhishek Prasoon PMP 14 Years
Abhishek Prasoon
No ratings yet
Experiences in Visual Thinking
Document1 page
Experiences in Visual Thinking
Adrián Rojas
100% (1)
Exception
Document188 pages
Exception
Azhar Mulla
No ratings yet
EES Data LTD Free To Use Estimating Labour Guide: Electrical
Document63 pages
EES Data LTD Free To Use Estimating Labour Guide: Electrical
Shahin Shajahan
100% (1)
RoBA Multiplier
Document3 pages
RoBA Multiplier
sai krishna boyapati
100% (1)
Control Unit 9010/9020 SIL: Technical Information
Document9 pages
Control Unit 9010/9020 SIL: Technical Information
Paul Ramos Carcausto
No ratings yet