You are on page 1of 20

WEB CLUSTERING

ENGINES

Search Engine?
Search engines are an invaluable tool for
retrieving information from the Web. In
response to a user query, they return a list of
results ranked in order of relevance to the
query.
Eg: Google, Yahoo etc.

Flat Ranked VS Clustered


Google (Flat Ranked Search Engine)

Northern Lights (Clustered Search Engine)

Why Web Clustering Engines?


Conventional Engines are not much efficient in
Ambiguous queries.
The search results returned by conventional
search engines on query will be mixed together
in the list irrelevant items occurs.

This systems group the results returned by a


search engine into a hierarchy of labeled
clusters (also called categories).
Web clustering engines:
1. Northern Light - predefined set of clusters
2. Credo Reference
3. Kartoo
4. Eyeplorer

Main advantages of the cluster


hierarchy
It makes for shortcuts to the items that relate to the
same meaning.
It allows better topic understanding.

Issues in Implementation Of
clusters

Short input data description.


Meaningful labels.
Selection of similarity measure.
Grouping of objects into clusters.
Computational efficiency.
Unknown number of clusters.

Architecture & Techniques

1.Search Results Acquisition


Provides input for the rest of the system.
Based on the query, the acquisition component
must deliver 50 to 500 results, each of which
should contain a title, a contextual snippet, and
the URL
The source of search results can be any public
search engines, such as Google,Yahoo etc.
Fetching results from other search engines.

2.Preprocessing of Search results


Primary aim is to convert the search results
into features
steps:
i.Language identification
ii.Tokenization
iii.Stemming
iv.Selection features

ii.Tokenization:
Text of each search result gets split into a
sequence of basic independent units called
tokens represent by word, number or symbol.

iii.Stemming:
Remove the inflectional prefixes and suffixes of
each word to reduce different grammatical form of
the word to a common base form called a stem.
Eg:
connected,connecting & interconnection

connect

iv.Selection features:
Extract features for each search result present
in the input.
Features are atomic entities by which we can
describe an object and represent its most
important characteristic to an algorithm.
Features vary from single word to tuples of
word.

How can represent a feature/text?


Vector Space Model(VSM)
Document d is represented in the VSM as a vector
[wt0 , wt1 , . . .wtn]
where t0, t1, . . . tn is a set of words/features
and wti is the weight/importance of feature ti
Eg:
dPolly had a dog and the dog had Polly

vsm representation

3.Cluster Construction &


Labelling

The set of search results along with their features


are input to the clustering algorithm,
for building the clusters and labeling.
Three types of Algorithms:
1. Data Centric Algorithms
2. Description aware
3. Description centric

Data Centric Clustering Algorithm


It has initial clustering of a collection of
documents in a set of k clusters(scatter)
At Query time the user selected clusters of
interest(gather) and the system re-clustered
those documents.
Process repeats until a small cluster with
relevant documents is found

Difficulties in Data centric algorithms


All these algorithms are not incremental in
nature - each document arrives from the web,
we clean it and add it to the available model.
Missing of meaningful labels.

4.Visualization of Clustered
Results
One prominent approach is based on hierarchical folders
Clusty, CREDO, Lingo3G - hierarchical folder visualization
approach
Grokker - Nesting ,zooming approach
KartOO - Graph based interfaces

THANK YOU

You might also like