Welcome to Scribd!

Web Clustring Engine

Uploaded by

0% found this document useful (0 votes)

7 views20 pages

Web clustering Engines are emerging trend in the field of data retrieval. They organize search results by topic, thus providing a complementary view to the flat ranked list returned by the standard search engines.

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

7 views20 pages

Web Clustring Engine

Uploaded by

FACTS Computer Software House L.L.C

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 20

Search inside document

WEB CLUSTERING

ENGINES

Search Engine?
Search engines are an invaluable tool for
retrieving information from the Web. In
response to a user query, they return a list of
results ranked in order of relevance to the
query.
Eg: Google, Yahoo etc.

Flat Ranked VS Clustered

Google (Flat Ranked Search Engine)

Northern Lights (Clustered Search Engine)

Why Web Clustering Engines?

Conventional Engines are not much efficient in
Ambiguous queries.
The search results returned by conventional
search engines on query will be mixed together
in the list irrelevant items occurs.

This systems group the results returned by a

search engine into a hierarchy of labeled
clusters (also called categories).
Web clustering engines:
1. Northern Light - predefined set of clusters
2. Credo Reference
3. Kartoo
4. Eyeplorer

Main advantages of the cluster

hierarchy
It makes for shortcuts to the items that relate to the
same meaning.
It allows better topic understanding.

Issues in Implementation Of
clusters

Short input data description.

Meaningful labels.
Selection of similarity measure.
Grouping of objects into clusters.
Computational efficiency.
Unknown number of clusters.

Architecture & Techniques

1.Search Results Acquisition

Provides input for the rest of the system.
Based on the query, the acquisition component
must deliver 50 to 500 results, each of which
should contain a title, a contextual snippet, and
the URL
The source of search results can be any public
search engines, such as Google,Yahoo etc.
Fetching results from other search engines.

2.Preprocessing of Search results

Primary aim is to convert the search results
into features
steps:
i.Language identification
ii.Tokenization
iii.Stemming
iv.Selection features

ii.Tokenization:
Text of each search result gets split into a
sequence of basic independent units called
tokens represent by word, number or symbol.

iii.Stemming:
Remove the inflectional prefixes and suffixes of
each word to reduce different grammatical form of
the word to a common base form called a stem.
Eg:
connected,connecting & interconnection

connect

iv.Selection features:
Extract features for each search result present
in the input.
Features are atomic entities by which we can
describe an object and represent its most
important characteristic to an algorithm.
Features vary from single word to tuples of
word.

How can represent a feature/text?

Vector Space Model(VSM)
Document d is represented in the VSM as a vector
[wt0 , wt1 , . . .wtn]
where t0, t1, . . . tn is a set of words/features
and wti is the weight/importance of feature ti
Eg:
dPolly had a dog and the dog had Polly

vsm representation

3.Cluster Construction &

Labelling

The set of search results along with their features

are input to the clustering algorithm,
for building the clusters and labeling.
Three types of Algorithms:
1. Data Centric Algorithms
2. Description aware
3. Description centric

Data Centric Clustering Algorithm

It has initial clustering of a collection of
documents in a set of k clusters(scatter)
At Query time the user selected clusters of
interest(gather) and the system re-clustered
those documents.
Process repeats until a small cluster with
relevant documents is found

Difficulties in Data centric algorithms

All these algorithms are not incremental in
nature - each document arrives from the web,
we clean it and add it to the available model.
Missing of meaningful labels.

4.Visualization of Clustered
Results
One prominent approach is based on hierarchical folders
Clusty, CREDO, Lingo3G - hierarchical folder visualization
approach
Grokker - Nesting ,zooming approach
KartOO - Graph based interfaces

THANK YOU

Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
Rating: 3.5 out of 5 stars
3.5/5 (738)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
Rating: 4.5 out of 5 stars
4.5/5 (4609)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
Rating: 3.5 out of 5 stars
3.5/5 (231)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
Rating: 4.5 out of 5 stars
4.5/5 (119)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
Rating: 4.5 out of 5 stars
4.5/5 (838)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
Rating: 4.5 out of 5 stars
4.5/5 (265)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
Rating: 3.5 out of 5 stars
3.5/5 (399)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
Rating: 4 out of 5 stars
4/5 (587)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
Rating: 3.5 out of 5 stars
3.5/5 (2219)
Yes Please
From Everand
Yes Please
Amy Poehler
Rating: 4 out of 5 stars
4/5 (1891)
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
Rating: 4 out of 5 stars
4/5 (5794)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
Rating: 4 out of 5 stars
4/5 (599)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
Rating: 4.5 out of 5 stars
4.5/5 (234)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
Rating: 3.5 out of 5 stars
3.5/5 (137)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
Rating: 4.5 out of 5 stars
4.5/5 (537)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
Rating: 4.5 out of 5 stars
4.5/5 (271)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
Rating: 4.5 out of 5 stars
4.5/5 (1711)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
Rating: 4 out of 5 stars
4/5 (1090)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
Rating: 4.5 out of 5 stars
4.5/5 (1929)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
Rating: 4 out of 5 stars
4/5 (821)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
Rating: 4.5 out of 5 stars
4.5/5 (344)
John Adams
From Everand
John Adams
David McCullough
Rating: 4.5 out of 5 stars
4.5/5 (2409)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
Rating: 3.5 out of 5 stars
3.5/5 (2322)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
Rating: 4 out of 5 stars
4/5 (890)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
Rating: 4 out of 5 stars
4/5 (1103)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
Rating: 4 out of 5 stars
4/5 (3811)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
Rating: 4.5 out of 5 stars
4.5/5 (440)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
Rating: 4.5 out of 5 stars
4.5/5 (474)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
Rating: 4 out of 5 stars
4/5 (4200)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
Rating: 4 out of 5 stars
4/5 (45)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
Rating: 4 out of 5 stars
4/5 (98)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
Rating: 4.5 out of 5 stars
4.5/5 (2099)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carre
Rating: 3.5 out of 5 stars
3.5/5 (104)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
Rating: 4 out of 5 stars
4/5 (1839)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
Rating: 4.5 out of 5 stars
4.5/5 (789)
Little Women
From Everand
Little Women
Louisa May Alcott
Rating: 4 out of 5 stars
4/5 (104)
Trigonometry
Document15 pages
Trigonometry
Jnanam
100% (1)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
Rating: 4 out of 5 stars
4/5 (73)
FMEA Scope Analysis
Document14 pages
FMEA Scope Analysis
Ankur
No ratings yet
199307
Document87 pages
199307
vtvuckovic
No ratings yet
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Tóibín
Rating: 3.5 out of 5 stars
3.5/5 (1937)
Mongo DB
Document22 pages
Mongo DB
Balaji Mani
100% (1)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
Rating: 3.5 out of 5 stars
3.5/5 (792)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
Rating: 4 out of 5 stars
4/5 (1015)
Angle of Repose & Angle of Friction
Document4 pages
Angle of Repose & Angle of Friction
ganmoses
50% (2)
Box Cars and One Eyed Jacks Gr3to5
Document38 pages
Box Cars and One Eyed Jacks Gr3to5
api-337183055
100% (1)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
Rating: 4.5 out of 5 stars
4.5/5 (806)
Design Flexible Roads Reinforced with Tenax Geogrids using TNXROAD Software
Document32 pages
Design Flexible Roads Reinforced with Tenax Geogrids using TNXROAD Software
Jose Lizarraga
No ratings yet
Association Rule Mining
Document50 pages
Association Rule Mining
bhargavi
No ratings yet
Bill Is A Document Management System, Especially For Storing and Retrieving VAT Bills, and Tabulate VAT Input
Document7 pages
Bill Is A Document Management System, Especially For Storing and Retrieving VAT Bills, and Tabulate VAT Input
FACTS Computer Software House L.L.C
No ratings yet
ERP Selection Methodology Matters
Document4 pages
ERP Selection Methodology Matters
FACTS Computer Software House L.L.C
No ratings yet
Shine A Light On Your Business: Software Solutions Behind Successful Enterprises
Document51 pages
Shine A Light On Your Business: Software Solutions Behind Successful Enterprises
FACTS Computer Software House L.L.C
No ratings yet
Software Development and Business Applications
Document6 pages
Software Development and Business Applications
FACTS Computer Software House L.L.C
No ratings yet
Important Features of Our HR & Payroll System (FactsSHARP)
Document4 pages
Important Features of Our HR & Payroll System (FactsSHARP)
FACTS Computer Software House L.L.C
No ratings yet
Manage Human Capital Efficiently With FactsSHARP
Document4 pages
Manage Human Capital Efficiently With FactsSHARP
FACTS Computer Software House L.L.C
No ratings yet
ERP Implementation Development Company
Document7 pages
ERP Implementation Development Company
FACTS Computer Software House L.L.C
No ratings yet
Fixed Asset Management in ERP Solution
Document4 pages
Fixed Asset Management in ERP Solution
FACTS Computer Software House L.L.C
100% (1)
ERP Solution For Fixed Asset Management
Document5 pages
ERP Solution For Fixed Asset Management
FACTS Computer Software House L.L.C
100% (1)
Best ERP Software Services in Dubai
Document5 pages
Best ERP Software Services in Dubai
FACTS Computer Software House L.L.C
No ratings yet
FACTS Computer Software House
Document5 pages
FACTS Computer Software House
FACTS Computer Software House L.L.C
No ratings yet
Aerodynamics MCQs on Low Speed Aerodynamics
Document4 pages
Aerodynamics MCQs on Low Speed Aerodynamics
Harish Mathiazhahan
No ratings yet
Chapter 1A
Document35 pages
Chapter 1A
Sandip Gaikwad
No ratings yet
The Gran Plot 8
Document5 pages
The Gran Plot 8
Yasmim Yamaguchi
No ratings yet
Chapter 8 - Probability Solutions
Document24 pages
Chapter 8 - Probability Solutions
DumoraSimbolon
No ratings yet
Rr210501 Discrete Structures and Graph Theory
Document6 pages
Rr210501 Discrete Structures and Graph Theory
Srinivasa Rao G
No ratings yet
Ejercicos Mentales Volumen 13
Document10 pages
Ejercicos Mentales Volumen 13
Luis Torres
No ratings yet
Sachin S. Pawar: Career Objectives
Document3 pages
Sachin S. Pawar: Career Objectives
Sachin Pawar
No ratings yet
ME 354A Vibration Problems and Solutions
Document4 pages
ME 354A Vibration Problems and Solutions
Akhil
No ratings yet
Artificial Neural Networks: Dr. Md. Aminul Haque Akhand Dept. of CSE, KUET
Document82 pages
Artificial Neural Networks: Dr. Md. Aminul Haque Akhand Dept. of CSE, KUET
MD. SHAHIDUL ISLAM
100% (1)
Phy1 11 - 12 Q1 0603 PF FD
Document68 pages
Phy1 11 - 12 Q1 0603 PF FD
hiro
No ratings yet
Evolutionary Computation:: Genetic Algorithms
Document49 pages
Evolutionary Computation:: Genetic Algorithms
Shobanraj Letchumanan
No ratings yet
Bloomberg MIT Spring Tech Talk
Document2 pages
Bloomberg MIT Spring Tech Talk
Bita Moghaddam
No ratings yet
Full Download Ebook Ebook PDF Mathematics in Action Algebraic Graphical and Trigonometric Problem Solving 5th Edition PDF
Document42 pages
Full Download Ebook Ebook PDF Mathematics in Action Algebraic Graphical and Trigonometric Problem Solving 5th Edition PDF
willie.ortiz343
100% (41)
DLL Mathematics-5 Q3 W5
Document7 pages
DLL Mathematics-5 Q3 W5
Charlota Pel
No ratings yet
Poisson's and Laplace Equations Explained
Document19 pages
Poisson's and Laplace Equations Explained
MnskSaro
No ratings yet
Shell Balance Flow Thro Circular Pipes
Document23 pages
Shell Balance Flow Thro Circular Pipes
Raja Selvaraj
No ratings yet
Table-Of-Specifications-Math 9
Document2 pages
Table-Of-Specifications-Math 9
johayma fernandez
No ratings yet
Workshop User Guide of Grafcet
Document150 pages
Workshop User Guide of Grafcet
James Habib
No ratings yet
0606 MW Otg MS P11
Document8 pages
0606 MW Otg MS P11
Yu Yan Chan
No ratings yet
Mat 510 Week 11 Final Exam Latest Strayer
Document4 pages
Mat 510 Week 11 Final Exam Latest Strayer
coursehomework
No ratings yet
Detailed Lesson Plan on Subtracting 2-3 Digit Numbers
Document7 pages
Detailed Lesson Plan on Subtracting 2-3 Digit Numbers
Nhor Jehan Saydin
No ratings yet
Urriculum Itae Et Tudiorum
Document21 pages
Urriculum Itae Et Tudiorum
Enrico
No ratings yet