Cmpsci 446 Search Engines

CMPSCI
446
Search Engines
Search Engine Architecture
All slides CIIR, 2014-2016

Course topics (where are we?)
Overview
Architecture of a search engine
Data acquisiGon
Text representaGon
Indexing
Query processing
Ranking
EvaluaGon
ClassicaGon and clustering
IR and Search Engines
InformaGon Retrieval Search Engines
Performance
Relevance
-Ecient search and indexing
-Eec,ve ranking
IncorporaGng new data
EvaluaGon
-Tes,ng and measuring
-Coverage and freshness
InformaGon needs Scalability
-Growing with data and users
-User interac,on
Adaptability
-Tuning for applica,ons
Specic problems
-e.g. Spam
Figure 1.1
Search Engine Architecture
A soRware architecture consists of soRware
components, the interfaces provided by those
components, and the relaGonships between
them
describes a system at a parGcular level of abstracGon
Architecture of a search engine determined by
two requirements
eecGveness (quality of results) and eciency
(response Gme and throughput)
Indexing Process
Figure 2.1
Query Process
Figure 2.2
Indexing Process
Figure 2.1
Details: Text AcquisiGon
Crawler
IdenGes and acquires documents for search
engine
Many types web, enterprise, desktop
Web crawlers follow links to nd documents
Must eciently nd huge numbers of web pages
(coverage) and keep them up-to-date (freshness)
Single site crawlers for site search
Topical or focused crawlers for verGcal search
Document crawlers for enterprise and desktop
search
Follow links and scan directories
Text AcquisiGon
Feeds
Real-Gme streams of documents
e.g., web feeds for news, blogs, video, radio, TV
RSS is common standard
RSS reader can provide new XML documents to search
engine
Conversion
Convert variety of documents into a consistent text
plus metadata format
e.g. HTML, XML, Word, PDF, etc. XML
Convert text encoding for dierent languages
Using a Unicode standard like UTF-8
Text AcquisiGon
Document data store
Stores text, metadata, and other related content
for documents
Metadata is informaGon about document such as type
and creaGon date
Other content includes links, anchor text
Provides fast access to document contents for
search engine components
e.g. result list generaGon
Could use relaGonal database system
More typically, a simpler, more ecient storage system
is used due to huge numbers of documents
Indexing Process
Figure 2.1
Text TransformaGon
Parser
Processing the sequence of text tokens in the
document to recognize structural elements
e.g., Gtles, links, headings, etc.
Tokenizer recognizes words in the text
must consider issues like capitalizaGon, hyphens,
apostrophes, non-alpha characters, separators
Markup languages such as HTML, XML oRen used to
specify structure
Tags used to specify document elements
E.g., <h2> Overview </h2>
Document parser uses syntax of markup language (or other
formahng) to idenGfy structure
Text TransformaGon
Stopping
Remove common words
e.g., and, or, the, in
Some impact on eciency and eecGveness
Can be a problem for some queries
Stemming
Group words derived from a common stem
e.g., computer, computers, compuGng, compute
Usually eecGve, but not for all queries
Benets vary for dierent languages
Text TransformaGon
Link Analysis
Makes use of links and anchor text in web pages
Link analysis idenGes popularity and community
informaGon
e.g., PageRank
Anchor text can signicantly enhance the
representaGon of pages pointed to by links
Signicant impact on web search
Less importance in other applicaGons
Text TransformaGon
InformaGon ExtracGon
IdenGfy classes of index terms that are important for
some applicaGons
e.g., named en,ty recognizers idenGfy classes such as
people, loca,ons, companies, dates, etc.
Classier
IdenGes class-related metadata for documents or
part of documents
i.e., assigns labels to documents
Topics, reading levels, senGment, genre
Spam vs. non-spam
Non-content parts of documents e.g. adverGsements
Use depends on applicaGon
Indexing Process
Figure 2.1
Index CreaGon
Document StaGsGcs
Gathers counts and posiGons of words and other
features
Used in ranking algorithm
WeighGng
Computes weights for index terms
Usually reect importance of term in the document
Used in ranking algorithm
e.g., K.idf weight
CombinaGon of term frequency in document and
inverse document frequency in the collecGon
Index CreaGon
Inversion
Core of indexing process
Converts document-term informaGon to term-
document for indexing
Dicult for very large numbers of documents
Format of inverted le is designed for fast query
processing
Must also handle updates
Compression used for eciency
Index CreaGon
Index DistribuGon
Distributes indexes across mulGple computers
and/or mulGple sites
EssenGal for fast query processing with large
numbers of documents
Many variaGons
Document distribuGon, term distribuGon, replicaGon
P2P and distributed IR involve search across
mulGple sites
Note outputs of Indexing Process
Figure 2.1
Query Process
Figure 2.2
Query Process
Figure 2.2
User InteracGon
Query input
Provides interface and parser for query language
Query language used to describe complex queries
Using operators
Indicate special treatment for query text
Most web search query languages are very simple
Small number of operators
There are more complicates query languages
e.g., Boolean queries, Indri and Galago query languages
IR query languages also allow content and structure
specicaGons, but focus on content
User InteracGon
Query transformaGon
Improves iniGal query, both before and aRer iniGal
search
Includes text transformaGon techniques used for
documents
Spell checking and query sugges,on provide
alternaGves to original query
Query expansion and relevance feedback modify
the original query with addiGonal terms
User InteracGon
Results output
Constructs the display of ranked documents for a
query
Generates snippets to show how queries match
documents
Highlights important words and passages
Retrieves appropriate adver,sing in many
applicaGons (related things)
May provide clustering and other visualizaGon
tools
Query Process
Figure 2.2
Ranking
Scoring
Calculates scores for documents using a ranking
algorithm
Core component of search engine
Basic form of score is QD = qi di
qi and di are query and document term weights for
term i
Many variaGons of ranking algorithms and
retrieval models
Ranking
Performance opGmizaGon
Designing ranking algorithms for ecient
processing
Term-at-a ,me vs. document-at-a-,me processing
Safe vs. unsafe opGmizaGons
DistribuGon
Processing queries in a distributed environment
Query broker distributes queries and assembles
results
Caching is a form of distributed searching
Query Process
Figure 2.2
EvaluaGon
Logging
Logging user queries and interacGon is crucial for
improving search eecGveness and eciency
Query logs and clickthrough data used for query
suggesGon, spell checking, query caching, ranking,
adverGsing search, and other components
Ranking analysis
Measuring and tuning ranking eecGveness
Performance analysis
Measuring and tuning system eciency
How Does It Really Work?
This course explains these components of a
search engine in more detail
ORen many possible approaches and techniques
for a given component
Focus is on the most important alternaGves
i.e., explain a small number of approaches in detail
rather than many approaches
Importance based on research results and use in
actual search engines
AlternaGves described in references
Course topics (where are we?)
Overview
Architecture of a search engine
Data acquisi3on
Text representaGon
Indexing
Query processing
Ranking
EvaluaGon
ClassicaGon and clustering

Cmpsci 446 Search Engines

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cmpsci 446 Search Engines

Uploaded by

Copyright:

Available Formats

CMPSCI

All slides CIIR, 2014-2016

You might also like