You are on page 1of 49

S

e
a
r
c
h

E
n
g
i
n
e
s

H
o
w
it
w
o
r
k
s
?

W
h
i
c
h

o
f

t
h
e
s
e

i
s

S
e
a
r
c
h

E
n
g
i
n
e
?

!
!
!
!
Purpose of
Search
Engines
Convert
their need
to query
Let people
find
information
In form of
web page/
other
format
In shortest
time
possible
Searching different then
Databases Query
Structured data tends to refer to information in
tables.
Unstructured data typically refers to free text
Allows
Keyword queries including operators
More sophisticated concept queries e.g.,
find all web pages dealing with drug abuse.
Information
Retrieval
lnformauon 8eLrleval (l8) ls ndlng maLerlal (usually
documenLs) of an unsLrucLured naLure (usually LexL) LhaL
sauses an lnformauon need from wlLhln large
collecuons (usually sLored on compuLers).
1hese days we frequenLly Lhlnk rsL of web search,
buL Lhere are many oLher cases:
L-mall search
Searchlng your lapLop
CorporaLe knowledge bases
Legal lnformauon reLrleval

Web
Crawling
Indexing Sorting
Page
Ranker
Google
Search
Engine
Query
Parser
Query
Engine
Relevance
Ranker
Formatter
Search
Engine
Google Search Engine
A Search Engine
How does
Crawling
work?
8egln wlLh known
seed u8Ls
leLch and parse Lhem
LxLracL u8Ls Lhey
polnL Lo
lace Lhe exLracLed
u8Ls on a queue
leLch each u8L on Lhe
queue and repeaL
Web Crawlers
Web
URLs frontier
Unseen Web
Seed
pages
URLs crawled
and parsed
12
What a crawler
MUST do
8e ollLe: 8especL lmpllclL and expllclL
pollLeness conslderauons
Cnly crawl allowed pages
8especL !"#"$%&$'$ (more on Lhls shorLly)
8e 8obusL: 8e lmmune Lo splder Lraps
and oLher mallclous behavlor from web
servers
What any crawler
SHOULD do
8e capable of dlsLrlbuLed operauon: deslgned Lo run on
muluple dlsLrlbuLed machlnes
8e scalable: deslgned Lo lncrease Lhe crawl raLe by addlng
more machlnes
erformance/emclency: permlL full use of avallable processlng
and neLwork resources
leLch pages of hlgher quallLy rsL
Conunuous operauon: Conunue feLchlng fresh coples of a
prevlously feLched page
LxLenslble: AdapL Lo new daLa formaLs, proLocols
Updated Crawling
Picture
URLs crawled
and parsed
Unseen Web
Seed
Pages
URL frontier
Crawling thread
13
Parsing a
document
WhaL formaL ls lL ln?
pdf/word/excel/hLml?
WhaL language ls lL ln?
WhaL characLer seL ls ln use?
(C1232, u1l-8, .)

How good are the
retrieved docs?
! Precision : Fraction of retrieved docs that are
relevant to the users information need
! Recall : Fraction of relevant docs in collection
that are retrieved
Indexing
Once we have all the pages crawled we need to
index the pages to make our retrieval easier.

But how do one Index all the pages ?
Term-document
incidence matrices
Antony Julius Tempest Hamlet Othello Macbeth
Document 1 1 1 0 0 0 1
Document 2 1 1 0 1 0 0
Document 3 1 1 0 1 1 1
0 1 0 0 0 0
1 0 0 0 0 0
1 0 1 1 1 1
Document n 1 0 1 1 1 0
Cant build
the matrix
500K x 1M matrix has half-a-trillion 0s and 1s.
But it has no more than one billion 1s.
matrix is extremely sparse.
Whats a better representation?
We only record the 1 positions.

Inverted
Index
lor each Lerm $, we musL sLore a llsL of all documenLs LhaL
conLaln $.
ldenufy each doc by a !"#$%, a documenL serlal number
Can we used xed-slze arrays
What happens if the word Caesar is
added to document 14?
Brutus
Calpurnia
Caesar 1 2 4 5 6 16 57 132
1 2 4 11 31 45 173
2 31
174
54 101
Inverted
Index
We need varlable-slze posungs llsLs
Cn dlsk, a conunuous run of posungs ls normal and besL
ln memory, can use llnked llsLs or varlable lengLh arrays
Some Lradeos ln slze/ease of lnseruon
Dictionary Postings
Posting
!"#$#%
'()*#"+,(
'(-%("
1 2 4 5 6 16 57 132
1 2 4 11 31 45 173
2 31
174
54 101
Tokenizer
Token stream Friends Romans Countrymen
lnverLed lndex consLrucuon
Linguistic modules
Modified tokens
friend roman countryman
Documents to
be indexed
Friends, Romans, countrymen.
Indexer
Inverted index
roman
counLryman
2 4
2
13 16
1
friend
Initial Text
Processing
Tokenization
Cut character sequence into word tokens
Deal with Johns, a state-of-the-art solution
Normalization
Map text and query term to same form
You want U.S.A. and USA to match
Stemming
We may wish different forms of a root to match
authorize, authorization
Stop words
We may omit very common words (or not)
the, a, to, of
Query processing:
AND
Conslder processlng Lhe query:
!"#$#% )*+ '(-%("
LocaLe !"#$#% ln Lhe ulcuonary,
8eLrleve lLs posungs.
LocaLe '(-%(" ln Lhe ulcuonary,
8eLrleve lLs posungs.
Merge Lhe Lwo posungs (lnLersecL Lhe documenL seLs):
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar
Are all the pages of same
importance?
Page Rank
A method for rating the importance of web
pages objectively and mechanically using
the link structure of the web
PageRank was developed by Larry Page
(hence the name Page-Rank) and Sergey Brin.
It is first as part of a research project about a
new kind of search engine. That project
started in 1995 and led to a functional
prototype in 1998 known as Google.
PageRank in a
single Equation
u: a web page
B
u
: the set of us backlinks
N
v
: the number of forward links of page v
c: the normalization factor to make ||R||
L1
=
1 (||R||
L1
= |R
1
+ ! + R
n
|)
Probabilistic Interpretation of
PageRank
The Random Surfer Model:
The standing probability distribution
of a random walk on the graph of the
web. simply keeps clicking
successive links at random
Example
Pagerank Matrices
Simple version of Web
Pagerank
Calculation
First Iteration
Second Iteration
Convergence after
many iterations
Problem with simplified Pagerank
Loops: During each iteration, the loop accumulates
rank but never distributes rank to other pages!
Example
Example
First Iteration
Second Iteration
Convergence after
many iterations
Solution
Modify the random surfer such that he simply keeps
clicking successive links at random, but periodically gets
bored and jumps to a random page based on the
distribution of E

E(u): a distribution of ranks of web pages that users
jump to when they gets bored after successive links at
random.
Example
Google in 1997
Ways of index
partitioning
By doc: each shard has index for subset of docs
pro: each shard can process queries
independently
pro: easy to keep additional per-doc information
pro: network traffic (requests/responses) small
con: query has to be processed by each shard
- con: O(K*N) disk seeks for K word query on N
shards

Ways of index
partitioning
By doc: each shard has index for subset of docs
pro: each shard can process queries independently
pro: easy to keep additional per-doc information
pro: network traffic (requests/responses) small
con: query has to be processed by each shard
- con: O(K*N) disk seeks for K word query on N
shards

Ways of index
partitioning
By word: shard has subset of words for all docs
pro: K word query => handled by at most K shards
pro: O(K) disk seeks for K word query
con: much higher network bandwidth needed, data
about each word for each matching doc must be
collected in one place
- con: harder to have per-doc information
Google in 1999
Caching
Cache both index results and doc snippets
Hit rates typically 30-60%
depends on frequency of index updates, mix of query
traffic,
level of personalization, etc
Better Performance: 10s of machines do work of 100s or
1000s
reduce query latency on hits
queries that hit in cache tend to be both popular and
expensive (common words, lots of documents to score,
etc.)
Beware: big latency spike/capacity drop when index
updated or cache flushed
Google 2000
Dealing with growth
Google 2001
In memory
Indexing
Big increase in throughput
Big decrease in latency
especially at the tail: expensive queries that
previously needed GBs of disk I/O became much
faster. e.g. [ circle of life ]
Variance: touch 1000s of machines, not dozens
Availability: 1 or few replicas of each docs index
data
Google 2004
Google 2007 : Universal Search
References
Stanford, CS-276 (Information Retrieval and Web
Search) - http://www.stanford.edu/class/cs276/
Pagerank Citation Ranking bringing order to the
web. EECS 584 University of Michigan :
http://web.eecs.umich.edu/~michjc/eecs584/
notes/lecture19-pagerank.ppt
Challenges in building large scale information
retrieval systems by Jeff Dean
http://static.googleusercontent.com/media/
research.google.com/en/us/people/jeff/WSDM09-
keynote.pdf

You might also like