Professional Documents
Culture Documents
Project Idea
Today
Presentations
• Project idea presentations
• Introduction to IR
• Assignments for next week
1
Data Retrieval vs.
Next
Information Retrieval
• Information retrieval solutions Models of information retrieval
may incorporate data retrieval
– Data retrieval as a subset of
information retrieval
• For this class, data retrieval
alone is not interesting
2
Takeaway Messages Textbook Roadmap
• Modeling information need and
user activity is complex
• Be aware of simplistic
assumptions that most IR work
makes
3
Generic Document
Retrieval Model Best Matching Models Covered
Information Need Documents
Ranking In this class:
Query Language Algorithm
• Boolean Model
Representation Representation of • Vector Space Model
of Information Need Document Content
Potentially more if time permits
and interest.
Prior Knowledge &
Assumptions
Documents
Full-text Indexing
Index Terms
Models
• Roughly a word or phrase describing the • Boolean model
content of a document
• Manual Indexing • Vector space model
– A human reads or scans a document and
assigns it index terms • Probabilistic model
– (i.e. Library of Congress subject terms)
• Automatic Indexing
– Full-text indexing
– Every word in the document becomes an index
term
Generic Full-Text
Boolean Model
Indexing Definitions
• K = {k1, ..., kt} • Compute vectors for each
– Set of all index terms document (dj)
• wi, j > 0 • If a keyword ki appears in dj,
– Weight of term ki in document dj then wi,j is 1, otherwise 0
– wi,j = 0 if ki is not in dj
• Each document is describe as a
vector
– dj = (w1,j, w2,j, ..., wt,j)
Modern Information Retrieval, Baeza-Yates & Ribeiro-Neto
4
Boolean Query Example Boolean
Language Queries
• Terms • House
– Words • House AND Corvallis
– Phrases • House OR Corvallis
• Operators • (House OR Condo) AND Corvallis
– AND • House AND Oregon AND NOT
– OR Eugene
– NOT • (House OR Condo) AND Corvallis
and NOT Eugene
Informal Pseudo
Rules for Boolean Logic
Boolean Notation
• DeMorgan’s Law • Evolved in web search engines
– NOT (A AND B) = (NOT A) OR (NOT B) • +house +Corvallis
– NOT(A OR B) = (NOT A) AND (NOT B) – house AND Corvallis
• Search for “Boolean Logic” if you want • +house +Oregon –Eugene
to know more – House and Oregon and NOT Eugene
• House +Corvallis
–?
5
Who Uses Boolean
Boolean Searching
Searches
• Upsides • Everybody until about ten-fifteen
– Easy to implement
years ago
– Simple queries are easy
– Query language gives significant control • Even now, many commercial
over results systems (library catalogs,
• Downsides abstracting services, etc)
– Binary relevance decision
• No ordering criteria
• Usually too much or too little
– Syntax can be complex