You are on page 1of 6

Review of Last Lecture

CS419/519 • Tour of an research information


retrieval system
Information Filtering and – Traditional IR search engine
Retrieval – Collaborative filtering
Jon Herlocker recommendations
Dept. of Computer Science – Crawling, pre-processing, and
Oregon State University indexing
Tu/Th Day 2 – Personalization
– User study

Project Idea
Today
Presentations
• Project idea presentations
• Introduction to IR
• Assignments for next week

What is Information Data Retrieval vs.


Retrieval? Information Retrieval
• How does it differ from
Data retrieval Information retrieval
databases (data retrieval)? Content Data Information
Data object Table Document, image, other
Matching Exact match Partial match, best match
Items wanted Matching Relevant
Query language SQL(artificial) Natural
Query specification Complete Incomplete
Model Deterministic Probabilistic
Highly structured Less structured

Table by Xin Xao, Drexel University

1
Data Retrieval vs.
Next
Information Retrieval
• Information retrieval solutions Models of information retrieval
may incorporate data retrieval
– Data retrieval as a subset of
information retrieval
• For this class, data retrieval
alone is not interesting

Traditional Some Assumptions of


Information Need Model Traditional Model
1. User has an information need • It is possible for the user to specify
2. Users forms a query their exact needs
• Document texts are functionally
3. IR system makes best match with equivalent to information needs
documents – Essentially document retrieval
4. User evaluates ranked documents • The information need remains
5. Is the need met? constant throughout search process
6a. Yes -> done • The user will always recognize
relevant documents
6b. No -> reformulate query
(Belkin, Oddy, & Brooks)

Relevance? Other Models


• What does relevance mean? • Anomalous States of Knowledge
• A document might be relevant for (ASKs)
– (Belkin, Oddy, and Brooks)
many reasons
– A recognized anomaly in a user’s state
– Answers a question with a fact of knowledge that the user is not able to
– Gives part of an answer specify specifically
– Gives link to the answer • Berry-Picking Model
– Gives related information – (Bates 90)
• Relevance is subjective! – Interesting information scattered like
berries on bushes
– We’ll return to this discussion later – The query is continually shifting

2
Takeaway Messages Textbook Roadmap
• Modeling information need and
user activity is complex
• Be aware of simplistic
assumptions that most IR work
makes

MIR – Chap. 2 - Models Models for Retrieval


• Goal is to retrieve documents • What documents best match the need
described by a query?
that are relevant to a user’s • Modeling
single information need – Information need
• From a query
– Based on those traditional – Information content
assumptions discussed earlier • From a document
– Closeness or similarity
• Between need and content
• Based on content analysis
– Measurable attributes of queries and documents

3
Generic Document
Retrieval Model Best Matching Models Covered
Information Need Documents
Ranking In this class:
Query Language Algorithm
• Boolean Model
Representation Representation of • Vector Space Model
of Information Need Document Content
Potentially more if time permits
and interest.
Prior Knowledge &
Assumptions
Documents

Full-text Indexing
Index Terms
Models
• Roughly a word or phrase describing the • Boolean model
content of a document
• Manual Indexing • Vector space model
– A human reads or scans a document and
assigns it index terms • Probabilistic model
– (i.e. Library of Congress subject terms)
• Automatic Indexing
– Full-text indexing
– Every word in the document becomes an index
term

Generic Full-Text
Boolean Model
Indexing Definitions
• K = {k1, ..., kt} • Compute vectors for each
– Set of all index terms document (dj)
• wi, j > 0 • If a keyword ki appears in dj,
– Weight of term ki in document dj then wi,j is 1, otherwise 0
– wi,j = 0 if ki is not in dj
• Each document is describe as a
vector
– dj = (w1,j, w2,j, ..., wt,j)
Modern Information Retrieval, Baeza-Yates & Ribeiro-Neto

4
Boolean Query Example Boolean
Language Queries
• Terms • House
– Words • House AND Corvallis
– Phrases • House OR Corvallis
• Operators • (House OR Condo) AND Corvallis
– AND • House AND Oregon AND NOT
– OR Eugene
– NOT • (House OR Condo) AND Corvallis
and NOT Eugene

Informal Pseudo
Rules for Boolean Logic
Boolean Notation
• DeMorgan’s Law • Evolved in web search engines
– NOT (A AND B) = (NOT A) OR (NOT B) • +house +Corvallis
– NOT(A OR B) = (NOT A) AND (NOT B) – house AND Corvallis
• Search for “Boolean Logic” if you want • +house +Oregon –Eugene
to know more – House and Oregon and NOT Eugene
• House +Corvallis
–?

Pseudo Boolean Ordering of Retrieved


Notation Documents
• House +Corvallis • Pure Boolean has no order
– (Corvallis AND House) OR Corvallis – All returned documents are equally
relevant
• +House +Condo Corvallis Salem
• In reality, different approaches can
– (House and Condo and Corvallis) OR be taken
(House and Condo and Salem) OR
– Chronologically
(House and Condo)
– Order by number of times a specified
term occurs
– Other approaches – get further and
further away from Boolean.

5
Who Uses Boolean
Boolean Searching
Searches
• Upsides • Everybody until about ten-fifteen
– Easy to implement
years ago
– Simple queries are easy
– Query language gives significant control • Even now, many commercial
over results systems (library catalogs,
• Downsides abstracting services, etc)
– Binary relevance decision
• No ordering criteria
• Usually too much or too little
– Syntax can be complex

Boolean model – Data


Retrieval or Information Proximity operators
Retrieval?
• NEAR
• Very close – hard to distinguish
• Differences
• WITHIN
– Enormous number of attributes
– A document only has values for a few of
those attributes
– Inefficient to store and search using
traditional data retrieval methods
– Ordering may still be important

You might also like