You are on page 1of 33

CSC483: Information Retrieval

Lecture 1: Introduction
Salah Hammami

2008

Course Goals

dimensions of the IR "problem:


functions of an IR system components of an IR system factors which optimize the IR process

examine current research issues in IR explore examples of industrial IR applications form a broad picture of the IR field

Lecture 1: Introduction

Course Material
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto, 2001 Information Retrieval: A Survey by Ed Greengrass, 2000 Information Discovery by Theo van der Weide, 2001 Introduction to Information Retrieval by C. D. Manning, P.Raghavan and H. Schtze (in preparation) Lecture slides & notes Additional study material & Web links

Lecture 1: Introduction

Course Lectures Overview


http://tech.groups.yahoo.com/group/csc483/

IR introduction 1. IR Models 2. IR Query Languages & Operations 3. Searcher Feedback

IR research issues

Applications of IR

4. Language Modeling for IR 8. Multimedia IR 6. Semantic in IR

5. Search Engines

9. Structured Content

Lecture 1: Introduction

IR Related Areas

Database Management Library and Information Science Artificial Intelligence Natural Language Processing Machine Learning

Lecture 1: Introduction

IR Related Areas
Database Management structured data in relational tables vs. free-form text well-defined queries in formal language (SQL) Recent move towards semi -structured data (XML) Library and Information Science user aspects of IR categorization of human knowledge citation analysis bibliometrics (structure of information) Recent work on digital libraries Artificial Intelligence (AI) Knowledge representation, reasoning, formalisms, e.g. first-order predicate logic, Bayesian networks Recent work on Web ontologies and Intelligent Information Agents
Lecture 1: Introduction 6

IR Related Areas
Natural Language Processing (NLP) Syntactic, semantic, and pragmatic analysis of text & discourse Retrieval based on meaning rather than keywords analyzing the syntax (phrase structure) and semantics Determining sense of ambiguous words (context-based) Identifying specific pieces of information in a document Answering specific NL questions Recent work in GATE (general architecture for text engineering http://gate.ac.uk/ ) Machine learning (ML) computational systems - experienced-based improving of performance automated classification of examples (supervised learning) automated clustering of examples (unsupervised learning)

Lecture 1: Introduction

IR is not databases

Lecture 1: Introduction

Current situation: The Information Age


Main task: Information retrieval
increasing amount of information understand manage various information precision distributed information repositories complex user goals dynamic user demands customization demand speed

Lecture 1: Introduction

Information Retrieval
Definition

Information Retrieval (IR) is the task of finding relevant texts within a large amount of unstructured data Relevant = texts matching some specific criteria. Examples of IR tasks: searching for emails from a given person, searching for an event that occurred on a given date using the Internet, etc. Examples of IR systems: www search engines, specific search engines (laws, medical documents), etc. NB: Databases Management Systems (DBMS) are different from IR systems (data stored in a DB are structured!)
Lecture 1: Introduction 10

Information Retrieval

Goal of IR is to retrieve all and only the relevant documents in a collection for a particular user with a particular need for information Relevance is a central concept in IR theory How does an IR system work when the collection is all documents available on the Web? Web search engines are stress-testing the traditional IR models

Lecture 1: Introduction

11

Information Retrieval
The goal is to search large document collections (millions of documents) to retrieve small subsets relevant to the users information need Examples are: Internet search engines Digital library catalogues Some application areas within IR Cross language retrieval Speech/broadcast retrieval Text categorization Text summarization

Subject to objective testing and evaluation hundreds of queries millions of documents (the TREC set and conference)
Lecture 1: Introduction 12

Information Retrieval IR in general ...


IR discipline that deals with: retrieval representation storage organization access of structured, semi-structured and unstructured data (information objects) in response to query (topic statement) structured (e.g. boolean expression) unstructured (e.g. sentence, document)

Lecture 1: Introduction

13

Information Retrieval in other words


The process of applying algorithms over unstructured, semistructured or structured data in order to satisfy a given information (explicit) query Efficiency with respect to: algorithms query building data organization/structure

Lecture 1: Introduction

14

Information Retrieval and in other words

what attributes what structure what rules how to build

Data CM
content model

how to organize what structures what data

Algorithm
optimal what CM attributes

Query
how to build what CM attributes
Lecture 1: Introduction 15

Data vs. Information Retrieval


Data retrieval which docs contain a set of keywords? Well defined semantics a single erroneous object implies failure! Information retrieval information about a subject or topic semantics is frequently loose small errors are tolerated IR system interpret contents of information items generate a ranking which reflects relevance notion of relevance is most important
Lecture 1: Introduction 16

IR Systems

interpret contents of information objects

generate a ranking which reflects relevance

User Query

IR System

Ranked list of documents

Lecture 1: Introduction

17

IR Systems
Information Need

IR System disclosure for a collection O of n information objects user is interested in information objects interest model as a partial order on the collection a set of relevant a set of irrelevant documents produces a (total) ordering resembling the users interest comparative model to user info need The Information Need has qualitative and quantitative aspects expressed in a query

Lecture 1: Introduction

18

Basic Concepts

Lecture 1: Introduction

19

Basic Concepts:

The User Task

Pull actions User requests information in an interactive manner Push actions Software agents push the information towards the users
Lecture 1: Introduction 20

Basic Concepts:

The User Task

Document

single unit of information typically text in a digital form other media a complete logical unit (e.g. book, article) a part of a larger text (e.g. passage, section, entry in a dictionary) any physical unit (e.g. file, email, web page)
Lecture 1: Introduction 21

The Standard Retrieval Interaction Model

Lecture 1: Introduction

22

Standard Model of IR

Assumptions: The goal is maximizing precision and recall simultaneously The information need remains static The value is in the resulting document set

Lecture 1: Introduction

23

Problems with Standard Model


Users learn during the search process: Scanning titles of retrieved documents Reading retrieved documents Viewing lists of related topics/thesaurus terms Navigating hyperlinks

Some users dont like long (apparently) disorganized lists of documents

Lecture 1: Introduction

24

IR is an Iterative Process

Goals

Repositories

Workspace
Lecture 1: Introduction 25

IR is a Dialog

The exchange doesnt end with first answer Users can recognize elements of a useful answer, even when incomplete Questions and understanding changes as the process continues
Lecture 1: Introduction 26

Information Retrieval

Revised Task Statement: Build a system that retrieves documents that users are likely to find relevant to their queries This set of assumptions underlies the field of Information Retrieval

Lecture 1: Introduction

27

The Retrieval Process ...

user feedback change the query

User Interface

text

10

specifies user need

1
text defines logical view

Text Operations
logical view logical view

Query Operations
query generated

6 7

Indexing
inverted file

builds

DB Manager Module
1

Searching
3
retrieved docs

Index
8 9
Lecture 1: Introduction 28

Text Database

ranking docs

Ranking

The Retrieval Process ...


Documents Index Terms

document

match ranking Information Need


query

Lecture 1: Introduction

29

The Retrieval Process ...


Ordering retrieved documents reflects their relevance to user query Fundamental premices for relevance: common sets of index terms sharing of weighted terms likelihood of relevance Each set of premices leads to a distinct IR model

Matching index terms is quite imprecise Users get frequently unsatisfied Users have no training in query formation Frequent dissatisfaction of Web users Relevance is critical for IR systems: ranking
Lecture 1: Introduction 30

IR Taxonomy

Set Theoretic Classic Models Fuzzy Extended Boolean Algebraic Generalized Vector Latent Semantic Index Neural Networks Structured Models NonNon- Overlapping Lists Proximal Nodes Browsing Flat Structure Guided Hypertext Probabilistic Inference Network Belief Network

User Task

Retrieval: Adhoc Filtering Browsing

Boolean Vector Probabilistic

Lecture 1: Introduction

31

IR Taxonomy: Ad Hoc Retrieval


new queries are submitted to the system

Q1 Q2 Collection Fixed Size

Q3

Q4 Q5
collection remains relatively static
a person having a need for information a set of information objects to satisfy the need models to formalize the information need stable (fixed) info collection user interest is valid during some period of time query only expresses the information need at some point in time Lecture 1: Introduction 32

IR Taxonomy: Filtering
Queries remain relatively static

User 2 Profile User 1 Profile

Docs Filtered for User 2 Docs for User 1

Documents Stream

New documents come into the system

(continuous) stream of documents e.g. newsgroups decision for each document no preprocessing of all documents universal contents description, or incrementally when new documents arrive (e.g. WWW) Lecture 1: Introduction

33

You might also like