IR1 Introduction

CSC483: Information Retrieval
Lecture 1: Introduction
Salah Hammami
2008
Course Goals
dimensions of the IR "problem:

functions of an IR system components of an IR system factors which optimize the IR process
examine current research issues in IR explore examples of industrial IR applications form a broad picture of the IR field
Course Material
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto, 2001 Information Retrieval: A Survey by Ed Greengrass, 2000 Information Discovery by Theo van der Weide, 2001 Introduction to Information Retrieval by C. D. Manning, P.Raghavan and H. Schtze (in preparation) Lecture slides & notes Additional study material & Web links
Course Lectures Overview

http://tech.groups.yahoo.com/group/csc483/
IR introduction 1. IR Models 2. IR Query Languages & Operations 3. Searcher Feedback
IR research issues
Applications of IR
4. Language Modeling for IR 8. Multimedia IR 6. Semantic in IR
5. Search Engines
9. Structured Content
IR Related Areas
Database Management Library and Information Science Artificial Intelligence Natural Language Processing Machine Learning
IR Related Areas
Database Management structured data in relational tables vs. free-form text well-defined queries in formal language (SQL) Recent move towards semi -structured data (XML) Library and Information Science user aspects of IR categorization of human knowledge citation analysis bibliometrics (structure of information) Recent work on digital libraries Artificial Intelligence (AI) Knowledge representation, reasoning, formalisms, e.g. first-order predicate logic, Bayesian networks Recent work on Web ontologies and Intelligent Information Agents
Lecture 1: Introduction 6
IR Related Areas
Natural Language Processing (NLP) Syntactic, semantic, and pragmatic analysis of text & discourse Retrieval based on meaning rather than keywords analyzing the syntax (phrase structure) and semantics Determining sense of ambiguous words (context-based) Identifying specific pieces of information in a document Answering specific NL questions Recent work in GATE (general architecture for text engineering http://gate.ac.uk/ ) Machine learning (ML) computational systems - experienced-based improving of performance automated classification of examples (supervised learning) automated clustering of examples (unsupervised learning)
IR is not databases
Current situation: The Information Age

Main task: Information retrieval
increasing amount of information understand manage various information precision distributed information repositories complex user goals dynamic user demands customization demand speed
Information Retrieval
Definition
Information Retrieval (IR) is the task of finding relevant texts within a large amount of unstructured data Relevant = texts matching some specific criteria. Examples of IR tasks: searching for emails from a given person, searching for an event that occurred on a given date using the Internet, etc. Examples of IR systems: www search engines, specific search engines (laws, medical documents), etc. NB: Databases Management Systems (DBMS) are different from IR systems (data stored in a DB are structured!)
Goal of IR is to retrieve all and only the relevant documents in a collection for a particular user with a particular need for information Relevance is a central concept in IR theory How does an IR system work when the collection is all documents available on the Web? Web search engines are stress-testing the traditional IR models
11
The goal is to search large document collections (millions of documents) to retrieve small subsets relevant to the users information need Examples are: Internet search engines Digital library catalogues Some application areas within IR Cross language retrieval Speech/broadcast retrieval Text categorization Text summarization
Subject to objective testing and evaluation hundreds of queries millions of documents (the TREC set and conference)
Information Retrieval IR in general ...

IR discipline that deals with: retrieval representation storage organization access of structured, semi-structured and unstructured data (information objects) in response to query (topic statement) structured (e.g. boolean expression) unstructured (e.g. sentence, document)
13
Information Retrieval in other words

The process of applying algorithms over unstructured, semistructured or structured data in order to satisfy a given information (explicit) query Efficiency with respect to: algorithms query building data organization/structure
14
Information Retrieval and in other words
what attributes what structure what rules how to build
Data CM
content model
how to organize what structures what data
Algorithm
optimal what CM attributes
Query
how to build what CM attributes
Data vs. Information Retrieval

Data retrieval which docs contain a set of keywords? Well defined semantics a single erroneous object implies failure! Information retrieval information about a subject or topic semantics is frequently loose small errors are tolerated IR system interpret contents of information items generate a ranking which reflects relevance notion of relevance is most important
IR Systems
interpret contents of information objects
generate a ranking which reflects relevance
User Query
IR System
Ranked list of documents
17
IR Systems
Information Need
IR System disclosure for a collection O of n information objects user is interested in information objects interest model as a partial order on the collection a set of relevant a set of irrelevant documents produces a (total) ordering resembling the users interest comparative model to user info need The Information Need has qualitative and quantitative aspects expressed in a query
18
Basic Concepts
19
Basic Concepts:
The User Task
Pull actions User requests information in an interactive manner Push actions Software agents push the information towards the users
Basic Concepts:
The User Task
Document
single unit of information typically text in a digital form other media a complete logical unit (e.g. book, article) a part of a larger text (e.g. passage, section, entry in a dictionary) any physical unit (e.g. file, email, web page)
The Standard Retrieval Interaction Model
22
Standard Model of IR
Assumptions: The goal is maximizing precision and recall simultaneously The information need remains static The value is in the resulting document set
23
Problems with Standard Model

Users learn during the search process: Scanning titles of retrieved documents Reading retrieved documents Viewing lists of related topics/thesaurus terms Navigating hyperlinks
Some users dont like long (apparently) disorganized lists of documents
24
IR is an Iterative Process
Goals
Repositories
Workspace
IR is a Dialog
The exchange doesnt end with first answer Users can recognize elements of a useful answer, even when incomplete Questions and understanding changes as the process continues
Revised Task Statement: Build a system that retrieves documents that users are likely to find relevant to their queries This set of assumptions underlies the field of Information Retrieval
27
The Retrieval Process ...
user feedback change the query
User Interface
text
10
specifies user need
1
text defines logical view
Text Operations
logical view logical view
Query Operations
query generated
6 7
Indexing
inverted file
builds
DB Manager Module
1
Searching
3
retrieved docs
Index
8 9
Text Database
ranking docs
Ranking

Documents Index Terms
document
match ranking Information Need

query
29

Ordering retrieved documents reflects their relevance to user query Fundamental premices for relevance: common sets of index terms sharing of weighted terms likelihood of relevance Each set of premices leads to a distinct IR model
Matching index terms is quite imprecise Users get frequently unsatisfied Users have no training in query formation Frequent dissatisfaction of Web users Relevance is critical for IR systems: ranking
IR Taxonomy
Set Theoretic Classic Models Fuzzy Extended Boolean Algebraic Generalized Vector Latent Semantic Index Neural Networks Structured Models NonNon- Overlapping Lists Proximal Nodes Browsing Flat Structure Guided Hypertext Probabilistic Inference Network Belief Network
User Task
Retrieval: Adhoc Filtering Browsing
Boolean Vector Probabilistic
31
IR Taxonomy: Ad Hoc Retrieval

new queries are submitted to the system
Q1 Q2 Collection Fixed Size
Q3
Q4 Q5
collection remains relatively static
a person having a need for information a set of information objects to satisfy the need models to formalize the information need stable (fixed) info collection user interest is valid during some period of time query only expresses the information need at some point in time Lecture 1: Introduction 32
IR Taxonomy: Filtering
Queries remain relatively static
User 2 Profile User 1 Profile
Docs Filtered for User 2 Docs for User 1
Documents Stream
New documents come into the system
(continuous) stream of documents e.g. newsgroups decision for each document no preprocessing of all documents universal contents description, or incrementally when new documents arrive (e.g. WWW) Lecture 1: Introduction
33

IR1 Introduction

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IR1 Introduction

Uploaded by

Copyright:

Available Formats

CSC483: Information Retrieval

dimensions of the IR "problem:

Course Lectures Overview

IR introduction 1. IR Models 2. IR Query Languages & Operations 3. Searcher Feedback

4. Language Modeling for IR 8. Multimedia IR 6. Semantic in IR

Current situation: The Information Age

Information Retrieval IR in general ...

Information Retrieval in other words

Information Retrieval and in other words

what attributes what structure what rules how to build

how to organize what structures what data

Data vs. Information Retrieval

interpret contents of information objects

generate a ranking which reflects relevance

Ranked list of documents

The User Task

The User Task

The Standard Retrieval Interaction Model

Problems with Standard Model

Some users dont like long (apparently) disorganized lists of documents

The Retrieval Process ...

user feedback change the query

specifies user need

The Retrieval Process ...

match ranking Information Need

The Retrieval Process ...

Retrieval: Adhoc Filtering Browsing

Boolean Vector Probabilistic

IR Taxonomy: Ad Hoc Retrieval

Q1 Q2 Collection Fixed Size

User 2 Profile User 1 Profile

Docs Filtered for User 2 Docs for User 1

New documents come into the system

You might also like