Professional Documents
Culture Documents
Lecture 1: Introduction
Salah Hammami
2008
Course Goals
examine current research issues in IR explore examples of industrial IR applications form a broad picture of the IR field
Lecture 1: Introduction
Course Material
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto, 2001 Information Retrieval: A Survey by Ed Greengrass, 2000 Information Discovery by Theo van der Weide, 2001 Introduction to Information Retrieval by C. D. Manning, P.Raghavan and H. Schtze (in preparation) Lecture slides & notes Additional study material & Web links
Lecture 1: Introduction
IR research issues
Applications of IR
5. Search Engines
9. Structured Content
Lecture 1: Introduction
IR Related Areas
Database Management Library and Information Science Artificial Intelligence Natural Language Processing Machine Learning
Lecture 1: Introduction
IR Related Areas
Database Management structured data in relational tables vs. free-form text well-defined queries in formal language (SQL) Recent move towards semi -structured data (XML) Library and Information Science user aspects of IR categorization of human knowledge citation analysis bibliometrics (structure of information) Recent work on digital libraries Artificial Intelligence (AI) Knowledge representation, reasoning, formalisms, e.g. first-order predicate logic, Bayesian networks Recent work on Web ontologies and Intelligent Information Agents
Lecture 1: Introduction 6
IR Related Areas
Natural Language Processing (NLP) Syntactic, semantic, and pragmatic analysis of text & discourse Retrieval based on meaning rather than keywords analyzing the syntax (phrase structure) and semantics Determining sense of ambiguous words (context-based) Identifying specific pieces of information in a document Answering specific NL questions Recent work in GATE (general architecture for text engineering http://gate.ac.uk/ ) Machine learning (ML) computational systems - experienced-based improving of performance automated classification of examples (supervised learning) automated clustering of examples (unsupervised learning)
Lecture 1: Introduction
IR is not databases
Lecture 1: Introduction
Lecture 1: Introduction
Information Retrieval
Definition
Information Retrieval (IR) is the task of finding relevant texts within a large amount of unstructured data Relevant = texts matching some specific criteria. Examples of IR tasks: searching for emails from a given person, searching for an event that occurred on a given date using the Internet, etc. Examples of IR systems: www search engines, specific search engines (laws, medical documents), etc. NB: Databases Management Systems (DBMS) are different from IR systems (data stored in a DB are structured!)
Lecture 1: Introduction 10
Information Retrieval
Goal of IR is to retrieve all and only the relevant documents in a collection for a particular user with a particular need for information Relevance is a central concept in IR theory How does an IR system work when the collection is all documents available on the Web? Web search engines are stress-testing the traditional IR models
Lecture 1: Introduction
11
Information Retrieval
The goal is to search large document collections (millions of documents) to retrieve small subsets relevant to the users information need Examples are: Internet search engines Digital library catalogues Some application areas within IR Cross language retrieval Speech/broadcast retrieval Text categorization Text summarization
Subject to objective testing and evaluation hundreds of queries millions of documents (the TREC set and conference)
Lecture 1: Introduction 12
Lecture 1: Introduction
13
Lecture 1: Introduction
14
Data CM
content model
Algorithm
optimal what CM attributes
Query
how to build what CM attributes
Lecture 1: Introduction 15
IR Systems
User Query
IR System
Lecture 1: Introduction
17
IR Systems
Information Need
IR System disclosure for a collection O of n information objects user is interested in information objects interest model as a partial order on the collection a set of relevant a set of irrelevant documents produces a (total) ordering resembling the users interest comparative model to user info need The Information Need has qualitative and quantitative aspects expressed in a query
Lecture 1: Introduction
18
Basic Concepts
Lecture 1: Introduction
19
Basic Concepts:
Pull actions User requests information in an interactive manner Push actions Software agents push the information towards the users
Lecture 1: Introduction 20
Basic Concepts:
Document
single unit of information typically text in a digital form other media a complete logical unit (e.g. book, article) a part of a larger text (e.g. passage, section, entry in a dictionary) any physical unit (e.g. file, email, web page)
Lecture 1: Introduction 21
Lecture 1: Introduction
22
Standard Model of IR
Assumptions: The goal is maximizing precision and recall simultaneously The information need remains static The value is in the resulting document set
Lecture 1: Introduction
23
Lecture 1: Introduction
24
IR is an Iterative Process
Goals
Repositories
Workspace
Lecture 1: Introduction 25
IR is a Dialog
The exchange doesnt end with first answer Users can recognize elements of a useful answer, even when incomplete Questions and understanding changes as the process continues
Lecture 1: Introduction 26
Information Retrieval
Revised Task Statement: Build a system that retrieves documents that users are likely to find relevant to their queries This set of assumptions underlies the field of Information Retrieval
Lecture 1: Introduction
27
User Interface
text
10
1
text defines logical view
Text Operations
logical view logical view
Query Operations
query generated
6 7
Indexing
inverted file
builds
DB Manager Module
1
Searching
3
retrieved docs
Index
8 9
Lecture 1: Introduction 28
Text Database
ranking docs
Ranking
document
Lecture 1: Introduction
29
Matching index terms is quite imprecise Users get frequently unsatisfied Users have no training in query formation Frequent dissatisfaction of Web users Relevance is critical for IR systems: ranking
Lecture 1: Introduction 30
IR Taxonomy
Set Theoretic Classic Models Fuzzy Extended Boolean Algebraic Generalized Vector Latent Semantic Index Neural Networks Structured Models NonNon- Overlapping Lists Proximal Nodes Browsing Flat Structure Guided Hypertext Probabilistic Inference Network Belief Network
User Task
Lecture 1: Introduction
31
Q3
Q4 Q5
collection remains relatively static
a person having a need for information a set of information objects to satisfy the need models to formalize the information need stable (fixed) info collection user interest is valid during some period of time query only expresses the information need at some point in time Lecture 1: Introduction 32
IR Taxonomy: Filtering
Queries remain relatively static
Documents Stream
(continuous) stream of documents e.g. newsgroups decision for each document no preprocessing of all documents universal contents description, or incrementally when new documents arrive (e.g. WWW) Lecture 1: Introduction
33