You are on page 1of 15

Multilingual Information Retrieval

by T.Mehbub Basha

Overview
Introduction Document Preprocessing

Monolingual Information Retrieval

Introduction
Information Retrieval (IR):
Concerned with satisfying information needs of users. Ex: documents World Wide Web(WWW) websites requires efficient approaches to retrieve relevant subsets

for specific information needs.


Constantly increasing number of information items, it requires to adapt the retrieval techniques applied to Web search to these new scenarios.

Why we need multilingual?

Websites, social networks or personal emails

are written in different languages(27.3% of


English Internet users , last accessed November 16, 2010) People from different nations and languages are connected in social networks

Internet usage statistics as presented in Figure


1.1 show that only one fourth of the Internet users are native English speakers.

Figure 1.1: Statistics of the number of Internet users by language

Cont.. Many information retrieval approaches are based on Machine Translation (MT) systems. However, these systems still have high error rates(like

grammars, meanings)
This motivates the development of multilingual retrieval methods that do not depend on MT or at least are able to compensate errors introduced by the translation systems.

DEFINITION OF INFORMATION RETRIEVAL: Given a collection D containing information items di and a keyword query q representing an information need, IR is defined as the task of

retrieving a ranked list of information items d1,


d2, . . . sorted by their relevance in respect to the specified information need. The overall search process is visualized in Figure II.1 . This process consists of two parts.

1.Indexing part 2.Search part

The indexing part processes the entire document


collection to built index structures & Each document is thereby preprocessed and mapped to a vector representation The search part is based on the same

preprocessing step that is also applied to the


query. Using the vector representation of the query, the matching algorithm determines relevant documents which are then returned as ranked results.

monolingual case, the content of information

items di and the keyword query q are thereby


written in the same language. Cross-lingual and Multilingual IR, the information need and the corresponding query of the user may be formulated in other languages than the

one in which the documents are written in.

Introduction Document Preprocessing

Monolingual Information Retrieval

Preprocessing takes a set of raw documents as

input and produces as set of tokens as output.


Depending on language, script and other factors, the process for identifying terms can differ substantially For Western European languages, terms used in

IR systems are often defined by the words of


these languages. But for Chinese, words are not separated by whitespaces. So use character sequences ,avoids the problem of detecting word borders

Common techniques used for document preprocessing document syntax, encoding, tokenization & normalization of

tokens

You might also like