You are on page 1of 15

Seminar

On

Multilingual Web Search and Navigation


Presented By

Dhage Manoj Madhavrao MTech 1st Year School of Information Technology, IIT Khargpur.
Seminar Guide

Dr. D. Samanta.

Abstract Today Internet has large information contained within it. Different countries connected to Internet may have information pages in their native languages or any widely used language like English. So when users in any particular country want to search on a particular topic, the search engine should not limit the search, to the native language of the user or the language of the country, in which the search engine has been engineered. Creating a site for a global audience is also a significant challenge. We need to be ready to answer requests in foreign languages. Web pages can be written in any language. A single web page can contain information in multiple languages. One language may represent sentences in left to right, other may represent in right to left. If a web page contains languages like this then pattern-matching algorithm should be able to work accordingly. Special cautions have to be taken while accessing information on these pages. User should be able to query in any language, and he must get information contained in pages written in other languages also. With the spread of the Internet number of people knowing introductory English might have increased. But those people may not understand complex subject of their interest, in English. So a search engine should be able to provide translation of web pages on users demand. Also the translation of user queries and search results should be accurate enough to preserve original theme and meaning. The underlying heterogeneous character representations should be transparent to user. Keywords: Information Retrieval, Multilingual, Unicode, and Mulinex.

Contents 1. Introduction 1.1 What is Multilingual Web Search? 1.2 Translingual information retrieval 1.3 Defining Multilingual Information Retrieval 1.4 How to represent each character from so many languages uniquely? 1.5 Basic steps for providing single language search 1.6 Basic steps in a multilingual search 2. MULINEX: Multilingual Web Search Engine 2.1 Architecture of Mulinex 2.2 The Mulinex system components 3. Document Acquisition 3.1 Document acquisition steps 4. Search and Navigation 4.1 Query Translation 4.2 Search Server 5. Support for multilingual search in other search engines 5.1 Google 5.2 Alltheweb 6. Multilingual Web Navigation 7. References 4 4 4 4 5 5 5 6 6 6 8 8 9 9 9 11 11 13 14 15

1. Introduction 1.1 What is Multilingual Web Search? Multilingual means multiple languages. It includes different pages in different languages but each page in a single language or single page comprised of multiple languages. Searching WWW for some information based on keywords, independent of the language of any page or query is Multilingual Web Search. 1.2. Translingual information retrieval: TIR consists of providing a query in one language and searching document collections in one or more languages. 1.3 Defining Multilingual Information Retrieval: Any cross language querying for information retrieval is a MLIR. This can be of different types as follows. 1. Information Retrieval on a monolingual document collection, which can be queried in any language, may be different from the language of the documents. E.g. An English document collection can be queried in German for some information. 2. Information Retrieval on different multilingual documents where the result of a query can contain documents in different languages. E.g. a collection of English and German documents queried in English, the result may contain pages in both the languages. 3. Information Retrieval on multilingual documents in which each document may contain information in a number of languages. E.g. Any typical school admission form. In the first type of MLIR the document collection is monolingual, but the retrieval system is capable of processing queries in a number of different languages and retrieving documents across language boundaries. In the second type of MLIR we can have a multilingual document collection, which can be searched, in any of its component languages and where documents can be retrieved in multiple languages as a result of a single search. Finally, we can generalize this definition still further to multilingual documents i.e. type 3 MLIR. The above three types are arranged according to their increasing complexity. MLIR do not require machine translation, as the translated representations of documents or queries are for automatic machine use only and not read by a human being anytime. Therefore the entire problem syntactic generation and the problems associated with incorrect resolution of ambiguity can be avoided. Any language can be represented in computer by assigning numbers to the character set of the language. So different countries use to assign numbers from 0 to n to the character set of its native language. Any information page contains numbers to represent the words and sentences of the language. So if the search engine is not aware of the language of the document then it may misinterpret the language of the document. But in case of a multilingual document each character must have unique character code.

1.4 How to represent each character from so many languages uniquely? Unicode is a 16-bit character code. It can represent 64k characters, which is sufficient to cover all the languages in the world and punctuation marks. Before Unicode was defined, there were so many different encoding systems for different languages. No single encoding could contain enough characters. These encoding systems also conflict with one another. That is, two encoding can use the same number for two different characters, or use different numbers for the same character. Any given computer needs to support many different encoding. After doing all this whenever data is passed between different encoding and platforms, that data always runs the risk of corruption. Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. 1.5 The basic steps for providing single language search are as follows: 1. Document gathering: Gather all documents that will be searched into a database. 2. Indexing: Divide each document into words - character strings separated by commonly and consistently understandable separators such as space marks. Build a list of unique words in a dictionary. Here Dictionaries means an index. Repeat for all the documents with the dictionary itself growing for new unique words. Store an index of words and documents in which they are to be found. Additionally store offsets from start of the document, if certain proximity searches are to be permitted. 3. Querying and retrieval: When a search request is received, match the requested string of characters with the dictionary and report the location of the string if it exists in the database. 1.6 Basic steps in a multilingual search: 1. Document gathering: Gather all documents that will be searched into a database 2. Identify the language of the document. 3. Indexing: Make different dictionaries for different pages depending on the language of the document. Store an index of words and documents in which they are to be found. Additionally store offsets from start of the document, if certain proximity searches are to be permitted 4. Querying and retrieval: When a search request is received in any language, match the requested string of characters with all dictionaries by converting it into corresponding languages and report the location of the string if it exists in the database. 5. Automatic translation: We will see an example Multilingual Search Engine MULINEX. This search engine is developed by a group of five European companies as follows. MULINEX Consortium consists of five European Companies. 1. DFKI 2. Grolier Interactive
3. Bertelsmann Telemedia

4. DATAMAT

5. TRADOS. 2. MULINEX: Multilingual Web Search Engine 2.1 Architecture of Mulinex:

Ref: http://mulinex.dfki.de 2.2 The Mulinex system is consisting of three major clusters of components: 1. The document acquisition and analysis cluster: This cluster is responsible for collection documents from the WWW, language identification, categorisation and producing an automatic summary. Its components are: I. Document acquisition: This subsystem provides functionality for collection of documents from the web and preparing them in such a way that they can be easily used to populate the search servers database. II. Language Identification: This subsystem provides functionality for identifying the language of a given document. III. Document Summariser: This subsystem is responsible for producing a summary of a given text.

IV. Document Classifier: This subsystem provides functionality for classifying acquired documents with respect to pre-defined categories, which are maintained in a database. 2. The search and navigation cluster: This cluster is responsible for providing user interface, search queries, translation of the queries, and navigation of the search results. Its components are: I. User Interface: This subsystem is responsible for making the systems functionality available to the user by providing the communication between the user and the system. II. Presentation Server: This subsystem is responsible for presenting the result to the user. The result page contains URLs for the web pages containing the keywords, translation links, summary, etc. III. Task Manager: The task manager is the main control and co-ordination unit of the system. It receives a user request from the user interface, performs request dispatching, elaborates a work plan, executes it, and constructs the top-level response which is returned to the user. IV. Search Server: This subsystem consists of a search manager and a search server. The search manager receives search requests from the task manager, invokes corresponding operations. V. Query Translation Server: This subsystem provides a translation of a query or query terms in such a way that the translated query can be used to retrieve documents in the corresponding target language. It encapsulates dictionaries and term translation subcomponents. VI. Query Expansion Server: This subsystem provides mechanisms and methods for concept-based query expansion. VII. Text Translation Server: This subsystem consists of a translation manager and a translation module. The translation module is a machine translation system used to translate the text. 3. The personal agent cluster: This cluster is responsible for providing user customized presentation and other services. I. Mulinex Agent: A server-side software agent, which provides advanced functionality to assist the user in the process of finding the right information. It autonomously performs tasks like initiating search requests on user-selected topics, informing the user when new documents are available, etc. II. User Profile Server: provides information about a registered user and encapsulates the user profile repository, which contains relevant information about registered users (user profiles, agent configuration, etc.) III. Information Extraction System: extracts relevant information from documents that belong to specific categories.

3. Document Acquisition 3.1 Document acquisition consists of three steps: 1. Gathering of documents 2. Document analysis 3. Indexing During the document collection phase, at each step, the information about the document is successively refined. The gathering step obtains information that is specified in HTTP and HTML such as size, modification time, the URL, the character encoding, and the full text of the document. All this information is encoded in a SOIF (i.e. summery object interchange format) object, which is then passed to the document analysis component. The document analysis component analyses the content of the document to determine the language and thematic categories, and to create a document summary. All this information, along with the information from the SOIF object is then encoded into a SQL statement, which is used to create or update a record in the Fulcrum SearchServer. The SearchServer is used as the information retrieval system, which allows retrieval of documents according to their attributes, including a full-text search on the document content.

Ref: http://mulinex.dfki.de In the Mulinex system, the summariser is used, during document gathering to generate summaries, which are stored in the SearchServer database. Categoriser does classification of the documents based on the subject described by the keywords. E.g. politics, computers, business etc. Summarizer is used to create the summary of the documents.

4. Search and Navigation 4.1 Query Translation: The MULINEX system accepts user query and translates it. Translations of queries may contain different meaning than the originally intended meaning. Because each word can be translated into more than one word in the target language. So this problem is solved by interaction with the user. This is disambiguated by the user who selects among alternative translations. In order to help users who do not understand the target language with the disambiguation of query translations, the query assistant displays for each translation how it translates back into the original query language. As the following example for the query term fair shows, the back translations (shown in italics) will allow the user to eliminate translations into German which are irrelevant to the intended meaning even though the user may not have any knowledge of German. FAIRschn beautiful, lovely hbsch pretty, nice gerecht just, legitimate anstndig proper, decent, respectable Messe mass, market Ausstellung exhibition, show, display The translated queries are the input to the search in the document collection. The search is performed separately for each language in order to avoid retrieving irrelevant documents because of accidental cross-language accesses. For query translation it uses Bilingual dictionaries. Currently it supports only English, French and German. 4.2 Search Server Following types of queries can be used for information retrieval from MULINEX. 1. + Must include Words preceded by + must be contained in the document. Enter +recipe +cake and MULINEX will retrieve all documents containing both the word recipe and the word cake.

2. - Must not include Words preceded by - will not be contained in the document. Enter recipe -cake and as a result you will receive documents containing the word recipe but not cake. 3. ! Must not translate Words preceded by ! will not be translated. This is especially useful for names and places. 4. "..." phrase Words contained in "" will be understood as a phrase and will be translated as a phrase. MULINEX will retrieve documents containing the phrase rather than the individual words in the phrase.

10

5. Support for multilingual search in other search engines: 5.1 Google: Although Google is not a fully multilingual search engine it provides language tools for cross language information retrieval. 1. Cross Language Query: We can search a particular topic in web pages containing language different than query language. E.g. We can search supercomputer in pages containing Japanese language as follows.

2. Translation: Google also provides automatic language translation. E.g. Suppose that we have a page in Japanese language. If we want to convert this page into English we can do so by choosing appropriate option, as follows.

11

i.

Original page in Japanese.

ii.

Translated page in English.

12

Alltheweb: Alltheweb also provides cross language information. Assume the same example. We want to search supercomputer in Japanese language. Then alltheweb works as follows.

13

6. Multilingual Web Navigation 6.1 For developing web pages in multiple languages we need a character representation that will allow using multiple character sets. Unicode can be used for this purpose. 6.2 Depending on the language we may have to change design and navigation of the website. E.g. if we are displaying menu bars from left to write in English, we may need to display menu bars from right to left in Arabic. If the language is RightTo-Left like Arabic, Persian and Hebrew, operating system need to be tuned to be able to switch from Left-To-Right to Right-To-Left typing method to be able to type from Right and from Left. 6.3 If the web site is in two or more languages then the user may be asked on entrance page itself to choose the navigation language. Or always have a link to pages in other languages. 6.4 Independent of the language of the multilingual website the domain name should be in English.

14

7. References 1. Joanne Capstick, Abdel Kader Diagne, Gregor Erbach, Hans Uszkoreit, MULINEX: Multilingual Web Search and Navigation. 2. David A. Hull, Gregory Grefenstette, Querying Across Languages: A DictionaryBased Approach to Multilingual Information Retrieval. 3. Spectrum Business Support L., Full Text Search in Multilingual Documents. 4. www.unicode.org 5. www.google.com 6. www.alltheweb.com 7. mulinex.dfki.de

15

You might also like