You are on page 1of 2

CE314 - Natural Language Engineering Assignment 2: Indexing for Web Search

Udo Kruschwitz 15th November 2010

Plagiarism
You are reminded that this work is for credit towards the composite mark in CE314, and that the work you submit must therefore be your own. Any material you make use of, whether it be from textbooks, the Web or any other source must be acknowledged as a comment in the program, and the extent of the reference clearly indicated.

The Problem
A typical search engine consists of three parts: A crawler that collects pages over the Web An indexing component which processes those pages and selects the important keywords and writes them to a database A front end for querying the index database. A good indexing component is essential to give good answers to a user query. The main problem of the indexing step is to decide which words in the text are interesting and which ones can be ignored. Most Web search engines do not do very sophisticated language processing in order to build their indexes. They simply delete some stopwords and pass the remaining words to the query system.

The Task
Your task is to build a simple indexing component for a Web search system. Your system should take HTML pages as input, process them using the kind of techniques that we have been looking at in the module, and output an index consisting of a list of keywords. The ideal system would begin by deleting markup and replacing HTML special symbols (such as &) with their ASCII correspondent (&). It would then part-of-speech tag the input, use the POS tags to decide which parts of the text to keep as keywords (e.g. only choose nouns), and apply a stemmer (or a tool for baseform reduction). This assignment comes in stages. Marks are given for each stage. You may choose not to attempt some stages. You might also implement a system that does not strictly follow the stages but will work in the same way. The stages are as follows: Input/Output (10%) The system must be able to read some input (for example from a le) and produce appropriately formatted output, which could be a simple list of words. Deleting Markup (20%) Before the text can be analyzed it is necessary to get rid of the HTML tags. The result will be plain text. You could use nite state methods for this. Note however, that if you simply delete all HTML tags, you will lose information such as meta tag keywords. Therefore, I strongly suggest that you use some tool to perform this task. 1

Pre-processing: Sentence Splitting, Tokenization and Normalization (10%) The next step should be to transform the input text into a normal form of your choice. Part-of-Speech Tagging (10%) The input should be tagged with a part-of-speech tagger (e.g. OpenNLP, QTag or the Brill tagger), so that the result can then be processed in the next steps. Selecting Keywords (20%) One aim of your system is to identify the words or phrases in the text that are most useful for indexing purposes. Your system should remove words which are not useful, such as very frequent words or stopwords. You should develop a selection method, possibly using POS tags (e.g. nouns and noun phrases are usually good indices) in combination with statistical/frequency information. Stemming or Morphological Analysis (10%) Writing word stems to the database rather than words allows to treat various inected forms of a word in the same way, i.e. bus and busses refer to exactly the same thing even though they are dierent words. Engineering a Complete System (10%) The nal system should have control over all the individual components so that there is a single call and all the above steps will be performed.

The Report
You will have noticed that the percentages above only add up to 90%. This is because one of the important aspects of the project is that your work should be well documented and your code well commented. 10% of your mark will come from this. You should submit: A description of your implementation: what the code does, and the software you used Clear commented code Unedited output from a run of the code submitted using this Web page: http://news.bbc.co.uk/ (feel free to submit other runs as well, i.e. using Web pages of your own choice) Commented output. You may work in pairs. If you do, you only need to submit one report. Both members of a pair will get the same mark unless there is reason to do otherwise.

Software
You can implement your system either on the Linux or the Windows machines. Perl, Java, Python, C/C++, and shell scripts are good choices for this project (you may even use Prolog for some of the processing steps), but you are by no means restricted to those languages. You can use any of the software discussed in the labs, or any additional software you nd appropriate. On the Windows machines, besides Perl and Java, you can use QTag (or other software installed in the labs such as Connexor, NLTK or GATE). On the Linux machines you can use shell scripts and the Brill tagger that can be accessed from the command line. If you want to use an existing stemmer, the Porter stemmer (briey discussed in the lectures) would be a candidate. The algorithm is described in the textbook by Jurafsky and Martin. A Web site that provides Java, Perl and C implementations (as well as many others) is the following: http://www.tartarus.org/~martin/PorterStemmer/

Submission
The assignment, which counts for 20% of the overall mark, should be submitted via the electronic submission system by Friday, 17 December 2010, 11:59 (mid-day) (see the submission guidelines provided for Assignment 1). The guidelines about late assignments are explained in the handbook. The assignments will be marked by 17 January 2011.

You might also like