You are on page 1of 25

Contents

1.

Introduction
1.1 1.2

2 3 5 6 7 8 9 11 15 17 18

Project overview Problem definition.

2.

Requirement Analysis
2.1

System Specification.

3. 4.

Related Works System Design


4.1

Function specification

5. 6. 7. 8. 9.

Implementation.. Data flow Diagram. Screenshot.

Future Enhancement.. 22 Conclusion 23 24

10. Bibliography

1. INTRODUCTION

With the continuing growth of world-wide web and online text collections ,it has become increasingly important to provide improved mechanisms for finding information quickly. Conventional IR systems rank and present documents based on measuring relevance to the user query. Unfortunately not all documents retrieved by the system are likely to be of interest to the user. Presenting the user with summaries of the matching documents can help the user identify which documents are most relevant to the users needs. This can either be a generic summary which gives an overall sense of the documents content or a query-relevant summary, which presents the content that is most closely related to the initial search query. This project deals with text summarization, whose goal is to produce a shorter version of a source text, while still retaining its main semantic content. Research in this field is flourishing ( Mani, 2001; Minel, 2004; NIST, 2005); it is motivated by the increasing size and availability of digital documents and the necessity for more efficient methods of information retrieval and assimilation. Automatic summarization involves reducing a text document or a larger corpus of multiple documents into a short set of words or paragraph that conveys the main meaning of the text. Extractive methods work by selecting a subset of existing words, phrases ,or sentences in the original text to form the summary. In contrast ,abstractive methods build an internal semantic representation and then use natural language generation techniques to create a summary that is closer to what a human might generate. Such a
2

summary might not contain words not explicitly present in the original. The state-of-the-art abstractive methods are still quite weak, so most research has focused on extractive methods.

1.1 PROJECT OVERVIEW


Our program essentially works on the following logics: a. WORD SCORING: 1. Stop Words: These are some insignificant words that are so commonly used in the English language that no text can be created without them. They therefore provide no real idea about the textual theme, and have therefore, been neglected while scoring sentences. Eg. I, a, an, of, am, the, et cetera. 2. Cue Words: These are words usually used in concluding sentences of a text, making sentences containing them crucial for any given summary. Cue Words provide closure to a given matter, and have therefore, been given prime importance while scoring sentences. Eg. Thus, hence, summary, conclusion, et cetera. 3. Basic Dictionary Words: 850 words of the English language have been defined as the most frequently used words that add meaning to a sentence. These words form the backbone of our algorithm, and have been vital in the creation of a sensible summary. We have hence, given these words moderate importance while scoring sentences. 4. Proper Nouns: Proper Nouns in most cases form the central theme of a given text. Albeit, the identification of proper nouns without the use of linguistic methods was difficult, we have been successful in identifying them in most cases. Proper Nouns provide semantics to the summary, and have therefore been given high importance while scoring sentences.
3

5. Keywords: The user has been given an option to get a summary generated which contains a particular word, the keyword. Though this is greatly limited by the absence of NLP, we have tried our best to produce results.

6. Word Frequency: Once basic scores have been allotted to words, their final score is calculated on the basis of their frequency of occurrence in the document. Words in the text which are repeated more frequently than others contain a more profound impression of the context, and have therefore been given a higher importance. b. SENTENCE SCORING:

Primary Score: Using the above methods, a final word score is calculated, and the sum of word scores gives a sentence score. This gives long sentence a clear advantage over their smaller counterparts, which might not necessarily be of lesser importance. 2. Final Score: By multiplying the score so obtained by the ratio average length / current length the above drawback can be nullified to a large extent, and a final sentence score is obtained.

The most noteworthy aspect has been the successful merger of frequency based and definition based categorization of words into one efficient algorithm to generate an as complete as possible summary for a given sensible text.

1.2 PROBLEM DEFINITION


The auto summarization tool requires the following :

A text file that is given as input. Summary is generated from this. WordNet is used. This software stems the words in input file. A file that contains Basic dictionary words is used. A list of stop words and cue words are also used.

2. REQUIREMENT ANALYSIS
This phase deals with understanding the problem , the goals and constraints etc. Requirement analysis starts with some general statement of need or a high level problem statement . during analysis the problem domain and the environment are modeled in an effort to understand the system behavior ,constraints of the system ,its inputs and its outputs etc. The basic purpose of the activity is to obtain any thorough understanding of what the software needs to provide. The understanding of the requirements leads to requirement specification. The analysis produce large amount of information and knowledge with possible redundancies properly organizing and describing the requirements is an important goal of this activity.

2.1 SYSTEM SPECIFICATIONS

HARDWARE SPECIFICATIONS Processor; Any processors of the present generation. Main memory: 256 MB and above. Operating system: Windows, Linux.

SOFTWARE SPECIFICATIONS Programming Environment: Java jdk1.6.0_03. Package: WordNet 2.1

3. Related Work Extracting


Text extracts are produced by identifying interesting sentences in the source document and simply joining them to produce what is hoped to be a legible summary. Various methods exist to determine which sentences should be extracted and a number of commercial systems using these are available (Copernic Summarizer, Microsoft word summarizer, Pertinence Summarizer. Etc) Extracting is a simple, robust method ,but it suffers from a number of problems. The one we focus on is the fact that extracted sentences may be wordy and not quite reach the goal of summarizing the documents sufficiently.

Abstracting
An abstract is a summary at least some of whose material is not present in the input (Mani 2001:129). An abstract may start by reducing sentences from the source text, joining sentence fragments, generalizing etc. This method has greater potential for increasingly readable summaries. Al-though the most ambitious abstracting methods require a full analysis of the input text, much previous work has relied on limited analysis, for instance information extraction templates ( Rau et al,1989;Paice and jones,1993,McKeown and Radev,1995),Rhetorical structure trees(Marku,1996,1999 ) and a comparison of a noisy channel and a decision tree approach(Knight and Marku,2002). Some researchers have tried to identify linguistic text reduction techniques which preserve meaning(Jing and Mckeown,1999,Saggion and Lapalme,2000). These techniques vary considerably and some are much harder to implement than others. However, all require a fairly good syntactic analysis of the source text. This implies having a wide coverage grammar, a robust parser , and generation techniques which defy most existing systems.

4.SYSTEM DESIGN

There are many factors to be considered while making summaries. In our summarizer we have basically considered
1. Term-Frequency model (TF)

2. Cue- Phrase model 3. Position-Based model

Term-Frequency model:
This model basically deals with the fact that words which are important to the document will appear many times in the document. So this method traverses the document and counts the number of occurrences of each word. But just adding the scores will not be correct. Since the sentences can be of any size so a sentence which contains more words(though not important enough) may have higer rank than a sentence which is important with respect to the document but has fewer words. This condition is unacceptable. The solution to this problem which we have used is to divide by the number of words, hence we find out the rank of sentences per word. An important pre-processing which we have done is that we have stemmed each word before calculating the TF . This is very important because on a input document on summarization there can be words like summary, summarize, summarizer, summaries. But the root word is always summary. Hence stemming increases the TF of the word summary . For stemming a software called wordNet is used. On giving a word as input it gives the stemmed word as output. For example the stemmed output of summarizing is summary.

Cue-Phrase Model:

This model is very important since it enhances the effect of the Term-Frequency model. Cue Phrases are a categories of phrases like { the most important, Hence we conclude , In this paper we show etc.} which automatically implies that those sentences that contains these cue-phrases should contain higher ranks. The solution to this problem which we have used is to check for any cue-phrases occurring in the document and if the sentences contain any of the phrases then assign a very high rank so that the probability of the sentence occurring in the final summary increases.

Position-Based model:

This model deals with the fact that the start sentence is an important sentence of the document in most of the cases. For example in news paper articles the first sentence will be the most important sentence in the article. This model is hence named lead method since this method assigns higher ranks to the initial sentences.

10

4.1 FUNCTION SPECIFICATIONS

The text summarizer is implemented using java. The input is a text document and the output is the summary of the input document. There are classes where each have their specific work. Input A text document Output The summary of the input document

The various classes are Text: A plain text is given as input. Sentence :

11

It is a derived class of text class. Various functions are sentenceseparator,sentencestopword and sentenceposition. Sentenceseparator(): This function takes as input the input document and produces a file which contains each of the sentences separated by newlines.

Sentence StopwordEliminator( ): This has the same functionality as the previous file but it separates the stop words from sentences. It takes in input from the output of StopwordEliminator. Removing stop words from sentences is important for calculating the Term Frequencies(TF). Sentenceposition( ) : Rank the sentences based on the position in the text. Rank the sentence more if it is the starting or ending sentence of a paragraph. . Word : It is also the derived class of text class. Various functions are wordseparator, stopwordeliminator, wordstemmer, wordfrequency and keyword.The wordnet interface is also attached with this class. Wordseparator () : This function takes as input the input document and produces a file with each words separated by newline. Pattern matching with the use of regular expression helps to split the word in a sentence. Regular expression can be used to match white spaces or comma so that words can be separated from a sentence. StopwordEliminator() :

12

This function takes as input the word separated file (the output of wordseparator) and removes stop words like a,an,the etc which occur very frequently but are of no importance. Check whether the word is a stop word , if yes store in the tag as stop=1 so that it is not required to be processed by other modules, else store as stop=0 and pass it to other modules. WordStemmer() : The words are stemmed so that the Term Frequency of similar words increases. This function takes input from the output produced by SentenceStopwordEliminator and stems each word. Apply Stemming to word and store it as a tag for that word. A stemming algorithm is a process of linguistic normalization, in which the variant forms of a word are reduced to a common form. We include the WordNet dictionary to find the stems of the words. WordFrequency() : The Term Frequency(TF) of each word is determined. We create a Hash Map of all terms and by one traversal we are calculating the frequency of each term. Some min-max thresholds can be set for the frequencies.Word frequency is calculated as

Keyword (): Check whether the word matches any keyword or patterns .Assign to the tag rank based on the keyword. These keywords along with their ranks are stored in the Keywords table.

13

Scoring : This class inherits from both sentence and word class.Various functions are wordscore,sentencescore,sentenceselector. Wordscore() : Since the ranks are assigned in user defined tags we are actually converted the given input document to XML so we can use XML API for easy manipulation of ranks. This module thus uses the XML APIs for Java and integrates the rank from different module.

Sentencescore() : Scores are given for the sentences. Primary Score : Using the above methods, a final word score is calculated, and the sum of word scores gives a sentence score. This gives long sentence a clear advantage over their smaller counterparts, which might not necessarily be of lesser importance. Final Score: By multiplying the score so obtained by the ratio average length / current length the above drawback can be nullified to a large extent, and a final sentence score is obtained. Sentenceselector (): Based on these ranks applied by different module, select the most relevant sentence based on rank. Sentences according to the decreasing order of ranks are taken and are given as output.

14

5. IMPLEMENTATION We have implemented an Automatic-Text Summarizer. The coding language used is java. The input is a plain text document and the output is the summary of the input document.we have 11 source files where each have their specific work. We are going to give in brief the role played by each source file. Input- A plain text document. Output- The summary of the input document. sentenceSeparator.java- This file basically takes as input the input document and produces a file which contains each of the sentences separated by new lines. Separating the sentences will be important for new stages. sentenceSeparatorforoutput.java- This file also takes input as the input document and produces a file which contains each of the sentences separated by newlines. This output file is helpful for printing the final summary. wordSeparator.java- This file also takes as input the input document and produces a file with each words separated by newline.

15

stopWordEliminator.java- This file takes as input the word separated file(output of wordseparator.java) and removes stop words like a,an,the,etc which occur very frequently but are of no importance. blankRemover.java- This file mainly does the job of pre-processing as it removes any blanks produced after removing the stopwords. sentenceStopwordEliminator.java: This has the same functionality as the previous file but it separates the stop-words from sentences.It takes in input from the output of sentenceStopwordEliminator.java.Removing stop words from sentence is important for calculating the Term Frequencies(TF).

sentenceStemmer.java:This is an important part of our project.We are stemming each word so that the Term Frequency of similar words increases.This file takes input from the output produced by sentenceStopwordEliminator.java and stem each word.we have included the wordNet dictionary to find the stems of the words. wordFrequency.java:This is the most important source file.Here we find the Term Frequency(TF) of each term.We create a Hash Map of all the terms and by one traversal we are calculating the frequency of each term

16

6.Dataflow diagram

InputDocument

wordSeperator.java

sentenceSeperatorforoutput.java

sentenceSeperator.ja va

stopWordEliminator.j ava

sentenceStopwordEliminato r.java

blankRemover.java

SentenceStemmer. java wordNetInterface.jav a

wordFrequency.j ava

Output Summary

17

7. Screenshot

18

19

20

21

22

8. FUTURE ENHANCEMENT

Abstractive summarizers are difficult to implement but we can get efficient summaries(which are close to the expectations of humans) only by using abstractive summarizers, future enhancement can be made using abstractive(linguistic) methods. This phase can be applied on sentences picked out by statistical summarizer. The linguistic method is more complex, it removes, re fuse related topics into more general ones. Linguistic approach understands the essence of the document and generate exact summary. It involves Natural Language Processing and Artificial Neural network.

23

9.CONCLUSION
We have a developed a Text Summarizer that produces a summary of an input document based on statistical and linguistic methods. In this present age electronic documents and files are gaining popularity. Therefore a summarizer becomes very useful when such documents and files are of large size.

24

10. BIBLIOGRAPHY

Mohamed AbdelFattah and Fuji Ren, Automatic Text Summarization,World Academy of Science,Engineering and Technology:37-2008.

Shanmugasundaram and rengaramanujam ,Investigations in Single Document Summarization by Extraction Method

www.ics.mq.edu.au
Mani, Inderjeet (2001). Automatic Summarization

Mark A.Finlayson,MIT java Wordnet interface series2 users Guide. www. Wikipedia.com www.google.com

25

You might also like