You are on page 1of 2

Kevyn Reinholt, Shelly Lukon, Patrick Juola Software Demonstration: Machine-aided Back-of-the-book Indexer A well crafted index is an important

aspect of a book and greatly contributes to that book's reuse. If one needs to find information on photosynthesis, he will begin a process of eliminating unwanted resources from his library. First, all texts except those relating to biology are eliminated. Then, all texts not related to plants are eliminated. Now only a few books remain that might have information relating to photosynthesis. Books with no index would require the user to skim through all (or much) of each text; a poor index would require the user to search through slightly narrower portions of the text; and a reliable index would point the user directly to the information he desires. As making an index this dependable is quite time consuming and often expensive, there is a great need for a way to generate an index that is both time efficient but most importantly, effective. The goal for this project, therefore, is to create an index that best demonstrates those qualities. The theoretical model that will help us achieve that goal includes the following stages. Documents Tagger Frequency TF-IDF EVD HCA WSD The first section, Documents, will allow the user to select a document for indexing. Currently the program is only set up to handle text documents, but will expand to Microsoft Word, Adobe PDF, LaTeX, and XML formats. The text will appear on the screen if the user chooses to do so (which is recommended to ensure that the correct draft of the text was selected). Next, Tagger makes use of the Stanford POS (Part of Speech) tagger1. The text that has been retrieved from Documents is divided into a list of paragraphs and then put through the Stanford POS tagger. The tagger will make an attempt to attribute a part of speech to each word by looking at how it is being used in the sentence. For example, the tagger should output something along the lines of The/DT quick/JJ brown/JJ fox/NN jumps/VBZ over/IN the/DT lazy/JJ dog/NN. After the tagger finishes, the tagged text will appear and the user will be allowed to change any parts of speech that they think was interpreted wrong (i.e. changing jumps from a plural noun to a singular verb). The reason we chose to give the user this control over the words, in comparison with letting the computer have total control, was that the user who has written the text likely knows more than the computer about how each word is to be used. Although it is not recommended to check every word in the document (as that would be very time consuming for a large document), this option allows the user to check a particular word which they wish to appear in the index to ensure that it is tagged properly. After the user confirms the text, the program shifts to the Frequency section. In Frequency, the program will determine what words the user might want to appear in the index. The user will input thresholds in which words should appear in the document (such as between 15 and 25 times). The program will look for all nouns that meet this requirement and display them. The interface is set up into three sections. On the left half of the program, the text is displayed, which is helpful if the user wishes to double check a word's occurrence. The right half is split into a top and bottom. The top lists all words that meet the thresholds, while the bottom lists all the words which do not meet the thresholds. The top/bottom lists allow the user to add/remove any words which he feels is necessary to the index, even if it does not appear within the thresholds. Buttons on the interface include Add Terms, Remove Terms, and Combine Terms. The combine terms feature is beneficial if the user
1 Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003, pp. 252-259.

wishes to use the word utensil for the words spoon, fork and knife. If this is the case, the word utensil will appear in the index, but will point to important occurrences of those three words. During this step, the program also generates an array of words in the text associated with their frequencies in order to give better results for the later sections. TF-IDF, which stands for Term Frequency-Inverse Document Frequency, will generate values for each term and each paragraph depending on how important that word is, which is dependent on its frequency within the paragraph compared with the frequency within the entire document. The average (mean) is taken across all paragraphs giving a unique value to each individual word. Finally, a covariance matrix is created, where values are given depending on how two terms vary together. Although most of the work done during this step does not require much human interaction, the program will still output certain values for the first ten words, to ensure everything is working according to the user's standards. In addition, the original text appears in the left half of the program for consistency. EVD (Eigenvalue Decomposition) makes use of the JAMA2 (Java Matrix) package to evaluate matrix decompositions on the covariance matrix that was created in the previous step. Through these decompositions, the program will create a list of dimensions for every word, from most important to least. Once this has been completed, the original text will appear in the left half of the program, and the right side will consist of a graph with meaningful words plotted with their coordinates being determined by the first two dimensions (the most significant dimensions). Theoretically, words that have similar meanings, such as dog, cat, and rabbit, should appear close together. In the HCA (Hierarchical Cluster Analysis), the system will cluster together related terms, and will provide an opportunity for the user to refine these groupings, by making the associations more tightly or more loosely clustered. HCA describes a method used to partition terms into subsets with similar properties or characteristics. Members of an optimally clustered group share maximum characteristics with one another and minimal characteristics with terms in any other group, so terms that are semantically similar will cluster together. A general category heading can then be assigned to each cluster. Therefore, if spoons, forks and knives were not manually grouped together in the frequency section as mentioned above, they could be grouped under the category heading utensils.Again, the original text will appear on the left half of the interface. On the right half, all instances of the indexed terms will appear underneath the category heading with their corresponding x and y values as well as the sum of squares. In the WSD, the system will check for any terms that may be spelled the same but have different senses/meanings, and will provide an opportunity for the user to label them in order to distinguish between them. Within a certain distance threshold, we can group together instances whose surrounding text has a particular average context, and split apart those instances whose surrounding text has another average context. Then when we iterate through the next cycle of HCA, they will be spelled/tagged differently so they will possibly end up in different clusters. The user will be able to visually see these differences and the contexts in which they appear. Each data point that appears on the graph can be selected, and the user can then see how that particular point is used in the document. This section in particular allows for great user interaction, allowing the user to distinguish what word instances should be placed into the index.
References: "JAMA: Java Matrix Package." Mathematics, Statistics and Computational Science at NIST. Web. 31 Aug. 2010. <http://math.nist.gov/javanumerics/jama>. Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003, pp. 252-259. Lukon, S., Juola, P. (2006). A Context-Sensitive Machine-Aided Index Generator. Proceedings of the Joint Annual Conference of the Association for Computing and the Humanities and the Association for Literary and Linguistic Computing (ACH/ALLC 2006). University of Paris-Sorbonne. July 5, 2006: pp. 327-328 2 "JAMA: Java Matrix Package." Mathematics, Statistics and Computational Science at NIST. Web. 31 Aug. 2010. <http://math.nist.gov/javanumerics/jama>.

You might also like