You are on page 1of 4

Journal of Computer Applications (JCA) ISSN: 0974-1925, Volume VI, Issue 3, 2013

Measuring Vocabulary Consistency

S.Manikandan a,*, P.Vijay Anand b,1 , R.Prabhu c,1, D.Suresh Babu d,1 Abstract - Measuring Vocabulary Consistency aims at
evaluating the paper for consistency of vocabulary, to be published. This is an application that helps the author to document his/her own paper that they have worked upon. The application works at a Corpus level which contains the words of any particular domain which is pre-generated. The input paper is traversed and the words related to the corpus domain are collected, say Dictionary of paper. The Corpus holds the word with their corresponding synonyms as high, medium and low. The words from Paper Dictionary are graded and compared with the corpus and it provides appropriate words that fix for their corresponding grade of vocabulary. Thus the final output will be the original document along with the suggested equivalent grading vocabulary for maintaining consistency of the paper. Index Terms Corpus, vocabulary.

The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods.

Figure 1. Measuring Vocabulary Consistency

I. INTRODUCTION Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities). Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics.
Manuscript received 02/September/2013. S.Manikandan, Assistant Professor, Department of Information Technology, Vel Tech Multi Tech Dr.Rangarajan Dr.Sakunthala Engineering College. (E-mail: P.Vijay Anand, Assistant Professor, Department of Information Technology, Vel Tech Multi Tech Dr.Rangarajan Dr.Sakunthala Engineering College. (E-mail: R.Prabhu, Assistant Professor, Department of Information Technology, Vel Tech Multi Tech Dr.Rangarajan Dr.Sakunthala Engineering College. (E-mail: D.Suresh Babu, Assistant Professor, Department of Information Technology, Vel Tech Multi Tech Dr.Rangarajan Dr.Sakunthala Engineering College. (E-mail:

Measuring Vocabulary Consistency, measures the consistency of the words that are used in the paper, which is prepared for acceptance. The input paper is traversed throughout and the words are collected and are stored in a Database. These words are compared with the corpus dictionary that contains the pre-graded vocabulary and the words from the paper are graded as comparing with the corpus. Controlled vocabularies provide a way to organize knowledge for subsequent retrieval. They are used in subject indexing schemes, subject headings, thesauri, taxonomies and other form of knowledge organization systems. Controlled vocabulary schemes mandate the use of predefined, authorized terms that have been preselected by the designer of the vocabulary, in contrast to natural language vocabularies, where there is no restriction on the vocabulary. When indexing a document, the indexer also has to choose the level of indexing exhaustively, the level of detail in which the document is described. For example using low indexing exhaustively, minor aspects of the work will not be described with index terms. In general the higher the indexing exhaustively, the more terms indexed for each document. In recent years free text search as a means of access to documents has become popular. This involves using natural language indexing with an indexing exhaustively set to maximum (every word in the text is indexed). Many studies have been done to compare the efficiency and effectiveness of free text searches against documents that have been indexed by experts using a few well-chosen controlled vocabulary descriptors. II. RELATED WORK 2.1 A Study of Indexing Consistency [1] [6] [7] The article aims to compare the indexing consistency between the Library of Congress (LC) and the British Library (BL) catalogers with regards to their using the Library of Congress Subject Headings (LCSH). Eighty-two titles, published in 1987 in the field of Library and Information Science (LIS), were identified for comparison, and for each title its LC subject headings, 47

Measuring Vocabulary Consistency

assigned by both LC and BL catalogers, were compared. By applying Hooper's "consistency of a pair" equation, the average indexing consistency value was found for 82 titles. The average indexing consistency value between LC and BL catalogers is 16 percent for exact matches, and 36 percent for partial matches. The major findings of the study are discussed, and, in the Appendix, the examples of LCSH that assigned by both LC and BL catalogers for the same document are provided along with its consistency value. Indexing consistency in a group of indexers is defined as "the degree of agreement in the representation of the essential information content of the document by certain sets of indexing terms selected individually and independently by each of the indexers in the group". It should be borne in mind that the sample used in this study is small. Moreover, since two different subject indexing tools (i.e., LCSH and PRECIS) are used in LC and BL, it may not be very meaningful, if at all, to compare the two groups of catalogers. Yet the indexing consistency value found in this study is similar to those of reported in other consistency studies. In conclusion, the indexing consistency value between LC and BL catalogers for the books in the field of LIS is 16 percent for exact matches and 36 percent for both exact and partial matches, which is pretty low.

performance of a set of professional human indexers. Alternatively, for request-oriented indexing, where a documents irretrievability is more important than the consistency of its representation, the weights could be derived from searchers relevance judgments. [10] We plan to use this measure to assess the quality of automatically produced key phrases and to compare them with ones extracted by human indexers. Analysis of the conceptual relations between the phrases instead of simple matching of their stems will provide a sounder basis for judging the usability of automatic extraction in real-world applications.

2.3 Measuring vocabulary levels of English Textbooks

[3] The purpose of this study was to create a means for comparing the vocabulary levels of Japanese junior and senior high school (JSH) texts, Japanese college qualification tests, English proficiency tests, and EGP, ESP and semi-ESP college textbooks in order to determine what the vocabulary levels are, and what additional vocabulary is required for students to understand 95% of these materials. This was done by creating a lemmatized and ranked high frequency word list (BNC HFWL) from the British National Corpus. This study found that although most college students should be prepared to take the TOEIC, and high school students should be able to pass both the Daigaku Center Nyushi and Eiken 2nd grade tests, most college entrance exams contain vocabulary that is significantly above the level of high school graduates. Specialized vocabulary lists can be helpful in bridging vocabulary gaps between JSH and ESP, and between JSH and the TOEFL. Since learners depend on vocabulary as their first resource (Huckin and Bloch 1993), a rich vocabulary makes the skills of listening, speaking, reading, and writing easier to perform (Nation 1994: viii). Therefore, there has been continuing interest in whether there is a language knowledge threshold which marks the boundary between having and not having sufficient language knowledge for successful language use (Nation 2001: 144). Historically, experienced teachers such as West (1926) considered one unknown word in every fifty words to be the minimum threshold necessary for the adequate comprehension of a text. Others such as Hatori (1979) and Johns (as cited in Bensoussan and Laufer 1984) Hu and Nation (2000) concluded that for largely unassisted reading for pleasure, learners would need to know around 98% of the running words in the text; however, the current thinking in the field of vocabulary teaching and learning puts the threshold of meaningful input at 95% (Schmitt and McCarthy 1997, Tono et al. 1997, Read 2000, Nation 2001, and Hayashi 2002). How, then, is this goal attained in the classroom? Nation (2001) assures us that [i]f more than five percent of the running words are unknown, then it is likely that there is no longer meaning-focused learning because so much attention has to be given to language features (pp. 388389). Clearly, it is first necessary to examine what vocabulary exists in learners textbooks, and to determine if the learners are able to meet the 95% comprehension criteria. If not, educators must then provide the supplemental vocabulary to bridge this gap. Without this kind of bridge, learners would face a daunting amount of 48

2.2 Measuring Inter-Indexer Consistency

[2] When professional indexers independently assign terms to a given document, the term sets generally differ between indexers. Studies of inter-indexer consistency measure the percentage of matching index terms, but none of them consider the semantic relationships that exist amongst these terms. We propose to represent multiple-indexers data in a vector space and use the cosine metric as a new consistency measure that can be extended by semantic relations between index terms. We believe that this new measure is more accurate and realistic than existing ones and therefore more suitable for evaluation of automatically extracted index terms. Indexing consistency has been defined as the degree of agreement in the representation of the (essential) information content of a document by certain sets of indexing terms selected individually and independently by each of the indexers. [8]Several different measures have been proposed, and many studies of inter-indexer consistency have been reported. They generally conclude that a high level of consistency is hard to achieve and that the indexers are more likely to agree on what concepts should be indexed than on the exact terms that best represent them. Surprisingly, existing consistency measures do not take into account the semantic relations that exist between terms in the indexing vocabulary, which intuitively would seem likely to improve accuracy. [4] Existing measures of indexing consistency are flawed because they ignore semantic relations between the terms that different indexers assign. [5] This paper has shown how the vector space model that underlies the cosine metric supports an elegant linear generalization of similarity that takes thesaurus relations into account. [9] We introduce coefficients that reflect the relative importance of the thesaurus relations to the term-identity relation. We choose their values to optimize the

Journal of Computer Applications (JCA) ISSN: 0974-1925, Volume VI, Issue 3, 2013 dictionary work.Variables included were demographic data, views of health promotion, health promotion activities at the school, barriers and opportunities to implement health promotion activities. III. SYSTEM AND ADVERSARY MODEL 3.1 EXISTING SYSTEM The Existing System has a committee of members; they go through the paper and check for consistency. The paper, that is to be published, may be written by a group of people; involved in the research. Each pupil may submit his document of work. When the final paper is prepared by combining all the others, it may not be considered consistent at the vocabulary level. When this paper is traversed by the committee members, they sort out the inconsistency and re-circulate the paper for correction. This involves more time and an additional evaluator may be needed for documenting the paper. given to the pupil for manual alteration and goes through consistency check.

Figure 3. Proposed system model

3.2.1 ENTRY MODULE The Entry module describes the user profile with maximum failed attempts and blocks them, when they cross their attempts more than 3. When a new user uploads a paper, a random generated number is given for respective user identification. The user can upload only .doc, .docx, .pdf and .txt files as input. 3.2.2 TRAVERSE MODULE The input paper is traversed and the words are collected. The collected word list is compared with the corpus dictionary words. The new list has words that are graded as high, low and medium and stored in a separate database.

Figure 2. Existing system model

3.2. PROPOSED SYSTEM In our proposed system, the application traverses the paper and collects each word. These collected words are stored in a database for comparison. The stored words are fetched and compared with the dictionary maintained manually. When the words are matched with respective table of high, low and medium; they are graded respectively. Each and every word is done so and the entire paper is graded with each sections. Finally, the average of the count is taken, and the rest of the grades are made to the average by suggesting their synonyms respectively. Thus the final paper contains the average of all the graded words, and they are said to be consistent. When certain lemmas are not found, they are 49

Figure 4. Traverse system model

3.2.3 CONSISTENCY MODULE The words separated are checked for consistency of occurrences and also the frequency of corpus graded words occurrences. Their meanings are checked along with their corresponding occurrences; the synonym of each word is

Measuring Vocabulary Consistency

suggested along with their definitions. These are given as output to the user as they could replace the necessary words as they wish. Finally the altered paper may again go for consistency check; resulting with a consistent paper as output.

concepts are also mapped, our application becomes complete without any flaws. REFERENCES
[1]. Yasar Tonta "A study of indexing consistency: consistency between the Library of congress and the british library catalogers. [2]. Olena MedelyanMeasuring Inter-Indexer Consistency Using a Thesaurus. [3].Hooper, R.S. (1965). Indexer consistency tests-Origin, measurements, results and utilization. IBM, Bethesda. [4].mirja Iivonen, consistency in the selection of search concepts and search terms , Information Processing and management : an international journal ,v.31 n.2 , p.173-190, march/april 1995. [5]. Markey, K. (1984). Inter-indexer consistency tests. Library and Information Science Research, 6, 155--177. [6].Rolling, L. (1981). Indexing consistency, quality and efficiency. Information Processing and Management, 17, 69 76. [7]. Zunde, P., & Dexter, M.E. (1969). Indexing consistency and quality. American Documentation, 20, 259 26. [8]. Asadi , R. Schwartz and J. Makhoul "Automatic Modeling for Adding New Words to a Large-Vocabulary Continuous Speech Recognition System", IEEE International Conference on Acoustics, Speech and Signal Processing, pp.305 -308 1991. [9]. Relevance Search and Anomaly Detection using Bipartite Graphs Jimeng Sun1 Huiming Qu2 Deepayan Chakrabarti3 Christos Faloutsos1 1Carnegie Mellon Univ. 2Univ. of Pittsburgh 3Yahoo! Research [10].Challenging Issues of Automatic Summarization: Relevance Detection and Quality-based Evaluation Elena Lloret and Manuel Palomar Department of Software and Computing Systems, University of Alicante,spain.

Mr. S. Manikandan has received his B.Tech degree,in 2006 in the stream of Information Technology from Anna University, Chennai,India and M.E degree in Systems Engineering and Operation in 2009 from Anna University, Chennai ,India. He is currently employed at Vel Tech Multi Tech Engineering College Chennai. His research interests include Data Mining. Cryptography,Mobile networks . Mr.D.Suresh Babu has received his B.Tech degree in the stream of Information Technology in 2006 from Anna University, Chennai and M.E degree in Software Engineering in 2009 from Anna University, Chennai. He is currently employed at Vel Tech Multi Tech Engineering College Chennai. His research interests include Adhoc Networks, Data Mining. Network Security and Mobile networks . Mr.R.Prabu has received his B.Tech degree in the stream of Information Technology in 2004 from periyar University, salem and M.Tech degree in Information Technology in 2008 from saithiyabama University, Chennai. He is currently employed at Vel Tech Multi Tech Engineering College Chennai. His research interests include Adhoc Networks, Data Mining. Mobile networks . Mr.P.Vijay anand has received his B.Tech degree in the stream of Information Technology in 2005 from Anna University, Chennai and M.Tech degree in Information Technology in 2011 from saithiyabama University, Chennai. He is currently employed at Vel Tech Multi Tech Engineering College Chennai. His research interests include Adhoc Networks, Data Mining. Network Security and Mobile networks .

Figure 5. Consistency system model

IV. CONCLUSION Our project is only a humble venture to satisfy the author for documenting the research article on his own. Several user friendly coding have also adopted. This package shall prove to be a powerful package in satisfying all the requirements of the members involved in the committee of checking the consistency of paper. A third person requirement is not needed for documenting or for diversified vocabulary level. The application helps the author by finding the inconsistent vocabulary terms and suggests him/her with the respective vocabulary level that suits the grade well. When the paper is again traversed, the consistency level must be same. The words are suggested along with the definitions as it might help the author to fix the appropriate words for each term. V. FUTURE ENHANCEMENT The future enhancement would be the concept mapping. The paper when checked for consistency, it traverses the paper and gives the consistency level. It also supports paper with zero concepts; finds and suggests grades. Here, it blocks the user to upload, by handling a database which holds the results of the paper after checking for concepts in next stages. This could be done here in this stage while checking for consistency itself. It can overcome by Ontology, which helps to map concepts and check for worthy ideas. When the 50