Automatic Text Classification and Focused Crawling

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 1, January February 2013 ISSN 2278-6856
AUTOMATIC TEXT CLASSIFICATION AND FOCUSED CRAWLING

Tamanna Verma
Banasthali University, Rajasthan, India.
Abstract: Automatic classification of text documents has

become an important research issue now days. We develop an automatic text categorization approach to text retrieval. Automatic text classification basically categories the text in bi-words using N-Gram approach. When it is categories the main objective is to crewel the document in particular area. Our experiments clearly indicate that automatic categorization improves the retrieval performance. Maintaining currency of search engine indices by exhaustive crawling is rapidly becoming impossible due to the increasing size and dynamic content of the web. So focused crawling is used to search the content in specific documents.
document also document retrieve whose occur more time in a document but in particular area.Todays there is a need of fast retrieval so that using focused crawler the speed is high twice because this focused crawler based on by words and search in particular area.
1. INTRODUCTION
There is a large Availability of online text which required organizing systematically for the purpose of proper utilization. Classification is done because of ease of storage, searching and retrieval of relent document [9]. Automatic Text Classification is attractive because it relives the organizations from the need of manually Organizing document bases, which is not only expensive, time consuming but also error prone [1].The classification is basically depends on the extracted features from the text document. The interaction of all the components in our categorization approach is illustrated in Fig. 1. For a given domain, we first invoke the parameter selection process to determine the appropriate parameter values. This step is carried out only once off-line at the beginning. After this step, we can determine the categories for a new document via the category extraction process which can be done efficiently on-line. The category learning model will be presented first document instance. Focused crawling is a technique which crawl the information in limited area. In this paper the aim is to crawl the text document on the basis of bi-gram approach such that is the user query the co-education then the retrieved document must contain both word continuously that is co-education. Search engines typically consist of a crawler which traverses the web retrieving documents and a search fronted which provides the user interface to the acquired information search engine also based on link analysis and term frequency. But focused crawls we use here because I want search in a particular area. Focused crawls crawl the data on the basis of co occurrence of that word in the Volume 2, Issue 1 January - February 2013
Fig-1
2. RELATED WORK
[2] Give the idea that the categorization approach is derived from a combination of a learning paradigm known as instance-based learning and an advanced document retrieval technique known as retrieval feedback. [3] Give the idea that with the explosion of information fuelled by the growth of the World Wide Web it is no longer feasible for a human observer to understand all the data coming in or even classify it into categories. With this growth of information and simultaneous growth of available computing power automatic classification of data, particularly textual data, gains increasingly high importance. [4] Give the idea that Automatic Text Classification is a semi-supervised machine learning task that automatically assigns a given document to a set of pre-defined categories based on its textual content and extracted features. [5] Give the idea that a focused crawler is a web crawler that traverses the web to explore information that is related to a particular topic of interest only. We categorize them as focused crawling based on content analysis, focused crawling based on link analysis and focused crawling based on both the content and link analysis.[6] Give the idea that Automatic text classification based on machine learning techniques viz. supervised, unsupervised and semi Page 88

supervised. In this paper we present a review of various text classification approaches under machine learning paradigm. We expect our research efforts provide useful insights on the relationships among various text classification techniques as well as sheds light on the future research trend in this domain.[7] give the idea that to enhance disperse and heterogeneous industrial digital ecosystem for e-Learning. Its target is to discover and classify the industrial information automatically using focused crawlers. only on this training dataset, it can be shown that the best possible choice. is
The kNN classifier is one of the most robust and useful classifiers and is often used to provide a benchmark to more complex classifiers such as artificial neural nets and support vector machines. 3.2 Nave Bayes Method The Bayesian method that makes independence assumption is termed as Nave Bayes classifier [10]. It predicts by reading a set of examples in attribute valuerepresentation and then by using the Bayes theorem to estimate the posterior probabilities of all qualifications. The independence assumptions of features make the features order irrelevant and presence of one feature does not affect other features in classification task [12]. Merits: This method requires a small amount of training data to estimate the parameters Necessary for classification. The classifiers based on this algorithm exhibited high accuracy and speed when applied to large databases. Demerits: This method works well only if assumed features are independent; when Dependency arises then it gives low performance. Example Nave Bayes Classifiers P (cj) Can be estimated from the frequency of classes in the Training examples. P(X1, X2, Xn|cj) O (|X|n|C|) parameters could only be estimated if a very, very large number of Training examples was available. Nave Bayes Conditional Independence Assumption: Assume that the probability of observing the conjunction of attributes is equal to the product of the individual Probabilities P (xi |cj). The Nave Bayes Classifier
3. AUTOMATIC TEXT CLASSIFICATON TECHIQUES

3.1 K-Nearest Neighbor classifier It is a well known pattern recognition algorithm. Given a test document, the kNN algorithm Finds the k nearest neighbors among the training documents and uses the categories of the k nearest neighbors to weight the category candidates [10]. The similarity score of each neighbor document to the test document is used as the weight of the categories of the neighbor document. This algorithm is based on the assumption that the characteristics of members of the same class should be similar. Thus observations located close together in covariate space are members of the same class. It is suitable for data streams. It does not build a classifier in advance. Merits: This method is effective, simple, non-parametric and easy to implement. Demerits: The major drawback of this method is that it becomes slow when size of training set grows the presence of irrelevant features severely degrades its accuracy.
Example
Fig-2 The 0/1 values are color-coded green and red. Based on this training data, our object is to find a predictor for new data, designated the test data. One method is to take the nearest neighbors of the new inputs and predict the new output based on the most frequent outcome, 0 or 1, among these neighbors. By taking odd we avoid ties. This is the kNN classifier and the idea is easily generalized to more than two output classes and more than two inputs. Based
Fig-3
Conditional Independence Assumption:

Volume 2, Issue 1 January - February 2013 Page 89

Features detect term presence and are independent of each other given the class:
=0.0405/0.0495 =0.8181 p(negative | x) = p(negative )* p(Medium | negative)*( red | negative)*p (circle | negative)/p(x) = 0.5*0.2*0.3*0.3 = 0.009/p(x) Now, =0.009/0.0495 =0.1818 P (Positive | x) + p (negative | x) = 0.0405/p(x) +0.1818/p(x P(x) =0.0405+0.1818=0.0495 3.3 Decision Trees The decision tree categorizes the training documents by constructing well-defined true/false Queries in the form of tree structure. In this leaves represent the corresponding category of the text documents and branches represent conjunctions of features that lead to these categories [12]. Merits: This method works on data of any type. It is fastest even in the presence of large Amounts of attributes. Demerits: The major risk of implementation of decision tree is it over fits the training data with the occurrence of an alternative tree. The construction of a tree involves the following three Elements: 1. The selection of the splits. 2. The decisions when to declare a node terminal or to continue Splitting it. 3. The assignment of each terminal node to a class. 3.4 Decision Rules Classification This method uses the rule-based inference to classify documents to their annotated categories [10]. These classifiers are useful for analyzing non-standard data. It constructs a rule set that describe the profile for each category. Rules are in the form of If condition Then conclusion, where condition portion is filled by features of the category, and conclusion portion is represented with the categories name or another rule to be tested. Fig-5 Merits: This method is capable to perform semantic analysis. Demerits: The major drawback of this method is the need of involvement of human experts to construct or update the rule set. For a document d and a class c Posterior = Prior Likelihood P (c|d) =P (d|c) P(c) P (d) The probability of an event before we consider our additional knowledge is called the prior probability while new probability that results from new Knowledge is posterior probability. Likelihood is independent occurrence of Class. Rule Description P (d|c) probability of doc. occurring in class c P(c) is prior probability of occurrence in class c; if doc does not provide clear evidence in class c then features are selected. Our goal is to find the best class for the document To avoid zero probability we add some bias value (such as To formula which is often called smoothing. Example H = Have a headache F = Coming down with Flu P (H) = 1/10 P (F) = 1/40 P (H|F) = 1/2 Headaches are rare and flu is rarer, but if youre coming down with flu theres a ratio with prob. is . Bayesian Classifiers If a doc term does not provide clear evidence for one class then higher prior Probability is used and classifies a new Page 90
Volume 2, Issue 1 January - February 2013

instance D based on a tuple of attribute values into one of n the classes cj. CMAP= argmax p (Cj| X1, X2.Xn) cj c = argmax p(X1, X2.Xn|cj) p (cj)/p(X1, X2.Xn) cj c = argmax p(X1, X2.Xn|cj) p (cj)/p(X1, X2.Xn) cjc as possible. In other words SVM finds the separator with a maximum margin and is often called a "maximum margin classifier. 3.6 N-Gram Approach An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "diagram"); size 3 is a "trigram". Larger sizes are sometimes referred to by the value of n, e.g., "four-gram", "five-gram", and so on. The main motivation behind this approach is that similar words will have a high Proportion of N-grams in common. Typical values for n are 2 or 3; these correspond to the use of bigrams or trigrams, respectively. For example, the word computer results in the generation of the bigrams. *C, CO, OM, MP, PU, UT, TE, ER, R* Character N-gram matching for computing a string similarity measure is widely used technique in information retrieval, stemming, spelling and error correction [22-24], text compression [17], language identification [18-19], and text search and retrieval [2021]. The N-gram based similarity between two strings is measured by Dices coefficient. Consider the word computer whose bi-grams are: *C, CO, OM, MP, PU, UT, TE, ER, R* To measure the similarity between the words computer and computation, we can use Dices coefficient in the following way. First, find all the bi-grams from the word computation. *C, CO, OM, MP, PU, UT, TA, AT, TI, IO, ON, N* The number of unique bi-grams in the word computer is 9 and in the word computation is 12. There are 6 common bi-grams in both the words. Similarity measured by Dices coefficient is calculated as 2C/ (A+B), Where A and B are the number of unique bigrams in the pair of words; C is the number of common bigrams between the pair. For statistical stemming, terms are clustered using the Single link Clustering Method along with the above similarity measure. For spelling correction tri-gram matching gives significant results [21]. Some IR systems [23] use character N-grams rather than words as index terms for retrieval is done, and the system works unmodified for documents in English, French, Spanish, and Chinese. But I m doing my work using 2-gram approach. In 2 gram approach he whole documents divided into pair because 2-gram consists pairing. 2-grams can also be used for efficient approximate matching. By converting a sequence of items to a set of 2-grams, it can be embedded in a vector space, thus allowing the sequence to be compared to other sequences in an efficient manner. For example, if we convert strings with only letters in the English alphabet into 3-grams, we get a Page 91
Prior prob. <X1, X2, X3, Xnd> are token in document d that Are part of vocabulary we use for classification, nd is the Number of tokens in d. 3.5 Support Vector Machines It is a statistical based learning algorithm [10]. This algorithm addresses the general problem of learning to discriminate between positive and negative members of a given class of n dimensional vectors. It is based on the Structural Risk Minimization principle from Computational learning theory. The SVM need both positive and negative training set which are uncommon for other classification methods. The performance of the SVM classification remains unchanged even if documents that do not belong to the support vectors are removed from the set of training data; this is one of its major advantages. Merits: Amongst existing supervised learning algorithms for TC SVM has been recognized as one of the most effective text classification methods [11][13] as it is able to manage large spaces of features and high generalization ability. Demerits: But this makes SVM algorithm relatively more complex which in turn demands high time and memory consumptions during training stage and classification stage.
Fig 6 Support Vector Machines Classifier It is said that one separator is better than another if it generalizes better, i.e. shows better performance on documents outside of the training set. It turns out that the generalization quality of the plane is related to the distance between the plane and the data points that lay on the boundary of the two data classes. These data points are called "support vectors" and the SVM algorithm determines the plane that is as far from all support vectors Volume 2, Issue 1 January - February 2013

dimensional space (the first dimension measures the number of occurrences of "aaa", the second "aab", and so forth for all possible combinations of three letters). Using this representation, we lose information about the string. For example, both the strings "abc" and "bca" give rise to exactly the same 2-gram "bc" (although {"ab", "bc"} is clearly not the same as {"bc", "ca"}). However, we know empirically that if two strings of real text have a similar vector representation then they are likely to be similar. [13] Automatic Text Classification: A Technical Review International Journal of Computer Applications (0975 8887) Volume 28 No.2, August 2011 Mita K. Dalal Sarvajanik College of Engineering & Technology, Surat, India, Mukesh A. Zaveri Sardar Vallabhbhai National Institute of Technology, Surat, India . [14] Focused crawling: a new approach to Topic-specific Web resource discovery Soumen Chakrabarti a,1, Martin van den Berg b,2, Byron Domc a Computer Science and Engineering, Indian Institute of Technology, Bombay, 400076, India b FX Palo Alto Laboratory, 3400 Hillview Ave, Bldg 4, Palo Alto, CA 94304, USA. [15] Document Classification with Support Vector Machines Konstantin Mertsalov Principal Scientist, Machine and Computational Learning Rational Retention Michael McCreary Chief Operating Officer Rational Retention, LLC January 2009. [16] R. Anglell, G. Freund, and P. Willett, Automatic spelling correction using a trigram similarity measure, Information Processing & Management, 19, (4), 305--316, (1983). [11] R. C. Angell, G. E. Freund, and P. Willette, Automatic Spelling Correction Using Trigram Similarity Measure, Inf. Proc. Mgt. 18, 255, 1983. [17] E. J. Yannakoudakis, P.Goyal, and J. A. Huggill, The Generation and Use of TextFragments for Data Compression, Inf. Proc. Mgt. 18, 15, 1982. [18] J. C. Schmitt, Trigram-based Method of Language Identification, U.S. Patent No. 5,062,143, 1990. [19] W. B. Cavnar and J. M. Trenkle, N-gram-based Text Categorization, Proceeding of the Symposium on Document Analysis and Information Retrieval, University of Nevada, Las Vegas, 1994. [20] P. Willett, Document Retrieval Experiments Using Indexing Vocabularies of Varying Size. II. Hashing, Truncation. Digram and Trigram Encoding of Index Terms. J. Doc. 35, 296, 1979. [21] W. B. Cavnar, N-gram-based Text Filtering for TREC-2, The Second Text Retrieval Conference (TREC-2), NIST Special Publication 500-215, National Institute of Standards and Technology, Gaitherburg, Maryland, 1994. [22] C. Y. Suen, N-gram Statistics for Natural Language Understanding and Text Processing, IEEE Trans. on Pattern Analysis & Machine Intelligence. PAMI, 1(2), pp.164-172, April 1979. [23] Ethan Miller, Dan Shen, Junli Liu and Charles Nicholas Performance and Scalability of a LargeScale N-gram Based Information Retrieval System Journal of Digital information, volume 1 issue 5. [24] R. C. Angell, G. E. Freund, and P. Willette, Automatic Spelling Correction Using Trigram Similarity Measure, Inf. Proc. Mgt. 18, 255, 1983.
4. CONCLUSION
The result of automatic text classification and focused crawling is that the fetched documents are limited. Time saving to crawl the document because the documents are filter and remaining document etch easily. Basically we use n gram approach which help to classify the document in phrases on that basis document is fetched. Firstly we stemming the documents mean we remove stop words then classify our documents using 2- gram approach then we crawl the documents.
REFRENCES
[1] Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp. 147. [2] Wai Lam, Member, IEEE, Miguel Ruiz, and Padmini Srinivasan in nov/dec 1999. [3] By Aigars Mahinovs and Ashutosh Tiwari April 2007. [4] Mita K. Dalal Sarvajanik College of Engineering & Technology, Surat, India Mukesh A. Zaveri Sardar Vallabhbhai National Institute of Technology, Surat, India in aug 2011. [5] Samarawickrama, S.; Jayaratne, L. in 26-28 Sept. 2011. [6] Shweta C. Dharmadhikari#1, Maya Ingle *2, Parag Kulkarni #3 # Pune Institute of Computer Technology - EkLat Solutions Pune , Maharashtra, India in nov. 2011. [7] H. Dong et al. (2011). [8] Marcelo Mendoza, (2012) "A new term-weighting scheme for nave Bayes text categorization", International Journal of Web Information Systems, Vol. 8 Iss: 1, pp.55 72. [9] L.Tang,S. Rajan,V.K. Narayanan. Large Scale MultiLabel Classification via MetaLabeler. In Proceedings of the Data Mining and Learning 2009. [10] A.Khan,B.Baharudin,Lan Hong Lee. A Review of Machine Learning Algorithms for Text- Documents Classification. Journal Of Advances in Information Technology, Vol. 1 , No. 1, Feb.2010. [11] Z. Wang, X. Sun, D. Zhang. An optimal Text categorization algorithm based on SVM. [12] Arzucan Ozgur. Supervised and unsupervised machine learning techniques for text document categorization.Thesis submitted in Department of Computer Science, Bogaziki University. 2004. Volume 2, Issue 1 January - February 2013
Page 92

Automatic Text Classification and Focused Crawling

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automatic Text Classification and Focused Crawling

Uploaded by

Copyright:

Available Formats

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)