You are on page 1of 6

HIERARCHICAL CLASSIFICATION OF

TEXT DOCUMETNS IN PUNJABI


LANGUAGE USING ONTOLOGY
BASED APPROACH
Kaveesh Nayak1 and Charanjiv Singh1
1

Department of Computer Engineering, University College of Engineering, Punjabi


University, Patiala

ABSTRACT
Due to large amount of digital content available online it has become the need of the hour to
categorize/classify the data for suitable access. Classifying each document manually is a quite
troublesome task. To overcome the problem of classification of text documents manually text mining
technique can be used to create computer program which automatically assigns unclassified text
documents to appropriate categories. The approach being used here to classify documents is ontology
based. Ontology for different categories is created. The technique implemented not only assigns text
document to different categories but it further assigns the unclassified document to its appropriate
subcategory.

KEYWORDS
Document Classification, Ontology, Punjabi Text Classification.

1. INTRODUCTION
Though previous work has been done to classify the documents in Punjabi language, but system
of hierarchical classification of Punjabi text is not yet developed. The system implemented here
considers the text document submitted by user as the unclassified text it wants to classify into
different categories with the help of underlying ontology based approach. One of the categories
used in the system is : (Dharam) and in it are contained the following subcategories:
(Islam Dharam), (Isai Dharam), (Sikh Guru), (Sikh
Dharam), (Hindu Dharam), (Jain Dharam), (Bodhi Dharam). These
categories and further subcategories are constructed with the help of Punjabi Corpus. To classify
the uncategorized documents ontology based approach is used. Ontology related to different
categories is created in Punjabi language. Each category consists of related terms. Requirement
of different training set is not at all required for ontology based classification system which
makes it easier to implement.

2. PROPOSED METHODOLOGY
Hierarchical classification of documents in Punjabi language aims to assign a category and a
subcategory to the unknown text document submitted by the user. This will be automatically
done by the application made. Firstly, the classifier needs to be trained to correctly classify the
unknown text documents. So, the various phases of the classifier work as described below:

2.1 Pre-processing phase

Uncategorized text document in Punjabi language is assumed to be bag-of-words. So, to


remove unnecessary terms from this bag-of-words the uncategorized document goes through
Pre-processing phase. In addition to unnecessary words (stop words) punctuation marks, special
symbols etc. are also removed. , , , are some of the stop words used in Punjabi
language.

2.2. Training phase


The training phase of classifier works according to the steps mentioned below:
1. For each Category (for e.g: ) a table with the name of the category is created.
2. In the Category table goes fields like Key, Word, Subcategory, Frequency.
3. The Key field serves as primary key of the table.
4. For each category equal no. of documents out of which ontology is to be created is needed.
And for constructing ontology for each subcategory in the category each subcategory in the
category should have equal number of training documents. This is done so as to remove biasing.
5. For each new word (excluding stop words) belonging to a category, it is inserted in the table
of the corresponding category.
6. Then in the same row in which the word is stored the subcategory to which that word
possibly belongs to is stored.
7. After that store the frequency of that word occurring in all the documents of the subcategory
to which that particular word belongs to.
8. This completes storage of one word in the category table. Similarly, for each word in category
and subcategory repeat the process.
9. End of Training Phase.

2.3. Testing phase


This is the last phase of the classifier, this is the step when an unlabelled document is submitted
the classifier automatically assigns category and subcategory to it. Steps to be followed in the
testing phase are:
1. Tokenize each word of the text document submitted by the user, the result is the bag- ofwords model.
2. Remove stop words from the resulting bag of tokenized words.
3. For each remaining words (Let abc be one of the word) in bag of words:
(i) Take x number of variables for x categories.
(ii) Add the frequencies corresponding to that word (abc) in each category table and
store
them in one of the x variables, one for each category.
(iii) The variable out of x which is maximum is taken into consideration.
(iv) Suppose variable x4 comes out to be maximum and x4 variable denotes category

(v) Now that it is clear that word abc belongs to category .


(vi) Repeat the above steps for all words in the text document submitted by the user.
(vii) The category table to which the maximum number of words in the document belongs
to is the main category of the text document.
4. Now the category of the text document submitted is known. Suppose it comes out to be
. For each word (Let abc be one of the word) in the document excluding stop words:
(i) Check for each word in the document in the category table (e.g.:).
(ii) If the word (abc) under consideration belongs to the category table (e.g.:), then
check for any occurrence of that word in the table.

(iii) If the word (abc) occurs only once in the category table (e.g.:), then the
subcategory field corresponding to that word in the category table is taken out as
subcategory of the document.
(iv) If the word (abc) occurs more than once in the category table. The subcategory field
corresponding to the word (abc) which has maximum frequency field in the category
table can be considered as subcategory of the document.

Figure 1. Flowchart of how classification process works

3. ANALYSIS AND RESULTS

3.1 Dataset
Around 235 Punjabi text pages for each category are used for training phase of classifier. 235
multiplied by 8 categories yields 1880 of total training set pages. The text pages in each
category contain words related to that category. 100 pages in Punjabi language are used for
testing purposes. The source of the documents is online content taken from websites like
ajit.com, pa.wikipedia.org, jagbani.com, punjabipedia.org. The data taken is further divided into
eight categories which is further divided into subcategories which range from five subcategories
to seven. Categories implemented in the application are: , ,
, , , , , . VB.net is used
for front end of the application while SQL Server is used as the database. 957 stopwords and
category related ontology is stored in SQL tables in the database.

3.2. Screenshot
Figure 2 shows the applications main page which classifies the text entered in textbox. First it
shows Show category button which shows the category of the text entered and after giving out
the category a button of Show subcategory comes up which shows us the subcategory of the
text.

Figure 2. Punjabi Classifier Application

3.3. Exploratory results


F-score for each category first and then subcategory is calculated using following equation:
F-score = (2*Precision*Recall)/(Precision+Recall)
Precision = (documents correctly classified in the category C)/(total documents retrieved in
the category C)
Recall = (documents correctly classified in the category C)/(total relevant documents in test
set that belong to the category C)
Table 1. F-score of each category

Category

F-Score

0.88
0.94
0.94

1
0.88
0.90
1
0.94

Table 2. F-score of each subcategory

Subcategory

F-Score

0.88
0.88
0.88
1
0.82
0.90
1
0.88

The above two tables show the F-score category wise and then subcategory wise. Total
80 documents are taken to carry out this test. As it is clear from the above tables that Fscore of both category and subcategory lie above 0.85 which is a good score considering
the benchmarks.

4. CONCLUSION
Ontology is created of different categories and subcategories to construct a Punjabi classifier
application. Thanks to the ontology based approach there is no need of a separate training set.
The results obtained by the classifier using ontology based approach are quite satisfactory
(>85%). The results are evidence of how a correctly implemented ontology can help in making a
text classifier.
Though a classifier in Punjabi language has been developed earlier based on ontology based
approach but taking it to the next stage, a hierarchical classifier in Punjabi language is
constructed.

REFERENCES
[1] V. Gupta and N. Nidhi, "Domain Based Classification of Punjabi Text Documents", Proceedings of
COLING 2012, pp. 297-303, 2012.
[2]M. IKONOMAKIS, S. KOTSIANTIS and V. TAMPAKAS, "Text Classification Using Machine
Learning Techniques", WSEAS TRANSACTIONS on COMPUTERS, vol. 4, no. 8, pp. 966-974, 2005.
[3]U. Jain and K. Saini, "A Review on the Punjabi Text Classification using Natural Language
Processing", International Journal of Advanced Research in Computer and Communication Engineering,
vol. 4, no. 7, pp. 102-104, 2015.
[4]V. Korde and C. Mahender, "Text Classification And Classifiers: A Survey", International Journal of
Artificial Intelligence & Applications, vol. 3, no. 2, pp. 85-99, 2012.
[5]A. Rozeva, "Classification of text documents supervised by domain ontologies", Applied Technologies
and Innovations, vol. 8, no. 3, pp. 1-12, 2012.
[6]M. Song, S. Lim, S. Park, D. Kang and S. Lee, "An Automatic Approach to Classify Web Documents
Using a Domain Ontology", Lecture Notes in Computer Science, pp. 666-671, 2005.
[7]C. Wijewickrema, "Impact of an ontology for automatic text classification", Annals of Library and
Information Studies, vol. 61, pp. 263-272, 2014.

You might also like