Professional Documents
Culture Documents
ABSTRACT
Due to large amount of digital content available online it has become the need of the hour to
categorize/classify the data for suitable access. Classifying each document manually is a quite
troublesome task. To overcome the problem of classification of text documents manually text mining
technique can be used to create computer program which automatically assigns unclassified text
documents to appropriate categories. The approach being used here to classify documents is ontology
based. Ontology for different categories is created. The technique implemented not only assigns text
document to different categories but it further assigns the unclassified document to its appropriate
subcategory.
KEYWORDS
Document Classification, Ontology, Punjabi Text Classification.
1. INTRODUCTION
Though previous work has been done to classify the documents in Punjabi language, but system
of hierarchical classification of Punjabi text is not yet developed. The system implemented here
considers the text document submitted by user as the unclassified text it wants to classify into
different categories with the help of underlying ontology based approach. One of the categories
used in the system is : (Dharam) and in it are contained the following subcategories:
(Islam Dharam), (Isai Dharam), (Sikh Guru), (Sikh
Dharam), (Hindu Dharam), (Jain Dharam), (Bodhi Dharam). These
categories and further subcategories are constructed with the help of Punjabi Corpus. To classify
the uncategorized documents ontology based approach is used. Ontology related to different
categories is created in Punjabi language. Each category consists of related terms. Requirement
of different training set is not at all required for ontology based classification system which
makes it easier to implement.
2. PROPOSED METHODOLOGY
Hierarchical classification of documents in Punjabi language aims to assign a category and a
subcategory to the unknown text document submitted by the user. This will be automatically
done by the application made. Firstly, the classifier needs to be trained to correctly classify the
unknown text documents. So, the various phases of the classifier work as described below:
(iii) If the word (abc) occurs only once in the category table (e.g.:), then the
subcategory field corresponding to that word in the category table is taken out as
subcategory of the document.
(iv) If the word (abc) occurs more than once in the category table. The subcategory field
corresponding to the word (abc) which has maximum frequency field in the category
table can be considered as subcategory of the document.
3.1 Dataset
Around 235 Punjabi text pages for each category are used for training phase of classifier. 235
multiplied by 8 categories yields 1880 of total training set pages. The text pages in each
category contain words related to that category. 100 pages in Punjabi language are used for
testing purposes. The source of the documents is online content taken from websites like
ajit.com, pa.wikipedia.org, jagbani.com, punjabipedia.org. The data taken is further divided into
eight categories which is further divided into subcategories which range from five subcategories
to seven. Categories implemented in the application are: , ,
, , , , , . VB.net is used
for front end of the application while SQL Server is used as the database. 957 stopwords and
category related ontology is stored in SQL tables in the database.
3.2. Screenshot
Figure 2 shows the applications main page which classifies the text entered in textbox. First it
shows Show category button which shows the category of the text entered and after giving out
the category a button of Show subcategory comes up which shows us the subcategory of the
text.
Category
F-Score
0.88
0.94
0.94
1
0.88
0.90
1
0.94
Subcategory
F-Score
0.88
0.88
0.88
1
0.82
0.90
1
0.88
The above two tables show the F-score category wise and then subcategory wise. Total
80 documents are taken to carry out this test. As it is clear from the above tables that Fscore of both category and subcategory lie above 0.85 which is a good score considering
the benchmarks.
4. CONCLUSION
Ontology is created of different categories and subcategories to construct a Punjabi classifier
application. Thanks to the ontology based approach there is no need of a separate training set.
The results obtained by the classifier using ontology based approach are quite satisfactory
(>85%). The results are evidence of how a correctly implemented ontology can help in making a
text classifier.
Though a classifier in Punjabi language has been developed earlier based on ontology based
approach but taking it to the next stage, a hierarchical classifier in Punjabi language is
constructed.
REFERENCES
[1] V. Gupta and N. Nidhi, "Domain Based Classification of Punjabi Text Documents", Proceedings of
COLING 2012, pp. 297-303, 2012.
[2]M. IKONOMAKIS, S. KOTSIANTIS and V. TAMPAKAS, "Text Classification Using Machine
Learning Techniques", WSEAS TRANSACTIONS on COMPUTERS, vol. 4, no. 8, pp. 966-974, 2005.
[3]U. Jain and K. Saini, "A Review on the Punjabi Text Classification using Natural Language
Processing", International Journal of Advanced Research in Computer and Communication Engineering,
vol. 4, no. 7, pp. 102-104, 2015.
[4]V. Korde and C. Mahender, "Text Classification And Classifiers: A Survey", International Journal of
Artificial Intelligence & Applications, vol. 3, no. 2, pp. 85-99, 2012.
[5]A. Rozeva, "Classification of text documents supervised by domain ontologies", Applied Technologies
and Innovations, vol. 8, no. 3, pp. 1-12, 2012.
[6]M. Song, S. Lim, S. Park, D. Kang and S. Lee, "An Automatic Approach to Classify Web Documents
Using a Domain Ontology", Lecture Notes in Computer Science, pp. 666-671, 2005.
[7]C. Wijewickrema, "Impact of an ontology for automatic text classification", Annals of Library and
Information Studies, vol. 61, pp. 263-272, 2014.