You are on page 1of 8

The Effect of Part-of-Speech Tagging on IR Performance for Turkish

B. Taner Diner and Bahar Karaolan


Ege niversitesi , Uluslararas Bilgisayar Enstits, 35100 Bornova, zmir, Trkiye {dtaner,bahar}@ube.ege.edu.tr

Abstract. In this paper, we experimentally evaluate the effect of the Part-ofSpeech (POS) tagging on Information Retrieval performance for Turkish. We used four term-weighting schemas to index SABANCI-METU Turkish Treebank corpus. The term weighting schemas are tf, tf x idf, Ltu.ltu, and Okapi. Each weighting scheme is factored over three POS tagging cases that are namely No POS tagging, POS tag with no history (i.e. 1-gram), and POS tag with one step history (i.e. 2-gram). The Meta-scoring function is used to analyze the effect of these nine factors on IR performance. Results show that weighting schema are significantly different from each other with a p-value of 0.04 (Friedman Non-parametric Test), but there is not enough evidence in the corpus to reject the null hypothesis that the three weighting schemas, on the average, show equal performance over the three cases of POS tagging with a pvalue of 0.36.

Introduction

Information Retrieval (IR) systems are used to search a large number of electronic documents for user information needs. Information on a document is composed of words semantics. Hence, an IR system actually deals with those words, which are the representatives of semantics that are truly the building blocks of intended information. Index term selection is a task of finding manually or automatically the words or collocations that are the representatives of the potential information in a particular document. Those terms are then used to represent the document in an IR system for further processing purposes. The major task of IR systems, which use automatic indexing, is to find the most important terms to represent the documents. Term weighting or index term weighting is the task of assigning a quantitative value to a particular term according to its importance of being a representative for each document in the collection. POS (Part-of Speech) Tagging is the process of assigning grammatical functions to terms in a written text. Tagging a term with a part-of-speech allows the use of this information in term weighting schemas. The natural language processing techniques such as POS tagging may improve retrieval performance, but the degree of improvement varies from minimal to moderately helpful [1].

C. Aykanat et al. (Eds.): ISCIS 2004, LNCS 3280, pp. 771778, 2004. Springer-Verlag Berlin Heidelberg 2004

772

B. Taner Diner and Bahar Karaolan

In this study, we examine the effect of POS tagging on three different index term weighting schemas, consequently the retrieval performance. The standard tf x idf [2], the Ltu.ltu [3] and the Okapi [4] are used as weighting schemas in our experiments. Term weighting without POS tagging, with POS tag of the target term, and with the POS tag of the target and previous term are taken as the treatment groups. The Meta-Scoring function [1] is used to evaluate a quantitative metric to compare the results obtained from constructed nine test beds without precision and recall. Jin et. al. [1] states that Meta-Scores are always consistent with the average precision for all their six test collections and four different term weighting schemas. The results reveal that there is a significant difference between weighting schemas with or without POS tag information for Turkish texts, free from any statistical fluctuations with a significance level of 0.04 percent (i.e. Friedman non-parametric test, p-value of 0.04). Although there seems no difference between no tagging, tag with target term POS and tag with target and previous term for all weighting schemas, it is false to say that there is no tendency to favor POS tagging against no tagging. The p-value of 0.36 may not mean a strong evidence of the complete randomness. However, it is also not enough to say that there exists a difference. This inconclusive situation may be due to the small size of the test collection and will be further discussed in Section 3. The paper is organized as follows: In Section 2, we give our experimental design. In Section 3, we present the results and the discussions of our experiments. The conclusions are given in Section 4.

2
2.1

Experimental Design
The Test Corpus

We used the SABANCI-METU Turkish Treebank [5, 6] as our test corpus. It contains 7,262 sentences that are manually POS and syntactic function tagged and disambiguated. Some statistics and proportion of the genres of the corpus is given in Table 1. The corpus has been constructed from the parts of the METU Turkish Corpus [7] and has approximately the same proportion of genres. Intuitively, a corpus length of 63,916 token is not large enough for statistical test of hypothesis with a high level of confidence. However, it is not that small to conclude that it entails no information to test some linguistic methods. Besides, it is the only corpus, which is manually tagged and disambiguated for Turkish.

The Effect of Part-of-Speech Tagging on IR Performance for Turkish Table 1. Some descriptive statistics of SABANCI-METU Turkish Treebank
Doc. Topics Memoirs Research Essay Travel News Paper Story Short Story Novel Total % 3.03 6.06 9.09 3.03 27.27 9.09 18.18 6.06 18.18 100 Parag. Sent. Token % 3.30 3.09 15.55 1.10 16.04 2.28 22.83 3.34 32.48 100 % 3.72 24.98 5.08 3.18 4.79 17.55 2.27 11.78 26.66 100 # 4519 13135 2284 2908 4040 7659 2142 11919 15310 63916 % 7.07 20.55 3.57 4.55 6.32 11.98 3.35 18.65 23.95 100

773

The vocabulary size of the corpus is 17,518 tokens. This value is approximately equal to the 30% of the corpus size, which causes the document-term matrixes for any weighting scheme to be sparse. This appalling situation of huge vocabulary size is a side affect of Turkish as an agglutinative language and brings the need for stemming as a major preparation step for Information Retrieval tasks. However, we did no stemming in our study, because our aim is to find the POS tagging effect on retrieval performance independent of any other factor. Another problem is in our evaluation metric, Meta-Scoring may be thought as a discriminant analysis of n-dimensional document space. In other words, the score increases while a weighting schema can discriminate the document content space into a number of orthogonal axes that fit the number of collection space dimensions. However, vocabulary of that size may readily discriminate the space into all possible orthogonal axes. In this case, the MetaScoring assessment is limited to a small room for discrimination that can be produced by the weighting schemas, which are meaningless with respect to the statistical significance of change. 2.2 Evaluation Metric: Meta-scoring

In Information retrieval, performance is evaluated with a widely accepted precision and recall metrics [2]. To evaluate these metrics for a corpus with given queries requires human judgments about relevance between the queries and the retrieved documents. There are two disadvantages of this method for performance evaluation: 1-) Relevance judgments of human subjects for every document for all queries is very expensive because of the large size of the collections used for IR purposes. 2-) Human judgment about relevance between documents and queries is not free from bias. Human judgments are subjective, thus to make an objective evaluation of IR performance more than one person are needed to judge the relevance. The most essential property of Meta-Scoring method is that it evaluates an objective relevance judgment for all possible queries that can be formulated from the considered collection against all documents.

774

B. Taner Diner and Bahar Karaolan

Meta-scoring [1] is a way of bypassing the human effort for relevance judgment. It compares two different document-term matrixes generated by two different term weighting schemas and computes a goodness score for them. Jin et. al. state that this goodness score for different weighting schemas are correlated with their information retrieval performances evaluated by average precision metric. The Metascore is based on the notion of Mutual Information between two random variables C and D, and is written as:

I (C ; D) = H (C ) H (C D)

(1)

In other words, Mutual Information I (C ; D ) is the difference between the entropy of the random variable C, H(C) and the average entropy of the random variable C when the value of random variable D is given H(C|D). In the Meta-Scoring formulation random variable, C represents the document contents and the random variable, D represents the content vectors. To model the document contents they used the idea of Latent Semantic Indexing (LSI) [8]. Therefore, the random variable C r r r can take the values from the set of eigenvectors {v1 , v2 ,K, vn } of the document matrix

r i = 1,..., n i.e. P(C = vi ) that can be calculated by the corresponding eigenvalues i i = 1,..., n of document
D. The probability of C is equal to any eigenvector vi

matrix D as:

r P(C = vi ) =

i
j =1

, 1 i n

(2)

The document matrix D is a square and symmetric matrix, and evaluated from the document-term matrix M as:

Dn n = M n m M m n T

(3)

The random variable D can take values from the set of document vectors, i.e.

r has equal importance, P( D = d i ) will be constant.

documents in the collection d1 , d 2 ,K, d n . Since, every document in the collection

r r

r 1 P(D = d i ) = , 1 i n n
evaluated is P (C

(4)

To compute intended mutual information, the last probability that must be

r r = vi D = d j ) . In other words, the probability of the document

The Effect of Part-of-Speech Tagging on IR Performance for Turkish

775

content is equal to the eigenvector vi , given that the random variable D is equal to the document vector d j .

r r P(C = vi D = d j ) =

k =1

r r d j T vi , 1 i, j n n r Tr d j vk

(5)

2.3

Weighting Schemas Used in Experiments

In our study, we used three different weighting schemas to evaluate the effect of POS tagging on IR performance shown in Table 2. They are namely the standard tfxidf [2], the Ltu [3] and the Okapi [4] weighting schemes. In addition to these three methods, tf (i.e. raw term frequencies) is used as a base line for the performance considerations. The Ltu and the Okapi are the two well-known term weighting schemas. They consider the document lengths, which is not a property of tfxidf scheme.
Table 2. Term-weighting schemas used in experiments. tf is the frequency of a term, max_tf is the maximum term frequency in a document. N is the total number of documents in the collection, df is the document frequency, dl is the document length and avg_dl is the average document length

Name

Term weighting schema

tf tfxidf

Tf

tf * log(

N ) df N ) df

(log(tf ) + 1) log(
Ltu

0.8 + 0.2
tf
Okapi

dl avg _ dl
log( N df + 0.5 ) df + 0.5

0.5 + 1.5

dl + tf avg _ dl

776

B. Taner Diner and Bahar Karaolan

2.4

Description of the Experiments

We evaluated all of the four weighting schema for each POS tagging cases in the SABANCI-METU Turkish Treebank [5, 6]. This results in twelve document-term matrixes that have to be tested. Enumerations of the test treatments are shown in Table 3. Columns of the table show the POS tagging treatments: Case0 represents no tagging where weighting scheme considers only the terms (i.e. words). Case1 represents the case where the term (termi) is tagged with its POS (tagi). In Case2, the triple (tag1, tag2, term2) represents the target term (termi), its POS tag (tagi) and the POS tag of the previous term (tagi-1). In other words, Cases may be thought as ngrams in POS sequence: 0-gram for Case0, 1-gram for Case1 and 2-gram for Case2.
Table 3. Enumeration of test cases

tf tfxidf Ltu Okapi

Case0 termi termi termi termi

Case1 (tagi, termi) (tagi, termi) (tagi, termi) (tagi, termi)

Case2 (tag i-1,tag i, term i) (tag i-1,tag i, term i) (tag i-1,tag i, term i) (tag i-1,tag i, term i)

Results and Discussions

Figure 1, shows the over all results of the factors of the four weighting schemas and the three POS tagging treatments. The effect of all treatments and weighting schemas for index term weighting are tested by Friedman non-parametric test for statistical independence (Friedman test statistics is the non-parametric equivalence of 2-way ANOVA for the parametric case).
1.3000 1.2950 1.2900 1.2850 1.2800 1.2750 1.2700 1.2650 Cases Okapi Case0 TF Case0 TFxIDF Case0 Ltu Case0 TFxIDF Case1 Ltu Case1 TFxIDF Case2 Ltu Case2 TF TFxIDF Okapi Okapi Case1 TF Case1 TF Case2 Ltu

Okapi Case2

Fig. 1. Meta-scores of all test treatments

The Effect of Part-of-Speech Tagging on IR Performance for Turkish

777

As seen in the figure, the standard tfxidf-weighting scheme outperforms the other two schemas for SABANCI-METU Turkish Treebank corpus. This is not an expected situation when an English corpus is under consideration. This may be due to the large size of vocabulary and the insufficient size of the corpus for IR purpose. The tfxidf is significantly different with a significance level of 0.10 from both Ltu scheme with p-value of 0.08 and the Okapi scheme with p-value of again 0.08. In addition, the Ltu scheme is significantly different from Okapi scheme with a p-value of 0.08. The three weighting schemas are significantly different with a significance level of 0.05 from each other with a p-value of 0.04 without treatments effect (i.e. without POS tagging effect). It is worth to note that Jin et. al [1] states the close performance of Ltu and Okapi weighting schemas against six different test collections for English in their original work. However, Ltu weighting scheme performs much better than Okapi weighting scheme for Turkish. Although there is no empirical result that any of these weighting schemas differ in their performance when the POS tag information is included, it will not be true to judge that all these weighting schemas are completely independent of POS tag information. It is statistically evident from the corpus that there is no significant difference between these three weighting schemes, but as can be seen in the Figure 1, tfxidf , Ltu and also tf weighting schemas tend to increase their performance while information gathered from POS tag context increase. These results with POS tag reveal clearly the fact that the corpus size is not sufficient to make confident decisions about the effect of POS information on retrieval performance. Nevertheless, there is an empirical tendency towards differentiation between considered weighting schemas with a p-value of 0.36.

Conclusion

In this paper, we examined the effect of POS tag information on the performance of Information Retrieval systems for Turkish by four different weighting schemas. We used a Meta-Scoring [1] function as the evaluation metric. This method enables to avoid the need for human relevance judgment of queries against retrieved documents. Our results reveal that there is a significant difference between weighting schemas, namely the standard tfxidf [2], the Ltu [3] and the Okapi [4], free from any statistical fluctuations with a significance level of 0.05 percent (i.e. Friedman non-parametric test, p-value of 0.04). On the other hand, statistically speaking, the test corpus does not carry enough evidence to judge whether POS tag information is helpful or not. However, it would be false to say that retrieval performance is completely independent of the POS tag information, because we have failed to reject the null hypothesis of the dependency with a relatively low value of p (i.e. 0.36).

778

B. Taner Diner and Bahar Karaolan

References
1. Jin, R., Faloutsos, C., Hauptmann, A.G.: Meta-Scoring: Automatically Evaluating Term Weighting Schemes in IR without Precision-Recall. In: Proceedings of the 24th ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, Louisiana. (2001) Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. 1st edition. AddisonWesley, England (1999) Buckley, C., Singhal, A., Mitra, M.: New Retrieval Approaches Using SMART. In: Harman, D.K. (ed.): Proceedings of the Fourth Text Retrieval Conference (TREC-4), Gaithersburg. (1996) Robertson, S.E., Walker, S.: Okapi/Keenbow at TREC-8. In: Voorhees, E.M., Harman, D.K. (eds.): Proceedings of the Eighth Text Retrieval Conference (TREC-8), Gaithersburg. (2000) Oflazer, K., Say, B., Hakkani-Tr, D.Z., Tr, G.: Building a Turkish Treebank. In: Abeille, A. (ed.): Building and Exploiting Syntactically-Annotated Corpora. Kluwer Academic Publishers (2003) Atalay, N.B., Oflazer, K., Say, B.: The Annotation Process in the Turkish Treebank. In: Proceedings of the EACL Workshop on Linguistically Interpreted Corpora, Budapest, Hungary. (2003) Say, B., Zeyrek, D., Oflazer, K., zge., U.: Development of a Corpus and a Treebank for Present-Day Written Turkish In: Proceedings of the Eleventh International Conference of Turkish Linguistics. (2002) Kwok, K.L., Grunfeld, L., Xu, J.H.: TREC-6 English and Chinese Retrieval Experiments Using PRICS. In: Voorhees, E.M., Harman, D.K. (eds.): Proceedings of Sixth Text Retrieval Conference (TREC-6), Gaithersburg. (1997)

2. 3. 4. 5. 6. 7. 8.

You might also like