A Simple Method For Estimating Term Mutual Information

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 6, JUNE 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.
ORG
A Simple Method for Estimating Term Mutual Information

D. Cai and T.L. McCluskey
AbstractThe ability to formally analyze and automatically measure statistical dependence of terms is a core problem in many areas of science. One of the commonly used tools for this is the expected mutual information (MI) measure. However, it seems that MI methods have not achieved their potential. The main problem in using MI of terms is to obtain actual probability distributions estimated from training data, as the true distributions are invariably not known. This study focuses on the problem and proposes a novel but simple method for estimating probability distributions. Estimation functions are introduced; mathematical meaning of the functions is interpreted and the verification conditions are discussed. Examples are provided to illustrate the possibility of failure of applying the method if the verification conditions are not satisfied. An extension of the method is considered. Index TermsInformation analysis and extraction, dependence and relatedness of terms, statistical semantic analysis.
1 INTRODUCTION
H E ability to form ally analyze and au tom atically m easu re statistical d epend ence (related ness, proxim ity, association, sim ilarity) of term s in textu al d ocu m ents is a core problem in m any areas of sciences, su ch as, featu re extraction and selection, concept learning and clu stering, d ocu m ent representation and qu ery form u lation, text analysis and d ata m ining. Solu tion of the problem has been a technical barrier for a variety of practical m athem atical applications. One of the com m only u sed tools of analysis and m easu rem ent is the expected m u tu al inform ation (MI) m easu re d raw n from inform ation theory [10], [16]. Many stu d ies have u sed the m easu re for a variety of tasks in, for instance, featu re selection [2], [1], [11], [15], d ocu m ent classification [18], face im age clu stering [14], m u ltim od ality im age registration [13], inform ation retrieval [6], [7], [8], [9], [14]. H ow ever, it seem s that MI m ethod s have not achieved their potential. The m ain p roblem w e face in u sing the expected MI m easu re is obtaining actu al probability d istribu tions, as the tru e d istribu tions are invariably not know n, and w e have to estim ate them from training d ata. This w ork explores techniqu es of estim ation. To ad d ress this stu d y clearly, let u s first introd u ce the concept of a term state value d istribu tion. A term is u su ally thou ght of as having states present or absent in a d ocu m ent. Thu s, for an arbitrary term , it w ill be convenient to introd u ce a variable taking valu es from set = *1, 0+, w here = 1 expresses that is present and = 0 expresses that is absent. Denote = , w hen = 1, 0, respectively. We call a state value space, and
each elem ent in a state value, of . Sim ilarly, for an arbitrary term pair ( , ), w e introd u ce a variable pair ( , ) taking valu es from set = {(1,1), (1,0), (0,1), (0,0)}. We call a state valu e space, and each elem ent in a state valu e p air, of ( , ). Let be a collection of d ocuments (training d ata), and a vocabulary of term s used to index individual documents in . Denote as the set of terms occu rring in document . Thu s, for each term occu rring in , its state valu e d istribu tion is
() = ( |)
( )
Obviously, each term is matched to a state value distribution and there are totally | | state value distributions for d ocument . There exists statistical d epend ence betw een tw o term s, and , if the state valu e of one of them provid es m u tu al inform ation abou t the probability of the state valu e of another. Losee [12] show ed that there is a relationship betw een the frequ encies (or probabilities) of term s and MI of term s. Therefore, term taking som e state valu e (say = 1) shou ld be looked u pon as com p lex becau se another state valu e (say = 0) of , and state valu es of m any other term s (i.e., all term s * + ), m ay be d epend ent on this . Mathem atically, for tw o arbitrary term s , , the expected mutual information [10] abou t the probabilities of the state valu e pair ( , ) of term pair ( , ) can be expressed by:
D. Cai is with the School of Computing and Engineering, University of Huddersfield, UK, HD1 3BE. T.L. M cCluskey is with the School of Computing and Engineering, University of Huddersfield, UK, HD1 3BE.
( , ) = ( ) ( | ) = ( ) ( | ) =
, ,
( , ) log
( , ( ) (
) )
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617 http://sites.google.com/site/journalofcomputing/
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 6, JUNE 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG
w here ( ) is entropy of , measu ring u ncertainty on ; ( | ) is conditional entropy of , measu ring uncertainty on given know ing . Thus, ( , ) measures the amou nt of information that provides about , and vice versa. The estim ation of d istribu tions () and ( , ) requ ired in ( , ) is cru cial and rem ains an open issu e for effectively d istingu ishing potentially d epend ent term pairs from m any others and , therefore, the m ain concern of ou r cu rrent stu d y. In Section 2, w e introd u ce estim ation fu nctions, interpret m athem atical m eaning of the fu nctions and d iscu ss verification cond itions. In Section 3, w e provid e exam ples to clarify the id ea of ou r m ethod . Section 4 consid ers an extension of ou r m ethod and conclu sions are d raw n in Section 5. Som e m athem atical d etails are given in the Ap pend ix.
given term s , , d efine
( = 1, = 1) = ( , ) ( = 1, = 0) = ( ) ( , ) ( = 0, = 1) = ( ) ( , ) ( = 0, = 0) = 1 ( ) ( ) + ( , )
N ote that ( , ) m ay not constitu te a probability d istribu tion. N ext, w e need to prove that, u nd er som e cond itions,
( , ) can be a probability d istribu tion by Theorem 1 below. For d oing so, h ere, and throu ghou t this stu d y, w e d enote the d enom inator of ( , ) by =
,
2 ESTIMATION
In practical ap plication, the state valu e d istribu tions m ay be estim ated from training d ata. The estim ation of the joint state valu e d istribu tion, ( , ), is a m ore com plicated task, w hich is thu s the m ain concern of this section. Let u s start w ith consid ering a given d ocu m ent. Say w e have a d ocu m ent w ith = * , , , + * , , , + = , w here 1 < < < . In this stu d y, w e alw ays assu m e that 2 < = | | (nam ely, each d ocu m ent has at least three d istinct term s). Generally, if w e denote () as the frequency of term in and as the length of then, for a given d ocu m ent , the term occurrence frequency d istribu tion is given by
( ) ( )
and , for an arbitrary term , d enote
* +
( ) ( )
Clearly > 1 as | | > 2. To prove Theorem 1, w e need to introd u ce tw o lem m as. Detailed p roofs are given in the Ap pend ix. Lemma 1. For an arbitrary term , we have = () () + Lemma 2. For the functions () and ( , ) given in (1) and (2), respectively, we have: (a) ( ) if and only if ( ) ( , ); (b) ( ) if and only if ( ) ( , ). We are now read y to introd u ce Theorem 1 below. Detailed p roof is given in the Ap pend ix. Theorem 1. For arbitrary terms , , expression
() = (|) =
() () = ( )
( )
w hich shou ld not be confu sed w ith the term state value d istribu tion (). In ord er to constitu te the state valu e d istribu tion s, for arbitrary term s , , , let w e introd u ce tw o estimation functions:
() = ( , ) =
( )
( ) ( ) ( ) ( ) . /
(1) (2)
( , ) = ( , )
(4)
is a probability distribution over if it satisfies two inequalities: a) ( ) and b) ( ). Thu s, by the above expression ( , ) and Theorem 1, w e have, for instance,
Clearly, 0 < (), ( , ) < 1 for arbitrary , , . Then, for each term , from the fu nction (), d efine () by
( = 1) = () = () ( = 0) = () = 1 ()
w hich is a probability d istribu tion over . To constitu te ( , ) from the function ( , ), for
( = 1, = 1) = ( , ) = ( , ) (3) ( = 1, = 0) = ( , ) = ( ) ( , )
The first lem m a tells u s that there exists a relationship betw een and . The second lem m a tells u s how to
verify tw o inequ alities (cond itions) requ ired in Theorem 1: ( ) and ( ) m ay be sim ply verified by ( ) ( , ) and ( ) ( , ), respectively. The reasoning behind the estimate, ( , ), is rather intuitive. ( = 1, = 0) and ( = 0, = 1) are derived by two constraints:
( ) ( ) for < , = 1, , , is a normalization factor for the characterization. Clearly, is a constant for all term pairs, ( , ), occurring in a given d ocument. Note that assum ption | | > 2 ensures that there exists more than one non-zero element in the matrix, such that [ ( = 1, = 1)] ,0- . Notice also that, becau se no two com ponents of ( , ) can be the same, the elements where = , corresponding to ( = 1, = 1) for = 1, , , should not be considered in our context. However, it is only for notational convenience that these elements are inclu ded in the matrix.
( = 1, = 0) + ( = 1, = 1) = ( = 1);
( = 0, = 1) + ( = 1, = 1) = ( = 1); w hich ensure that both ( ) and ( ) are marginal distributions of ( , ). ( = 0, = 0) is derived by another constraint
3 DISCUSSION
It shou ld be em phasized , in ord er to speak of the MI of term s, that w e m u st verify tw o argu m ents of ( , ) are probability d istribu tions. For instance, in ou r m ethod , they shou ld satisfy the tw o inequ alities given in Theorem 1. Let u s look at exam ples below, w hich w ill help to clarify the id ea and m ake u nd erstand able the com pu tation involved in all the above form u lae. Example 1. Su ppose = * , , , , , , , +, then we have = * , , , , , +, and = 26. Thu s, for term pair ( , ), from (2) and (4), w e have
( , ) = 1
It is w orth explaining the derivation of ( = 1, = 1) = ( , ) in more detail. In practice, the estim ation functions should be considered carefully and introduced meaningfully according to a specific application problem . Let us now explain the meaning of ( , ) given in (1). It may be easier to m ake the explanation through an matrix. Su ppose w e are given a document represented by a (frequency) 1 matrix
= , ( ), ( ), , ( )- = , ()-
( = 1, = 1) =
= = = >0 >0 =
in w hich, each element is a frequency satisfying () 1 w hen and () = 0 w hen . The m atrix product can be w ritten by
( = 1, = 0) = ( = 0, = 1) =
( ) = [ ] , ( ) ( ) ( ) ( ) =[ ( ) ( ) = [ ( ) ( )]

( )-
( = 0, = 0) = 1 +
( ) ( ) ] ( ) ( )
Then, it follow s immediately (using natural logarithm s)
( , ) = +
log log
+ +
log log
= 0 ( ) ( )1
= [ ( = 1, = 1)]
0.0693 0.0321 0.0405 + 0.0472 = 0.0439
Generally, [ ( ) ( )] , w hich is symmetric, is called the co-occurrence frequency matrix of term s concerning . H ence,
Example 2. Suppose that we are given a document = * , , , , , +. From which we have
( ) ( )
[ ( = 1, = 1)]
= 0 ( ) ( )1 = [ ( , )]
= ( ) ( ) + ( ) ( ) + ( ) ( ) + ( ) ( ) + ( ) ( ) + ( ) ( )
= 1 3 + 1 1 + 1 1 + 3 1 + 3 1 + 1 1 = 12 Thu s, for instance, for term pair ( , ), we have ( , ) =
can be referred to as the normalized co-occu rrence frequency matrix of term s concerning . Consequently, ( = 1, = 1) = ( , ), for , = 1, , , can be represented by an matrix: its numerator, ( ) ( ), characterizes the co-occu rrence frequencies of and in document ; its denom inator, , the sum of all possible numerators
, and
( = 1, = 0) = ( ) ( , ) = 1/6 3/12 = 1/12 < 0

from which we can conclu de that ( , ) is not a
probability distribution since ( ) ( , ) < 0. Also, we can verify this in an alternative w ay:
fu nction w ou ld be () = () log
| | ( )
, w here () is
( ) ( )
= ( ) ( ) + ( ) ( ) + ( ) ( ) = 1 + 1 + 1 < 9 = ( ) ( )
That is, the first inequality given in Lemma 2 is not satisfied. The above Exam ple 2 is a specific instance of failing to apply the estim ation given in (1)-(4). From the above tw o exam ples, we can see: (i) In order to com pute term d ependence, w e must verify both ( ) ( , ) and ( ) ( , ), or equivalently to verify both ( ) and ( ), satisfied simultaneously, for each term pair considered. (ii) ( , ) becomes sm aller rapidly as documents become longer, it should thu s not be a problem to satisfy the above tw o inequalities in practical application. In a practical application, w e generally concentrate on the statistics of co-occu rrence of terms. That is, the dependence with which w e are really concerned is state value ( , ) = (1, 1) of term pair ( , ). In this case, what w e need is to apply only the first item of ( , ) and to verify the second cond ition given in Theorem 1:
the nu m ber of d ocu m ents in in w hich occu rs. Also, the m ethod d escribed in previou s sections is a special case w here, () = () for . With d ocu m ent representation by (), let u s continu e to d enote the length of d ocu m ent by = ( ) and d enote
= <
, ,
* +
( ) ( )
( ) ( ) =
Then, for arbitrary term s , , , sim ilar to the expressions given in (3) and (4), w e m ay w rite the correspond ing m arginal d istribu tion :
( = 1) =
( )
= () = ()
(5)
( = 0) = 1 ()
and joint d istribu tion
( = 1, = 1) = ( = 1, = 0) =
( )
( )
= ( , )
( )
( )
( )
= ( ) ( , ) ( = 0, = 1) =
( )
(6)
( = 1, = 1) = =
( )
( )
( )
( )
= ( , ) > ( ) ( )
( ) ( )
= ( ) ( , ) ( = 0, = 0) = 1 ( ) ( ) + ( , )
Also, the verification cond itions are given by the follow ing lem m as and theorem . Proofs of Lem m as 3-4 and Theorem 2 are here om itted as they are sim ilar to the respective proofs of Lem m as 1-2 and Theorem 1. Lemma 3. For an arbitrary term , we have = () () + Lemma 4. For functions () and ( , ) given in (5) and (6), respectively, we have:
to ensure that and are highly depend ent under their co-occu rrence.
4 EXTENSION
The m ethod prop osed in this stu d y m ay be ap plicable to any qu antitative d ocu m ent representation s. That is, the estim ation fu nctions given in (1) and (2) can be app lied to d ocu m ent representations not only for the frequ ency m atrix, bu t also for a m ore general case, w here each m atrix elem ent is a real nu m ber. More particu larly, su p pose each can be expressed by a 1 (w eight) m atrix = , ( ), ( ), , ( )- = , ()
( ) if and only if ( ) ( , ); ( ) if and only if ( ) ( , ).

Theorem 2. ( , ) given in (6) is a probability distribution if it satisfies two inequalities: a) ( ) and b) ( ). Obviou sly, () is the m ain com ponent of the estim ation fu nctions () and ( , ). As w e all know ,
satisfying () > 0 w hen and () = 0 w hen . The () is called a weighting function, ind icating the im portance of term in representing d ocu m ent . For instance, a w id ely u sed w eighting
d ocu m ent representations, (), play an essential role in d eterm ining effectiveness. The issu e of accu racy and valid ity of d ocu m ent representation has long been a cru cial and open problem . It is beyond the scope of this paper to d iscu ss the issu e in greater d etail. A d et ailed d iscu ssion abou t representation techniqu es m ay be fou nd , for instance, in stu d y [3][4].
= () () +
, * +
( ) ( )
= () () +
Lemma 2. Functions () and ( , ) given in (1) have:
5 CONCLUSION
It seem s that MI m ethod s have not achieved their potential for au tom atically m easu ring statistical d epend ence of term s. The m ain problem in MI m ethod s is to obtain actu al probability d istribu tions estim ated from training d ata. This stu d y concentrated on su ch a problem and proposed a novel bu t sim ple m ethod for m easu res. We introd u ced estim ation fu nctions () and ( , ), w hich m ay be u sed to cap tu re the occu rrence and cooccu rrence inform ation of term s and to d efine d istribu tions () and ( , ). We interpreted m athem atical m eaning of the fu nctions w ithin practical ap plication contexts. We d iscu ssed verification cond itions in ord er to ensu re () and ( , ) are probability d istribu tions u nd er the cond itions. We provid ed exam ples to clarify the id ea of ou r m ethod , to m ake u nd erstand able the com pu tation involved in all the form u lae and , in p articu lar, to illu strate the possibility of failu re of ap plying ou r m ethod if the verification cond itions are not satisfied . We consid ered the possibility of extension of ou r m ethod , ind icated that it is applicable to any qu antitative d ocu m ent representations w ith a w eighting fu nction. The generality of the form al d iscu ssion m eans ou r m ethod can be ap plicable to m any areas of science, involving statistical sem antic analysis of textu al d ata .
(a) ( ) if and only if ( ) ( , ); (b) ( ) if and only if ( ) ( , ).

Proof. We only prove (a). The proof of (b) is sim ilar to (a). Notice that 0 and 0. Thu s, by Lemma 1, we have
( ) 0
if and only if = ( ) + [ ( )] ( ) if and only if ( ) ( ) ( ) if and only if
( ) =
( )
( )
( )
= ( , )
Theorem 1. ( , ) given in (4) is a probability distribution if a) ( ) and b) ( ) Proof. ( = 1, = 1) > 0 as 0 < ( , ) < 1; ( = 1, = 0), ( = 0, = 1) 0 as ( ), ( ) ( , ) by Lemm a 2; ( = 0, = 0) = ,1 ( )[ ( ) ( , )] > 0 as 0 < ( ) < 1. , , ( , ) = 1 can easily be seen from (3). Finally,
APPENDIX
Lemma 1. For an arbitrary term , we have = () () +
REFERENCES
Proof. Without losing generality, suppose = . (Otherwise, let = . N otice that the order of the elements in the set is unnecessary, so w e can rew rite = * , , , , , , +, and thu s *+ = w ith 1 < < < * , , , , , + < < . So ou r d iscu ssion still hold s.) Thu s w e have
[1] M. Ait Kerroum, A. Ham mouch, and D. Aboutajd ine, Textural feature selection by joint m utual inform ation based on gaussian m ixture m od el for m ultispectral im age classification, Pattern Recognition Letters, vol 3, no. 10, pp. 1168-1174, 2010. A. E. Akad i, A.E. Abd eljalil El Ouard ighi, and D. Aboutajd ine, A pow erful feature selection approach based on m utual inform ation, International Journal of Computer Science and N etwork Security, vol. 8, pp. 116-121, 2008. D. Cai, An information theoretic foundation for the m easurem ent of d iscrim ination inform ation, IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 9, pp. 1262-1273, 2010. D. Cai, Determ ining sem antic related ness through the m easurem ent of d iscrim ination inform ation using Jensen d ifference, International Journal of Intelligent Systems, vol. 24, no. 5, pp. 477-503, 2009. K. W. Church and P. Hanks, Word association norm s, m utual inform ation, and lexicography, Journal of the A merican Society
[2]
= ( )[ ( ) + + ( )] + ( )[ ( ) + + ( )] + + ( + ( )[ ( ) + ( )]
[3]
)[ ( )]
,,
[4]
= (), ()- + ( ) + + ( ) ( )
( )
[5]
[6]
[7]
[8]
[9] [10]
[11]
[12]
[13]
[14]
[15] [16]
[17]
for Information Science, vol. 16, no. 1, pp. 22-29, 1990. H . Fang and C. X. Zhai, Semantic term m atching in axiom atic approaches to inform ation retrieval, Proc. 29th A nn. International A CM -SIGIR Conf. Research and Development in Information Retrieval, pp. 115-122, 2006. S. Gauch, J. Wang, and S. M. Rachakonda, A corpus analysis approach for automatic query expansion and its extension to m ultiple d atabases, A CM Trans. Information Systems, vol. 17, no. 3, pp. 250-269, 1999. M. Kim and K. Choi, A comp arison of collocation-based sim ilarity m easures in query expansion, Information Processing & M anagement, vol. 35, no. 1, pp. 19-30, 1999. S. Kullback, Information Theory and Statistics, N ew York: Wiley, 1959. H .-W. Liu, J.-G. Sun, L. Liu, and H .-J. Zhang, Feature selection w ith d ynam ic m utual inform ation, Pattern Recognition, vol. 42, pp. 1330-1339, 2009. R. M. Losee, Jr., Term d epend ence: A basis for Luhn and Zipf m od els, Journal of the A merican Society for Information Science and Technology, vol. 52, no. 12, pp. 1019-1025, 2001. F. Maes, A. Collignon, D. Vand erm eu len, G. Marchal, and P. Suetens, Multim od ality im age registration by maxim ization of m utual information, IEEE Transactions on M edical Imaging, vol. 16, no. 2, pp. 187-198, 1997. R. Mandala, T. Toku naga, and H . Tanaka, Query expansion using heterogeneous thesauri, Information Processing & M anagement, vol. 36, no. 3, pp. 361-378, 2000. H . Peng, F. Long, and C. Ding, Feature selection based on m utual inform ation: criteria of max-d epend ency, m ax-relevance and m in-red undancy, IEEE Transactions on Pattern A nalysis and M achine Intelligence, vol. 27, no. 8, pp. 1226-1238, 2005. C. E. Shannon, A mathem atical theory of com m unication, Bell System and Technical Journal, vol. 27, pp. 379-423, 623-656, 1948. N . Vretos, V. Solachid is, and I. Pitas, A m utual inform ation based face clustering algorithm for m ovies, Proc. IEEE International Conf. M ultimedia and Expo. (ICM E'06), pp. 10131016, 2006. G. Wang, F.H . Lochovsky, and Q. Yang, Feature selection w ith cond itional m utual information m axim in in text categorization, Proc. 10th International Conf. Information and Knowledge M anagement, pp. 342-349, 2004.
Di Cai received her PhD in the Department of Computing Science at the University of Glasgow in UK. She is currently a research fellow in the School of Computing and Engineering at the University of Hudersfield in UK. Her main research interests include information extraction and retrieval, document classification and summarization, text mining and opinion mining, emotion and sentiment analysis. She is a member of the IEEE and ACM. Thomas Leo McCluskey is professor of software technology at the University of Huddersfield in the UK, and director of research for the Universitys School of Computing and Engineering. His research interests include software and knowledge engineering, domain modelling, planning and machine learning. His research group has developed a series of knowledge engineering aids which help in the formulation process of structural and heuristic planning knowledge, ranging from interactive interfaces to fully automated learning tools.

A Simple Method For Estimating Term Mutual Information

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Simple Method For Estimating Term Mutual Information

Uploaded by

Copyright:

Available Formats

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 6, JUNE 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.

A Simple Method for Estimating Term Mutual Information

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617 http://sites.google.com/site/journalofcomputing/

given term s , , d efine

and , for an arbitrary term , d enote

Then, it follow s immediately (using natural logarithm s)

0.0693 0.0321 0.0405 + 0.0472 = 0.0439

Example 2. Suppose that we are given a document = * , , , , , +. From which we have

( = 1, = 0) = ( ) ( , ) = 1/6 3/12 = 1/12 < 0

( ) if and only if ( ) ( , ); ( ) if and only if ( ) ( , ).

(a) ( ) if and only if ( ) ( , ); (b) ( ) if and only if ( ) ( , ).

You might also like