You are on page 1of 8

WORD ASSOCIATION NORMS, ] /IUTUAL INFORMATION,

AND LEXICOGRAPHY
K e n n e t h Wa r d Ch u r c h
Be l l La b o r a t o r i e s Mu r r a y Hi l l , N. J .
Patrick Hanks
Co l l i n s Pu b l i s h e r s Gl a s g o w, Sc o t l a n d
The term word association i s used in a very parti cul ar sense i n t he psychol i ngui st i c l i terature. (Generally
speaki ng, subjects respond qui cker t han normal t o t he word nurse i f i t f ol l ows a hi ghl y as s oci at ed word s uch
as doctor.) We will ext end t he term t o provide t he basi s f or a st at i st i cal descri pti on o f a variety o f i nteresti ng
l i ngui st i c phenomena, rangi ng from semant i c rel ati ons o f t he doc t or / nur s e t ype ( cont ent wor d/ c ont e nt word)
t o l e xi c o- s ynt ac t i c co- occurrence const rai nt s between verbs and preposi ti ons ( cont ent wor d/ f unc t i on word).
Thi s paper will propose an objecti ve measure based on t he i nf ormat i on t heoret i c not i on o f mutual
i nf ormat i on, f or est i mat i ng word as s oci at i on norms from computer readable corpora. (The standard met hod
o f obt ai ni ng word as s oci at i on norms, t est i ng a f ew t housand :mbjects on a few hundred words, i s bot h cos t l y
and unreliable. ) The proposed measure, t he association ratio, est i mat es word as s oc i at i on norms di rectl y
from computer readable corpora, maki ng it possi bl e t o est i mat e norms for t ens o f t housands o f words.
1 MEANING AND ASSOCIATION
It is common practice in linguistics to classify words not
only on the basis of their meanings but also on the basis of
their co-occurrence with other words. Runni ng through the
whole Firthian tradition, for example, is the theme t hat
"You shall know a word by the company it keeps" (Firth,
1957).
On the one hand, bank co-occurs with words and expres-
sion such as money, notes, loan, account, investment,
clerk, official, manager, robbery, vaults, working in a,
its actions, First National, of England, and so forth. On
the other hand, we find bank co-occurring with river,
swim, boat, east (and of course West and South, which
have acquired special meanings of their own), on top of
the, and of the Rhine. (Hanks 1987, p. 127)
The search for increasingly delicate word classes is not new.
In lexicography, for example, it goes back at least to the
"verb pat t erns" described in Hornby' s Advanced Learner's
Dictionary (first edition 1948). What is new is t hat facili-
ties for the computational storage and analysis of large
bodies of natural language have developed significantly in
recent years, so t hat it is now becoming possible to test and
apply informal assertions of this kind in a more rigorous
way, and to see what company our words do keep.
2 PRACTICAL APPLICATIONS
The proposed statistical description has a large number of
potentially important applications, including: (a) constrain-
ing the language model both for speech recognition and
optical character recognition (OCR), (b) providing disam-
biguation cues for parsing highly ambiguous syntactic struc-
tures such as noun compounds, conjunctions, and preposi-
tional phrases, (c) retrieving texts from large databases
(e.g. newspapers, patents), (d) enhancing the productivity
of computational linguists in compiling lexicons of lexico-
synWctic facts, and (e) enhancing the productivity of lexi-
cographers in identifying normal and conventional usage.
Consider the optical character recognizer (OCR) appli-
cation. Suppose t hat we have an OCR device as in Kahan et
al. (1987), and it has assigned about equal probability to
having recognized f arm and form, where the context is
either: (1) federal c r e d i t or (2) some o f .
f arm
federal ~form ] credit
/farm
22 Computational Linguistics Volume 16, Number 1, March 1990
Kenneth Church and Patrick Hanks Word Association Norms, Mutual Information, and Lexicography
The proposed association measur e can make use of the fact
that f ar m is much more likely in the first context and f orm
is much more likely in t he second to resolve t he ambi gui t y.
Not e t hat al t ernat i ve di sambi guat i on met hods based on
synt act i c const rai nt s such as par t of speech are unlikely to
help in this case since both f orm and f arm are commonl y
used as nouns.
3 WORD ASSOCIATION AND
PSYCHOLINGUISTICS
Word association norms are well known to be an i mpor t ant
fact or in psycholinguistic research, especially in the area of
lexical retrieval. Gener al l y speaki ng, subj ect s respond
quicker t han normal to the word nurse if it follows a highly
associated word such as doctor.
Some results and implications are summar i zed from
react i on-t i me experi ment s in which subjects either (a)
classified successive strings of letters as words and non-
words, or (b) pronounced the strings. Both types of
response to words (e.g. BUTTER) were consistently
fast er when preceded by associated words (e.g. BREAD)
r at her t han unassociated words (e.g. NURSE) ( Meyer
et al. 1975, p. 98)
Much of this psycholinguistic research is based on empiri-
cal est i mat es of word association norms as in Pal ermo and
Jenki ns (1964), perhaps the most influential st udy of its
kind, t hough ext remel y small and somewhat dated. Thi s
st udy measured 200 words by asking a few t housand sub-
j ect s to write down a word aft er each of the 200 words to be
measured. Results are report ed in t abul ar form, indicating
which words were written down, and by how many subjects,
fact ored by grade level and sex. The word doctor, for
exampl e, is report ed on pp. 98- 100 to be most often associ-
at ed with nurse, followed by sick, health, medicine, hospi-
tal, man, sickness, lawyer, and about 70 more words.
4 AN INFORMATION THEORETIC MEASURE
We propose an al t ernat i ve measure, the association ratio,
for measuri ng word association norms, based on the infor-
mat i on t heoret i c concept of mut ual information. 1 The
proposed measure is more objective and less costly t han the
subjective met hod employed in Pal ermo and Jenkins (1964).
The association rat i o can be scaled up to provide robust
est i mat es of word association norms for a large portion of
the language. Using the association rat i o measure, the five
most associated words are, in order: dentists, nurses, treat-
ing, treat, and hospitals.
What is "mut ual i nf or mat i on?" Accordi ng to Fano
(1961), if two points (words), x and y, have probabilities
P( x) and P( y) , then their mut ual i nformat i on, I(x, y), is
defined to be
P( x, y)
I(x, y) =- log2 P( x ) P( y )
Informal l y, mut ual i nformat i on compares the probabi l i t y
of observing x and y together (t he j oi nt probabi l i t y) with
t he probabi l i t i es of observi ng x and y independently
(chance). I f t here is a genuine association between x and y,
t hen the j oi nt probabi l i t y P(x, y) will be much l arger t han
chance P( x) P(y), and consequently I ( x, y) >> 0. I f t here is
no interesting relationship between x and y, t hen P( x, y)
P( x) P(y), and thus, I ( x, y) ~ O. I f x and y are in comple-
ment ar y distribution, t hen P(x, y) will be much less t han
P( x) P(y), forcing I ( x, y) << 0.
In our application, word probabilities P( x) and P( y ) are
est i mat ed by counting the number of observations of x and
y in a corpus, f (x) a n d f ( y ) , and normal i zi ng by N, the
size of the corpus. (Our exampl es use a number of different
corpora with different sizes: 15 million words for the 1987
AP corpus, 36 million words for the 1988 AP corpus, and
8.6 million tokens for the t agged corpus.) Joint probabili-
ties, P(x, y), are est i mat ed by counting t he number of times
t hat x is followed by y in a window of w words, fw (x,y), and
normal i zi ng by N.
The window size par amet er allows us to look at different
scales. Smal l er window sizes will identify fixed expressions
(idioms such as bread and butter) and other relations t hat
hold over short ranges; l arger window sizes will highlight
semant i c concepts and ot her relationships t hat hold over
l arger scales.
Tabl e 1 may help show the contrast. 2 In fixed expres-
sions, such as bread and butter and drink and drive, the
words of interest are separat ed by a fixed number of words
and t here is very little vari ance. In the 1988 AP, it was
found t hat the two words are al ways exactly two words
apar t whenever t hey are found near each other (within five
words). That is, the mean separat i on is two, and t he
vari ance is zero.
Compounds also have very fixed word order (little vari-
ance), but the average separat i on is closer to one word
r at her t han two. In contrast, relations such as man/woman
are less fixed, as indicated by a l arger vari ance in their
separat i on. (The nearl y zero value for the mean separat i on
for man/women indicates the words appear about equally
Table 1. Mean and Variance of the Separation Between
X and Y
Separation
Relation Word x Word y Mean Variance
Fixed break butter 2.00 0.00
drink drive 2.00 0.00
Compound computer scientist 1.12 O. I 0
United States 0.98 0.14
Semantic man woman 1.46 8.07
man women - 0.12 13.08
Lexical refraining from 1.11 0.20
coming from 0.83 2.89
keeping from 2.14 5.53
Computational Li ngui sti cs Vol ume 16, Number 1, March 1990 23
Kenneth Church and Patri ck Hanks Word Associ ati on Norms, Mutual Information, and Lexicography
often in either order.) Lexical relations come in several
varieties. There are some like refraining f r o m t hat are
fairly fixed, others such as comi ng f r o m t hat may be
separated by an argument, and still others like keepi ng
f r o m that are almost certain to be separated by an argu-
ment.
The ideal window size is different in each case. For the
remainder of this paper, the window size, w, will be set to
five words as a compromise; this setting is large enough to
show some of the constraints between verbs and arguments,
but not so large t hat it would wash out constraints t hat
make use of strict adj acency)
Since the association ratio becomes unstable when the
counts are very small, we will not discuss word pairs with
f ( x , y ) _< 5. An improvement would make use of t-scores,
and throw out pairs t hat were not significant. Unfortu-
nately, this requires an estimate of the variance o f f ( x , y ) ,
which goes beyond the scope of this paper. For the remain-
der of this paper, we will adopt the simple but arbi t rary
threshold, and ignore pairs with small counts.
Technically, the association ratio is different from mu-
t ual i nf ormat i on in two respects. First, joint probabilities
are supposed to be symmetric: P ( x , y ) = P( y , x) , and
thus, mutual information is also symmetric: I ( x , y ) =
I ( y , x) . However, the association ratio is not symmetric,
si ncef ( x, y) encodes linear precedence. (Recall t ha t f ( x, y)
denotes the number of times t hat word x appears before y
in the window of w words, not the number of times the two
words appear in either order.) Al t hough we could fix this
problem by r edef i ni ngf ( x, y) to be symmetric (by averag-
ing the matrix with its transpose), we have decided not to
do so, since order information appears to be very interest-
ing. Notice the asymmet ry in the pairs in Table 2 (com-
puted from 44 million words of 1988 AP text), illustrating a
wide variety of biases ranging from sexism to syntax.
Second, one might expect f ( x , y ) <_ f ( x ) and f ( x , y ) <_
f ( y ) , but the way we have been counting, this needn' t be
the case if x and y happen to appear several times in the
window. For example, given the sentence, "Li br ar y work-
ers were prohibited from saving books from this heap of
ruins," which appeared in an AP story on April 1, 1988,
f ( pr ohi bi t ed) = 1 and f ( pr ohi bi t e d, f r o m) = 2. This
problem can be fixed by di vi di ngf ( x, y) by w - 1 (which
has the consequence of subtracting log2 (w - 1) = 2 from
o u r association ratio scores). This adjustment has the addi-
Table 2. Asymmetry in 1988 AP Corpus (N = 44 million)
x y f(x, y) f(y, x)
doctors nurses 99 10
man woman 256 56
doctors lawyers 29 19
bread butter 15 1
save life 129 11
save money 187 11
save f rom 176 18
supposed to 1188 25
tional benef t of assuring t hat Z f ( x , y ) = ~ f ( x ) =
Z f ( y ) = N.
When I ( x , y ) is large, the association ratio produces very
credible results not unlike those reported in Palermo and
Jenkins (1964), as illustrated in Table 3. In contrast, when
I ( x, y) ---: 0, the pairs are less interesting. (As a very rough
rule; of thumb, we have observed t hat pairs with I ( x , y ) > 3
tend to be interesting, and pairs with smaller I ( x, y ) are
generally not. One can make this statement precise by
calibrating the measure with subjective measures. Alterna-
tively, one could make estimates of the variance and then
make statements about confidence levels, e.g. with 95%
confidence, P( x, y ) > e ( x ) P( y) . )
I f I ( x, y ) << 0, we would predict t hat x and y are in
compl ement ary distribution. However, we are rarely able
to observe I ( x, y) << 0 because o u r c o r p o r a a r e t o o small
(and our measurement techniques are too crude). Suppose,
f o r example, t hat both x and y appear about 10 times per
million words of text. Then, P( x ) = P ( y ) = 10 -5 and
chance is P( x ) P( x ) = 10 -I. Thus, to say t hat I ( x, y ) is
much less t han 0, we need to say t hat P( x, y ) is much less
t han 10 -t, a statement t hat is hard to make with much
confidence given the size of presently available corpora. In
fact, we cannot (easily) observe a probability less t han
1 / N ~ 10 -7, and therefore it is hard to know if I ( x, y ) is
much less t han chance or not, unless chance is very large.
(In fact, the pair a . . . doctors in Table 3, appears signifi-
cantly less often than chance. But to justify this statement,
we need to compensate for the window size (which shifts
the score downward by 2.0, e.g. from 0.96 down to - 1.04),
and we need to estimate the standard deviation, using a
method such as Good (1953). 4
5 LEXICO-SYNTACTIC REGULARITIES
Al t hough the psycholinguistic literature documents the
significance of noun/noun word associations such as doct or/
nurse in considerable detail, relatively little is said about
Ta bl e 3. Some interesting Associations with " Do c t o r " i n the
1987 AP Corpus (N = 15 million)
I(x, y) f(x, y) f(x) x f(y) y
11.3 12 111 honorary 621 doctor
11.3 8 1105 doctors 44 dentists
10.7 30 1105 doctors 241 nurses
9.4 8 1105 doctors 154 treating
9.0 6 275 examined 621 doctor
8.9 11 1105 doctors 317 treat
8.7 25 621 doctor 1407 bills
8.7 6 621 doctor 350 visits
8.6 19 1105 doctors 676 hospitals
8,4 6 241 nurses 1105 doctors
Some Uninteresting Associations with "Doctor"
0.96 6 621 doctor 73785 with
0.95 41 284690 a 1105 doctors
0.93 12 84716 is 1105 doctors
2 4 Co mput a t i o na l Li n g u i s t i c s Vo l u me 16, Numbe r 1, Ma r c h 1 9 9 0
Kenneth Church and Patrick Hanks Word Association Norms, Mutual Information, and Lexicography
associations among verbs, function words, adjectives, and
ot her non-nouns. In addition to identifying semant i c rela-
tions of the doct or /nur se variety, we believe the association
ratio can also be used to search for interesting lexico-
synt act i c relationships between verbs and typical argu-
ment s/adj unct s. The proposed association rat i o can be
viewed as a formal i zat i on of Si ncl ai r' s argument :
How common are t he phrasal verbs with set ? Se t is
part i cul arl y rich in maki ng combi nat i ons with words
like about , in, up, out , on, off, and these words are
themselves very common. How likely is set of f t o occur?
Both are frequent words [set occurs appr oxi mat el y 250
times in a million words and o f f occurs appr oxi mat el y
556 times in a million wo r d s . . . [T]he question we are
asking can be roughly rephrased as follows: how likely is
of f to occur i mmedi at el y af t er s e t ? . . . Thi s is 0.00025 x
0.00055 [ P( x) P ( y ) ] , which gives us the tiny figure of
0.0000001375 . . . The assumpt i on behind this calcula-
tion is t hat the words are di st ri but ed at r andom in a t ext
[at chance, in our terminology]. I t is obvious to a linguist
t hat this is not so, and a rough measure of how much set
and of f a t t r a c t each other is to compar e the probabi l i t y
with what act ual l y happens . . . Se t of f occurs nearl y
70 times in the 7.3 million word corpus [ P( x, y ) =
70/(7. 3 x 106) >> P( x ) P ( y ) ] . That is enough to show
its mai n pat t erni ng and it suggests t hat in current l y-hel d
corpora t here will be found sufficient evidence for the
description of a subst ant i al collection of phrases . . .
(Sinclair 1987c, pp. 151-152).
Using Sinclair' s est i mat es P( s e t ) ~ 250 x 10 -6, P( o f f ) ~-
556 x 10 -6, and P( s et , o f f ) ~ 70/(7. 3 x 106), we would
est i mat e t he mut ual i nformat i on to be I ( s et ; o f f ) =
l og2P( set , o f f ) / ( P ( s e t ) P( o f f ) ) ~ 6.1. In the 1988 AP
corpus ( N = 44,344,077), we est i mat e P( s e t ) ~ 13, 046/N,
P( o f f ) ~ 20, 693/N, and P( s et , o f f ) ~ 463/ N. Given these
est i mat es, we would comput e the mut ual i nformat i on to be
l ( set ; o f f ) ~ 6.2.
In this exampl e, at least, the values seem to be fairly
compar abl e across corpora. In other exampl es, we will see
some differences due to sampling. Si ncl ai r' s corpus is a
fairly bal anced sampl e of (mai nl y British) text; the AP
corpus is an unbal anced sampl e of Amer i can journalese.
This association between set and of f i s relatively strong;
the j oi nt probabi l i t y is more t han 26 = 64 times l arger t han
chance. The ot her particles t hat Sinclair mentions have
association ratios t hat can be seen in Tabl e 4.
The first three, set up, set off, and set out , are clearly
Table 4. Some Phrasal Verbs in 1988 AP Corpus
(N = 44 million)
x y f(x) f(y) f(x, y) I(x; y)
set up 13,046 64,601 2713 7.3
set of f 13,046 20,693 463 6.2
set out 13,046 47,956 301 4.4
set on 13, 046 258,170 162 1.1
set in 13, 046 739,932 795 1.8
set about 13, 046 82,319 16 - 0.6
associated; the last t hree are not so clear. As Sinclair
suggests, the approach is well suited for identifying t he
phrasal verbs, at least in cert ai n cases.
6 PREPROCESSING WITH A PART
OF SPEECH TAGGER
Phrasal verbs involving the preposition t o raise an interest-
ing probl em because of the possible confusion with t he
infinitive mar ker to. We have found t hat if we first t ag
every word in the corpus with a par t of speech using a
met hod such as Church (1988), and t hen measur e associa-
tions between t agged words, we can identify interesting
cont rast s between verbs associated with a following prepo-
sition to~in and verbs associated with a following infinitive
mar ker t o~t o. ( Par t of speech notation is borrowed f r om
Francis and Kucer a (1982); in = preposition; to = infini-
tive marker; vb = bare verb; vbg = verb + ing; vbd =
verb + ed; vbz = verb + s; vbn = verb + en.) The
association rat i o identifies quite a number of verbs associ-
at ed in an interesting way with to; restricting our at t ent i on
to pairs with a score of 3.0 or more, t here are 768 verbs
associated with the preposition to~in and 551 verbs with
the infinitive mar ker t o/ t o. The ten verbs found to be most
associated before t o/ i n are:
to~in: al l udi ng/vbg, adher e/vb, amount ed/vbn, r el at i ng/
vbg, amount i ng/vbg, r ever t /vb, r ever t ed/vbn, resort i ng/
vbg, r el egat ed/vbn
to~to: obl i gat ed/vbn, t r yi ng/vbg, compel l ed/vbn, en-
abl es / vbz, supposed/ vbn, i nt ends/ vbz, vowi ng/ vbg,
t r i ed/vbd, enabl i ng/vbg, t ends/vbz, t end/vb, i nt end/vb,
t r i es/vbz
Thus, we see t here is considerable leverage to be gai ned by
preprocessing the corpus and mani pul at i ng the inventory of
tokens.
7 PREPROCESSING WITH A PARSER
Hi ndl e (Church et al. 1989) has found it helpful to prepro-
cess the input with the Fidditch parser (Hi ndl e 1983a,
1983b) to identify associations between verbs and argu-
ments, and post ul at e semant i c classes for nouns on this
basis. Hi ndl e' s met hod is abl e to find some very interesting
associations, as Tabl es 5 and 6 demonst rat e.
Aft er running his parser over the 1988 AP corpus (44
million words), Hi ndl e found N = 4,112,943 s ubj ect / ver b/
object (SVO) triples. The mut ual i nformat i on between a
verb and its object was comput ed f r om these 4 million
triples by counting how often the verb and its object were
found in the same triple and dividing by chance. Thus, for
exampl e, di s c onne c t / V and t e l e p h o n e / 0 have a j oi nt prob-
ability of 7 / N. In this case, chance is 8 4 / N x 4 8 1 / N
because t here are 84 SVO triples with the verb di sconnect ,
and 481 SVO triples with the obj ect t el ephone. The mut ual
i nformat i on is log z 7N/ ( 84 481) = 9.48. Similarly, the
mut ual i nformat i on for d r i n k / V b e e r / O is 9.9 = log 2 2 9 N/
(660 195). ( d r i n k / V and be e r / O are found in 660 and
Computational Li ngui sti cs Volume 16, Number 1, March 1990 25
Kenneth Church and Patrick Hanks Word Association Norms, Mutual Information, and Lexicography
Table 5. What Can You Drink?
Verb Object Mutual Info Joint Freq
drink/V martinis/O 12.6 3
drink/V cup_water/O 11.6 3
drink/V champagne/O 10.9 3
drink/V beverage/O 10.8 8
drink/V cup_coffee/O 10.6 2
drink/V cognac/ O 10.6 2
drink/V beer/O 9.9 29
drink/V eup/O 9.7 6
drink/V coffee/O 9.7 12
drink/V t oast / O 9.6 4
drink/V alcohol/O 9.4 20
drink/V wine/ O 9.3 10
drink/V fluid/O 9.0 5
drink/V liquor/O 8.9 4
drink/V tea]O 8.9 5
drink/V milk/O 8.7 8
drink/V j ui ce/ O 8.3 4
drink/V water/O 7.2 43
drink/V quantity]O 7.1 4
195 SVO triples, respectively; they are found together in 29
of these triples).
This application of Hindle' s parser illustrates a second
example of preprocessing the input to highlight certain
constraints of interest. For measuring syntactic constraints,
it may be useful to include some part of speech information
and to exclude much of the internal structure of noun
phrases. For other purposes, it may be helpful to tag items
and/ or phrases with semantic labels such as *person*,
*place*, *time*, *body part*, *bad*, and so on.
8 APPLICATIONS IN LEXICOGRAPHY
Large machine-readable corpora are only just now becom-
ing available to lexicographers. Up to now, lexicographers
have been reliant either on citations collected by human
Table 6. What Can You Do to a Telephone?
Verb Object Mutual Info Joint Freq
sit_by/V telephone/O 11.78 7
disconnect/V telephone/O 9.48 7
answer/ V telephone/O 8.80 98
hang_up]V telephone/O 7.87 3
tap/V telephone/O 7.69 15
pick_up/V telephone/O 5.63 11
return/V telephone/O 5.01 19
be_by/V telephone/O 4.93 2
spot/V telephone/O 4.43 2
r epeat / V telephone/O 4.39 3
place/V telephone/O 4.23 7
recei ve/ V telephone/O 4.22 28
install/V telephone/O 4.20 2
be_on/V telephone/O 4.05 15
come_to/V telephone/O 3.63 6
use/V telephone/O 3.59 29
operat e/ V telephone/O 3.16 4
readers, which introduced an element of selectivity and so
inevitably distortion (rare words and uses were collected
but common uses of common words were not), or on small
corpora of only a million words or so, which are reliably
informative for only the most common uses of the few most
frequent words of English. (A million-word corpus such as
the Brown Corpus is reliable, roughly, for only some uses of
only some of the forms of around 4000 dictionary entries.
But standard dictionaries typically contain twenty times
this number of entries.)
The computational tools available for studying machine-
readable corpora are at present still rather primitive. These
are concordancing programs (see Figure 1), which are
basically KWI C (key word in context; Aho et al. 1988)
indexes with additional features such as the ability to
extend the context, sort leftward as well as rightward, and
so on. There is very little interactive software. In a typical
situation in the lexicography of the 1980s, a l exi cographer
is giwen the concordances for a word, marks up the printout
with colored pens to identify the salient senses, and then
writes syntactic descriptions and definitions.
Although this technology is a great improvement on
using human readers to collect boxes of citation index cards
(tlhe method Mur r ay used in constructing The Oxford
English Dictionary a cent ury ago), it works well if there are
no more than a few dozen concordance lines for a word, and
only two or three main sense divisions. In analyzing a
complex word such as take, save, or from, the lexicogra-
pher is trying to pick out significant patterns and subtle
distinctions t hat are buried in literally thousands of concor-
dance lines: pages and pages of computer printout. The
unaided human mind simply cannot discover all the signifi-
Is Su~Say, c a l l i ng f or ~x~ater economi c r ef or ms t o
mmi ~: i o n a s s e a e d t hat " t he Post al Se~wice coul d
Then. sl0e sai d, t he f a mi l y hopes t o
e out - of - wor k st eel wor ker , " because t hat does n' t
. . . . We suspend r e a l i t y wh e n we say we ' l l
s cl ent ~t s has won t he fi rst r ound i n a n ef f or t t o
about t hr ee chi l dr en i n a mi n i n g t own who pl ot t o
GM execut i ves s ay t he s l mt d o w~ wi l l
rt r~ent as r ecei ver , l i l st r act ed of f i ci al s t o U3, t o
Th e package, whi c h i s t o
n e wl y e nha nc e d i ma ge as t he moder at e w h o mo v e d to
mf f i i na of f er f r om c ha i r ma n Vi c t or Posner t o hel p
af t er telling a del i ver y- r oom doct or not t o t r y t o
h bl i f f i day Tmr~day, cheer ed by t hose who f ought t o
at be ~ s l f or med a n a l l i a nc e wi t h Mos l e m r ebel s t o
' Bas i cal l y w e c o u l d
We wor ke d f or a year t o
t hei r e xpe t ~i ve mi r r or s, j us t l i ke i n wa r t i me , t o
a l d of ma ny who r i s ked t he i r Own l i ves i n or der t o
We mus t i ncr eas e t i l e a mount Ame r i c a ns
save Oa t ha ~ pove a y.
save enor mous s ums of mone y i n conwacl i ng out i ndi vi dual e
save e nough f or a d o wn p a y me n t on a boul e.
save j obs, t hat costs j obs. "
save money by s pendi ng $ 10, 000 i n wage~ f or a publ i c work~
save one of Egypt ' s gr eat m:Lsxtre.s, t he decayi ng t omb of R
save t he " pi t poni es " doome d t o be sl aught er ed.
save t he a ut oma ke r $ 500 mi l l i o n a ye a r i n oper at i ng e ~ t s a
save t he m3pany r a t he r t han l i qui dat e i t a nd t hen decl ar ed
save t he c o u n W/ n e a r l y $ 2 bi l l i on, al so i ncl udes a pr ogr am
save t he counw/ .
save t he f i nanci al l y t r oubl ed company, but s ai d P c ~ e r st i l
save t he i nf a nt by i msert l nl i a t ube i n i t s t hr oat t o bel p i
save t he maj es t i c Be a u x Ar t s arcl~tecmral mE~-telpiece.
save ate nation f r o m c o m m u m s m .
save the operating costs of the Pershing, s and ground-launch
s a v e t h e ~ t e at e n o r mo u s e x p e n s e t o us, " s a i d L e v e i l ] e e .
save t he m f i ~m di amken y a n k e e br awl el ~, " T a ~ sai d.
save t hose who we r e p~=aengers. "
save. "
Fi gure 1 Short Sample of the Concordance to
"save" from the AP 1987 Corpus.
26 Comput at i onal Li ngui s t i c s Vo l ume 16, Numbe r 1, Mar c h 1 9 9 0
Kenneth Church and Patrick Hanks Word Association Norms, Mutual Information, and Lexicography
cant patterns, let alone group them and rank them in order
of importance.
The AP 1987 concordance to s ave is many pages long;
there are 666 lines for the base form alone, and many more
for the inflected forms saved, saves, savi ng, and s avi ngs . In
the discussion t hat follows, we shall, for the sake of simplic-
ity, not analyze the inflected forms and we shall only look at
the patterns to the right of s ave (see Table 7).
It is hard to know what is important in such a concor-
dance and what is not. For example, although it is easy to
see from the concordance selection in Figure 1 t hat the
word "t o" often comes before "save" and the word "t he"
often comes after "save, " it is hard to say from examination
of a concordance alone whether either or both of these
co-occurrences have any significance.
Two examples will illustrate how the association ratio
measure helps make the analysis both quicker and more
accurate.
8.1 EXAMPLE 1: "SAVE . . . FROM"
The association ratios in Table 7 show t hat association
norms apply to function words as well as content words. For
example, one of the words significantly associated with s ave
is f r o m. Many dictionaries, for example We bs t e r ' s Ni n t h
Ne w Col l e gi at e Di c t i o n a r y (Merri am Webster), make no
explicit mention of f r o m in the entry for save, although
Tabl e 7. Words Often Co-Occurring to the Right of"Save"
I(x, y) f(x, y) f(x) x f(y) y
9.5 6 724 save 170 f orest s
9.4 6 724 save 180 $1. 2
8.8 37 724 save 1697 lives
8.7 6 724 save 301 enormous
8.3 7 724 save 447 annual l y
7.7 20 724 save 2001 j obs
7.6 64 724 save 6776 money
7.2 36 724 save 4875 l i f e
6.6 8 724 save 1668 dollars
6.4 7 724 save 1719 c o s t s
6.4 6 724 save 1481 t housands
6.2 9 724 save 2590 f ac e
5.7 6 724 save 2311 son
5.7 6 724 save 2387 est i mat ed
5.5 7 724 save 3141 your
5.5 24 724 save 10880 billion
5.3 39 724 save 2 0 8 4 6 mi l l i on
5.2 8 724 save 4398 us
5.1 6 724 save 3513 less
5.0 7 724 save 4590 own
4.6 7 724 save 5798 worl d
4.6 7 724 save 6028 my
4.6 15 724 save 13010 t hem
4.5 8 724 save 7434 count ry
4.4 15 724 save 14296 t i me
4.4 64 724 save 61262 f r o m
4.3 23 724 save 23258 more
4.2 25 724 save 27367 t hei r
4.1 8 724 save 9249 company
4.1 6 724 save 7114 mont h
British learners' dictionaries do make specific mention of
f r o m in connection with save. These learners' dictionaries
pay more attention to language structure and collocation
t han do American collegiate dictionaries, and lexicogra-
phers trained in the British tradition are often fairly skilled
at spotting these generalizations. However, teasing out
such facts and distinguishing true intuitions from false
intuitions takes a lot of time and hard work, and there is a
high probability of inconsistencies and omissions.
Which other verbs typically associate with f r o m, and
where does s ave rank in such a list? The association ratio
identified 1530 words t hat are associated wi t h f r o m; 911 of
t hem were tagged as verbs. The first 100 verbs are:
r ef r ai n/vb, gl eaned/vbn, st ems/vbz, st emmed/vbd,
st emmi ng/vbg, rangi ng/vbg, st emmed/vbn, r anged/
vbn, derived/vbn, ranged/vbd, extort/vb, gr aduat ed/
vbd, barred/vbn, benefiting/vbg, benefitted/vbn, bene-
fited/vbn, excused/vbd, arising/vbg, range/vb, exempts/
vbz, suf f er s/ vbz, exempt i ng/ vbg, benef i t ed/ vbd,
prevented/vbd (7.0), seeping/vbg, barred/vbd, prevents/
vbz, suffering/vbg, excluded/vbn, marks/vbz, profiting/
vbg, recovering/vbg, discharged/vbn, reboundi ng/vbg,
vary/vb, exempt ed/vbn, separat e/vb, bani shed/vbn,
withdrawing/vbg, ferry/vb, prevented/vbn, profit/vb,
bar/vb, excused/vbn, bars/vbz, benefit/vb, emerges/
vbz, emerge/vb, varies/vbz, differ/vb, removed/vbn,
exempt/vb, expelled/vbn, withdraw/vb, st em/vb, sepa-
rat ed/vbn, j udgi ng/vbg, adapt ed/vbn, escaping/vbg, in-
herited/vbn, differed/vbd, emerged/vbd, withheld/vbd,
leaked/vbn, strip/vb, resulting/vbg, discourage/vb, pre-
vent/vb, withdrew/vbd, prohibits/vbz, borrowing/vbg,
preventing/vbg, prohibit/vb, resulted/vbd (6.0), pre-
clude/vb, divert/vb, distinguish/vb, pulled/vbn, fel l /
vbn, varied/vbn, emerging/vbg, suffer/vb, prohibiting/
vbg, extract/vb, subt ract /vb, recover/vb, paral yzed/
vbn, stole/vbd, departing/vbg, escaped/vbn, prohibited/
vbn, forbid/vb, evacuated/vbn, reap/vb, barring/vbg,
removing/vbg, stolen/vbn, receives/vbz.
S a v e . . . f r o m is a good example for illustrating the advan-
tages of the association ratio. S a v e is ranked 319th in this
list, indicating t hat the association is modest, strong enough
to be important (21 times more likely t han chance), but not
so strong t hat it would pop out at us in a concordance, or
t hat it would be one of the first things to come to mind.
I f the dictionary is going to list s a v e . . , f r o m, then, for
consistency' s sake, it ought to consider listing all of the
more important associations as well. Of the 27 bare verbs
(tagged ' vb' ) in the list above, all but seven are listed in
Col l i ns Co b u i l d En g l i s h La n g u a g e Di c t i o n a r y as occurring
with f r o m. However, this dictionary does not note t hat
vary, f e r r y , st ri p, di vert , f or bi d, and r eap occur wi t h f r o m.
I f the Cobuild lexicographers had had access to the pro-
posed measure, t hey could possibly have obtained better
coverage at less cost.
8.2 EXAMPLE 2: IDENTIFYING SEMANTIC CLASSES
Having established the relative importance of s ave . . .
f r o m, and having noted t hat the two words are rarely
Comput at i onal Li ngui s t i c s Vo l ume 16, Numbe r 1, Ma r c h 1 9 9 0 2 7
Kenneth Church and Patrick Hanks Word Association Norms, Mutual Information, and Lexicography
adj acent , we would now like to speed up t he labor-intensive
t ask of categorizing t he concordance lines. Ideally, we
would like to develop a set of semi - aut omat i c tools t hat
would help a l exi cographer produce somet hi ng like Fi gure
2, which provides an annot at ed s ummar y of the 65 concor-
dance lines for s a v e . . . f r o m. 5 The s a v e . . . f r o m pat t er n
occurs in about 10% of the 666 concordance lines for s ave.
Traditionally, semant i c categories have been only vaguel y
recognized, and to dat e little effort has been devoted to a
syst emat i c classification of a l arge corpus. Lexi cographers
have tended to use concordances impressionistically; seman-
tic theorists, AI-ers, and others have concent rat ed on a few
interesting exampl es, e.g. b a c h e l o r , and have not given
much t hought to how the results mi ght be scaled up.
Wi t h this concern in mind, it seems reasonabl e to ask
how well these 65 lines for s a v e . . . f r o m fit in with all ot her
uses of s a v e A laborious concordance analysis was under-
t aken to answer this question. When it was neari ng comple-
tion, we noticed t hat the t ags t hat we were inventing to
capt ur e the general i zat i ons could in most cases have been
suggested by looking at t he lexical items listed in the
association rat i o t abl e for s ave. For exampl e, we had failed
to notice the significance of t i me adverbi al s in our analysis
of s ave, and no di ct i onary records this. Yet it should be
s a ve X f r o m Y (65 c o n c o r d a n c e l i nes)
1 s ave P E RS ON f r o m Y ( 23 c o n c o r d a n c e l i nes)
1.1 s a ve P E RS ON f r o m BAD ( 19 c o n c o r d a n c e l i nes)
( Robert DeNiro ) to save Indian tribes(PERSON] from genocide[DESTRUCT[BAD]] a t the hands of
" We wanted t o s a v e him(PERSON] ~orn undue ~ouble[BAD] and loss(BAD] of money , "
Murphy was sacrificed to save more powerful Democrats(PERSON] from harm(BAD] .
" God s e n t this man to save my five children(PERSON] from being burned to death(DESTRUCT(BAD]] and
Pope John Paul I] to " save us(PERSON] fl~m sin(BAD] . "
1.2 s ave P E RS ON f r o m ( BAD) LOC( AT1 ON) (4 c o n c o r d a n c e l i nes)
rescuers who helped save the toddler(PERSON] from an abandoned weU[LOC] will be feted with a parade
while attempting to save two drowning hoys[PERSON] from a turbulent(BAD] creeklLOC] in Otdo[LOC]
2. s a ve I NS T ( I T UT I ON) f r o m ( ECON) BAD (27 c o n c o r d a n c e l i nes)
member states to help save the EEC[INSTI from possible bankaxlptcy[BCON][BAD] this year.
should be sought " to save the compeny[CORP[1NST]] from bankmptfy[BCON][BAD].
l a w w a s n e c e s s a r y t o save the counffy[NATIOlq[lNST]] flora disaster(BAD].
operation " t o s a v e the nation(NATION(INS'r]] from COmmUnL~n[BAD][POL1TICAL] .
were not needed to save the system from benkauptcy[ECON][BAD].
his efforts t o s a v e the wodd[INST] from the like~ of Lothax and the Spider Woman
3. s a ve ANI MAL f r o m DE S T RUCT ( I ON) (5 c o n c o r d a n c e l i nes)
give them the money to save the dogs(ANIMAL] from being destroyed(DESTRUCT] ,
program intended to save the giant birds(ANIMAL] ~om extinction[DESTRUCTI,
UNCL AS S I F I E D ( 10 c o n c o r d a n c e l i nes)
walnut and ash tx~es to save them from the axes and saws of a logging company.
after the a~aek t o s a v e the ship from a temble[BAD] fire, Navy reports concluded Thursday.
cemficates that would save shopper~[pERSON] anywhere f ~m $50[MONEY] [NUMBER] to $500[MONEY] (/flu
Fi gure 2 Some AP 1987 Concordance Lines to
" s a v e . . . f r o m, " Roughly Sorted into Categories.
clear fi' om the association rat i o t abl e above t hat a n n u a l l y
and mo n t h 6 are commonl y found with s ave. Mor e detailed
inspection shows t hat the t i me adverbials correl at e interest-
ingly with j ust one group of s a v e objects, namel y those
t agged [ MONEY] . The AP wire is full of discussions of
s a v i n g $ 1 . 2 b i l l i o n p e r mo n t h ; comput at i onal l exi cography
should measure and record such pat t erns if t hey are gen-
eral, even when t radi t i onal dictionaries do not.
A,; anot her exampl e illustrating how t he association rat i o
tables would have helped us anal yze t he s a v e concordance
lines, we found ourselves cont empl at i ng the semant i c t ag
E NV( I RONME NT ) to anal yze lines such as:
the t rend to save t he f or est s[ ENV]
i t ' s our t urn to save the l ake[ ENV] ,
joined a fight to save their f or est s[ ENV] ,
can we get busy to save t he pl anet [ ENV] ?
I f we had looked at the association rat i o tables before
labC.ing the 65 lines for s a v e . . . f r o m, we mi ght have
noticed the very l arge val ue for s a v e . . , f o r e s t s , suggesting
t hat t here may be an i mpor t ant pat t er n here. In fact , this
pat t er n probabl y subsumes most of the occurrences of the
" s a v e [ ANI MAL] " pat t ern noticed in Fi gure 2. Thus,
these tables do not provide semant i c tags, but t hey provide
a powerful set of suggestions to t he l exi cographer for what
needs to be account ed for in choosing a set of semant i c tags.
I t may be t hat everyt hi ng said here about s a v e and ot her
words is t rue only of 1987 Amer i can j ournal ese. Intuitively,
however, many of the pat t erns discovered seem to be good
candi dat es for conventions of general English. A fut ure
step would be to exami ne ot her more bal anced corpora and
test how well the pat t erns hold up.
9 CONCLUSIONS
We began this paper with the psycholinguistic notion of
word association norm, and extended t hat concept t oward
the i nformat i on t heoret i c definition of mut ual i nformat i on.
Thi s provided a precise statistical cal cul at i on t hat could be
applied to a very l arge corpus of t ext to produce a t abl e of
associations for tens of t housands of words. We were t hen
able to show t hat the t abl e encoded a number of very
interesting pat t erns rangi ng f r om d o c t o r . . , n u r s e to s a v e
. . . . f r o m. We finally concluded by showing how t he pat -
terns in the association rat i o t abl e mi ght help a lexicogra-
pher organi ze a concordance.
In point of fact, we act ual l y developed these results in
basi cal l y the reverse order. Concordance analysis is still
ext r emel y labor-intensive and prone to errors of omission.
The ways t hat concordances are sorted don' t adequat el y
support current lexicographic practice. Despite t he fact
t hat a concordance is indexed by a single word, often
l exi cographers act ual l y use a second word such as f r o m or
an equally common semant i c concept such as a t i me adver-
bial to decide how to cat egori ze concordance lines. In ot her
words, t hey use two words to t r i a n g u l a t e i n on a word sense.
Thi s t ri angul at i on approach clusters concordance lines to-
get her into word senses based pri mari l y on usage (distribu-
28 Comput at i onal Li ngui s t i c s Vo l ume 16, Numbe r 1, Ma r c h 1 9 9 0
Kenneth Church and Patrick Hanks Word Association Norms, Mutual Information, and Lexicography
t i onal e vi de nc e ) , as oppos e d t o i nt ui t i ve not i ons of me a n i n g .
Thus , t he que s t i on o f wh a t is a wor d s ens e c a n be a ddr e s s e d
wi t h s y n t a c t i c me t h o d s ( s ymbol pus hi ng) , a nd ne e d not
a ddr e s s s e ma n t i c s ( i n t e r p r e t a t i o n ) , e ve n t h o u g h t he i nven-
t or y of t ags ma y a p p e a r t o ha ve s e ma n t i c val ues .
Th e t r i a n g u l a t i o n a p p r o a c h r e qui r e s " a r t . " Ho w does t he
l e x i c o g r a p h e r d e c i d e wh i c h p o t e n t i a l c u t p o i n t s a r e
" i n t e r e s t i n g " a nd whi c h a r e me r e l y d u e t o c h a n c e ? Th e
pr opos e d a s s oc i a t i on r a t i o s c or e pr ovi de s a p r a c t i c a l a nd
obj e c t i ve me a s u r e t ha t is of t e n a f a i r l y good a p p r o x i ma t i o n
t o t he " a r t . " Si n c e t he pr opos e d me a s u r e is obj e c t i ve , i t c a n
be a ppl i e d i n a s y s t e ma t i c wa y ove r a l a r ge body of ma t e -
r i al , s t e a di l y i mp r o v i n g c ons i s t e nc y a nd pr oduc t i vi t y.
But on t he ot he r ha nd, t he obj e c t i ve s cor e c a n be mi s l e a d-
i ng. Th e s c or e t a ke s onl y di s t r i but i ona l e vi de nc e i nt o ac-
count . For e xa mpl e , t he me a s u r e f a vor s set . . . f o r over
set . . . down; i t doe s n' t know t ha t t he f o r me r is l ess
i nt e r e s t i ng be c a us e i t s s e ma n t i c s a r e c ompos i t i ona l . I n
a ddi t i on, t he me a s u r e is e x t r e me l y s uper f i ci al ; i t c a n n o t
c l us t e r wor ds i nt o a p p r o p r i a t e s ynt a c t i c cl asses wi t h o u t an
e xpl i c i t pr e pr oc e s s s uch as Ch u r c h ' s pa r t s p r o g r a m or
Hi n d l e ' s par s er . Ne i t h e r of t he s e pr epr oces s es , t hough, c a n
hel p h i g h l i g h t t he " n a t u r a l " s i mi l a r i t y b e t we e n nouns s uch
as p i c t u r e a nd p h o t o g r a p h . Al t h o u g h one mi g h t i ma g i n e a
pr e pr oc e s s t ha t woul d hel p i n t hi s p a r t i c u l a r case, t he r e wi l l
pr oba bl y a l wa ys be a cl ass of g e n e r a l i z a t i o n s t h a t a r e
obvi ous t o an i nt e l l i ge nt l e xi c ogr a phe r , but l i e hope l e s s l y
be yond t he obj e c t i vi t y of a c o mp u t e r .
De s pi t e t he s e pr obl e ms , t he a s s oc i a t i on r a t i o c oul d be an
i mp o r t a n t t ool t o a i d t he l e xi c ogr a phe r , r a t h e r l i ke an i nde x
t o t he c onc or da nc e s . I t c a n hel p us de c i de wh a t t o l ook f or ;
i t pr ovi de s a q u i c k s u mma r y of wh a t c o mp a n y our wor ds do
keep.
REFERENCES
Church, K. 1988 "A Stochastic Parts Program and Noun Phrase Parser
for Unrestricted Text," Second Conference on Applied Natural Lan-
guage Processing, Austin, TX.
Church, K.; Gale, W.; Hanks, P.; and Hindle, D. 1989 "Parsing, Word
Associations and Typical Predicate-Argument Relations," Interna-
tional Workshop on Parsing Technologies, CMU.
Fano, R. 1961 Transmission of Information: A Statistical Theory of
Communications. MIT Press, Cambridge, MA.
Firth, J. 1957 "A Synopsis of Linguistic Theory 1930-1955," in Studies
in Linguistic Analysis, Philological Society, Oxford; reprinted in Palmer,
F. (ed.) 1968 Selected Papers of J. R. Firth, Longman, Harlow.
Francis, W. and Ku~era, H. 1982 Frequency Analysis of English Usage.
Houghton Mifflin Company, Boston, MA.
Good, I. J. 1953 The Population Frequencies of Species and the Estima-
tion of Population Parameters. Biometrika, Vol. 40, 237-264.
Hanks, P. 1987 "Definitions and Explanations," in J. Sinclair (ed.),
Looking Up: An Account of the COBUILD Project in Lexical Comput-
ing. Collins, London and Glasgow.
Hindle, D. 1983a "Deterministic Parsing of Syntactic Non-fluencies." In
Proceedings of the 23rd Annual Meeting of the Association f or Compu-
tational Linguistics.
Hindle, D. 1983b "User Manual for Fidditch, a Deterministic Parser."
Naval Research Laboratory Technical Memorandum #7590-142.
Hornby, A. 1948 The Advanced Learner's Dictionary, Oxford University
Press, Oxford, U.K.
Jelinek, F. 1982. (personal communication)
Kahan, S.; Pavlidis, T.; and Baird, H. 1987 "On the Recognition of
Printed Characters of any Font or Size," IEEE Transactions PAMI,
274-287.
Meyer, D.; Schvaneveldt, R.; and Ruddy, M. 1975 "Loci of Contextual
Effects on Visual Word-Recognition," in P. Rabbitt and S. Dornic
(eds.), Attention and Performance V, Academic Press, New York.
Palermo, D. and Jenkins, J. 1964 "Word AssociationNorms." University
of Minnesota Press, Minneapolis, MN.
Sinclair, J.; Hanks, P.; Fox, G.; Moon, R.; and Stock, P. (eds.) 1987a
Collins Cobuild English Language Dictionary. Collins, London and
Glasgow.
Sinclair, J. 1987b "The Nature of the Evidence," in J. Sinclair (ed.),
Looking Up: An Account of the COBUILD Project in Lexical Comput-
ing. Collins, London and Glasgow.
Smadja, F. In press. "Microcoding the Lexicon with Co-Occurrence
Knowledge," in Zernik (ed.), Lexical Acquisition: Using On-Line Re-
sources to Build a Lexicon, MIT Press, Cambridge, MA.
NOTES
1. This statistic has also been used by the IBM speech group (Jelinek
1982) for constructing language models for applications in speech
recognition.
2. Smadja (in press) discusses the separation between collocates in a
very similar way.
3. This definition f w( x, y) uses a rectangular window. It might be
interesting to consider alternatives (e.g. a triangular window or a
decaying exponential) that would weight words less and less as they
are separated by more and more words. Other windows are also
possible. For example, Hindle (Church et al. 1989) has used a
syntactic parser to select words in certain constructions of interest.
4. Although the Good-Turing Method (Good 1953) is more than 35
years old, it is still heavily cited. For example, Katz (1987) uses the
method in order to estimate trigram probabilities in the IBM speech
recognizer. The Good-Turing Method is helpful for trigrams that
have not been seen very often in the training corpus.
5. The last unclassified line . . . . save shoppers anywhere f rom $ 5 0 . . .
raises interesting problems. Syntactic "chunking" shows that, in spite
of its co-occurrence of f rom with save, this line does not belong here.
An intriguing exercise, given the lookup table we are trying to
construct, is how to guard against false inferences such as that since
shoppers is tagged [PERSON], $50 to $500 must here count as either
BAD or a LOCATION. Accidental coincidences of this kind do not
have a significant effect on the measure, however, although they do
serve as a reminder of the probabilistic nature of the findings.
6. The word time itself also occurs significantly in the table, but on closer
examination it is clear that this use of time (e.g. to save time) counts
as something like a commodity or resource, not as part of a time
adjunct. Such are the pitfalls of lexicography (obvious when they are
pointed out).
Comput at i onal Li ngui s t i c s Vol ume 16, Number 1, Mar c h 1990 29

You might also like