You are on page 1of 5

Corpus Linguistics

Application #2:
Lexicography Corpus Linguistics

Application #2:
Collocations Collocations

Collocations Collocations
Defining a collocation Defining a collocation
Krishnamurthy Krishnamurthy

Calculating
collocations
Corpora for lexicography Calculating
collocations

Corpus Linguistics Practical work I Can extract authentic & typical examples, with Practical work

(L615) frequency information

Application #2: Collocations


I With sociolinguistic meta-data, can get an accurate
description of usage and, with monitor corpora, its
change over time
Markus Dickinson
I Can complement intuitions about meanings
Department of Linguistics, Indiana University
Spring 2013 The study of loanwords, for example, can be bolstered by
corpus studies

1 / 28 2 / 28

Lexical studies: Collocation Corpus Linguistics

Application #2:
Collocations & colligations Corpus Linguistics

Application #2:
Collocations Collocations

Collocations Collocations
Defining a collocation Defining a collocation

Collocations are characteristic co-occurrence patterns of


Krishnamurthy Krishnamurthy

Calculating Calculating
two (or more) lexical items collocations A colligation is a slightly different concept: collocations

I Tend to occur with greater than random chance Practical work


I collocation of a node word with a particular class of Practical work

I The meaning tends to be more than the sum of its parts words (e.g., determiners)

These are extremely hard to define by intuition: Colligations often create noise in a list of collocations
I Pro: Corpora have been able to reveal connections I e.g., this house because this is so common on its own,
previously unseen and determiners appear before nouns
I Con: It may not be clear what the theoretical basis of I Thus, people sometimes use stop words to filter out
collocations are non-collocations
I Pro & Con: how do they fit into grammar?

3 / 28 4 / 28

Defining a collocation Corpus Linguistics

Application #2:
What a collocation is Corpus Linguistics

Application #2:
Collocations Collocations

Collocations Collocations
Defining a collocation
Krishnamurthy
Collocations are expressions of two or more words that are Defining a collocation
Krishnamurthy

People disagree on collocations Calculating


in some sense conventionalized as a group Calculating
collocations collocations
I Intuition does not seem to be a completely reliable way I strong tea (cf. ??powerful tea)
Practical work Practical work
to figure out what a collocation is I international best practice
I Many collocations are overlooked: people notice I kick the bucket
unusual words & structures, but not ordinary ones
What your collocations are depends on exactly how you Importance of the context: You shall know a word by a
calculate them company it keeps (Firth 1957)
I There is some notion that they are more than the sum
I There are lexical properties that more general syntactic
of their parts properties do not capture

So, how do we practically define a collocation? . . .



This slide and the next 3 adapted from Manning and Schutze (1999),
Foundations of Statistical Natural Language Processing

5 / 28 6 / 28
Prototypical collocations Corpus Linguistics

Application #2:
Compositionality tests Corpus Linguistics

Application #2:
Collocations Collocations

Collocations Collocations
Defining a collocation Defining a collocation

The previous properties are good tests, but hard to verify


Krishnamurthy Krishnamurthy

Calculating Calculating
Prototypically, collocations meet the following criteria: collocations with corpus data collocations

Practical work Practical work


I Non-compositional: meaning of kick the bucket not (At least) two tests we can use with corpora:
composed of meaning of parts I Is the collocation translated word-by-word into another
I Non-substitutable: orange hair just as accurate as red language?
hair, but some dont say it I e.g., Collocation make a decision is not translated
I Non-modifiable: often we cannot modify a collocation, literally into French
even though we normally could modify one of those I Do the two words co-occur more frequently together
words: ??kick the red bucket than we would otherwise expect?
I e.g., of the is frequent, but both words are frequent, so
we might expect this

7 / 28 8 / 28

Kinds of Collocations Corpus Linguistics

Application #2:
Semantic prosody & preference Corpus Linguistics

Application #2:
Collocations Collocations

Collocations Collocations
Defining a collocation Defining a collocation

Collocations come in different guises: Krishnamurthy Krishnamurthy

Calculating Semantic prosody = a form of meaning which is Calculating


I Light verbs: verbs convey very little meaning but must collocations
established through the proximity of a consistent series of
collocations

be the right one: Practical work Practical work


collocates (Louw 2000)
I make a decision vs. *take a decision, take a walk vs.
I Idea: you can tell the semantic prosody of a word by the
*make a walk
types of words it frequently co-occurs with
I Phrasal verbs: main verb and particle combination,
I These are typically negative: e.g., peddle, ripe for, get
often translated as a single word:
oneself verbed
I to tell off, to call up
I This type of co-occurrence often leads to general
I Proper nouns: slightly different than others, but each semantic preferences
refers to a single idea (e.g., Brooks Brothers) I e.g., utterly, totally, etc. typically have a feature of
I Terminological expressions: technical terms that form a absence or change of state
unit (e.g., hydraulic oil filter)

9 / 28 10 / 28

Collocation: from silly ass to lexical sets Corpus Linguistics

Application #2:
Notes on a collocations definition Corpus Linguistics

Application #2:
Krishnamurthy 2000 Collocations Krishnamurthy 2000 Collocations

Collocations Collocations
Defining a collocation Defining a collocation
Krishnamurthy Krishnamurthy

Firth 1957: You shall know a word by the company it keeps Calculating Calculating
collocations
We often look for words which are adjacent to make up a collocations
I Collocational meaning is a syntagmatic type of Practical work Practical work
collocation, but this is not always true
meaning, not a conceptual one
I e.g., computers run, but these 2 words may only be in
I e.g., in this framework, one of the meanings of night is
the same proximity.
the fact that it co-occurs with dark
We can also speak of upward/downward collocations:
Example: ass is associated with a particular set of adjectives
(think of goose if you prefer) I downward: involves a more frequent node word A with
a less frequent collocate B
I silly, obstinate, stupid, awful
I upward: weaker relationship, tending to be more of a
I We can see a lexical set associated with this word
grammatical property
Lexical sets & collocations vary across genres, subcorpora,
etc.

11 / 28 12 / 28
Corpus linguistics Corpus Linguistics

Application #2:
Calculating collocations Corpus Linguistics

Application #2:
Krishnamurthy 2000 Collocations Collocations

Collocations Collocations

Simplest approach: use frequency counts


Defining a collocation Defining a collocation
Krishnamurthy Krishnamurthy

Calculating
collocations
I Two words appearing together a lot are a collocation Calculating
collocations

Practical work Practical work


Where collocations fit into corpus linguistics: The problem is that we get lots of uninteresting pairs of
1. Pattern recognition: recognize lexical and grammatical function words (M&S 1999, table 5.1)
units
C(w1 , w2 ) w1 w2
2. Frequency list generation: rank words 80871 of the
3. Concordancing: observe word behavior 58841 in the
4. Collocations: take concordancing a step further ... 26430 to the
21842 on the


(Slides 1430 are based on Manning & Schutze (M&S) 1999)

13 / 28 14 / 28

POS filtering Corpus Linguistics

Application #2:
POS filtering (2) Corpus Linguistics

Application #2:
Collocations Collocations

Collocations Collocations
Defining a collocation Defining a collocation

To remove frequent pairings which are uninteresting, we can Krishnamurthy


Some results after tag filtering (M&S 1999, table 5.3)
Krishnamurthy

Calculating Calculating
use a POS filter (Justeson and Katz 1995) collocations collocations

I only examine word sequences which fit a particular Practical work C(w1 , w2 ) w1 w2 Tag Pattern Practical work

part-of-speech pattern: 11487 New York AN


A N, N N, A A N, A N N, N A N, N N N, N P N 7261 United States AN
5412 Los Angeles NN
AN linear function
3301 last year AN
N A N mean squared error
N P N degrees of freedom
Fairly simple, but surprisingly effective
I Crucially, all other sequences are removed
I Needs to be refined to handle verb-particle collocations
PD of the
MV V has been I Kind of inconvenient to write out patterns you want

15 / 28 16 / 28

Determining strength of collocation Corpus Linguistics

Application #2:
(Pointwise) Mutual Information Corpus Linguistics

Application #2:
Collocations Collocations

Collocations Collocations
Defining a collocation Defining a collocation
Krishnamurthy Krishnamurthy

We want to compare the likelihood of 2 words next to other Calculating One way to see if two words are strongly connected is to Calculating
being being a chance event vs. being a surprise collocations
compare
collocations

Practical work Practical work


I Do the two words appear next to each other more than I the probability of the two words appearing together if
we might expect, based on what we know about their they are independent (p (w1 )p (w2 ))
individual frequencies? I the actual probability of the two words appearing
I Is this an accidental pairing or not? together (p (w1 w2 ))
I The more data we have, the more confident we will be The pointwise mutual information is a measure to do this:
in our assessment of a collocation or not
p (w1 w2 )
Well look at bigrams, but techniques work for words within (1) I(w1 , w2 ) = log p (w
1 )p (w2 )

five words of each other, translation pairs, phrases, etc.

17 / 28 18 / 28
Pointwise Mutual Information Equation Corpus Linguistics

Application #2:
Mutual Information example Corpus Linguistics

Application #2:
Collocations Collocations

Collocations Collocations
Our probabilities (p (w1 w2 ), p (w1 ), p (w2 )) are all basically Defining a collocation Defining a collocation

We want to know if Ayatollah Ruhollah is a collocation in a


Krishnamurthy Krishnamurthy

calculated in the same way: Calculating Calculating


collocations data set we have: collocations
C (x )
(2) p (x ) = N Practical work
I C (Ayatollah ) = 42 Practical work

I C (Ruhollah ) = 20
I N is the number of words in the corpus I C (AyatollahRuhollah ) = 20
I The number of bigrams the number of unigrams I N = 14, 307, 668
p (w1 w2 ) 20
(3) I(w1 , w2 ) = log (4) I(Ayatollah , Ruhollah ) = log2 = log2 N 4220
20
N
p (w1 )p (w2 ) 42 20
N N
C (w1 w2 ) 18.38
= log N
C (w1 ) C (w2 )
N N
To see how good a collocation this is, we need to compare it
C (w w )
= log[N C (w1 )1C (2w2 ) ] to others

19 / 28 20 / 28

Problems for Mutual Information Corpus Linguistics

Application #2:
Motivating Contingency Tables Corpus Linguistics

Application #2:
Collocations Collocations

Collocations Collocations
The formula we have also has the following equivalencies: Defining a collocation Defining a collocation

What we can instead get at is: which bigrams are likely, out
Krishnamurthy Krishnamurthy

p (w1 w2 ) P (w1 |w2 ) P (w2 |w1 )


(5) I(w1 , w2 ) = log p (w1 )p (w2 )
= log P (w1 )
= log P (w2 )
Calculating
collocations of a range of possibilities?
Calculating
collocations

Practical work Practical work


Mutual information tells us how much more information we Looking at the Arthur Conan Doyle story A Case of Identity,
have for a word, knowing the other word we find the following possibilities for one particular bigram:
I But a decrease in uncertainty isnt quite right ... I sherlock followed by holmes
A few problems: I sherlock followed by some word other than holmes
I Sparse data: infrequent bigrams for infrequent words
I some word other than sherlock preceding holmes
get high scores
I two words: the first not being sherlock, the second not
I Tends to measure independence (value of 0) better being holmes
than dependence These are all the relevant situations for examining this
I Doesnt account for how often the words do not appear bigram
together (M&S 1999, table 5.15)

21 / 28 22 / 28

Contingency Tables Corpus Linguistics

Application #2:
Observed bigram probabilities Corpus Linguistics

Application #2:
Collocations Collocations

Collocations Collocations
Defining a collocation Defining a collocation
Krishnamurthy Krishnamurthy

We can count up these different possibilities and put them Calculating


Because each cell indicates a bigram, divide each of the Calculating
into a contingency table (or 2x2 table) collocations cells by the total number of bigrams (7105) to get collocations

Practical work probabilities: Practical work

B = holmes B , holmes Total


A = sherlock 7 0 7 holmes holmes Total
A , sherlock 39 7059 7098 sherlock 0.00099 0.0 0.00099
Total 46 7059 7105 sherlock 0.00549 0.99353 0.99901
Total 0.00647 0.99353 1.0
The Total row and Total column are the marginals
The marginal probabilities indicate the probabilities for a
I The values in this chart are the observed frequencies
given word, e.g., p (sherlock ) = 0.00099 and
(fo )
p (holmes ) = 0.00647

23 / 28 24 / 28
Expected bigram probabilities Corpus Linguistics

Application #2:
Expected bigram frequencies Corpus Linguistics

Application #2:
Collocations Collocations

Collocations Collocations
Defining a collocation Defining a collocation
Krishnamurthy Krishnamurthy

If we assumed that sherlock and holmes are Calculating Multiplying by 7105 (the total number of bigrams) gives us Calculating
independenti.e., the probability of one is unaffected by the collocations
the expected number of times we should see each bigram:
collocations

probability of the otherwe would get the following table: Practical work Practical work

holmes holmes Total


sherlock 0.05 6.95 7
holmes holmes Total
sherlock 45.5 7052.05 7098
sherlock 0.00647 x 0.00099 0.99353 x 0.00099 0.00099
Total 46 7059 7105
sherlock 0.00647 x 0.99901 0.99353 x 0.99901 0.99901
Total 0.00647 0.99353 1.0
I The values in this chart are the expected frequencies
(fe )
I This is simply pe (w1 , w2 ) = p (w1 )p (w2 )

25 / 28 26 / 28

Pearsons chi-square test Corpus Linguistics

Application #2:
Working with collocations Corpus Linguistics

Application #2:
Collocations Collocations

The chi-square (2 ) test measures how far the observed Collocations


Defining a collocation
Collocations
Defining a collocation

values are from the expected values: Krishnamurthy Krishnamurthy

Calculating Calculating
P (fo fe )2 collocations collocations
(6) 2 = fe Practical work The question is: Practical work

I What significant collocations are there that start with the


(7)
2 (70.05)2 (06.95)2 (3945.5)2 (70597052.05)2 word sweet?
= 0.05 + 6.95 + 45.5 + 7052.05 I Specifically, what nouns tend to co-occur after sweet?
= 966.05 + 6.95 + 1.048 + 0.006 What do your intuitions say?

= 974.05 Next time, we will work on how to calculate collocations ...

If you look this up in a table, youll see that its unlikely to be


chance
NB: The 2 test does not work well for rare events, i.e., fe < 6

27 / 28 28 / 28

You might also like