02 Collocations 2x3 PDF

Corpus Linguistics
Application #2:
Lexicography Corpus Linguistics
Application #2:
Collocations Collocations
Defining a collocation Defining a collocation
Krishnamurthy Krishnamurthy
Calculating
collocations
Corpora for lexicography Calculating
collocations
Corpus Linguistics Practical work I Can extract authentic & typical examples, with Practical work
(L615) frequency information
Application #2: Collocations

I With sociolinguistic meta-data, can get an accurate
description of usage and, with monitor corpora, its
change over time
Markus Dickinson
I Can complement intuitions about meanings
Department of Linguistics, Indiana University
Spring 2013 The study of loanwords, for example, can be bolstered by
corpus studies
1 / 28 2 / 28
Lexical studies: Collocation Corpus Linguistics
Application #2:
Collocations & colligations Corpus Linguistics
Application #2:
Collocations are characteristic co-occurrence patterns of

Calculating Calculating
two (or more) lexical items collocations A colligation is a slightly different concept: collocations
I Tend to occur with greater than random chance Practical work

I collocation of a node word with a particular class of Practical work
I The meaning tends to be more than the sum of its parts words (e.g., determiners)
These are extremely hard to define by intuition: Colligations often create noise in a list of collocations
I Pro: Corpora have been able to reveal connections I e.g., this house because this is so common on its own,
previously unseen and determiners appear before nouns
I Con: It may not be clear what the theoretical basis of I Thus, people sometimes use stop words to filter out
collocations are non-collocations
I Pro & Con: how do they fit into grammar?
3 / 28 4 / 28
Defining a collocation Corpus Linguistics
Application #2:
What a collocation is Corpus Linguistics
Application #2:
Defining a collocation
Krishnamurthy
Collocations are expressions of two or more words that are Defining a collocation
Krishnamurthy
People disagree on collocations Calculating

in some sense conventionalized as a group Calculating
collocations collocations
I Intuition does not seem to be a completely reliable way I strong tea (cf. ??powerful tea)
Practical work Practical work
to figure out what a collocation is I international best practice
I Many collocations are overlooked: people notice I kick the bucket
unusual words & structures, but not ordinary ones
What your collocations are depends on exactly how you Importance of the context: You shall know a word by a
calculate them company it keeps (Firth 1957)
I There is some notion that they are more than the sum
I There are lexical properties that more general syntactic
of their parts properties do not capture
So, how do we practically define a collocation? . . .

This slide and the next 3 adapted from Manning and Schutze (1999),
Foundations of Statistical Natural Language Processing
5 / 28 6 / 28
Prototypical collocations Corpus Linguistics
Application #2:
Compositionality tests Corpus Linguistics
Application #2:
The previous properties are good tests, but hard to verify

Prototypically, collocations meet the following criteria: collocations with corpus data collocations

I Non-compositional: meaning of kick the bucket not (At least) two tests we can use with corpora:
composed of meaning of parts I Is the collocation translated word-by-word into another
I Non-substitutable: orange hair just as accurate as red language?
hair, but some dont say it I e.g., Collocation make a decision is not translated
I Non-modifiable: often we cannot modify a collocation, literally into French
even though we normally could modify one of those I Do the two words co-occur more frequently together
words: ??kick the red bucket than we would otherwise expect?
I e.g., of the is frequent, but both words are frequent, so
we might expect this
7 / 28 8 / 28
Kinds of Collocations Corpus Linguistics
Application #2:
Semantic prosody & preference Corpus Linguistics
Application #2:
Collocations come in different guises: Krishnamurthy Krishnamurthy
Calculating Semantic prosody = a form of meaning which is Calculating

I Light verbs: verbs convey very little meaning but must collocations
established through the proximity of a consistent series of
collocations
be the right one: Practical work Practical work

collocates (Louw 2000)
I make a decision vs. *take a decision, take a walk vs.
I Idea: you can tell the semantic prosody of a word by the
*make a walk
types of words it frequently co-occurs with
I Phrasal verbs: main verb and particle combination,
I These are typically negative: e.g., peddle, ripe for, get
often translated as a single word:
oneself verbed
I to tell off, to call up
I This type of co-occurrence often leads to general
I Proper nouns: slightly different than others, but each semantic preferences
refers to a single idea (e.g., Brooks Brothers) I e.g., utterly, totally, etc. typically have a feature of
I Terminological expressions: technical terms that form a absence or change of state
unit (e.g., hydraulic oil filter)
9 / 28 10 / 28
Collocation: from silly ass to lexical sets Corpus Linguistics
Application #2:
Notes on a collocations definition Corpus Linguistics
Application #2:
Krishnamurthy 2000 Collocations Krishnamurthy 2000 Collocations
Firth 1957: You shall know a word by the company it keeps Calculating Calculating
collocations
We often look for words which are adjacent to make up a collocations
I Collocational meaning is a syntagmatic type of Practical work Practical work
collocation, but this is not always true
meaning, not a conceptual one
I e.g., computers run, but these 2 words may only be in
I e.g., in this framework, one of the meanings of night is
the same proximity.
the fact that it co-occurs with dark
We can also speak of upward/downward collocations:
Example: ass is associated with a particular set of adjectives
(think of goose if you prefer) I downward: involves a more frequent node word A with
a less frequent collocate B
I silly, obstinate, stupid, awful
I upward: weaker relationship, tending to be more of a
I We can see a lexical set associated with this word
grammatical property
Lexical sets & collocations vary across genres, subcorpora,
etc.
11 / 28 12 / 28
Corpus linguistics Corpus Linguistics
Application #2:
Calculating collocations Corpus Linguistics
Application #2:
Krishnamurthy 2000 Collocations Collocations
Simplest approach: use frequency counts

Calculating
collocations
I Two words appearing together a lot are a collocation Calculating
collocations

Where collocations fit into corpus linguistics: The problem is that we get lots of uninteresting pairs of
1. Pattern recognition: recognize lexical and grammatical function words (M&S 1999, table 5.1)
units
C(w1 , w2 ) w1 w2
2. Frequency list generation: rank words 80871 of the
3. Concordancing: observe word behavior 58841 in the
4. Collocations: take concordancing a step further ... 26430 to the
21842 on the

(Slides 1430 are based on Manning & Schutze (M&S) 1999)
13 / 28 14 / 28
POS filtering Corpus Linguistics
Application #2:
POS filtering (2) Corpus Linguistics
Application #2:
To remove frequent pairings which are uninteresting, we can Krishnamurthy

Some results after tag filtering (M&S 1999, table 5.3)
Krishnamurthy
use a POS filter (Justeson and Katz 1995) collocations collocations
I only examine word sequences which fit a particular Practical work C(w1 , w2 ) w1 w2 Tag Pattern Practical work
part-of-speech pattern: 11487 New York AN

A N, N N, A A N, A N N, N A N, N N N, N P N 7261 United States AN
5412 Los Angeles NN
AN linear function
3301 last year AN
N A N mean squared error
N P N degrees of freedom
Fairly simple, but surprisingly effective
I Crucially, all other sequences are removed
I Needs to be refined to handle verb-particle collocations
PD of the
MV V has been I Kind of inconvenient to write out patterns you want
15 / 28 16 / 28
Determining strength of collocation Corpus Linguistics
Application #2:
(Pointwise) Mutual Information Corpus Linguistics
Application #2:
We want to compare the likelihood of 2 words next to other Calculating One way to see if two words are strongly connected is to Calculating
being being a chance event vs. being a surprise collocations
compare
collocations

I Do the two words appear next to each other more than I the probability of the two words appearing together if
we might expect, based on what we know about their they are independent (p (w1 )p (w2 ))
individual frequencies? I the actual probability of the two words appearing
I Is this an accidental pairing or not? together (p (w1 w2 ))
I The more data we have, the more confident we will be The pointwise mutual information is a measure to do this:
in our assessment of a collocation or not
p (w1 w2 )
Well look at bigrams, but techniques work for words within (1) I(w1 , w2 ) = log p (w
1 )p (w2 )
five words of each other, translation pairs, phrases, etc.
17 / 28 18 / 28
Pointwise Mutual Information Equation Corpus Linguistics
Application #2:
Mutual Information example Corpus Linguistics
Application #2:
Our probabilities (p (w1 w2 ), p (w1 ), p (w2 )) are all basically Defining a collocation Defining a collocation
We want to know if Ayatollah Ruhollah is a collocation in a

calculated in the same way: Calculating Calculating

collocations data set we have: collocations
C (x )
(2) p (x ) = N Practical work
I C (Ayatollah ) = 42 Practical work
I C (Ruhollah ) = 20
I N is the number of words in the corpus I C (AyatollahRuhollah ) = 20
I The number of bigrams the number of unigrams I N = 14, 307, 668
p (w1 w2 ) 20
(3) I(w1 , w2 ) = log (4) I(Ayatollah , Ruhollah ) = log2 = log2 N 4220
20
N
p (w1 )p (w2 ) 42 20
N N
C (w1 w2 ) 18.38
= log N
C (w1 ) C (w2 )
N N
To see how good a collocation this is, we need to compare it
C (w w )
= log[N C (w1 )1C (2w2 ) ] to others
19 / 28 20 / 28
Problems for Mutual Information Corpus Linguistics
Application #2:
Motivating Contingency Tables Corpus Linguistics
Application #2:
The formula we have also has the following equivalencies: Defining a collocation Defining a collocation
What we can instead get at is: which bigrams are likely, out
p (w1 w2 ) P (w1 |w2 ) P (w2 |w1 )

(5) I(w1 , w2 ) = log p (w1 )p (w2 )
= log P (w1 )
= log P (w2 )
Calculating
collocations of a range of possibilities?
Calculating
collocations

Mutual information tells us how much more information we Looking at the Arthur Conan Doyle story A Case of Identity,
have for a word, knowing the other word we find the following possibilities for one particular bigram:
I But a decrease in uncertainty isnt quite right ... I sherlock followed by holmes
A few problems: I sherlock followed by some word other than holmes
I Sparse data: infrequent bigrams for infrequent words
I some word other than sherlock preceding holmes
get high scores
I two words: the first not being sherlock, the second not
I Tends to measure independence (value of 0) better being holmes
than dependence These are all the relevant situations for examining this
I Doesnt account for how often the words do not appear bigram
together (M&S 1999, table 5.15)
21 / 28 22 / 28
Contingency Tables Corpus Linguistics
Application #2:
Observed bigram probabilities Corpus Linguistics
Application #2:
We can count up these different possibilities and put them Calculating

Because each cell indicates a bigram, divide each of the Calculating
into a contingency table (or 2x2 table) collocations cells by the total number of bigrams (7105) to get collocations
Practical work probabilities: Practical work
B = holmes B , holmes Total

A = sherlock 7 0 7 holmes holmes Total
A , sherlock 39 7059 7098 sherlock 0.00099 0.0 0.00099
Total 46 7059 7105 sherlock 0.00549 0.99353 0.99901
Total 0.00647 0.99353 1.0
The Total row and Total column are the marginals
The marginal probabilities indicate the probabilities for a
I The values in this chart are the observed frequencies
given word, e.g., p (sherlock ) = 0.00099 and
(fo )
p (holmes ) = 0.00647
23 / 28 24 / 28
Expected bigram probabilities Corpus Linguistics
Application #2:
Expected bigram frequencies Corpus Linguistics
Application #2:
If we assumed that sherlock and holmes are Calculating Multiplying by 7105 (the total number of bigrams) gives us Calculating
independenti.e., the probability of one is unaffected by the collocations
the expected number of times we should see each bigram:
collocations
probability of the otherwe would get the following table: Practical work Practical work
holmes holmes Total

sherlock 0.05 6.95 7
holmes holmes Total
sherlock 45.5 7052.05 7098
sherlock 0.00647 x 0.00099 0.99353 x 0.00099 0.00099
Total 46 7059 7105
sherlock 0.00647 x 0.99901 0.99353 x 0.99901 0.99901
Total 0.00647 0.99353 1.0
I The values in this chart are the expected frequencies
(fe )
I This is simply pe (w1 , w2 ) = p (w1 )p (w2 )
25 / 28 26 / 28
Pearsons chi-square test Corpus Linguistics
Application #2:
Working with collocations Corpus Linguistics
Application #2:
The chi-square (2 ) test measures how far the observed Collocations

Collocations
values are from the expected values: Krishnamurthy Krishnamurthy
P (fo fe )2 collocations collocations
(6) 2 = fe Practical work The question is: Practical work
I What significant collocations are there that start with the

(7)
2 (70.05)2 (06.95)2 (3945.5)2 (70597052.05)2 word sweet?
= 0.05 + 6.95 + 45.5 + 7052.05 I Specifically, what nouns tend to co-occur after sweet?
= 966.05 + 6.95 + 1.048 + 0.006 What do your intuitions say?
= 974.05 Next time, we will work on how to calculate collocations ...
If you look this up in a table, youll see that its unlikely to be

chance
NB: The 2 test does not work well for rare events, i.e., fe < 6
27 / 28 28 / 28

02 Collocations 2x3 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

02 Collocations 2x3 PDF

Uploaded by

Copyright:

Available Formats

Corpus Linguistics

(L615) frequency information

Application #2: Collocations

Lexical studies: Collocation Corpus Linguistics

Collocations are characteristic co-occurrence patterns of

I Tend to occur with greater than random chance Practical work

Defining a collocation Corpus Linguistics

People disagree on collocations Calculating

So, how do we practically define a collocation? . . .

The previous properties are good tests, but hard to verify

Practical work Practical work

Kinds of Collocations Corpus Linguistics

Collocations come in different guises: Krishnamurthy Krishnamurthy

Calculating Semantic prosody = a form of meaning which is Calculating

be the right one: Practical work Practical work

Collocation: from silly ass to lexical sets Corpus Linguistics

Simplest approach: use frequency counts

Practical work Practical work

POS filtering Corpus Linguistics

To remove frequent pairings which are uninteresting, we can Krishnamurthy

part-of-speech pattern: 11487 New York AN

Determining strength of collocation Corpus Linguistics

Practical work Practical work

five words of each other, translation pairs, phrases, etc.

We want to know if Ayatollah Ruhollah is a collocation in a

calculated in the same way: Calculating Calculating

Problems for Mutual Information Corpus Linguistics

p (w1 w2 ) P (w1 |w2 ) P (w2 |w1 )

Practical work Practical work

Contingency Tables Corpus Linguistics

We can count up these different possibilities and put them Calculating

Practical work probabilities: Practical work

B = holmes B , holmes Total

holmes holmes Total

Pearsons chi-square test Corpus Linguistics

The chi-square (2 ) test measures how far the observed Collocations

values are from the expected values: Krishnamurthy Krishnamurthy

I What significant collocations are there that start with the

= 974.05 Next time, we will work on how to calculate collocations ...

If you look this up in a table, youll see that its unlikely to be

You might also like