You are on page 1of 37

What is a corpus?

a collection of words?
Is it a theory or methodology of
language?

Why use a corpus?


Large amounts of data tell us about
tendencies and whats normal or typical in
real-life language use
Corpora also reveal instances of very rare
or exceptional cases, that we wouldnt get
from looking at single texts or introspection.
Human researchers make mistakes and
are slow. Computers are much quicker and
more accurate.

Criteria in building a corpus


1. It must be a large body of text.
2. It needs to be representative of
language (or a genre of language).
3. Must be in machine-readable form (e.g.
txt files on a computer).
4. Acts as a standard reference about
whats typical in language.
5. Often annotated with additional linguistic
information e.g. grammatical codes.

annotation and mark-up


corpus texts may be enriched with additional information to
ease analysis.
Note that this type of additional information may be called mark
up, annotation, or tagging. All three terms are near
synonyms. Annotation usually refers to linguistic information
encoded in a corpus - however, the encoding is achieved using a
mark-up language.
Similarly, the annotation itself is usually undertaken by putting so
called tags - short codes to indicate some linguistics feature - into
a text. Hence, while the terms can be separated, they can also be
used inter-changeably!
One final note - an xml tag finishes with a forward slash rather than
a back slash.

Some untagged text


Arrest warrant out for Clowes partner years
before collapse.
By Daniel John
A WARRANT for the arrest of the former
partner of Mr Peter Clowes was issued seven
years before his Barlow Clowes investment
empire collapsed, according to evidence
submitted to the Parliamentary Ombudsman.

Add tags for headers and


paragraphs
<head type=MAIN>
Arrest warrant out for Clowes partner years before collapse.
</head>
<head type=BYLINE>
By Daniel John
</head>
<p>
A WARRANT for the arrest of the former partner of Mr Peter
Clowes was
issued seven years before his Barlow Clowes investment empire
collapsed, according to evidence submitted to the Parliamentary
Ombudsman.
</p>

Add sentence tags


<head type=MAIN>
<s n=001>Arrest warrant out for Clowes partner years before
collapse.
</head>
<head type=BYLINE>
<s n=002>By Daniel John
</head>
<p>
<s n=003>A WARRANT for the arrest of the former partner of Mr
Peter
Clowes was issued seven years before his Barlow Clowes investment
empire collapsed, according to evidence submitted to the
Parliamentary Ombudsman.
</p>

Change quotes to SGML


<head type=MAIN>
<s n=001>&bquo;Arrest warrant out for Clowes partner years
before collapse&equo;
</head>
<head type=BYLINE>
<s n=002>By Daniel John
</head>
<p>
<s n=003>A WARRANT for the arrest of the former partner of Mr
Peter Clowes was
issued seven years before his Barlow Clowes investment empire
collapsed,
according to evidence submitted to the Parliamentary Ombudsman.
</p>

Add tags for punctuation


<head type=MAIN>
<s n=001><c PUQ>&bquo;Arrest warrant out for Clowes<c PUN>
partner years
before collapse <c PUQ>&equo;
</head>
<head type=BYLINE>
<s n=002>By Daniel John
</head>
<p>
<s n=003>A WARRANT for the arrest of the former partner of Mr Peter
Clowes
was issued seven years before his Barlow Clowes investment empire
collapsed, according to evidence submitted to the Parliamentary
Ombudsman <c PUN>.
</p>

Add grammatical codes to word


units
<head type=MAIN>
<s n=001><c PUQ>&bquo<w NN1>Arrest <w NN1>warrant <w AVP>out <w PRP>for <w
NP0>Clowes<c PUN> <w NN1>partner <w NN2>years <w PRP>before <w NN1>collapse <c
PUQ>&equo
<c PUN>.
</head>
<head type=BYLINE>
<s n=002><w PRP>By <w NP0>Daniel <w NP0>John
</head>
<p>
<s n=003><w AT1>A <w=NN1>WARRANT <w PRP>for <w AT0>the <w NN1>arrest <w
PRF>of <w
AT0>the <w DT0>former <w NN1>partner <w PRF>of <w NP0>Mr <w NP0>Peter <w
NP0>Clowes <w VBD>was <w VVN>issued <w CRD>seven <w NN2>years <w CJS>before
<w
DPS>his <w NN1-NP0>Barlow <w NP0>Clowes <w NN1>investment <w NN1>empire <w
VVD>collapsed<c PUN>, <w PRP>according to <w NN1>evidence <w VVN>submitted <w
PRP>to <w AT0>the <w AJ0>Parliamentary <w NN1>Ombudsman<c PUN>.
</p>

Types of Corpora
1 Specialised corpus e.g.
genre: the language of newspapers
time: 2005 to the present day
place: just texts published in China
2 General corpus needs to be much
larger. E.g. The British
National Corpus (BNC) has about 100
million words of
spoken and written British English:

The BNC

Types of Corpora
3. Multilingual corpus e.g. English and Spanish. Or American
English and Indian English. http://ice-corpora.net/ICE/INDEX.HTM
4. Parallel corpus e.g. English and Spanish exactly the
same texts translated. E.g. the CRATER corpus
http://catalog.elra.info/product_info.php?products_id=84
5. Learner corpus language use created by people learning a
particular language. E.g. the International Corpus of
Learner English.
6. Historical or Diachronic corpus e.g. Helsinki corpus 1.5
million words of texts from 700AD to 1700AD.
7. Monitor corpus continually being added to. e.g. the Bank
of English
http://www.collins.co.uk/page/Wordbanks+Online

frequency data, concordances and collocation

Frequencies
Your query "wash" returned 2415
matches in 952 different texts (in
97,626,093 words; freq: 24.74
instances per million words)

Concordances aka Key Word In


Context

Concordance (sorted at 1L)

Collocations

Corpora and Language


Teaching
Textbooks
Dictionaries
Classroom Exercises
Tests
Learner Corpora

Limitations of Corpus
linguistics
It wont tell us if something is possible in a language, or
well-formed. E.g. is he expired of heart disease acceptable
English?
Any generalisations we make from corpus data can only be
deductions not facts.
Corpora give us evidence, but not information or
explanations. Why do women say wash more than men?
Corpora give us language out of context so no visual
information e.g. pictures, fonts etc. And with spoken data
no information on what the speakers look like, behaviour or
body language.

Further Reading
McEnery, Tony & Wilson, Andrew (2001)
Corpus Linguistics.
Edinburgh: Edinburgh University Press.
Chapter 1.
Hunston, S. (2002) Corpora in Applied
Linguistics.
Cambridge: Cambridge University Press.
Chapter 1.

Question 1
What is a corpus?
A theory of language.
A collection of texts stored on a
computer.
An electronic database similar to a
dictionary.
Any large collection of words such as
a collection of books, newspapers or
magazines.

Question 2
What is the main reason for using corpora?
Other methods of language analysis are not
reliable.
Computers can confirm our intuitions about
language.
Computers can help us discover interesting
patterns in language which would be difficult to
spot otherwise.
With corpora we can answer all research
questions about language.

Question 3
What is corpus annotation?
Adding an extra layer of information
to the text to allow for more
sophisticated searches.
Separating text into sentences.
Manual coding of text for parts of
speech.
Adding critical comments to a text.

Question 4
What is a specialised corpus?

A corpus that is used for historical language investigations.


A corpus that is composed of a large variety of genres.
A corpus that is used by language specialists.

A corpus that focuses on e.g. one type of genre, one


period, one place etc.

Question 5
Which of these is NOT a type of
corpus?
Multilingual corpus
Learner corpus
Diachronic corpus
Observer corpus

Question 6
What is the BNC?

A large general corpus of British English.


A corpus of different genres of English writing.
A large spoken corpus of British English.
A specialised corpus representing the language of
newspapers.

Question 7
Which of these statements is NOT
true about a monitor corpus?
It is frequently updated.
The Bank of English is an example of
a monitor corpus.
The BNC is an example of a monitor
corpus.
It is used to monitor rapid change in
language.

Question 8
What is a concordance?
Information about word frequencies normalised
per million words.
Listing of examples of a word searched in a
corpus with some context on the right and some
context on the left.
An alphabetical list of words that appear in a
text.
A list of words and their frequencies that can be
used for identifying important words in a text.

Question 9
What is collocation?
The tendency of speakers to talk over each
other.
The tendency of words to co-occur with one
another.
The tendency of words to appear in unique,
different contexts each time.
The tendency of sentences to create
meaning.

Question 10
What is a frequency distribution in a corpus?
Information about how frequent a word is in a
corpus.
Information about the frequency of use of a term
across a number of different texts, corpus
sections, speakers etc.
Information about how frequent a word is per
million words.
Sociolinguistic information about the gender of
the speakers that are represented in a corpus.

Brown and LOB View 80 comments


These corpora are sometimes referred to as snapshot corpora - their design is such that
they try to represent a broad range of genres of published, professionally authored,
English. Their goal is to capture the language at one moment in time, hence the term
snapshot.
Of course, as with any snapshot there are things you see and things you do not see. So,
in this case, we are looking at professionally authored written English - not speech and
not writing of a more informal variety. We are also only looking at certain genres. As with
any snapshot, it was taken at a certain point of time in a certain place - Brown is America
in the early 1960s, LOB is the UK in the early 1960s. Such corpora are often used to
compare and contrast varieties of a language - in this case two varieties of English. They
can also be looked at on their own to explore either variety of English in its own right.
The Brown corpus is so named because it was developed at Brown University in the US.
LOB is an acronym, standing for Lancaster-Oslo-Bergen, the three Universities that
collaborated to build that corpus.
Back to the snapshot metaphor! The two corpora can be compared because they are
composed in the same way - the subject is the same, if you like. They look at broadly the
same genres. Those genres are represented by similarly sized and numbers of chunks of
data. Also, of course, the data was gathered in roughly the same time period.
The genres covered in the two corpora are outlined below. Note the letter code for each
genre - that is important, as it shows you which genre is associated with which file in the
corpus. Following the letter code is a description of the type of data in the category,
followed by two numbers in parentheses - the first is the number of chunks of data in
that category in Brown, the second is the number of chunks of data in that category in

You might also like