Professional Documents
Culture Documents
a collection of words?
Is it a theory or methodology of
language?
Types of Corpora
1 Specialised corpus e.g.
genre: the language of newspapers
time: 2005 to the present day
place: just texts published in China
2 General corpus needs to be much
larger. E.g. The British
National Corpus (BNC) has about 100
million words of
spoken and written British English:
The BNC
Types of Corpora
3. Multilingual corpus e.g. English and Spanish. Or American
English and Indian English. http://ice-corpora.net/ICE/INDEX.HTM
4. Parallel corpus e.g. English and Spanish exactly the
same texts translated. E.g. the CRATER corpus
http://catalog.elra.info/product_info.php?products_id=84
5. Learner corpus language use created by people learning a
particular language. E.g. the International Corpus of
Learner English.
6. Historical or Diachronic corpus e.g. Helsinki corpus 1.5
million words of texts from 700AD to 1700AD.
7. Monitor corpus continually being added to. e.g. the Bank
of English
http://www.collins.co.uk/page/Wordbanks+Online
Frequencies
Your query "wash" returned 2415
matches in 952 different texts (in
97,626,093 words; freq: 24.74
instances per million words)
Collocations
Limitations of Corpus
linguistics
It wont tell us if something is possible in a language, or
well-formed. E.g. is he expired of heart disease acceptable
English?
Any generalisations we make from corpus data can only be
deductions not facts.
Corpora give us evidence, but not information or
explanations. Why do women say wash more than men?
Corpora give us language out of context so no visual
information e.g. pictures, fonts etc. And with spoken data
no information on what the speakers look like, behaviour or
body language.
Further Reading
McEnery, Tony & Wilson, Andrew (2001)
Corpus Linguistics.
Edinburgh: Edinburgh University Press.
Chapter 1.
Hunston, S. (2002) Corpora in Applied
Linguistics.
Cambridge: Cambridge University Press.
Chapter 1.
Question 1
What is a corpus?
A theory of language.
A collection of texts stored on a
computer.
An electronic database similar to a
dictionary.
Any large collection of words such as
a collection of books, newspapers or
magazines.
Question 2
What is the main reason for using corpora?
Other methods of language analysis are not
reliable.
Computers can confirm our intuitions about
language.
Computers can help us discover interesting
patterns in language which would be difficult to
spot otherwise.
With corpora we can answer all research
questions about language.
Question 3
What is corpus annotation?
Adding an extra layer of information
to the text to allow for more
sophisticated searches.
Separating text into sentences.
Manual coding of text for parts of
speech.
Adding critical comments to a text.
Question 4
What is a specialised corpus?
Question 5
Which of these is NOT a type of
corpus?
Multilingual corpus
Learner corpus
Diachronic corpus
Observer corpus
Question 6
What is the BNC?
Question 7
Which of these statements is NOT
true about a monitor corpus?
It is frequently updated.
The Bank of English is an example of
a monitor corpus.
The BNC is an example of a monitor
corpus.
It is used to monitor rapid change in
language.
Question 8
What is a concordance?
Information about word frequencies normalised
per million words.
Listing of examples of a word searched in a
corpus with some context on the right and some
context on the left.
An alphabetical list of words that appear in a
text.
A list of words and their frequencies that can be
used for identifying important words in a text.
Question 9
What is collocation?
The tendency of speakers to talk over each
other.
The tendency of words to co-occur with one
another.
The tendency of words to appear in unique,
different contexts each time.
The tendency of sentences to create
meaning.
Question 10
What is a frequency distribution in a corpus?
Information about how frequent a word is in a
corpus.
Information about the frequency of use of a term
across a number of different texts, corpus
sections, speakers etc.
Information about how frequent a word is per
million words.
Sociolinguistic information about the gender of
the speakers that are represented in a corpus.