Corpus and Lexicography Construction of Word List of Pakistani English

Corpus and Lexicography:
Construction of ‘Word List’ of Pakistani English
Submitted by: Safdar Hussain
Submitted to: Prof. Dr. Zafar Iqbal

Abstract:
The history of corpus-based lexicography is not so old but because of its
reliable empirical data most of the modern dictionaries of English depend upon
word lists created by corpus analysis. As far as dictionaries of English
produced in Pakistan are concerned, no one is based on corpus data analysis.
The aim of this small research project is to present a corpus-based ‘word list’
for learners’ dictionary of Pakistani English. It has been tried to use an
objective frequency criterion to select the words and multiword units to be
described in the dictionary. Therefore, an automated analysis of an
approximately 10.5 million word corpus of Pakistani English newspapers has
been used. This paper will also discuss issues related to data collection and
data analysis of Pakistani English corpus and problems faced in this project.
CORPUS AND LEXICOGRAPHY:
CONSTRUCTION OF ‘WORD LIST’ OF PAKISTANI ENGLISH
Safdar Hussain
Introduction:
In linguistics, a collection of texts is generally referred to as corpus. Francis
(1992:17) defines corpus as “a collection of text assumed to be representative of a
given language, dialect, or other subset of a language to be used for linguistic
analysis”. Corpus linguistics is concerned with the compilation and investigating of
such corpora. Corpus linguistics is a relatively new discipline which originates from
the second half of the 20th century when the first machine readable corpora were
compiled. The function of a corpus may vary according to the needs and demands of
the research project.
The use of a corpus has played an important role in the modern dictionary
construction because it is based on large textual corpora of words, containing a wide
variety of electronically stored text such as newspapers, books, etc. The example
sentence for each word is extracted directly from the corpus that shows how a word is
really used. With the advent of the computational lexicon, the corpus has become
more and more important.
We are interested in contributing a small, publicly available corpus of written
text of Pakistani English. In pursuit of natural language processing research in Urdu,
we could not find a publicly available Urdu corpus with which to work, so we
had to start our own to train and test machine learning algorithms.
Objectives of the Study:

The English which is spoken in Pakistan is different from that spoken in other
regions of the world, and it is regarded as the unique variety which is called Pakistani
English.
The aim of present study is to produce a corpus of Pakistani English in first
stage and then to extract a ‘word list’ of most frequent words of Pakistani English
based on corpus data.
Construction of Corpus:
Important questions that arise when building a corpus are
• Authenticity of language data
• Electronic/machine readable form
• Design and collection according to sampling procedures
• Representative of given language
From authentic language data it means data obtained from real language
source with out tempering. Usually data is stored in notepad or xml form for corpus
analysis. Issues of making it representative of given language and of appropriate size
are also important which are tackled by proper sampling. For the Pakistani English
language there is currently no project like the British National Corpus (BNC 2000) or
the Bank of English (BOE 2000); therefore we must rely on the texts that are freely
accessible. Data for present study was colleted from online news English news papers
published in Pakistan which solve the issue of authenticity of data. The data collected
from archives of news paper websites converted into notepad format to be used for
corpus analysis. Matter to make corpus representative of Pakistani English was very
hard to realize because of lack of resources and time available for this research project
Corpus of spoken language is relatively very small as compared to corpus of written
language because of time required to transcribe spoken form of language. Because of
nature of this small project it was hard to collected data from scanned copies of text
books using OCR (optical character recognizer). Data from internet news paper
websites is easily available from achieves section.
In order to cover actual language, we have chosen the September 2008 issues
of newspapers: the daily times, the dawn, the frontier post, the nation, the news and
the post. There is no clear-cut classification of the articles. Therefore, only exporting
by date makes it possible to export all articles. In the context of this study, a corpus
was compiled in a machine readable form. It is approximately 10.5 million words
corpus of Pakistani English newspapers.
Corpus compilation for the Research:
This section describes the corpora created for this study. The main method of
preparing data for entry into a corpus was adaptation of material in electronic format.
The material was not readily available, other than from the Internet. Therefore,
Internet sources were used; appropriate data was downloaded and stripped of its
HTML formatting by creating a text file of it. All embedded computer instructional
text was removed. The data was then catalogued, arranged chronologically and
entered into the corpus in the form of the corresponding text files. All the daily
newspapers included into the corpus have offline printed versions. Despite some
difficulties in the development of Internet technologies in Pakistan, there is a
considerable progress in the evolution of Pakistani online media. Because of the
unavailability of online versions of the weekly and monthly magazines, only daily
newspapers have been taken into account for this study. The list of the websites from
which the texts were downloaded is presented in the following table
Detail of Sample from Different News Papers:
Newspapers Token* Types/Raw Types/Tag

www.dailytimes.com.pk
Daily Times 2,017,137 60,303 66,594
www.dawn.com.pk
Dawn 2,709,703 78,858 86,842
www.frontierpost.com.pk
Frontier Post 1,095,201 43,574 48,087
www.nation.com.pk
Nation 1,259,076 44,463 49,127
www.news.com.pk
News 2,496,829 65,295 72,410
www.thepost.com.pk
The Post 2,120,733 60,452 66,965
*Token: Total number of words in a corpus
% Age of Tokens from Different News Papers:
Token
The Post Daily Times

18% 17%
News Dawn
21% 24%
Nation Frontier Post

11% 9%
Number of Token Types from Different News Papers:

Types
The Post, 66,965 Daily Times, 66,594
News, 72,410 Dawn, 86,842
Nation, 49,127 Frontier Post,

48,087
Number of Types/Raw Different News Papers:
Types/Raw
The Post, 60,452 Daily Times, 60,303
News, 65,295
Dawn, 78,858
Nation, 44,463 Frontier Post,

43,574
This corpus is smaller than the English corpora but it seems to be large
enough, taking into account our objective of creating a world list of Pakistani English
collocations and grammatical structures of the Pakistani English language. It is also
not claimed that given corpus is perfectly balanced, but it is made up of the kind of
texts that the potential users of our dictionary will have to deal with. Corpora are used
to derive empirical knowledge about language, which can supplement, and frequently
supplant, information from reference sources and introspection (Leech 1991).

Data Analysis:
The whole corpus was then tagged and lemmatised with the AntConc. The
result of this analysis was processed in order to restore the aspect of the original texts.
We submitted the entire lemmatised corpus (51 845 143 words) to AntConc, a well-
known text analysis tool. As AntConc was used to create a frequency lists for the
whole corpus. This frequency list has been corrected on some minor points. For
example, frequent words written with hyphens that were split up during the
lemmatisation process have been extracted from the original corpus and added to the
list. Some errors of lemmatisation have also been corrected.
Results:
Results of the data analysis from AntConc produced word lists of 89843 word
types out of 10.406212 word tokens. List of top 1000 highly frequent word types is
given below.
Limitations
There are several limitations to the study regarding the statistics of the corpus
research. Urdu words occur quite frequently and regularly in Pakistani English
newspapers. It is not possible to take all the Urdu words into account in a 10.5 million
word corpus. Those words that appear less than 10 times have been neglected keeping
in mind the scope and nature of this study.
Conclusion:
Frequency is a powerful tool in the lexicographer's arsenal of resources,
allowing her to make informed linguistic decisions about how to frame the entry and
analyse the lexical patterns associated with words in a more objective and consistent
way. However, in dictionary making editorial judgment is of paramount importance,
because blindly following the corpus, no matter how carefully it may be constructed
to represent the target language type accurately, can lead to oddities. We expect our
motto: 'Corpus-based, but not corpus bound' to hold good for many years to come.
References:
Francis. F.N. (1992) Language Corpora B.C. In J. Svartvik (ed.), Directions in
Corpus Linguistics. Proceedings of Nobel Symposium 82, Stockholm, 4-8 August
1991. Berlin and New York: Mouton de Gruyter. 17-32
Kennedy, G. (1998) An introduction to corpus linguistics. Harlow: Longman, section
4.1.2
Leech, G. (1991) The State of the Art in Corpus Linguistics. In K. Aijmer and B.
Tognini Bonelli, E. (2001) Corpus Linguistics at Work. Amsterdam: John Benjamins.

Corpus and Lexicography Construction of Word List of Pakistani English

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Corpus and Lexicography Construction of Word List of Pakistani English

Uploaded by

Copyright:

Available Formats

Corpus and Lexicography:

Construction of ‘Word List’ of Pakistani English

Submitted by: Safdar Hussain

Submitted to: Prof. Dr. Zafar Iqbal

The history of corpus-based lexicography is not so old but because of its

word lists created by corpus analysis. As far as dictionaries of English

produced in Pakistan are concerned, no one is based on corpus data analysis.

for learners’ dictionary of Pakistani English. It has been tried to use an

objective frequency criterion to select the words and multiword units to be

described in the dictionary. Therefore, an automated analysis of an

approximately 10.5 million word corpus of Pakistani English newspapers has

CONSTRUCTION OF ‘WORD LIST’ OF PAKISTANI ENGLISH

In linguistics, a collection of texts is generally referred to as corpus. Francis

(1992:17) defines corpus as “a collection of text assumed to be representative of a

given language, dialect, or other subset of a language to be used for linguistic

analysis”. Corpus linguistics is concerned with the compilation and investigating of

the research project.

construction because it is based on large textual corpora of words, containing a wide

more and more important.

We are interested in contributing a small, publicly available corpus of written

text of Pakistani English. In pursuit of natural language processing research in Urdu,

Objectives of the Study:

The aim of present study is to produce a corpus of Pakistani English in first

based on corpus data.

Important questions that arise when building a corpus are

• Authenticity of language data

• Electronic/machine readable form

• Design and collection according to sampling procedures

• Representative of given language

analysis. Issues of making it representative of given language and of appropriate size

Corpus of spoken language is relatively very small as compared to corpus of written

language because of time required to transcribe spoken form of language. Because of

websites is easily available from achieves section.

was compiled in a machine readable form. It is approximately 10.5 million words

corpus of Pakistani English newspapers.

Corpus compilation for the Research:

difficulties in the development of Internet technologies in Pakistan, there is a

considerable progress in the evolution of Pakistani online media. Because of the

which the texts were downloaded is presented in the following table

Detail of Sample from Different News Papers:

Newspapers Token* Types/Raw Types/Tag

% Age of Tokens from Different News Papers:

The Post Daily Times

Nation Frontier Post

Number of Token Types from Different News Papers:

The Post, 66,965 Daily Times, 66,594

News, 72,410 Dawn, 86,842

Nation, 49,127 Frontier Post,

Number of Types/Raw Different News Papers:

The Post, 60,452 Daily Times, 60,303

Nation, 44,463 Frontier Post,

collocations and grammatical structures of the Pakistani English language. It is also

supplant, information from reference sources and introspection (Leech 1991).

list. Some errors of lemmatisation have also been corrected.

in mind the scope and nature of this study.

Frequency is a powerful tool in the lexicographer's arsenal of resources,

way. However, in dictionary making editorial judgment is of paramount importance,

Francis. F.N. (1992) Language Corpora B.C. In J. Svartvik (ed.), Directions in

Corpus Linguistics. Proceedings of Nobel Symposium 82, Stockholm, 4-8 August

1991. Berlin and New York: Mouton de Gruyter. 17-32

Kennedy, G. (1998) An introduction to corpus linguistics. Harlow: Longman, section

Tognini Bonelli, E. (2001) Corpus Linguistics at Work. Amsterdam: John Benjamins.

You might also like