Professional Documents
Culture Documents
reliable empirical data most of the modern dictionaries of English depend upon
The aim of this small research project is to present a corpus-based ‘word list’
been used. This paper will also discuss issues related to data collection and
data analysis of Pakistani English corpus and problems faced in this project.
CORPUS AND LEXICOGRAPHY:
Safdar Hussain
Introduction:
such corpora. Corpus linguistics is a relatively new discipline which originates from
the second half of the 20th century when the first machine readable corpora were
compiled. The function of a corpus may vary according to the needs and demands of
The use of a corpus has played an important role in the modern dictionary
variety of electronically stored text such as newspapers, books, etc. The example
sentence for each word is extracted directly from the corpus that shows how a word is
really used. With the advent of the computational lexicon, the corpus has become
we could not find a publicly available Urdu corpus with which to work, so we
had to start our own to train and test machine learning algorithms.
regions of the world, and it is regarded as the unique variety which is called Pakistani
English.
stage and then to extract a ‘word list’ of most frequent words of Pakistani English
Construction of Corpus:
From authentic language data it means data obtained from real language
source with out tempering. Usually data is stored in notepad or xml form for corpus
are also important which are tackled by proper sampling. For the Pakistani English
language there is currently no project like the British National Corpus (BNC 2000) or
the Bank of English (BOE 2000); therefore we must rely on the texts that are freely
accessible. Data for present study was colleted from online news English news papers
published in Pakistan which solve the issue of authenticity of data. The data collected
from archives of news paper websites converted into notepad format to be used for
corpus analysis. Matter to make corpus representative of Pakistani English was very
hard to realize because of lack of resources and time available for this research project
nature of this small project it was hard to collected data from scanned copies of text
books using OCR (optical character recognizer). Data from internet news paper
In order to cover actual language, we have chosen the September 2008 issues
of newspapers: the daily times, the dawn, the frontier post, the nation, the news and
the post. There is no clear-cut classification of the articles. Therefore, only exporting
by date makes it possible to export all articles. In the context of this study, a corpus
This section describes the corpora created for this study. The main method of
preparing data for entry into a corpus was adaptation of material in electronic format.
The material was not readily available, other than from the Internet. Therefore,
Internet sources were used; appropriate data was downloaded and stripped of its
HTML formatting by creating a text file of it. All embedded computer instructional
text was removed. The data was then catalogued, arranged chronologically and
entered into the corpus in the form of the corresponding text files. All the daily
newspapers included into the corpus have offline printed versions. Despite some
unavailability of online versions of the weekly and monthly magazines, only daily
newspapers have been taken into account for this study. The list of the websites from
Token
News Dawn
21% 24%
Types/Raw
News, 65,295
Dawn, 78,858
This corpus is smaller than the English corpora but it seems to be large
enough, taking into account our objective of creating a world list of Pakistani English
not claimed that given corpus is perfectly balanced, but it is made up of the kind of
texts that the potential users of our dictionary will have to deal with. Corpora are used
to derive empirical knowledge about language, which can supplement, and frequently
The whole corpus was then tagged and lemmatised with the AntConc. The
result of this analysis was processed in order to restore the aspect of the original texts.
We submitted the entire lemmatised corpus (51 845 143 words) to AntConc, a well-
known text analysis tool. As AntConc was used to create a frequency lists for the
whole corpus. This frequency list has been corrected on some minor points. For
example, frequent words written with hyphens that were split up during the
lemmatisation process have been extracted from the original corpus and added to the
Results:
Results of the data analysis from AntConc produced word lists of 89843 word
types out of 10.406212 word tokens. List of top 1000 highly frequent word types is
given below.
Limitations
There are several limitations to the study regarding the statistics of the corpus
research. Urdu words occur quite frequently and regularly in Pakistani English
newspapers. It is not possible to take all the Urdu words into account in a 10.5 million
word corpus. Those words that appear less than 10 times have been neglected keeping
Conclusion:
allowing her to make informed linguistic decisions about how to frame the entry and
analyse the lexical patterns associated with words in a more objective and consistent
because blindly following the corpus, no matter how carefully it may be constructed
to represent the target language type accurately, can lead to oddities. We expect our
motto: 'Corpus-based, but not corpus bound' to hold good for many years to come.
References:
4.1.2
Leech, G. (1991) The State of the Art in Corpus Linguistics. In K. Aijmer and B.