You are on page 1of 209
Exploring English with Online Corpora An Introduction Wendy Anderson and John Corbett palgrave macmillan Contents Lis of Figures Lis of Tables ‘Acbnoledgements 1 Introducing Online Corpora Using a corpus (Choosing your conpus ‘The uadition of corpus analysis of English Five online corpora of English Analysing online comport “The organisation ofthis book 2. Interpreting Corpus Daca Quanscative and qualtasve analyses Representativeness Froquences [Normalstion of frequencies Mural Information Other messures of collocation Key words Other statis ees ‘Summary 3 Exploring Lexis with Corpora ‘What isu word ina corpus? ‘Obssining exe information fom 3 corpus ‘Analysing leical da, The lexcogrammatical envionment of words “The semantic environment of words The wider social environment of words Summary 4 Exploring Grammar with Corpora Asitudes to grammar ‘Whats grammar? ee err vin conrevrs Word categories Exploting lexia items witha corpus ‘Exploring grammatical roms with compus Exploring phrases with a corpus From phrate to chase Compl sentences “The grammar of speech and writing Delesicned verbs Calligaion Vebsysems Di-drven grammar versus inuition Summary 5 Exploring Discourse with Corpora Whar i disor? Using corpora to explore discourse Analysing spoken discourse ‘Analsing wriven discourse Tnvertextuaisy Summary {6 Faploring Pronunciation with Corpora Deeb diiron cone ‘The merational phoned haber Consonant sound Consors onelsounde : Ueno be romiin ‘cen of Engi ontne Espino one fave “The aquition of aaton nication Tmonation Sima 7 Contestalsing Corpus Texts Temand extant “Texcand context Metadata Corpora and socolinguisties | serrated variables Userelated variables, Summary n 2 16 8 83 88 3 92 93 96 9 100 101 101 105 m1 116 nn 2 124 125 126 126 7 1 131 2 BT 143 “7 19 132 153 155 1S 7 158 159 165 169 7 {8 Conclusion: Isucs in the Use of Corpora in Teaching and, Research Using corpora in caching Developing your own research incerests ‘The ture of onlin erpora Append: Onn Corpora Bblography Giontry Tae 173 v3. 178 179 183 188 195 201 Introducing Online Corpora ‘This book has a number of intended eaderships should appeal co «+ seudents who are embarking ona formal course in English language, a upper School eel or university: «teachers who wish to know more about the technology avilable o support English Language education in schools, # leamers and teaches of Engh a a frcgn language, who wis ro explore hve language i used in avast quantity of authentic’ texts and ‘genera eaders who ae simply curious abut how the English language works 8nd about new methods of exploring this opie. Ais ade wil bie by a common inert inthe Engh langage tats dere peer many writen an spen rms Fonte he re fre ono previo epic fhe sy oe Eg ngage A sna pr of te ba ete on eof the Be ues trades wl cocumer wh thy begin pron how or woikn Engh fow thy fom pee ey ees in his es and dines eal toss fe bases eed Sly silo Eagl Tiinedacey apes efecto wb lb cused npr deal the capers sil The gmay sce i pvr a nan fo ny he enc nga dc ng th we ren eos Ten put she wy oo np edgar may es od Ba Hunted ME Glos of Crp Enguiis 290 fe 1c gece wok nly er fad © webu opny che peont oak a omnpe eae. Using a corpus ‘One of the mos exciting developments inthe exploration of English over the past 40 years hasbeen the accumulation of vant electronic archives, of corps, of wriven and spoken rex, ex sored on computes and maniplated ail nd quiyby seach propane The vay of uch corpora has changed dhe working paces of Ings, pacar hove scholar irene in the Imeanings a pte of word and phrases. Before the advent of tonic “nore ima’ lang and arduous proces to compe and search substantial feds of daa node to confirm or challenge ones own inition as langage ther Nowadays vast bods of Gc are salable athe touch of eybourd and ‘Reda ofs mouse. Ea bene of lage eleaonialyseachable, cor furs Englsh were icoary makers who sudden had hitherto inacesible dene fr the meanings of words and language learnes, who also had aces ‘Simmer constructed ting crpor data ta seemed to gurane wlevance ‘nd authenety Only inthe at ev yes however, hae language corpora gun robe ely avalable online tothe casa rows, langage learner and Tinsel novice dent "hTow yn apn Susan Hockey (2000, pv) ned tha “The World Wide ‘web spo for looking t maria ba does nt provide many tol or Sralsing snd. manipulating hat material” Since then, happy. the sources sRalube on the Web have improved to she een tha students of langage Sa lngoistis can make comsderbl inroads ito inguie ud soley by iting fey avalible corpo provided that they know her ook, have an Sppreciion of few base notions and know how to maximise the pte Flnguage corpora wh all ther oyna and dilrencs So, as witen Sd span corpora become arabe to an ever wider network of potential er idan nde nthe cof them wo explareaspects of language This Bak i designed to ie ha gudance “The carl analysis of corpora ca ive insights nc () how language is realy ved tater than how people nk iwc apd 3) how tis oman and ‘pital use. To Sx bow tue this you can ty joting dw your ime “Bc shoughts aout the pl linguist contexts 3 word in rel lange Sd then compar our withthe sample evidence om corpus sch as 1. Wthou telerence toa dictionary, wie down dafiniton of he aeectvsedy and some types examples of nouns that describes, 22. Ona computer og ono the BNC interface ac BYU: peorpu tye! 3. inthe Dipl secon cick on Ut 44 Inge Search Seng secon 7p sey inthe Word) box 5 The es wi how you ha numberof ecurencs of ed nthe 10 millon swords of he BNC. 6, Cickon the word sad nthe Res ston in order to see some ofthe ents of ung, Note down some ofthe nouns that are mod By se forexampl sed fo 7. To compare your reps with 360 millon words of erent American Engh, goto wwamercincorpunorg and repeat Steps 3-6. she Bech National Corpus (BNC) or she Bigham Young Unvenity (BYU) CConps of Contemporary American English both of which wel inode in seeder inthis chapter Fr camp what sor of nouns re mail bye Now comple Tak 1 The result ofthese Searches may confirm your expectations o ues you om te fm te pot te ec anon sight gest shabinesor ck fee Some ofthe al ofthe opus see ‘confirm such a definition. For example, Zs Harold Macmillan sey and sagnane Britain However, the majority of examples in the BNC suggest something that is sexually unsavoury about the erm sed, For example, scody alse seedy porn photos seedy aborionist. ‘rake’ progres of late nigh, seedy bus and relentless bet bellies the seedy word of prosiuion the flinsmaker’s seedy ile wife, pompous and unsctvely mide seedy northern beauty contest, 2 sedy image of people who use porn tne up her seedy cacer and wrote an exposé of che porn busines While by no means all the examples conform to this pattern of meaning (here 's for example, an innocent reference co Brith actor Will Hays creation of the ole of se. Bering and inelcuel echer and even ateference tory _se-heads caught in owes), many do. Thee instances ste enough to sugest ‘hatin contemporary English the ‘cxally unsavoury’ element ofthe menting of sedy i strong enough to carey into contents in which it i not necssts ‘ly explicidy mentioned: Fr example, busin fends who nan seemingly dod or sey litle second-hand enterprise in sope dark and dy ‘This clement of meaning has not yet, however ound is way into the online Oxford English Dictionary. A this very basic level, then, checking your ineitons against corpus dats «an help you confirm or challenge your preconceptions about wont what chy ‘mean and how they are used. We bud up our intuitions, of course, from long experience of language read and heard in linear fhion, We come ates instances of words ike wet rom time to tine, perhaps once ina milion words that we ead or hear spoken. Prom each ofthese contexts we buildup an eroling Pictu of what the word can mean in various contexts, By bringing substan ‘ial number of these instances together in ssl pace ~a cops ~ we can become avare of patterns that remained below the sutface of our consciousness As John Sinclar, one of the pioneers of modern corps linguistics, noted, "The language looks father different when you Took at slot of tat once (Sinclair 1991, 100), a rT SE CT Tg 4.08 éNGUSH WITH ONUNE CORPORA Choosing your corpus “There are some important issues that every corpus ser shouldbe awa f when choosing corps that wil help you answer your own questions about language. ‘We will return to some of them again in later pages. Fist of al, we newd t0 consider what kind of collection of text a corps sand what differentiates From othr online extual resources “The nature ofa corpus ‘corpus maybe decribed ute simply asa body of texts in fc, the lira meaning ofthe word. Mos corps linguists, however, pir to be taker more spe and densibe «corpus a lage pincpel colton of tert that one which hat born ected for « purpone This ow 4 trdely cep dfnion. Even more pecly 4 mer corpus i sample far occurring langage, in dosrone form, which bat been despre {o represents language, langage vain regi or gene. The ey word hee Jr Uegne and this what dings a cops fom is coe relive, the nt achive While an ache aay have no pedcteminedstuctre and isnot ined vo repracoe someting lager «corps motivated, ceed wid lng puspre in mind. The Web 2 a whole may be wed ara resource for linguinie eaplorcon, and wich dhe av of online sons sich 1 WekCorp, ithe beens such tore stsghforward todo sx Gin Conmandy changing see ad ate, however te beer eed a mae Srchive athe than a corps. Nevertls is ceranly pose to crete 4 Corps from mateal onthe Web, by sting ers aconding wo parila iri inoprane to eat in mind the pinged ean of corpo when ing cntine eames for rsearch Since corporat ally compiled wih purpose ite neceany wo match your nese aracorpas ts apn he tress pas ‘ofthe corpus designers Although many coepora do indeed contain vast uant= "exo ay obese dl ton a mere orca he corpus brings fogether 100 milion word of ex eal pating eo the shade thet milion word of she Sconsh Corpus of Tense Speech (SCOTS) Sd the 18 lion word of he Michigan Caps f Academic Spoken Eaghsh IMICASE). However name ogg the 10 milion words ofthe TIME corpus cons cntrely of txts ken fom TIME magasing and while ths ‘facinating source of «particule pe of journaled, e des noe contin Sy spoken gl, other air of writen Engh, Othe echt han, the [MICASE comp contin excariveyspken Engh, bu only fom the domain fede genre sucha erty senna aed omelet tare ‘Along ts writen dts, she SCOTS corps contains 80,000 ond of spo ioe da sears ange of Sons, Boss etre to sponte cild-prene Interactions, but given sie edocs nor necessary give you much linguist dis fom eich ind stuaton. Each comps clay bas ks es, and {ter you pce ohoksbour wich one rot appropae or your cum explo ‘rons: Forunately, most online corpora exp se out het design enter —_______reouonc onuneconrona and contents, so that you can ase their stability for your purpose before you begin, and chen cary outa smal plot invexigation eo sce how searches Work in practice ‘This book contains 2 subscandal number of tasks and activites that aim {© usc the most appropiate online resource or resources in each case. You should therefore indi easy wo design your own fllow-up studies and develop your own interests. Ic will normally be possible ro attempe the same activ. ‘gy with different corpus, orto use the sie corpus but subuly modify the research question. This will give you a fs! forthe nature, design and sxrengths ofthe corpus you ae using, and llaw ou to appreciate ts (incr tablo) ‘wether Representativeness Ashine inte previo en ha of eesti crc oe Wipe aie gles hers hago kag ty cathe opus fa Spe ft tnguge ice ce Jor Se ne wold dm to mae vl cane ar Engh fm sing sop sf Seth Engh eos Nee nd finns shure hn en spn ngs soge tonalvonl wen ngage and er Ss ne aos eee agama comutossconnen nsec taleepefesaele che sabe common we lang satel Neos sok “fain me fas pd peso ees cFohn hats aarkse cae “i snp an mind tha cop can ny pv po ite ens of une snot nse woke tre ie hac ners oc epic Kato net se ee pon bygring ata uve ame un ane one ‘eld tl eben postnn mh cy ha bad ee Sea Cpe 59) once wld ny cage deck arse a {Eye ci Th oe oe gents nner ee Shops dat ih mii caw tsar si cals be anicroze An clrer igh toned hace canta aaa eae inh re Hows aly hess see cee festeontucion"schsns gotta heal viene SCOTS corm agar 1 prac ci ul wee be osu 5 ee te pono mag cine ago pled Ste Ad of ton ai sean hips gate ogc fc erattice oyen ba ay epe an eater Shi eet per epseannseae veeed so talcup of hl op Ones tine ay wee neni the gsm n pede tomes inthe UR in the cigs Su thn te opto oe he snd api atten oa i es 21 pate np of te owl pao emp ch he xPLoRING ELI WH ONLINE CORPORA tas decade of writing, gender and age of writer and the level of formality of the leer ‘Mn f the time, however, the sue of representativenes isa rom stright onward tis parculaly thorny if you want to make generalisations about ‘Sfinguage as a wHole. What ar the right proportions of wicn language and Spot enguage? Hw do we del withthe fae chat some vxs te very widely ‘eet or heads over either Tong ota shor period of time (for example, the bible the Quen’ speech or che Presidential addres, the ition of JK. Rowing, Font page newspaper arcs), while others ae read or head by few people, ot Tea tape of people or example, specialised scholaly monographs. hore: ener, semons in a smal village church). low can we mae sue that ‘he weicg of people ofall ages, genders, occupations, races, religions, pes of Upbinging and som ar fay sepresented? ‘Wha aout the sorts of language that are by nature private ~ personal dais, inamatecomerations the words you matter fo youself when working through {Renractions wo build a selEastembly bookcase? Can we jusly noticing {hen ina corpus! Most lage corpora contin some ephemeral texts spoken sr vite, ae might seem odd out of conte. On its oe, for example, the felloming excerpt om the spoken scion of the BNC might use the casual ter to question the principles ofthe designers sm mm: Mm. Mm. Men, Mm, Ma ah ah! Daddy! Daddy! Hellot Dee dee ‘ea, dee dee dee dee, dee dee dee dec dee, dee dee dee! What is she meant fo sink sbout Pauline? A yea yea yea yea yea ya! A, yea yea yea yea yea yal Yes yea yea yea Yea yet Yet Yes! Yea! Ooh ooh coh ook! Ooh ook! ‘Gok! Ooh ooh! you dda cell mee as on yet. Wl you, shall take her up tonight? Ifshelllee me, What? [BNC text KDG, convertion). “This spontaneous effusion in slaion eight wll ll us very lie, but as pat as agen representative corpus, can func a a egimate thread inthe Tanger epesry of language Represcneavene a simply be thought of s the incision in a corpus ofa lage numberof et ina large numberof esters and genes, Hower, there dba saieal measures of tepresenarveness, which can guide dhe corpus “Gaigner cand these are discussed mor flyin Chapter 2 Most genera corpora Sie eae ty to caprureasnapsor of eilrenc pes of speaking and writings om casual conversation go caeflly composed legal documents. Further d= eel of the ane of represenativenes ean be found in most intoductory ‘Books ee the farther reading at he end ofthis chapen), or the ails by Biber (1993) and Clea (1992). Sie “The suc of size sa separate bu elated issue co represenativeness. For applic ‘ion in lencography, 1 important that corpora should be large, usually tens {Eno hundreds of millions of words in siz. O'Keefe, McCarthy and Carter (2007, p. 4) observe that 1 oben a sficient range of preposition dat fol- low bargain, they needed vo sate 2 corpus of about 10 milion words. Online ‘eorpora range fom the 360 milion words of the BYU Corpus of Contem= porary American English (with more words being regula added) co the 1.8 millon words of MICASE. As O'Keefe, McCarthy and Carter go on to sate (bid, 3), In terms of what consticutes a large or 2 small compu i depends on whether itis a spoken or a writen corps and what i scking to represent. For ‘corpora of the spoken language. anything over million words is cnsid- ‘redo be lage for writen eorpor, anything below five million is quite sal [Nevertheles, lo of incerstng research canbe done on the more frequent oF “core words constructions and festures of 3 language with corpora of even 2 Sen tens of thousands of words Sie, along with design sa feature ofa corpus ‘which should out the purpose to which ix put. In he context af corpora for Tnguage etning, Chambers (2007, p. 9) nots that [kis clear tha, despite the corpus inguis’s need for coxpora tobe as large as possible in support ofthe researchers quest force elusive quality of represen "vg sever of those who Focus on classroom applicadions and who have ‘experience of working with lamers have become avare ofthe usefulness of ‘he sal corps. Several online corpora are very large indeed, but many thers are relatively small and preiely focused. Rather than necessarily having a age corp es more important ro have a good understanding of the nature ofthe corpus being used. fd therefore the level of fth which one can putin findings from it lise of ‘urrenty avalbl fe online corpora, wth some deals bout cher nature, is ‘gen in the Appendie. Type ofcorpus Ie should aleny be clear thatthe Furs of nln orpors flings soy Can be dies n nature. While it nt nesta 0 prove a comprehen She survey of corpus ype hes thf o he aware of 3 fe general incon ‘One important disinction is between synchronic and diachronic crpors thar bree coopers whidh conan exe om sparc Ge peso uch $= Engl rom the 19900) and see, hereto to provide saps of language age and those which can be sed to ivetigate linguage change ove tie, a would be the for conpu of personal correspondence beeen 1700 and 1900, fr example andthe xe the TIME corpus wich spans mox of the went entry. incon corpora canbe ude neste nologies Sno esning or changes grammatical consractons ove time A specific ‘Spe of dahon ops known a onto corpus, onc which Nw ex

You might also like