You are on page 1of 16

Semantic Variation in Idiolect and Sociolect: Corpus Linguistic Evidence from Literary Texts Author(s): Max M.

Louwerse Source: Computers and the Humanities, Vol. 38, No. 2 (May, 2004), pp. 207-221 Published by: Springer Stable URL: http://www.jstor.org/stable/30204935 . Accessed: 27/05/2013 10:34
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp

.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.

Springer is collaborating with JSTOR to digitize, preserve and extend access to Computers and the Humanities.

http://www.jstor.org

This content downloaded from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR Terms and Conditions

Computers and the Humanities 38: 207-221, 2004. @ 2004 Kluwer Academic Publishers. Printed in the Netherlands.

207

SemanticVariation in Idiolect and Sociolect: Corpus Linguistic Evidence from LiteraryTexts


MAX M. LOUWERSE
Department of Psychology, Institutefor Intelligent Systems, University of Memphis, 202 Psychology Building, Memphis, TN 38152, USA E-mail: mlouwers@memphis.edu

Abstract. Idiolects are person-dependent similarities in language use. They imply that texts by one author show more similarities in language use than texts between authors. Sociolects, on the other hand, are group-dependent similarities in language use. They imply that texts by a group of authors, for instance in terms of gender or time period, share more similarities within a group than between groups. Although idiolects and sociolects are commonly used terms in the humanities, they have not been investigated a great deal from corpus and computational linguistic points of view. To test several idiolect and sociolect hypotheses a factorial combination was used of time period (Modernism, Realism), gender of author (male, female) and author (Eliot, Dickens, Woolf, Joyce) totaling 16 corresponding literary texts. In a series of corpus linguistic studies using Boolean and vector models, no conclusive evidence was found for the selected idiolect and sociolect hypotheses. In final analyses testing the semantics within each literary text, this lack of evidence was explained by the low homogeneity within a literary text. Key words: author identification, coherence, computational linguistics, content analysis, corpus linguistics, idiolect, latent semantic analysis, literary period, sociolect

1. Introduction

Writersimplicitlyleave their signaturein the documentthey write,groupsof in the languageuse of an indiwritersdo the same. Idiolectsare similarities in the languageuse of a communityof individvidual, sociolectssimilarities uals. Although various theoretical studies have discussed the notion of idiolectsand sociolects(Eco, 1977;Lotman, 1977;Fokkemaand Ibsch, 1987; Jakobson, 1987)and those theoriesare widely acceptedin fields like literary criticism(Fokkemaand Ibsch, 1987),semiotics(Eco, 1977;Sebeok, 1991)and sociolinguistics(Wardhaugh,1998), hypothesesderivedfrom those theories have not often been empiricallytested. The presentstudy will test some of these hypotheses,using different computationalcorpuslinguisticmethods. 2. Idiolects,Sociolectsand LiteraryPeriods Both idiolect and sociolectdependon the linguisticcode the writeruses. On can be built top of this linguisticcode other codes (e.g. narrativestructures)

This content downloaded from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR Terms and Conditions

208

MAX M. LOUWERSE

linguistic (Eco, 1977;Lotman, 1977;Jacobson, 1987).These complementary The bestexamplesof theseculturalized codesallowfor textsto be culturalized. texts are artistictexts. Thesetexts are thus secondary modelingsystemsmade accessibleby the primary(linguistics)modeling system. What is so special about aesthetictexts is that the authorwill try to deviatefrom currentlyacceptedcodes. By deviatingfromthe normtextsbecomeaesthetic.Thisway the deviationgraduallybecomesthe norm of a groupand by deviatingfrom the establishednormnew aesthetictexts will deviate(Martindale,1990). thesemultipleencodings.On the In practiceit is verydifficult to determine the idiolector sociolectfroma literary one hand,to determine text, one has to look at the complementarylanguage codes. On the other, however, the productof the multiplemodelingsystemsis just one linguisticsystem. Fokkemaand Ibsch(1987)arguethat althoughthe text usuallydoesn'tyielddata in the about complementary languagecodes, we are likely to find differences codes (e.g. languagecode by comparingtexts with differentcomplementary the timeperiod).In otherwords,on the one handa top-downapproachcould analyzethose texts that share certainaspects (e.g. time of first publication) On the otherhand, a bottom-upapproachcould and reporttheirsimilarities. comparelinguisticcodes of differenttexts, and reportpredictionsabout the idiolectsand sociolects.The currentstudy will use both. We start with the top-down approach,following Fokkema and Ibsch's (1987) theory of Modernistconjectures.Accordingto Fokkema and Ibsch historicaldevelopments change the way we think and hence will likely have an impact on the cultural system. For instance, historical events around WWI led to principal political changes and psychological and scientific depression.Similarly,WWII createdanotherbreak in world history and in that Fokkema and Ibsch distinour thinking.It is thereforenot surprising guish two literaryperiods on the basis of these historicalbreaks. The first rangesfrom approximately1850 to 1910 and is called Realism.The second rangesfrom approximately1910 to 1940 and is called Modernism(see also Wellek and Warren,1963). By analyzinga numberof literarytexts writtenduringthis 30-yeartime frame,Fokkemaand Ibschare able to definea Modernistcode. This code is a selection of the syntactic,pragmatic,and semanticcomponentsof the linguisticand literaryoptionsthe authorhas available.The semanticcomponent receivesby far most attentionin their study. The Modernistsemanticcode detachment and observaconsists of three centralsemanticfields:awareness, tion. These fields can be visualized as concentriccircles that form a first and conconsistsof wordslike awareness semanticzone. The field awareness
sciousness. The semantic field of observationconsists of words like observation, perception and window. Finally, detachment consists of words like deperson-

In addition to this first zone of semantic fields a alization and departure. second zone can be distinguished.This zone contains neutral semantic

This content downloaded from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR Terms and Conditions

SEMANTIC VARIATION IN IDIOLECT AND SOCIOLECT

209

fields related to the idiolect of the author. A third zone, finally, contains semantic fields that are at the bottom of the Modernist semantic In addinature,religion,agriculture. hierarchy,includingeconomy,industry, that tion, fieldslike criminality, psychology,science,sexualityand technology were alreadypresentin pre-Modernist literatureare expandedin Modernist texts. Throughout their study Fokkema and Ibsch show that literary texts written by authors in the period 1910-1940 share the pragmatic,syntactic and semanticcomponentsof the Modernistcode. The notion of Modernist code has variousimplications.First of all, it assumesthat those texts written within the Modernisttime frame(e.g. 1910-1940)shareparticularlanguage features,includinga prominentrole for the selectedsemanticfields.Secondly, the notion of Modernistcode impliesthat those literarytexts writtenwithina certaintime frameshareparticular languagefeatures(periodcode). Thirdly, groups of authors share languagefeatures(what we earliercalled sociolect) that could be defined in differentways: chronologicallyas Fokkema and Ibsch did, but other ways are also possible. For instance, we could group authorsby gender.Finally,if groupsof authorssharelanguagefeatures,texts writtenby an individualauthormust sharelanguagefeatures(whatwe earlier called idiolect). Accordinglywe can formulatefour hypotheses:(1) an idiolecthypothesis that predictsthat linguisticfeaturesin texts by one author should not significantlydiffer from each other, whereasthose from texts by differentauthors should; (2) a sociolect-gender hypothesis'that predictsthat linguistic featuresof texts writtenby male authorsshould not significantly differ,but should differ from texts written female a sociolect-time authors;(3) they by time frameshould hypothesis predictingthat texts writtenwithina particular not differ, but texts between time-frames should; (4) a Modernist-code that predictsthat Modernisttexts should not only show homohypothesis geneity and differ from Realist texts, but they should also show a higher frequencyof certain semantic fields. It needs to be kept in mind though that these hypothesesare stated accordingto a stringentcriterion.For instance,it is of coursepossiblefor one authorto shiftin stylebetweendifferent periods (Watson, 1994). In the first experiment,the four hypotheses are tested using the frequencyof semanticfields occurringin a series of literary texts.

3. Study 1: SemanticField Comparisons Using a BooleanModel Fokkema and Ibsch (1987) suggest a word frequencyanalysis to test the Modernist-code hypothesis.In our first study this generallyacceptedcorpus as a measureof semantic linguisticmethodis used, by takingword frequency

This content downloaded from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR Terms and Conditions

210

MAXM. LOUWERSE

distinction.Such methodcan be identifiedas a Boolean model (Baeza-Yates and Ribeiro-Neto, 1999). This model has very precise semantics using a binary decision criterion.It is the most commonlyused method in content analysisand has been extensivelyused in corpuslinguisticsin general(Biber, 1988), in social psychology (Pennebaker,2002) and in literary studies in particular(see Louwerseand Van Peer, 2002). The four hypothesesoutlined in the previous section (Modernist-codehypothesis, the sociolect-gender hypothesis, sociolect-timehypothesis and the idiolect-hypothesis)will be of wordsin each of the semanticfieldsidentifiedby testedusing the frequency Fokkema and Ibsch (1987).

3.1. MATERIALS A total of sixteen texts were selectedfor the analysisfollowing a 2 (literary period)x 2 (gender)x 4 (texts per author)design. The selectionof authors followed Fokkemaand Ibsch (1987).At the same time the choice of authors and texts was constrainedby the availabilityof electronicversions of these texts (hence the focus on Englishtexts only) and the preferred design (four texts from one authorin each cell). Fokkemaand Ibsch (1987, corresponding pp. 192, 203) considerGeorge Eliot and CharlesDickens as representatives for Realist authors.For the literaryperiod ModernismVirginiaWoolf and JamesJoycewereselected(Fokkemaand Ibsch, 1987,p. 10).Table I gives an overviewof the sixteentexts classifiedby period and gender,indicatingyear of publicationand numberof words. Despite the various text archiveiniThe OxfordText Archive,The OnlineBooks tiatives(e.g. ProjectGutenberg, electronic versions of texts from authorsdiscussedin Fokkema Page) finding and Ibsch (1987)and findingfour texts from each authorremainsa daunting task. Ratherthan being seen as the final completeset of corpora,the sixteen selectedtexts should be consideredas a representative sample to study the relevantresearchquestions.

FIELDS 3.2. SEMANTIC

for All thirteensemanticfields Fokkemaand Ibsch identifyas characteristic


Modernist texts were used in this study: consciousness, observation, detachment, agriculture, criminality, economy, industry, nature, psychology, religion, science, sexuality and technology. Two graduate students in cognitive psychology populated the thirteen semantic fields with lemmata. A total of 592 lemmata were created from two sources. Roget's thesaurus was the source for the majority of the lemmata (59%). By selecting each of the semantic fields as a keyword in the thesaurus,

This content downloaded from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR Terms and Conditions

SEMANTICVARIATIONIN IDIOLECT AND SOCIOLECT Table I. Overview of 16 corpora used Period Gender Author Texts Year of publication 1861 1860 1872 1860 1838 1859 1850 1836 1925 1931 1928 1927 1918 1914 1916 1922

211

Number of words 75,632 20,863 322,594 214,441 162,025 140,389 363,323 304,907 81,550 80,236 83,562 73,300 31,067 71,790 90,086 271,722

Realism

Female

George Eliot

Silas Marner

Male

Modernism Female

Male

Brother Jacob Middlemarch Mill on the Floss Charles Dickens Oliver Twist Tale of Two Cities David Copperfield Pickwick Papers Virginia Woolf Mrs. Dalloway The Waves Orlando To the Lighthouse James Joyce Exiles Dubliners Portrait* Ulysses

* Portrait is used as an abbreviation for Portrait of the Artist as a Young Man.

large numbersof semanticallyrelatedwords were found. A second source was the WordNet database(41%of the lemmata),a large semanticnetwork of nouns and verbs(Fellbaum,1998).By using the label of the semanticfield as a hypernymin WordNet, all relatedhyponymswere selected. Obviously,in a Boolean model whereprecisesemanticsis crucialthe actual word form is essentialand lemmataalone do not suffice.Therefore,for each of the 592 lemmata correspondingderivations and inflections were generated,resultinginto a total of 1461word forms.
3.3. RESULTS AND DISCUSSION

To account for differenttext sizes, a normalizationproceduretransformed the raw frequencyto a basis per 1000words of a text (Biber,1988).The four hypotheses (idiolect, sociolect-gender,sociolect-time,and Modernist-code) were then tested on the frequencyof the semanticfieldsin each of the sixteen texts. First, it needs to be establishedwhetherthere are differences between all sixteentexts. If there are not, aggregating across authors,genderor time werefound betweenthe texts periodwould be futile.As predicteddifferences

This content downloaded from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR Terms and Conditions

212

MAX M. LOUWERSE

(H= 234.951,df= 15, p < 0.001, N= 23,376).To test the idiolect-hypothesis, between-groupscomparisons (texts between authors) as well as withingroups comparisons(texts by one author) were made. As predictedby the the frequencyof semanticfieldsindeed differedbetween idiolect-hypothesis, authors(H= 18.49, df= 3, p < 0.001, N= 23,376). A Mann Whitney U pair wise analysis, however, showed that this differencewas due to comparing Dickens with Eliot, Woolf or Joyce (U > 0.01, Z=-3.361, p < 0.001, N= 11,688),whereasthe idiolect hypothesispredictsthat differences would occur between all authors. No significantdifferenceswere found between Eliot and Joyce, Eliot and Woolf or Woolf and Joyce. Identicalresultswere obtained for those semantic fields limited to the first zone (awareness, The predictedbetween-groups differences shouldbe observation, detachment). differences. However,within-groups accompaniedby a lack of within-group showed that texts Woolf and comparison by Eliot, Joyce differedin fre> quency of semanticfields (all Hs 34.47, df= 5,844,p < 0.001). Only the texts by Dickens confirmedthe idiolect hypothesiswith no differencesin of the semanticfields,resultingin only verylimitedsupportfor the frequency the idiolect-hypothesis. Contraryto what was predictedby the sociolect-gender hypothesis,no differenceswere found between the female authors (Eliot, Woolf) and the male authors(Dickens,Joyce).Moreover,significant effectswerefound both within the male authorsand femaleauthors(H > 107.389,df= 7, p < 0.01, N= 11,688), suggestinga lack of homogeneityin gender and falsifyingthe hypothesis.When the analysis only took into account the sociolect-gender first zone semanticfields,an effectbetweenthe genderof authorswas found
(U= 0.01, Z= -2.550, p = 0.011, N= 9,184). Although this would support the

withinmale authorsand female hypothesis,no homogeneity sociolect-gender


authors was found (H > 32.118, df= 7, p < 0.01, N= 4,592).

For sociolect-timehypothesisa difference was found betweenthe Realist


texts and the Modernist texts (U= 0.01, Z= -4.076, p < 0.001, N= 23,376).

Nevertheless,as with the sociolect-gender hypothesis,this support for the would be if thereis homogeneityin the sociolect-hypothesis only meaningful the the of semantic fields between texts withina period.But in both frequency the Realist texts and the Modernist texts, differencesbetween texts were found (H > 79.351, df= 7, p < 0.001, N= 11,688). The lack of homogeneityin Realist texts on the one hand and in Modernisttexts on the other,falsifiesthe sociolect-time hypothesis.Consequently, no conclusivesupport is found for the Modernist-code hypothesis,despite
the fact that the significant difference between Realist texts and Modernist texts does show the predicted pattern. A higher frequency of the semantic fields is found in Modernist texts (Mean= 0.0023, SE= 0.001) than in Realist texts (Mean=.00247, SE=0.001) but this pattern is supported by the Dickens and Woolf texts only and not by the other texts. In fact, the fre-

This content downloaded from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR Terms and Conditions

SEMANTIC VARIATION IN IDIOLECT AND SOCIOLECT

213

quencyof semanticfieldsin Eliot is almost as high as the frequencyof fields in Woolf. Similarly,frequencyof fields in Dickens is almost as high as the frequencyof fields in Joyce. Although Fokkema and Ibsch's hypotheseshave strictly been followed, one could argue that the choice of the texts and contents of the semantic fields distortsthe picture.To account for this possibilityeach of the sixteen texts was split into two halvesand each of the halveswerecomparedusing a Wilcoxon Signed Ranks test. No significantdifference was found for any of
the texts, except for Eliot's Middlemarch (z= -4.47, p < 0.001; N= 1461) Dickens' Copperfield(z = -7.014, p < 0.001; N= 1461) and Joyce's Ulysses

(z= -6.853, p < 0.001;N= 1461).Thereis the option of removingthesetexts from the analysis. However, given the importanceof these texts for their in respectivecategories,the importanceof equal cell sizes and the difficulties electronic of versions the we have to run the risk of finding requiredtexts, makinga Type II errorin this study. What can be concludedso far?Should all four hypothesesbe abandoned becauseof a lack of evidencefrom the semanticfield frequencies betweenthe corpora? One problem in this study is the method. One of the obvious drawbacksof a Boolean model is its precisesemantics(see Baeza-Yatesand Ribeiro-Neto,1999).The binarydecisioncriterionmeansthat if a word form is not found in the exact format as specifiedit will returna null result. It is feasiblehoweverthat the semanticfield is generallypresentin a paragraph ratherthan in the form of an exact string-match. The paragraph would then semanticallyapproachthe semanticfield without a specificword matching the keyword. Similarly,a field might be present in the text but only by numbersof words loosely associatedwith the keywordsused for the population of the semanticfield. In other words, some kind of semanticgrading scale is desirable.This is what is investigatedin the second study.

4. Study 2: Semantic Field ComparisonsUsing a Vector Model

To overcome the limitations of binary decision making (Boolean model), degrees of similaritiesbetween the selected semanticfields and texts were measuredusing a vectormodel. One of the vectormodelscommonlyused in computationallinguisticsis latent semanticindexing(LSI), also called latent semanticanalysis(LSA).
LSA is a statistical, corpus based, technique for representing world knowledge. It takes quantitative information about co-occurrences of words in paragraphs and sentences and translates this into an N-dimensional space. Generally, the term 'document' is used for these LSA units (paragraphs or sentences), but to confuse terminology, we will use 'text units' here. Thus, the

This content downloaded from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR Terms and Conditions

214

MAXM. LOUWERSE

matrixthat specifiesthe frequencyof input of LSA is a large co-occurrence each word in a text unit. LSA maps each text unit and word into a lower dimensionalspace by using singular value decomposition.This way, the matrix is typicallyreducedto about initially extremelylarge co-occurrence 300 dimensions.Eachwordnow becomesa weightedvectoron K dimensions. The semanticrelationship betweenwordscan be estimatedby takingthe dot product (cosine) betweentwo vectors.What is so specialabout LSA is that the semantic relatednessis not (only) determinedby the relation between words, but also by the words that accompanya word (see Landauerand and mindwill have a Dumais, 1997).In otherwords,termslike consciousness highly related)not becausethey occur in high cosine value (are semantically the same text units together, but because words that co-occur with one equally often co-occur with the other (see Landauerand Dumais, 1997; Landaueret al., 1998;Baeza-Yatesand Ribeiro-Neto,1999). The method of statisticallyrepresentingknowledge has proven to be useful in a range of studies.It has been used as an automatedessay grader, et al., 1998).Similarly, comparingstudentessayswith ideal essays(Landauer it has been used in intelligenttutoringsystems,comparingstudent answers with ideal answersin tutorials(Graesseret al., 2000). LSA can measurethe coherencebetween successivesentences(Foltz et al., 1998). It performsas well as students on TOEFL (test of English as a foreign language) tests (Landauer and Dumais, 1997) and can even be used for understanding metaphors(Kintsch,2000). In this second study we thereforeused the populated semanticfieldsand comparedthem not to the texts as in study 1, but to the semanticLSA spaces of those texts. 4.1. MATERIALS The same sixteen texts from the authors Eliot, Dickens, Woolf and Joyce were used. For each text a semanticspace was createdusing the default of
300 dimensions (see Graesser et al., 1999; for a most recent view see Hu et al.,

2003). The weightingfor the indextermswas kept to the defaultlog entropy. common words like functional Similarly,the defaultfeatureof disregarding items was used. The size of the text units was generallykept at paragraphs, exceptin the case for dialogswhen lineswerechosenas text unit size,with the size of each semanticspacerangingfrom 600 text units to 1700text units per text.

4.2. SEMANTIC FIELDS

The same thirteen semantic fields were used as in the first study with the same population of lemmata (N= 592) and word forms (N= 1,461).

This content downloaded from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR Terms and Conditions

SEMANTICVARIATIONIN IDIOLECT AND SOCIOLECT 4.3. RESULTS AND DISCUSSION

215

After the LSA spaces per text were created,the 1,461 words forms for the thirteen semanticfields were comparedwith the LSA space, resultingin a cosine value between 0 and 1. The very large number of data points (i.e., numberof wordforms x the numberof text unitswithineachtext)calledfor a moremanageable a sampleof 65,000datapointsper LSA analysis.Therefore, outputfilewererandomlyselectedusinga simplerandomsamplingtechnique. The idiolect hypothesis predicted significant differencesbetween texts from different differences betweenthe texts of one authors,but no significant author. As predicted,between-author differed from each other (F(1, groups < with cosine values 1040000)=31.82, p 0.001), highest for Eliot (Mean
Cosine = 0.040, SD = 0.059) and Dickens (Mean Cosine = 0.040, SD = 0.044), lowest for Woolf (Mean Cosine= 0.012, SD= 0.056), with Joyce in between

= 0.039, SD = 0.06). However,contraryto this prediction,texts (Mean Cosine written by Eliot showed significant differences between them (F(3, 260000)= 4.72, p < 0.003), as did texts by Dickens (F(3, 260000)= 10.16, p < 0.01) and Joyce (F(1, 260000)= 11.49, p < 0.001). Only the texts by Woolf seemedto be more homogeneous(F(1, 260000)= 2.61, p = 0.05). The sociolect-genderhypothesis predicted that texts by male authors would differ from those by female authors, whereas no differenceswere predictedbetweentexts withineach of these two groups.Indeed,a difference was found between these two author groups (F= 1, 1040000)=5.15, p= 0.023), with higher cosine values for female authors (Mean Cosine=
0.039, SD = 0.050) than for male authors (Mean Cosine = 0.029, SD = 0.059).

However,betweenthe texts within each of the groups significantdifferences were also found (Male: F(1, 520000)=62.28, p < 0.001), Female: F(1, 520000)= 31.23,p < 0.001). The sociolect-periodhypothesis predictedthat no differenceswould be found betweenthe texts within a period. Cosine values betweentexts of the Realist authors indeed did not show a difference(p= 0.5), but contraryto what was expectedvalues betweenModernisttexts did show significantdifferences(F(1, 520000)= 8.67,p = 0.003). In addition,as predicted,differences betweenthe two time periodswere found (F(1,1040000)=89.16,p < 0.001). However, whereasthe Modernist-code hypothesispredictedthat the values for the semanticfields would be higher in the Modernisttexts than in the Realist texts, an opposite effect is found with higher cosine values for the
Realist texts (Mean Cosine= 0.040, SD=0.051) than the Modernist texts (Mean Cosine= 0.026, SD= 0.059). In fact, this effect can be found for all

betweenthe Realistsand Modernisttexts. Similarto the possibleinteractions findingsin the previousstudy identicalresultswere found for the core set of
three semantic fields (consciousness, observation, detachment) as for the

overall set of semanticfields.

This content downloaded from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR Terms and Conditions

216

MAX M. LOUWERSE

In this second study similarresultswerefound as study 1. Comparingthe semanticfields to the LSA spacesof the texts ratherthan to the texts themselves allowed for a degree of similarity, but again results showed The only exceptionto this both betweenas well as within group differences. withinthe groupitself.Although was the Realisttextsnot showingdifferences it is difficult to draw conclusions about the Modernist-codehypothesis of the idiolectand sociolecthypotheses,the Modernistwithoutconfirmation code effectthat was found to show an effectthat did not matchthe prediction with a higheraveragecosine value for Realisttexts than for Modernisttexts. In any case, no unambiguousevidence was found for any of the four hypotheses.This would suggest a lack of empiricalsupport for the claims made by Fokkema and Ibsch (1987). However, it is still possible that the idiolect and sociolect hypotheses hold and that only the Modernist-code becauseof the selectedsemanticfields.In other hypothesisshouldbe rejected, betweengroups of be able to find semanticsimilarities we still words, might texts (idiolect,sociolect)but these similarities might not be contingenton the semanticfields.The idiolectand sociolecthypothesesmay then be falsifiedby selectionof semanticfields,but not by the full semanticspace of a particular the texts. This option is what is exploredin a third study. 5. Study: Between-Text Using a VectorModel Comparisons list of semanticfieldsto words in each of Insteadof comparinga predefined the corpora(study 1) or the semanticspacesof those corpora(study2), LSA spacesof each text werecomparedwith each other.In otherwords,each text or sentence)in each text was comparedwith each text unit unit (paragraph or (paragraph sentence)of anothertext, resultingin a cosine value for each comparison.The higherthe cosine value, the more similarthe text units are (ranging from 0 to 1). According to the idiolect hypothesis the semantic whereasthe semantic universesof textsby one authordo not show differences, or within-time of textsbetweenauthorsdo. Similarly, universes within-gender or between-time texts are. texts are expectednot to differ,but between-gender in cosine values betweentexts indicatehomogeneityof the conSimilarities tent. In addition,high cosine values are indicatorsof semanticsimilarities.
5.1 MATERIALS

The same LSA spacesof the sixteentexts were used as those createdfor the second study.
AND DISCUSSION 5.2. RESULTS

In this study(LSA spacesof) textswerecomparedto othertexts insteadof to


a word list as in the first studies, resulting in 256 (16 x 16) sets of cosines

This content downloaded from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR Terms and Conditions

SEMANTIC VARIATION IN IDIOLECT AND SOCIOLECT

217

the semanticrelationshipbetween texts. A comparisonof the representing texts with the author-non-matching texts showed differauthor-matching ences in all four cases (All Fs (1, 6000000)2 25,1250,p < 0.001). When the cosine values (indicatingsimilarityin content) were comparedper idiolect, texts by Dickens differedmore betweenthemselvesthan betweentexts from other authors.The same is true for texts by Joyce. In other words, only for half of the authors (Eliot and Woolf) the author-matching texts had higher cosine values than the texts. average author-non-matching In orderto test the sociolect-gender hypothesis,texts by maleauthorswere the texts with female authors. Differences betweengroupswere by compared found, suggesting evidence for the sociolect-gender hypothesis (F(1, 3640000)= 392989.6 < 0.001). However,as with the unpredictedresults in the idiolect hypothesis, significantdifferenceswere also found within each gender group (All Fs (1, 1820000)2 31747.8,p < 0.001). Overall,texts by female authorshad a higheraveragecosine value, suggestinga resemblance in content, than texts by male authors (female: Mean Cosine=0.135, SD=0.121; male: Mean Cosine=0.058, SD=0.103). As predicted by the sociolect-periodhypothesis, significantdifferences were found betweenRealist-matching texts versusModernist-matching texts < (F(1, 3640000)= 12246.763,p 0.001). But again, unexpecteddifferences were also found within each period (All Fs (1, 1820000)2 1579.459, p < 0.001). Interestingly, averagecosine values were higherfor Realisttext than for Modernisttexts (Realist: Mean Cosine=0.107, SD= 0.128; Modernist= 0.093, SD= 0.110). This suggeststhat despitethe fact that there are differencesbetweenthe Realist texts, they are semanticallymore similarto each other than Modernisttexts are. In sum, for some authors(Eliot and Woolf) similarityin content can be found, supportingan idiolect hypothesis. For other authors (Dickens and Joyce) texts differwithin one author. This findingis even more interesting when we look at the sociolect-gender hypothesis.Texts by female authors show more similaritiesthan texts by male authors. Similaritiesin content were also found in the sociolect-timehypothesis:In both Realist and Modernist texts more similaritieswere found between the texts within a period than betweenperiods.Furthermore, Modernisttexts show a greaterdiversity when compared to each other than Realist texts, suggested by the lower cosine values for the formercomparedto the latter.

6. Study4: Within-Text Comparisons Using a VectorModel Up to now, we have found no conclusiveevidencefor the idiolect-hypothesis or either of the sociolect-hypotheses. Should we thereforeabandon all four hypotheses? So far we have assumed that there is homogeneity in the

This content downloaded from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR Terms and Conditions

218

M.LOUWERSE MAX

semanticswithin a text. This has largelybeen supportedby the within-text analysiswhen the two text halvesof each texts werecompared.The question howeveris two what extent this assumptionis correct.It might be the case betweenthe semantics.If that is the that within a text there are differences case, the lack of evidencefor the idiolect and sociolect hypothesesmight be within a text. At the same time, the hypothexplainedby the heterogeneity eses can be tested by comparinghomogeneityvalues by author, genderand period. For this purpose an LSA analysiswas carriedcomparingeach text or senunit (i.e. paragraphor sentence)to every other text unit (paragraph consistent(the content tence)withina text. If texts are generallysemantically of the text units in the text is similar),higher cosine values will be found. Texts that differin the semantics,and are thereforesemantically inconsistent, will have lower cosine values.

6.1. MATERIALS The same LSA spacesof the literarytexts from the second and third studies were used.

6.2. RESULTS AND DISCUSSION As in the previousstudy,the numberof datapointswas reducedby randomly selecting 65,000 cosine values per text using a simple random sampling technique.An ANOVA comparingthe idiolects showed a significantdifference between the four authors (F(1, 1040000)- 2650.68,p < 0.001). Conwerealso found betweenthe texts for traryto what was predicteddifferences each of the authors (Eliot: (F(3, 260000)= 1305.17,p < 0.001; Mean Cosine=0.034, SD=0.07; Dickens: (F(3, 260000)=645.62, p < 0.001; Mean Cosine=0.020, SD=0.06; Woolf: (F(3, 260000)-167.80, p < 0.001; Mean Cosine=0.019, SD= 0.045;Joyce:(F(3, 260000)= 2899.20,p < 0.001; Mean = 0.021, SD = 0.073). This again suggestsno supportfor the idiolect Cosine betweenthe texts of one authordescribed hypothesis.In the LSA comparison in the previousstudy,most homogeneity was found in the texts by Eliot. This effect was replicatedin the internalhomogeneityanalysis,suggestedby the highest LSA cosine values. For the sociolect-gender effectwas found between hypothesisa significant with texts written by female 1699.02,p < 0.001), gender (F(1, 1040000)-= = authorshaving highercosine values (Mean Cosine 0.026, SD= 0.058) than those written by male authors (Mean Cosine=0.020, SD=0.068). In addition, differences were found for within-gender texts (female: F(1, 520000)= 1929.72,p < 0.001;male:F(1, 520000)= 1660.43,p < 0.001).

This content downloaded from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR Terms and Conditions

SEMANTIC VARIATION IN IDIOLECT AND SOCIOLECT

219

As for the sociolect-periodhypothesis, differenceswere found between = 2563.28,p < 0.001), but also betweenthe authors periods (F(1, 1040000) within a period (Realist:F(1, 520000)= 4788.1,p < 0.001; Modernist:F(1, 520000)=61.08, p < 0.001). In the previous analysis we saw that Realist texts share more semantic concepts between them. Similarly, for the semanticsbetweenpartsof the Realisttexts cosine valuesare higherthan for = 0.027, SD = 0.067;Modernist: Modernisttexts (Realist:Mean Cosine Mean
Cosine = 0.020, SD = 0.060).

An explanationfor the resultswe have found in the previousstudiesmight indeedlie in the internalsemantichomogeneity.This analysisreplicatedthe findingin study 3 that Modernisttexts seem to be more diversethan Realist texts. This is an importantfindingfor corpus linguisticanalysesof modern literary texts in general, but also for the validity of the Modernist-code hypothesis.If it is true that Modernistauthorsexperimentmore with their literaryproducts(see Fokkema and Ibsch, 1987), then it is still possible to keep up a Modernisthypothesis:Certainsemanticfields might still be more prominentin these texts. However, their overall frequencyis low because Modernisttexts miss the homogeneityRealist texts have.

7. Conclusion We tested hypothesesinitially brought forwardby in Fokkema and Ibsch (1987), who argued that selected authors use selected semanticfields. The word frequencyof the contents of these fields would predictfrequencypatterns in idiolect and literary period. We tested idiolect, sociolect-gender, sociolect-timeand Modernist-code hypothesesderivedfrom this study using Boolean models and vector models. A total of 16 literarytexts were used balancedacrossauthor(Eliot, Dickens,Woolf, Joyce),gender(female,male) and literaryperiod (Realism, Modernism).Two models were used to test these hypotheses,a binaryBoolean model and a scalingvector model. Both methods are very common the field of corpus linguistics(see Louwerseand Van Peer, 2002 for an overview). Initial Boolean analyses suggestedno evidence for any one of the four hypotheses,possiblybecauseof the semanticfieldsthat were selectedand the Boolean method that was used. Results were replicatedin a vector model using the semanticfields. A vector analysis,comparingthe generalcontent betweenthe groupsof texts and comparing the variouspartswithineach text, showed that the semantic homogeneity in literary texts is an important confoundingvariable.Becauseof this, drawingconclusionsfrom a literary text as a whole, ratherthan its parts might be problematic.A vector model can partly solve this problem,by takinginto account everypart of the text. But drawingconclusionsfrom semanticsimilarities within an author can be

This content downloaded from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR Terms and Conditions

220

MAX M. LOUWERSE

becauseauthorstend to changetheirstyle and semantic equallyproblematic, texts. within a literaryperiod between Similarly,semanticsimilarities space are difficultto determine becauseof the overallvariations.As pointed out in the beginningof this study, the lack of internalhomogeneityin one text, between texts and between authors can be explained by the (semantic) deviation from the norm the author tries to establish.These variationsare exactly what makes the idiolect and sociolectof literarytexts unique, and is in fact what makes those texts literary.
Acknowledgements

This researchwas partiallysupportedby the National ScienceFoundation (SBR 9720314,REC 0106965,REC 0126265,ITR 0325428)and the Institute of EducationSciences(IES) (R3056020018-02). Any opinions, findings,and in this conclusionsor recommendations materialare those of the expressed author and do not necessarilyreflectthe views of the fundingagencies. Note
i Whereas the Modernist-code hypothesis, sociolect-time and idiolect hypotheses are directly derived from Fokkema and Ibsch (1987), the sociolect-gender hypothesis is not. However, given the theory of a group code, a sociolect-gender hypothesis seems justified.

References
Baeza-Yates R., Ribeiro-Neto B. (eds.) (1999) Modern Information Retrieval. ACM Press, New York, 513 p. Biber D. (1988) Variation Across Speech and Writing. Cambridge University Press, Cambridge, UK, 315 p. Eco U. (1977) A Theory of Semiotics. Indiana University Press, Bloomington, 368 p. Fellbaum C. (ed.) (1998) WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA, 500 p. Fokkema D., Ibsch E. (1987) Modernist Conjectures. A Mainstream in European Literature 1910-1940. Hurst, London, 330 p. Foltz P.W., Kintsch W., Landauer T.K. (1998) The Measurement of Textual Coherence with Latent Semantic Analysis. Discourse Processes, 25, pp. 285-307. Graesser A., Wiemer-Hastings P., Wiemer-Hastings K., Harter D., Person N., and the Tutoring Research Group. (2000) Using Latent Semantic Analysis to Evaluate the Contributions of Students in Autotutor. Interactive Learning Environments,8, pp. 149-169. Hu X., Cai Z., Franceschetti D., Penumatsa P., Graesser A.C., Louwerse M.M., McNamara D.S. and the Tutoring Research Group (2003) LSA: The First Dimension and Dimensional Weighting. Proceedings of the 25th Annual Conference of the Cognitive Science Society. Erlbaum, Mahwah, NJ. Jakobson R. (1987) Linguistics and Poetics. In Jakobson R. (ed.), Language in Literature. Harvard University Press, Cambridge, MA, pp. 62-94.

This content downloaded from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR Terms and Conditions

AND SOCIOLECT SEMANTICVARIATIONIN IDIOLECT

221

Kintsch W. (2000). Metaphor Comprehension: A Computational Theory. Psyhonomic Bulletin and Review, 7, pp. 257-266. Landauer T.K., Dumais S.T. (1997) A Solution to Plato's Problem: The Latent Semantic Analysis Theory of the Acquisition, Induction, and Representation of Knowledge. Psychological Review, 104, pp. 211-240. Landauer T.K., Foltz P.W., Laham D. (1998) Introduction to Latent Semantic Analysis. Discourse Processes, 25, pp. 259-284. Lotman J. (1977) The Structure of the Artistic Text. University of Michigan, Ann Arbor, 300 p. Louwerse M.M., Van Peer W. (eds.) (2002) Thematics.:InterdisciplinaryStudies. John Benjamins, Amsterdam/Philadelphia. 430 p. Martindale, C. (1990) The Clockwork Muse. Basic Books, New York, 411 p. Pennebaker J.W. (2002) What Our Words Can Say about Us: Towards a Broader Language Psychology. Psychological Science Agenda, 15, pp. 8-9. Project Gutenberg, [http://www.ibiblio.org/gutenberg]. Sebeok T.A. (1991) A Sign Is Just a Sign. Indiana University Press, Bloomington, 178 p. The Online Books Page, [http://onlinebooks.library.upenn.edu]. The Oxford Text Archive, [http://ota.ahds.ac.uk]. Wardhaugh R. (1998) An Introduction to Sociolinguistics. Blackwell, Oxford, 464 p. Watson G. (1994) A Multidimensional Analysis of Style in Mudrooroo Nyoongah's Prose Works. Text, 14, pp. 239-285. Wellek R., Warren A. (1963) Theory of Literature. Cape, London, 382 p.

This content downloaded from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR Terms and Conditions

You might also like