You are on page 1of 32

SOCIAL AND LINGUISTIC STRUCTURE

Language structure is partly determined by social structure.

Gary Lupyan University of Pennsylvania Rick Dale University of Memphis

Gary Lupyan 3401 Walnut Street, Suite 400A University of Pennsylvania Philadelphia, PA 19104

Tel: 917-843-4868 Fax: (215) 573-9247 Email: glupyan@gmail.com

SOCIAL AND LINGUISTIC STRUCTURE

Abstract The languages of the world differ greatly both in their syntactic and morphological systems and in the social and ecological environments in which they exist. In the present work, we challenge the long-held assumption that language grammars are unrelated or only spuriously related to the social environments in which they are found (Chomsky, 1995)1. Based on a statistical analysis of 2,236 languages, we report strong relationships between linguistic factors related to morphological complexity and demographic/socio-historical factors such as the number of language users, geographic spread, and degree of language contact. The analyses suggest that languages spoken by large groups have simpler inflectional morphology than languages spoken by smaller groups as measured on a variety of factors such as case systems and complexity of conjugations. Additionally, languages spoken by large groups are much more likely to use lexical strategies in place of inflectional morphology to encode evidentiality, negation, aspect, and possession. These results are explained using principles borrowed from evolutionary biology. Just as the divergence of species is due not only to genetic drift, but also to adaptation to different niches (ecological speciation), so language structures appear to adapt to the environment (niche) in which they are being learned and used. As adults learn a language, features that are difficult for them to acquire, are less likely to be passed on to subsequent learners. Languages used for communication in large groups that include adult learners appear to have been subjected to such selection. Conversely, the morphological complexity common to languages used in small groups increases redundancy and may facilitate language learning by infants. The proposed Linguistic Niche Hypothesis has implications for answering the broad question of why languages differ in the way they do and also makes empirical predictions regarding language acquisition capacities of children versus adults.

SOCIAL AND LINGUISTIC STRUCTURE

Although the most populous languages are spoken by millions of people spread over vast geographic areas, most languages are spoken by relatively few individuals over comparatively small areas. The median number of speakers for the 6,912 languages catalogued by the Ethnologue is only 7000, compared to the mean of over 828,0002. Similarly, for the 2,236 languages in our sample (Figure 1), the median area over which a language is spoken is about the size of Luxembourg or San Diego, California (948 km2). The mean area is about the size of Austria or the US state of Maryland (33,795 km2). Languages also differ dramatically in the proportion of individuals who speak the language natively (L1 speakers) to those who learned it later in life (L2 speakers)Table 1. Although there are numerous counter-examples (Supplementary Note 1), languages spoken by millions of people have a greater likelihood of coming into contact with other languages and of having numerous nonnative speakers compared to languages spoken by only a few thousand people. This is not surprising: a language spoken by more people is more likely to encompass a larger and more diverse area and include speakers from varying ethnic and linguistic backgrounds. Conversely, languages spoken by a thousand or even fewer individuals tend to be spoken in a highly circumscribed locale (Supplementary Note 2). Overall, languages with smaller speaker populations are more likely to be spoken by more socially cohesive groups3 than languages that have millions of speakers. Just as there are socio-historical and demographic differences among the worlds languages, there are also vast differences among languages in grammar and syntax. For example languages differ in the devices used to convey syntactic relationswho did what to whom. Some languages rely on a fixed word order (Subject-Verb-Object in the case of English), while other languages (e.g., German, Polish) allow much more flexibility in word order and rely on case markings to signal which noun fills the role of subject, object, etc.4 More generally, languages

SOCIAL AND LINGUISTIC STRUCTURE

differ in the amount of information conveyed through inflectional morphology compared to the amount of information conveyed through non grammatical devices such as word order and lexical constructions, e.g., compare morphological marking aspect in Russian: Ya vypil chai (I PERFECTIVE+drank tea) to the English lexical strategy: I finished drinking the tea). Other domains exhibiting such differences between lexical and morphological strategies include tense, aspect, evidentiality, negation, plurality, and expressions of possibility. Languages with richer morphological systems are said to be more overspecified 5-7. For instance, of the languages that encode the past tense inflectionally, about 20% have past tenses that explicitly mark remoteness distinctions. For example Yagua, a language of Peru, has inflections that differentiate 5 levels of remoteness. A verb denoting an event that happened only a few hours ago takes the suffix jsiy; an event that happened a day previous to the utterance requires a different suffix, -jay, an event that occurred a week to a month ago, a still different suffix, -siy, etc.8. Of course, languages without these grammatical distinctions can express them lexically, as in English: I broke my foot a few years ago. On the other hand, when semantic distinctions are encoded grammatically, speakers are generally obligated to make them, hence sentences concerning the past will have its remoteness specified even when it may not be relevant to the discourse. In the English example above, speakers have the option to omit remoteness information, but are obligated to express the grammatically encoded past tense (which leaves remoteness to context). In Mandarin or Thai, which express both tense and remoteness lexically, speakers have the option of omitting the past tense entirely. Of the 222 languages in our corpus for which tense information is available, 40% do not encode past tense inflectionally9.

SOCIAL AND LINGUISTIC STRUCTURE

The degree and specificity of inflectional encoding can reach astounding levels. In Karoka language of Northwestern Californiawe find grammaticalized verbal suffixes for various containment pa:-kirih throw into fire, pa:-kurih throw into water, pa:-ruprih throw in through a solid (the affixes are unrelated to the nouns water, fire, etc.)10. Clearly, such elaboration does not arise from communicative necessity. Researchers have long been puzzled by the reasons why some languages abound in such overspecification, while others eschew it, particularly in cases of closely related languages. For example, in comparing English and German we find that where the surface structures of English and German contrast, English tends to leave more to context6, thus, German speakers are forced to make certain semantic distinctions which can regularly be left unspecified in English (ref.6, p. 28). For example, German obligatorily specifies the direction of motion in the place adverbs here/there/where. Compare: hier/her; dort/hin; wo/wohin. English can specify direction using to and from (where to versus where from), but such specification is optional and is generally left to context11. Grammatical divergence between languages has been typically attributed to driftas a population speaking an ancestral Germanic language splits into separate groups, their language gradually diverges with one branch becoming English and the other German12. Such accounts do not explain why English came to shed much of its morphology while German retained it. Attempts to establish relationships between social and linguistic structure date back at least a century13,14. Recent work has provided some support for the idea that extralinguistic factors (e.g., degree of ecological risk) play a role in some aspects of language such as the varying levels of linguistic diversity found in different parts of the world15,16. More speculative accounts have argued that historical developments that impact language transmission over historical time can impact language structure itself 2,4,26,37. It has been argued that languages with histories of adult

SOCIAL AND LINGUISTIC STRUCTURE

learning are morphologically simpler, less redundant, and more regular/transparent2,6,11,12,13,37. This argument has been made most forcefully and convincingly for Creole languages17, but it has been speculated that any situation in which a language is learned by a substantial number of adults becomes simplified due to the lousy language learning abilities of the human adult19. The evidence for such linguistic simplification has been descriptive, consisting of selected examples and examining grammatical inventories of a small number of languages6,12,19,21. Thus, at present, there is no convincing evidence of global relationships between linguistic structure and non-linguistic factors and no framework within which to understand such relationships. An additional limitation of previous work is that it fails to explain why morphological complexity and grammatical overspecification arise in the first place. That is, why arent all languages as morphologically simple as those that have been argued to be heavily shaped by adult learning (e.g., English11)? The present work aims to: (1) establish whether non-spurious relationships exists between social and linguistic structure by using large-scale demographic and linguistic databases, and (2) provides a framework within which to understand the reported resultsthe Linguistic Niche Hypothesiswhich proposes mechanisms for morphological complexity and its simplification and provides novel perspectives on linguistic diversity and language change.

In assessing the relationship between social and linguistic structure, it is useful to distinguish two main contexts in which languages are learned and used: the exoteric and the esoteric niches3,23. The exoteric linguistic niche contains languages with large numbers of speakers, thus requiring that the languages serve as interfaces for communication between strangers (Supplementary Note 3). Speakers of languages in the exoteric niche compared to

SOCIAL AND LINGUISTIC STRUCTURE

speakers of esoteric languages are more likely to (1) be nonnative speakers or have learned the language from nonnative speakers, (2) use the language to speak to outsidersindividuals from different ethnic and/or linguistic backgrounds. The exoteric niche includes languages like English, Swahili, and Hindi, while the esoteric niche includes languages like Tatar, Elfdalian, and Algonquin. The analyses described below aim to test whether systematic relationships exist between grammars and social contexts within which languages are spoken.

Methods To assess relationships between social and linguistic structure we constructed a dataset that combined social/demographic and typological information for 2,236 languages. The dataset was constructed by combining typological data from the World Atlas of Language Structures (WALS)24 with the following demographic and ecological variables: speaker population, geographic spread, and number of linguistic neighbors. Because direct measures for the esotericity of communities are not available on a large scale, these demographic variables served as proxies. Speaker population data for each language was retrieved from the Ethnologue2 and included the summed total of speakers in all the countries in which the language is spoken. Because nonnative speaker population estimates are unreliable and unavailable for most languages in our sample, our population estimates were conservative, including only native speakers, as reported by the Ethnologue. Populations of less than 50 speakers were set to 50. Area (km2) for each language was calculated from data provided by Global Mapping International25. This data contained boundary information (language polygons) for most of the languages in WALS. The area measure was the sum of all the geographic regions in which the language is spoken. We also used the boundary information to calculate degree of contact by

SOCIAL AND LINGUISTIC STRUCTURE

counting the number of languages whose global polygons are contained in, overlapping with, or contacting a given languages polygons. For example, although English originates in the British Isles, the fact that it is spoken in North America and Australia means that its neighbors include the extant indigenous languages on those continents. Selecting Typological Features for Analysis Our analysis focused on typological factors most relevant to inflectional morphology with particular emphasis on continuous variables such as the number of inflectional case markings or the inflectional synthesis of verbsthe number of different types of information that can be inflectionally encoded by verbal affixesmeasured in categories per word26. An additional guide for feature selection was the ability to make a priori predictions about the level of morphological complexity of a given feature. For instance, plurality (feature 16) can be coded using prefixes, suffixes, some combination of the two, a plural word, a plural clitic, reduplication, or by using non-conventionalized lexical means. Clearly, languages that have morphological coding of plurality are more grammatically specified in this respect than languages that do not. We made no a priori predictions about the relative morphological complexity of prefixes versus suffixes versus reduplication. However, our analyses revealed that demographic factors in fact correlated strongly with prefixing versus suffixing strategies in a range of linguistic domains and we include these additional analyses and their relation to the Linguistic Niche Hypothesis in the Supplementary Materials. Although our final corpus included 2,236 languages, no feature was defined for all the languages in the database. The results presented in Table 2 are based on a median of 218 languages per feature analyzed (range: 112-1,074).

SOCIAL AND LINGUISTIC STRUCTURE

Results Table 2 shows the results of three models used to explore the relationships between typological features, and measures of population, geographic spread, and degree of linguistic contact (Supplementary Note 4). For most (20/26) of the WALS features that were most relevant to inflectional morphology, demographic variables (population, area over which a language is spoken, and degree of linguistic contact) combined with geographic covariates (latitude/longitude) proved to be better predictors of the linguistic features than geographic location alone (Supplementary Note 5). The results provide overwhelming evidence against the null hypothesis that language structure is unrelated to demographic factors. Across a wide range of linguistic features, a systematic relationship between demographic and typological variables was found. Although the three demographic predictors are not independent (intercorrelations range from .5 to .6), including all three predictors helps to ensure that linguistic-demographic relationships are not spurious. We summarize the findings below (parenthetical numbers reference entries in Table 2). The Supplementary Materials include more detailed descriptions of the linguistic features. Compared to languages spoken in the esoteric niche (smaller population, smaller area, fewer linguistic neighbors), languages spoken in the exoteric niche: 1. Are more likely to be classified by typologists as isolating languagesthose in which grammatical functions are fulfilled by markers not bound to the stem (e.g., modals, lexical items, or particles) than concatenative/fusional languagesthose in which grammatical markers show a greater degree of fusion to the stem (e.g., affixes and clitics) (1).

SOCIAL AND LINGUISTIC STRUCTURE

10

2. Contain fewer case markings (3), and have case systems with higher degree of case syncretism (4) (further reducing the number of morphological distinctions). Nominative/accusative alignment is more prevalent than ergative/absolutive alignment (5).

3. Have fewer grammatical categories marked on the verb (6) and are less likely to have idiosyncratic verbal morphology such as verbal person markings that alternate between marking agent or patient depending on semantic context (7).

4. Are more likely to not possess noun/verb agreement or have agreement limited to agents (8) and are more likely to possess no person markings on adpositions (9). As with case markings, syncretism in noun/verb/adposition agreement is more common in languages spoken in the exoteric niche (10).

5. Are more likely to make possibility and evidential distinctions using lexical (e.g., verbal) constructions rather than using inflections such as affixes (11-13) and are more likely to conflate the two (semantically distinct) types of possibility (14).

6. (a) Are more likely to encode negation using analytical strategies (negative word) than using inflections (affixes) and are less likely to have idiosyncratic variations between word and affixation strategies (15). (b) Are more likely to have obligatory plural markers

SOCIAL AND LINGUISTIC STRUCTURE

11

(16). For languages with optional markers, analytic (word) strategies are more common than inflectional strategies (affixes or clitics). (c) Are less likely to have a separate associative plural (e.g., He and his friends) (17) (c) Are more likely to have a dedicated question particle (18).

7. (a) Are less likely to encode the future tense morphologically (19) or possess remoteness distinctions in the past tense (20). In contrast, languages spoken in the exoteric niche are somewhat more likely to mark the perfective/imperfective distinction in their morphology (21), although this relationship disappears when language geography is particle out. (b) Are more likely to mark singular imperatives on verbs using inflections than have no morphological markings for imperatives at all, but are less likely to contain more elaborate markings that differentiate between singular and plural imperatives (22). (c) Are less likely to have inflections that mark possession (23), and the optative mood (24).

8. (a) Are less likely to have definite and indefinite articles (25). If both are present, they are more likely to comprise lexical words than affixes.

9. Are less likely to communicate distance distinctions in demonstratives (26). Figure 2 shows the relationship between population and the two quantitative measures of morphological complexity: number of case markings (feature 3), and inflectional synthesis of the verb (feature 6). Both relationships are significant as analyzed by a GLM: cases, F(1, 228)=5.68, p=.018 (the relationship remains significant when zero-case languages are removed: F(1,132)=4.29, p=.040); inflectional synthesis of the verb: F(1,140)=20.90, p<.0005. The latter

SOCIAL AND LINGUISTIC STRUCTURE

12

relationship is particularly striking when averaged by the largest language families (Figure 3a, Pearson r = .48) and by continents (Figure 3b, Pearson r = .96). In a subsequent analysis, we constructed an overall complexity measure by adding up the number of features for which each language relies on lexical versus morphological coding and subtracting the total from 0 (Supplementary Note 6). There was a strong relationship between complexity and speaker population, F(1,1246)=71.20, p<.0005 (Figure 4). Languages with the most speakers were more likely to use lexical over morphological strategies. Restricting the analysis to the 6 largest language language families revealed that the relationship held within each one (excepting the Australian family which has a very small population range) (Figure 5). With few exceptions, the same patterns were observed whether population, area, or linguistic contact was used in the model. Overall, the population model provided the greatest predictive power. Languages spoken in the exoteric nicheas indicated by larger speaker populations, greater geographical coverage, and greater degree of contact with other languages had overall simpler inflectional systems, more frequently express semantic distinctions using lexical means, and were overall less grammatically specified. This was true both for quantitative grammatical measures such as the number of different grammatical categories encoded by verbal inflections (feature 6) and case markings, as well as for qualitative grammatical types. For example, languages spoken in the exoteric niche were associated with a lack of conventional strategies for encoding semantic distinctions like situational/epistemic possibility, evidentiality, the optative, indefiniteness, the future tense, and both distance contrasts in demonstratives (consider the increasing rarity of the English over yonder) and remoteness distinctions in the past tense.

SOCIAL AND LINGUISTIC STRUCTURE

13

As noted above, semantic distinctions coded lexically are more likely to be optionally expressed than those coded inflectionally (e.g., lexical versus inflectional encoding of tense). Thus, languages that are less grammatically specified tend to rely more on extra-linguistic information such as pragmatics and context.12 Reduced reliance on morphology also has the effect of increasing the transparency between word-forms and meanings (form-meaning compositionality)3. Consider the high occurrence of exceptions in the inflectionally marked past tense forms of English compared to the perfect regularity of the modally marked future tense. One reason for the inverse relationship between morphology and form-meaning compositionality is that inflections such as affixes are, by definition, phonologically bound to the stem which increases opportunities for phonological compression and sound change to disrupt regular mappings between form and meaning. Thus, although it is logically possible to have complex inflectional morphology that is highly regular (frequently classified as agglutination), in practice, coarticulation, historical sound change, and other phonological/articulatory processes often subvert this regularity and lead to more idiosyncratic mappings27-29. We found that the relationship between exotericity and increased form-meaning compositionality holds not only for specific linguistic features like tense and evidentiality, but is also supported by the observation that languages in the exoteric niche are more likely to be classified by typologists as being isolating rather than concatenative or fusional30. The Linguistic Niche Hypothesis These results provide strong evidence for the relationship between social structure and linguistic structure. What kinds of social and cognitive mechanisms could give rise to these relationships? The linguistic niche hypothesis provides a framework in which to consider two central questions raised by the present analyses: (1) Why are languages spoken in the exoteric

SOCIAL AND LINGUISTIC STRUCTURE

14

niche morphologically simpler than languages spoken in the esoteric niche? (2) Why are languages spoken in the esoteric niche so morphologically complex, given that such a high level of specification seems unnecessary for communication? We propose that the level of morphological specification is a product of languages adapting to the learning constraints and the unique communicative needs of the speaker population. As a language spreads over a larger area (e.g., as a result of colonization) and is being learned by a greater number of outside learners, complex morphological paradigms become simplified19,17,11. Complex morphological paradigms appear to present particular learning challenges for adult learners even when their native languages make use of similar paradigms31. This appeal to learning constraints of adult learners as an explanation for morphological simplification has also been proposed by the descriptive analyses of Trudgill20 and McWhorters (interrupted transmission hypothesis)7 which has been previously supported only by selected examples. With increased geographic spread and a greater speaker population, a language is more likely to be subjected to learnability biases and limitations of adult learners. Linguistic change that facilitates adult second-language learning will accumulate over historical time (calculating the rate of change are an intriguing topic that is beyond the scope of the present work). It appears that such simplification of inflectional morphology and the accompanying increase in the transparency of form-to-meaning mapping3 comprise a major type of such change (see Supplementary Materials for additional analyses). It is important to note that adult learners can affect the trajectory of a grammar even when they do not make up the majority of a languages population. As has been noted by Calvet32, there is no automatic transmission of the

SOCIAL AND LINGUISTIC STRUCTURE

15

mother tongue from parents to offspring. For example, in a survey of 188 individuals in Senegal who listed Bambara as their native language, Bambara was the fathers native language in 16%, the mothers in 19%, the native language of both parents in 26%, and the native language of neither parent in 39%32. It is thus common for children to receive input of what they consider to be their native language from nonnative learners. Vehicular languages like Bambara (as well as colonial languages like French in Gabon) are often dominant enough to impose themselves within families even when they are not the native language of the parents. Although children are learning these languages from a young age and are, in theory, fully capable of learning whatever inflectional system the language possesses, much of their input may come from nonnative speakers. Thus, whatever aspects of Bambara were difficult for the parents to learn are more likely to be passed on to the offspring in a revised form. Many have commented on the puzzle of baroque accretion so common to languages33. We propose that the surface complexity of languages adapted to the esoteric niche may arise as a consequence of two sources of optimization: (1) a pressure to facilitate learning of the language by infants without regard for adult learnability, and (2) provide an efficient way to relay subtle differences in meaning. We can formalize this intuition by using principles of minimum description length34,35. Consider the case of a speaker A producing a sentence S which has to be understood by a listener L. The description length (total cost) of S has three components: The code-cost is the number of bits required to communicate the sentence (the production cost to A). The model-cost is the number of bits required to specify how to extract the meaning from sentence S by listener L.

SOCIAL AND LINGUISTIC STRUCTURE

16

The reconstruction-error is the number of bits required to repair any errors that occur when S as communicated by A is reconstructed by L.
Total Cost = Code-cost + Model-cost + Reconstruction-error

Let us assume that the reconstruction error is constant (Supplementary Note 7). Minimizing the code-cost increases the model-cost. To take an example from a familiar domain: one can reduce the size of a music file by compressing it, but decreasing its size in this way requires more powerful decoders to read the file. Reading a CD is far simpler than reading an MP3. Let the code-cost correspond to the surface level grammatical specification. Thus, requiring speakers to specify tense, number, aspect, evidentiality, and mood on a verbwhich we have shown to be more common to languages spoken in the esoteric niche (e.g., feature 6)corresponds to a greater code-cost. A decrease in the model cost under such circumstances, would suggest that morphological overspecification may increase redundancy (Supplementary Note 8) and, provided that infants benefit from such an increase, may simplify language acquisition36,37. Below, we provide independent evidence that languages spoken in the esoteric niche may be more informationally redundant than languages spoken in the exoteric niche. We obtained written translations of the Universal Declaration of Human Rights38 for 103 languages that use the Roman Alphabet. Redundancy was assessed by compressing the text files corresponding to each language using a standard ZIP algorithm (as implemented by WinRAR: www.rarlab.com). The ZIP algorithm works by building a dictionary of maximum-length textstrings and tries to reuse the entries as frequently possible. So, for instance, the string to be or not to be can be compressed by 25% by storing the longest redundant string to be in a dictionary, and referencing it each time it occurs. Overall, we expected that greater morphological specification would lead to greater compressibility. For example, the strings

SOCIAL AND LINGUISTIC STRUCTURE

17

walk and walked can be compressed by storing walk and ed in a dictionary and referencing ed for any regularly inflected verb, producing a storage savings whenever an inflected verb occurs (of course the addition of inflections can increase the overall size of the uncompressed document). In the absence of an inflectional past tense marker, no such savings occurs. Table 3 shows the obtained correlations between the measure of redundancy (compression ratio) and the demographic variables used in our main analysis. As shown in Figure 6, languages spoken by more people and/or over a larger area are less compressible than languages spoken by fewer people (Supplementary Note 9). Additional analyses that particle out the original file size and the number of unique and total words, did not eliminate the negative relationship between exotericity and compressibility. To ensure that the redundancy differences arose from differences in morphological specification, we replaced each unique word with a unique number, e.g., walk and walked might be consistently replaced throughout the document by the numbers 12938 and 59843.39. This substitution retains the relationship between words (each repeated occurrence of 12938 can still be replaced by a reference to the dictionary entry), but not between morphemes. Following the substitution, there were no significant relationships between demographics and compressibility suggesting that the original relationship was driven by morphology. This analysis suggests what appears to be baroque accretion33 may make languages more informationally redundant and, insofar as redundancy provides infants with multiple cues and allows language acquisition to rely less on extralinguistic context. The greater redundancy common to languages in the esoteric niche may thus facilitate language acquisition by infants. Communication is frequently linguistically underspecified; adults may cope with such

SOCIAL AND LINGUISTIC STRUCTURE

18

underspecification more effectively than infants and thus it is infants that would benefit most from linguistic redundancy36,37,40. The paradoxical prediction that morphological overspecification, while clearly difficult for adults, facilitates infant language acquisition, remains to be empirically tested. Supplementary materials present some evidence that the most frequent typologies (e.g., case suffixes are much more widespread/frequent than case prefixes) correspond to those most easily learned by children whereas typologies common to highpopulation (i.e., exoteric) languages are most learnable by adults. We have argued that, depending on the number of speakers, geographic spread, and linguistic contact, languages are placed under different learnability and communication pressures. Languages spoken by millions of people over a diverse region are (1) under a greater pressure to be learnable by outsiders and (2) under a greater pressure to be understood by strangersindividuals with whom the speaker does not share much common ground. Languages appear to respond to these pressures by simplifying their morphology, increasing productivity of existing grammatical patterns, and becoming more analytical. In becoming analytical, languages spoken in the exoteric niche increase their compositionality in that meanings of expressions can be determined from their composition, because the system approximates a one-to-one relationship between forms and meanings. (ref.3, p. 9). A language spoken by relatively few people over a small area is less subject to these same pressures. Idiomatic constructions and baroque accretion so common to languages are more likely to flourish in an environment composed exclusively of native speakers. Such constructions increase encoding redundancy which may aid acquisition by first language learners whose learning systems easily accommodate the increased morphosyntactic complexity. A new perspective on linguistic diversity

SOCIAL AND LINGUISTIC STRUCTURE

19

The Linguistic Niche Hypothesis adds a new perspective to the question that has puzzled people for millennia. Why are there so many different languages? One, currently accepted answer is that as a population splits into several groups, dialect differences emerge and gradually render the languages mutually incomprehensible30. This linguistic drift account is analogous to genetic drift in evolutionary biology. Crucially, biological speciation events are also produced by ecological speciation in which genetic diversity is increased between cohabitating populations when populations adapt to different ecological niches41. The present work suggests that languages may undergo a similar process of adaptation to a niche. On this view, linguistic diversity is not simply a product of passive drift, but also of active speciation as languages adapt either to a small socially cohesive community of native speakers or to a large, diverse group that includes nonnative learners. The present levels of morphological complexity in a language may thus be informative of the socio-historical context in which the language evolved.

SOCIAL AND LINGUISTIC STRUCTURE

20

Table 1: Language Speakers (millions)2 L1 Malay English French Amharic Abkhaz 30 330 65 27 0.11 L2 170 812 50 7 .006 ~0 %L1 .15 .29 .57 .79 .95 ~1

Siberian Yupik Eskimo 0.001

Examples of native (L1) to non-native (L2) populations for several languages.

SOCIAL AND LINGUISTIC STRUCTURE

21

Table 2
Model
Population (Log Speakers)

Feature
Morphological Type

Observed Pattern

Ling Contact (Log ling. neighbors)

Area (Log km2)

1.

Fusion of inflectional formatives (20) 2. Inflectional Morphology(26) 3. Number of Cases (49) 4. Case Syncretism (28) 5. Alignment of Case markings of Full NPs (98)

Isolating > Concatenating Little or None > Present

** **

. .

17.69 37.26 f2=.0 8 11.03 25.16

Cases Fewer Cases > More Cases Core/Non-Core Cases > Core Only > No Syncretism Nom/Acc > Erg/Abs

** ** **

* **

* **

Verb Morphology

6. Inflectional Synthesis of the Verb (categories per word)(22) 7. Alignment of Verbal Person Marking (100)
Agreement

Few Forms > Many Forms Neutral Ergative=Accusative > Context Dependent None=Agent > Agent & Patient = Patient Only > Agent or Patient None > Pronoun > Pronoun + Noun Syncretic > None

** **

* *

*
x

31.78

8. Person Marking on Verbs (102) 9. Person Marking on Adpositions (48) 10. Syncretism in Verbal Person/Number Marking (29)
Possibility and Evidentials

** ** **

* * **

* * *

22.23 22.93 8.09

11. Situational Possibility (74) 12. Epistemic Possibility (75) 13. Overlap b/w Epistemic and Situational Possibility (76) 14. Coding of Evidentiality (77)

Verbal > Morphological Verbal > Morphological Situational/Epistemic Collapsed > Separate Markers No Gram. Evidentials > Gram. Evidentials

** ** ** *

** ** ** .

* ** *
x

16.18 93.46 75.50 9.33

Negation, Plurality, and Interrogatives

15. Coding of Negation (112) 16. Coding/Occurrence of Plurality (34)

Word > Affix Double Neg Particle Aux. Verb Word/Affix Variation Obligatory > Optional [word > affix/clitic]

** **

** **

** **

30.76 55.86

Effect size

f2=.1 5

SOCIAL AND LINGUISTIC STRUCTURE


> None No assoc. Plural > Assoc. Plural Question particle > No Question particle No Morph > Morph. Simple Past > No Morph Past > 2-3 Remoteness Dist. > 3+ Remoteness Dist. Morph. Distinction > No Morph Distinction Sing only > Not Morph. Marked Sing & Plural Sing. Syncretic with Plural No possessive affix > Possessive Affix Not Marked > Morphologically Marked None Both (Lexical) = Only Def. or Only Indef. Both (Affixes) No distance contrasts > 2 Contrasts 2+ Contrasts

22

17. Associative Plural (36) 18. Polar Question coding (92)


Tense, Possession, Aspect, and Mood

** ** ** ** . ** ** .

. ** * * *
x

. ** * * .
x

3.74 15.06 15.95 34.41 4.50 26.52 30.53 18.54

19. Future Tense (67) 20. Past Tense (66) 21. Perfective/ Imperfective (65) 22. Morphological Imperative (70) 23. Coding of Possessives (57) 24. Optative (73)
Articles and Demonstratives

** **

**
x

25. Definite/Indefinite Articles (38-39) 26. Distance distinctions in demonstratives (41)

. **

** .

. **

23.52 13.83

Effect Size is the log-likelihood ratio from a comparison of the intercept-only model with a model that predicts the feature values using the three demographic variables. For continuous measures (features 3 and 6, effect size is reported as Cohens f 2). = Demographics and geographic location predict typology better than geographic location alone (Chi-sq model comparison, p<.05) ** = Reported pattern is significant (p<.05) after controlling for geographic location * = Pattern no longer significant (p>.05) after controlling for geographic location . = Consistent with the pattern reported, but not significant x = Pattern after controlling for geographic covariates is non-significantly inconsistent with the pattern observed without controlling for geographic location.

SOCIAL AND LINGUISTIC STRUCTURE

23

Table 3
Population (Log Speakers) Ling Contact (Log ling. neighbors)

Total Words

-.01 .15 (.14)

Area (Log km2)

-.17 (.10) .17 (.10) -.19 (.06) -.32 (<.0005) -.22 (.03) -.25 (.01)

-.11

Unique Words

.04

Size in bytes

-.23 (.02)

-.12 -.17 (.09) -.10

Compression Ratio (CR) CR with size particle out CR with total and unique words particle out Control Condition CR With Number Substitution

-.56 (<.0005) -.43 (<.0005) -.53 (<.0005)

-.13

-.01

-.12

-.03

Pearson correlations (and p-values) for the compression (ZIP) analysis of 103 text translations of the Universal Declaration of Human Rights.

SOCIAL AND LINGUISTIC STRUCTURE

24

References
1. Chomsky, N. The Minimalist Program. 300(The MIT Press: 1995). 2. Gordon, R.G. Ethnologue: Languages of the World, 15th Edition. 1272(SIL International: 2005). 3. Wray, A. & Grace, G. The consequences of talking to strangers : Evolutionary corollaries of socio-cultural influences on linguistic form. Lingua 117, 543-578(2007). 4. Greenberg, J.H. Universals of language. (MIT Press: 1966). 5. Dahl, . The Growth and Maintenance of Linguistic Complexity. (John Benjamins Publishing Co: 2004). 6. Hawkins, J.A. A Comparative Typology of English and German: Unifying the Contrasts. (Univ of Texas Pr: 1986). 7. McWhorter, J. Language Interrupted: Signs of Non-Native Acquisition in Standard Language Grammars. 304(Oxford University Press, USA: 2007). 8. Payne, D.L. & Payne, T.E. Yagua. Handbook of Amazonian Languages 2, 249474(1990). 9. Dahl, . & Velupillai, V. The past tense. Haspelmath et al (2005).at <http://wals.info/feature/description/66> 10. Bright, W. The Karok language. (University of California Press: 1957). 11. McWhorter, J. What happened to English? Diachronica 19, 217-272(2002). 12. Crowley, T. & Guinea, U.O.P.N. An introduction to historical linguistics. (Oxford University Press New York: 1997). 13. Sapir, E. Language and Environment. American Anthropologist 14, 226-242(1912). 14. Jakobson, R. Roman Jakobson--Selected Writings I: Phonological Studies. 678(Mouton: 1929). 15. Nettle, D. Language Diversity in West Africa: An Ecological Approach. Journal of Anthropological Archaeology 15, 403-438(1996). 16. Nettle, D. Linguistic Diversity. 184(Oxford University Press, USA: 1999). 17. McWhorter, J. The worlds simplest grammars are creole grammars. Linguistic Typology 5, 125166(2001). 18. Trudgill, P. Contact and isolation in linguistic change. Language change: Contributions to the study of its causes 43, 227-237(1989). 19. Trudgill, P. Contact and simplification: Historical baggage and directionality in linguistic change. Linguistic Typology 5, 371-374(2001). 20. Trudgill, P. Sociolinguistic Variation and Change. 197(Georgetown University Press: 2002). 21. Trudgill, P. On dialect: Social and geographical perspectives. (Blackwell: Oxford, 1983).

SOCIAL AND LINGUISTIC STRUCTURE

25

22. Dahl, O. The Growth And Maintenance Of Linguistic Complexity. 333(John Benjamins Publishing Co: 2004). 23. Thurston, W. How exoteric languages build a lexicon: esoterogeny in West New Britain. Papers from the Fifth International Conference on Austronesian Linguistics 555-579 24. Haspelmaths, M. et al. The world atlas of language structures online. (Max Planck Digital Library: Munich,). 25. Seamless Digital Chart of the World. at <http://www.gmi.org/> 26. Nichols, J. Linguistic Diversity in Space and Time. 374(University Of Chicago Press: 1999). 27. Bybee, J.L. Morphology: A Study of the Relation Between Meaning and Form. (J. Benjamins: 1985). 28. Bybee, J.L., Perkins, R.D. & Pagliuca, W. The Evolution of Grammar: Tense, Aspect, and Modality in the Languages of the World. (University Of Chicago Press: 1994). 29. Dressler, W.U. Word formation as part of natural morphology. Leitmotifs in Natural Morphology 99-126(1987). 30. Sapir, E. Language: An Introduction to the Study of Speech. (Dover Publications: 1921). 31. Klein, W. & Perdue, C. The Basic Variety (or: Couldn't natural languages be much simpler?). Second Language Research 13, 301-347(1997). 32. Calvet, L. Towards an Ecology of World Languages. 304(Polity: 2006). 33. Bickerton, D. Roots of Language. 351(Karoma Publishers, Incorporated: 1985). 34. Rissanen, J. A universal prior for integers and estimation by minimum description length. Annals of Statistics 11, 416-431(1983). 35. Zemel, R. & Hinton, G. Learning Population Codes by Minimizing Description Length. Neural Computation 7, 549-564(1995). 36. Weighall, A.R. The kindergarten path effect revisited: Childrens use of context in processing structural ambiguities. Journal of Experimental Child Psychology 99, 75-95(2008). 37. Ackerman, B.P. The Understanding of Young Children and Adults of the Deictic Adequacy of Communications. Journal of Experimental Child Psychology 31, 256-70(1981). 38. Universal Declaration of Human Rights. at <http://www.un.org/> 39. Juola, P. Measuring Linguistic Complexity: The Morphological Tier. Journal of Quantitative Linguistics 5, 206-213(1998). 40. Dittmar, M. et al. German Childrens Comprehension of Word Order and Case Marking in Causative Sentences. Child Development 79, 1152-1167(2008). 41. Schneider, C.J. Natural selection and speciation. 97, 12398-12399(National Acad Sciences: 2000).

SOCIAL AND LINGUISTIC STRUCTURE

26

Figure Captions.

Figure 1. Geographic distribution of the 2,236 languages included in the present study. Figure 2. a: The relationship between population, the number of cases. b: number of categories per word. The regression lines are flanked by 95% CIs. The ranges on the x-axis correspond to the coding of these features in the World Atlas of Langauge Structures. Figure 3. a: Categories-per-word (inflectional synthesis of the verb (feature 6 in Table 2) plotted against the mean number of speakers for the largest language families (those containing at least 32 languages). b: Inflectional synthesis of the verb collapsed by continent. The regression line is flanked by 95% CIs. Eurasia corresponds to the region 38o N 71o20 N / 29oE 172oW. Figure 4. Languages spoken by more people have simpler inflectional morphology. X-axis scores represent a measure of lexical devices compared to the use of inflectional morphology. Symbols represent means; bars show 95% confidence intervals of the median. Bar width is proportional to sample size for each score. Figure 5. The relationship between population and morphological complexity for the 6 largest language families in our sample. Interestingly, a number of the languages that lie far below the regression line are lingua francas, e.g., Hausa, Bambara, and Oromo are all used as lingua francas (vehicular languages). The Padang dialect of Minangkabau (the second simplest Austronesian language by our measure) is also a lingua franca around West Sumatra, Indonesia. Figure 6. Text translated into languages spoken by fewer people is more compressible (i.e., more redundant), compared to text translated to languages spoken by many people.

You might also like