You are on page 1of 12

Preliminary observations on author variation in mobile

phone texting
by John Olsson, Director of the Forensic Linguistics Institute, UK

www.thetext.co.uk

forensicling2003@yahoo.co.uk

Abstract

This non-statistical study considers the degree of individual author variation in mobile
phone text authorship. Fifty-three texters donated 950 texts to a corpus. Each author
was tested for degree of variation displayed in regard to corpus-wide polymorphs such
as ‘u’, ‘you’, ‘yew’. It was found that 34 of the 53 texters exhibited some degree of
polymorphism, ranging from 2 per cent to 52 per cent (average 9 per cent). Although
the number of tokens which polymorph is relatively low in terms of the total lexicon,
about 5 per cent, polymorphs account for approximately 20 per cent of all tokens
used. Although the degree of variation per texter is not high per se, it can, in some
instances, be sufficient to cause problems in forensics. It is therefore crucial that
forensic workers carry out prior variation testing on candidates in mobile phone text
inquiries. The study found that of all the socio-groups observed young females exhibit
more variation than any other group, and that variation decreases with age. It is also
observed that author variation is often ignored by forensic linguists. Recent work
which scarcely considers the topic is cited. The findings from this preliminary study
will be tested against further, larger studies based on corpora currently being built.

Introduction

The purpose of this preliminary study is to consider the question of author variation in
the context of mobile (or cell) phone texting. Specifically, the study examines the
extent to which the individual mobile phone texter varies with regard to style. For a
number of reasons – including the informality of the phone text medium, the generally
private nature of the content, the fact that even the best laid out phone keypads are not
always easy to use, and, among some social groups, the motivation to use an in-group

1
style – many texters use a variety of abbreviations and other non-conventional forms
in place of standard language tokens.

An important question in the forensic context is the extent to which individual texters
are consistent in their output. If it is discovered that texters are highly consistent with
regard to the forms they use, this would give forensic authorship of mobile phone
texts great credence in courts. If, however, it turns out that texters are inconsistent
with regard to output then forensic authorship analysis could prove problematic in
some circumstances.

It should be noted that the corpus is relatively small in size with some contributors
having donated a disproportionately large number of texts. For this reason some
individual texters will in all likelihood be exerting an undue influence on the
polymorph population and to some extent on variation. One solution to this might
have been to restrict each contributor to a fixed number of texts. This, however, would
have meant the sacrifice of much valuable data. The long term answer will probably
simply be the expansion of the corpus to the extent that the capacity for any one
individual to influence the corpus will be restricted. That is why this is a ‘provisional’
analysis, a work in progress.

It should perhaps be emphasised at this point that the study is about author variation,
not authorship testing per se. Questions relating specifically to authorship
identification will be addressed in later studies.

Finally, it is worth remarking that this is not a statistical study, although quantitative
data are employed. However, interested readers may wish to take the data found in the
table and perform their own calculations.

Mobile phone texting

In order to send text messages the mobile phone texter (hereafter ‘texter’) selects the
message function on the handset, and then enters a message using the alphanumeric
keypad. On most phones there are between three and four characters for each numeric
key. For example, the number 1 key typically allows for entry of the letters ‘A’, ‘B’
and ‘C’. For the first letter in the sequence only one press is required to return the
desired letter, for the second two presses are required, while for the third three presses
are required. There is a space bar to enter spaces between words and there is usually a

2
specific key which, if pressed in sequence, will yield a given punctuation mark, such
as a comma, question mark and so on. Additionally, the texter can select the upper
case option for proper names while, typically, sentence initial letters automatically
default to the upper case.

Most phones have the facility for dictionary building using the predictive text mode.
In this mode a pre-entered dictionary item, such as ‘you are’ would be yielded if the
texter entered a pre-selected sequence, such as ‘u r’. Given the relative difficulty of
adding items to the inbuilt dictionary and/or the labour involved in learning in-built
configurations, not to mention the wide range of items required to make such a
dictionary effective, few users opt for the predictive texting mode. In fact, of the 53
participants in the present study none had selected predictive mode and, as far as the
author is aware, no authorship cases involving predictive mode texting have been
presented to the courts in the UK, though this may not be true of other jurisdictions.

For the reasons previously stated many users opt for abbreviated forms, such as ‘u’ for
‘you’, ‘l8’ for ‘late’, ‘b’ for ‘be’ and so on. For many language tokens several options
are available. For example, in the present study four forms of the word ‘about’ were
presented in addition to its standard form: ‘abt’, ‘bout’, ‘bowt’ and ‘bt’, which was
also one of the forms for ‘but’. The word ‘and’ attracted four forms in addition to the
regular form, and these were ‘&’, ‘n’ (which also occurred in place of ‘in’), ‘nd’ and
‘un’. ‘Because’ occurred in six different ways – never once in its conventional form –
only appearing as ‘cause’, ‘caz’, ‘cos’, ‘coz’, ‘cuz’ and ‘cz’. ‘You’ (including with its
various contracted auxiliaries) occurs in exactly 20 different guises, including ‘u’,
‘ya’, ‘yew’, ‘yo’ (also a greeting), and ‘yhoo’. Some of these, given the fact that they
are not actually abbreviations, may be occurring for in-group reasons or for style
motivational reasons.

The study

Mobile phone texts were obtained over a period of two months from 53 texters in the
vicinity of the author’s workplace and elsewhere. Participants were aware that a study
of mobile phone texts was being undertaken for forensic purposes. A phone number
was made available for people to forward their texts to, while some visited the
author’s premises and allowed an upload of material onto the author’s computer,
using various types of software. Several sets of texts were presented in hard copy

3
format. Participants were requested to send or bring only texts which had been sent
prior to the request for texts, and that texts should be less than twelve months old.

Above, the reader will have observed that I have titled this analysis ‘Preliminary
observations on author variation in mobile phone texting’. My reason for this is
simple: my colleagues and I are in the process of collecting a larger corpus of mobile
phone texts. We have yet to collate and anonymise these. On completion of this work
we will be able to determine to what extent any conclusions in this report can be more
generally applied. At that stage a further study will take place with, it is intended,
considerably more detail than that provided in the present instance.

The participants

The participants ranged in age from 11 (with parental permission) to 70 years of age.
16 participants were aged 25 or less, 17 were between 26 and 40, a further 12 were
between 41 and 50 years of age, and the remaining 8 were between 51 and 70 (all
ages inclusive). 33 of the participants were female and 20 were male. There was a
variety of educational and social levels among the participants. All except two
participants were English native speaker British Caucasians, one was an Irish
Caucasian (English native speaker) and the other a Chinese (Han) national (putong
hua speaker). Participants were informed that they could anonymise their texts or,
alternatively, this would be done on their behalf.

The texts

Texters varied greatly in the number of texts they donated, with several donating just
one text while one texter donated 104 texts. The average number of texts donated was
17.92. In all 950 texts were received. Once received, those texts which had not been
anonymised were carefully redacted to protect the identities and locations of the
participants and their text interactants. All phone numbers were replaced by eleven
zeros (eleven being the number of digits in a UK telephone number, whether
terrestrial or mobile).

The motivation for the present study

The author has been researching variation in written language for some years but the
motivation for the present study resulted after a discussion in which the following

4
words were used by a colleague with reference to the language of mobile phone
texting: “It is not unusual to find variation in the same person’s language choices…
but…for most items most people use only one form most of the time”. This comment
struck me as being worthy of further analysis, especially as the colleague and I both
have an interest in forensic matters.

‘For most items most people use only one form most of the time’

Firstly, of course it is the case that ‘most’ items will be realised in ‘only one form
most of the time’. It would be far too labour and memory intensive for ‘most people’
to produce several forms of most of the language items they used. In all likelihood the
extent of both producer and receiver processing in such a case would defeat the goal
of ‘instant’ communication.

Secondly, it is a surprise that this phrase1 occurred at all in the context of forensic
matters. At a purely practical level, it is meaningless: ‘most’ can mean as little as
‘fifty one percent’. Taking the statement to its logical conclusion, we could quite
reasonably interpret its meaning as ‘fifty one percent of people use only one form fifty
one percent of the time’. As the reader will appreciate this equates to something like
twenty five percent by the multiplication rule. Even if, however, my colleague
actually meant something like ‘ninety percent of people use only one form ninety
percent of the time’, this is still not a forensically satisfactory statement, since – again,
by the multiplication rule – this would equate, informally, to something
approximating 80 percent consistency. Few courts would be prepared to convict if
such a low percentage were claimed for DNA or other physical evidence and, as far as
I know, nobody has yet built software which is able to (statistically) attribute mobile
phone text authorship to even this level if, indeed, such software exists.

So, what my colleague appeared to be saying was that somewhere between 25% and
80% of the time we will find authorial consistency. I submit that this is saying very
1
In the context of the same case, another colleague was interviewed on the BBC. He also claimed that
consistency was important in forensic authorship, citing a recent case in which the defendant allegedly
forged texts from a victim's phone. He quoted the defendant's texts as using 'me' (as possessive
pronoun) instead of 'my' "consistently". However, close examination of the defendant's texts reveals
that the defendant uses 'my' just as frequently in this capacity as he uses 'me'. It seems that adherents of
'consistency' are sometimes unable even to observe variation.

5
little of forensic value. Nevertheless, I decided to test the extent to which texters are
consistent in their output.

The experiment

It was first decided to calculate how many words in the corpus occurred in more than
one form. Aside from personal and place names and the combining of two words (e.g.
‘see you’ as ‘cu’, ‘cya’, etc), as well as the formation of clusters by punctuation, e.g.
‘cya.or’ in ‘cya.or cal me’, there were 295 types which occurred in more than one
form, producing a total of 894 tokens. Some occurred only twice, others several times
– examples of some of these were given earlier. Words which occur in more than one
form in the corpus have been labelled ‘polymorphs’. As seen from the earlier
examples, most polymorphs are of extremely common words, about 80% being
function words, mostly medium to high frequency.

The next question was how many of the 53 texters used at least two forms of one
polymorph in their texts. It will be appreciated that those with many texts will have
had the opportunity to use more polymorphs than those with fewer texts. Of the 53
participants only 19 did not text more than one form of a polymorph. In other words
64 percent used more than one form of one word, with half of those producing
between 10 and 50 percent of all the possible polymorphs (in their sub-corpus) in
multiple forms. To clarify this point: a texter can use a word which is found (in the
corpus or elsewhere) in a variety of forms (e.g. ‘you’). Some texters use just one form
(not necessarily the standard form) of a particular polymorph such as ‘you’, others use
several different forms. What the experiment showed was that most texters use more
than one form of at least one word, while a number use more than one form of a
significant percentage of their possible polymorphs.

The youngest texter, who contributed 72 texts, displayed the highest variation: of 61
possible polymorphs occurring in her texts, 32 occurred in more than one form. This
texter produced no less than five forms of ‘you’, four forms of ‘yes’, three forms of
‘tomorrow’, and a number of other words in multiple forms. A 16 year old male
participant, who contributed a much lower 21 texts had 87 possible polymorphs, of
which 18 occurred in more than one form, including four forms of ‘chicken’ (in the
phrase ‘chicken pox’). Nor was this degree of variation found exclusively in young
texters. An older texter (F, 44) produced three forms of ‘and’ and four forms of ‘have’.

6
Some texters produced two forms of a polymorph in just one text, e.g. one participant
(F, 62) contributed just one text, which contains both ‘u’ and ‘you’, while another
texter (F, 26) also contributed just one text in which she produces not only two forms
of ‘you’ in it (‘u’ and ‘ya’) but also two forms of ‘good’ (‘good’ and ‘gud’). The
results are presented in numerical form in the table below. The percentage of variation
for each author is calculated as the number of polymorphs used compared to the
number of words which are polymorphous. An alternative method of measurement is
given below, after the table.

Table 1: Showing the degree of variation found in a participant group of 53


texters

Author Age Gender Total No Total No Total 1 Form Many One/Many


Code of Txts of Tkns No of Polymrphs Form
Types Polymrphs
1 44 F 104 1905 627 134 44 0.33
2 70 F 2 47 43 17 0 0.00
3 32 F 11 83 63 33 0 0.00
4 45 F 4 35 28 15 0 0.00
5 50 F 5 53 34 12 2 0.17
7 17 M 30 259 177 73 11 0.15
8 45 M 93 1125 393 124 9 0.07
9 16 M 21 428 231 87 18 0.21
10 25 M 4 52 44 18 0 0.00
11 32 M 9 236 116 47 3 0.06
13 25 F 2 32 30 16 0 0.00

7
15 35 F 1 58 54 26 1 0.04
22 42 F 26 202 147 56 6 0.11
24 46 F 85 934 339 109 25 0.23
26 25 F 39 766 367 116 22 0.19
27 22 F 2 66 55 22 1 0.05
28 37 F 13 245 149 54 4 0.07
32 40 F 42 460 270 82 13 0.16
33 35 F 6 109 73 26 0 0.00
35 18 F 9 191 118 54 4 0.07
36 45 F 3 31 26 6 1 0.17
37 26 F 1 32 30 9 3 0.33
38 35 F 1 25 21 7 1 0.14
39 38 F 6 130 97 37 0 0.00
40 50 F 1 18 14 4 0 0.00
41 20 F 22 309 172 67 7 0.10
42 25 F 5 92 65 25 2 0.08
44 23 M 7 78 64 35 4 0.11
45 32 M 6 99 69 37 1 0.03
46 32 F 41 905 330 134 3 0.02
48 45 M 1 24 23 10 0 0.00
49 56 M 27 230 133 42 4 0.10
50 11 F 72 627 285 61 32 0.52
51 15 F 3 35 30 15 2 0.13
52 26 F 8 119 72 32 4 0.13
60 17 M 1 34 30 17 0 0.00
61 40 M 32 249 152 58 2 0.03
63 39 M 2 15 10 5 0 0.00
65 36 M 3 38 32 14 0 0.00
66 58 F 1 43 38 25 0 0.00
67 25 M 48 749 307 105 11 0.10
69 70 M 2 17 16 6 0 0.00
70 45 M 2 51 42 21 0 0.00
71 16 F 94 2323 508 144 42 0.29
73 45 F 1 1 1 1 0 0.00
74 56 M 6 104 72 34 0 0.00
75 70 F 3 15 13 4 0 0.00
77 45 F 4 95 71 21 1 0.05
78 55 M 11 254 146 44 1 0.02
79 26 M 19 276 125 59 4 0.07
80 24 M 1 6 6 2 0 0.00
81 62 F 1 15 15 7 1 0.14
82 38 F 7 168 124 50 1 0.02

Results

The average percentage of multiple form polymorphs was 0.09 (nine percent).
However, there was extensive variation (from 0 percent to 52 percent). Males appear
to exhibit significantly less variation than females (5 percent compared to 11 percent).
Variation appears to decrease with age: texters aged 18 or less exhibit 20% variation
on average, those between 18 and 30 display 13%, between 30 and 40 years of age a
mere 4%, while those aged 40-50 exhibit 9%, the figure declining to 7% for those
between 50 and 70 years of age. Hence young females show the highest variation in
their texts of all the sub-groups. There is almost no correlation between age of texter

8
and average length of text: in fact the correlation is slightly negative (-0.08).
Similarly, there is little or no connection between the texter’s gender and average
length of text. Of the 53 participants, 26 exhibit 5% or greater variation. Even those
who contributed 10 or fewer texts exhibit, by this calculation, an average of 5%
variation. It will be appreciated that the size of the corpus in the present study means
these results should be considered as provisional, pending further research.

It was also found that, as expected, variation tends to increase with the number of
text’s in the author’s sub corpus. There was thus a high correlation (p<0.005, n = 53)
between the number of texts the author had sent and the degree of variation they
displayed. This can be confirmed from the table above. This would mean that analysis
of a corpus which contained a relatively small number of texts per author would very
likely reveal variation to be relatively low. If the data in the above table are reliable
indicators of variation, then corpora containing low numbers of texts per author would
in all probability not represent the true state of author variation in the phone texting
medium.

A more conservative measurement of individual variation than the one referred to


above is the number of many form polymorphs used divided by the total number of
types (as in types/tokens) used. Although this yields a much lower percentage of
variation, it was found that 15 of the 53 texters exhibited variation of 5% or greater.

To give a rough idea of what all this might mean in the forensic context, the results
presented in this analysis suggest that if an expert were being very conservative,
he/she would have to inform the court that the chances of a correct identification
could, in some circumstances be as low as 72% (i.e. 38/53), although it is possible –
again, depending on the circumstances – for it to be as high as 90%. Clearly, the texts
in each case will present specific levels of variation and some will present greater
difficulties with regard to possible identification than others.

Conclusion

In this analysis I have attempted to illustrate that individual author variation in mobile
phone texts may be greater than would be readily apparent. This is particularly
important in the forensic context, where people’s lives, liberty or reputations are at
issue.

9
I am not suggesting that forensic authorship of mobile phone texts cannot be carried
out accurately and successfully, but I believe that very great caution is required when
making identifications for court purposes. Nobody would disagree that this should be
so in any forensic analysis.

More generally, I believe it is clear from this brief study that the question of author
variation is becoming increasingly important in the forensic linguistic context.
Manifestly, we require to know more about this phenomenon, so that we can be more
sure when stating our observations and conclusions in forensic work, that we are
basing our judgements on the realities of language use. Up until now individual author
variation has largely been ignored by many forensic linguists and I believe this is
dangerous: dangerous for the justice system and dangerous for victims and defendants
alike. In case the reader might think I am exaggerating when referring to the lack of
attention given to the topic of individual author variation, I would refer to a statement
in a recent book on forensic linguistics.

The linguist approaches the problem of questioned authorship from the theoretical position
that every native speaker has their own distinct and individual version of the language they
speak and write, their own idiolect, and the assumption that this idiolect will manifest
itself through distinctive and idiosyncratic choices in speech and writing…

This implies that it should be possible to devise a method of linguistic fingerprinting, in


other words that the linguistic ‘impressions’ created by a given speaker should be usable,
just like a signature, to identify them. So far, however, practice is a long way behind
theory and no one has even begun to speculate about how much and what kind of data
would be needed to uniquely characterize an idiolect.

Coulthard and Johnson 2007: 161.

In this passage Coulthard (the author of this section of the book) appears to be
claiming that given sufficient data writers can be ‘linguistically fingerprinted’. All that
is required to expose this idiolect is sufficient data and, apparently, a method.
According to this claim each of us has an idiolect and – providing the above
conditions can be met – this idiolect could be demonstrated. However, the notion of
idiolect assumes an underlying consistency of data and, as William Labov noted over
thirty years ago, the notion of “homogenous data” in idiolects is highly questionable
(Labov, 1972: 192) and the problem of “the quantification of the dimension of style”

10
(1972: 245) remains. Although Labov’s comments are in relation to speech, my view
is that they are probably applicable to all language data.

As far as I am aware, Coulthard does not refer anywhere in his book to author
variation specifically and shows no attempt at having studied the subject in any depth.
Yet he pronounces that we each have an idiolect – which is no more than a theoretical
construct, based on the notion of linguistic variation at the group level which may be
extended to linguistic variation at the individual level. It is of some concern that
trainee forensic linguists may now also adopt the position that ‘idiolect’ is a given,
rather than simply a notion, or a convenient way of describing what is no more than a
theoretical possibility. It is surely the case that in any inquiry we need to look
systematically at the variation displayed by each author candidate and compare this
(perhaps in some measurable way) with the degree to which candidates are
individually distinctive in their choices. We would then also test each claim we make
about distinctiveness, either by the use of general and specialist corpora and, where
feasible, Internet searches.

Hence, while it is literally true that ‘for most items most people use only one form
most of the time’ in itself this statement does not mean much in the forensic context.
In the experiment reported here, while only about five per cent of items are actually
polymorphs they represent around 20% of all words used. At this level the variation
displayed is more than capable of confounding an identification analysis. Moreover,
given that it is almost predictable which tokens will turn out to be polymorphs, and
given the endless inventiveness of human beings when in contact with language, it is
apparent from the results given above that not only are individuals subject to some
kind of ‘internal’ instability of form when using polymorphs, but that polymorphs in
the phone texting environment are themselves subject to various forms of social
change, including of course diachronic change. It is suggested that Bakhtin’s notion of
heteroglossia is particularly useful when considering variation in mobile phone
texting: whereas ordered, standardised forms of language are evidence of centrifugal
social forces, centripetal forces influence non-standard forms (see Morson and
Emerson, 1990). This effectively means that variation itself may be unstable and
hence unpredictable.

11
This makes it doubly critical that in any forensic analysis involving authorship of
phone texts linguists need to look closely at the degree of variation displayed by
individual authors before carrying out comparison tests. They will then be in a better
position to assign a more accurate level of probability to their results.

As a first step, it may be helpful to consider that forensic phoneticians now look at the
issue of ‘consistency’ and they balance this against ‘distinctiveness’. Hence, in a
given case if a suspect’s voice yields (relatively) consistent data, and is sufficiently
distinctive as a voice, this greatly enhances the ability to make a valid comparison.
How is it that forensic linguists are not thinking like this?

References

Coulthard M., & Johnson, A. 2007. An Introduction to Forensic Linguistics:


Language in Evidence. London: Routledge.

Labov, W. 1972. Sociolinguistic patterns. Philadelphia, Pennsylvania: University of


Pennsylvania Press.

Morson, G.S. & Emerson, C. 1990. Mikhail Bakhtin: Creation of a Prosaics. Palo
Alto, California: Stanford University Press.

12

You might also like