Discourse, Vision, and Cognition

Discourse, Vision, and Cognition
Human Cognitive Processing (HCP)

Human Cognitive Processing is a bookseries presenting interdisciplinary
research on the cognitive structure and processing of language and its
anchoring in the human cognitive or mental systems.
Editors
Marcelo Dascal Raymond W. Jr. Gibbs Jan Nuyts
Tel Aviv University University of California at Santa Cruz University of Antwerp
Editorial Advisory Board

Melissa F. Bowerman Eric Pederson
Nijmegen Eugene, OR
Wallace Chafe Franois Recanati
Santa Barbara, CA Paris
Philip R. Cohen Sally Rice
Portland, OR Edmonton, Alberta
Antonio Damasio Benny Shanon
Iowa City, IA Jerusalem
Morton Ann Gernsbacher Lokendra Shastri
Madison, WI Berkeley, CA
David McNeill Paul Thagard
Chicago, IL Waterloo, Ontario
Volume 23
by Jana Holnov
Jana Holnov
Lund University
John Benjamins Publishing Company

Amsterdam/Philadelphia
TM
The paper used in this publication meets the minimum requirements of
8
American National Standard for Information Sciences Permanence of

Paper for Printed Library Materials, ansi z39.48-1984.
Library of Congress Cataloging-in-Publication Data
Holnov, Jana.
Discourse, vision, and cognition / Jana Holnov.
p. cm. (Human Cognitive Processing, issn 1387-6724 ; v. 23)
Includes bibliographical references and index.
1. Discourse analysis. 2. Oral communication. 3. Psycholinguistics. 4. Visual
communication. I. Title.
P302.H635 2008
401'.41--dc22 2007049098
isbn 978 90 272 2377 7 (Hb; alk. paper)
2008 John Benjamins B.V.

No part of this book may be reproduced in any form, by print, photoprint, microfilm, or any
other means, without written permission from the publisher.
John Benjamins Publishing Co. P.O. Box 36224 1020 me Amsterdam The Netherlands
John Benjamins North America P.O. Box 27519 Philadelphia pa 19118-0519 usa
Table of contents
Preface xi
chapter 1
Segmentation of spoken discourse 1
1. Characteristics of spoken discourse 1
1.1 Transcribing spoken discourse 3
2. Segmentation of spoken discourse and cognitive rhythm 4
2.1 Verbal focus and verbal superfocus 6
3. Segmentation rules 10
4. Perception of discourse boundaries 12
4.1 Focus and superfocus 13
4.2 Other discourse units 15
5. Conclusion 16
chapter 2
Structure and content of spoken picture descriptions 19
1. Taxonomy of foci 19
1.1 Presentational function 21
1.1.1 substantive foci 21
1.1.2 substantive foci with categorisation
difficulties 21
1.1.3 substantive list of items 22
1.1.4 summarising foci 22
1.1.5 localising foci 23
1.2 Orientational function 24
1.2.1 evaluative foci 24
1.2.2 expert foci 24
1.3 Organisational function 25
1.3.1 interactive foci 25
1.3.2 introspective foci & metacomments 26
vi Discourse, Vision and Cognition
2. How general are these foci in other settings? 28

2.1 Presentational function 30
2.1.1 substantive foci (subst) 30
2.1.2 substantive foci with categorisation difficulties 31
2.1.3 subst list of items 31
2.1.4 summarising foci (sum) 31
2.1.5 localising foci (loc) 32
2.2 Orientational function 33
2.2.1 evaluative foci (eval) 33
2.2.2 expert foci (expert) 33
2.3 Organisational function 34
2.3.1 interactive foci (interact) 34
2.3.2 introspective foci & metacomments (introspect/meta) 34
2.4 Length of description 35
3. Conclusion 37
chapter 3
Description coherence & connection between foci 39
1. Picture description: Transitions between foci 39
1.1 Means of bridging the foci 42
1.1.1 Discourse markers 43
1.1.2 Loudness and voice quality 45
1.1.3 Localising expressions 46
2. Illustrations of description coherence:
Spontaneous description & drawing 46
2.1 Transitions between foci 50
2.2 Semantic, rhetorical and sequential aspects of discourse 51
2.3 Means of bridging the foci 52
3. Conclusion 53
chapter 4
Variations in picture description 55
1. Two different styles 55
1.1 The static description style 57
1.2 The dynamic description style 59
1.3 Cognitive, experiential and contextual factors 64
Table of contents vii
2. Verbal or visual thinkers? 66

3. Effects of narrative and spatial priming on the two styles 69
3.1 The effect of (indirect) spatial priming 70
3.2 The effect of narrative priming 73
4. Conclusion 76
chapter 5
Multimodal sequential method and analytic tool 79
1. Picture viewing and picture description:
Two windows to the mind 80
1.1 Synchronising verbal and visual data 81
1.2 Focusing attention: The spotlight metaphor 83
1.3 How it all began 84
2. Simultaneous description with eye tracking 85
2.1 Characteristics of the picture: Complex depicted scene 86
2.2 Characteristics of the spoken picture description:
Discourse level 87
2.3 Multimodal score sheets 88
2.3.1 Picture viewing 89
2.3.2 Picture description 91
3. Multimodal sequential method 94
4. Conclusion 98
chapter 6
Temporal correspondence between verbal and visual data 99
1. Multimodal configurations and units of visual and verbal data 100
1.1 Configurations within a focus 101
1.1.1 Perfect temporal and semantic match 101
1.1.2 Delay between the visual and the verbal part 101
1.1.3 Triangle configuration 102
1.1.4 N-to-1 mappings 104
1.1.5 N-to-1 mappings, during pauses 104
1.1.6 N-to-1 mappings, rhythmic re-examination pattern 105
1.2 Configurations within a superfocus 106
1.2.1 Series of perfect matches 106
1.2.2 Series of delays 107
1.2.3 Series of triangles 109
viii Discourse, Vision and Cognition
1.2.4 Series of N-to-1 mappings 109

1.2.5 Series of N-to-N mappings 113
1.3 Frequency distribution of configurations 114
2. Functional distribution of multimodal patterns 115
3. Discussion: Comparison with psycholinguistic studies 120
4. Conclusion 121
chapter 7
Semantic correspondence between verbal and visual data 125
1. Semantic correspondence 126
1.1 Object-location relation 128
1.2 Object-path relation 128
1.3 Object-attribute relation 129
1.4 Objectactivity relation 129
2. Levels of specificity and categorisation 130
3. Spatial, semantic and mental groupings 131
3.1 Grouping concrete objects on the basis of spatial proximity 131
3.2 Grouping multiple concrete objects on the basis
of categorical proximity 132
of the composition 133
3.4 Mental zooming out, recategorising the scene 133
3.5 Mental grouping of concrete objects on the basis of similar traits
and activities 134
3.6 Mental grouping of concrete objects on the basis
of an abstract scenario 135
4. Discussion 137
4.1 The priming study 138
4.2 Dimensions of picture viewing 142
4.3 Functions of eye movements 143
4.4 The role of eye fixation patterns 143
4.5 Language production and language planning 146
5. Conclusion 148
chapter 8
Picture viewing, picture description and mental imagery 151
1. Visualisation in discourse production
and discourse comprehension 152
2. Mental imagery and descriptive discourse 157
Table of contents ix
2.1 Listening and retelling 159

2.1.1 Measuring spatial and temporal correspondence 161
2.1.2 Eye voice latencies 161
2.2 Picture viewing and picture description 164
3. Discussion: Users interaction with multiple
external representations 167
4. Conclusion 168
chapter 9
Concluding chapter 171
References 179
Author index 193
Subject index 197
Preface
While there is a growing body of psycholinguistic experimental research on

mappings between language and vision on a word and sentence level, there are
almost no studies on how speakers perceive, conceptualise and spontaneously
describe a complex visual scene on higher levels of discourse.
This book aims to fill this gap by exploring the relationship between lan-
guage, vision and cognition in spoken discourse. In particular, I investigate the
dynamic process of picture discovery, the process of picture description and
cognitive processes underlying both. What do we attend to visually when we
describe something verbally? How do we construct meaningful units of a scene
during the scene discovery? How do we describe a scene on different occasions
for different purposes? Are there individual differences in the way we describe
and visualise a scene?
The point of departure in this book are complex units in spoken language
description and visual units during the process of picture viewing. Observers
perceive the picture on different levels of detail, mentally group the elements
in a particular way and interpret both WHAT they see and HOW the picture
appears to them. All this is reflected in the stepwise process of picture view-
ing and picture description. The transcripts of spoken discourse including
prosody, pausing, interactional and emotional aspects contain not only ideas
about concrete objects, their shapes, qualities and spatial relations but also the
describers impressions, associations and attitudes towards them. This enables
us to study the complex dynamics of thought processes going on during the
observation and description of a scene.
In this book, I combine discourse analysis with a cognitively oriented re-
search and eye movement protocols in order to gain insights about the dynam-
ics of the underlying cognitive processes. On the one hand, eye movements
reflect human thought processes. It is easy to determine which elements attract
the observers gaze, in what order and how often. On the other hand, verbal
foci formulated during picture description are the linguistic expressions of a
conscious focus of attention. With the help of a segmented transcript of the
discourse, we can learn what is in the verbal focus at a given time. Thus, verbal
xii Discourse, Vision and Cognition
and visual data (the contents of the attentional spotlight) are used as two win-
dows to the mind. Both kinds of data are an indirect source to shed light on the
underlying cognitive processes.
It is, of course, impossible to directly uncover our cognitive processes. If we
want to learn about how the mind works, we have to do it indirectly, via overt
manifestations. The central question is: Can spoken language descriptions and
eye movement protocols, in concert, elucidate covert mental processes? To an-
swer this question, we will proceed in the following steps:
Chapter 1 presents a segmentation of spoken discourse and defines units of
speech verbal focus and verbal superfocus expressing the contents of active
consciousness and providing a complex and subtle window on the mind.
Chapter 2 focuses on the structure and content of spoken picture descrip-
tions. It describes and illustrates various types of foci and superfoci extracted
from picture descriptions in various settings. Chapter 3 takes a closer look at
how speakers create coherence when connecting the subsequent steps in their
picture descriptions. Chapter 4 discusses individual description styles.
While the first four chapters of this book deal with characteristics of picture
descriptions in different settings, in the remaining four chapters, the perspec-
tive has been broadened to that of picture viewing. These chapters explore the
connection between spoken descriptive discourse, picture viewing and mental
imagery. The discourse segmentation methodology is therefore extended into
a multimodal scoring technique for picture description and picture viewing,
leading up to an analysis of correspondence between verbal and visual data.
Chapter 5 deals with methodological questions, and the focus is on sequen-
tial and processual aspects of picture viewing and picture description. The read-
er gets acquainted with the multimodal method and the analytical tools that are
used when studying the correspondence between verbal and visual data.
In Chapters 6 and 7, the multimodal method is used to compare the con-
tent of the visual focus of attention (specifically clusters of visual fixations) and
the content of the verbal focus of attention (specifically verbal foci and super-
foci) in order to nd out whether there is correspondence in units of picture
viewing and simultaneous picture description. Both temporal and semantic re-
lations between the verbal and visual data are investigated. Finally, clusters on
different levels of the discourse hierarchy are connected to certain functional
sequences in the visual data.
Chapter 8 focuses on the issue of visualisations in discourse production
and discourse comprehension and presents studies on mental imagery associ-
ated with picture viewing and picture description.
Preface xiii
The concluding Chapter 9 looks back on the most important issues and
findings in the book and mentions implications of the multimodal approach
for other fields of research, including evaluation of design, users interaction
with multiple representations, multimodal systems, etc.
This book addresses researchers with a background in linguistics, psycho-
linguistics, psychology, cognitive science and computer science. The book is
also of interest to scholars working in the applied area of usability and in the
interdisciplinary field concerned with cognitive systems involved in language
use and vision.
Acknowledgements
I wish to thank Professor Wallace Chafe, University of Santa Barbara, for in-
spiration, encouragement and useful suggestions during my work. I also ben-
efited from discussions and criticism raised by doctoral students, research-
ers and guest researchers during the eye tracking seminars at the Humanist
Laboratory and Cognitive Science Department at Lund University: Thanks to
Richard Andersson, Lenisa Brando, Philip Diderichsen, Marcus Nystrm and
Jaana Simola. Several colleagues have contributed by reading and commenting
on the manuscript. In particular, I want to thank Roger Johansson and Ken-
neth Holmqvist for their careful reading and criticism. I also wish to thank the
anonymous reviewer for her/his constructive suggestions. Finally, thanks are
due to my family: to our parents, my husband and my children Fredrik and
Annika.
The work was supported by the post-doctoral fellowship grant VR 2002
6308 from the Swedish Research Council.
chapter 1
Segmentation of spoken discourse
When we listen to spontaneous speech, it is easy to think that the utterances form
a continuous stream of coherent thought. Not until we write down and analyse
the speech, do we realise that it consists of a series of small units. The speech
flow is segmented, containing many repetitions, stops, hesitations and pauses.
Metaphorically one could say that speech progresses in small discontinuous steps.
In fact, it is the listeners who creatively fill in what is missing or filter out what is
superfluous and in this way construe continuous speech. The stepwise structure
of speech can give us many clues about cognitive processes. It suggests units for
planning, production, perception and comprehension. In other words, the flow of
speech reflects the flow of thoughts.
The aim of this chapter is to present a segmentation of spoken discourse and to

define units of speech verbal focus and verbal superfocus. This will be used
as a starting point for further analysis of spoken descriptive discourse (Chap-
ters 24) and for the comparison of spoken and visual data (Chapters 58).
First, we will get acquainted with the content and form of transcriptions and
important conventions for easier reading of the examples. Second, we will ad-
dress the issue of discourse segmentation and deal with verbal focus and verbal
superfocus as the most important units in spoken descriptive discourse. Third,
we will formulate segmentation rules and operationalise the definitions for
segmentation of spoken data. Fourth, we will consider perception of dicourse
boundaries on different levels of discourse.
1. Characteristics of spoken discourse
Spontaneous speech tends to be produced in spurt-like portions, not in well-

formulated sentences. Therefore, transcription has to be performed in spoken
Discourse, Vision and Cognition
language terms (Linell 1994:2) and include such features as pauses, hesitations,
unintelligible speech, interruptions, restarts, repeats, corrections and listener
feedback. All these features are typical of speech and give us additional in-
formation about the speaker and the context. The first example stems from a
data set on picture descriptions in Swedish and illustrates some of the typical
features of spoken discourse.
Example 1
0851 ehmm till hger p bilden
ehmm to the right in the picture
0852 s finns de/ ja juste,
there are/ oh yes,
0853 framfr framfr trdet
in front in front of the tree
0854 s e det . grs,
so there is . grass,
0855 eh och till hger om det hr trdet
eh and to the right of this tree
0856 s har vi terigen en liten ker-tppa eller jord,
so again we have a little field or piece of soil,
The example above illustrates the main characteristics of spontaneous speech,

i.e. how we proceed in brief spurts and formulate our message in portions. Af-
ter some hesitation the informant introduces the right hand side of the picture
(0851). He then interrupts the description seemingly reminding himself of
something else and turns to another part of the picture. First after he has de-
scribed the region in front of the tree (08530854), he returns to the right hand
side of the picture again (0855). Such interruptions, jumps and digressions are
very common in spoken discourse (see Chapter 3 for details). Speakers are
not always consistent and disciplined enough to exhaust one idea or aspect
completely before they turn to the next one. Besides formulating ideas on the
main idea track, there are often steps aside. It is possible to abandon a certain
aspect for a while although it is not completed in the description, turn to other
aspects, and then again return to what was not fully expressed. Often, this is
done explicitly, by marking the steps backs to the main track. It could also
be that something has been temporarily forgotten but gets retrieved (I saw
three eh four persons in the garden). The relation between discourse production
and the visual behaviour during pauses and hesitations will be considered in
later Chapters (4 and 7).
Chapter 1. Segmentation of spoken discourse
As we turn our ideas into speech, the match between what we want to say
and what we actually do say is rarely a perfect one. The stepwise production of
real-time spoken discourse is associated with pauses, stutterings, hesitations,
contaminations, slips of the tongue, speech errors, false-starts, and verbatim
repetitions. Within psycholinguistic research, such dysuencies in speech have
been used as a primary source of data since they allow us insights into the actu-
al process of language production. This kind of dysfluencies has been explored
in order to reveal planning, execution and monitoring activities on different
levels of discourse (Garett 1980; Linell 1982; Levelt 1983; Strmqvist 1996).
Goldman-Eisler (1968) has shown that pauses and hesitation reflect plan-
ning and execution in tasks of various cognitive complexity. Pausing and hesi-
tation phenomena are more frequent in demanding tasks like evaluating (as
opposed to simply describing), as well as at important cognitive points of tran-
sition when new or vital pieces of information appear. Also, the choice points
of ideational boundaries are associated with a decrease in speech fluency.
Errors in speech production give us some idea of the units we use. Slips of
the tongue can appear either on a low level of planning and production, as an
anticipation of a following sound (he dropped his cuff of coffee), or on higher
levels of planning and production (Thin this slicely instead of Slice this thinly;
Kess 1992). Garett (1975, 1980) suggests that speech errors provide evidence
for two levels of planning: semantic planning across clause boundaries (on a
functional level) and grammatical planning within the clause range (on a po-
sitional level). Along similar lines, Levelt (1989) assumes macroplanning (i.e.
elaboration of a communicative goal) and microplanning (decisions about the
topic or focus of the utterance etc.). Linell (1982) distinguishes two phases of
utterance production: the construction of an utterance plan (a decision about
the semantic and formal properties of the utterance) and the execution of an ut-
terance plan (the pronunciation of the words) (see also Clark & Clark 1977).
1.1 Transcribing spoken discourse
In order to analyse spoken discourse, it needs to be transcribed by using a nota-

tion system. Let me now mention some practical matters concerning the tran-
scripts that the reader will find in this book. Each numbered line represents a
new idea or focus of thought. The transcription includes (a) verbal features,
(b) prosodic/acoustic features (intonation, rhythm, tempo, pauses, stress, voice
quality, loudness), and (c) non-verbal features (gestures, laughter etc.). The ar-
row in front of a unit means that this unit is discussed and explained in the
Table 1. Transcription symbols

. short pause
... longer pause
(3s) measured pause longer than 2 seconds
eh hesitation
yes, falling intonation
yes rising intonation
tree stress
(faster) paralinguistic feature: tempo, voice quality, loudness
LAUGHS non-verbal features: laughter, gestures
one of/ interrupted construction
xxx unintelligible speech
presentation. The first two numbers identify the informant (04), the other two
or three numbers are the speech units in that order. The transcribed speech
is not adapted to the syntax and punctuation of written language. Also, it in-
cludes forms that do not exist in written language (uhu, mhm, ehh) maybe
apart from instant messaging. The data has been translated into spoken Eng-
lish. Compared to the orthography of written language, some spoken English
language forms are used in the transcript (unless the written pronunciation is
used): hes, theres, dont. For easier reading, Table 1 summarises the symbols
that are used to represent speech and events in the transcription.
2. Segmentation of spoken discourse and cognitive rhythm
Information flow in discourse is associated with dynamic changes in language

and thought: an upcoming utterance expresses a change in the speakers and
listeners information states. Chafe (1994) presents a consciousness-based ap-
proach to information flow and compares it with other approaches that do not
account for consciousness, such as the Czech notion of functional sentence
perspective (Mathesius 1939), communicative dynamism (Firbas 1979), Halli-
days functional grammar (Halliday 1985), Clark & Havilands given-new con-
tract (Haviland & Clark 1974), Princes taxonomy of given-new information
(Prince 1981) and Givns view of grammar as mental processing instructions
(Givn 1990).
For quite a long time, linguists, psycholinguists and psychologists have
been discussing the basic structuring elements in speech and spoken discourse.
There are different traditions regarding what the basic discourse structuring
Figure 1. The motif is from Nordqvist (1990). Kackel i grnsakslandet. Opal.

(Translated as Festus and Mercury: Ruckus in the garden.)
elements are called: idea units, intonation units, information units, informa-
tion packages, phrasing units, clauses, sentences, utterances etc. Many theories
within the research on discourse structure agree that we can focus on only
small pieces of information at a time, but there are different opinions of what
size these information units should have. In the following, I will define two
units of spoken discourse and formulate segmentation rules. But first, let us
look at the next example.
Example 2 illustrates the segmentation of spoken discourse. In a spoken
description of a picture (cf. Figure 1), one aspect is selected and focused on at a
time and the description proceeds in brief spurts. These brief units that speak-
ers concentrate on one at a time express the focus of active consciousness. They
contain one primary accent, a coherent intonation pattern, and they are often
preceded by a hesitation, a pause or a discourse marker signalling a change of
focus. Note that the successive ideas expressed in units 04040410 are in six
cases marked and connected by discourse markers och d, sen s, sen, s,
d (and then, then, and then, and so, and then).
Example 2
0402 de e Pettson och katten Findus i olika skepnader d, ja
well its Pettson and Findus the cat in different versions,
0403 och Pettson grver
and Pettson is digging
0404 . och d har han hittat nnting som han hller i handen,
. and hes found something that hes holding in his hand,
0405 sen s . fortstter han grva

and then . he continues digging
0406 sen s rfsar han
and then he rakes
0407 med hjlp av katten Findus
with the help of Findus the cat
0408 . s ligger han p kn . sr frn
and . then hes lying on his knees and . sowing seeds
0409 (snabbare) d hjlper honom Findus med nn snillrik
vattenanordning, SKRATTAR
(faster) and then Findus the cat is helping him with a clever
watering system, LAUGHS
0410 sen ligger katten Findus och vilar,
and then Findus the cat is lying down having a rest,
2.1 Verbal focus and verbal superfocus
In the later chapters on picture viewing and picture description (Chapters 5

9), I will use verbal focus and visual focus as two windows to the mind. These
activated units in verbal and visual data will be used as a protocol to throw light
on the underlying cognitive processes (Holsanova 1996). They will be analysed
by using a multimodal sequential method (see Chapter 5, Section 2.3 and 3).
Thus, for the purpose of my studies, I am interested in units that reflect hu-
man attention, or, more precisely, the stage of attentional activation and active
consciousness. Concerning the ow of information and spoken language units,
I will address the form, content and limitation criteria of the units that I call
verbal focus and verbal superfocus.
Let us start with example 3, which illustrates a verbal focus. Each num-
bered line in the transcript represents a new verbal focus expressing the con-
tent of active consciousness.
Example 3
0203 det finns ett trd i mitten
there is a tree in the middle
A verbal focus denotes the linguistic expression of a conscious focus of atten-

tion. Verbal focus is the basic discourse-structuring element that contains ac-
tivated information. It contains the information that is currently conceived of
as central in a speech unit. A verbal focus is usually a phrase or a short clause,
delimited by prosodic, acoustic features and lexical/semantic features. It has

one primary accent, a coherent intonation contour and is usually preceded by
a pause, hesitation or a discourse marker (Holsanova 2001:15f.).
How does verbal focus relate to other units suggested in the literature?
Similar to Chafes delimitation of an intonation unit, it implies that one new
idea is formulated at a time (Chafe 1994) and that focused information is re-
placed by another idea at short intervals of approximately two seconds (Chafe
1987). Units similar to the intonation unit have been discussed in Anglo-Sax-
on spoken language research (compare tone unit, Crystal 1975, intonation
phrase, Halliday 1970, prosodic phrase, Bruce 1982; Levelt 1989:307f.). Selting
(1998:3) and her research colleagues introduced the concept phrasing unit to
account for the spoken language units produced in conversation. Van Donzel
(1997, 1999) analyses the semantic and acoustic/prosodic clues for recognis-
ing spoken language units on different levels of discourse hierarchy. Other re-
searchers concluded on the basis of their empirical studies that the clause is the
basic structuring unit in spoken language (Bock et al. 2004; Griffin 2004; Kita
& Ozyrek 2003).
Already in 1980, Chafe suggested that we express one idea at a time. The
reason for discontinuous jumping in small idea units lies in the limitations of
human cognitive capacities. We receive a large amount of information from
our perception, our emotions and our memory, but we can fully activate only
a part of the accessible information. According to our present needs, interests
and goals, we concentrate our attention on small parts, one at a time. The small
unit of discourse (called idea unit or intonation unit) expresses the conscious
focus of attention. Such a unit consists of that amount of information to which
persons can devote their central attention at a time. The expression of one new
idea at a time reflects a fundamental temporal constraint on the minds pro-
cessing of information.
I will also use the term idea unit but in another sense than Chafe and oth-
ers. Instead of using it as a linguistic manifestation, idea unit in my study will
be the mental unit that is the object of investigation. In a schematic way, Fig-
ure 2 shows the relation between verbal focus, visual focus and idea unit (see
Chapter 5, Section 1.2).
At the same time, attention is conceived of as a continuum of active, semi-
active and inactive foci (Chafe 1994). This means that the smallest units of dis-
course that are at the focus of attention are embedded in larger, thematic units
of discourse. These larger units, serving as a background, are only semiactive.
Figure 2. Verbal and visual focus.
We are coming closer to the question of macroplanning in discourse. Exam-

ple4, lines 03130316, illustrates the verbal superfocus.
Example 4
0313 (1s) en fgel sitter p sina gg i ett rede
(1s) one bird is sitting on its eggs in a nest
0314 (1s) och den andra fgeln (SKRATTAR) verkar sjunga
(1s) and the other bird (LAUGHING) is singing
0315 samtidigt som en . tredje . hon-fgeln liksom
at the same time as the . third . like female bird
0316 (1s) piskar matta eller nnting snt,
(1s) is beating a rug or something,
Several verbal foci are clustered into a more complex unit called verbal superfo-
cus. Verbal superfocus is a coherent chunk of speech that surpasses the verbal
focus and is prosodically nished. This larger discourse segment, typically a
longer utterance, consists of several foci connected by the same thematic aspect
and has a sentence-nal prosodic pattern (often a falling intonation). A new
superfocus is typically preceded by a long pause and a hesitation, which reects
the process of refocusing attention from one picture area to another. Alterna-
tively, transitions between superfoci are made with the aid of discourse mark-
ers, acceleration, voice quality, tempo and loudness. An inserted description
or comment uttered in another voice quality (creaky voice or dialect-imitating
voice), deviating from its surroundings can stretch over several verbal foci and
form a superfocus. The acoustic features simplify the perception of superfoci
borders. The referents often remain the same throughout the superfocus, only
. Chafe (1979, 1980) found signicantly longer pauses and more hesitation signals at the
borders between superfoci than at the borders between verbal foci.
some properties change. When it comes to the size of superfoci, they often
correspond to long utterances. Superfoci can be conceived of as new complex
units of thought. According to Chafe (1994), these larger units of discourse
(called centres of interest) represent thematic units of speech that are guided
by our experience, intellect and judgements, and thus correspond roughly to
scripts or schemata (Schank & Abelson 1977).
Superfoci are in turn parts of even larger units of speech, called discourse
topics. In our descriptive discourse, examples of such larger units would be
units guided by the picture composition (in the middle , on the left, on the
right, in the background) or units guided by the conventions in descriptive
discourse (impression, general overview-detailed description, description-evalu-
ation). In Chafes terminology, these bigger units are called basic level topics or
discourse topics, verbalising the content of the semiactive consciousness. We
usually keep track of the main ideas expressed in these larger units. Table2
summarises all the building blocks in the hierarchy of discourse production:
foci, superfoci and discourse topics.
Let me briefly mention other suggested units of speech and thought. But-
terworth (1975) and Beattie (1980) speak about hesitant and fluent phases
in speech production. Butterworth asked informants to segment monologic
discourse into idea units that were assumed to represent informal intuitions
about the semantic structuring of the monologues. The main result of his stud-
ies was that borders between idea units coincided with the hesitations in the
speech production, suggesting that the successive ideas are planned during
these periods and formulated during the subsequent fluent periods (Butter-
worth 1975:81, 83). Beattie (1980) found the cyclic arrangement of hesitant
and fluent phases in the flow of language.
Schilperoord & Sanders (1997) analyse cyclic pausing patterns in discourse
production and suggest that there is a cognitive rhythm in our discourse seg-
mentation that reflects the gross hierarchical distinction between global and
local transitions. Yet another tradition in psycholinguistics is to speak about
information units or information packages and look at how information is pack-
aged and received in small portions that are digestible, in a certain rhythm that
gives the recipient regular opportunities for feedback (Strmqvist 1996, 1998,
et al. 2004; Berman & Slobin 1994).
Next, I will specify the set of rules that I used for segmentation of spoken
descriptive discourse.
10 Discourse, Vision and Cognition
Table 2. The discourse hierarchy and the building blocks in free picture description
Discourse topics, Superfoci Foci
(comparable to para- long utterances, clauses phrases, clauses (noun
graphs) dealing with the same topic phrases grouped on
------------------------- ------------------------------- semantic or functional
Examples: Example: grounds), short utter-
Composition Pettson is first digging in the ances
Evaluations field and then hes sowing -------------------------
Associations quite big white seeds and Examples:
Impressions then the cat starts watering in the background;
these particular seeds and on the left;
it ends up with the fellow eh Pettson is digging;
raking this field, in the middle is a tree;
Impression of the whole Superfocus 1 Focus 1
picture
Global overview-de- Superfocus 2 Focus 2

tailed description Focus 3
Focus 4
Descriptive Description-evaluation Superfocus 3 Focus 5
discourse Focus 6
Focus 7
In the middle Superfocus 4 Focus 8
Focus 9
Superfocus 5 Focus 10
Focus 11
Focus 12
Superfocus 6 Focus 13
Focus 14
To the left Superfocus 7 Etc.
To the right Superfocus 8
Superfocus 9
Superfocus 10
In the background Superfocus 11
Etc. Etc.
3. Segmentation rules
The discussions on the segmentation of speech have often dealt with the dif-
ficulty to clarify and exactly delimit such units. Chafe (1987) mentions the
functional properties of the smallest unit in discourse: An intonation unit is a
Chapter 1. Segmentation of spoken discourse 11
s equence of words combined under a single, coherent contour, usually preced-

ed by a pause. () Evidently, active information is replaced by other, partially
different information at approximately two seconds intervals (Chafe 1987:22).
Later on, Chafe (1994) states: The features that characterize intonation units
may involve any or all of the following: changes in fundamental frequency (per-
ceived as pitch), changes in duration (perceived as the shortening or lengthen-
ing of syllables or words), changes in intensity (perceived as loudness), chang-
es in voice quality of various kinds, and sometimes changes of turn (Chafe
1994:58). The following questions arose when I was trying to operationalise
these segmentation criteria on my data: Are all the features of equal value or
is there a hierarchy? Can a feature govern others or do the features have to co-
operate so that the demands for a unit border are fulfilled? Is there a minimal
set of criteria that are always necessary? In order to operationalise the steps in
segmentation, I modified Chafes delimitation of units, suggested a hierarchy of
different criteria and formulated segmentation rules (Holsanova 2001). Apart
from prosodic criteria being the overriding ones, I also introduced semantic
criteria and discourse markers as supportive cues into the discourse segmenta-
tion. The interplay between the prosodic and acoustic features was decisive for
the segmentation (rules 13). Rules 46 can be conceived of as complements to
the prosodic rules. They are connected to semantic criteria or to lexical mark-
ers as cues to the natural segmentation of speech.
1. When dividing the transcript into units, the primary (focal) accent togeth-
er with a coherent intonation pattern falling or rising pitch contour are
the dominant and most decisive features for segment borders.
2. Apart from the two above-mentioned features, segment borders can be
recognised by pauses and hesitation that appear on the boundary between
two verbal foci.
3. Additional strength is added if these features are supported by changes in
loudness, voice quality, tempo and acceleration (especially for the segmen-
tation of larger chunks of speech).
4. Verbal focus is a unit that is seldom broken up internally by pauses. But
when rules 13 are fulfilled and there is a short pause in the middle of a
unit where the speaker looks for the correct word or has trouble with pro-
nunciation or wording, it is still considered to be one unit.
around in the air there are . a number of insects flying
but its getting close to the genre of com/ comic strips,
5. When there are epistemic expressions or modifying additions that clearly

semantically belong to the statement expressed in the unit but are tied to it
without a pause, they are considered one unit.
and a third one that . is standing and singing I would say
6. Additional strength to the criteria 13 is added by lexical or discourse
markers signalling a change of focus in spoken discourse and demarking
a new unit (and then, anyway, but, so then). Such markers can be used as
additional clues into the segmentation of discourse (see Example 2 and
Chapter 3, Section 1.1.1 for details).
4. Perception of discourse boundaries
Although language production and language comprehension are traditionally

treated as separate research areas, there are similarities between the process of
language production and that of comprehension (Linell 1982). One idea is that
units in language production and comprehension coincide and that we there-
fore can verify the segmentation criteria using the listeners intuition about
how spoken discourse is structured. In my book on picture viewing and picture
description (Holsanova 2001:16f.), I presented the results of a perception study
where I explored the extent to which listeners can perceive and identify dis-
course boundaries in spoken language descriptions. The aim of the study was
to see whether listeners agree on the perceived boundaries at various levels of
discourse (focus and superfocus) and what cues might contribute to the detec-
tion of those boundaries (prosodic, semantic, discourse markers, etc.).
Ten Swedish and ten non-Swedish informants of different professional and
linguistic backgrounds and with different mother tongues were listening to an
authentic, spontaneous picture description in Swedish. They were asked to mark
perceived discourse boundaries in the verbatim transcription of this spoken dis-
course by means of a slash. Each space in the transcribed text represented a pos-
sible border in the segmentation of discourse. For each of these 346 spaces, I not-
ed whether the 20 informants drew a slash or not. The same transcript had been
analysed by two discourse analysts and segmented into verbal foci and verbal
superfoci (agreement 93%). According to the experts coding, the extract con-
tained 33 focus boundaries, 18 superfocus boundaries and 295 non-boundaries.
. The data were collected at Lund University and at the UCSB. I would like to thank John
W. Du Bois and Wallace Chafe for their help with this study.
All authentic border markings of the Swedish and non-Swedish group have been
compared and related to the different types of borders coded by the experts.
4.1 Focus and superfocus
First, the aim of the study was to find out whether discourse borders are more
easily recognised at the higher discourse level of verbal superfoci than at the
lower discourse level of verbal foci. The hypothesis was that the boundaries of
a superfocus a unit comparable to long utterances, higher in the discourse
hierarchy will be perceived as heavier and more final and thus will be more
easily identified than a focus. The agreement on focus boundaries, superfo-
cus boundaries, and non-boundaries by Swedish and non-Swedish informants
is illustrated in Figure 3. As we can see, the hypothesis has been confirmed:
Swedish and non-Swedish informants agree more on superfocus boundaries
than on focus boundaries. The agreement on focus boundaries reached a level
of 45 percent in the Swedish and 59 percent in the non-Swedish group, whereas
the agreement on superfocus boundaries reached 78 percent in the Swedish
and 74 percent in the non-Swedish group. Apart from agreement on different
kinds of boundaries, we can also consider listeners agreement on what was not
Agreement on focus boundaries, superfocus boundaries and

non-boundaries by Swedish (N=10) and non-Swedish
informants (N=10)
100
90
80
70
60
Focus boundaries
50 Superfocus boundaries
Non-boundaries
40
30
20
10
0
1 2
Swedish Non-Swedish
Figure 3. Agreement on focus boundaries, superfocus boundaries and non-boundar-

ies by Swedish and non-Swedish informants.
erceived as a boundary. Both the Swedish and non-Swedish group perceived

p
the majority of spaces as non-boundary with a high level of agreement al-
most 92 percent in the Swedish and 90 percent in the non-Swedish group.
The second aim of the study was to find out how the discourse structure is
perceived by Swedish and non-Swedish listeners and what criteria the bound-
ary markings rely on (prosodic, semantic, syntactic, hesitational, discourse
markers etc.). The hypothesis was that the non-native speakers would be gov-
erned by the prosodic and acoustic criteria only and that the comparison be-
tween native and non-native speakers would reveal what criteria are crucial for
perception of discourse segments.
This part of the study required a qualitative analysis. Many of the marked
boundaries in the two groups coincided and the agreement on the placement
of boundaries between informants was good, whether they understood the lan-
guage or not. Roughly the same spaces in the text were chosen as boundaries
in the Swedish and in the non-Swedish groups. Still, when investigating a se-
lection of boundaries on which there was disagreement between both groups,
and in which the non-Swedish group was more willing to place a boundary,
it could be confirmed that the segmentation of the non-Swedish group relied
more heavily on prosody. At certain places, however, prosody alone was mis-
leading. Instead, semantic criteria and discourse markers played an important
role for the recognition of discourse borders. For instance, this happened in
cases of an embedded re-categorisation that was not prosodically signalled,
in cases of topic changes without a pause or change in voice quality and in
cases of an attached commentary that was not separated by prosodic means. It
turned out that at exactly those places, there was a high agreement on border
segmentation within the Swedish group. When prosodic and semantic bound-
aries coincided, there was agreement on segment boundaries in both the Swed-
ish and the non-Swedish group. When prosodic and semantic boundaries did
not coincide, the Swedish group had the advantage of being guided by both
prosodic and semantic criteria, whereas the non-Swedish group had to rely on
the prosody alone.
Concerning segmentation criteria, my conclusions are as follows: A super-
focus a unit comparable to long utterances is easier to identify and both
cognitively and communicatively relevant. This result is also in line with Herb
Clarks tenet of language use: In language use, utterances are more basic than
sentences. (Clark 1992:xiii). Prosody plays a very important role for recog-
nition of discourse boundaries. This is also in accordance with our everyday
experience. When listening to a foreign language, we can perceive the spurts
of ideas although we do not understand the content. However, the recognition

of discourse boundaries can be facilitated if the prosodic and acoustic crite-
ria are further confirmed by semantic criteria and lexical markers. Horne et
al. (1999) draw a similar conclusion. They suggest that a constellation of both
prosodic and lexical means gives us an important hint about the way segmen-
tation should be produced and understood in spontaneous speech. Second,
it has been shown that both speakers and listeners keep track of more global
structure in discourse and that listeners can tell when a new discourse unit
begins and ends (van Donzel 1997, 1999). According to our investigations, dis-
course borders are more easily recognised at the higher discourse level of ver-
bal superfoci (comparable to long utterances) than on the lower discourse level
of verbal foci (comparable to phrases or clauses). The hierarchical structure of
discourse and its rhythmic form is also the focus of Schilperoord & Sanders
investigation (1997). They conclude that the position of long pauses coincides
with the semantic structure of discourse. Finally, Strmqvist (2000:6) com-
ments on the issue of production and recognition of units:
Both speakers and writers aim at information packages such as clauses, sen-
tences, and LUs [larger units of discourse, my explanation]. Syntactic and se-
mantic properties help language users identify such packages, both in speech
and writing. In speech, intonation plays a crucial role for signalling when
such a package is completed in our Swedish example, notably, the phrase
final rise. In the spoken narration, longer pauses participated in marking the
boundary between larger units of discourse and no silent pauses were found
clause-internally. (Strmqvist 2000:6)
4.2 Other discourse units
In the previous part, the reader got to know that the interplay of prosodic,
acoustic and semantic criteria facilitates the recognition of boundaries in de-
scriptive discourse. It should be noted, however, that there are argumentative
parts of descriptive discourse where the rhetorical build-up plays a role for
the discourse segmentation. These units are often based on the relation event
cause or statecause. Another type of unit is the question-answer pair (and
other types of adjacency pairs). In dialogic discourse, listeners feedback and
laughter play an important role in confirming segmentation borders. Also, the
non-verbal action, such as body language, gestures etc. can be important for
recognising the start/end of a unit or action (cf. Holsanova 2001, Chapter 5).
Korolija (1998) has investigated global thematic units in spontaneous con-

versation that she calls topical episodes and that bear similarity to basic level
topics or superfoci. She characterises topical episodes as follows:
An episode is the manifest product of how several people manage to focus
attention as well as sense-making efforts on a few aspects of the world dur-
ing limited sequences in time. Within one single episode, some features are
selected and focused upon, while others are disregarded; an episode is a seg-
ment of the ow of impressions, events, and experiences []. Episodes in
jointly attended conversations constitute actors attempts at achieving inter-
subjectivity in understanding some particular things they are talking about.
(Korolija 1998:119)
In a study on radio debates, Korolija (1998) reached the conclusion that listen-
ers most easily recognise leitmotifs or macro topics.
5. Conclusion
This chapter has been devoted to the segmentation of spoken descriptive

discourse. I started out by explaining that speech is produced in spurt-like
portions and contains many dysfluencies, such as pauses, hesitations, inter-
ruptions, restarts etc. These phenomena provide clues about the cognitive
processes and suggest units for planning, production, perception and com-
prehension. In order to analyse spoken language production, we first have
to transcribe it. The transcription in our data includes verbal features, pro-
sodic/acoustic features, as well as non-verbal features. As a next step, we had
a closer look at the segmentation of spoken discourse. The way speakers seg-
ment discourse and create global and local transitions seems to reflect a cer-
tain cognitive rhythm in discourse production. Many theories agree that we
can focus only on small pieces of information at a time, but there are different
opinions concerning what size and form these information units should have.
For the purpose of my further correlation studies (Chapters 58), I delimited
the most important units of spoken descriptive discourse that reflect human
attention, verbal focus and verbal superfocus. Further, rules for segmentation
of spoken data were formulated. Apart from prosodic/acoustic criteria being
the overriding ones, I also introduced semantic criteria and discourse markers
as supportive cues into the discourse segmentation. Finally, the segmentation
criteria were then tested against listeners intuition about discourse segmenta-
tion in a perception study. There were two main conclusions: The recognition
of discourse boundaries is facilitated if the interplay of prosodic and acoustic

criteria is further confirmed by semantic criteria and lexical markers. It is
easier for listeners to identify boundaries on the higher levels of discourse
(superfoci).
In the following chapter, we will investigate the structure and content of
spoken picture descriptions in further detail.
chapter 2
Structure and content of spoken

picture descriptions
When we describe something, a scene or an event, to a person who did not

witness it, we often experience it anew. The scene is once again staged before
our eyes and our attention moves from one aspect to another as our descrip-
tion progresses. We retell, step-by-step, what we see and focus on the differ-
ent aspects of the scene, and we use language to describe our impressions.
We could say that our consciousness about this scene is moving through our
thoughts. We select some elements that we have experienced, arrange them
in a certain order, maybe abandoning some elements while emphasising
others, and cluster and summarise a series of experiences into more complex
categories. While doing so, we give a form to our thoughts and offer the
listener a certain presentational format.
This chapter will focus on the structure and content of spoken picture descrip-
tions. First, I will describe and illustrate various types of units (foci and su-
perfoci) extracted from an off-line picture description (i.e. descriptions from
memory) in an interactive setting. Second, I will compare the distribution of
the different foci in this setting to data from other studies.
1. Taxonomy of foci
Let me first introduce a transcript extract from a study where informants de-
scribed the picture off-line in an interactive setting (Holsanova 2001:7ff.).
Twelve informants had each examined a complex picture for a period of time
(1 minute). Then the picture was removed and they described the contents of
the picture to a listener. Let us look at an extract from this study (Example 1).
One informant in the second half of his description mentions the tree with
birds in it.
Example 1
0637 in the middle there was a tree
0638 the lower . part of the tree was crooked
0639 there were . birds
0640 it looked like a happy picture,
0641 everybody was pleased and happy,
0642 eh and (2 s) the elds were brown,
0643 eh it was this kind of topsoil,
0644 it wasnt sand,
0645 and then we had . it was in this tree,
0646 there were . it was like a . metaphor of . the average Swede there,
0647 That you have a happy home
0648 the birds are ying
0649 the birds were by the way characteristic,
0650 They were stereotyped pictures of birds,
0651 big bills
0652 Xxx human traits
As we can see in this example, the informant describes the birds and at the
same time evaluates the atmosphere of the picture (lines 06400641), describes
the quality of the soil (lines 06420644), formulates a similarity by using a
metaphor (lines 06460648) and finally characterises some of the elements de-
picted (lines 06490652). He is not only expressing ideas about states, events
and referents in the picture but also evaluating these in a number of ways. This
can be generalised for all picture descriptions in this study: The informants do
not only talk about WHAT they saw in the picture the referents, states and
events but also HOW they appeared to them. In other words, informants also
express their attitudes towards the content; they evaluate and comment on dif-
ferent aspects of the scene.
The way we create meaning from our experience and describe it to others
can be understood in connection to our general communicative ability. Ac-
cording to Lemke (1998:6) when we make meaning we always simultaneously
construct a presentation of some state of affairs, orient to this presentation
and orient to others, and in doing so create an organised structure of relat-
ed elements. The presentational, orientational and organisational functions
. These elements of spoken descriptions reflect the fact that our consciousness is to a
large extent made up of experiences of perceptions and actions, accompanied by emotions,
opinions and attitudes etc. In addition to perceptions, actions and evaluations, there are
sometimes even introspections or meta-awareness (Chafe 1994:31).
Chapter 2. Structure and content of spoken picture descriptions 21
c orrespond to ideational, interpersonal and textual linguistic metafunctions

described in Hallidays functional grammar (Halliday 1978).
In the following, we will be looking at how the speakers structured their
spoken presentation: what kind of contents they focus on and in what order.
The off-line picture descriptions contained different types of foci. The classifi-
cation of foci was inspired by Chafes classification (Chafe 1980) and modified
in Holsanova (2001). In the grouping of the different types of foci, I adopt the
three functions mentioned above: the presentational, the orientational and the
organisational function (Lemke 1998). Each type can be used either as a single
verbal focus or in a more complex unit of thought, as a verbal superfocus, a
thematically and prosodically coherent chunk of discourse that consists of sev-
eral verbal foci (cf. Chapter 1, Section 2.1). In the following, each type of focus
will be described and illustrated.
1.1 Presentational function
Let me start with the presentational function, which includes substantive, sum-
marising and localising foci. In these units of speech, speakers present us to a
scene, to scene elements (animate and inanimate, concrete and abstract) and
their relations, to processes and to circumstances.
1.1.1 substantive foci

In a substantive unit, speakers describe referents, states and events in the pic-
ture (there is a bird; Pettson is digging; there is a stone in the foreground). Oc-
casionally, speakers make validity judgements from their observations and use
epistemic expressions in substantive units. We nd a dominance of so-called
mental state predicates (I think), usually appearing in a parenthetic position, as
well as a number of modal adverbs (such as maybe) and modal auxiliary verbs
(such as it could have been).
1.1.2 substantive foci with categorisation difficulties

Substantive units reflect the ongoing categorisation process and, sometimes,
categorisation difficulties. When determining a certain kind of object, activity
or relation, the effort to name and categorise is associated with cognitive load.
This effort is manifested by dysfluencies such as hesitation and pauses, with
multiple verbal descriptions of the same thing (lexical or syntactic alternation,
repetitive sequences), sometimes even with hyperarticulation (this tree seems
to be a sort of/I dont remember what is it called, there are those things hanging
there, sort of seed/ seed things in the tree). Often, this focus type constitutes a
whole superfocus.
1.1.3 substantive list of items

Another type of focus that typically clusters into larger units, consists of a se-
ries of substantive foci and appears as a superfocus is a substantive list of items.
This type of superfocus is very frequent in picture descriptions. Different items
that belong to the cluster are enumerated and described in detail. Illustration
of a list can be found in Example 2.
Example 2
0415 and then in the tree there are bird activities sum
0416 one is tidying up the nesting box subst list
0417 one is standing around singing subst list
0418 and one is brooding on the eggs, subst list
A substantive list of items is often explicitly announced in a summarising focus

either by a general introduction (in the tree there are bird activities) or by what
I have called (Holsanova 1999a, 2001) a signalised quantitative enumeration (I
see three birds). After that, the speaker goes on with the elaboration of the par-
ticular items (one is tidying up the nesting box, one is standing around singing
and one is brooding on the eggs).
1.1.4 summarising foci

In a summarising unit (sum), speakers connect picture elements with similar
characteristics at a higher level of abstraction and introduce them as a global
gestalt (in the air there are a number of insects flying around). Summarising foci
appear as single verbal foci or superfoci. They have a structuring function. To
the listener, they signal that a more detailed description may follow after the
introduction, often in the form of a list. The semantic relations between sum
and list can vary. In the example below, the informant starts with a superordi-
nate concept or a hyperonym (clothes), which is followed by a detailed account
of subordinate concepts or hyponyms (hat, vest).
Example 3
--> 0945 one can say something about his clothes sum
0946 he has a funny hat subst list
0947 he has if I am not wrong a vest subst list
Another common pattern is that the speaker uses an abstract name for activi-
ties (cultivating process) that s/he then elaborates within the following cluster
of verbal foci (list) as in Example 4.
Example 4
0508 and then one can see . different parts
/eh well, steps in the . cultivating process sum
0509 hes digging, subst list
0510 hes raking subst list
0511 then hes sowing, subst list
As we can see from Example 5, some speakers introduce a semantically rich

concept (sowing) that in turn makes the listener associate to its parts (a sowing
person, dig in the soil, sow seeds, water, rake).
Example 5
1102 its a . picture about . sowing sum
1103 and . its a fellow subst
1104 I dont remember his name introspect
1105 Pettson I think, subst
1106 who is rst digging . in the eld subst list
1107 and then hes sowing subst list
1108 quite big white seeds subst
1109 and then the cat starts watering subst list
1110 these particular seeds subst
1111 and it ends up with the fellow . eh . raking, . subst list
this eld,
The relation between the summarising focus (1102) and the subsequent sub-
stantive foci is characterised as a part-whole relation or as conceptual depen-
dence (Langacker 1987:306). Sowing here subsumes a schema for an action: a
person (11031105) and a cat (1109) are sowing and performing several activi-
ties (11061111) in a specific temporal order (first, and then, it ends).
1.1.5 localising foci

The last type of foci that serves the presentational function are localising foci
(loc). In localising units, speakers focus on the spatial relations in the picture
(in front of the tree, at a distance, to the right). Sometimes, spatial expressions
do not constitute a whole independent unit of talk but are embedded in sub-
stantive units (there are cows in the background).
1.2 Orientational function
The next types of units, the evaluative and expert foci, belong to the orienta-
tional function. Here, speakers evaluate the scene by expressing their attitude
to the content, to the listener or even to the listeners attitude. The emphasis on
different elements and the way they are evaluated contributes indirectly to the
speakers positioning. For example, the speaker can position herself/himself as
an expert.
1.2.1 evaluative foci

It is natural that we interpret everything we see in the light of our previous
experiences, knowledge and associations. We simply cannot avoid interpret-
ing. Therefore, in evaluative units, speakers include attitudinal meaning. They
make judgements about the picture as a whole, about properties of the picture
elements and about relations between picture elements. Therefore, evaluative
foci are often connected to substantive and summarising foci. Evaluative foci
in the study concerned:
Table 1.
Evaluative foci Transcription examples
i. Speakers reaction to the picture on the whole there are very many things in
the picture
ii. Speakers categorisation of the different eh . the birds are . of indenite sort, they
picture elements are fantasy birds, I would think
iii. Speakers attitude to the presentation of Pettsons facial expression is fascinating,
the different picture elements although he seems to be quite anonymous
he reflects a sort of emotional expression I
think
iv. Speakers comments on the visual proper- the daffodils are enlarged, almost tree-like,
ties of a picture element they are the wrong size in relation to a real
daffodil
v. Speakers comments on the interaction of Findus the cat helps Pettson as much as he
the characters in the picture can, they sow the seeds very carefully place
them one after the other it appears
1.2.2 expert foci

In expert foci, speakers express judgements that are based on their experience
and knowledge. The informants mention not only concrete picture elements,
but also, at a higher level, the genre and painting technique of the picture. By
doing this, they show their expertise. In interpersonal or social terms, the
speakers indirectly position themselves towards the listener.
Example 6
--> 1015 the actual picture is painted in water colour,
--> 1016 so it has/ eh . the picture is not/
--> 1017 it is naturalistic in the motif (theme)
--> 1018 but its getting close to the genre of com/ comic strips,
Expert foci include explanations and motivations where the informant show
their expertise in order to rationalise and motivate their statements or interpre-
tations. Such explanatory units are often delivered as embedded side sequences
(superfoci) and marked by a change in tempo or acceleration. The comment is
started at a faster speed, and the tempo then decreases with the transition back
to the exposition. The explicative comment can also be marked by a change in
voice quality, as in the following example, where it is formulated with a silent
voice (cf. Chapter 3, Section 1.1.2). In Example 7, the expert knowledge of the
participant is based on a story from a childrens book and the experience of
reading aloud to children.
Example 7
0112 And Findus,
0113 he has sown his meatball,
--> 0114 (lower voice) you can see it on a little picture,
--> 0115 this kind of stick with a label
--> 0116 he has sown his meatball under it
0117 (higher voice) and then there is Findus there all the time,
1.3 Organisational function
The last group of units, including the interactive, introspective and metatextual
units, constitutes the third, organisational function.
1.3.1 interactive foci

Informants use signals to show that they are taking the floor and starting to de-
scribe the scene or that they are about to finish their description (ok; well, these
were the things I noticed; this was the whole thing). These units fulfil a mainly
regulatory function.
1.3.2 introspective foci & metacomments

Describers do reflect explicitly on speech planning, progression and recapitula-
tion. They seem to be monitoring their description process, step by step. Their
comments build a broad and rather heterogeneous group of introspective and
metatextual comments.
Table 2.
Introspective foci & metacomments Transcription examples
i. Informants think aloud lets see, have I missed something? Ill think
for a moment: Pettson is the old guy and
Findus is the cat
ii comment on the memory process ehm, its starting to ebb away now; I dont
remember his name; eh actually, I dont
know what the fellow in the middle was
doing
iii. make procedural comments lets see if I continue to look at the back-
ground; I have described the left hand side
and the middle
iv. express the picture content on a textual I dont know if I said how it was composed
metalevel, refer to what they had already
said or not said about the picture
v. ask rhetorical questions and reveal steps what more is there to say about it?; what
of planning in their spoken presentation more happened?
through an inner monologue
vi. think aloud in a dialogic form: One eh what more can I say, I can say some-
person systematically posed questions to thing about his clothes; eh well, what
himself, and then immediately answered more can I say, one can possibly say some-
them by repeating the question as a full thing about the number of such Pettson
sentence. The speakers metatextual foci characters
pointed either forward, toward his next
focus, or backward
vii. recapitulate what they have said or sum- eh I mentioned the insects/ I mentioned
marise the presented portions of the small animals; so thats whats happening
picture or the picture as a whole here; thats what I can say; that was my
reflection
To sum up, the off-line descriptions from memory in an interactive setting

contained different types of foci (see Table 3). Speakers used substantive, sum-
marising and localising foci, in a presentational function. These types of foci
contributed to information about the picture content. Speakers also included
Table 3. Various types of foci represented in off-line picture description from

memory, with a listener present.
Types of foci
Presentational substantive foci (subst; subst. cat. diff.; subst list)
function speakers describe the referents, states and events in the picture
summarising foci (sum)
speakers connect picture elements with similar characteristics at a
higher level of abstraction and introduce them as a global gestalt
localising foci (loc)
speakers state the spatial relations in the picture
Orientational evaluative foci (eval)
function speaker judgement of the picture as a whole
speaker judgement of properties of the picture elements
speaker judgement of relations between picture elements
expert foci (expert)
speaker judgements of picture genre, painting technique
Organisational interactive foci (interact)
function speakers signal the start and the end of the description
introspective foci and metatextual comments (introspect & meta)
thinking aloud
memory processes
procedural comments, monitoring
expressing the picture content on a textual metalevel
speech planning
speech recapitulation
evaluative and expert foci, representing the orientational (or attitudinal) func-
tion. In these types of foci informants express their opinions about various
aspects of the picture contents and comment on the picture genre from the
viewpoint of their own special knowledge. Finally, speakers used interactive,
introspective and metatextual foci, serving the organisational function. They
were thinking aloud and referring to their own spoken presentation on a tex-
tual metalevel. The informants used each focus type either as a single ver-
bal focus or as a superfocus consisting either of a multiple foci of the same
type (subst list) or a combination of different focus types (e.g. sum and list,
subst and loc, subst and eval etc.). Apart from reporting WHAT they saw,
the informants also focused on HOW the picture appeared to them. In other
words, the informants were involved in both categorising and interpreting
activities.
2. How general are these foci in other settings?
In the above section, a thorough content analysis of the picture descriptions was
presented. The question is, however, whether the developed taxonomy would
hold up in other settings. For instance, would the repertoire of focus types vary
depending on whether the informants described the picture in the presence or
in the absence of listeners? Would there be any influence on the taxonomy and
distribution of foci in a narrative task where the informant has to tell a story
about what happens in the picture? Would other types of foci constitute the
discourse when informants describe the picture either off-line (from memory)
or on-line (simultaneously with the description)? Last but not least, what does
the structure and content of descriptions look like in a spontaneous conversa-
tion? Do the same types of foci occur and is their distribution the same?
In the following section, we will discuss the above-mentioned types of foci
and their distribution in different settings by comparing the presented tax-
onomy with data from different studies on picture description. My starting
point for this discussion will be the data from the off-line descriptions in an
interactive setting reported above and from three additional studies that I have
conducted later on. One of the studies concerns simultaneous verbal descrip-
tion: the informants describe the complex picture while, at the same time, their
visual behaviour is registered by an eye tracker (cf. Chapter 5, Section 2). The
next study concerns off-line description after a spatial task: the informants de-
scribe the same picture from memory while looking at a white board in front of
them. Previously, they conducted a spatially oriented task on mental imagery
(cf. Chapter 4, Section 3.1). Yet another study is concerned with simultaneous
(on-line) description with a narrative priming: the informants described the
same picture while looking at it, and the task is to tell a story about what is hap-
pening in the picture (cf. Chapter 4, Section 3.2). Apart from these four data
sets, I will also briefly comment on the structure and content of spontaneous
descriptions in conversation (Holsanova 2001:148ff.).
Before comparing the structure and content of picture descriptions from
different settings on the basis of focus types, let me start with some assump-
tions.
i. The repertoire of focus types is basically the same and can thus be gener-
alised across different settings.
ii. However, the proportion of focus types in descriptions from different set-
tings will vary.
iii. In the off-line description, informants have inspected the picture before-
hand, have gained a certain distance from its contents and describe it from
memory. The off-line setting will thus promote a summarising and inter-
preting kind of description and contain a large proportion of sum, intro-
spect and meta. The metacomments will mainly concern recall. Finally,
the informants will find opportunity to judge and evaluate the picture, and
the proportion of eval will be high.
iv. In the on-line description, informants have the original picture in front
of them and describe it simultaneously. Particularities and details will be-
come important, which in turn will be reflected in a high proportion of
subst, list foci and superfoci. The real-time constraint on spoken produc-
tion is often associated with cognitive effort. This will lead to a rather high
proportion of subst foci with categorisation difficulties (subst cat.diff.).
The picture serves as an aid for memory and the describers will not need to
express uncertainty about the appearance, position or activity of the refer-
ents. Thus, the description formulated in this setting will not contain many
modifications. If there are going to be any metacomments (meta), they
will not concern recall but rather introspection and impressions. The fact
that the informants have the picture in front if them will promote a more
spatially oriented description with a high proportion of loc.
v. In the interactive setting, informants will position themselves as experts
(expert) and use more interactive foci (interact). The presence of a
listener will stimulate a thorough description of the picture with many
subst foci and bring up more ideas and aspects which will cause a longer
description (length). The informants will probably express their uncer-
tainty in front of their audience, which can affect the number of epistemic
and modifying expressions. Finally, the presence of a listener will contrib-
ute to a higher proportion of foci with orientational and organisational
functions.
vi. In the eye tracking condition, informants will probably feel more observed
and might interpret the task as a memory test. In order to show that they
remember a lot, they will group several picture elements together and use
a high proportion of sum. The fact that they are looking at a white board
in front of them will promote a more static, spatially focused description
with many local relations (loc). The description will be shorter than in the
on-line condition where the picture is available and during the interaction
condition where the listener is present (length).
vii. In a narrative setting, the informants will be guided by the instruction

to tell a story about what is happening in the picture. They will focus on
referents and events, which in turn will give rise to a high proportion
of subst, and evaluate the described picture elements and relations in
a number of eval. The narrative priming will promote a dynamic de-
scription with a low priority for spatial relations or compositional layout
(loc).
It is now time to return to the question of whether the developed taxonomy of

foci holds up for picture descriptions in other settings. We will systematically
compare the data per focus type in order to find out how the different situa-
tions have influenced the structure and content of picture descriptions.
2.1 Presentational function

2.1.1 substantive foci (subst)
As we can see from Figure 1 and 2, substantive units containing descriptions
of referents, states and events are an important part of the picture descriptions.
Substantive foci (first diagram in Figure 1) are normally distributed across the
four different settings with one exception: Substantive foci are dominant in
simultaneous (on-line) descriptions where the informants tell a story about
what is happening in the picture (57 percent). The proportion is significantly
higher than in the on-line eye tracking setting (Anova, one-way, F = 3,913;
p=0,015, Tukey HSD test). In accordance with our assumptions, particulari-
ties and details become important, the focus is on referents and events and
both factors give rise to a high proportion of subst foci.
proportion of substantive foci in four studies on picture Proportion of substantive foci with categorization difficulties in
description four studies on picture description
0,60 0,07
0,06
0,50
0,05
0,40
off-line ET 0,04 off-line ET
on-line narrative on-line narrative
0,30
off-line interact off-line interact
0,03
on-line ET on-line ET
0,20
0,02
0,10
0,01
0,00 0,00
1 1
Studies 1-4 Studies 1-4
Figure 1. The proportion of substantive foci (subst) and substantive foci with cat-
egorisation difficulties (subst cat.diff.) in four studies on picture description.
2.1.2 substantive foci with categorisation difficulties

As can be seen from the second diagram in Figure 1, we find significant differ-
ences between the four settings concerning the substantive foci with categorisa-
tion difficulties (Anova, one-way, F = 9,88; p < 0,001). As assumed, categorisa-
tion difficulties appear in the simultaneous on-line description, the explanation
being the real-time constraint on spoken production that is often associated
with cognitive effort. A quite large proportion of categorisation difficulties oc-
cur, however, even in the off-line setting with eye tracking. The explanation one
would first think of is that the informants were trying to remember picture ele-
ments that are no longer present, using the empty white board as a support
for memory, and while looking at the respective localisations of the objects
made efforts to categorise the objects. On the linguistic surface, we can find
traces of obvious categorisation difficulties: some of the substantive foci contain
reformulations and sequences with lexical or syntactic alternation.
2.1.3 subst list of items

The highest proportion of list foci (18 percent) is found in the on-line de-
scription with eye tracking (cf. first diagram in Figure 2). This was expected
since in this simultaneous description, particularities and details will become
important which in turn is reflected in a high proportion of subst. list foci
and superfoci. The difference is not significant.
2.1.4 summarising foci (sum)

As we can see from the second diagram in Figure 2, summarising foci are an
integral part of picture descriptions in all settings. However, as assumed, the
summarising foci dominate in the eye tracking setting where informants de-
scribe the picture off-line (from memory). This result is significant (Anova, one
Proportion of LIST foci in four studies Proportion of summarizing foci in four studies on picture
description
0,20
0,35
0,18
0,16 0,30
0,14
0,25
0,12 off-line ET
0,20 off-line ET
on-line narrative
0,10 on-line narrative
off-line interact
off-line interact
0,08 on-line ET 0,15
on-line ET
0,06
0,10
0,04
0,05
0,02
0,00 0,00
1 1
Studies 1-4 Study 1-4
Figure 2. The proportion of subst list foci and summarising foci (sum) in four stud-
ies on picture description.
way, F = 6,428; p = 0012, Tukey HSD test). It can partly be explained by the tem-
poral delay between picture viewing and picture description: informants who
inspected the picture beforehand gained distance from the particularities in the
picture contents and were involved in summarising and interpreting activities.
The other explanation is the fact that the grouping of several picture elements
can be an effective way of covering all the important contents of the picture.
2.1.5 localising foci (loc)

The proportion of localising foci (Figure 3) dominated significantly both in the
off-line and on-line eye tracking setting (11 percent each). This is, however,
not very surprising. First, the fact that the informants were looking at a white
board in front of them while they described what they had seen on the picture
might have promoted a more spatially focused description. Also, they were
indirectly spatially primed since in a previous task, they were listening to a
spatial description of a scene. Second, when the picture was displayed in front
of them, the informants delivered a more spatially oriented kind of description.
In accordance with our expectations, the narrative instruction has lowered the
proportion of localising foci (2 percent) which is significantly lower than in
off-line eye tracking (Anova, one-way, F = 5,186; p = 0,0041, Tukey HSD test).
When concentrating on an event or a story, the informants probably did not
focus on the spatial or compositional layout of the picture elements since it
was not relevant for the task. A description rich in localisations has a tendency
to be a more static, whereas a description rich in narrative aspects tends to
be more dynamic, as will be shown in Chapter 4 in connection to different
description styles.
Proportion of localizing foci in four studies on picture description
0,12
0,10
0,08
off-line ET
on-line narrative
0,06
off-line interact
on-line ET
0,04
0,02
0,00
1
Studies 1-4
Figure 3. The proportion of localising foci (loc) in four studies on picture

description.
Table 4. Distribution of foci serving the presentational function in four studies on

picture description.
Comm. Studies off-line + on-line + off-line, on-line + Sign. *
functions Types of foci ET narrative + interact ET
subst 39% 57% 46% 31% *
subst cat. diff. 5% 0% 0% 6% *
subst list 10% 9% 12% 18% NS
sum 29% 12% 12% 13% *
loc 11% 2% 6% 11% *
Average # foci 26 20 51 57 *
So far, we have systematically compared the distribution of foci serving the

presentational function. Table 4 summarises the results. With the help of this
table, we can follow the characteristics of the data in four different settings (col-
umns) and discuss the cases where there were significant differences between
the settings concerning a specific type of focus (rows). In all four settings, all
types of foci are represented with exception of the substantive foci with cat-
egorisation difficulties that is missing in two of the settings.
We will now continue with foci serving the orientational and organisa-
tional functions.
2.2 Orientational function

2.2.1 evaluative foci (eval)
The assumption that the narrative priming in the on-line description and the
presence of a listener in the off-line description will positively influence the
proportion of evaluative foci has been confirmed. The proportion of evaluative
foci in these two conditions reached 10 percent (cf. first diagram in Figure 4)
and is much higher than in the other two conditions, the difference being close
to significant (p = 0,05).
2.2.2 expert foci (expert)

Expert foci were found only in the off-line conditions: in the off-line descrip-
tion with eye tracking (1 percent) and in the interactive setting (3 percent).
The explanations at hand are the following: the off-line condition promotes
an interpreting kind of description and the presence of a listener encourages
positioning as an expert.
Proportion of evaluative foci in four studies on picture Proportion of expert foci in four studies of picture description
description
0,04
0,12
0,10 0,03
0,03
0,08
off-line ET
off-line ET
on-line narrative
on-line narrative 0,02
0,06 off-line interact
off-line interact
on-line ET
on-line ET
0,04 0,01
0,01
0,02
0,00 0,00
1 1
Figure 4. The proportion of evaluative foci (eval), and expert foci (expert) in four
studies on picture description.
2.3 Organisational function

2.3.1 interactive foci (interact)
Contrary to our expectations of finding the interactive foci mainly in the in-
teractive setting, this type of foci is represented in all settings (cf. second dia-
gram in Figure 5). A relatively large proportion of interactive foci is found in
the narrative setting and in the simultaneous description with eye tracking (4
percent each).
2.3.2 introspective foci & metacomments (introspect/meta)

The highest proportion of introspect/meta foci (8,5 percent) can be found
in the interactive setting when the informants were describing the picture
from memory (cf. first diagram in Figure 5). As we assumed, the presence of
a listener contributed to a higher proportion of foci with orientational and or-
ganisational functions. Also, the character of the introspective verbal foci had
changed slightly. The metacomments were no longer concerned with recall.
Proportion of introspective and metatextual foci in four studies Proportion of interactive foci in four studies on picture
of picture description description
0,1 0,05
0,04
0,09 0,04
0,04
0,08
0,04
0,07
0,03
0,03
0,06 off-line ET off-line ET
0,03
on-line narrative on-line narrative
0,05
off-line interact 0,02 off-line interact
0,02
0,04 on-line ET on-line ET
0,02
0,03
0,01
0,02
0,01 0,01
0 0,00
1 1
Figure 5. The proportion of introspective foci (introspect/meta) and interactive

foci (interact) in four studies on picture description.
Table 5. Distribution of foci serving orientational and organisational functions in

four studies on picture description.
Comm. Studies off-line + on-line + off-line, on-line + Sign. *
functions Types of foci ET narrative + interact ET
orientational eval 4% 10% 10% 4% NS
expert 1% 0% 3% 0% NS
organisational interact 3% 4% 2% 4% NS
introspect/meta 5% 7% 8,5% 3% NS
Average # foci 26 20 51 57 *
Instead, the speakers introspection and impressions dominated. A rather high

proportion of introspective and metatextual foci can also be found in the nar-
rative setting (7 percent).
Table 5 summarises the distribution of foci serving orientational and or-
ganisational functions in four studies on picture description. Apart from ex-
pert foci, all other types of foci have been represented in the data.
2.4 Length of description
Finally, a comment on the length of the descriptions (measured by the average

number of foci). Although all of the picture descriptions were self-paced, we
can see differences in the duration of descriptions (cf. Figure 6). Informants
in the off-line interactive setting and on-line eye tracking setting formulated
rather long descriptions. In the off-line interact setting, the average number of
foci was 51, which is significantly higher than in the off-line eye tracking set-
ting and the on-line narrative setting (Anova, one-way, F = 11,38; p < 0,001,
Tukey HSD test). One possible explanation is that the presence of a listener
stimulates the informant to describe the picture more thoroughly and to bring
up more ideas and aspects.
To our surprise, the average number of foci in the on-line eye tracking
setting was 57, which is even higher than in the interactive off-line description
and significantly higher than in the off-line eye tracking setting and the on-line
narrative setting (p < 0,01). One possible explanation is that the informants
who describe a picture that is in front of them simultaneously explore and in-
terpret the picture contents and get new associations along the way in the
course of the description process.
We have now systematically compared all types of foci and their distri-
bution in four different studies on picture description, serving presentational,
The average length of picture

description in four different studies
70
60
50
off-line ET
40 on-line narrative
30 off-line interact
on-line ET
20
10
0
1
Studies 1-4
Figure 6. The average length of picture description in four studies.
rientational and organisational functions. Taken together, the substantive,

o
summarising and localising foci, serving the presentational function, dominat-
ed. Even if this is the case, from a communicative perspective we can state that
the informants devoted quite a large proportion of their descriptions to ori-
entational and organisational aspects, in particular to evaluative and metatex-
tual comments. On the other hand, we always find an interplay between these
three functions (Lemke 1998). One possible interpretation could be that the
presentational meaning in form of sum-list combinations structures the
descriptions and makes them more coherent, which in turn contributes to the
organisational function.
Apart from categorisation difficulties and expert foci, all other types of foci
appeared in the four different settings. However, the proportion of focus types
in descriptions from different settings varied. Another piece of evidence sup-
porting this result comes from a spontaneous description of an environment in
a conversation (Holsanova 2001:148184, see also Chapter 3). This description
was embedded in a casual and amusing type of conversation among friends
talking about houses and fulfilled multiple purposes. It included a detailed list
of referents, was aimed at amusing and convincing the listeners, as well as at
demonstrating the speakers talent. Even this spontaneous descriptive discourse
contained the basic repertoire of focus types: substantive, summarising, evalu-
ative and interactive foci etc., but the description gained a more argumentative
character: the evaluative foci contained a motivation and the substantive foci
were systematically structured in an argumentative manner etc.
Similarly, the same repertoire of focus types could even be found in a de-
scription where one person described a scene to another person who should
redraw it (Johansson et al. 2005). In this two-way communication, however, the
foci were formulated in a cooperative manner. The result of the communication
depended both on the explanations of the instructor, on the questions and an-
swers of the drawer and on their agreement about the size of the picture ele-
ments, their proportions, spatial relations and so on.
3. Conclusion
The aim of this chapter has been twofold: (a) to introduce a taxonomy of foci
that constituted the structure and content of the off-line picture descriptions
and (b) to compare types of foci and their distribution with picture descrip-
tions produced in a number of different settings.
Concerning the developed taxonomy, seven different types of foci have
been identified serving three main discourse functions. The informants were
involved both in categorising and interpreting activities. Apart from describing
the picture content in terms of states, events and referents in an ideational or
presentational function, informants naturally integrated their attitudes, feel-
ings and evaluations, serving the interpersonal function. In addition, they also
related the described elements and used various resources to create coherence.
This can be associated with the organisational function of discourse. Substan-
tive, summarising and localising foci were typically used for presentation of
picture contents. Attitudinal meaning was expressed in evaluative and expert
foci. A group of interpersonal, introspective and metatextual foci served the
regulatory or organising function. Informants thought aloud, making com-
ments on memory processes, on steps of planning, on procedural aspects of
their spoken presentation etc. In sum, besides reporting about WHAT they saw
on the picture, the informants also focused on HOW the picture appeared to
them and why. Three additional sets of data were collected in order to find out
how general these focus types are: whether a narrative instruction or a spatial
priming in off-line and on-line condition (with or without eye tracking) would
influence the structure and the content of picture description and the distribu-
tion of various types of foci.
Concerning types of foci and their distribution in four different settings, we
can conclude that the repertoire of foci was basically the same across the differ-
ent settings, with some modifications. The expert foci, where speakers express
their judgements on the basis of their experience and knowledge, were only
found in the off-line condition. This suggests that the off-line condition gives
more freedom and opportunities for the speakers to show their expertise and,
indirectly, to position themselves. Further, the subgroup of substantive foci,
substantive foci with categorisation difficulties, could be identified only in two

of four conditions: in the off-line and on-line description with eye tracking.
The reason why they are part of the descriptive discourse is explained by the
memory task and the real-time constraint respectively, both associated with
cognitive effort. Apart from the four studies analysed, the same inventory of
foci was found in descriptions of an environment in a cooperative drawing task
and in a spontaneous conversation. We can thus conclude that the typology of
foci can be generalised to include different situations.
The distribution of foci and their proportion, however, varied across the
different settings. For instance, the summarising foci dominated in a setting
where the picture was described from memory, whereas substantive foci domi-
nated in simultaneous descriptions with the narrative priming. In the settings
with the spatial priming, the proportion of localising foci was significantly
higher. The interactive setting promoted a high proportion of evaluative and
expert foci informants were expressing their attitudes to the picture content,
making judgements about the picture as a whole, about properties of the pic-
ture elements and about relations between picture elements. Introspective and
metatextual foci were also more frequent in a situation where the listener was
present etc.
This chapter focused on the structure and content of spoken picture de-
scriptions. The next chapter will demonstrate how describers connect units of
speech when they stepwise build their description.
chapter 3
Description coherence & connection

between foci
When we describe something verbally, then depending on our goals, we

organise the information so that our partner can understand it. We verbally
express and package our thoughts in small digestible portions and present
it, step by step, in a certain rhythm that gives the partner regular oppor-
tunities for feedback, questions and clarifications. The information flow is
thereby divided into units of speech that focus on one certain aspect a time
and in a coherent way are connected to a whole. It happens that we make a
digression a step aside from the main track of our thoughts and spend
some time on comments. But we usually succeed in coming back to the main
track (or our partners will remind us). We can also use various means to
signal to ourselves and to our partners where we are at the moment.
The previous chapter dealt with structure and content of picture descriptions,
in particular with the taxonomy of foci. In this chapter, we will take a closer
look at how speakers create coherence when connecting the subsequent steps
in their descriptions. First, we will discuss transitions between foci and the
verbal and non-verbal methods that informants use when creating coherent
descriptions in an interactive setting. Second, description coherence and con-
nections between foci will be illustrated by a spontaneous description of a vi-
sual environment accompanied by drawing.
1. Picture description: Transitions between foci
The results in the first part of this chapter are based on the analysis of the off-
line picture descriptions with a listener (cf. Chapter 2, Section 1 and Holsanova
2001:38ff.). As was mentioned in connection with discourse segmentation in
Chapter 1, hesitation and short pauses often appear at the beginning of verbal
foci (eh . in this picture there are also a number of . minor . gures/ eh fantasy
gures/ eh plus a recurrent cat). Verbal superfoci are usually introduced by

longer pauses combined with hesitation and lexical markers (eh . and then eh
. we have some little character in the left corner). Both the speakers and the lis-
teners have to reorient at the borders of foci. Discontinuities such as hesitation
reflect the fact that speakers make smaller or larger mental leaps when relating
the foci (which also the listeners must do in their process of meaning-making).
The following tendencies could be extracted from the on-line picture descrip-
tion with eye tracking and picture descriptions in an interactive setting:
Pauses and hesitations are significantly longer when speakers move to an-
other superfocus compared to when they move internally between foci
within the same superfocus. Pauses between superfoci are on average 2,53
seconds long, whereas internal pauses between foci measure only 0,53 sec-
onds on average (p = 0009, one sided, Ttest). This is consistent with the
results of psycholinguistic research on pauses and hesitations in speech
(Goldman-Eisler 1968) showing that cognitive complexity increases at the
choice points of ideational boundaries and is associated with a decrease in
speech fluency. This phenomenon also applies to the process of writing (cf.
Strmqvist 1996; Strmqvist et al. 2004) where the longest pauses appear
at the borders of large units of discourse.
Pauses and hesitations get longer when speakers change position in the
picture (i.e. from the left to the middle) compared to when they describe
one and the same picture area. This result is consistent with the studies of
subjects use of mental imagery demonstrating that the time to scan a vi-
sual image increases linearly with the length of the scanned path (Kosslyn
1978; Finke 1989). Kosslyn looked at links between the distance in the pic-
torial representation and the distance reflected in mental scanning. Mental
imagery and visualisations associated with picture descriptions will be dis-
cussed further in Chapter 8, Section 2.
Pauses get longer when the speaker moves from a presentation of picture
elements on a concrete level (three birds) to a digression at a higher level of
abstraction (let me mention the composition) and then back again.
The longest pauses in my data appear towards the end of the description, at
the transition to personal interaction, when the description is about to be
concluded.
Let us have a closer look at some explanations. I propose that there are three as-
pects contributing to the mental distance between foci. The first one is the the-
matic distance from the surrounding linguistic context in which the movement
Chapter 3. Description coherence & connection between foci 41
takes place. As mentioned earlier, the picture contents (the referents, states and
events) are mainly described in substantive and summarising foci and super-
foci. Thus, this aspect is connected to the presentational function of discourse.
There are two possibilities: the transition between foci can either take place
within one superfocus or between two superfoci. For example, speakers can de-
scribe within one and the same cluster of verbal foci all the articles of clothing
worn by Pettson or all the activities of the birds in the tree. Alternatively, while
introducing new referents, the speakers can jump from one area to another in
their description (from the tree in the middle to the cows in the background).
Since the superfoci are characterised by a common thematic aspect (cf. Chap-
ter 1, Section 2.1), it is easier to track referential and semantic relations within
a superfocus than to jump to another topic. Thus, when the thematic aspect
changes and the speaker starts on a new superfocus, a larger mental distance
must be bridged (cf. Kosslyn 1978).
The second aspect is the thematic distance from the surrounding pictorial
context. As was concluded earlier, describers do not only focus on a pure de-
scription of states, events and referents but also add evaluative comments on
the basis of their associations, knowledge and expertise. Because of that, we
have to count on a varying degree of proximity to the picture elements or, in
other words, with a various extent of freedom of interpretation. This aspect
is associated with the change from a presentational to an orientational (or at-
titudinal) function. A change of focus from a concrete picture element to the
general characteristics of the painting technique or to an expertise on childrens
book illustrations means not only a change of topic, but also a shift between the
concrete picture world and the world outside of the picture.
Example 1
0737 and you can also see on a new twig small animals
0738 that then run off probably with a daffodil bulb
0739 on a . wheelbarrow
0740 in the bottom left corner
--> 0741 eh I dont know if I said how it was composed,
0742 that is it had a background which consisted of two halves in the
foreground,
0743 (quickly) one on the left and one on the right,
--> 0744 (slowly) ehh . and . eh oh yes, Findus . hes watering
In Example 1, we can follow the informant moving from a description at a
concrete level, when speaking about small animals (07370740) towards a
digression at a higher level of abstraction, when speaking about the composi-

tion (07410743) and back to the description of a concrete picture element
(0744). At such transitions, typically long pauses with multiple hesitations
appear.
The third aspect is related to the performative distance from the descriptive
discourse. This aspect is connected to changes between the presentational and
orientational function on the one hand and the organisational function on the
other. The largest discontinuities and hesitations are to be found at transitions
where the describer steps out from the content of the picture description and
focuses on the immediate interaction.
The above-mentioned findings about transitions in picture descriptions can
be confirmed by the results of film retellings (Chafe 1980). Chafe (1980:43f.)
claims that the more aspects of the story content are changed in the course of
the description, the larger reorientations are required; for instance, when the
scene is changed, when a new time sequence is added, when new people appear
in the story or when the activities of the people change. The largest hesitations
are found in connection to the most radical change when the speaker jumps
between the story content and the interaction.
To summarise, in spite of the different types of transitions, we can state
that transitions between larger units of speech, such as superfoci, are associ-
ated with a larger amount of effort and reorientation. In addition, the mental
distance between the superfoci can increase due to the content and functions
of foci. I suggest that there are three aspects that increase the mental distance
between foci and, consequently, the effort and reorientation at the borders of
foci: the thematic distance from the linguistic context, the thematic distance
from the pictorial context and the performative distance from the descriptive
discourse.
The next question is: In what way do speakers create and signal the transi-
tions between different foci? How do speakers make the connection between
foci clear to the listener? What lexical and other means do speakers use?
1.1 Means of bridging the foci
Apart from pauses and hesitations at the borders of speech units, we can find
various other bridging cues, such as discourse markers, changes of loudness
and voice quality, as well as stressed localisations.
1.1.1 Discourse markers

Discourse markers (or discourse particles, cf. Aijmer 1988, 2002; Glich 1970;
Quasthoff 1979; Stenstrm 1989) include conjunctives, adverbs, interjections,
particles, nal tags and lexicalised phrases (in English: okay, so, but, now, well;
in Swedish: s, d, men, sen s, ok). From a linguistic perspective, discourse
markers can be characterised as sequentially dependent elements that delimi-
tate spoken language units (Schiffrin 1987:31). Discourse markers are used
to signal the relation between the utterance and the immediate context (Re-
deker 1990:372). From a cognitive perspective, discourse markers can be con-
ceived of as more or less conscious attentional signals that mark the structuring
of the speech and open smaller or larger steps in its development.
Discourse operators are conjunctions, adverbials, comment clauses, or inter-
jections used with the primary function of bringing to the listeners attention a
particular kind of linkage of the upcoming discourse unit with the immediate
discourse context. (Redeker 2000:16)
In the data from the off-line picture descriptions with a listener, we can find
many discourse markers that fulfil various functions: they reect the planning
process of the speaker, help the speaker guide the listeners attention and signal
relations between the different portions of discourse. Grosz & Sidner (1986)
use the summary term cue phrases, i.e. explicit markers and lexical phrases
that together with intonation give a hint to the listener that the discourse struc-
ture has changed.
Example 2
0513 its bright green colours . and bright sky
--> 0514 then we have the cat
0515 that helps with the watering
0516 and sits waiting for the . seeds to sprout
0517 and he also chases two spiders
--> 0518 and then we have some little character in/down in the ... left corner,
0519 that take away an onion,
0521 I dont know what (hh) sort of character
--> 0522 then we have some funny birds . in the tree
In Example 2, the speaker successively introduces three new characters the

cat, the spiders and the birds and creates the transitions between the super-
foci with the help of the markers then (0514), and then (0518), then (0522). The
transitions can be schematised as follows (Figure 1):
Figure 1. A paratactic transition closes the referential frame of the earlier segment
and opens a new segment.
A transition that closes the current referential frame and opens a new segment
is called a paratactic transition. According to Redeker (2006), a paratactic se-
quential relation is a transition between segments that follow each other on the
same level, i.e. a preplanned list of topics or actions. The attentional markers
introducing such a transition have consequences for the linear segments with
respect to their referential availability. The semantic function of the paratac-
tic transitions is to close the current segment and its discourse referents and
thereby activate a new focus space.
Example 3
0112 and Findus,
0113 he has sown his meatball,
--> 0114 (lower voice) you can see it on a little picture,
--> 0115 this kind of peg/stick with a label
--> 0116 he has sown his meatball under it
0117 (higher voice) and then there is Findus there all the time,
In Example 3, the speaker is making a digression from the main track of de-
scription (substantive foci 01120113) to a comment, pronounced in a lower
voice (01140116). She then returns to the main track by changing the volume
of her voice and by using the marker and then (0117). This kind of transition
can be schematised as follows (Figure 2).
A transition that hides the referential frame from the previous segment in
order to embed another segment, and later returns to the previous segment is
called a hypotactic transition. According to Redeker (2006), hypotactic sequen-
tial relations are those leading into or out of a commentary, correction, para-
phrase, digression, or interruption segment. Again, the attentional markers
introducing such a transition have consequences for the embedded segments
with respect to their referential availability. The hypotactic transitions signal
an embedded segment, which keeps the earlier referents available at an earlier
level. Such an embedded segment usually starts with a so-called push marker
or next-segment marker (Redeker 2006) (such as that is, I mean, I guess, by the
way) and nishes with a so-called pop marker or end-of-segment marker (but
push pop
Figure 2. A hypotactic transition hides the referential frame from the previous seg-
ment, while embedding another segment, and later returning to the previous segment.
anyway, so). The reference in the parenthetic segment is thereby deactivated,

and the referent associated with the interrupted segment is reactivated.
As we have seen, speakers more or less consciously choose linguistic mark-
ers to refocus on something they have already said, or to point ahead or back-
ward in the discourse. Let us now continue with other means that are used to
bridge the gaps between foci.
1.1.2 Loudness and voice quality

In Example 3, we could see an example of marking a transition also by the use
of loudness. Prosodic means, such as loudness, voice quality and acceleration
(cf. Chapter 1) are used at borders between units of speech in various func-
tions. Suddenly talking with a louder voice can indicate a new focus. In con-
trast, talking with a lower voice can mean that a digression is being inserted.
Changing voice quality can signal quotation of another person (cf. Holsanova
1998b). A change in speed also makes us interpret the utterance differently.
Speaking faster often indicates a side sequence, while speaking more slowly
and with emphasis signals that this is important information and that we are
on the main line. Example 4 shows an inserted side comment where the speed
changes (lines 07430745):
Example 4
0741 eh I dont know if I said how it was composed,
0742 that is it had a background which consisted of two halves in the
foreground,
--> 0743 (quickly) one on the left and one on the right,
--> 0744 (slowly) ehh . and . eh oh yes, Findus . hes watering
. In a recent study, Bangerter & Clark (2003) study coordination of joint activities in dia-
logue and distinguish between vertical and horizontal transitions. Participants signal these
transitions with the help of various project markers, such as uh-huh, m-hm, yeah, okay, or
allright.
1.1.3 Localising expressions

Apart from discourse markers and prosodic means, there is another way of
introducing a new focus in the picture description data: verbal attention can
be refocused by using stressed localising expressions. In this way, the speaker
can structure the presentation of the picture and build up contrasts (in the fore-
ground, in the background, in the middle, on the left).
The picture descriptions in the analysis above were elicitated. In the second
section of this chapter, we will look closer at what happens outside the labo-
ratory. People talk about their experiences all the time, they describe visual
scenes and environments on different occasions. Is there a different way of cre-
ating description coherence in spontaneous discourse? How do speakers focus
and refocus in the hierarchical structure of the description? How do they tie
together the sequential steps in the description process and mark the relations
between them to make the comprehension process easier for the listeners?
2. Illustrations of description coherence:

Spontaneous description & drawing
To answer these questions we will now look at an extract from a spontane-

ous discourse where the speaker uses both language, gestures and drawing
to describe a visual environment (a bathroom) to his friends (Holsanova
2001:148ff.). Example 5 contains a longer transcribed extract from the con-
versation. The transcript contains the number of unit and speaker (leftmost
column), verbal and prosodic/acoustic features (second column), non-verbal
action (third column) and different types of foci (fourth column).
Example 5
624(A) a standard bathroom in Canada sum
625(A) .. can look like this, <DRAWS>
626(A) 2.08 we have a door <DRAWS> subst list 1a
627(A) without a threshold,
628(A) <QUICKLY WITH A LOWER expert com.
VOICE> they have no thresholds there,
629(A) they dont exist,
630(A) 1.14 then there is the bathtub.. here, <DRAWS> subst list 1b
631(A) the bathtub is built-in, subst detail
632(A) 1.10 so its in the wall <SHOWS>
633(A) .. and down to the oor,
634(B) .. mhm
635(A) 1.22 then we have .. eh 1.06 where <DRAWS> subst list 1c
the faucet comes out
636(A) 1.97 and then we have .. all the other <DRAWS> subst list 1d
.. furnishings here,
637(A) 1.53 we have= usually a=eh <DRAWS> subst list 2
638(A) 4.00 vanity they call it there, naming
639(A) 1.04 where . the washbasin is built design
in
640(B) mhm
641(A) but its a 1.09 a .. piece,a part subst
of the washbasin, detail, design
642(B) 0.62 mhm
643(A) and it sits .. on the counter itself, <SHOWS> loc
644(A) which means that if the water arg
overows
645(A) then you have to try to .. force it over cause (if-then)
the rim, back again,
646(B) <IRONIC> 1.00 which is very eval (attitude)
natural .. for us
647(A) 1.09 so the washbasin is actually loc
resting upon it,
648(A) so if you have a cross-section sum
here,
649(B) m[hm]
650(A) [here we] have the very counter <DRAWS> subst list
651(C) mhm
652(A) then we have the washbasin,
653(A) it goes up here like this <DRAWS> subst
654(A) and down, <DRAWS>
655(A) 1.67 and up, <DRAWS>
656(A) 1.24 of course when the water <SHOWS> arg 1
already has come over here conseq. (if-then)
657A) then it wont go back again,
658(B) m[hm]
659(A) [y]ou have to force it over,
660(A) .. and these here <SHOWS> arg 2
661(A) 1.04 eh=if the caulking under here is conseq. (if-then)
not entirely new and perfect
662(A) then of course the water leaks in <SHOWS>

under .. under here,
663(A) and when the water leaks in un/ arg 3
.. under
664(A) and there is wood under here <SHOWS> conseq. (if-then)
665(A) after a while then certain small funny
things are formed
666(B) [mhm]
667(A) ... which are not/
668(B) hn ha ha [ha]
669(A) which means that you have to ..
replace the whole piece of furniture,
670(B) [ha ha ha]
671(A) [anyway we have the batht]ub here, <SHOWS> subst
672(B) mhm
During this description, the speaker A illustrates what he is saying by sponta-

neous drawing. Figures 3 and 4 are two of the pictures that were produced by
the speaker during this conversation. Drawings reveal a lot about thinking. Ac-
cording to Tversky (1999:2), drawings () are a kind of external representa-
tion, a cognitive tool developed to facilitate information processing. Drawings
differ from images in that they reflect conceptualisations, not perceptions, of
reality. Further, she proposes that the choice of and representation of ele-
ments and the order in which they are drawn reflect the way that domain is
schematized and conceptualized.
At the beginning of the extract, the speaker introduces the bathroom (focus
624). The speaker does not get back to it until in focus 675 (!), and this time he
Figure 3. Figure 4.
calls it the room. The bathtub is introduced in 630 and refocused in 671. How
do the interlocutors handle referential availability? How do they know where
they are and what the speaker is referring to? How can speakers and listeners
retain nominal and pronominal reference for such a long time?
This is a continuation of my general question posed in Chapter 1, whether
our attentional resources enable us only to focus on one activated idea a time
or if we, at the same time, can keep track of the larger units of discourse. Linde
(1979) and Grosz & Sidner (1986) suggest that the interlocutors can do so by
simultaneously focusing on a higher and a lower level of abstraction.
The use of the same item for accomplishing these two types of reference sug-
gests that, in discourse, attention is actually focused on at least two levels si-
multaneously the particular node of the discourse under construction and,
also, the discourse as a whole. Thus, if the focus of attention indicates where
we are, we are actually at two places at once. In fact, it is likely that the number
is considerably greater than two, particularly in more complicated discourse
types. (Linde 1979:351)
In this entertaining conversation, the speaker is trying to achieve a certain vi-

sualisation effect for the listeners both with the help of his drawing and his
spoken description. Therefore, the partners have to handle multiple represen-
tations: The speaker si talking about an object visually present in his drawing
here-and-now (the vanity), later on describing the same object under a virtual
flooding scenario, leaving it to the listeners imagination how it will look like
some time after the flooding when certain parts of it are attacked by germs. He is
referring to the same concrete object on a higher level of abstraction as a piece of
furniture, as an instance of poor construction, or even focusing on the discourse
as a whole. In this discourse, we find a dissociation between external visual rep-
resentations and the discourse-mediated mental representations that are built
up in the course of the conversation (cf. also Chapter 7, Sections 1 and 3).
The ability to simultaneously focus on a higher and a lower level of ab-
straction and to handle multiple representations can be explained within
Chafes theoretical account (1994), where intonation units are embedded in a
periphery consisting of semiactive information that forms a context for each
separate focus. The simultaneity is thus possible because the present, focused
segment is active while the previously mentioned, interrupted segments are
merely semiactive.
Apart from that, an additional explanation can be found in the partners
situation awareness and visual access. Since the interlocutors are present in
the situation, they share the same visual environment. They can observe each
others gaze behaviour, mimics, pointing gestures etc. We must not forget that
the speakers gaze (at the drawing, at his own gestures, at the artefacts in the
environment) can affect the attention of the listeners. Furthermore, the draw-
ing that is created step by step in the course of the verbal description is visible
to all interlocutors. The speaker, as well as the others in the conversation, can
interact with the drawing; they can point to it and refer to things and relations
non-verbally. Thus, the drawing represents a useful tool for answering where
are we now? and functions as a storage of referents or as an external memory
aid for the interlocutors. This also means that the interlocutors do not have to
keep all the referents in their minds, nor always mention them explicitly. Apart
from that, the drawing has been used as a support for visualisation and as an
expressive way of underlining what is being said. Finally, it serves as a rep-
resentation of a whole construction problem discussed in the conversation. I
suggest that the common/joint focus of attention is created partly via language,
partly by the non-verbal actions in the visually shared environment. For these
reasons, the discourse coherence becomes a situated and distributed activity
(cf. Gernsbacher & Givn 1995; Gedenryd 1998:201f.). This may be the reason
why we use drawings to help our listeners understand complex ideas.
2.1 Transitions between foci
There is a considerable complexity in this spoken description. Figure 5 visualis-

es the complexity of different levels of focusing suggested by Linde (1979) and
Grosz & Sidner (1986). This graphic representation illustrates the hierarchical
step-by-step construction of the rst sequence in the analysed spoken descrip-
tion (624672). It also indicates the speakers lexical means of transition (up-
per part of the figure) and listener feedback, mainly found at segment borders
(bottom part of the figure). As we can see, the description includes digressions,
insertions, comments and evaluations on several levels, as well as returns to
the main line. For example, the topic bathtub which was introduced in focus
630 is resumed much later in the conversation, in focus 671. We can also see
how thoroughly some bathroom objects are described: vanity, an example of
furniture in the bathroom, was mentioned in focus 637640, and more detailed
information about its design and place/localisation is added in lines 641642
and 643. Later on, an argumentative comment with a logical consequence has
been inserted in foci 644645 (if the water overows you have to force it over the
rim back again). The comment was eventually followed by an evaluation (which
then then an so but an which which so so anyway

which
624-625
bathroom
SUM 626-627 630 635 636 671-672
door bathtub tap furnitur
e bathtub
SUBST 628-629 SUBST 631 SUBST SUBST 637-640 SUBST
list threshold list built in list list vanity
EXP
. COM. SUBST 632-634 SUBST 641-642 643 647 648-649
detail list washbowl cross
built in form placement ontop section
LOCAL SUBST LOCAL 644-645 LOCAL SUM. 650-652
detail desk&
conseq washbowl
ARG. 646 SUBST 653-655
attitude list form
EVAL. SUBST 656-659 660-662 663-670
detail conseq 1 conseq 2 conseq 3
ARG. ARG. ARG.
haha haha
mhm mhm mhm mhm mhm mhm haha mhm
Figure 5. The hierarchy of the bathroom description (624672) including the

discourse markers of the speaker and the feedback of the listener.
is very natural for us) before the speaker refocuses the localisation of the wash-
basin (647). In other words, what we follow are the hypotactic (embedded) and
the paratactic (linear) relations between the verbal foci.
2.2 Semantic, rhetorical and sequential aspects of discourse
This spontaneous description fulfils multiple goals in the conversation: to de-

scribe an environment, to convince the partners, to amuse them, to demon-
strate the speakers talent etc. Consequently, the description contains many
evaluative and interactive foci and its character shifts towards the rhetorical
and argumentative aspects. Therefore, it is suitable to apply Redekers Paral-
lel Component Model (Redeker 2000, 2006), which states that every unit of
speech is evaluated according to one of three aspects representing three paral-
lel structures in discourse:
i. semantic structure (i.e. how its contents contribute to the discourse)

ii. rhetorical structure (i.e. how it contributes to the purpose of the discourse
segment)
iii. sequential structure (i.e. which sequential position it has in the dis-
course).
One of these structures is usually salient. Redeker (1996:16) writes:

In descriptive or expository discourse, rhetorical and sequential relations will
often go unnoticed, because semantic relations are a priori more directly rel-
evant to the purposes of these kinds of discourse. Still, there remains some
sense in which, for instance, the explication of a state of affairs is evidence for
the writers claim to authority, and the elaboration of some descriptive detail
can support or justify the writers more global characterization.
This can be applied to our spontaneous description. Although the semantic as-
pect seems to be salient, not all verbal foci are part of the semantic hierarchy on
the content level. Apart from descriptions of bathroom objects (door, bathtub,
faucet, other furnishings), their properties and their placement even argu-
mentative, interactive and evaluative foci are woven into the description.
The schematic figure illustrates the complexity in the thicket of spoken de-
scriptions. Despite this brushwood structure and many small jumps between
different levels, the speaker manages to guide the listeners attention, lead them
through the presentation and signal focus changes so that the description ap-
pears uent and coherent.
2.3 Means of bridging the foci
Which means are used to signal focus changes and to create coherence in the
data from spontaneous descriptions? The answer is that verbal, prosodic/acous-
tic and non-verbal means are all used. The most common verbal means are dis-
course markers (anyway, but anyway, so, and so, to start with) that are used for
reconnecting between superfoci (for an overview see Holsanova 1997a:24f.).
A lack of explicit lexical markers can be compensated by a clear (contrastive)
intonation. Talking in a louder voice usually means a new focus, while talking
in a lower voice indicates an embedded side comment. A stressed rhythmic
focus combined with a synchronous drawing (updownup) makes the listener
attend to the drawn objects. Prosody and the acoustic quality thus give us im-
portant clues about the interpretation of the embedding and coherence of an
utterance.
Last but not least, interlocutors often use deictic expressions (like this, and
this here) in combination with non-verbal actions (drawing, pointing, gestur-
ing). Deictic means help move the listeners attention to the new objects in the
listeners immediate perceptual space (cf. attention movers in Holmqvist &
Holsanova 1997). Demonstrative pronouns are sometimes used to draw atten-
tion to the corresponding (subsequent) gesture. Attention is directed in two
ways: the speakers are structuring their speech and also guiding the listeners
attention.
. Deictic gestures in spatial descriptions can co-occur with referential expressions and
anchor the referent in space on an abstract level. Alternatively, they work on a concrete
level and make a clear reference to the actual object by mapping out the spatial relationship
explicitly (cf. Gullberg 1999).
3. Conclusion
In this chapter, we have focused on how speakers connect the subsequent steps
in their description and thereby create discourse coherence. We discussed dif-
ferent degrees of mental distance between steps of description and various
means that informants use to connect these descriptive units of talk. To summa-
rise, discontinuities, such as pauses and hesitations, appear within and between
foci, but are largest at the transition between foci and superfoci. Both speakers
and listeners must reorient themselves at transitions. I proposed that the mental
leaps between foci are influenced by three factors: (a) the thematic distance to
the surrounding linguistic context, (b) the thematic distance to the surrounding
pictorial context, and (c) the performative distance to the descriptive discourse.
In other words, the discontinuity is dependent on which (internal or external)
worlds the speaker moves between. If the speaker stays within one thematic
superfocus, the effort at transitions will not be very big. On the contrary, the
lower the degree of thematic closeness, the larger the mental leaps. If the speak-
er moves between a description of a concrete picture element and a comment
on the painting technique, i.e. between a presentational and an orientational
function, larger reorientations will be required. The biggest hesitations and the
longest pauses are found at the transitions where the speaker steps out of the
description and turns to the metatextual and interactional aspects of the com-
municative situation (fulfilling the organisational discourse function).
I also concluded that transitions between the sequential steps in the picture
descriptions are marked by lexical and prosodic means. Focus transitions are
often initiated by pauses, hesitations and discourse markers. Discourse mark-
ers are more or less conscious signals that reveal the structuring of the speech
and introduce smaller and larger steps in the description. Speakers may re-
focus on something they have already said or point forward or backward in
the discourse. Such markers have consequences for the linear (paratactic) and
the embedded (hypotactic) segments with respect to referential availability.
Moreover, speakers use prosodic means such as changes in loudness or voice
quality to create transitions. Finally, in the process of focusing and refocusing,
speakers use stressed localising expressions to bridge the foci.
In a spontaneous description formulated outside the laboratory, the hier-
archical structure contains many small jumps between different levels. Verbal,
prosodic/acoustic and non-verbal means are all used to signal focus changes
and to create coherence. Although the semantic aspect seems to be salient,
apart from descriptions of objects, their properties and their placement, many
argumentative and evaluative foci are woven into the description. Despite the
complexity, the speaker manages to guide the listeners attention and to lead
them through the presentation. The interlocutors seem to retain nominal and
pronominal references for quite a long time. There are several explanations to
this phenomenon: a) simultaneously focusing on both a higher and a lower
level of abstraction, b) switching between active and semiactive information
and c) using situation awareness and mutual visual access (e.g. by observing
each others pointing, gazing and drawing).
In the spontaneous description, the linguistic and cognitive structuring of
the description is situationally anchored. The interlocutors make use of both
verbal means and non-verbal means of focusing and refocusing. Thus, joint fo-
cus of attention is created partly through language, partly through non-verbal
actions in the visually shared environment. The drawing then gets the function
of a referent storage, of an external memory aid for the interlocutors. The in-
terlocutors look for and provide feedback, and a continuous mutual adaptation
takes place throughout the description (Strmqvist 1998).
In the current chapter, we took a closer look at how speakers create co-
herence when connecting the subsequent steps in their descriptions. The next
chapter will be devoted to different description styles.
chapter 4
Variations in picture description
Usually, when we describe something, the choice of form and content in

our presentation mainly depends on the purpose of the description. If you
describe your dream apartment to someone, you can imagine the description
as a guided tour, an imaginary visit, with information about the decoration,
the colours and the atmosphere. If you instead describe your at in order for
someone to draw a sketch of it, then information about room size, form and
spatial arrangement will probably be in focus. If you want to rent the at, you
are also interested in the rent, how it is furnished etc. Also, the presentation
is affected by individual choices: different individuals can focus on different
aspects and abandon others in their descriptions.
In the previous two chapters we focused on structure and content of descrip-

tive discourse and the coherence that speakers create between different parts
of the description. The aim of this chapter is to discuss various styles in picture
descriptions. First, we will characterise and exemplify the dynamic and the
static description styles. Second, we will discuss the two identified styles in the
framework of the literature on verbal versus visual thinkers. Third, the results
will be compared with data from two other studies on picture descriptions in
order to find out whether the style of description can be affected by spatial and
narrative priming.
1. Two different styles
At this point, I would like to remind the reader of our discussion regarding
the distribution of different types of foci in Chapter 2. Already there, it was
assumed that a description rich in localisations would have a tendency to be a
more static, whereas a description rich in narrative aspects would tend to be
more dynamic. In the results of the analysis of the picture descriptions from
memory in an interactive setting, I found considerable differences in the pic-

ture description concerning this aspect.
Twelve informants (six women and six men) with different backgrounds
studied the picture for a limited time. The picture was then removed and they
described it to a listener from memory. The instruction was as follows: Ill
show you a picture. You can look at it for a while and afterwards, you will be
asked to describe it. The spoken descriptions were recorded and transcribed.
The average length of a picture description was 2 minutes and 50 seconds.
After a qualitative analysis of these picture descriptions, I could extract
two different styles of description, deploying different perspectives. Attending
to spatial relations was dominant in the static description style, while attending
to the flow of time was the dominant pattern in the dynamic description style.
Consider the following two examples:
Example 1
0601 well its a picture
0602 rectangular
0603 and it was mainly green
0604 with a blue sky,
0605 divided into foreground and background,
0606 in the middle theres a tree
0607 uh and in it three birds are sitting,
0608 the lowest bird on the left sits on her eggs
0609 and above her
0610 there is a bigger bird standing up,
Example 2
0702 its quite early in the spring
0703 and Pettson and his cat Findus are going to sow,
0704 Pettson starts with digging up his little garden
0705 then he rakes
0706 and . then . he sows plants
0707 uh . puts potatoes out later on
0708 and when hes ready
0709 he starts sowing lettuce,
Example 1 is a prototypical example of a static description style. In the static

style, the picture is typically decomposed into elds that are then described sys-
tematically, using a variety of terms for spatial relations. In the dynamic style
(as in Example 2), on the contrary, informants primarily focus on temporal
Chapter 4. Variations in picture description 57
relations and dynamic events in the picture. Although there is no temporal

or causal order inherent in the picture, viewers infer it (why they do it will
be discussed later). They explicitly mark that they are talking about steps in a
process, about successive phases, about a certain order. Let me describe and
exemplify the two styles in detail.
1.1 The static description style
The rst group of picture descriptions is conditioned by a preference to focus

spatial relations between picture elements. The informants enumerate the ob-
jects in all their detail and divide the picture into different elds or squares
which they then describe systematically one after the other. Also the picture
elements inside a eld are systematically enumerated and ticked off:
Example 3
0859 and then you can see the same man as on the left of the picture
0860 in three different . positions
0861 on the very right
0862 . ehh, there he stands and turns his face to the left,
0863 and digs with the same spade,
0864 and I think that he has his left arm down on the spade,
0865 his right hand further down on the spade,
0866 eh and then he has his left foot on the spade,
0867 to the right of this gure
0868 the same man is standing
0869 turning his face to the right
The informants using the static style describe the objects in great detail. They
give a precise number of picture elements, state their colour, geometric form
and position. They deliver a detailed specication of the objects and when enu-
merating the picture elements, they mainly use nouns:
Example 4
0616 eh the farmers were dressed in the same way
0617 they had . dark boots
0618 . pants
0619 light shirt
0620 such a armless . armless . jacket, or what should I say
0621 hat light hat
0622 beard
0623 glasses
0624 eh a pronounced nose,
Moreover, presentational and existential constructions such as there is, there

are, it is and it was are frequently used. Also, many auxiliary verbs (we have)
appear in static descriptions. The impersonal form man ser/man kan se (you
see/you can see) seems to be characteristic of this style. When the informants
use other verbs, these verbs are position verbs (stand, sit, lie) but hardly dy-
namic verbs. Finally, the passive verb form is preferred.
The next group of typical features concerns localisations. The informants
use many expressions for spatial relations. These spatial expressions are used
according to different orientations systems: (a) leftright orientation (to the
left, in the middle, to the right, the left bottom corner of the picture, furthest to the
left, to the right of the tree, turns his face to the left), (b) updown orientation
(at the bottom left corner, the upper left in the picture, a little further down), (c)
backgroundforeground orientation (in the foreground, divides the picture
into a closer and a more distant part, up front, towards the front, further back in
the picture, in the middle of the horizontal line). The orientations correspond to
the three dimensions or axes that a viewer sitting upright has (Herskovits 1986;
Tversky 1992), namely the vertical, the viewer and the horizontal axes (see also
Lang, Carstensen & Simmons 1991:2853). Localisations are not only frequent
but also have an elaborate form. Example 5 shows a multiple reformulated,
precise localisation, spread over several units.
Example 5
--> 0834 in the middle of the picture theres a tree
0835 uh . and in it three birds are sitting
--> 0836 in to the left in the tree,
--> 0837 at the very bottom
--> 0838 or or the lowest bird on the left in the tree
0839 she sits on her eggs
--> 0840 uuh and above her
--> 0841 still on the left I think
0842 there is a a a bigger bird
0843 standing up
In the course of the description, the informants establish an elaborate set of

referential frames and landmarks. From then on they can refer to this place
without having to localise the object in the picture anew. Finally, the localising
expressions also indirectly mediate the speakers way of moving about in the
discourse. In the static description style, focusing and refocusing on picture
elements is done by localising expressions in combination with stress, loudness
and voice quality (cf. Chapter 3, Section 1.1.2). Few or no discourse markers
were used in this function.
1.2 The dynamic description style
The other group of picture descriptions shows a primary perceptual guidance

by dynamic events in the picture. Although there is no explicit temporal or
causal order in the picture, the informants have identied the cause of events.
Their descriptions follow a schema. They start with an introduction of the
main characters (the picture represents Pettson and Findus the cat), their in-
volvement in the various activities (they seem to be involved in the spring sow-
ing; what they are doing is farming in different stages) and a description of
the scene (and then there is a house further away so its in the countryside). In
some cases, the informants add a specication of the season and a description
of the overall mood (its from the spring because there are such giant lilies in
one corner). This introductory part often resembles the rst phase in a narra-
tive (Holsanova 1986, 1989; Holsanova & Korycanska 1987; Labov & Waletzky
1973; Sacks 1968/1992).
One of the typical features of this dynamic description style is the sequen-
tial description of events. Speakers explicitly mark that they are talking about
steps in a process, about successive phases, about a certain order:
Example 6
--> 0704 Pettson starts with digging up his little garden
--> 0705 then he rakes
--> 0706 and . then . he sows plants
--> 0707 uh yeah puts potatoes out later on,
--> 0708 and when hes ready
--> 0709 he starts sowing lettuce
The different phases are introduced by using temporal verbs (starts), temporal
adverbs (then, and then, later on), and temporal subordinate clauses (when hes
ready). The successive phases can also be introduced using temporal preposi-
tions (from the moment when hes digging to the moment when hes raking
and sowing).
Example 7 shows a careful description of the various steps in the process:

the entire activity is named in an introduction (1102), the different steps are
enumerated (11061110) and this part of the description is concluded by a
summarising line (1111).
Example 7
--> 1102 its a . picture . of . a sowing
1103 and . there is a . fellow
1104 I dont remember his name,
1105 Pettson I think,
--> 1106 who is digging . in the eld rst
--> 1107 and then . hes sowing
1108 rather big white seeds
--> 1109 and then the cat is watering these seeds
--> 1110 and it ends up with the fellow . raking . this eld,
--> 1111 . so thats whats happening here,
Some speakers are particularly aware of the temporal order and correct them-
selves when they describe a concluded phase in the past time:
Example 8
0711 eh on the one half of it
0712 one can see . eh Pettson
--> 0713 when hes . digging in the eld
--> 0712 and/ or hes done the digging of the eld,
0715 when hes sitting and looking at the soil,
Informants have also noticed differences in time between various parts of the
picture. The informant in the next example has analysed and compared both
sides of the picture and presented evidence for the difference in time:
Example 9
--> 1151 . in a way it feels that the left hand side is . the spring side
--> 1152 because . on the right hand side there is some . raspberry thicket .
or something,
--> 1153 that seems to have come much longer
--> 1154 than the daffodils on the left hand side,
The dynamic quality is achieved not only by the use of temporal verbs (starts
with, ends with) and temporal adverbs (rst, then, later on), but also by a
f requent use of motion verbs in the active voice (digs, sows, waters, rakes, sings,
ies, whips, runs away, hunts).
Also the frequent use of so-called pseudo coordinations (constructions
like fgeln sitter ruvar p gg; the bird is sitting and brooding on the eggs) con-
tribute to the rhythm and dynamic character of the description (for details, cf.
Holsanova 1999a:56, 2001:56f.).
Example 10
--> 0404 then . he continues to dig
0405 and then he rakes
0406 with some help from Findus the cat
--> 0407 and . then he lies on his knees and . sows . seeds
0408 and then Findus the cat helps him with some sort of ingenious
watering device
--> 0409 LAUGHS and then Findus the cat is lying and resting,
0410 and he jumps around in the grass
0411 xxx among ants,
0412 and then a little character in the lower left corner
0413 with an onion and a wheelbarrow
0414 a small character
0415 and then in the tree there are bird activities
0416 one is tidying up the nesting box
--> 0417 one is standing and singing
--> 0418 and one is lying and brooding on the eggs,
0419 then you can see some owers
0420 whether its . yellow anemone or yellow star-of-Bethlehem or
something
--> 0421 and . cows go grazing in the pasture,
In the dynamic description style, speakers do not ascribe the spatial perception
the same weight as they do the temporal perception. Thus, we do not find pre-
cise localisations, and spatial expressions are rare. The few localising expres-
sions that the informants use are rather vague (in the air, around in the picture,
at strategically favourable places, in the corners, in a distance).
Another distinguishing feature is that discourse markers are used to focus
and refocus the picture elements, and to bridge and connect them (cf. Chap-
ter3, Section 1.1). Last but not least, the difference in the description style was
also reected on the content level, in the number of the perceived (and report-
ed) characters. The informants with the prototypical dynamic style perceived
Table 1. The most important features of the two description styles.

Static Dynamic
Description Style Description Style
Global dominance of spatial perception dominance of temporal perception
characteristics numerous and precise localisations many temporal expressions
a detailed description sequential description of events
according to a schema
the picture is subdivided into squares focusing on temporal differences in
and described systematically the picture
static description dynamic description
temporal no temporal expressions temporal verbs, temporal adverbs,
expressions temporal subordinate clauses,
prepositions
spatial many and exact spatial expressions few spatial expressions, less
expressions precise
means of focusing and refocusing on picture focusing and refocusing on picture
refocusing elements by localising expressions elements by mainly using discourse
combined with stress, loudness and markers
voice quality
pseudo few pseudo coordinations frequent usage of pseudo
coordinations coordination
there is, there a high frequency of presentational few there is, there are
are, you cansee, expressions: there is, there are, it
to have is, it was etc.
dynamic mainly nouns, few dynamic verbs, many dynamic motion verbs
motion verbs mostly auxiliary and position verbs
one Pettson (and one cat) gure at several moments in time, whereas the static
describers often perceived multiple gures in different positions.
How are these results related to the focus types presented in Chapter 2? One
could assume that localising foci and substantive foci would be characteristic
of the static style, whereas evaluative foci, introspective foci and a substantive
listing of items (mentioning different activities) would be typical of a dynamic
style. A Ttest showed, however, that localising foci (loc) were not the most im-
portant predictors of the static style. They were frequently present but instead,
expert foci and meta foci reached significant results as being typical of the
static style (p = 0,03). Concerning the dynamic style, evaluative foci were quite
dominant but not significant. list of items was typical and close to significant
(p = 0,06). Both the temporal aspect and the dynamic verbs, which turned out
to be typical of a dynamic description style, were part of the substantive and
Table 2. The distribution of the most important linguistic variables in the twelve
descriptions (off-line interact).
Subject # dynamic % dynamic
No. # foci Style # there is %there is #spatial %spatial #temporal % temporal verbs verbs
1 28 Dynamic 3 0,11 1 0,04 11 0,39 16 0,57

4 25 Dynamic 5 0,20 3 0,12 15 0,60 16 0,64
5 41 Dynamic 10 0,24 4 0,10 6 0,15 20 0,49
7 69 Dynamic 12 0,17 12 0,17 10 0,14 35 0,51
11 66 Dynamic 15 0,23 14 0,21 11 0,17 23 0,35
2 24 MIX 6 0,25 5 0,21 6 0,25 0 0,00

10 64 MIX 19 0,30 8 0,13 4 0,06 21 0,33
12 29 MIX 8 0,28 4 0,14 2 0,07 3 0,10
3 29 Static 18 0,62 5 0,17 0 0,00 4 0,14

6 65 Static 28 0,43 20 0,31 4 0,06 9 0,14
8 98 Static 23 0,23 46 0,47 0 0,00 25 0,26
9 72 Static 33 0,46 9 0,13 0 0,00 13 0,18
Mean 50,83
Sum 610
summarising foci. Table 1 summarises the most important features in the two
description styles.
When it comes to the frequency and distribution of these two styles, the
dynamic description style was mostly evident with five informants (P1, P4,
P5, P7 and P11), whereas the static description style was dominant with four
informants (P3, P6, P8 and P9). The descriptions of the three remaining infor-
mants (P2, P10, P12) showed a mixture of a dynamic and a static style. They
usually began with the dynamic style but, after a while, they started with lists
and enumerations, focusing on spatial aspects, interpretative comments and
comparisons. Thus, the style at the end of their descriptions was closer to the
static description style.
In order to substantiate the characteristics of the two description styles,
I have quantied some of the most important linguistic aspects mentioned
above. Table 2 shows the distribution of the linguistic variables in the twelve
descriptions from memory. The informants are grouped according to the pre-
dominating description style.
The overview in Table 2 encourages a factor analysis a way to summarise
data and explain variation in data (see the two screen dumps in Figure 1). It
might be interesting to nd out which variables are grouped together. Since my
data were not so extensive, the factor analysis has only an explorative charac-
ter. Let me only briefly mention the results. Data from eleven informants were
used as a basis for the factor analysis. One informant was excluded from the
analysis because of her age. Ten variables have been analysed: the length of the
whole description, the length of the free description, number of foci, number of
temporal expressions, number of spatial expressions, refocusing with discourse
Figure 1. Summary of the factor analysis
markers (DM), number of pseudo coordinations, number of there is/there are

etc. and the number of dynamic motion verbs. An extra variable was added
to those mentioned previously: whether the informants call the characters by
names. As a result, three factors were suggested by the system.
Factors 1 and 2 together explained 70 percent of the variation in the data.
They were also easy to interpret: Factor 1 summarised the static style while
Factor 2 characterised the dynamic style. Thereby, the hypotheses about the
two description styles were confirmed. Factor 3 was more difcult to nd a
name for. On the basis of the variables grouped by this factor, including explicit
connectors and coherence markers, I interpreted this factor as expressing the
dynamics of the description coherence.
To sum up, in the twelve descriptions from an interactive setting, two dom-
inant styles were identified: the static and the dynamic description style. At-
tending to spatial relations was dominant in the static description style, while
attending to the flow of time was the dominant pattern in the dynamic descrip-
tion style.
1.3 Cognitive, experiential and contextual factors
What are the possible explanations for these two styles found in the picture
descriptions? One observation concerns gender differences. If we compare the
results in Table 2 with the informants characteristics, we can see a preliminary
tendency for women to be more dynamic and for men to have a more static
description style. However, more studies are needed to conrm this pattern.
Another possibility would be the difference between visual thinkers and ver-
bal thinkers (Holsanova 1997b), which will be discussed in the next section.
Yet another possibility is that these description styles are picture specific
in contrast to other sources of perception. It could be the case that some of the
results are due to the fact that the informants either verbalise the picture as a
representation or focus on the content of the represented scene. The question
whether these results are valid also for scene description and event description
in general has to be tested empirically. Another possibility is that this particular
picture, with its repetitive figures, may have affected the way it was described.
Furthermore, an additional source of explanation is the non-linguistic and
contextual aspects that might to some extent have inuenced what people fo-
cus on during picture viewing, what they remember afterwards and what they
describe verbally (and how). For instance, previous knowledge of the picture,
of the genre, of the book, of the characters or of the story may be the most criti-
cal factors for the distinction of the two description styles, since not knowing
the genre and/or characters may lead to a less dynamic description. One could
expect that if the informants have read the book (to themselves or to their
children), the picture will remind them of the story and the style will become
dynamic and narrative. If the informants know the Pettson and Findus char-
acters from Sven Nordqvists books, films, TV programmes, computer games,
calendars etc., they can easily identify the activities that these characters are
usually involved in, and the description will become dynamic. Finally, if the
informants know the particular story from the childrens book, they can go
over to a story-telling-mode and deliver a dynamic description with narra-
tive elements. On the other hand, some informants may follow the instruction
more strictly and on the basis of their discourse genre knowledge formulate a
rather static picture description.
Here is some evidence from the data. In fact, ten informants mentioned the
characters Pettson and Findus by name (P1, P2, P4, P5, P7, P8, P9, P10, P11,
P12). Two of the informants even seem to know the story, since they included
details about the meatballs that Findus has planted. Three informants recog-
nised and explicitly mentioned the source where the picture comes from or the
illustrated childrens book genre (P3, P9, P10). The only informant who does
not mention either the genre or the characters is P6. If this is because he does
not know them, this could have inuenced his way of describing the scene.
Table 3 summarises various cognitive, experiential and contextual factors
that might have played a role.
So if the complex picture primed the knowledge of the story-book, and
if this knowledge of the story book was the critical factor that influenced the
Table 3. Overview over cognitive, experiential and contextual factors that might have
inuenced the descriptions
Cognitive, experiential and contextual Example
factors
Scene schema knowledge semantic characteristics of the scene: rural land-
scape, gardening in the spring
Pictorial genre knowledge childrens book illustration
Knowledge of the characters old guy Pettson and his cat Findus
Particular knowledge of the book
Particular knowledge of the story
Particular knowledge of the picture
Discourse genre knowledge spoken picture description
Informants way of remembering things
Informants background and interests fauna and ora
Informants expertise on painting techniques, farming, gardening etc.
Informants associations activities that Pettson and Findus usually are
involved in, spring, harmony
Informants linguistic and cultural language-specific ways of classifying and struc-
background turing scenes and events
The interactional setting description to a specific listener
dynamic (narrative) style of the picture description, then we can assume the
following:
a narrative priming will increase the dynamic elements in the picture de-
scriptions
a spatial priming will, on the contrary, increase the static elements in the
picture description.
We will test these hypotheses and describe the effects of spatial and narrative
priming later on in this chapter. Before that, let us turn to the second section,
which will be devoted to the discussion of individual differences in general and
verbal and visual thinkers in particular.
2. Verbal or visual thinkers?
Quite often, the distinction between verbal and visual thinkers is made in psy-
chology, pedagogy, linguistics and cognitive sciences. The question is whether
we can draw parallels between the dynamic and the static picture description
style on the one hand and the verbal and visual thinkers on the other. In the
following, I will discuss the two extracted styles from different theoretical and
empirical perspectives: from studies on individual differences, experiments on
remembering and theories about information retrieval and storage.
The results concerning the two description styles are supported by Grows
(1996) study on text-writing problems. Grow has analysed the written essays
of students and divided them into verbal and visual thinkers on the basis of
how they express themselves in written language. He points out some general
problems that visual thinkers have when expressing themselves in written lan-
guage. According to Grow, visual thinkers have trouble organising expository
prose because their preferred way of thinking is fundamentally different from
that of verbal thinkers. Visual thinkers do not focus on the words but rather
think in pictures and in non-verbal dimensions such as lines, colours, texture,
balance and proportion. They therefore have trouble expressing themselves in
writing, i.e. breaking down ideas that turn up simultaneously into a linear or-
der of smaller units, as required by language. They also have trouble presenting
clear connections between these units. The fact that visual thinkers let several
elements pop up at the same time, without marking the relation between them,
means that it is up to the listener to draw conclusions, interpret and connect
these elements. In contrast, verbal thinkers analyse, compare, relate and evalu-
ate things all the time. Visual thinkers often list things without taking a position
on the issues, do not order them or present them as events. The description be-
comes a static one. Furthermore, the ability of visual thinkers to dramatise and
build up a climax is weak. They do not build the dynamics and do not frame the
description in a context. Verbal thinkers more easily linearise and dramatise.
The features in the static picture description style noted in my study greatly
resemble the general features of the ability to produce written texts that visu-
al thinkers exhibit according to Grow. On the one hand, we have a dynamic,
rhythmic and therefore very lively style, where relations between ideas are ex-
plicitly signalled using discourse markers, close to Grows verbal thinkers. On
the other hand, the static character of the picture description style, with its
perceptual dominance of spatial relations, where the picture is divided into
elds and many details are mentioned but no explicit connection is established
between them, resembles the visual thinker. Despite the difference in medium
(written vs. spoken language), it is therefore easy to draw parallels between the
dynamic and the static picture description style on the one hand and the verbal
and visual thinkers on the other.
Concerning information retrieval, Paivio (1971a, b, 1986) suggests that hu-
mans use two distinct codes in order to store and retrieve information. In his
dual code theory, Paivio (1986:53f.) assumes that cognition is served by two
modality-specific symbolic systems that are structurally and functionally dis-
tinct: the imagery system (specialised for representation and processing of non-
verbal information) and the verbal system (specialised for language). Currently,
the cognitive styles of the visualisers and the verbalisers have been characterised
as individual preferences for attending to and processing visual versus verbal
information (Jonassen & Grabowski 1993:191; cited in Kozhevnikov et al.
2002). While visualisers rely primarily on imagery processes, verbalisers prefer
to process information by verbal-logical means. According to the current re-
search (Baddeley 1992; Baddeley & Lieberman 1980), working memory consists
of a central executive (controlling attention) and two specialised subsystems: a
phonological loop (responsible for processing verbal information) and a visuo-
spatial sketchpad (responsible for processing visuospatial information).
Since the stimulus picture in the study has been described off-line, let me
also mention classic works on memory. Bartlett (1932:110f.) distinguishes be-
tween visualiser and vocaliser when reporting the results of his experiments on
remembering. According to his observations, visualisers primarily memorise
individual objects, group objects based on likeness of form and, sometimes,
even use secondary associations to describe the remembered objects. Vocalis-
ers, on the other hand, prefer verbal-analytic strategies: their descriptions are
influenced by naming, they use economic classifications for groups of objects
and rely much more on analogies and secondary associations (it reminds me
of so and so). They also frequently describe relations between objects. When
speaking about verbal-analytic strategies, Bartlett mentions the possibility of
distinguishing between verbalisers and vocalisers, but does not give any ex-
amples from his data.
Nevertheless, there is still the possibility of a mixed type of description, us-
ing both the verbal and the visuospatial type of code. Krutetskii (1976; cited in
Kozhevnikov et al. 2002:50) studied strategies in mathematical problem solv-
ing and distinguished between three types of individual strategies on the basis
of performance: the analytic type (using the verbal-logical modes), the geo-
metric type (using imagery) and the harmonic type (using both codes).
Individual differences are in focus also in Kozhevnikov et al. (2002), who
revise the visualiser-verbaliser dimension and suggest a more fine-grained
distinction between spatial and iconic visualisers. In a problem-solving task,
they collected evidence for these two types of visualisers. While the spatial
visualisers in a schematic interpretation focus on the location of objects
and on spatial relations between objects, the iconic visualisers in a pictorial
i nterpretation focus on high vividness and visual details like shape, size, col-
our and brightness. This finding is consistent with neurophysiological evi-
dence for two functionally and anatomically independent pathways: one con-
cerned with object vision and the other with spatial vision (Ungerleider &
Mishkin 1982; Mishkin et al. 1983). Moreover, the recent research on mental
imagery suggests that imagery ability consists of distinct visual and spatial
components (Kosslyn 1995).
If we want to apply this above mentioned distinction of spatial and iconic
visualisers to my data, we could conclude the following: Apart from associat-
ing the static description style with the visualisers and the dynamic description
style with the verbalisers, one could possibly find evidence even for these more
fine-grained preferences: verbal, spatial and iconic description styles. For in-
stance, when we look at the distribution of different types of foci (cf. Chapter
2, Section 1) and the categorising devices on the linguistic surface, the domi-
nance of localising foci (loc) could be taken as one characteristic of the spatial
visual style, whereas the dominance of substantive (subst) and evaluative foci
(eval) describing the colour, size and shape of individual objects could serve
as an indicator of the iconic visual style. Finally, the dominance of associa-
tions (the dragonfly reminds me of an aeroplane, the three birds in the tree are
like an average Swedish Svensson-family), mental groupings and interpretations
(it starts with the spring on the left) and economic summarising classifications
(this is a typical Swedish landscape with people sowing) would be the typical
features of a verbal description style.
3. Effects of narrative and spatial priming on the two styles
As we saw in Section 1, the data from a picture description in an interactive set-

ting showed a clear tendency towards two styles deploying different perspec-
tives: the static style and the dynamic description style. Attending to spatial
relations in the picture was dominant in the static style, while attending to the
flow of time was the dominant pattern in the dynamic style. In the previous
section, these results were supported by various studies on individual differ-
ences, experiments on remembering and theories about information retrieval
and storage. Since one of the possible explanations for a dynamic style was
that the knowledge of the book or story would add narrative character to the
description, two more studies were conducted, one with narrative and one with
spatial priming.
In this last section, we will compare the results from the interactive setting
with these two sets of data on picture descriptions in order to find out whether
the style of picture descriptions can be affected by spatial and narrative prim-
ing. The first set of data consists of 12 off-line picture descriptions with spatial
priming. The second data set consists of 15 on-line picture descriptions with
narrative priming.
Hypothesis 1 is that the spatial priming will influence a more static style
of picture description. Spatial priming will be reflected in a large propor-
tion of existential constructions, such as there is, a high number of spatial
expressions and other characteristics of the static description style. Spatial
priming will also lower the number of dynamic verbs and temporal expres-
sions in the descriptions.
Hypothesis 2 is that narrative priming will influence a more dynamic style
of picture description. Narrative priming will result in a large proportion
of temporal expressions, a high number of dynamic verbs and other char-
acteristics of the dynamic description style. Narrative priming will also
lower the number of spatial expressions and existential constructions.
3.1 The effect of (indirect) spatial priming
The first set of data consists of picture descriptions with (indirect) spatial prim-
ing. Twelve informants (six men and six women), matched by age, were viewing
the same picture for 30 seconds. Afterwards, the picture was covered by a white
board and the informants were asked to describe it off-line, from memory. Their
eye movements were measured both during picture viewing and during picture
description. All picture descriptions were self-paced, lasting on average 1 min-
ute and 55 seconds. The instruction was as follows: I will show you a picture, you
can look at it for a while and then I will ask you to describe it verbally. You are free
to describe it as you like and you can go on describing until you feel done.
The spatial priming consisted of the following step: Before this descrip-
tion task, all informants were listening to a pre-recorded static scene descrip-
tion that was systematically structured according to the scene composition and
contained a large proportion of there is- constructions and numerous spatial
descriptions (There is a large green spruce at the centre of the picture. There is a
bird sitting at the top of the spruce. To the left of the spruce, and at the far left in
the picture, there is a yellow house with a black tin roof and white corners. ).
This description lasted for about 2 minutes and created an indirect priming for
the current task.
Would such a suggested way of describing a picture have effects on the
way the informants consequently described the stimulus picture? In particular,
would such an indirect spatial priming influence the occurrence of the static
description style? If yes, were all informants primed to the same extent?
When looking closely at the data, we can see that most descriptions were
affected by spatial priming but that some descriptions still contained dynamic
parts. Example 11 illustrates a dynamic part of picture description and Ex-
amples 12 and 13 demonstrate a static picture description in this setting.
Example 11
0401 Ok,
0402 I see the good old guy Pettson and Findus
dr ser jag den gode gamle Pettson och Findus
0403 eh how they are digging and planting in a garden.
Eh hur de grver och planterar i en trdgrd,
0404 it seems to be a spring day
det r en vrdag tydligen
0405 the leafs and the flowers have already come out
lven blommorna har kommit ut
Example 12
0601 Ehm in the middle to the left Pettson is standing
Hm i mitten till vnster str Pettson
0602 And looking at something he has in his hand
och tittar p nnting han har i sin hand
0603 in the middle of the garden
mitt i trdgrdslandet
0604 And it is green all around
s r det grnt runtomkring
0605 Some birds on the left in the corner
ngra fglar till vnster i hrnet
0606 And then in the middle
sen s i mitten
0607 Stands a tree
str ett trd
0608 with three birds and a flower
med tre fglar och en blomma
Table 4. The overall number and proportion of linguistic parameters in off-line inter-
act compared to off-line + ET (spatial priming).
Linguistic indicators off-line, interact off-line + spatial priming Sign.
1there is 180 (30 %) 110 (35 %) NS
2 spatial expressions 131 (21 %) 104 (33 %) NS
3 temporal expressions 69 (11 %) 15 (5 %) *
4 dynamic verbs 185 (30 %) 104 (33 %) NS
Number of foci 610 311 *
Average duration 2 min 50 sec 1 min 55 sec *
Mean # foci 50,83 25,92 *
Example 13
1201 ehhaa on the picture you can see a man
ehhaa p bilden ser man en man
1202 with a vest and hat and beard and glasses,
med en vst hatt skgg glasgon,
1203 Eh the same man in four different eh what should I call it situations,
Eh samma man i fyra olika eh . vad ska man sga situationer,
In order to substantiate the analysis, a number of important linguistic variables

and indicators have been quantified. Table 4 shows the proportion of linguistic
parameters (i.e. there is/there are, spatial expressions, temporal expressions
and dynamic verbs) typical for the static and dynamic styles in two settings: in
the off-line interactive setting and in the off-line setting with spatial priming.
As assumed, the general tendency was that more localisations were used
under spatial priming: we found only 21 percent spatial expressions in the in-
teractive condition, whereas they were more frequent in the primed condition
(33 percent). One major difference was that picture descriptions with spatial
priming contained significantly fewer temporal verbs/ adverbs compared to
the off-line interactive setting (Anova, oneway, F = 8,8952; p = 0,006). The
length of the description is another parameter typical of a static description
style. The number of foci was indeed significantly smaller in the spatial prim-
ing condition (p = 0,008). Contrary to our hypothesis, there were, however, no
large differences between the two settings concerning the proportion of there
is (31 percent and 35 percent of all foci respectively) and dynamic verbs (30
percent and 33 percent of all foci respectively). Figure 2 visualises the results
of the comparison.
The result was that due to the indirect spatial priming the informants
participated in a spatial task directly before the current task the descriptions
Relative proportion of linguistic Relative proportion of

parameters in 12 off-line + listener linguistic parameters in 12 off-
0,35 0,4
0,3 0,35
0,3
0,25
'there is' 0,25 'there is'
0,2 spatial expr. spatial expr.
0,2
0,15 temporal expr. temporal expr.
dynamic verbs 0,15 dynamic verbs
0,1
0,1
0,05 0,05
0 0
1 1
Linguistic parameters 1-4 Linguistic parameters 1-4
Figure 2. Spatial priming: Proportion of linguistic parameters in the interactive set-

ting (left) and spatial priming (right).
contained a larger number of localisations and significantly fewer temporal

expressions. The descriptions were also significantly shorter. The overall char-
acter of picture descriptions in the data from off-line setting shifts towards the
static dimension.
An additional explanation of the static character might be that the speak-
ers formulate their description while looking at a white board in front of them
instead of having an active listener present during the task. In this way, they are
more bound to the actual picture elements and their positions. The fact that
some parts of the descriptions were not primed to the same extent and includ-
ed dynamic descriptions can be explained by a possible conflict in priming. If
the informants recognised the childrens book or the story when they saw the
picture, they were narratively primed by the book and spatially primed by the
instruction. Let us now continue with narrative priming.
3.2 The effect of narrative priming
The second set of data consists of picture descriptions with narrative priming.
Fifteen informants (six men and nine women), matched by age, described the
same picture. This time, they were given a narrative priming and described it
on-line. The description was self-paced, lasting on average 2 minutes and 30
seconds. The task was to tell a story about what happens in the picture.
Would a narrative priming influence the occurrence of a dynamic de-
scription style in on-line descriptions? The on-line picture descriptions were
analysed thoroughly. The result was that many dynamic descriptions were pro-
duced and that the dynamic aspects were strengthened (cf. Table 5).
Table 5. Overall number and proportion of linguistic indicators for two styles in:
off-line interactive setting and on-line description with narrative priming.
Linguistic indicators off-line, interact on-line, narrative priming Sign.
there is 180 (30 %) 28 (9 %) *
spatial expressions 131 (21 %) 38 (12,5 %) *
temporal expressions 69 (11 %) 172 (57 %) *
dynamic verbs 185 (30 %) 52 (17 %) *
Number of foci 610 302 *
Average duration 2 min 50 sec 2 min 30 sec NS
Mean # foci 52,83 20,13 *
The use of temporal expressions (temporal verbs and adverbs) and dynamic
verbs were two of the relevant indicators of a dynamic style. What is most strik-
ing when comparing these two data sets is the dramatic increase of temporal
expressions and the decrease of existential constructions there is in de-
scriptions with narrative priming (cf. Figure 3). 57 percent of all foci contain
temporal verbs or adverbs compared with only 11 percent in the interactive
setting (p = 0,0056). Only 9 percent of all foci contain a there is-construction
compared to 30 percent in the interactive setting (p = 0,0005). As assumed, the
proportion of spatial expressions was significantly lower in the narrative set-
ting than in the interactive one (p = 0.038). Contrary to our expectations, the
number of dynamic verbs dropped.
It can be concluded that the narrative priming did influence the way the
informants described the picture. In particular, the narrative priming mostly
Relative proportion of linguistic Relative proportion of linguistic parameters

parameters in 12 off-line + listener in 15 on-line + narrative
0,70
0,35
0,60
0,3
0,25 0,50
'there is' 'there is'
0,2 0,40
spatial expr. spatial
temporal expr. temporal
0,15 0,30
dynamic verbs dynamic verbs
0,1 0,20
0,05 0,10
0
0,00
1 1
Linguistic parameters 1-4 Linguistic parameters 1-4
Figure 3. Narrative priming: Proportion of linguistic parameters in the interactive

setting (left) and narrative priming (right).
enhanced the temporal dynamics in the description. The effect is even stron-
ger than in the indirect spatial priming. Also, the use of there is and spatial
expressions dropped in this setting. The reason for this might be that the in-
formants did not feel bound to the picture. Some of the informants created a
completely new story out of the picture elements, far from the actual constella-
tion. Another possible explanation can be found in double priming. If the in-
formants recognised the characters, the book or the story, they were narratively
primed both by the book and by the instruction. These two factors might have
supported each other in a tandem fashion.
Furthermore, compared to the dynamic style in the off-line condition, we
find more types of narrative elements here: the informants frequently use tem-
poral specifications, they connect the foci by means of discourse markers, and
they use projected speech (direct and indirect quotation of the figures). Some
of the speakers even include a real story with a punch line and a typical story-
ending (cf. Example 14).
Example 14
0712 so. . Pettson starts to loosen up the soil
s Petterson brjar och luckrar upp jorden
0713 and then he finds something weird,
och d hittar han ngot konstigt,
0714 he looks at it . properly,
han tittar . ordentligt,
0715 and . it is a seed that has probably flown here from somewhere,
och. det r ett fr som nog har flugit hit frn ngonstans,
0716 he has never seen such a seed,
. han har aldrig sett ett snt fr,
0717 even if hes a good gardener,
ven om han r en kn/ en duktig trdgrdsmstare,
0718 And has seen most of it,
och varit med om det mesta,
0719 so he becomes extremely curious,
s han blir jttenyfiken,
0720 he . shouts to the little cat:
han . ropar till lilla katten:
0721 come and take a waterhose here
ta nu en vattenslang hr
0722 take it at once/ I have to plant this seed,
ta genast/ jag mste genast plantera den hr frn,
0723 and so . he does it

och . s gr han det

0735 and what happens with this seed
och vad som hnder med den hr frn
0736 that you will get to know in the next episode,
det kommer ni att hra i nsta avsnitt,
The aim of this section was to compare the results concerning static and dy-
namic styles with other data. Spatial and narrative priming had effects on the
description style. What seems to be influenced most by the spatial priming is
a larger number of localisations, significantly fewer temporal expressions and
significantly shorter descriptions. The character of the descriptions shifted to-
wards a static style. The narrative priming gave rise to a significantly larger
proportion of temporal expressions and a significant drop in spatial expres-
sions and existential constructions of the type there is/there are. In particular,
narrative priming enhanced the temporal dynamics in the description. The ef-
fect was even stronger than in the indirect spatial priming.
Questions that remain to be answered are the following: Does the presence
of a listener influence the occurrence of these two styles? If we were to not find
the two styles in other types of off-line description, then there would be reason
to think that these are conversationally determined styles that only occur in
face-to-face interaction. To test this, we have to find out whether these styles
appear in off-line monological picture descriptions.
4. Conclusion
In the twelve descriptions, two dominant styles were identified: the static and
the dynamic description style. Attending to spatial relations is dominant in
the static description style where the picture is decomposed into elds that are
then described systematically, with a variety of terms for spatial relations. In
the course of the description, informants establish an elaborate set of referen-
tial frames that are used for localisations. They give a precise number of picture
elements, stating their colour, geometric form and position. Apart from spatial
expressions, the typical features of the static description style are frequent use
of nouns, existential constructions (there is, it is, it was), auxiliary or posi-
tion verbs and passive voice. Focusing and refocusing on picture elements is
done by localising expressions in combination with stress, loudness and voice

quality. All the above-mentioned traits together contribute to a static charac-
teristic of the description.
Attending to the flow of time is the dominant pattern in the dynamic de-
scription style, where the informants primarily focus on temporal relations and
dynamic events in the picture. The speakers describe the picture sequentially;
they explicitly make it clear that they are talking about steps in a process, about
successive phases, about a certain order. The dynamic quality of this style is
achieved by a frequent use of temporal verbs, temporal adverbs and motion
verbs in an active voice. Discourse markers are often used to focus and refo-
cus on the picture elements, and to interconnect them. Apart from the above
mentioned features, the informants seem to follow a (narrative) schema: the
descriptions start with an introduction of the main characters, their involve-
ment in various activities and a description of the scene.
In Section 2, the two extracted styles were discussed from different theo-
retical and empirical perspectives: from studies on individual differences, ex-
periments on remembering and theories about information retrieval and stor-
age. The results were connected to studies about visual and verbal thinkers and
spatial and iconic visualisers and were supported by dual coding theory and
the neurophysiological literature.
Since one of the possible explanations for a dynamic style was that the
knowledge of the book or story would add narrative character to the descrip-
tion, two more studies were conducted, one with narrative and one with spa-
tial priming. Thus, Section 3 focused on the comparison between the results
from the interactive setting with other data on picture descriptions in order
to find out whether the static and dynamic styles of picture descriptions can
be affected by spatial and narrative priming. We can conclude that spatial and
narrative priming had effects on the description style. Spatial priming leads to
a larger number of localisations, significantly fewer temporal expressions and
significantly shorter descriptions, whereas narrative priming mostly enhances
the temporal dynamics in the description.
The first four chapters of this book have dealt with various characteristics
of picture descriptions in different settings. In the remaining four chapters, we
will broaden the perspective and connect the process of picture viewing to that
of picture description. In the next chapter, the reader will get acquainted with
the multimodal method and the analytical tools that I used when studying the
correspondence between verbal and visual data.
chapter 5
Multimodal sequential method

and analytic tool
When we describe pictures or scenes, we usually do not express everything

we have experienced visually. Instead, we pick out certain picture elements
depending on our tasks, preferences and communicative goals. The verbal
description and the perceptual experience of the picture can be filtered by
several principles: for example by the saliency principle (expected, prefer
red and important elements are described first or most), by the animacy
principle (human beings and other animate elements are described first or
most) (Lucy 1992) or by the relevance principle (elements that are relevant to
the listener will be selected). We can imagine that the selected elements get
highlighted from the background of the available elements. What we still do
not know exactly is the relation between what we see and what we express,
i.e. what we attend to visually and what we describe verbally in the picture
or scene. How can we investigate and measure this? Are there any suitable
methods?
In Chapters 14, the reader has been presented with the characteristics of the
spoken picture description. In the following four chapters, I will broaden the
perspective and explore the connection between spoken descriptive discourse,
visual discovery of the picture and mental imagery. The current chapter deals
with methodological questions, and the focus is on sequential and processual
aspects of picture viewing and picture description. First, spoken language and
vision will be treated as two windows to the mind and the focusing of attention
will be conceived of with the help of a spotlight metaphor. Second, a new eye
tracking study will be described and the multimodal score sheet will be intro-
duced as a tool for analysis of temporal and semantic correspondence between
verbal and visual data. Third, a multimodal sequential method suitable for a
detailed dynamic comparison of verbal and visual data will be described.
1. Picture viewing and picture description: Two windows

to the mind
It is, of course, impossible to directly uncover the content of our minds. If we

want to learn about how the mind works, we have to do it indirectly, via overt
manifestations, be it behavioural measures such as eye and body tracking or
psychometric data such as GSR, ERP, fMRI, just to mention a few currently
popular methodologies that may complement the analysis of speech. Psycholo-
gists, linguists, psycholinguists and cognitive scientists use different approach-
es and make various attempts to come closer to how the mind works.
Eye movements have always been of great interest to cognitive scientists,
since conscious control is related to human volition, intention and attention
(Solso 1994). Empirical evidence suggests that the direction of eye movements
and the direction of attention are linked (Sneider & Deubel 1995; Theeuwes et
al. 1998; Juola et al. 1991). Although covert attention has been discussed (Posner
1980), its role in normal visual scanning has been minimised (Findley & Walker
1999). Eye movements have thus been used to obtain valid measurements of a
persons interests and cognitive processes, for instance when watching works of
art (Buswell 1935; Yarbus 1967). Just and Carpenter (1976, 1980) conclude from
their ndings in written text comprehension that readers xate each word until
processing at perceptual, linguistic and conceptual levels has been completed.
They propose a strong eyemind assumption: when looking at relevant visual
displays, there is no lag between what the observer is fixating and what the mind
is processing (Just and Carpenter 1980:331). In contrast, Underwood and Ever-
att (1992) argue that one current xation can indicate either past, present and
future information acquisition, which weakens the strong eyemind assump-
tion (cf. also Slowiaczek 1983). Eye xations have been considered a boundary
between perception and cognition, since they overtly indicate that information
was acquired. Thanks to new technology, eye movements provide an unob-
trusive, sensitive, real-time behavioural index of ongoing visual and cognitive
processing (Henderson & Ferreira 2004:18). More information about eye
movement analysis will be considered in sections to follow.
Another way of approaching the mind is via spoken language. Within psy-
cholinguistic research, the spurt-like character of speech and dysuencies in
spoken language production have been explored in order to reveal planning
and monitoring activities on different levels of discourse (Garrett 1980; Linell
1982; Strmqvist 1996). The hypothesis behind these approaches is that dysflu-
encies that belong to the flow of language production reflect the flow of thought.
Chapter 5. Multimodal sequential method and analytic tool 81
In psychology and cognitive science, spoken language has been used to exter-
nalise mental processes during different tasks in form of verbal protocols or
think-aloud protocols (Ericsson & Simon 1980), where subjects are asked to
verbally report sequences of though during different tasks. Verbal protocols
have been extensively used to reveal steps in reasoning, decision-making and
problem-solving processes and applied as a tool for design and usability test-
ing. Linguists were also trying to access the mind through spoken descriptive
discourse (Linde & Labov 1975) and through a consciousness-based analysis
of narrative discourse (Chafe 1996).
1.1 Synchronising verbal and visual data
It has been argued that eye movements reflect human thought processes, since
it is easy to determine which elements attract the observers eye (and thought),
in which order and how often. But eye movements reveal these covert proc-
esses only to a certain extent. We gain only information about which area the
fixation landed on, not any information about what level the reader was focus-
ing on (was it the contents, the format, or the colour of the inspected area?). An
area may be xated visually for different purposes: in order to identify a picture
element, to compare certain traits of an object with traits of another area, in
order to decide whether the momentary inferences about a picture element are
true or in order to check details on a higher level of abstraction. Verbal foci
and superfoci include descriptions of objects and locations, but also attitudes,
impressions, motives and interpretations. A simultaneous verbal picture de-
scription gives us further insights into the cognitive processes.
There are several possible ways to use the combination of verbal and visual
data. In recent studies within the so-called visual world paradigm, eye move-
ment tracking has been used as a tool in psycholinguistic studies on object
recognition and naming and on reading and language comprehension (for an
overview see Meyer & Dobel 2003 and Griffin 2004). The static pictorial stimu-
li were drawings of one, two or three objects and the linguistic levels concerned
single words and referring expressions, nouns, pronouns, simple noun phrases
(the cat and the chair) and only to a very small extent utterances, for instance
the angel is next to the camel, the cowboy gives the hat to the clown. The
hypotheses behind this line of research is that eye movements are closely time-
locked to the speech stream and that eye movements are tightly coupled with
. The relation between language and mind is also discussed by Grdenfors (1996).
cognitive processes, in particular in connection to speech planning (Meyer &

Dobel 2003). For example, it was found that speakers usually fixate on new
objects just before mentioning them, which is measured on a millisecond
scale approximately 800 ms before naming (Griffin & Bock 2000). When lis-
tening, there is a latency of about 400800 ms from the onset of the word to
the moment when the eyes look at the picture (Couper 1974). Dynamic scenes
and spoken descriptions were studied in Tomlin fish-film experiment (Tomlin
1995, 1997; see also Diderichsen 2001). The question under investigation was
how focal attention affects grammatic constructs.
The combination of verbal and visual data can also be used to disambigu-
ate the local vagueness in one data set, as has been done in some proposi-
tional approaches. For instance, Hauland (1996), Hauland & Hallbert (1995)
and Braarud et al. (1997) used language protocols as a complement to the eye
movement data in the hope of revealing more exact information about the
point of vision. The aim was to bridge the vagueness of the spoken language
protocols by tracking the direction of the gaze. For instance, the meaning of
a deictic term such as that can be resolved by processing the x, y coordinate
indicated by the gaze at an object. The complementarity of verbal and visual
data was used to overcome the weaknesses of the respective modality and
to guarantee a more robust and reliable interpretation of some points in the
informants interactions (cf. also Strohner 1996). This methodology is based
on the assumption that a visually focused object always has a counterpart in
the spoken language description that is temporally aligned with it. However,
as I will argue in my next two chapters, the relation between the objects fo-
cused on visually and the objects described verbally seems to be more com-
plicated.
In my studies, I combine the analysis of two sources, eye movement data
and simultaneous spoken descriptions, with the expectation that these two kinds
of data can give us distinct hints about the dynamics of the underlying cogni-
tive processes. On the one hand, eye movements reflect human thought pro-
cesses. It is easy to determine which elements attract the observers gaze, in
what order and how often. In short, eye movements offer us a window to the
mind. On the other hand, verbal foci formulated during picture description
are the linguistic expressions of a conscious focus of attention. With the help
of a segmented transcript, we can learn what is in the verbal focus at a given
time. In short, spoken language description offers us another window to the
mind. Both kinds of data are used as an indirect source to gain insights about
the underlying cognitive processes and about human information processing.
The way to uncover the idea unit goes via spoken language in action and via
the process of visual focusing.
1.2 Focusing attention: The spotlight metaphor
The human ability to focus attention on a smaller part of the visual eld has
been discussed in the literature and likened to a spotlight (Posner 1980; Theeu-
wes, 1993; Olshausen & Koch 1995) or to a zoom-lens (Findlay & Walker 1998).
() the locus of directed attention in visual space is thought of as having great-
er illumination than areas to which attention is not directed or areas from which
attention has been removed (Theeuwes 1993:95). The current consensus is
that the spotlight of attention turns off at one location and then on at another
(Mozer & Sitton 1998:369). The explanations for such a spotlight differ, though.
Traditionally, attention is viewed as a limited mental resource that constrains
cognitive processing. In an alternative view, the concept is viewed in terms of
the functional requirements of the current task (Allport 1989).
What we attend to during the visual perception and the spoken language
description can be conceived of with the help of a spotlight metaphor, which
intuitively provides a notion of limitation and focus. Actually, the spotlight
comes from the inner world, from the mind. It is the human ability to visually
and verbally focus attention on one part of the information ow a time. Here,
this spotlight is transformed to both a verbal and a visual beam (cf. Figure 1).
The picture elements fixated are visually in the focus of a spotlight and
embedded in a context. The spotlight moves to the next area that pops up from
the periphery and will be in focus for a while. If we focus our concentration
and eye movements on a point, we mostly also divert our attention to that
point. By using a sequential visual fixation analysis, we can follow the path
VERBAL FOCUS VISUAL FOCUS
IDEA
UNIT
Figure 1. Verbal focus, visual focus and idea unit.

of attention deployed by the observer. Concerning spoken language descrip-

tion, it has been shown that we focus on one idea a time (Chafe 1980, 1994).
The picture elements described are in the focus of an attentional spotlight and
embedded in the discourse context. What I will be investigating in the fol-
lowing is the relationship between the content of the visual focus of attention
(specically clusters of visual xations) and the content of the verbal focus of
attention (specically clusters of verbal foci).
The common denominator of both modalities, vision and spoken lan-
guage, is the way we focus attention when we move from one picture element
to another. In a dynamic processual view, the spotlight metaphor can be ap-
plied to both visual scanning of the picture and spoken language description
of the picture.
Verbal and visual protocols can, in concert, elucidate covert processes. Us-
ing two windows to the mind offers a better understanding of information
processing and has implications for our understanding of cognition, attention
and the nature of consciousness. I propose a multimodal sequential method
that can be successfully used for a dynamic analysis of perception and action
in general. The units extracted from the empirical data give us hints about the
ways in which information is acquired and processed in the human mind. The
multimodal clusters extracted from verbal and visual foci illustrate how hu-
mans connect vision and speech while engaged in certain activities.
1.3 How it all began
Wallace Chafe, who studied the dynamic nature of language in connection

with the ow of ideas through consciousness and who has demonstrated that
the study of language and consciousness can provide an unexpectedly broad
understanding of the way the mind works, referred to an unpublished but very
interesting experiment performed by Charlotte Baker at Berkeley during the
late 1970s (Chafe 1980). Baker studied the connection between eye movements
when looking at a picture and the corresponding description of the picture
content from memory. It turned out that the linguistic segments corresponded
closely to the gaze movements. Not only were the parts of the picture, to which
attention was paid, the same. The order in which a subject focused on these
parts was also identical. Based on these results, Chafe suggested that:
Similar principles are involved in the way information is acquired from the en-
vironment (for example, through eye movements), in the way it is scanned by
consciousness during recall, and in the way it is verbalised. All three processes
may be guided by a single executive mechanism which determines what is

focused on, for how long, and in what sequence. (Chafe 1980:16)
Chafe has also suggested that language and vision have similar properties: both
proceed in brief spurts and have a focus and a periphery (Chafe 1980, 1994,
1996). As far as I know, this is the very first known eye tracking study exploring
the couplings between vision and discourse by using a complex picture. (And it
took almost 25 years before the topic was taken up again.) Comparing patterns
during the visual scanning of the picture and during the verbal description of
the picture is a very useful method that provides sustainable results. It helps us
in answering the question whether there is a unifying system integrating vision
and language, and it enriches the research about the nature of human attention,
vision, discourse and consciousness.
2. Simultaneous description with eye tracking
I collected new data in order to study semantic and temporal correspondence

between units in spoken description and picture viewing. Four informants
were asked to verbally describe a complex picture while they were inspecting it
(on-line). The aim of the study was to observe and identify patterns and phe-
nomena. The motif was a complex picture a Swedish childrens book illustra-
tion (cf. Figure 2). The duration of each session was self-determined. Picture
viewing and picture description lasted on average 2,25 minutes. The process
of the visual discovery of the picture was registered using an eye tracker. We
used the remote SMI iView system with an infrared camera measuring the
Figure 2. The motif is from Nordqvist (1990). Kackel i grnsakslandet. Opal.

(Translated as Festus and Mercury: Ruckus in the Garden.)
pupil-corneal reex with a 50 Hz scan rate. The simultaneous spoken language

description was recorded, transcribed and segmented into verbal foci and su-
perfoci (cf. Chapter 1, Section 2.1). Visual and verbal data were then synchro-
nised in time. When comparing the data, a new method was applied, using
multimodal score sheets (see following sections).
2.1 Characteristics of the picture: Complex depicted scene
The type of scene is, of course, a very important factor influencing both the
form and content of verbal picture descriptions and the eye movement patterns
during picture viewing. Let me therefore stop for a while at the chosen complex
picture, characterise it in more detail and relate it to a scene classification within
scene perception research. Henderson & Hollingworth 1999 define a scene as
a semantically coherent (and often nameable) human-scaled view of a real-
world environment comprising background elements and multiple discrete ob-
jects arranged in a spatially licensed manner (Henderson & Ferreira 2004:5).
The chosen illustration can be characterised by the following features:
it is a true scene (not a randomly arranged spatial array of objects),

it is a depiction as a stand-in for a real environment,
it is a colour rendering,
it is a complex scene,
it consists of an immovable background and multiple moving or movable
discrete objects,
it contains agents (persons and animals),
it depicts both states and events and allows different scan paths and thereby
also different readings,
it has a route to semantic interpretation,
it is nameable,
the gist of the scene can be rapidly apprehended.
Similarly to the scene perception researchers (Henderson & Hollingworth 1998,

1999; Henderson & Ferreira 2004), we are interested in what picture areas at-
tract attention and in what order and for how long. Let us start with the question
where we tend to look at pictures and scenes. What will be most interesting inter-
esting for observers? In scene perception research, different methods have been
used to determine what areas will attract the maximal attention (Henderson &
Hollingworth 1998). Areas of interest were determined either by the researchers
themselves (Buswell 1935) or by naive viewer ratings (Mackworth & Morandi
1967) and the influence of semantic region informativeness was investigated

(Loftus and Mackworth 1978). It was found that contextually guided attention
enhances detection of objects in a scene (de Graef 1990).
From a functional perspective, different areas of the same scene will be-
come relevant and informative depending on the communicative task or goal
(preference task, memory task, picture description etc.). Already Yarbus (1967)
stressed the role of the task in guiding where and when to fixate, and recent
research on eye movements in natural behaviour has confirmed its importance
(Hayhoe 2004; Hayhoe & Ballard 2005). Apart from that, we want to know
more about the interplay between language production and visual perception.
A given scene can evoke different descriptions depending on the describers
knowledge, associations, preferences and other cognitive, experiential and
contextual factors.
2.2 Characteristics of the spoken picture description: Discourse level
Like other psycholinguistic researchers in general, I am interested in revealing

the dynamics of the language production (and visual perception). However, un-
like the majority of the eye tracking-based psycholinguistic studies focusing on
word or sentence level, the studies presented in this book focus on higher levels
of discourse. The starting points are free descriptions containing complex ideas
about the picture discovery that are formulated in large units of discourse.
Compared to verbal protocols or think-aloud protocols where subjects are
explicitly asked to externalise their mental processes during an activity, the
informants in our study were asked to describe a picture. Although our infor-
mants were not instructed to comment on their mental processes, we obtained
a spontaneous and coherent line of discourse (and thought) that contained em-
bedded sequences expressing associations, attitudes, and meta-cognitive com-
ments. The reason for this is as follows: When looking at a complex picture,
scene or environment, we do not only reveal our thoughts about the discrete
objects that we see in the form of pure descriptive verbal foci. We also formu-
late our associations, comments, impressions and refocus on certain picture
elements, on the situation, on what we said, recategorise what we have seen, all
of which appears as the various types of foci that we have discussed in Chapter
2. In other words, apart from reporting about WHAT we see as viewers, we also
focus on HOW the picture appears to us and why. Viewers and describers are
involved in categorising and interpreting activities and their steps in descrip-
tions serve presentational, orientational and organisational functions.
It is important to start with complex pictures and to analyse natural de-

scriptive discourse. After that we can also include the naturally occurring non-
presentational foci, complex ideas and abstract concepts in our analysis. To
look at the discourse level is essential since it reveals the ongoing conceptuali-
sation process and the way viewers and describers create meaning.
2.3 Multimodal score sheets
The aim of this study has been to conduct a qualitative sequential analysis
of the temporal and semantic relations between clusters of visual and verbal
data (cf. Chapters 6, Section 1 and Chapter 7, Section 1). For each person,
eye movement data have been transformed to a visual flow on a timeline and
the objects that have been fixated in the scene have been labelled. The spoken
language descriptions have been transformed from the transcript observation
format to a verbal ow on a timeline and the borders of foci and superfoci have
been marked. As a result of these transformations, a multimodal time-coded
score sheet with different streams can be created.
A multimodal time-coded score sheet is a format suitable for synchronising
and analysing visual and verbal data (for details see Holsanova 2001:99f.). In
comparison with ELAN, developed in Max Planck Institute in Nijmegen, and
other transcription tools with tier, timeline and tags, I have built in the analysis
of foci and superfoci, which is unique. Figure 3 illustrates what happens visually
and verbally during a description of the three pictures of the old man Pettson,
who is involved in various activities on the right hand side of the picture.
As we can see in Figure 3, the score sheet contains two different streams:
it shows the visual behaviour (objects fixated visually during description on
line 1; thin line = short fixation duration; thick box = long fixation) and verbal
behaviour (verbal idea units on line 2), synchronised over time. Since we start
1. iView fix.
TIME
0.45 0.47 0.49 0.51 0.53 0.55 0.57 0.59 1.01 1.03 1.05 1.07 1.09 1.11 1.13 1.15 1.17
to the three working

2. Transc. right variations the land digging raking sowing
3. Superfocus SUM SUBST LIST
Figure 3. Multimodal time-coded score sheet.

from the descriptive discourse level, units of discourse are marked and cor-
related with the visual behaviour. Simple bars mark the borders of verbal foci
(expressing the conscious focus of attention) and double bars mark the borders
of verbal superfoci (thematic clusters of foci that form more complex units of
thought). On line 3, we find the coding of superfocus types (summarising,
substantive, evaluative etc.). With the help of this new analytic format, we
can examine what is in the visual and verbal attentional spotlight at a particu-
lar moment: Configurations of verbal and visual clusters can be extracted and
contents in the focused verbal idea flow and the visual fixation clusters can be
compared. This score sheet makes it possible to analyse what is happening dur-
ing preceding, simultaneous and following fixations when a larger idea is de-
veloped and formulated. Since the score sheet also includes types of superfoci,
it is possible to track the functional distribution of the extracted verbal and
visual patterns. This topic will be pursued in detail in Chapter 6.
The multimodal time-coded score sheets are suitable for an analysis of pro-
cessual aspects of picture viewing and picture description (and perhaps even for
the dynamics of perception and action in general). So for example, instead of
analysing the result of the picture discovery of three versions of Pettson in the
form of a fixation pattern (Figure 4, Section 2.3.1) and the result of the verbal
picture description in form of a transcript (Example 1, Section 2.3.2), we are able
to visualise and analyse the process of picture viewing and picture description on
a time-coded score sheet (as seen in Figure 3). Let me explain this in detail.
2.3.1 Picture viewing

Figure 4 illustrates the fixation plot resulting from the description three ver-
sions of Pettson on the right. It shows us the path of picture discovery, i.e. the
Figure 4. Fixation plot: Three versions of Pettson on the right

Table 1. Average xation duration in different activities.

Activity Average fixation duration
silent reading 225 ms
oral reading 275 ms
visual search 275 ms
scene perception 300 ms
viewing of art 300 ms
music reading 375 ms
typing 400 ms
picture viewing & picture description 367 ms
objects and areas that have been fixated by the viewer. This is, however, a static
pattern since it does not exactly visualise when and in what order they were fix-
ated. The circles in Figure 4 indicate the position and duration of the fixations,
the diameter of each fixation being proportional to its duration. The lines con-
necting fixations represent saccades. The white circle in the lower right corner
is a reference point: it represents the diameter of a one-second fixation.
Let me at this point briefly mention some basic information about eye move-
ments. Eye gaze fullfills many important functions in communication: Apart
from being an important source of information during social interaction, gaze
is related to verbal and non-verbal action (Griffin 2004). The reason why people
move their eyes is to bring a particular portion of the visible field of view into
high resolution so that we may see in fine detail whatever is at the central direc-
tion of gaze (Duchowski 2003:3). The saccadic eye movements are important
during picture viewing. They consist of two temporal phases: fixations and sac-
cades. Fixations are stops or periods of time when the point of regard is relative-
ly still. The average fixation duration varies. Fixation duration varies according
to the activity we are involved in. Table 1 (based on Rayner 1992; Solso 1994;
Henderson & Hollingworth 1999 and my own data) shows some examples of
average xation duration during different viewing activities.
The jumps between stopping points are called saccades. During saccades,
the eyes move at a relatively rapid rate to reorient the point of vision from one
spatial position to another. Saccades are very short, usually lasting from 2050
ms. It is during fixations that we acquire useful information, whereas our vi-
sion is suppressed and we are essentially blind during saccades (Hendersson &
Hollingworth 1998, 1999; Hoffman 1998). Usually, three types of visual acuity
are distinguished: foveal vision which encompasses a visual angle of only about
12 degrees, parafoveal vision which encompasses a visual angle of up to 10
Figure 5. Fixations during the sessions with four different informants.
degrees and peripheral vision which encompasses a visual angle beyond 10

degrees (Henderson & Hollingworth 1999, Solso 1994). In order to read or to
study picture elements in detail, we have to position our eyes so that the retinal
image of a particular element in the visual field falls on the fovea (a small, disk
shaped area of the retina that gives the highest visual acuity).
The location where we look in a scene is partly determined by the scene
constraints and region informativeness, partly by the task, instruction or inter-
est. Viewers can deploy different visual paths through the same picture, since
they extract information from those parts of the scene that are needed for their
particular description. Figure 5 shows fixations during the sessions with four
different informants.
Apart from a fixation plot, fixation data can typically be retrieved also in
the format of xation les containing a list of xations in their temporal order,
accompanied with their location (x, y coordinates) and duration (in ms).
2.3.2 Picture description

Let us now look at a result of the verbal description, the transcript of three ver-
sions of Pettson on the right.
Example 1. Transcript sample Three versions of Pettson on the right (uttered dur-
ing picture discovery illustrated in Figure 4).
0123 (1s) eeh to the right 0:50 sum + loc
0124 there are three versions of (1s) old guy Pettson 0:54
0125 who is working the land, 0:56
0126 he is digging in the soil 0:59 subst. list
0127 he is (1s) eh raking 1:02
0128 and then he is sowing, 1:06
The spoken picture description has been recorded, transcribed and translated
from Swedish into English. The transcript of the spoken description is detailed
(Example 1). It includes verbal features, prosodic features (such as intonation,
rhythm, tempo, pauses, stress, voice quality, loudness), and non-verbal features
(such as laughter). It also contains hesitations, interruptions, restarts and other
features that are typical of speech and that give us additional information about
the speaker and the situational context. Each numbered line represents a new
verbal focus expressing the content of active consciousness. Several verbal foci
are clustered into superfoci (for example summarising superfocus 01230125
or a list of items 01260128; see Chapter 1, Section 2.1 for definitions and
details).
However, both the visual and the verbal outcomes above, i.e. the fixation
plot in Figure 4 and the transcript for three versions of Pettson in Example 1,
are static. In order to follow the dynamics of picture viewing and picture de-
scription, we need to synchronise these two data outcomes in time. Let us have
a look at the same sequence in the new sequential format (Figure 6) where we
zoom in on what is happening in the visual and verbal streams. The boxes of
different shading and different length with labels on the top line represent ob-
jects that were fixated visually. On the bottom line, we find the verbal foci and
superfoci including pauses, stress etc. Both the verbal and the visual streams
are correlated on a time-line.
With the help of the multimodal sequential method, we are able to extract
several schematic configurations or patterns both within a focus and within a
superfocus as a result of a detailed analysis of the temporal and semantic rela-
tions (cf. Holsanova 2001). Some of them can be seen in Figure 6, for example
an n-to-one configuration between the visual and the verbal part in the second
focus when the viewer is introducing the three pictures. Notice that there is a
large delay (or latency) between the visual fixation of the sowing Pettson (dur-
ing the second verbal focus) and the verbal mention of this particular picture
(in the sixth focus). In fact, the sowing Pettson is not locally fixated, in parallel

P raking
P raking
P sowing
P raking
P digging
P digging
P raking
P digging
P digging
P raking
P raking
P raking
P digging
P digging
P raking
P digging
P raking
P raking
P digging
VISUAL
VERBAL
--- eeh to the right there are three variations of --- old guy Pettson who is working the land, he is digging in the soil he is --- eh raking and then he is sowing,
time
Figure 6. Schematic configuration on the multimodal time-coded score sheet: Three versions of Pettson on the right with connections
between verbal and visual foci and with markings of foci and superfoci.
Chapter 5. Multimodal sequential method and analytic tool
93
to the verbal description within the same focus. Its mentioning is based on a
previous fixation on this object, during the summarising focus there are three
versions of (1s) old guy Pettson. The ability to keep track of this referent gives us
some hints about the capacity of the working memory. For more details about
the temporal and semantic relations, see Chapters 6 and 7.
3. Multimodal sequential method
In a dynamic sequential analysis, we can answer the following questions: What

does the process of picture viewing and picture description look like? Where
do we start, how do we structure our thoughts about the picture, what do we
focus on and in what order, how do we understand the contents of the picture
and how do we verbalise it to different audiences? What does the sequential
build-up of the visual examination and the verbal descriptions usually look
like? Can we find any general patterns among the individual paths? Are there
different (visually and cognitively based) scanning strategies in separate stages
of picture viewing and picture description?
A multimodal sequential method combines verbal data (successive clusters
of verbal foci) and visual data (successive clusters of visual fixations) in order to
explain what happens visually and verbally in an individual session at a certain
point in time. It makes it possible to analyse how the attentional spotlight has
been cast on picture elements and areas in the course of the visual scanning
and the simultaneous verbal description. In particular, we can look closely at
the temporal and semantic relations between the visually attended and verbally
focused objects and explore whether there is a correspondence between these
two kinds of data. By conducting a detailed sequential analysis, it is possible to
obtain insights into the dynamics of the underlying cognitive processes.
In addition, this method allows us to follow how informants bridge the gap
between visual information gathering and speech production. We can examine
how the informants successively verbalise their thoughts about the picture or a
scene. The combination of the data illuminates mental processes and attitudes
and can thus be used as a sensitive evaluative tool for understanding the dy-
namics of the ongoing perception and categorisation process (cf. Holsanova
2006). By combining the contents of the verbal and visual attentional spotlight,
we get a reinforcement effect. By using two windows to the mind, we obtain
more than twice as much information about cognition, since vision and spo-
ken language interact with each other. In this way, we can gain more detailed

Figure 7. Multimodal time coded score sheet of keystroke data and eye tracking data
Chapter 5. Multimodal sequential method and analytic tool
95
information about the structure of thoughts and the process of generating or

accessing them.
The multimodal time-coded score sheet Holsanova (2001) was imple-
mented in a recent project on the dynamics of perception and production in
on-line writing, where we studied the interaction between writing and gazing
behaviour in 81 subjects balanced for age, gender and dyslectics/controls. In-
stead of describing pictures orally, the informants produced a written picture
description. The tool provided an overview of how the writers attention was
distributed between the stimulus picture, the keyboard, the computer monitor
and elsewhere during the writing session (cf. Andersson et al. 2006). Let us
look at an example from the written data.
The informant has just started a new paragraph. The first formulation
in the new paragraph is Mitt p bilden finns det ocks <2.021> kor <3.031>
<4.210> som r bruna och vita <14.351>. (In the middle of the picture, there
are also <2.021> cows <3.031> <4.210> that are brown and white <14.351>).
Figure 7 shows us how the visual attention was distributed during writing and
during pauses. First stream shows visual behaviour on the stimulus picture,
second stream on the monitor, third on the keyboard. The last two streams
show writing activity.
In the beginning, Mitt p bilden finns det ocks [...] In the middle of the
picture there is also [...], the subject tends to look at the keyboard whereas later
on, she mainly looks at the monitor when writing. Before naming the objects
and writing the word kor cows, this person stops and makes a short pause
(2.02 sec) in order to revisit the stimulus picture. She may, for example, be
checking whether the cattle in the picture are cows, bulls, calves or perhaps
sheep. After the visual information gathering is accomplished, she finally writes
kor cows. After writing this, she distributes her visual attention between the
screen and the keyboard, before she finishes the sentence by writing som r
bruna och vita that are brown and white.
Apart from the process of writing, this method has been used for mul-
timodal dialogue systems. Qvarfordt (2004) studied the effect of overlaying
eye-gaze on spatial information (a map) in a collaborative user interface in
order to know whether users can take advantage of the conversational partners
gaze information. In result, she found important functions of users eye-gaze
patterns in deictic referencing, interest detection, topic switching, ambiguity
reduction, and establishing common ground in a dialogue. If we incorporate
different types of foci and superfoci in the analysis, we can follow the eye gaze
patterns during various mental activities (cf. Chapter 6, Section 2). Interface
design and human factors are therefore yet another areas where this method
could be applied when testing how nuclear power plant operators react in
a simulated scenario, measuring situation awareness of operators in airport
towers, and evaluating computer interfaces or assessing architectural and in-
terface design (Lahtinen 2005). A further possible application lies within the
educational context (test of language development, text-picture integration,
etc.). It would also be possible to add a layer for gestures and for drawing in
order to reveal practitioners expert knowledge by analysing several verbal and
non-verbal activities: designers sketching and verbally explaning a structure
for a student, archaeologists visually scanning, verbally describing structures
on a site while simultaneously and pointing and gazing at them, or radiologists
visually scanning images and verbally describing the anomalies in an exami-
nation report. All these multimodal actions involve several mental processes
including analytic phases, planning, survey phases, constructive phases, moni-
toring phases, evaluative phases, editing phases and revision phases. It would
be interesting to synchronize different streams of behaviour (verbal, visual,
gestural, other non-verbal) in order to investigate the natural segmentation of
action into functional phases or episodes, in order to get to know more about
individual strategies and patterns and about the distribution of the underlying
mental processes.
Finally, multimodal score sheets can also be applied within scene percep-
tion and reasoning. The selection of informative regions in an image or in a
scene is guided both by bottom-up and top-down processes such as internal
states, memory, tasks and expectations (Yarbus 1967). Recorded eye move-
ments of human subjects with simultaneous verbal descriptions of the scene
can reveal conceptualisations in the human scene analysis that, in turn, can be
compared to system performance (Schill 2005; Schill et al. 2001).
The advantages when using a multimodal method in applied areas are
threefold: it gives more detailed answers about cognitive processes and the
ongoing creation of meaningful units, it reveals the rationality behind the in-
formants behaviour (how they behave and why, what expectation and associa-
tions they have) and it gives us insights about users attitudes towards certain
layout solutions (what is good or bad, what is easy or difficult etc.). In short, the
sequential multimodal method can be successfully used for a dynamic analysis
of perception and action in general.
4. Conclusion
This chapter has been an introduction to a multimodal sequential method used

for comparison of visual and verbal data. This method has been used in a study
of simultaneous picture description with eye tracking in order to throw light
on the underlying cognitive processes. A complex picture and free descrip-
tive discourse served as starting points. In order to investigate the dynamics
of picture viewing and picture description, spoken language description and
eye movement data were transformed and synchronised on a time-line and
the border between different foci and superfoci were marked. The multimodal
time coded score sheet enables us to synchronise visual and verbal behaviour
over time, to follow and compare the content of the attentional spotlight and to
extract clusters in the visual and verbal flow.
The aim of the current chapter was to explain the principles and impor-
tance of a multimodal sequential method enriched with the analysis of foci and
superfoci. Verbal and visual data have been used as two windows to the mind
in order to gain insights about the underlying cognitive processes. In the next
two chapters, this method will be used to reveal principles involved in infor-
mation processing. In particular, we will investigate the temporal and semantic
correspondence between visual and verbal data.
chapter 6
Temporal correspondence
between verbal and visual data
In our everyday life, we often look ahead, pre-planning our actions. People
look ahead when they want to reach something, identify a label, open a
bottle. When playing golf, rugby, cricket, chess or football, the players usually
do not follow the track of the moving object but rather fixate the expected
future position where the object should land (Tadahiko & Tomohisa 2004;
Kiyoshi et al. 2004). Piano players and singers read the next passage in the
score and their eyes are ahead of their hands and voices (Bersus 2002;
Goolsby 1994; Pollatsek & Rayner 1990; Sloboda 1974; Young 1971). Last
but not least, copytypers eyes are ahead of their fingers (Butsch 1932; Inhoff
& Gordon 1997). In sum, predictions and anticipatory visual fixations are
frequent in various types of activities (Kowler 1996). The question is how
picture viewing and picture description are coordinated in time. Do we al-
ways look ahead at a picture element before describing it verbally or can the
verbal description of an object be simultaneous with visual scanning? Does it
even happen that eye movements follow speech?
By now, the reader has been presented with different types of foci and superfo-
ci, with the sequential qualitative method and the analytic tool and with multi-
modal time-coded score sheet, all of which have been described in the previous
chapters. In the following two chapters, I will use this method to compare the
content of the visual focus of attention (specifically clusters of visual fixations)
and the content of the verbal focus of attention (specifically verbal foci and su-
perfoci). Both temporal and semantic correspondence between the verbal and
visual data will be investigated.
In this chapter, I will primarily concentrate on temporal relations. Is the
visual signal always simultaneous with the verbal one? Is the order of the ob-
jects focused on visually identical with the order of objects focused on ver-
bally? Can we find a comparable unit in visual and verbal data? It is, of course,
difcult to separate temporal and semantic aspects in the analysis. We have to

consider the contents of the verbal and visual foci on the timeline. Neverthe-
less, I will focus on the temporal aspects rst and return to the semantic rela-
tions in more detail in the following chapter.
The inspiration for this study comes from Chafes idea that there is a simi-
lar way for us to acquire, recall and verbalise information: All three processes
may be guided by a single executive mechanism which determines what is fo-
cused on, for how long, and in what sequence (Chafe 1980:16). His sugges-
tion implies temporal and semantic correspondence between the verbal and
the visual data stream. From the examples in Chafe (1980) we can also deduce
that the idea unit (verbal focus in our terminology) is the suggested unit of
comparison. In order to test it, I compared patterns during the visual scanning
of a picture and during the verbal description of it. The aim was to find out
what is in the visual and verbal attentional spotlight at a particular moment
and to answer the question about comparable units.
First, I will go through an inventory of congurations extracted from
the data on simultaneous picture description with eye tracking starting with
congurations at the lowest level within a focus (Section 1.1) and then move
higher up in the discourse hierarchy, towards the superfocus (Section 1.2). I
will illustrate them graphically on a timeline, mention their frequency distribu-
tion in the data and look at their relation to the preceding and following unit.
Second, I will analyse the functional distribution of the configuration types by
connecting them to verbal activity types. The question under investigation is
whether certain types of patterns reflect a certain type of mental activity. Third,
I will discuss how the typology of multimodal clusters can be used to study
mental activity.
1. Multimodal configurations and units of visual and verbal data
Let us start with the inventory of configurations extracted from the data.
The multimodal time-coded score sheet presented in the previous chapter
showed connections between verbal and visual foci in detail. Here, the vari-
ous configurations will be presented in a more simplied, schematic way. In
this simplied version of the score, there are only two streams: the visual and
the verbal behaviour. The discrete units compared (boxes on the timeline)
represent the objects that at a certain moment are in the focus of visual and
verbal attention.
Chapter 6. Temporal correspondence between verbal and visual data 101
1.1 Configurations within a focus
Since we have deduced from Chafe (1980) that the verbal focus is a candidate
for a unit of comparison between verbal and visual data, we will start on the
focus level. In particular, we will compare the verbal focus (an idea about the
picture element that is conceived of as central at a certain point in time and
delimited by prosodic, acoustic and semantic features) with the temporally si-
multaneous visual focus (a cluster of visual fixations directed onto a discrete
object in the scene).
1.1.1 Perfect temporal and semantic match

We are now zooming in on the verbal and visual streams on the score sheet.
The ideal case is one where there is a correspondence or a perfect match be-
tween what is seen and what is said. Such a perfect temporal and semantic
match would result in the overlap conguration seen in Figure 1. However, this
kind of perfect temporal and semantic overlap is very rare in fact, it appears
in only 5 percent of foci. If it occurs, it is often a part of a substantive focus
(subst) or summarising focus (sum).
1.1.2 Delay between the visual and the verbal part

In the next conguration extracted, there is a delay between the visual and
the verbal stream when focusing on an identical object (Figure 2). In other
words, the object (the first Pettson figure to the left) is xated prior to its ver-
balisation (Pettson). The delay-conguration is well represented in free picture
descriptions, but there is no constant latency throughout the data. Therefore,
we cannot reset or align the whole spoken data set automatically in order to
compare the semantic simultaneity of the visual and verbal units. The delay
conguration appears in 30 percent of foci in the on-line description and is
stone Pettson 1
VISUAL VISUAL
VERBAL VERBAL
stone Pettson
time time
Figure 1. Perfect match. Figure 2. Delay.

typical of certain parts of the descriptions, especially of the list of items and
substantive foci (subst). One further characteristic is that this conguration
usually does not appear alone, within one focus, but rather as a series, as a part
of a superfocus (see conguration 1.2.2).
Eye-voice latency (or eye-voice span) has been reported in a number of
psycholinguistic studies using eye tracking, but the average duration differs in
different activities. It has, for instance, been observed that the eye-voice latency
during reading aloud is 750 ms and during object naming, 900 ms (Griffin &
Bock 2000; Griffin 2004). As we have seen in picture viewing and picture de-
scription, the eye-voice latency in list foci lasts for about 20003000 ms. In
our mental imagery studies (Johansson et al. 2005, 2006), the eye-voice latency
was on average 2100 ms during the scene description and approximately 300
ms during the retelling of it (cf. Chapter 8, Section 2.1). The maximum value
across all subjects was 5000 ms in both phases.
There are interesting parallels in other areas of non-verbal behaviour, such
as gesturing, signing or using a pen. The nding that the verbal account lags
behind the visual xations can be compared to Kendons (1980), Naughtons
(1996) and Kitas (1990) results, which reveal that both spontaneous gesturing
and signed language precede their spoken lexical analogues during communi-
cation. The eye-hand latency in copy-typing ranges between 200 and 900 ms
(Inhoff & Gordon 1997). Also, when users interact with a multimodal system,
pen input precedes their speech by one or two seconds (Oviatt et al. 1997).
The question that arises here is why the latency in picture viewing and pic-
ture description is so long compared to studies in reading and object naming.
What does it measure? Meyer and Dobel (2003:268) write: When speakers
produce sentences expressing relationships between entities, this formulation
phase may be preceded by a conceptualization or appraisal phase during which
speakers aim to understand the event, assign roles to the event participants and
possibly, select a verb. Conceptualisation and formulation activities during ut-
terance production might be one explanation, but the length of the lag indi-
cates that there are more complex cognitive proceses going on at a discourse
level. Other possible explanations for the length of the lag will be discussed in
detail in section three of this chapter.
1.1.3 Triangle configuration

So far, we have been looking at a single cluster of fixation within a verbal fo-
cus. The most common case is, however, that multiple visual clusters are con-
nected to one verbal focus and directed towards different objects. The triangle
stone stone
VISUAL
VERBAL
theres a stone
time
Figure 3. Triangle conguration: In front of the tree there is a stone.
conguration in Figure 3 encompasses two visual foci and one verbal focus. It
has been formulated as a part of a superfocus consisting of two foci in front of
the tree there is a stone. The size of the visual fixation cluster is limited by the
size of the verbal focus (< 2 sec). If the cluster is longer than that, then the fixa-
tions exceeding the focus border belong to the next focus.
The rst visual xation on the object seems to be a preparatory one. The
rexation on the same object then occurs simultaneously with the verbal de-
scription of this object. The two white boxes in between are anticipatory xation
clusters on other objects that are going to be mentioned in the next verbal
focus. The observer may at the time be looking ahead and pre-planning (cf.
Velichkovsky, Pomplun & Rieser 1995). Thus, it seems like a large proportion
of the categorisation and interpretation activities has already taken place dur-
ing the first phase of viewing (the first visual fixation cluster) whereas the later
refixation simultaneous with the formulation of stone occurs in order to
increase the saliency on stone during speech production. The issue of anticipa-
tion fixations on a forthcoming referent has been shown in auditory sen-
tence comprehension (Tanenhaus et al. 2000). In the above configuration, we
see a typical example of anticipation in on-line discourse production: the visual
cluster on white objects interferes temporally and semantically with the current
focus (more about semantic relations will be said in the following chapter).
The triangle conguration is typical of substantive verbal foci (subst) and
localising foci (loc) and represents 17 percent of all presentational foci. As we
will see later (in conguration 1.2.3), it usually represents only a part of a su-
perfocus and is often intertwined with xations from other foci. Already in this
conguration we can see that a 1:1 correspondence between verbal and visual
foci will not hold. The process of attentional focusing and refocusing on picture
elements seems to be more complicated than that.
VISUAL
VERBAL
and fou r men working around it
time
Figure 4. Intertwined n-to-1 mapping.
1.1.4 N-to-1 mappings

As stated above, the clusters consisting of multiple visual xations and one ver-
bal focus are very frequent in the data. The conguration below, called n-to-1
mapping (Figure 4), represents 37 percent of all foci in the on-line description.
It usually appears as a part of a summarising focus (sum) or a substantive focus
(subst), is well integrated into a superfocus and intertwined with other foci
(see conguration 1.2.4).
1.1.5 N-to-1 mappings, during pauses

The spoken picture description is sometimes interrupted by pauses and hesi-
tations typical for language production, but the eyes do not stop moving to
new fixation points. In other words, the visual examination takes place even
during pauses in the verbal description. The next conguration (Figure 5), is
an example of that. The visual scanning is done with multiple xations and
information is acquired both during a pause (illustrated by a broken line) and
simultaneously with the verbal description. This conguration often appears
in a summarising superfocus (sum), which is preceded by a long pause, or in
substantive foci (subst). It is the most frequent type of configurations appear-
ing in 37 percent of the presentational foci. It does not occur in isolation but
is intertwined with xations on other objects or areas belonging to the same
superfocus (see the white boxes Pettson 2 and Pettson 3, which are going to be
in focus during the following verbal focus).
tree tree tree tree tree

VISUAL
VERBAL
I see a tree in the middle
time
Figure 5. N-to-1 mappings during pauses.
Figure 6. Pettson is digging.
1.1.6 N-to-1 mappings, rhythmic re-examination pattern

A special case of multiple visual xations and one verbal focus are rhythmic
repetitive patterns on large complex objects (cf. Figure 6: Pettson is digging).
Parts of one and the same object are xated in a repetitive pattern with short
xations. In the process of describing Pettson and his activity, the viewer in a
rhythmical sequence xates the hand-tool-soil, hand-tool, hand-tool-soil and

by doing this, creates a schema for a digging activity. This rhythmic sequence
can be easily identified when looking at the video. This brings to mind Kahne-
mans (1973:65) statement that the rate of eye movements often corresponds
to the rate of mental activity. The viewers eye movements seem to mimic the
functional relationships between these parts of the whole gure. This rhythmi-
cal pattern appears as a part of a substantive focus (subst).
After this rst review of clusters within the focus, we can return to the
hypothesis concerning comparable units. One conclusion we can draw from
the analysis above is that the focus is often not an appropriate unit for com-
parison. There is very seldom a 1:1 relation between verbal focus and visual
focus (i.e. a single cluster of visual fixations on one object in the scene). The
foci are intertwined in larger wholes. Therefore, we have to search for a more
complex unit.
1.2 Configurations within a superfocus
The next higher unit in the hierarchy to consider is the superfocus (a larger co-
herent chunk of speech consisting of several verbal foci). Congurations 1.2.1
1.2.6 below exemplify more complex patterns extracted from the data within a
superfocus, i.e. the sequences of foci delimited by double bars on the score.
1.2.1 Series of perfect matches

The conguration where visual and verbal data match temporally (Figure 7) is
quite rare. Occasionally, it appears in the list superfoci, but only if the objects
have been xated in the preceding summarising superfocus.
VISUAL
VERBAL
time
Figure 7. Series of perfect matches.

VISUAL
VERBAL
time
Figure 8. Series of delays.
1.2.2 Series of delays

When speaking about plural objects, e.g. about the three birds which are situ-
ated close to each other in the central part of the tree, the describers keep their
eye xations within the central parts of the tree and make several short and
long xations and rexations on the birds. Sequentially analysed, visual and
verbal foci build a series of delays (Figure 8). This complex conguration is
quite frequent and is very typical of a list superfocus after a summarising fo-
cus. It reminds us of a counting-like behaviour. Each item on the list is briefly
checked off visually before it is mentioned verbally.
It is not unusual that the same objects that were initially perceived and
described on a general level (in a summarising focus), and then checked once
more and described on a detailed level in a following superfocus with a list of
items (list). These two superfoci are concerned with the same referents and
the same global topic and are thus coherent. The informants dwelled with their
eyes within the same area of interest for as long as those two superfoci were for-
mulated. After a global description of the area, the observers mentally zoomed
in and described the details on a ner scale. One and the same area scanned
during summarising focus (sum) and list of items (list) could thus be con-
ceived of as a region and a subregion.
Also, it seems that the cluster in a list is dependent on how the picture
was scanned under a summarising superfocus. It can either have the form of a
typical series of delay-clusters with short dwells, suggesting that the observer
has acquired information from these parts previously and is now only check-
ing off the items (e.g. three birds). Or, it proceeds in long visual fixation clus-
ters with a shorter latency between the visual and the verbal part. This variant
bird 1 bird 2 bird 3 bird 3

VIS UAL
VERBAL
one is sittin g the other is and the third she-bird is beating a rug or something,
on its eggs singing
time
Figure 9. Series of delays combined with 1-to-n mapping.
s uggests that the observer has not analysed these areas thoroughly the rst
time s/he entered this area (i.e. during the summarising overview). Instead,
s/he returns to them and analyses them in detail later on.
One of the varieties of this configuration can be seen in Figure 9. The
speaker is describing three birds doing different things in a summarising focus
and continues with a list: one is sitting on its eggs, the other is singing and the
third female bird is beating a rug or something. The visual dwelling on the third
bird is very long, consisting of inspections of various details of the birds ap-
pearance that lead to the categorisation she-bird. The delay configurations are
combined with one 1-to-n conguration.
During the summarising (sum) and list foci in on-line descriptions, in-
formants were inspecting one and the same area of interest twice but on two
different levels of specicity. This finding about two levels of viewing is fully
consistent with the initial global and a subsequent local phase hypothesised by
Buswell (1935), who identied two general patterns of perception:
One of these consists of a general survey in which the eye moves with a se-
ries of relatively short pauses over the main portions of the picture. A second
type of pattern was observed in which series of xations, usually longer in
duration, are concentrated over small areas of the picture, evidencing detailed
examination of those sections. While many exceptions occur, it is apparent
that the survey type of perception generally characterises the early part of an
examination of a picture, whereas the more detailed study, when it occurs,
usually appears later. (Buswell 1935:142)
1.2.3 Series of triangles

In my data from free picture descriptions, series of triangles are quite fre-
quent. It is common in localising foci (loc), in substantive superfoci (subst)
and in evaluative foci (eval). As we can see in Figure 10, the foci are inter-
twined. The informant is describing the lower central part of the picture in a
localising, evaluative and substantive focus: In front of the tree which is curved
is a stone, and simultaneously making multiple visual xations on the tree
and the stone.
1.2.4 Series of N-to-1 mappings

The conguration we can see within the double bars of a superfocus is a com-
plex variant of a conguration in which multiple visual xations are connected
to one verbal focus. This conguration is typical of the summarising superfo-
cus (sum).
Figure 11 shows an addition of two n-to-1 congurations within a sum-
marising superfocus and exemplifies the process of picture discovery during
the verbal formulation in the middle is a tree with one with three birds
doing different things. Neither did the describer plan the summarising descrip-
tion beforehand, nor did he wait until he had inspected the whole tree area
and counted all of the animals depicted there. Instead, he formulated the idea
about the tree, checked the tree again visually, noticed one of the birds there
and directly started to formulate his new idea about the bird which then had to
be corrected on-line with respect to the number of animals (with one with
three birds doing different things.).
The pattern in Figure 12, I see a tree in the middle and four men
working around it, is even more intertwined. The superfocus (between the
double bars) consists of two foci. Note that some objects are partly fixated dur-
ing pauses in the spoken description (tree) and some objects described in the
second focus are partly fixated during the preceding focus (Pettson 1, Pettson
3), long before they are integrated in the summarising formulation. The con-
sequence of this intertwined structure is that (a) there is no constant temporal
relation between the visual and the verbal foci because of the partial lack of the
verbal signal during a pause and (b) there is a preparatory visual fixation on
Pettson 1 and Pettson 3 during the tree-sequence. If we, however, look at the
superfocus as a whole, we find agreement.
stone stone tree stone
VISUAL
VERBAL
in front of the tree which is curved is a stone
time
Figure 10. Intertwined series of triangles.

tree foilage tree tree foilage bird 2 tree bird 2 bird 1 bird 3 bird 2 bird 1 bird 3
VISUAL
VERBAL
in the middle is a tree with one with three birds doing different things
time
Figure 11. Intertwined series of n-to-1 mappings.

Chapter 6. Temporal correspondence between verbal and visual data
111
tree
Pettson 1
Pettson 3
Pettson 2
Pettson 1
Pettson 2
Pettson 4
tree tree tree tree
VISUAL
VERBAL
I see a tree in the middle and four men working around it
time
Figure 12. Intertwined n-to-1 mappings.

VISUAL
VERBAL
time
Figure 13. N-to-n mapping.
1.2.5 Series of N-to-N mappings

The last conguration (Figure 13) consists of multiple visual and multiple ver-
bal foci and is called n-to-n mappings. This configuration is typically found
in substantive foci with categorisation difficulties (cat. diff). It represents 11
percent of all foci.
The fact that the describer has spent a whole superfocus describing only
one object can be a sign of categorisation problems. When categorising a spe-
cial kind of a plant (tree, different owers), an animal (cat, insect) and non-
animate things in the picture (telephone line), the informants try to solve their
problems on-line. The problems connected to categorising and labelling ob-
jects result in multiple and long visual xation clusters on several parts of the
object and repetitive attempts to verbally categorise the object. In the extracted
verbal superfocus (1) below, the describer is trying to specify the kind of tree
while looking at the foliage and the branch on the left.
(1)
0358 (4s) ehm this tree seems to be a sort of 2.42 subst.
0359 (2s) I dont remember what it is called haha, 2:44 categ. problems
0360 there are those things hanging here, 2:46
0361 sort of seed/ seed things in the tree, 2:50
Pauses, hesitation, vagueness and modifications (those things, sort of), met-
acomments, corrections and verbatim repetitions are typical of this type of
verbal superfocus. In other words, there are multiple visual and verbal activi-
ties. More specically, multiple visual xation clusters are accompanied by lin-
guistic alternation and dysfluencies.
Apart from configurations within the superfocus, correspondence could
be found on an even higher level of discourse, namely the discourse topics.
Frequency distribution of
configuration types
5%
11%
37% n-to-1
delay
17% triangles
n-to-n
match
30%
Figure 14. Frequency distribution of the configuration types in free on-line picture
descriptions.
Two describers can be guided by the composition of the picture and moved
systematically (both visually and verbally) from the left to the right, from the
foreground to the background (Holsanova 2001:117f.). The discourse topic can
also be based on impressions, associations or evaluations in connection to the
scene.
1.3 Frequency distribution of configurations
Hereby, the inventory of the configurations is concluded. I have already men-

tioned how frequent the particular configurations were in the data. The dia-
gram gives the reader a summarising overview of the frequency distribution
of various configurations in the data. The most frequent type of configuration
was n-to-1 mapping (37 percent of the foci), followed by delay (30 percent of
the foci), triangles (17 percent of the foci) and n-to-n mapping (11 percent of
the foci). The least frequent type of configuration was perfect match (5 percent
of the foci).
When we compare congurations within a focus and congurations with-
in a superfocus we can notice the following: one visual xation usually does
not match one verbal focus and a perfect match is very rare. There is rather an
asymmetry between them, in the sense that several visual xations are usually
connected to one verbal focus, in particular in summarising foci. The foci are
often intertwined, which causes a partial lack of qualitative correspondence
between the visual and verbal foci. The congurations where multiple visual
foci are connected to one verbal focus and intertwined with each other are well
integrated into a larger unit, into a superfocus. Explanations for intertwined
configurations can be found in discourse planning. If one visual xation clus-
ter is connected to one verbal focus, there is usually a latency of about 23
seconds between them, in particular in list of items. This delay conguration
is in turn a part of a larger unit, the surrounding superfocus, and the length
of the delay reflects cognitive processes on a discourse level. Looking back at
Chafes (1980) suggestions, the conclusion we can draw from this comparison
is that the verbal focus about one object does not always closely correspond to
a single visual focus, i.e. a visual fixation cluster on this object. Also, the order
in which the picture elements are focused on differs partially. The hypothesis
that there are comparable units in the visual scanning of the picture and the
simultaneous spoken language description of it can still be conrmed, but
on a higher level. It is the superfocus rather than the focus that seems to be
the suitable unit of comparison between visual and verbal data, since in most
cases the superfocus represents an entity that delimits separable clusters of
visual and verbal data.
We will now turn to the second section of this chapter, which focuses on
the functional distribution of the extracted configurations.
2. Functional distribution of multimodal patterns
In the first section of this chapter, I characterised the configurations of verbal

and visual data on a focus and a superfocus level. In this section, I will take a
closer look at the relation between verbal activity type and type of multimodal
pattern.
While analysing the distribution of the multimodal patterns in the data,
the question arose whether these clusters occur regularly and can be related to
specific types of mental activities. The question is: Can we nd congurations
typical of summarising, comparing, reconceptualising, problem solving, list-
ing, and evaluating? Can these configurations give us hints about the type of
mental activity that is going on? I will thus reverse the perspective now and
take the various types of verbal superfoci in all informants descriptions as a
starting point in order to check whether we can identify integration patterns

that are typical of certain kinds of verbal activities.
To start with, let us look at the distribution of the different focus types
in the on-line picture descriptions with eye tracking. The foci expressing the
presentational function (summarising, substantive, localising foci, list of items,
substantive foci with categorisation difficulties) and the orientational function
(evaluative foci) represent 92 percent of all foci (or, in absolute numbers, 186
foci). Foci associated with the organisational function are excluded, since they
are often concerned with procedural questions about the discourse.
As a next step, we will look at some relevant characteristics of the visual
and verbal stream in the various conguration types. The clusters are described
in terms of the following verbal and visual parameters:
i. The characteristics of the verbal data stream including length of verbalisa-

tion (sec), pauses, number of verbal foci, kind of verbal complexity (level
of embedding, repetitive sequences, lexical or syntactic alternation).
ii. The characteristics of the visual data stream including the number of
xation clusters, duration of fixation clusters, rexations, order of objects
focused on.
Table 1 summarises the functional distribution of multimodal patterns in

different activity types and gives a short description of the visual and verbal
streams. The characteristics of the visual foci are based on the distinction be-
tween short (150400 ms), middle (401800 ms) and long fixation clusters or
dwells (8011000 ms) found in the data.
These averages might seem rather long compared to results in scene per-
ception and naming tasks. But we may bear in mind that eye movements have
various functions in connection to speech production (Griffin 2004). Apart
from naming-related eye movements (Griffin & Bock 2000), eye movements
are used in a counting-like function when describing multiple picture ele-
ments, as a support in the process of categorisation and re-categorisation, self-
corrections etc. (cf. Chapter 7, Section 4.3).
Figure 15 shows the frequency distribution of configuration types as a func-
tion of verbal activity types. The closest relation between contents of the visual
and verbal units can intuitively be found in substantive foci (subst). However,
as we can see in Table 1, it is not so easy to identify a uniform reoccurring pat-
tern there. All five configurations appear here, with n-to-1 as the dominant
one and delay as a quite frequent one. The reason is probably that substantive
Table 1. Multimodal patterns and activity types.

Activity type Cluster type Characteristics
of the visual and verbal stream
sum
(summarising foci) many short dwells (150400 ms)
all items checked
B4 mostly one verbal focus
series of n-to-1 mappings
subst many short dwells (150400 ms)
(describing objects, one or multiple verbal
states and events) formulations
n-to-1 mappings, delay, triangles
list (a) many short dwells
(listing details of (150400 ms)
objects, states and B2 some middle dwells
events) series of delays (23 sec) (500800 ms)
multiple verbal foci
list (b) short, middle, long dwells
(reformulating de- (150400, 500800, 8001000 ms)
scriptions of objects, B2 verbal development,
states and events) series of delays (23 sec) reformulations
loc multiple dwells, rexations
(descriptions of the short, middle and long xation
location of objects) clusters (150400, 500800,
B3 8001000 ms)
series of triangles single or multiple verbal foci
cat. diff. series of visual and verbal foci
(categorisation many short (150400 ms) and
difculties, long (8001000 ms) dwells,
B5
problem solving) intensive scanning
series of n-to-n mappings
many verbal reformulations,
corrections, pauses, hesitations
eval many short dwells (150400 ms)
(evaluating objects, or long dwells(8001000) and
states and events, multiple verbal formulations
expressing attitudes) n-to-1, triangles, 1-to-n
foci can contain evaluative and localising aspects. The patterns seem to vary
according to the different aspects included.
However, when we look at the summarising foci (sum), list of items (list)
and foci with categorisation difculties (cat. diff.), the configurations appear
Frequency distribution of configuration types as a

function of verbal activity types
40
35
30
25 n-to-1
delay
20 triangles
n-to-n
15
perfect match
10
0
1 2 3 4 5 6
Sum Subst List Loc Cat.diff. Eval.
Figure 15. Frequency distribution of configuration types as a function of verbal

activity types.
regularly across all informants and can be systematically correlated with cer-
tain types of verbal activities. For summarising foci (sum), n-to-1 configuration
appears to be typical. This coupling is quite plausible: The information about
objects, states and events is acquired visually by multiple fixation clusters on
several objects in the scene, both during pauses and during the verbal descrip-
tion, and is summarised verbally, usually in one verbal focus.
For list of items, the delay configuration dominates. This can be explained
by a counting-like behaviour: the informants have inspected the objects during
a summarising focus and are checking the items off the list. Since they are de-
scribing these objects in detail now, they have to check the appearance, relation
and activity of the listed objects. This categorisation, interpretation and formu-
lation costs mental effort, which is reflected in the delay configuration.
Triangles are typical of localising foci (loc). Within one verbal focus high-
lighting an idea about an objects location in the scene (e.g. in front of the tree
there is a stone), the viewers visually inspect the stone-area before mentioning
stone. During this first inspection, the inspected object functions as a pars pro
toto, representing the in front-of-the-tree-location (for more details about this
semantic relation, see the following chapter). After that, the informants inspect
some other objects in the scene (which will be mentioned in the following
foci), and finally devote the last visual inspection to the stone again, while si-
multaneously naming it. The mental activities seem to be divided into a prepa-
ratory phase when most of the categorisation and interpretation work is done
and a formulation phase during the refixation.
Categorisation difficulties are exclusively connected to cluster type
n-to-n mappings. Concerning the characteristics of the verbal and visual
stream, multiple verbal and visual foci are typical of this type of activity, where
visual xations were either short/medium or very long (intensive scanning).
For the verbal stream, repetitive sequences, lexical or syntactic alternation
and, sometimes even hyperarticulation are characteristic. When determining
a certain kind of object, activity or relation, the effort to name and catego-
rise is associated with problem solving and cognitive load. This effort is of-
ten manifested by four or ve different verbal descriptions of the same thing
within a superfocus.
In short, tendencies towards reoccurring multimodal integration patterns
were found in certain types of verbal superfoci during free picture descrip-
tions. The results from the identied typology of multimodal clusters and their
functional distribution can give us hints about certain types of mental activi-
ties. If further research confirms this coupling, these multimodal clusters can
receive a predictive value.
Figure 16. Example of visual display and linguistic production in a psycholinguistic

task (displays used in van Meulen, Meyer & Levelt 2001).
In the middle is a tree with one with three birds doing different things; one is sitting
on its eggs, the other is singing and the third female bird is beating a rug or something.
Figure 17. The complex visual display and linguistic production in the current stud-
ies on descriptive discourse.
3. Discussion: Comparison with psycholinguistic studies
As a background for the discussion about temporal (and semantic) relations,

we will look at the differences in free picture description and in psycholin-
guistic studies based on eye tracking. First, we have to consider the character
of visual displays used. Whereas psycholinguistic studies with eye tracking use
spatial arrays of one, two or, at the most, three objects (cf. Figure 16), we use
a complex picture with many possible relations (cf. Figure 17). It is a true de-
picted scene with agents (person and animals) that is nameable and has a route
to semantic interpretation. The visual inspection and conceptualisation phase
takes more time, of course, in a complex picture with many entities, relation-
ships and events involved (cf. Chapter 5, Section 2.1).
Second, we have to consider the linguistic descriptions produced. Free de-
scription of a complex picture, lasting for two or three minutes, is, of course,
much more complex than a short phrase or utterance formulated about a sim-
ple array of two-three objects in the course of a couple of seconds. Also, more
extensive previews are necessary when planning and formulating more com-
plex descriptons on a discourse level. Apart from that, the information gained
visually during an inspection of an object need not be verbalised immediately,
but can be used later in the discourse, maybe as a part of a general description
later on (we can see a cultivation process).
Furthermore, the informants have the freedom to structure their discourse

on all levels of discourse hierarchy: via the choice of discourse topics (guided
by the composition or by associations, impressions, evaluations: it looks like
a typical Svensson-home, everybody is happy), via the choice of different types
of superfoci and foci (ideas about the referents, states or events, their relations,
judgements about relations and properties) and via the choice of even smaller
building blocks according to how the scene is perceived and interpreted by the
informants. Finally, it happens that the informants reconceptualise the whole
picture or parts of it during the process of picture viewing and picture descrip-
tion, something that radically changes their previous description. In short, the
description of the scene was not reduced to object naming. The starting point
are informants own complex ideas about the scene, their descriptions and in-
terpretations, their way of making meaning and their use of presentational, ori-
entational, and organisational functions of descriptive discourse. Everything
that is mentioned is said in relation to what has been mentioned in the preced-
ing parts of the discourse (for example a detailed list of objects is presented
after a global overview) and will be mentioned in the following (with a certain
amount of conscious forward-planning).
Although experimental control in psycholinguistic studies allows exact
measurements of the onset of speech after a single visual fixation (on a mil-
lisecond level), it has restrictions. The subject is expected to identify and label
isolated objects by using a particular syntactic structure and particular expres-
sions. It is therefore sometimes difficult to generalise the psycholinguistic find-
ings to comprehension of complex scenes and cognitive processes underlying
the production of natural descriptive discourse.
4. Conclusion
In the first section, I showed a variety of extracted configurations from verbal

and visual data during free descriptions of a complex picture. The discourse
level of description was at the centre of our interest and the configurations
were presented on different levels of discourse hierarchy: within the focus and
within the superfocus. I calculated the occurrences of the congurations in the
data, looked at their relation to the preceding and following units and con-
nected them to the type of superfocus they were part of. Let us now look at the
results from observations of the temporal relations between visual and verbal
data in free descriptions of a complex picture:
i. The verbal and the visual signals were not always simultaneous. Visual
scanning was also done during pauses in the verbal description and at the
beginning of the sessions, before the verbal description had started.
ii. The visual focus was often ahead of speech production (objects were visu-
ally focused on before being described). If one visual xation was con-
nected to one verbal focus, there was usually a delay of approximately 23
seconds between them, in particular in list foci. This latency was due to
conceptualisation, planning and formulation of a free picture description
on a discourse level, which affected the speech-to gaze alignment and pro-
longed the eye-voice latency.
iii. Visual focus could, however, also follow speech meaning so that a visual
fixation cluster on an object could appear after the describer had men-
tioned it. The describer was probably monitoring and checking his state-
ment against the visual account.
iv. Areas and objects were frequently re-examined, which resulted either in
multiple visual foci or in both multiple visual and verbal foci. As we will
see in the following chapter, a refixation on one and the same object could
be associated with different ideas.
v. One visual xation usually did not match one verbal focus and a perfect
overlap was very rare. Some of the inspected objects were not mentioned
at all in the verbal description, some of them were not labelled as a discrete
entity but instead included later, on a higher level of abstraction (there are
flying objects).
vi. Often, several visual xations were connected to one verbal focus, in par-
ticular in summarising foci (sum).
vii. The order of objects focused visually and objects focused verbally was not
always the same, due to the fact that in the course of one verbal focus, pre-
paratory glances were passed and visual fixation clusters landed on new
objects, long before these were described verbally.
viii. Multiple visual foci were intertwined and well integrated into a larger unit,
into a superfocus. This was due to discourse coherence and was related to
cognitive processes on the discourse level.
ix. Comparable units could be found; however, not on a focus level. In most
cases, the superfocus represented the entity that delimited separable clus-
ters of visual and verbal data. I have therefore suggested that attentional
superfoci rather than foci are the suitable units of comparison between
verbal and visual data.
We can conclude that many of the observations are related to cognitive pro-
cesses on the discourse level. The main result concerns similarities between
units in our visual and verbal information processing. In connection to the
traditional units of speech discussed in the literature, the empirical evidence
suggests that the superfocus expressed in longer utterances (or similar larger
discourse units) plays a crucial role and is the basic unit of information pro-
cessing. Apart from that, as the results from my perception study (Chapter 1)
indicate, the superfocus is easier to identify and both cognitively and commu-
nicatively relevant. I was also able to demonstrate the lag between the aware-
ness of an idea and the verbalisation of that idea: when describing a list of
items in free descriptive discourse, the latency between a visual focus and
a verbal focus was longer than the latency in psycholinguistic studies on a
phrase or clause level. This insight may be valuable for linguists but may also
enrich research about the nature of human attention and consciousness. We
will return to the issue of speech production and planning in the following
chapter.
The second section of this chapter maintains that a certain type of fixation
pattern reflects a certain type of mental activity. I analysed how often each type
of pattern appears in association with each type of focus and presented the
frequency distribution of cluster types as a function of verbal activity type. It
shows that this correspondence is present in summarising foci (sum), lists of
items (list), localising foci (loc) and substantive foci with categorisation dif-
ficulties (cat. diff). Finally, the current eye tracking study has been discussed
in comparison with psycholinguistic studies. Differencies have been found in
the characteristics of the visual displays used and in the linguistic description
produced.
The current chapter dealt with temporal relations between verbal and vi-
sual data. The following chapter will concern semantic relations between these
two sorts of data.
chapter 7
Semantic correspondence
between verbal and visual data
It is not self-evident that we all agree on what we see, even though we look at
the same picture. Ambiguous pictures are only one example of how our per-
ception can differ. The reason behind these individual differences is that our
way of identifying objects in a scene can be triggered by our expectations,
interests, intentions, previous knowledge, context or instructions we get. The
picture content that, for example, the visitors at an art gallery extract during
picture viewing does not very often coincide with the title that the artist has
formulated. Current theories in visual perception stress the cognitive basis
of art and scene perception: we think art as much as we see art (cf. Solso
1994). Thus, observers perceive the picture on different levels of specificity,
group the elements in a particular way and interpret both WHAT they see
and HOW the picture appears to them. All this is reflected in the process of
picture viewing and picture description.
The previous chapter dealt with temporal relations in picture viewing and pic-
ture descriptions, particularly with the configurations of verbal and visual data,
the unit of comparison and the distribution of the various configuration types
as a function of verbal activity types. In this chapter, we will use the multi-
modal method and the analytic tools described in Chapter 5 in order to com-
pare the contents of verbal and visual clusters. First, we will take a closer look
at the semantic relations between visual and verbal data from on-line picture
descriptions. Second, we will review the results concerning levels of specificity.
Third, we will focus on spatial proximity and mental groupings that appear
during picture description. Fourth, we will discuss the generability of our re-
sults by presenting results of a priming study conducted with twelve additional
informants. We will also discuss the consequences of the studies for viewing
aspects, the role of eye movements, fixation patterns, and language production
and language planning.
1. Semantic correspondence
In the first section, we will be looking at the content of the visual and verbal
units. In picture viewing and picture description, the observers direct their
visual xations towards certain objects in the scene. They also focus on objects
from the scene in their spoken language descriptions. With respect to semantic
synchrony, we can ask what type of information is focused on visually when,
for instance, someone describes the tree and the three birds in it. Does she
xate all the units that the summary is built upon? Is there a one-to-one rela-
tionship between the visual and the verbal foci? It is also interesting to consider
whether the order of the described items is identical with the order in which
they were scanned visually.
Clusters of visual xations on an object can be caused by different per-
ceptual and cognitive processes. The spoken language description can help us
to reveal which features observers are focusing on. Let us look again at the
relation between the visual and the verbal focus in the example presented in
Figure1: In front of the tree which is curved is a stone.
In the visual stream, the object stone is xated three times: twice in the
triangle conguration and once in the perfect match conguration (cf.
Chapter 6). If we compare these congurations with the verbal stream, we
discover that the relation between the visual and the verbal focus is differ-
ent. While their relation is one of semantic identity in the perfect match
(stone = stone), this is not the case in the triangle configuration. Here, the
concrete object stone is viewed from another perspective; the focus is not
on the object itself, but rather on its location (stone = in front of the tree). In
the perfect match conguration, eye movements are pointing at a concrete
object. In the triangle case, on the other hand, there is an indirect semantic
stone stone tree stone

VIS UAL
VERBAL
in front of the tree which is curved is a stone
time
Figure 1. In front of the tree which is curved is a stone.

Chapter 7. Semantic correspondence between verbal and visual data 127
relation between vision and spoken discourse. The observers eyes are point-
ing at a concrete object in the scene but as can be revealed from the verbal
description the observer is mentally zooming out and focusing the position
of the object.
This latter relation can be compared with the gure-ground (trajector-
landmark) relation in cognitive semantics (Lakoff 1987; Langacker 1987,
1991; Holmqvist 1993). According to cognitive semantics, our concepts of the
world are largely based on image-schemata, embodied structures that help
us to understand new experiences (cf. Chapter 8, Section 1). For instance,
in the relation in front of , the trajector (TR) the location of the object is
considered to be the most salient element and should thus be focused by the
observer/describer. In fact, when saying in front of the tree, the informant
actually looks at the trajector (=stone) and directs his visual attention to it
(Figure 2). In other words, the saliency in the TR role of the schema co-occurs
with the visual (and mental) focus on the stone, while the stone itself is verbal-
ised much later. It would be interesting to conduct new experiments in order
to verify this finding from spontaneous picture description for other relations
within cognitive semantics.
Nuyts (1996) is right when he notes that Chafes concept of human cogni-
tion is much more dynamic than issues discussed within mainstream cogni-
tive linguistics. Chafe is more process-oriented whereas mainstream cognitive
linguists are more representation-oriented, studying mental structures and is-
sues of a more strictly conceptual-semantic nature. However, as this example
shows, much could be gained by introducing the theoretical concepts from
cognitive semantics, such as landmark, trajector, container, prototype, figure-
ground, source-goal, centre-periphery, image schema etc. as explanatory de-
vices on the processual level.
LM
TR
Figure 2. In front of a tree (stone= trajector, tree=landmark).

1.1 Object-location relation
The object-location relation is a quite common relation in picture viewing and

picture descriptions and has different manifestations, as the three examples
(a)(c) show.
Examples (a)(b)
(a) 0205 in the background in the upper left
0206 some cows are grazing,
(b) 0310 (2s) in the middle of the eld there is a tree,
In example (a), the cows are partly viewed as a concrete object and partly as
a location. This time, the location is not expressed with the help of another
concrete object in the scene but in terms of picture composition (in the back-
ground).
In (b), the observer is not xating one object in the middle as a represen-
tation of the location, as we would expect. Instead, the middle is created by
comparing or contrasting the two halves of the scene. The observer delimits
the spatial location with the aid of multiple visual xations on different objects.
He is moving his eyes back and forth between the two halves of the picture,
xating similar objects on the left and on the right (Findus 1 and Findus 2, soil
in the left eld and soil in the right eld) on a horizontal line. After that, he
follows the vertical line of the tree in the middle, xating the bottom and the
top of it. This dynamic sequence suggests that the observer is doing a cross
with his eyes, as if he were trying to delimit the exact centre of the scene. The
semantic relation between the verbal and the visual foci is implicit; the discrete
objects are conceived of as pars pro toto representation of the two halves of the
picture, suggesting the location of the middle.
1.2 Object-path relation
Example (c)
(c) 0415 and behind the tree
0416 there go . probably cows grazing
0417 down towards a . stone fence or something
Example (c) is a combination of several object-for-location manifestations.

First, cows are xated as a spatial relation (behind the tree), then they are
rexated and categorised as concrete objects, their activity is added (grazing)
and, nally, cows are rexated while the perceived direction or path of their
movement is mentioned (down towards a stone fence).
1.3 Object-attribute relation
In addition to the object-location relation, the focus can also be on the ob-
ject-attribute relation. During evaluations, observers check the details (form,
colour, contours, size) but also match the concrete, extracted features with the
expected features of a prototypical/similar object: something that looks like a
telephone line or telephone poles . that is far too small in relation to the humans.
They compare objects both inside the picture world and outside of it. By using
metaphors and similes from other domains they compare the animal world
with the human one: one bird looks very human; theres a dragonfly like a dou-
ble aeroplane.
1.4 Objectactivity relation
Finally, we can extract the object-activity relation. When describing an activity,

for example when formulating Pettson is digging, the eye movements depict
the relationships among picture elements and mimic it in repetitive rhythmic
units, by filling in links between functionally close parts of an object (face
hand-tool; hand-tool, hand-tool-soil). In terms of cognitive semantics we can
say that eye movements fill in the relationship according to a schema for an
action (cf. Figure 3).
Figure 3. Pettson is digging.

To summarise the ndings concerning the relation between the visual and
verbal foci we can state the following: A visual xation on an object can mean
that:
i. the object itself is focused on and conceptualised on different levels of ab-

straction (lilies = lilies, cat = animal, something = insect);
ii. the objects location is focused on (stone = in front of the tree) which can be
explained by the gure-ground principle;
iii. The objects path or direction (cows = go down towards a stone fence)
iii. the objects attributes are focused on and evaluated (cat = weird cat, bird =
looks human, dragony = double aeroplane);
iv. the objects activity is focused on (hes digging, hes raking and hes sowing,
an insect which is ying);
We will now turn to levels of specificity in the categorisation process.
2. Levels of specificity and categorisation
When dealing with the specic semantic relationship between the content of
the verbal and the visual spotlight, we can then ask: Is it the object as such that
is visually focused on, or is it some of its attributes (form, size, colour, con-
tours)? Is it the object as a whole that is focused on, or does the observer zoom
in and analyse the details of an object on a ner scale of resolution? Or is the
observer rather mentally zooming out to a more abstract level (weird cat, cat,
animal, strange thing, something)?
As for the verbal categorisation, the main figure, Pettson, can be described
as a person, a man, an old guy, a farmer, or he can be called by his name. His
appearance can be described (a weird guy), his clothes (wearing a hat), the ac-
tivity he is involved in can be specified (he is digging) and this activity can be
evaluated (he is digging frenetically). Thus, we can envision a description on dif-
ferent levels of specificity and with different degrees of creativity and freedom.
The tendencies that could be extracted from the data are the following: In-
formants start either with a specific categorisation which is then verbally modi-
fied (in a potato field or something; the second bird is screaming or something like
that) or, more often, with a vague categorisation, a filler, followed by a speci-
fication: Pettson has found something he is looking at he is looking at the soil.
In other words, a general characteristic of an object is successively replaced by
more specific guesses during the visual inspection: he is standing there looking
at something maybe a stone in his hand that he has dug up.
Informants also use introspection and report on mental states, as in the
following example. After one minute of picture viewing and picture descrip-
tion, the informant verbalises the following idea: when I think about it, it seems
as if there were in fact two different fields, one can interpret it as if they were in
two different fields these persons here (cf. Section 3.4).
In the following, I will show that not only scene-inherent concrete objects or
meaningful groups of objects are focused on, but that also new mental group-
ings are created on the way. Let us turn to the aspect of spatial proximity.
3. Spatial, semantic and mental groupings
According to Gestalt psychology, the organising principles that enable us to

perceive patterns of stimuli as meaningful wholes (a) proximity, (b) similarity,
(c) closure, (d) continuation, and (e) symmetry. Especially the principles of
proximity, similarity and symmetry seem to be relevant for the viewers in the
studies of picture perception and picture description: The proximity principle
implies that objects placed close to each other appear as groups rather than as a
random cluster. The similarity principle means that there is a tendency for ele-
ments of the same shape or colour to be seen as belonging together. Finally, the
symmetry principle means that regions bounded by symmetrical borders tend
to be perceived as coherent figures.
Concerning picture or scene description, we can think of several princi-
ples of grouping that could have guided the verbalisation process: the spatial
proximity principle (objects that are appear close to each other in the picture
are described together), the categorical or taxonomical proximity principle
(elements that are similar to each other are described together), the compo-
sitional principle (units delimited by composition are described together), the
saliency principle (expected, preferred and important elements are described
together), the animacy principle (human beings and other animate elements
are described together) or perhaps even other principles.
3.1 Grouping concrete objects on the basis of spatial proximity
When we consider a group of objects in the scene, it is claimed that observ-

ers tend to fixate the closest objects first. This phenomenon is very coercive
( because of the proximity to the fovea) and is described as an important prin-

ciple in the literature (Lvy-Schoen 1969, 1974; referred by Findlay 1999). And
in fact, informants in the current study seemed to be partly guided by this prin-
ciple: they fixate clusters of objects depicted with spatial proximity. They de-
scribe objects in this delimited area during a verbal superfocus, and keep their
eyes within this area during visual scanning. The simplest example are clusters
of scene-inherent groups of concrete objects as in three birds in the tree. The
observer is zooming out, scanning picture elements and conceptualising them
on a higher level of abstraction. In terms of cognitive semantics, he is xating
a picture-inherent group of visual objects (birds) and linking them together to
a plural trajector (TR).

of categorical proximity
The cluster in Figure 4 was based on the picture-inherent spatial proximity of

objects. However, observers also fixated and described clusters that were not
spatially close. One type of cluster was based on groupings of multiple similar
concrete objects (four cats, four versions of Pettson). This type of cluster cannot
be based on spatial proximity, since the picture elements are distributed over
the whole scene. In this case, we can rather speak about categorial proximity.
We have to note that the simultaneity of the depicted objects involved in differ-
ent activities has probably promoted this guidance. Similarity should in general
Figure 4. Three birds in the tree.

invite comparison and clustering. This type of cluster is usually described in a

summarising superfocus that is typically followed by a list where each instance
is described in detail.
3.3 Grouping multiple concrete objects on the basis of the composition
Another type of cluster that was perceived as a meaningful unit in the scene
was hills at the horizon. The observers eyes followed the horizontal line, filling
in links between objects. This cluster was probably a compositionally guided
cluster. The observer is zooming out, scanning picture elements on a compo-
sitional level. This type of cluster is still quite close to the suggested, designed
or scene-inherent meaningful groupings.
3.4 Mental zooming out, recategorising the scene
As I have mentioned earlier, pictures invite observers to different scanning

paths during their discovery. Especially in later parts of the observation, the
clusters are based on thematic aspects and guided by top-down factors. The
next example is a visual cluster that corresponds to an idea about the picture
composition. After one minute of picture viewing and picture description, the
informant recategorises the scene: when I think about it, it seems as if there
were in fact two different fields (Figure 5).
After a closer inspection of the visual cluster, we can see that the observer
is rescanning several earlier identified objects that are distributed in the scene.
Figure 5. Fixation pattern: Two different fields.

What we do not see and know on the basis of eye movements only, is how the
observer perceives the objects on different occasions. This can be traced thanks
to the method combining picture viewing and picture description.
In the case of two different fields, the objects that are refixated represent a
bigger (compositional) portion of the picture and support the observers recon-
ceptualisation. By mentally zooming out, he discovers an inferential boundary
between parts of the picture that he has not perceived before. The scene origi-
nally perceived in terms of one field has become two fields as the observer gets
more and more acquainted with the picture. This example illustrates the process
during which the observers perception of the picture unfolds dynamically.
3.5 Mental grouping of concrete objects on the basis of similar traits

and activities
We are now moving further away from the scene-inherent spatial proximity
and approaching the next type of clusters constructed by the observers. This
time, the cluster is an example of an active mental grouping of concrete ob-
jects based on an extraction of similar traits and activities (cf. Figure 6. Fly-
ing insects). The prerequisite for this kind of grouping is a high level of active
processing. Despite the fact that the objects are distributed across the whole
scene and appear quite differently, they are perceived of as a unit because of
the identified common denominator. The observer is mentally zooming out
and creating a unit relatively independent of the suggested meaningful units
Figure 6. Fixation patterns for Flying insects.

in the scene. The eye movements mimic the describers functional grouping of
objects.
3.6 Mental grouping of concrete objects on the basis

of an abstract scenario
In a number of cases, especially in later parts of the observation, the clusters are
based on thematic aspects. The next cluster lacks spatial proximity (Figure 7).
The observer is verbalising his impression about the picture content: it looks
like early summer. This abstract concept is not identical with one or several
concrete objects compositionally clustered in the scene, and visual fixations
are not guided by spatial proximity. The previous scanning of the scene has
led the observer to an indirect conclusion about the season of the year. In the
visual fixation pattern, we can see large saccades across the whole picture com-
position. It is obviously a cluster based on mental coupling of concrete objects;
their parts or attributes (such as flowers, foliage, plants, leaves, colours) on a
higher level of abstraction.
Since the concept early summer is not directly identical with one or sev-
eral concrete objects in the scene, the semantic relation between verbal and
visual foci is inferred. The objects are concrete indices of a complex (abstract)
concept. In addition, the relation between the spoken description and the vi-
sual depiction is not a categorical one, associated with object identification.
Instead, the observer is in a verbal superfocus, formulating how the picture
appears to him on an abstract level. Afterwards, during visual rescanning, the
Figure 7. Fixation patterns by a person saying It looks like an early summer.

observer is searching again for concrete objects and their parts as crucial indi-
cators of this abstract scenario. By refocusing these elements, the observer is
in a way collecting evidence for his statement. In other words, he is checking
whether the object characteristics in the concrete scene match the symptoms
of the described scenario. Concrete objects can be viewed differently on differ-
ent occasions as a result of our mental zooming in and out. We have the ability
to look at a single concrete object and simultaneously zoom out and speak
about an abstract concept or about the picture as a whole. When talking about
creativity and freedom, this type of mental groupings shows a high degree of
active processing.
These clusters are comparable to Yarbus (1967) task-dependent clusters
where
[] in response to the instruction estimate the material circumstances of
the family shown in the picture, the observer paid particular attention to the
womens clothing and the furniture (the armchair, stool, tablecloth). In re-
sponse to the instruction give the ages of the people shown in the picture,
all attention was concentrated on their faces. In response to the instruction
surmise what the family was doing before the arrival of the unexpected visi-
tor, the observer directed his attention particularly to the objects arranged on
the table, the girls and womans hands []. After the instruction remember
clothes worn by the people in the picture, their clothing was examined. The
instruction remember position of the people and objects in the room caused
the observer to examine the whole room and all the objects. [] Finally, the
instruction estimate how long the unexpected visitor had been away from
the family, caused the observer to make particularly intensive movements of
the eyes between the faces of the children and the face of the person entering
the room. In this case he was undoubtedly trying to nd the answer by study-
ing the expression on the faces and trying to determine whether the children
recognised the visitor or not. (Yarbus 1967:192193)
Although the informants in my study did not receive any specic instructions,
their description spontaneously resulted in such kinds of functionally deter-
mined clusters. These clusters were not experimenter-elicited, as in Yarbus
case, but naturally occurring (correlational). It was the describers themselves
who created such abstract concepts and scenarios, which in turn provoked
a distributed visual search for corresponding signicant details in the scene.
Clusters of this type have temporal proximity but no spatial proximity. They
are clearly top-down guided and appear spontaneously in the later parts of the
observation/description.
To summarise the findings concerning spatial, semantic and mental group-

ings, the perceived units could contain any of i to vii:
i. singular concrete objects in the scene

ii. scene-inherent clusters of concrete objects, spatial proximity (three birds in
the tree)
iii. groups of multiple concrete objects, more or less distributed in the scene,
categorical proximity (Pettsons, cats)
iv. groups of multiple objects across the scene that are horizontally (or verti-
cally) aligned according to the scene composition (hills at the horizon)
v. mental zooming out, recategorising the scene (two fields)
vi. mental grouping of concrete objects on the basis of similar traits, proper-
ties and activities, across the whole scene (ying objects)
vii. mental grouping of multiple concrete objects, their parts or attributes
(owers, leaves, colours) on a higher level of abstraction (early summer).
These findings have consequences for viewing dimensions (Section 4.2) and
for function of eye movements (Section 4.3). It should be noted that the spoken
language description may inuence eye movements in many respects. Never-
theless, it is not unlikely that patterns based on evaluation and general impres-
sion do appear even in free visual scene perception. They may be a natural part
of interpreting scenes and validating the interpretation. I can also imagine that
evaluative patterns can be involved in preference tasks (when comparing two
or more pictures). Or, in a longer examination of a kitchen scene resulting in
the following idea: it looks as if somebody has left it in a hurry. If we want to ex-
tract such clusters on the basis of eye movements alone, the problem is that the
spoken language description is needed in order to identify them. If we combine
both modalities, we are more able to detect these kinds of complex clusters.
4. Discussion
The question arises whether the eye movement patterns connected to abstracts
concepts (such as early summer) are specific to the task of verbally describing
a picture, or whether they appear even during perception of the same speech.
Are these eye movement patterns caused by the process of having to systemati-
cally structure speech for presentation? Are they limited to situations where
the picture description is generated simultaneously with picture viewing? Do
these patterns appear for speakers only? Or can similar eye movement patterns
Table 1. The list of pre-recorded utterances 111.

1. Det finns tre fglar i trdet. En sitter p sina gg den andra fgeln verkar sjunga och
den tredje hon-fgeln piskar matta.
There are three birds in the tree. One is sitting on its eggs, the other bird is singing and
the third she-bird is beating a rug.
2. Det r nog inte bara en ker utan tv olika p bilden.
There is not only one field but two different fields in the picture.
3. Det verkar vara tidig sommar.
It seems to be early summer.
4. Till hger gr en telefonledning.
To the right there is a telephone line.
5. Fglarna till vnster sitter tripp trapp trull bland liljebladen.
The birds on the left are sitting in line according to size among the lilies.
6. Det flyger insekter i bilden.
There are insects flying in the picture.
7. Det r sent p hsten och naturen frbereder sig infr vintern.
It is late autumn and nature is preparing for winter.
8. Alla p bilden r ledsna.
Everyone in the picture is sad.
9. Det r sent p kvllen och det hller p att skymma.
It is late in the evening and it is starting to get dark.
10. Det r ett typiskt industriomrde.
It is a typical industrial area.
11. Gubben och katten knner inte varandra.
The old man and the cat do not know each other.
be elicited even for viewers who observe the picture after they have listened to
a verbal description of the scene? In order to answer these questions, we con-
ducted a priming study that will be presented in Section 4.1. If we find similar
eye movement patterns even for listeners, this would mean that these patterns
are not connected to the effort of planning a speech but rather to the semantics
of the viewed scene.
4.1 The priming study
The question under investigation was whether we could prime the occurrence
of similar viewing patterns by presenting spoken utterances about picture con-
tent before picture onset.
. I would like to thank Richard Andersson for his help with the study.
Twelve informants (seven male, five female) listened to a series of eleven

pre-recorded utterances that appeared in random order and were followed by a
picture being displayed for 6 seconds. The utterances were partly invented (ut-
terances 711 in Table 1), partly generated by the original speakers in the study
on simultaneous picture descriptions (utterances 16 in Table 1). In order to
guarantee that the informants had the same point of departure for the picture
viewing, the informants were asked to fixate the asterisk on the screen when
listening to the utterance describing the contents of the scene. After the spoken
description, they were shown the complex picture for 6 seconds and directly
after that, they were asked to estimate how well the spoken description and the
picture agreed, by pressing one of the keys on the evaluation scale 15. The
instruction was as follows: You will hear an utterance and then see a picture.
Your task is to judge how well the utterance corresponds to the picture on a scale
from 1 to 5. Value 5 means that the agreement between the utterance and the
picture is very good and value 1 means that the utterance does not correspond to
the picture at all. This procedure was repeated several times. Eye movements
were measured during the visual scanning of the picture.
If the visual patterns generated by a single viewer during a spoken picture
description are not random, they will be similar to visual patterns elicited by the
same spoken description performed by a group of listeners who visually inspect
Figure 8a. Flying insects: Listeners scanpaths (03, 04, 07 and 09). Can be compared
to speakers original scanpath in Figure 6.
Figure 8b. Telephone line: Listeners scanpaths (03, 04, 07, and 09). Area of interest is
marked with a rectangle on the right.
Figure 8c. Early summer: Listerners scanpaths (03, 04, 07, and 09). Can be compared
to speakers original scanpath in Figure 7.
the picture. This would indicate a) that we will find similar eye movement pat-
terns within the group of listeners and b) that the eye movement produced by
the listeners will be similar to the clusters produced by the original speaker.
In the analysis, I concentrate on visual clusters caused by utterances 16
(birds in the tree, two different fields, early summer, telephone line, three birds
in the lilies, flying insects), since I have the original visual patterns to compare
with. It is difficult to statistically measure similarity of scanpaths. It is even
more difficult to compare dynamic patterns, to measure similarity between
spatial AND temporal configurations and to quantitatively capture tempo-
ral sequences of fixations. To my knowledge, no optimal measure has been
found yet.
Figure 8ac illustrates scanpath similarity of three statistically significant
scenarios expressed in utterances 3, 4 and 6 (flying insects, telephone line and
early summer) for four randomly chosen listeners (03, 04, 07 and 09).
For the scanpaths elicited by the utterances 3, 4 and 6 (flying insects, tele-
phone line and early summer), I calculated visual fixations on a number of
areas of interest that were semantically important for the described scenario.
The concept early summer does not explicitly say which concrete objects in
the scene should be fixated, but I defined areas of interest indicating the sce-
nario: flowers, leaves, foliage, plants etc., that have also been fixated by the
original speaker. The aim was to show that the listeners visual fixation pat-
terns within relevant areas of interest were significantly better than chance.
Then the expected value for the time spent on the relevant area proportional
to its size was compared with the actual value. A t-test was conducted telling
us whether the informants looked significantly more on the relevant areas of
interest than by chance.
The visual scanpaths among the twelve listeners were rather similar. Where-
as the concept two fields and birds in the lilies were not significant, in the cases
of flying insects (p = 0,002) and telephone line (p = 0,001), we got significant
results for all listeners. For early summer, the results were partially significant
(large flowers on the left, p = 0,01; flowers left of the tree, p = 0,05). We can
thus conclude that a number of these utterances elicited similar eye movement
patterns even for the group of listeners. This implies that the eye movement
patterns are not restricted to the process of planning and structuring a verbal
description of the scene but are rather connected to the scene semantics. These
results are in line with studies by Noton and Stark (1971a, 1971b, 1971c) who
showed that subjects tend to fixate regions of special interest according to cer-
tain scanpaths.
When we compare listeners scanpaths in Figure 8 with the original speak-

ers scanpaths in Figures 67 (flying insects and early summer), we can see many
similarities in the spatial configurations, reflecting similarities between places
where speakers and listeners find relevant semantic information. However,
there is a large variation between subjects concerning the temporal order and
the length of observation.
4.2 Dimensions of picture viewing
The results concerning spatial, semantic and mental groupings (this chapter,
Section 3) can be interpreted in terms of viewing dimensions. Studying simul-
taneous picture viewing and picture description can help us to understand the
dynamics of the ongoing perception process and, in this way, contribute to the
area in the intersection of artistic and cognitive theories (cf. Holsanova 2006).
Arnheim (1969) draws the conclusion that visual perception () is not a pas-
sive recording of stimulus material but an active concern of the mind (Arnheim
1969:37). Current theories of visual perception stress the cognitive basis of art
and scene perception. We think art as much as we see art (Solso 1994). Our
way of perceiving objects in a scene can be triggered by our expectations, our
interests, intentions, previous knowledge, context or instructions. In his book
Visual thinking, Arnheim (1969) writes: cognitive operations called thinking
are not the privilege of mental processes above and beyond perception but the
essential ingredients of perception itself. I am referring to such operations as
active exploration, selection, grasping of essentials, simplification, abstraction,
analysis and synthesis, completion, correction, comparison, problem solving,
as well as combining, separating, putting in context (Arnheim 1969:13). He
draws the conclusion that visual perception () is not a passive recording of
stimulus material but an active concern of the mind (Arnheim 1969:37).
My data confirms the view of active perception: it is not only the recog-
nition of objects that matters but also how the picture appears to the view-
ers. Verbal descriptions include the quality of experience, subjective content
and descriptions of mental states. Viewers report about (i) referents, states and
events, (ii) colours, sizes and attributes, (iii) compositional aspects, and they
(iv) mentally group the perceived objects into more abstract entities, (v) com-
pare picture elements, (vi) express attitudes and associations, (vii) report about
mental states. Thanks to the verbal descriptions, several viewing dimensions can
be distinguished: content aspect; quality aspect; compositional aspect; mental
Table 2. Viewing dimensions

Viewing dimensions
Content aspect looking at concrete objects when reporting about referents, states
and events and spatial relations in substantive, summarising, and
localising foci
Quality aspect returning to certain picture areas and inspecting them from other
perspectives: colours, sizes and other attributes of objects were
examined and compared (mostly in evaluative foci)
Compositional aspect delivering comments on picture composition when looking at
picture elements
Mental grouping & inspecting the region with another idea in mind
mental zooming out
Attitudes & including attitudes and associations that come from the outside of
associations the picture world
Comparative aspect comparing objects & events with those outside the picture world
grouping and mental zooming out; attitudes, associations and introspection;

comparative aspect (cf. Table 2).
4.3 Functions of eye movements
The results of the comparison of verbal and visual data reported in Chapters
6 and 7 have also consequences for the current discussion about the functions
of eye movements. Griffin (2004) gives an overview of the function of speech-
related gazes in communication and language production. The reported psy-
cholinguistic studies on eye tracking reported involve simple visual stimuli
and a phrase or sentence level of language production. However, by studying
potentially meaningful sequences or combinations of eye movements on com-
plex scenes and simultaneous picture description on a discourse level, we can
extend the list of eye movement functions as follows (cf. Table 3).
4.4 The role of eye fixation patterns
Furthermore, the results reported in Chapters 6 and 7 can reveal interesting

points concerning the role of fixation patterns. During picture viewing and pic-
ture description, informants make many refixations on the same objects. This is
Table 3. Function of eye movement patterns extracted from free on-line descriptions
of a complex picture.
Type of gazes Explanation Example, type of focus
Counting-like gazes aid in sequencing in language pro- three birds, three . eh four
duction when producing quantifiers, men digging in the field
before uttering plural nouns list foci
Gazes in message shifting the gaze to the next object to on the level of the focus,
planning be named, aid in planning of the next superfocus and discourse
focus; aid in interpretation of picture topic level
regions subst, sum, eval foci
Gazes reflecting cat- gazing at objects while preparing cat.diff.
egorising difficulties how to categorise them within a
superfocus, gazes at objects or parts
of objects until the description is
retrieved, even during dysfluencies.
Gazes reflect allocation of mental
resources
Monitoring gazes Speakers sometimes return their sum
gazes to objects after mentioning
them in a way that suggests that
they are evaluating their utterances.
Re-fixations.
Comparative gazes aid in interpretations; reparation/ two fields
(reparative gazes or re- self-editing in content, changing subst foci, meta, eval
categorisation gazes) point of view
Organisational gazes organisation of info, structuring of Now I have described the
discourse, choice of discourse topics left hand side
introspect, meta foci
Ideational gazes when referring to objects, when subst, loc foci
using a modifier, when describing
objects location or action
Summarising gazes common activity flying objects
common taxonomic category-fixat- sum
ing multiple objects
Abstract, inferential reflected in a speakers gaze on flow- early summer
gaze-production link ers and leaves before suggesting that subst, sum
a scene depicts early summer
consistent with Yarbus (1967) conclusion that eye movements occur in cycles
and that observers return to the same picture elements several times and with
Noton & Stark (1971b) who coined the word scanpath to describe the sequen-
tial (repetitive) viewing patterns of particular regions of the image. According
to the authors, a coherent picture of the visual scene is constructed piecemeal
through the assembly of serially viewed regions of interest. In the light of my
visual and verbal data, however, it is important to point out that a refixation on
an object can mean something else than the first fixation. A fixation on one and
the same object can correspond to several different mental contents. This find-
ing also confirms the claim that meaning relies on our ability to conceptualise
the same object or situation in different ways (Casad 1995:23).
Fixation patterns can reveal more than single fixations. However, we still
need some aid, some kind of referential framework, in order to infer what ideas
and thoughts these fixations and scanpaths correspond to. As Viviani (1990)
and Ballard et al. (1996) pointed out, there is an interpretation problem: we
need to relate the underlying overt structure of eye scanning patterns to in-
ternal cognitive states. The xation itself does not indicate what properties of
an object in a scene have been acquired. Usually, the task is used to constrain
and interpret fixations and scanpaths on the objects in the scene. For instance,
Yarbus (1967) instructions give the ages of the people shown in the picture
etc. resulted in different scanpaths and allowed a functional interpretation of
the eye movement patterns. It showed which pieces of information had been
considered relevant for the specific task and were therefore extracted by the
informants. However, as we have seen in the analysis of semantic relations, we
can also find similar spontaneous patterns in free picture description, with-
out there being a specific viewing task. In this case, the informants attempt
to formulate a coherent description of the scene and their spontaneous verbal
description may be viewed as a source of top-down control. The question is
whether the task offers enough explanation for the visual behaviour or whether
the verbal description is the optimal source of explanation for the functional
interpretation of eye movement patterns. In certain respects, the combination
of visual scanpaths and verbal foci can reveal more about the ongoing cognitive
processes. If we focus on the discourse level and include different types of foci
and superfoci (cf. Chapter 2, Section 1), we can get more information about the
informants motivations, impressions, attitudes and (categorisation) problems
(cf. Chapter 6, Section 3).
4.5 Language production and language planning
The question arises how the scanning and description process develop. Con-
cerning the temporal aspect we can ask: Does the thought always come before
speech? Do we plan our speech globally, beforehand? Or do we plan locally, on
an associative basis? Do we monitor and check our formulations afterwards?
Several answers can be found in the literature. Levelt (1989) assumes plan-
ning on two levels: macroplanning (i.e. elaboration of a communicative goal)
and microplanning (decisions about the topic or focus of the utterance etc.).
Bock & Levelt (1994, 2004) maintain that speakers outline clause-sized mes-
sages before they begin to sequentially prepare the words they will utter. Linell
(1982) distinguishes between two phases of utterance production: the construc-
tion of an utterance plan (the decision about semantic and formal properties
of the utterance) and the execution of an utterance plan (the pronunciation of
the words). He also suggests a theory that represents a compromise between
Wilhelm Wundts model of complete explicit thought and Hermann Pauls as-
sociative model.
According to Wilhelm Wundts holistic view (Linell 1982; Blumenthal
1970), the speaker starts with a global idea (Gesamtvorstellung) that is later
analysed part-by-part and sequentially organised into an utterance. Applied
to the process of visual scanning and verbal description, the observer would
have a global idea of the picture as a whole, as well as of the speech genre
picture description. The observer would then decide what contents she would
express verbally and in what order: whether she would describe the central
part rst, the left and the right part of the picture later on, and the foreground
and the background last, or whether she would start from the left and continue
to the right. If such a procedure is followed systematically, the whole visual
and verbal focusing process would be guided by a top-down principle (e.g. by
a picture composition). This idea would then be linearised and verbalised and
specified in a stepwise fashion. Evidence against a holistic approach contains
hesitations, pauses and repetitions revealing the fact that the utterance has not
been planned as a whole beforehand. Evidence supporting this pre-structured
way of description, on the other hand, is reflected by the combination of sum-
marising foci (sum), followed by a substantive list of items (list).
According to Hermann Pauls associative view, the utterance production is
a more synthetic process in which concepts (expressed in words and phrases)
are successively strung together by association processes (Linell 1982:1). Ap-
plied to our case, the whole procedure of picture viewing and picture description
would be locally determined and associative. It would develop on the basis of

the observers locally occurring formulations, observations, associations, emo-
tions, memories and knowledge of the world. The observer would not have
a global, systematic strategy but would proceeds step by step, using different
available resources. Evidence against an associative model, however, contains
slips of the tongue and contaminations revealing forward planning.
Yet another possibility is that visual scanning and verbal description are
sequentially interdetermined and a combination of the above mentioned
approaches appear. In practice, that would mean that within a temporal se-
quence, certain thoughts and eye movement patterns can cause a verbal de-
scription. Alternatively, certain formulations during speech production can
cause a visual verification phase. The visual verification pattern can then lead
to a modication of the concept and to a reformulation of the idea, which in
turn may be conrmed by a subsequent rexation cluster. The abstract con-
cept early summer gives us some evidence that the relation between what is
seen, conceptualised and said is interactive. Even if there has been some local
planning beforehand, the plan can change due to the informants interaction
with the visual scene, her mental representation of a scenario and her verbal
formulation of it. Thanks to this interaction, certain lines of thoughts can be
confirmed, modified or lead to reconceptualisation.
Linell (1982) suggests a theory that is a compromise between the two above
mentioned opposite views on the act of utterance production.
On the one hand, there is no complete explicit thought or Gesamtvorstellung
before the concrete utterances have been constructed. On the other hand,
there is ample evidence against a simplistic associationist model. In utterance
production there is a considerable amount of advance planning and paral-
lel planning going on (though there seems to be limits to the capacity for
planning). Furthermore, although the retrieval of words and, to some extent,
the application of construction methods involve both automatic and partly
chance-dependent (probabilistic) processes, there also seems to be a monitor-
ing process going on by which the speaker checks, as it were, the words and
phrases that are constructed against some preconscious ideas of what he wants
to say. (Linell 1982:14)
In the light of my data from free descriptive discourse, not every focus and
superfocus is planned in the same way and to the same extent. We find a) evi-
dence for conceptualisation and advanced planning on a discourse level, in
particular in sequences where the latency between the visual examination and
the speech production is very long. We also find evidence for b) more associa-
tive processes, in particular in sequences where the speakers start executing
their description before they have counted all the instances or planned the de-
tails of the whole utterance production (I see three eh . four Pettsons doing
different things, there are one three birds doing different things). Finally, we
find evidence for c) monitoring activities, where the speakers afterwards check
the expressed concrete or abstract concept against the visual encounter. I there-
fore agree with Linells view that the communicative intentions may be partly
imprecise or vague from the start and become gradually more structured, en-
riched, precise and conscious through the verbalisation process.
5. Conclusion
The aim of the first section has been to look more closely at the semantic rela-
tions between verbal and visual clusters. My point of departure has been com-
plex ideas expressed as verbal foci and verbal superfoci in free simultaneous
spoken language descriptions and processed eye movement data from viewing
a complex picture, both aligned in time and displayed on the multimodal score
sheets. Using a sequential method, I have compared the contents of the verbal
and visual spotlights and thereby also shed light on the underlying cognitive
processes. Three aspects have been analysed in detail: the semantic correspon-
dence, the level of specificity and the spatial proximity in connection to the
creation of new mental units.
The semantic relations between the objects focused on visually and de-
scribed verbally were often implicit or inferred. They varied between object-
object, object-location, object-path, object-activity and object-attribute. Infor-
mants were not only judging the objects size, form, location, prototypicality,
similarity and function, but also formulating their impressions and associa-
tions. It has been suggested that the semantic correspondence between verbal
and visual foci is comparable on the level of larger units and sequences, such
as the superfocus.
The combination of visual and verbal data showed that objects were fo-
cused on and conceptualised on different levels of specificity. The dynamics of
the observers on-line considerations could be followed, ranging from vague
categorisations of picture elements, over comments on ones own expertise,
mentions of relevant extracted features, formulations of more specic guesses
about an object category, to evaluations. Objects location and attributes were
described and evaluated, judgements about properties and relations between

picture elements were formulated, metaphors and similes were used as a means
of comparison and finally, objects activities were described. We have witnessed
a process of stepwise specification, evaluation, interpretation and even re-con-
ceptualisation of picture elements and the picture as a whole.
The perceived and described clusters displayed an increasing amount of
active processing. We saw a gradual dissociation between the visual representa-
tions (objects present here-and-now in the scene) and the discourse-mediated
mental representations built up in the course of the description. Informants
started by looking at scene-inherent objects, units and gestalts. Apart from
scene-inherent concrete picture elements with spatial proximity, they also de-
scribed new creative groupings along the way. As their picture viewing pro-
gressed, they tended to create new mental units that are more independent of
the concrete picture elements. They made large saccades across the whole pic-
ture, picking up information from different locations to support concepts that
were distributed across the picture. With the increasing cognitive involvement,
observers and describers tend to return to certain areas, change their perspec-
tive and reformulate or re-categorise the scene. It became clear that their per-
ception of the picture changed over time. Objects across the scene horizon-
tally or vertically aligned were grouped according to the scene composition.
Multiple similar elements distributed in the scene were clustered on the basis
of a common taxonomic category. The dynamics of this categorisation process
was reected in the usage of many rexations in picture viewing and reformu-
lations, paraphrases and modications in picture description. Active mental
groupings were created on the basis of similar traits, on symmetry and com-
mon activity. The process of mental zooming in and out could be documented,
where concrete objects were refixated and viewed on another level of specificity
or with another concept in mind. During their successive picture discovery, the
informants also created new mental groupings across the whole scene, based
on abstract concepts.
The priming study showed that some of the eye movement patterns, pro-
duced spontaneously when describing a picture either in concrete or abstract
terms, were not restricted to the situation where a speaker simultaneously de-
scribes a picture. Similar viewing patterns could be elicited by using utterances
from picture description even for listeners. In an analysis of fixation behaviour
on spatial areas of interest, we got significant or partially significant results in
a number of utterances.
I have pointed out the relevance of the combination of visual and verbal
data for delimitation of viewing aspects, for the role of eye movements, fixation
patterns and for the area of language production and language planning.
This chapter has been concerned with the semantic correspondence be-
tween verbal and visual data. In the following chapter, I will present studies on
mental imagery associated with picture viewing and picture description.
chapter 8
Picture viewing, picture description

and mental imagery
It is sometimes argued that we do not visualise, at least not in general, when

we understand language. As long as the answer to this question depends on
introspective observation, the matter cannot be objectively settled. Some
people claim that they are visual thinkers, while others claim they are verbal
thinkers and do not use any visualisations at all. There is, however, indirect
proof in favour of visualisations in communication. Speakers spontane-
ously use gestures, mimic and draw pictures, maps and sketches when they
communicate. The use of iconic and pictorial representations is useful in
communication since it helps the speaker and the listener to achieve under-
standing. Pictures, maps and sketches are external representations showing
how the described reality is conceptualised. Image schemata, mental imagery
and mental models are linked to perception. With eye tracking methodol-
ogy, these different types of visualisation processes can be exactly traced and
revealed.
In this chapter, we will further investigate picture descriptions, visualisations

and mental imagery. In the first section of this chapter, we will focus on the
issue of visualisations in discourse production and discourse comprehension.
In particular, we will look at an example of how focus movements and the
participants mental images can be reconstructed from a spontaneous conver-
sation including drawing. In the second section, we will ask whether speakers
use mental images in descriptive discourse and review the results of our studies
on picture description and mental imagery. After a short discussion about pos-
sible application in educational contexts, we will conclude.
The reader should by now be acquainted with the way speakers create co-
herence when connecting the subsequent steps in their descriptions, how they
package their thoughts in small digestible portions and present them, step
by step, in a certain rhythm to the listeners (Chapter 3). The conscious focus
of attention is moving (Chafe 1994). The human ability to conceptualise the
same objects and scenes in different ways has been demonstrated in Chapter 7.
Semantic relations between visual and verbal foci in descriptive discourse (in
front of the tree is a stone) were explained by introducing some of the theoreti-
cal concepts from cognitive semantics, such as landmark, trajector, container,
prototype, figureground, sourcegoal, centreperiphery, image schema etc.
(Chapter 7). The question is whether speakers and listeners think in images
during discourse production and discourse understanding.
1. Visualisation in discourse production

and discourse comprehension
According to cognitive linguistics, the meanings of words are grounded in our

everyday perceptual experience of what we see, hear and feel (cf. the finger push
example in Figure 1). In their book on Metaphors we live by, Lakoff & Johnson
(1980) showed how language comprehension is related to spatial concepts. Our
concepts about the world are called mental models. Mental models are models
of external reality that are built up from perception, imagination, knowledge,
prior experience and comprehension of discourse and that can be changed and
updated (Johnson-Laird 1983; Gentner & Stevens 1983). Mental models are,
for instance, used to make decisions in unfamiliar situations and to anticipate
events. They are, in turn, largely based on image schemata. Image schemata are
mental patterns that we learn through our experience and through our bodily
interaction with the world (Lakoff & Johnson 1980) and which help us to un-
derstand new experiences. On the basis of our bodily (perceptual) experience
of force (pressure and squeezing), we get an embodied concept of force/pres-
sure that can be generalised in a visual metaphor presented in Figure 2.
This embodied concept is then reflected in verbal metaphors (They are forc-
ing us out. Wed better not force the issues. She was under pressure. He couldnt
Figure 1. Finger push Figure 2. Visual metaphor for force/pressure

Chapter 8. Picture viewing, picture description and mental imagery 153
Table 1. From everyday experince via embodied concepts to communication

Everyday experience Mental models Image schema Communication
Force/pressure Embodied concept Visual metaphor Verbal metaphors
of force/pressure of force/pressure Embodied concept of
force/pressure as a resource
for mutual understanding
squeeze more out of them.). When speakers use these formulations in commu-
nication, they evoke mental images in the hearers and are used as an important
resource for mutual understanding (cf. Table 1).
Holmqvist (1993) makes use of image schema when he describes discourse
understanding in terms of evolving mental images. Speakers descriptive dis-
course contains concepts that appear as schematic representations and establish
patterns of understanding. When speakers want to describe a complex visual
idea, e.g. about a scene, a navigation route or an apartment layout (cf. Labov &
Linde 1975; Taylor & Tversky 1992), they have depending on their goals to
organise the information so that our partner can understand it (Levelt 1981).
By uttering ideas, speakers evoke images in the minds of the listeners, the con-
sciousness of the speaker and the listeners get synchronised, and the listeners
co-construct the meanings.
In face-to-face spontaneous conversation, this process has a more dynamic
and cooperative character (Clark 1996). The partners try to achieve joint at-
tention, formulate complementary contributions and interactively adjust their
visualisations. Quite often, the partners draw simultaneously with their verbal
descriptions. Sketches and drawings are external spatial-topologic representa-
tions reflecting the conceptualisation of reality and serving as an aid for our
memory (Tversky 1999). The partners must create a balance between what
is being said and what is being drawn. The utterances and non-verbal action
(such as drawing and gesturing) can be conceived of as instructions for the
listeners for how to change the meaning, how something is perceived, how one
thinks or feels or what one wants to do with something that is currently in the
conscious focus (Linell 2005). Drawing plays an important role in descriptive
discourse (Chapter 3, Section 2), since it functions as:
a. useful tool for the identication of where are we now?,

b. storage of referents,
c. an external memory aid for the interlocutors,
d. support for visualisation,
e. an expressive way of underlining what is said and

f. a representation of a whole abstract problem discussed in the conversa-
tion.
In a chapter on focus movements in spoken discourse (Holmqvist & Holsanova

1997), we reconstruct the participants images during spontaneous descriptive
discourse. A speaker spontaneously draws an abstract sketch illustrating the
poor road quality in a descriptive discourse. The drawing and conversation are
recorded to see whether the hypothesised movements of a joint focus of atten-
tion will leave traces in the discourse and in the drawing, and consequently
tell us how the understanding develops. As the speaker moves through his ex-
planation, he uses a combination of drawing, physical pointing and linguistic
means in order to help the listeners identify the currently focused ideas about
the referents, relations, topic shifts etc.
In this conversational data, we investigated the visual focus movements in
the abstract spontaneously drawn picture and their relations to the verbal focus
movements in the spoken discourse in order to reveal how the participants
mental image is constructed during the progression of discourse (Holmqvist
& Holsanova 1997).
Sketches and drawings allow revisions, regroupings, refinements and re-
interpretations and are therefore an important thinking tool for design etc.
(Suwa et al. 2001). It is important to outline the external representations ac-
cording to the users mental model, i.e. the way the users conceptualise how
everyday objects and situations are structured or how a system works.
In our analysis, we have paid attention to what is being focused on at each
point of discourse and related that to elements in the drawing. Focus in the
drawings is illustrated with a white spotlight. Those parts that are currently not
in focus are shadowed they lie in the attentional outskirts (see an abbreviated
Example 1 below).
Example 1
here is the whole spectrum,
here is
much money
and very good quality,
Mhmh
they do a good job,
but they know it costs a little more
to do a good job,
Mhm
... then we have down here we have
the fellows who come from Italy
and all those countries,
they spend the money quickly
and they dont care,
... [mhm]
so we have more or less
Scandinavians and Scots up here
then we have the Italians and the

Portu[guese][hn]
()
now we can build roads and all this
stuff.
()
.. then they trade down here,
.... mhm
.... and of course when . these have
been given work enough times
then . these wind up bankrupt . and

disappear.
()
and then this thing spreads out

()
... then we have a new group here

who is the next generation of road
builders.
We claimed that the consciousness of the speaker and of the listeners are syn-
chronised, that they create joint attention focus and co-construct meaning.
This is achieved by lexical markers of verbal foci, by drawing and by point-
ing. Let us first look at how the movement of a conscious focus of attention
is reflected in the verbal description. The speaker marks the topic shifts and
transition between verbal foci in the unfolding description with the help of
discourse markers (cf. Chapter 3, Section 1.1.1, Holmqvist & Holsanova 1997).
For instance, then, and then marks a progression within the same superfocus,
whereas and now signals moving to a new superfocus. So marks moving back
to a place already described, preparing the listener for a general summary, and
the expressions and now its like this, and then they do like this serves as a link
to the following context, preparing the listener for a more complex explication
to follow.
The describer guides the listeners attention by lexical markers like then
down here, meaning: now we are going to move the focus (regulation), and we
are moving it to this particular place (direction/location, deixis) and we are
going to stay in this neighbourhood for some time (planning/prediction). The
marked focus movements therefore also influence the pronominal reference

the pronoun could primarily be found within the focused part of the image.
Using a delimited, common source of references makes it easier for the speaker
and the listener to coordinate and to understand each other.
With the help of drawing (and pointing at the drawing), the partners can
adjust their visualisations. The drawing as an external representation allows
cognitive processes to be captured, shared and elaborated. In As spontaneous
drawings, the spatial arrangement of elements is used to represent a) a concrete
spatial domain: geography, sizes, spatial relations; b) non-spatial domains:
amounts of money, temporal relations; c) abstract domains: intensity, contrast
and quality dimensions, ethnic dimension, and d) dynamic processes: stages in
a decision process, development over time.
When we look at the data from picture descriptions in a naturally occur-
ring conversation, we can see that external visualisations help the partners to
achieve a joint focus of attention and to coordinate and adjust their mental
images during meaning making. However, as we will see in the next section,
mental imagery and inward visualisations of different kinds are also important
to the speakers themselves.
2. Mental imagery and descriptive discourse
As we have seen earlier, image schemata as for instance in front of are similar
to the eye movement patterns during a spoken picture description (Chapter 7).
This indicates that cognitive schemata might also be important for the speaker
him/herself. On the other hand, some informants report that they are verbal
and not visual thinkers, and the research on visual-spatial skills has shown
that there are individual differences (Hegarty & Waller 2004, 2006, Chapter
4, Section 1 and 2). In order to verify the assumption that we use our ability
to create pictures in our minds, we conducted a series of studies on mental
imagery during picture description. The results of these studies contribute to
our understanding of how speakers connect eye movements, visualisations and
spoken discourse to a mental image.
The reader might ask: What is mental imagery and what is it good for? Fin-
ke (1989:2) defines mental imagery as the mental invention or recreation of an
experience that in at least some respects resembles the experience of actually
perceiving an object or an event, either in conjunction with, or in the absence
of, direct sensory stimulation. (cf. also Finke & Shepard 1986). In popular
terms, mental imagery is described as visualising or seeing something in the

minds eye. Since mental images are closely connected to visual perception, this
mental invention or recreation of experience almost always results in observ-
able eye movements.
The investigation of mental images in humans goes back to the experi-
ments on mental rotation conducted by Shepard and Metzler (1971). It has
been proposed that we use mental imagery when we mentally recreate person-
al experiences from the past (Finke 1989), when we retrieve information about
physical properties of objects or about physical relationships among objects
(Finke 1989), when we read novels, plan future events or anticipate possible
future experiences, imagine transformations by mental rotation (Finke 1989)
and mental animation (Hegarty 1992) and when we solve problems (Huber &
Kirst 2004; Yoon & Narayanan 2004; Engel, Bertel & Barkowsky 2005). Mental
images are closely related to mental models. In other words, imagery plays an
important role in memory, planning and visual-spatial reasoning and is con-
sidered a central component of our thinking.
How can we observe this phenomenon and prove that we think in im-
ages? Eye tracking methodology has become a very important tool in the
study of human cognition, and current research has found a close relation
between eye movements and mental imagery. Already in our first eye tracking
study (Holsanova et al. 1998), we found some striking similarities between
the informants eye movement patterns when looking at a complex picture
and their eye movements when looking at a white board and describing the
picture from memory. The results were interpreted as a bit of support for the
hypothesis that mental scanning (Kosslyn 1980) is used as an aid in recalling
picture elements, especially when describing their visual and spatial proper-
ties. Brandt and Stark (1997) and Laeng and Teodorescu (2001) have shown
that spontaneous eye movements closely reflect the content and spatial rela-
tions from the original picture or scene. In order to extend these findings, we
conducted a number of new eye tracking studies, both with verbal and picto-
rial elicitation, and both in light and in darkness (Johansson, Holsanova &
Holmqvist 2005, 2006). In the following, I will mention two of these studies:
a study with twelve informants who listened to a pre-recorded spoken scene
description and later retold it from memory, and a study with another twelve
informants who were asked to look at a complex picture for a while and then
describe it from memory.
2.1 Listening and retelling
In the first study, twelve informants six female and six male students at Lund
University listened to a pre-recorded spoken scene description and later re-
told it from memory. The goal of this study was to extend the previous findings
(Demarais & Cohen 1998; Spivey & Geng 2001; Spivey, Tyler, Richardson &
Young 2000) in two respects: First, instead of only studying simple directions,
we focused on the complexity of the spatial relations (expressions like at the
centre, at the top, between, above, in front of, to the far right, on top of, below, to
the left of). Second, apart from measuring eye movements during the listening
phase, we added a retelling phase where the subjects were asked to freely re-
tell the described scene from memory. Eye movements were measured during
both phases. To our knowledge, these aspects had not been studied before. In
addition, we collected ratings of the vividness of imagery both during the lis-
tening and retelling phase and asked the subjects whether they usually imagine
things in pictures or words.
The pre-recorded description was the following (here translated into Eng-
lish):
Imagine a two dimensional picture. At the centre of the picture, there is a large
green spruce. At the top of the spruce a bird is sitting. To the left of the spruce and
to the far left in the picture there is a yellow house with a black tin roof and white
corners. The house has a chimney on which a bird is sitting. To the right of the
large spruce and to the far right in the picture, there is a tree, which is as high as
the spruce. The leaves of the tree are coloured in yellow and red. Above the tree,
at the top of the picture, a bird is flying. Between the spruce and the tree, there is
a man in blue overalls, who is raking leaves. In front of the spruce, the house, the
tree and the man, i.e. below them in the picture, there is a long red fence, which
runs from the pictures left side to the pictures right side. At the left side of the
picture, a bike is leaning towards the fence, and just to the right of the bike there
is a yellow mailbox. On top of the mailbox a cat is sleeping. In front of the fence,
i.e. below the fence in the picture, there is a road, which leads from the pictures
left side to the pictures right side. On the road, to the right of the mailbox and the
bike, a black-haired girl is bouncing a ball. To the right of the girl, a boy wearing
a red cap is sitting and watching her. To the far right on the road a lady wearing
a big red hat is walking with books under her arm. To the left of her, on the road,
a bird is eating a worm.
. The initial Swedish verb was Frestll dig which is neutral to the modality (image or
word) of thinking.
Figure 3. Spatial schematics for the objects in the pre-recorded description.
A. B.
C. D.
Figure 4. iView analysis of the first 67 seconds for one subject. (A) 019 sec. Spruce
and bird in top. (B) 1932 sec. The house to the left of the spruce, with bird on top of
chimney. (C) 3252 sec. The tree to the right of the house and the spruce. (D) 5267
sec. The man between the spruce and the tree, and the fence in front of them running
from left to right.
Spatial schematics for the objects in the pre-recorded description can be seen
in Figure 3. The experiment consisted of two main phases, one listening phase
in which the subjects listened to the verbal description, and one retelling phase
in which the participants retold the description they had listened to in their
own words. Eye movements were recorded both while subjects listened to the
spoken description and while they retold it.
2.1.1 Measuring spatial and temporal correspondence

We developed a method analysing the relative position of an eye movement
compared to the overall structure of the scanpath. Figure 4 shows four exam-
ples of how the eye movements for one subject are represented in iView four
successive times during the description (circles represent fixations, lines rep-
resent saccades).
2.1.2 Eye voice latencies

Apart from measuring spatial correspondence, we also needed temporal cri-
teria for the occurrence of the fixation to ensure that it concerns the same
object that is mentioned verbally. In a study of simultaneous descriptions of
the same stimulus picture, I found that eye-voice latencies, i.e. the time from
an object is mentioned until the eye moves to the corresponding location,
typically range between 2 and 3 seconds (Holsanova 2001:104f., cf. Chapter6,
Section 1.1.2).
Whereas eye movements in the listening phase only can occur after the
mentioning of an object, eye movements in the re-telling phase may precede
the mentioning of an object. That is, some subjects first moved their eyes to a
new position and then started the retelling of that objects, while others started
the retelling of an object and then moved their eyes to the new location. On
average, the voice-eye latency was 2.1 seconds during the description and 0.29
seconds during the retelling of it. The maximum value over all subjects was 5
seconds both during the description and the retelling of it. Thus, a 5 second
limit was chosen, both before and after the verbal onset of an object.
In sum, the eye movements of the subjects were scored as global corre-
spondence, local correspondence and no correspondence. Eye movements to
objects were considered correct in global correspondence when fulfilling the
following spatial and temporal criteria:
1. The eye movement to an object must finish in a position that is spatially

correct relatively to the subjects eye gaze pattern the entire description or
retelling.
2. In the listening phase, the eye movement from one position to another
must appear within 5 seconds after the object is mentioned in the descrip-
tion.
3. In the retelling phase, the eye movement from one position to another
must appear within 5 seconds before or after the subject mentions the ob-
ject.
It is known that the retrieved information about physical relationships among

objects can undergo several changes (cf. Finke 1989; Kosslyn 1980; Barsa-
lou 1999). Several experiments have shown that subjects rotate, change size,
change shape, change colour, and reorganise and reinterpret mental images
(Finke 1989). We found similar tendencies in our data. Such image transfor-
mations may affect our results, in particular if they take place in the midst of
the description or the retelling phase. Therefore, we devised an alternative local
correspondence measure.
Eye movements were considered correct in local correspondence when ful-
filling the following spatial and temporal criteria:
1. When an eye movement is moving from one object to another during the
description or the retelling it must move in the correct direction.
2. In the listening phase, the eye movement from one position to another
must appear within 5 seconds after the object is mentioned in the descrip-
tion
3. In the retelling phase, the eye movement from one position to another
must appear within 5 seconds before or after the subject mentions the ob-
ject.
The key difference between global and local correspondence is that global cor-
respondence requires fixations to take place at the categorically correct spatial
position relative to the whole eye tracking pattern. Local correspondence only
requires that the eyes move in the correct direction between two consecutive
objects in the description. Examples and schematics of this can be seen in Fig-
ures 5 and 6.
No correspondence was considered if neither the criteria for local corre-
spondence nor global correspondence was fulfilled (typically, when the eyes
did not move or moved in the wrong direction).
For a few subjects, some eye movements were re-centred and shrunk into a
smaller area (thus yielding more local correspondence). However, the majority
of eye movements kept the same proportions during the listening phase and
A. B. C.
Figure 5. (A) Example of mostly global correspondences. (B) Example of mostly local
correspondences. (C) Example of no correspondences at all.
A. B.
Figure 6. Schematics of global (A) and local (B) correspondence.
A. B.
Figure 7. Comparison of one persons eye movement patterns during listening (A)
and retelling phase (B).
the retelling phase. A comparison of one and the same persons eye movement
patterns during listening and retelling phase can be seen in Figure 7.
Results for correct eye movements were significant during the listening
phase and retelling phase in local and global correspondence coding. When
listening to the pre-recorded scene description (and looking at a white board),
54.8 percent of the eye movements were correct in the global correspondence
coding and 64.3 percent of eye movements were correct in the local correspon-
dence coding. In the retelling phase, more than half of all objects mentioned
had correct eye movements, according to the conservative global correspon-
dence criteria (55.2 percent; p = 0.004). The resizing effects, i.e. the fact that
i nformants may have shrunk, enlarged and stretched the image, were quite
common during picture description. It was also common that informants re-
centred the image from time to time; thus yielding local correspondence. When
allowed for re-centring and resizing of the image as with local correspon-
dence then almost three quarters of all objects had correct eye movements
(74.8 percent, p= 0.0012). The subjects spatial pattern of eye movements was
highly consistent with the original spatial arrangement.
2.2 Picture viewing and picture description
In the second study, we asked another twelve informants six female and six
male students from Lund University to look at a complex picture for a while
and then describe it from memory. We chose Sven Nordqvists (1990) picture
again as a complex visual stimulus. The study consisted of two main phases, a
viewing phase in which the informants inspected the stimulus picture and a de-
scription phase in which the participants described this picture from memory
in their own words while looking at a white screen. Eye movements were re-
corded during both phases. At the beginning of the viewing phase, each infor-
mant received the following instructions:
You will see a picture. We want you to study the picture as thoroughly as pos-
sible and to describe it afterwards.
The picture was shown for about 30 seconds, and was then covered by a white
screen. The following description phase was self-paced: the informants usu-
ally took 12 minutes to describe the picture. After the session, the informants
were asked to rate the vividness of their visualisation during the viewing and
the retelling phase on a scale ranging from 1 to 5. They were also asked to assess
whether they usually imagine things in pictures or in words.
The descriptions were transcribed in order to analyse which picture ele-
ments were mentioned and when. The eye movements were then analysed ac-
cording to objects derived from the descriptions. For instance, when an infor-
mant formulated the following superfocus,
01:20 And ehhh to the left in the picture
01:23 there are large daffodils,
01:26 it looks like there were also some animals there perhaps,
we would expect the informant to move her eyes towards the left part of the
white screen during the first focus. Then it would be plausible to inspect the
referent of the second focus (the daffodils). Finally, we could expect the infor-
mant to dwell for some time within the daffodil area on the white screen
searching for the animals (three birds, in fact) that were sitting there on the
stimulus picture.
The following criteria were applied in the analysis in order to judge whether
correct eye movements occurred: Eye movements were considered correct in
local correspondence when they moved from one position to another in the cor-
rect direction within a certain time interval. Eye movements were considered
correct in global correspondence when moving from one position to another
and finishing in a position that was spatially correct relative to the whole eye-
tracking pattern of the informant (for a detailed description of our method, cf.
Johansson et al. 2005, 2006). We tested significance between the number of
correct eye movements and the expected number of correct movements by
chance.
Our results were significant both in the local correspondence coding (74.8
percent correct eye movements, p = 0,0015) and in the global correspondence
coding (54.9 percent correct eye movements, p = 0,0051). The results suggest
that informants visualise the spatial configuration of the scene as a support for
their descriptions from memory. The effect we measured is strong. More than
half of all picture elements mentioned had correct eye movements, according
to the conservative global correspondence criteria. Allowing for re-centring
and resizing of the image as with local correspondence makes almost three
quarters of all picture elements have correct eye movements. Our data indicate
that eye movements are driven by the mental record of the object position and
that spatial locations are to a high degree preserved when describing a complex
picture from memory.
Despite the fact that the majority of the subjects had appropriate imagery
patterns, we found no correlation between the subjects rating of their own
visualisations and the degree of correct eye movement, neither for the view-
ing phase nor for the retelling phase. The subjects assessments about whether
they usually think in words or pictures were proportionally distributed across
four possibilities: (a) words, (b) pictures, (c) combination of words and pic-
tures, (d) no guesses. Again, a simple correlation analysis showed no corre-
lation between these assessments and the degree of correct eye movements,
neither for the viewing nor for the retelling phase. One possible interpretation
. The effect was equally strong for verbal elicitations (when the informants listened to
a verbal, pre-recorded scene description instead of viewing a picture), and could also be
found in complete darkness (cf. Johansson et al. 2006).
A. B.
Figure 8. One and the same informant: viewing phase (A) and description phase (B).
might be that people in general are not aware of which mental modality they
are thinking in.
Overall, there was a good similarity between data from the viewing and the
description phases, as can be seen in Figure 8.
According to Kosslyn (1994), distance, location and orientation of the
mental image can be represented in the visual buffer, and it is possible to shift
attention to certain parts or aspects of it. Laeng & Teodorescu (2001) interpret
their results as a confirmation that eye movements play a functional role during
image generation. Mast and Kosslyn (2002) propose, similarly to Hebb (1968),
that eye movements are stored as spatial indexes that are used to arrange the
parts of the image correctly. Our results can be interpreted as further evidence
that eye movements play a functional role in visual mental imagery and that
eye movements indeed are stored as spatial indexes that are used to arrange the
different parts correctly when a mental image is generated.
There are, however, alternative interpretations. Researchers within the em-
bodied view claim that instead of relying on an mental image, we use features
in the external environment. An imagined scene can then be projected over
those external features, and any storing of the whole scene internally would be
unnecessary. Ballard et al. (1996, 1997) suggest that informants leave behind
deictic pointers to locations of the scene in the environment, which later may
be perceptually accessed when needed. Pylyshyn (2001) has developed a some-
what similar approach to support propositional representations and speaks
about visual indices (cf. also Spivey et al. 2004).
Another alternative account is the perceptual activity theory suggesting
that instead of storing images, we store a continually updated and refined set
of procedures or schemas that specify how to direct out attention in different
situations (Thomas 1999). In this view, a perceptual experience consists of an
ongoing, schema-guided perceptual exploration of the environment. Imagery

is then the re-enactment of the specific exploratory perceptual behaviour that
would be appropriate for exploring the imagined object as if it were actually
present.
In short, mental imagery seems to play an important role even for the
speakers involved in descriptive discourse. Our results show evidence support-
ing the hypothesis that mental imagery and mental scanning are used as an
aid in recalling picture elements, especially when describing their visual and
spatial properties from memory.
3. Discussion: Users interaction with multiple

external representations
Relations between mental imagery and external visualisations are important

for our communication and our interaction with the external world (Hegarty
2004). The issue of usability is tightly connected to the extent to which ex-
ternal representations correspond to (mirror) mental visualisations. In this
respect, combination of eye movement protocol and spoken language proto-
col can be used for the assessment of users interaction with multiple repre-
sentations and new media (Holmqvist et al. 2003; Holsanova & Holmqvist
2004). In formats such as newspapers, netpapers, instruction materials or
textbooks, containing texts, photos, drawings, maps, diagrams and graphics,
it is important that the message is structured in a coherent way, so that the
user has no difficulties finding information, navigating, conceptualising and
processing information from different sources and integrating it with her own
visualisation and experience (Holsanova et al., forthc.; Scheiter et al., submit-
ted). Recent eye tracking studies on text-picture integration have shown that
a coherent, conceptually organized format supports readers navigation and
facilitates reading and comprehension (Holmqvist et al. 2006). Also, a format
with spatial proximity between text and graphics facilitates integration and
prolongs reading (Holsanova et al., forthc.; Holmqvist et al. 2006), even if
we find individual differences concerning the ability to integrate multiple
representations (Hannus & Hyn (1999). Users anticipations, attitudes and
problems when interacting with these formats can be revealed by a combina-
tion of eye movement data and (retrospective) verbal protocols.
4. Conclusion
This chapter has dealt with mental imagery and external visualisations in con-
nection with descriptive discourse. As we have seen, in a naturally occurring
conversation, external visualisations help the partners to achieve a joint focus
of attention and to coordinate and adjust their mental images during mean-
ing-making. External visual representations such as drawings are central to
learning and reasoning processes. They can be manipulated, changed and are
subject to negotiations. The partners can work with patterns and exemplars
standing for abstract concepts. Apart from the spatial domain, drawings can
be used for other conceptual domains, i.e. for the non-spatial domain (time,
money), abstract domain (contrast, intensity, quality), dynamic domain (stages
in a process) etc.
However, as we have seen, mental imagery and inner visualisations of
different kinds are also important for the speakers themselves. In a study on
picture viewing, picture description and mental imagery, a significant similar-
ity was found between (a) the eye movement patterns during picture viewing
and (b) those produced during picture description (when the informants were
looking at a white screen). The eye movements closely reflected the content and
the spatial relations of the original picture, suggesting that the informants cre-
ated some sort of mental image as an aid for their descriptions from memory.
Apart from that, even verbal descriptions engaged mental imagery and elicited
eye movements that reflect spatiality (Johansson et al. 2005, 2006).
Mental imagery and mental models are useful for educational methods
and learning strategies. In the area of visuo-spatial learning and problem-solv-
ing, it is recommended to use those external spatial-analogical representations
(charts, geographical layouts, diagrams etc.) that closely correspond to the us-
ers mental models.
Our ability to picture something mentally is also relevant for design and
human-computer interaction, since humans interact with systems and objects
based on how they believe that the system works or the objects should be used.
The issue of usability is thus tightly connected to the extent to which external
representations correspond to our visualisations of how things function.
For a users interaction with a format containing multiple representations
(texts, photos, drawings, maps, diagrams and graphics) it is important the
message is structured in a coherent way, so that the user has no difficulties to
conceptualize, process and integrate information from different sources with
her own visualisation and experience (Holsanova et al., forthc.). However, we
find individual differences concerning the ability to integrate multiple repre-

sentations. Hannus & Hyn (1999) showed in their eye tracking study that
illustrations in textbooks were of benefit to high-ability pupils but not to low-
ability pupils who were not able to connect the illustration to the proper parts
of the text.
Finally, mental models are important for language comprehension: Kintsch
& van Dijk (1983) use the term situation models and show the relevance of
mental models for the production and comprehension of discourse (see also
Zwaan & Radvansky 1998; Kintsch 1988). It has been found that discourse
comprehension is in many ways associated with the construction of mental
models that involves visuo-spatial reasoning (Holmqvist & Holsanova 1996).
chapter 9
Concluding chapter
I hope that, by now, the reader can see the advantages of combining discourse
analysis with cognitively oriented research and eye movement tracking. I also
hope that I have convincingly shown how spoken descriptive discourse and eye
movement measurements can, in concert, elucidate covert mental processes.
This concluding chapter looks back on the most important issues and find-
ings in the book and mentions some implications of the multimodal approach
for other fields of research. The way speakers segment discourse and create
global and local transitions reflects a certain cognitive rhythm in discourse
production. The flow of speech reflects the flow of thoughts. In Chapter 1, I
defined the most important units of spoken descriptive discourse that reflect
human attention, verbal focus and verbal superfocus. I showed that listeners
intuition about discourse boundaries and discourse segmentation is facilitated
when the interplay of prosodic and acoustic criteria is further confirmed by
semantic criteria and lexical markers. Also, it is easier for listeners to identify
boundaries at the higher levels of discourse, such as the superfocus.
The way we create meaning from our experience and describe it to others
can be understood in connection with our general communicative ability: We
partly talk about WHAT we experienced by selecting certain referents, states
and events and by grouping and organising them in a certain way, but we also
express our attitudes and relate to HOW these referents, states and events ap-
peared to us. A taxonomy of foci reflecting these different categorising and inter-
preting activities that the speakers are involved in was developed in Chapter 2.
Seven different types of foci have been identified, serving three main discourse
functions. Substantive, summarising and localising foci are typically used for
presentation of picture contents, attitudinal meaning is expressed in evaluative
and expert foci, and a group of interpersonal, introspective and meta-textual
foci serves the regulatory and organising function.
This taxonomy of foci could be generalised to different settings but the dis-
tribution of foci varied. For instance, the summarising foci dominated in a set-
ting where the picture was described from memory, whereas substantive foci
dominated in simultaneous descriptions in a narrative setting. When spatial
aspects of the scene were focused on, the proportion of localising foci was
significantly higher. Furthermore, an interactive setting promoted a high pro-
portion of evaluative and expert foci in a situation where the informants were
expressing their attitudes to the picture content, making judgements about the
picture as a whole, about properties of the picture elements and about relations
between picture elements. Introspective and meta-textual foci were also more
frequent in a situation where the listener was present.
In spoken descriptions, we can only focus our attention on one particular
aspect at a time, and the information flow is divided into small units of speech.
These segments are either linear or embedded. It happens that we make a di-
gression a step aside from the main track of our thoughts and spend some
time on comments, but we usually succeed in coming back to the main track,
finish the previous topic and start on a new one. Sometimes, we must mentally
reorient at transitions between segments. In the process of meaning making,
both speakers and listeners try to connect these units into a coherent whole.
In Chapter 3, I showed how speakers connect the subsequent steps in their
description and thereby create discourse coherence. Discourse markers reveal
the structuring of the speech, introduce smaller and larger steps in the descrip-
tion and mark the linear (paratactic) and the embedded (hypotactic) segments
in discourse. Also, there are different degrees of mental distance between steps
of description, reflected in discontinuities at the transition between foci and
superfoci. This phenomenon has been interpreted in terms of the internal or
external worlds the speaker moves between. The largest hesitations and the
longest pauses were found at the transitions where the speaker steps out of the
description and turns to the meta-textual and interactional aspects.
An analysis of a spontaneous description with drawing where the speaker
is trying to achieve a certain visualisation effect for the listeners has shown that
the interlocutors despite the complexity in the hierarchical structure
can retain nominal and pronominal references for quite a long time
can simultaneously focus on a higher and a lower level of abstraction
can handle multiple discourse-mediated representations of visually pres-
ent and mentally imagined objects
can attend to the same objects with another idea in the mind.
The dissociation between the visual and mental representations as well as the
simultaneous handling of multiple discourse-mediated representations on dif-
ferent levels of abstraction is made possible (a) by the partners switching be-
tween active and semiactive information, (b) by joint attention and (c) by the
Chapter 9. Concluding chapter 173
use of mutual visual access (e.g. by observing each others pointing, gazing and
drawing). The drawing as an external visual representation fulfils many
functions in addition to the spoken discourse: It functions as a referent storage,
external memory aid for the interlocutors, basis for visualisation of imaginary
events and scenarios, and representation of the whole topic of the conversa-
tion. The fact that partners in a conversation can handle multiple discourse-
mediated representation of visually present and mentally imagined objects and
scenarios contributes to the theory of mind.
Different individuals focus on different aspects in their picture descrip-
tions. In Chapter 4, I characterised and exemplified the two dominant styles
found in the data and discussed various cognitive, experiential and contextual
factors that might have given rise to these styles. Whereas attending to spatial
relations is dominant in the static description style where the picture is de-
composed into fields that are then described systematically, with a variety of
terms for spatial relations, attending to the flow of time is the dominant pat-
tern in the dynamic description style, where the informants primarily focus
on temporal relations and dynamic events in the picture, talk about steps of
a process, successive phases, and a certain order. The quality of the dynamic
style is achieved by a frequent use of temporal verbs, temporal adverbs and
motion verbs in an active voice. Discourse markers are often used to focus and
refocus on the picture elements, and to interconnect them. Apart from that,
the informants seem to follow a narrative schema: the descriptions start with
an introduction of the main characters, their involvement in various activities
and a description of the scene. The extracted description styles are further dis-
cussed in the framework of studies on individual differences and remember-
ing. Connections are drawn to studies about visual and verbal thinkers and
spatial and iconic visualisers. I also showed that spatial and narrative priming
has effects on the description style. Spatial priming leads to a larger number of
localisations, significantly fewer temporal expressions and significantly shorter
static descriptions, whereas narrative priming mostly enhances the temporal
dynamics in the description.
The first four chapters focused on various characteristics of picture descrip-
tions in different settings and built up a basis for a broader comparison be-
tween picture descriptions, picture viewing and mental imagery presented in
Chapters 58.
The multimodal method and the analytical tool, multimodal time-cod-
ed score sheet, were introduced in Chapter 5. Complex ideas formulated in
the course of descriptive discourse were synchronised with fixation patterns
during visual inspection of the complex picture. Verbal and visual data have
been used as two windows to the mind. With the help of this method, I was
able to synchronise visual and verbal behaviour over time, follow and com-
pare the content of the attentional spotlights on different discourse levels,
and extract clusters in the visual and verbal flow. The method has been used
when studying temporal and semantic correspondence between verbal and
visual data in Chapters 6 and 7. By incorporating different types of foci and
superfoci in the analysis, we can follow the eye gaze patterns during specific
mental activities.
Chapter 6 focused on temporal relations. From a free description of a
complex scene, I extracted configurations from verbal and visual data on a
focus and superfocus level. In result, I found complex patterns of eye gaze and
speech.
The first question to be answered concerned temporal simultaneity be-
tween the visual and verbal signal. I found that the verbal and the visual sig-
nals were not always simultaneous. The visual focus was often ahead of speech
production. This latency was due to conceptualisation, planning and formu-
lation of a free picture description on a discourse level, which affected the
speech-to-gaze alignment and prolonged the eye-voice latency. Visual focus
could, however, also follow speech (i.e. a visual fixation cluster on an object
could appear after the describer had mentioned it). In these cases, the describ-
er was probably monitoring and checking his statement against the visual ac-
count. In some instances, there was temporal simultaneity between the verbal
and visual signals but no semantic correspondence (when informants dur-
ing a current verbal focus directed preparatory glances towards objects to
be described later on).
The second question concerned the order of the objects focused on visu-
ally and verbally. The empirical results showed that the order of objects fixated
may, but need not always be the same as the order in which the objects were
introduced within the verbal focus or superfocus. For instance, some of the
inspected objects were not mentioned at all in the verbal description, some
of them were not labelled as a discrete entity but instead included later, on
a higher level of abstraction. In the course of one verbal focus, preparatory
glances were passed and visual fixation clusters landed on new objects, long
before these were described verbally. Also, areas and objects were frequently
re-examined and a re-fixation on one and the same object could be associated
with different ideas.
Third, I investigated whether we can find a comparable unit in visual and

verbal data. Often, several visual fixations were connected to one verbal focus
and multiple visual foci were intertwined and well integrated into a larger unit.
I have therefore suggested that attentional superfoci rather than foci are the
suitable units of comparison between verbal and visual data. In connection
with the traditional units of speech discussed in the literature, the empirical ev-
idence suggests that the superfocus expressed in longer utterances (and similar
larger units of discourse) plays a crucial role as a unit of information process-
ing. A superfocus is easier to identify and both cognitively and communica-
tively relevant. Also, I was able to demonstrate the lag between the awareness of
an idea and the verbalisation of that idea: Starting from a complex visual scene,
free description and embedding in the discourse context, the latency between
a visual focus and a verbal focus was much longer (23 sec) than the latency
reported in psycholinguistic naming studies.
Furthermore, functional clusters typical of different mental activities were
found during free descriptions of complex pictures, in particular in summaris-
ing, localising foci, listing items and substantive foci connected to categorisa-
tion difficulties. For example, for localising foci, the mental activities seem to
be divided into (a) a preparatory phase when most of the categorisation and
interpretation work is done and the inspected object functions as a pars pro
toto representing the location and (b) a formulation phase when the object is
re-fixated and named. These tendencies towards reoccurring multimodal inte-
gration patterns offer many interesting application possibilities in interaction
with multimodal interactive systems and learning.
Chapter 7 focused on semantic correspondence between verbal and visual
data. I found that the semantic relation between the objects focused on visu-
ally and described verbally was often implicit or inferred. It varied between
object-object, object-location, object-path, object-activity and object-attribute.
The combination of visual and verbal data shows that the objects are focused
on and conceptualised at different levels of specificity, objects location and at-
tributes are evaluated, metaphors and similes are used for comparison and the
objects activity is described. All this involves interpretation and active process-
ing. We have witnessed a process of stepwise specification, evaluation, inter-
pretation and even re-conceptualisation of picture elements and of the picture
as a whole.
The eye movement protocol and the verbal description protocol have been
used as indirect sources of our mental processes. We saw a gradual dissociation
between the visual representations in the scene and the discourse-mediated

mental representations built up in the course of the description. During their
successive picture discovery, informants described not only scene-inherent
objects with spatial proximity but also clustered elements distributed across
the scene and created new mental groupings based on abstract concepts. The
process of mental zooming in and out could be documented, where concrete
objects were re-fixated and viewed with another concept in mind. The com-
parison of visual and verbal foci in the process of picture viewing and picture
description shows how language and vision meet through time and the units
extracted from the empirical data give us hints about the ways in which infor-
mation is acquired and processed in the human mind.
A priming study showed that the eye movement patterns connected to ab-
stract concepts were not restricted to the process of planning and structuring a
verbal description of the scene but were rather connected to the scene seman-
tics. Similar viewing patterns could be elicited by using utterances from the
original picture description even with a group of listeners.
By studying potentially meaningful sequences or combinations of eye
movements on complex scenes and simultaneous picture description on a dis-
course level, I was able to extend the list of eye movement functions: counting-
like gazes, ideational pointing gazes, gazes in message planning, gazes reflecting
categorising difficulties, monitoring gazes, comparative gazes, re-categorisa-
tional gazes, summarizing gazes, etc. The multimodal clusters extracted from
verbal and visual foci illustrate how humans connect vision and speech while
engaged in certain activities.
Finally, the empirical results have consequences for questions about dis-
course production and planning: Does the thought always come before speech?
Do we plan our speech globally, beforehand? Or do we plan locally, on an as-
sociative basis? Do we monitor and check our formulations afterwards? In the
light of my data from free descriptive discourse, not every focus and superfo-
cus is planned in the same way and to the same extent. I found (a) evidence
for conceptualisation and advanced planning on a discourse level, in particu-
lar in sequences where the latency between the visual examination and the
speech production is very long. I also found evidence for (b) more associative
processes, in particular in sequences where the speakers start executing their
description before they have counted all the instances or planned the details
of the whole utterance production. Finally, I found evidence for (c) monitor-
ing activities, where the speakers expressed a concrete or abstract concept
and, afterwards, checked it against the visual encounter. I therefore agree with
Linells (1982) view that the communicative intentions may be partly imprecise
or vague from the start and become gradually more structured, enriched, pre-
cise and conscious through the verbalisation process.
Chapter 8 was concerned with the role of mental imagery and external
visualisations in descriptive discourse. In a naturally occurring conversation,
external visualisations help the partners to achieve a joint focus of attention
and to coordinate and adjust their mental images during meaning-making.
However, as we have seen, inner visualisations and, in particular, mental im-
ages, are also important for the speakers themselves. In a study of picture
viewing, picture description and mental imagery, a significant similarity was
found between (a) the eye movement patterns during picture viewing and (b)
those produced during picture description (when the picture was removed
and the informants were looking at a white screen). The eye movements close-
ly reflected the content and the spatial relations of the original picture, sug-
gesting that the informants created a sort of mental image as an aid for their
descriptions from memory. Eye movements were thus not dependent on a
present visual scene but on a mental record of the scene. In addition, even
verbal scene descriptions evoked mental images and elicited eye movements
that reflect spatiality.
Let me finally mention some implications for other fields of research. The
multimodal method and the integration patterns discovered can be applied for
different purposes. It has currently been implemented in the project concern-
ing on-line written picture descriptions where we analyse verbal and visual
flow to get an enhanced picture of the writers attention processes (Andersson
et al. 2006). Apart from that, there are many interesting applications of integra-
tion patterns within evaluation of design, interaction with multimodal inter-
active systems and learning. The integration patterns discovered in our visual
and verbal behaviour can contribute to the development of a new generation
of multimodal interactive systems (Oviatt 1999). In addition, we would be able
to make a diagnosis about the current user activity and predictions about their
next move, their choices and decisions (Bertel 2007). In consequence, we could
use this information on-line, for supporting users individual problem-solv-
ing strategies and preferred ways of interacting. The advantages when using a
multimodal method are threefold: it gives more detailed answers about cog-
nitive processes and the ongoing creation of meaningful units, it reveals the
rationality behind the informants behaviour (how they behave and why, what
e xpectations and associations they have) and it gives us insights about users at-
titudes towards different solutions (what is good or bad, what is easy or difficult
etc.). In short, the sequential multimodal method can be successfully used for
a dynamic analysis of perception and action in general.
References
Aijmer, K. (1988). Now may we have a word on this. The use of now as a discourse par-
ticle. In M. Kyt, O. Ihalainen & M. Rissanen (Eds.), Corpus Linguistics, Hard and Soft.
Proceedings of the Eighth International Conference on English Language Research on
Computerized Corpora, 1533.
Aijmer, K. (2002). English Discourse Particles. Evidence from a corpus. Studies in Corpus
Linguistics. John Benjamins: Amsterdam.
Allport, A. (1989). Visual attention, In Posner, M. I. (Ed.), Foundations of Cognitive Science.
Cambridge, MA: MIT Press, 631682.
Allwood, J. (1996). On Wallace Chafes How consciousness shapes language. Pragmatics &
Cognition Vol. 4(1), 1996. Special issue on language and conciousness, 5564.
Andersson, B., Dahl, J., Holmqvist, K., Holsanova, J., Johansson, V., Karlsson, H., Strmqvist,
S., Tufvesson, S., & Wengelin, . (2006). Combining keystroke logging with eye track-
ing. In Luuk Van Waes, Marielle Leiten & Chris Neuwirth (Eds.)Writing and Digital
Media, Elsevier BV (North Holland), 166172.
Arnheim, R. (1969). Visual Thinking, Berkeley, University of California Press, CA.
Baddeley, A. & Lieberman, K. (1980). Spatial working memory. In R. Nickerson (Ed.), At-
tention and performance (Vol. VIII, pp. 521539). Hillsdale, NJ: Lawrence Erlbaum
Associates, Inc.
Baddeley, A. (1992). Is working memory working? The fifteenth Bartlett lecture. The Quar-
terly Journal of Experimental Psychology, 44A, 131.
Ballard, D. H., Hayhoe, M. M., Pook, P. K., & Rao, R. P. N. (1997). Deictic codes for the em-
bodiment of cognition. Behavioral and Brain Sciences (1997) 20, 13111328.
Ballard, D. H., Hayhoe, M. M., Pook, P. K., & Rao, R. P. N. (1996). Deictic Codes for the Em-
bodiment of Cognition. CUP: Cambridge.
Bangerter, A. & Clark, H. H. (2003). Navigating joint projects with dialogue. Cognitive Sci-
ence 27 (2003), 195225.
Barsalou, L. (1999). Perceptual symbol systems, Behavioral and Brain Sciences (1999), 22,
577660.
Bartlett (1932, reprinted 1997). Remembering. Cambridge: Cambridge University Press.
Beattie, G. (1980). Encoding units in spontaneous speech. In H. W. Dechert & M. Raupach
(Eds.), Temporal variables in speech, pp. 131143. Mouton, The Hague.
Berlyne, D. E. (1971). Aesthetics and psychobiology. New York: Appleton-Century-Crofts.
Berman, R. A. & Slobin, D. I. (1994). Relating events in narrative. A crosslinguistic develop-
mental study. Hillsdale, New Jersey: Lawrence Erlbaum.
Bersus, P. (2002). Eye movement in prima vista singing and vocal text reading. Master
paper in Cognitive Science, Lund University. http://www.sol.lu.se/humlab/eyetracking/
Studentpapers/PerBerseus.pdf
Bertel, S. (2007). Towards attention-guided human-computer collaborative reasoning for

spatial configuration and design. In Foundations of Augmented Cognition (Proceedings
of HCI International 2007, Beijing), pp. 337345, Lecture Notes in Computer Science.
Springer, Berlin.
Blumenthal, A. (1970). Language and Psychology. Historical aspects of psycholinguistics.
New York: John Wiley.
Bock, J. K., & Levelt, W. J. M. (1994). Language production: Grammatical encoding. In M. A.
Gernsbacher (Ed.), Handbook of psycholinguistics (pp. 945984). San Diego: Academic
Press.
Bock, K., Irwin, D. E., & Davidson, D. J. (2004). Putting First Things First. In J. M. Hender-
son & F. Ferreira (Eds.), The Interface of Language, Vision, and Action: Eye Movements
and the Visual World. New York: Psychology Press.
Braarud, P. ., Drivoldsmo, A., Hollnagel, E. (1997). Human Error Analysis Project
(HEAP) The fourth pilot study: Verbal data for analysis of operator performance.
HWR495. Halden.
Brandt, S. A., & Stark, L. W. (1997). Spontaneous eye movements during visual imagery
reflect the content of the visual scene. Journal of Cognitive Neuroscience, 9, 2738.
Bruce, G. (1982). Textual aspects of prosody in Swedish. Phonetica 39, 274287.
Buswell, G. T. (1935). How people look at pictures. A study of the psychology of perception
in art. Chicago: The University of Chicago Press.
Butterworth, B. (1975). Hesitation and semantic planning in speech. Journal of Psycholin-
guistic Research 4, 7587.
Butsch, R. L. C. (1932). Eye movements and the eye-hand span in typewriting. J Educat
Psychol 23: 104121.
Casad, E. H. (1995). Seeing it in more than one way. In John R. Taylor & Robert E. MacLaury
(Eds.), Language and the cognitive construal of the world, 2349. Trends in Linguistics.
Studies and Monographs, 82. Berlin: Mouton de Gruyter.
Chafe, W. L. (1979). The ow of Thought and the ow of Language. In Givn, T. (Ed.),
Syntax and Semantics, Vol. 12: Discourse and Syntax. Academic Press: New York, San
Francisco, London, 159181.
Chafe, W. L. (1980). The deployment of consciousness in the production of a narrative. In W.
L. Chafe (Ed.), The Pear Stories: Cognitive, Cultural, and Linguistic Aspects of Narrative
Production. Ablex: Norwood, NJ, 950.
Chafe, W. L. (1987). Cognitive Constraints on Information ow. In Tomlin, Russel S. (Ed.),
Coherence and Grounding in Discourse. Benjamins: Amsterdam/Philadelphia, 2151.
Chafe, W. L. (1994). Discourse, Consciousness, and Time. The ow and Displacement of Con-
scious Experience in Speaking and Writing. The University of Chicago Press: Chicago,
London.
Chafe, W. L. (1995). Accessing the mind through language. In S. Alln (Ed.), Of thoughts and
words. The relation between language and mind. Proceedings of Nobel Symposium 92.
Imperial College Press: London, 107125.
Chafe, W. L. (1996). How consciousness shapes language. Pragmatics & Cognition Vol. 4(1),
1996. Special issue on language and consciousness, 3554.
Clark, H. & Clark, E. (1977). Psychology and Language. Harcourt Brace Jovanovich, New
York.
References 181
Clark, H. H. (1992). Arenas of Language Use. The University of Chicago press: Chicago.
Clark, H. H. (1996). Using Language. Cambridge University Press: Cambridge.
Couper, R. M. (1974). The Control of eye fixations by the meaning of spoken language: A
new methodology for real-time investigation of speech perception. Memory and lan-
guage processing. Cognitive Psychology, 6, 84107.
Crystal, D. (1975). The English tone of voice. London: St. Martin.
De Graef, P. (1992). Scene-context effects and models of real-world perception. In K. Rayner
(Ed.), Eye movements and visual cognition: Scene perception and reading, 243259. New
York: Springer-Verlag.
Demarais, A. & Cohen, B. H. (1998). Evidence for image-scanning eye movements during
transitive inference. Biological Psychology, 49, 229247.
Diderichsen, Philip (2001). Visual Fixations, Attentional Detection, and Syntactic Perspec-
tive. An experimental investigation of the theoretical foundations of Russel Tomlins
fish film design. Lund University Cognitive Studies 84.
Duchowski, Andrew T. (2003). Eye Tracking Methodology: Theory and Practice. Springer-
Verlag, London, UK.
Engel, D., Bertel, S. & Barkowsky, T. (2005). Spatial Principles in Control of Focus in Rea-
soning with Mental Representations, Images, and Diagrams. Spatial Cognition IV,
181203.
Ericsson, K. A. & Simon, H. A. (1980). Verbal Reports as Data. Psychological Review; 87:
215251.
Findlay, J. M., & Walker, R. (1999). A model of saccadic eye movement generation based
on parallel processing and competitive inhibition. Behavioral and Brain Sciences 22:
661674. Cambridge University Press.
Finke, R. A. & Shepard, R. N. (1986). Visual functions of mental imagery. In K. R. Boff, L.
Kaufman, & J. P. Thomas (Eds.), Handbook of perception and human performance. New
York:Wiley.
Finke, R.A. (1989). Principles of Mental Imagery. Massachusetts Institute of Technology:
Bradford books.
Firbas, J. (1992). Functional Sentence Perspective in Written and Spoken Communication.
Cambridge: Cambridge University Press.
Grdenfors, P. (1996). Speaking about the inner environment, In S. Alln (Ed.), Of thoughts
and words. The relation between language and mind. Proceedings of Nobel Symposium
92. Stockholm 1994. Imperial College Press, 143151.
Grdenfors, Peter (2000). Conceptual Spaces: The Geometry of Thought, MIT Press Cam-
bridge, MA.
Garett, M. (1980). Levels of processing in sentence production. In Butterworth (1980), Lan-
guage production. London: Academic Press, 177220.
Garrett, M. (1975). The Analysis of Sentence Production. In Bower, G. (Ed.) Psychology of
Learning and Motivation, Vol. 9. New York: Academic Press. 133177.
Gedenryd, H. (1998). How designers work. Making sense of authentic cognitive activities.
Lund University Cognitive Studies 75: Lund.
Gentner, D., & Stevens, A. L. (Eds.). (1983). Mental models. Hillsdale, NJ: Lawrence Erlbaum
Associates.
Gernsbacher, M. A. & Givn, T. (Eds.). (1995). Coherence in spontaneous text. Amsterdam:

Benjamins.
Givn, T. (Ed.) (1979). Syntax and Semantics, Vol. 12: Discourse and Syntax. Academic
Press: New York, San Francisco, London.
Givn, T. (1990). Syntax: A Functional-Typological Introduction. Vol. 2 Amsterdam and
Philadelphia: John Benjamins.
Goldman-Eisler, F. (1968). Psycholingustics: Experiments in spontaneous speech. Academic
Press, New York.
Goolsby, T. W. (1994). Eye movement in music reading: Effects of reading ability, notational
complexity, and encounters. Music Perception, 12(1), 7796.
Griffin, Z. & Bock, K. (2000). What the eyes say about speaking. Psychol Sci. 2000 July 11(4):
274279.
Griffin, Z. M. (2004). Why look? Reasons for eye movements related to language produc-
tion. In Henderson & Ferreira (Eds.). The integration of language, vision, and action: Eye
movements and the visual world. New York: Psychology Press. 213247.
Grosz, B. & Sidner, C. L. (1986). Attention, Intentions, and the Structure of Discourse. Com-
putational Linguistics, Vol. 12, Nr. 3, 175204.
Grosz, B. J. (1981). Focusing and description in natural language dialogues. In A. Joshi,
B. Webber, I. Sag (Eds.), Elements of discourse understanding. New York: Cambridge
University Press.
Grow, G. (1996). The writing problems of visual thinkers. http://www-wane.scri.fsu.edu/
ggrow/. An expanded version of a print article that appeared in the refereed journal,
Visible Language, 28.2, Spring 1994, 134161.
Gullberg, M. (1999). Gestures in spatial descriptions. Working Papers 47, Department of
Linguistics, Lund university, 8798.
Glich, E. (1970). Makrosyntax der Gliederungsignale im gesprochenen Franzsisch. Wilhelm
Fink Verlag: Mnchen.
Halliday, M. A. K. (1970). A course in spoken English: Intonation. Oxford Universtity Press.
Halliday, M. A. K. (1978). Language as social semiotic. Edward Arnold: London.
Halliday, M. A. K. (1985). An Introduction to Functional Grammar. London: Edward Ar-
nold.
Hannus, M. & Hyn, J. (1999). Utilization of Illustrations during Learning of Science Text-
book Passages among Low- and High-Ability Children. Contemporary Educational Psy-
chology, Volume 24, Number 2, April 1999, 95123(29). Academic Press.
Hauland, G. & Hallbert, B. (1995). Relations between visual activity and verbalised problem
solving: A preliminary study. In Norros, L. (Ed.), VTT symposium 158, the 5th European
Conference on cognitive science approaches to process control, Espoo, Finland, 99110.
Hauland, G. (1996). Building a methodology for studying cognition in process controll: A se-
mantic analysis of visual and verbal behaviour. NTNU, Norge.
Haviland, S. E. & Clark, H. H. (1974). Whats new? Acquiring New Information as a Process
in Comprehension. Journal of Verbal Learning and Verbal Behaviour 13: 512521.
Hayhoe, M., & Ballard, D. (2005). Eye movements in natural behavior. TRENDS in Cognitive
Sciences, 9(4), Elsevier Ltd., 188193.
References 183
Hayhoe, M. M. (2004). Advances in relating eye movements and cognition. Infancy, 6(2),
pp. 267274.
Hebb, D. O. (1968). Concerning imagery. Psychological Review, 75, 466477.
Hegarty, M. (1992). Mental animation: Inferring motion from static displays of mechani-
cal systems. Journal of Experimental Psychology: Learning, Memory and Cognition, 18,
10841102.
Hegarty, M. (2004). Diagrams in the mind and in the world: Relations between internal and
external visualizations. In A. Blackwell, K. Mariott & A. Shimojima (Eds.), Diagram-
matic Representation and Inference. Lecture Notes in Artificial Intelligence 2980 (113).
Berlin: Springer-Verlag.
Hegarty, M. & Waller, D. (2004). A dissociation between mental rotation and perspective-
taking spatial abilities. Intelligence, 32, 175191.
Hegarty, M. & Waller, D. (2006). Individual differences in spatial abilities. In P. Shah & A.
Miyake (Eds.). Handbook of Visuospatial Thinking. Cambridge University Press.
Henderson, J. M. & Hollingworth, A. (1998). Eye Movements during Scene Viewing. An
Overview. In Underwood, G. W. (Ed.), Eye Guidance in Reading and Scene Perception,
269293. Oxford: Elsevier.
Henderson, J. M. & Hollingworth, A. (1999). High Level Scene Perception. Annual Review
of Psychology 50, 243271.
Henderson, J. M. (1992). Visual attention and eye movement control during reading and
picture viewing. In K. Rayner (Ed.) Eye movements and Visual Cognition. New York:
Springer Verlag.
Henderson, J. M. & Ferreira, F. (Eds.). (2004). The integration of language, vision, and action:
Eye movements and the visual world. New York: Psychology Press.
Herskovits, A. (1986). Language and Spatial Cognition. An Interdisciplinary Study of the
Prepositions in English. Cambridge University Press: Cambridge.
Hoffman, J. E. (1998). Visual Attention and Eye Movements. In Pashler, H. (Ed.). (1998).
Attention. Psychology Press: UK, 119153.
Holmqvist, K., Holmberg, N., Holsanova, J., Trning, J. & Engwall, B. (2006). Reading In-
formation Graphics Eyetracking studies with Experimental Conditions (J. Errea, ed.)
Malofiej Yearbook of Infographics, Society for News Design (SND-E). Navarra Univer-
sity, Pamplona, Spain, pp. 5461.
Holmqvist, K. & Holsanova, J. (1997). Reconstruction of focus movements in spoken dis-
course. In Liebert, W., Redeker, G., Waugh, L. (Eds.), Discourse and Perspective in Cog-
nitive Linguistics. Benjamins: Amsterdam, 223246.
Holmqvist, K. (1993). Implementing Cognitive Semantics. Image schemata, valence accomo-
dation and valence suggestion for AI and computational linguistics. Lund University
Cognitive Studies 17.
Holmqvist, K., Holsanova, J., Barthelson, M. & Lundqvist, D. (2003). Reading or scanning?
A study of newspaper and net paper reading. In Hyn, J. R., & Deubel, H. (Eds.), The
minds eye: Cognitive and applied aspects of eye movement research (657670). Elsevier
Science Ltd.
Holsanova, J., Holmberg, N. & Holmqvist, K. (forthc.). Intergration of Text and Information
Graphics in Newspaper Reading. Lund University Cognitive Studies 125.
Holsanova, J. & Holmqvist, K. (2004). Med blick pa ntnyheter. gonrrelsestudier av ls-

ning i ntbaserade tidningar. (Looking at the net news. Eye tracking study of net paper
reading) In Holmberg, Claes-Gran & Svensson, Jan (red.), Mediekulturer, Hybrider
och Frvandlingar. Carlsson frlag, 216248.
Holsanova, J. & Koryansk, R. (1987). Die Rolle des Erwachsenen bei der Aneignung des
Kommunikationstyps Erzhlen und Zuhren durch Vorschulkinder. In W. Bahner/ J.
Schildt/ D. Viehweger (Hrsg.), Proceedings of the XIVth International Congress of Lin-
guistics, B. 2, Akademie-Verlag: Berlin, 18021805.
Holsanova, J. (1986). Was ist Erzhlen? Versuche zur Rekonstruktion des Alltagsbegriffs von
Erzhlen. Unverffentlichte Diplomarbeit. Humboldt-Universitt zu Berlin.
Holsanova, J. (1989). Dialogische Aspekte des Erzhlens in der Alltagskommunikation. In
Koensk, J. & Hartung, W. (Hrsg.) Gesprochene und geschriebene Kommunikation. Lin-
guistica XVIII. stav pro jazyk esk Praha, 6580.
Holsanova, J. (1996). Attention movements in language and vision, In Representations and
Processes between Vision and NL, Proceedings of the 12th European Conference of
Articial Intelligence, Budapest, Hungary, 8183.
Holsanova, J. (1997a). Bildbeskrivning ur ett kommunikativt och kognitivt perspektiv. LUCS
Minor 6. Lund University.
Holsanova, J. (1997b). Verbal or Visual Thinkers? Different ways of orienting in a complex
picture. In Proceedings of the European Conference on Cognitive Science, Manchester
1997, 3237.
Holsanova, J. (1998). Att byta rst och rdda ansiktet. Citatanvndning i samtal om de an-
dra. (Changing voice and saving face. The use of quotations in conversations about the
others) Sprk och Stil 8 (1998), 105133.
Holsanova, J. (1999a). P tal om bilder. Om fokusering av uppmrksamhet i och struk-
turering av talad beskrivande diskurs. (Speaking of pictures. Focusing attention and
structuring spoken descriptive discourse in monological and dialogical settings) Lund
University Cognitive Studies 78.
Holsanova, J. (1999b). Olika perspektiv p sprk, bild och deras samspel. Metodologiska
reexioner. (Language, picture and their interplay seen from different perspectives.
Methodological considerations.) In Hask, I. & Sandqvist, C. (Eds.), Alla tiders sprk.
Lundastudier i nordisk sprkvetenskap A 55. Lund University Press. Lund, 117126.
Holsanova, J. (2001). Picture Viewing and Picture Description: Two Windows on the Mind.
Doctoral dissertation. Lund University Cognitive Studies 83: Lund, Sweden.
Holsanova, J. (2006). Dynamics of picture viewing and picture description. In Albertazzi
L. (Ed.), Visual thought. The depictive space of the mind. Part Three: Bridging percep-
tion and depiction of visual spaces. Advances in Consciousness Research. Benjamins,
233254.
Holsanova, J., Hedberg, B. & Nilsson, N. (1998). Visual and Verbal Focus Patterns when De-
scribing Pictures. In Becker, Deubel & Mergner (Eds.), Current Oculomotor Research:
Physiological and Psychological Aspects. Plenum: New York, London, Moscow.
Holsanova, J., Rahm, H. & Holmqvist, K. (2006). Entry points and reading paths on the
newspaper spread: Comparing semiotic analysis with eye-tracking measurements. In
Visual communication 5 (1), 6593.
References 185
Horne, M., Hansson, P., Bruce G., Frid, J. & Filipson, M. (1999). Discourse Markers and the
Segmentation of Spontaneous Speech: The case of Swedish men but/and/so. Working
Papers 47, 123140. Dept. of Linguistics, Lund university.
Huber, S., & Kirst, H. (2004). When is the ball going to hit the ground? Duration estimates,
eye movements, and mental imagery of object motion. Journal of Experimental Psychol-
ogy: Human Perception and Performance, Vol. 30, No. 3, 431444.
Inhoff, A. W. & Gordon, A. M. (1997). Eye Movements and Eye-Hand Coordination During
Typing. Current Directions in Psychological Science, Vol. 6/6/1997. American Psycho-
logical Society: Cambridge University Press, 153157.
Johansson, R., Holsanova, J. & Holmqvist, K. (2005). What Do Eye Movements Reveal
About Mental Imagery? Evidence From Visual And Verbal Elicitations. In Bara, B. G.,
Barsalou, L., Bucciarelli, M. (Eds.), Proceedings of the 27th Annual Conference of the
Cognitive Science Society, pp. 10541059. Mahwah, NJ: Erlbaum.
Johansson, R., Holsanova, J. & Holmqvist, K. (2006). Pictures and spoken descriptions elicit
similar eye movements during mental imagery, both in light and in complete darkness.
Cognitive Science 30: 6 (pp. 10531079). Lawrence Erlbaum.
Johansson, R., Holsanova, J. & Holmqvist, K. (2005). Spatial frames of reference in an in-
teractive setting. In Tenbrink, Bateman & Coventry (Ed.) Proceedings of the Workshop
on Spatial Language and Dialogue, Hanse-Wissenschaftskolleg Delmenhorst, Germany
October 2325, 2005.
Johnson-Laird, P. N. (1983). Comprehension as the Construction of Mental Models, Philo-
sophical Transactions of the Royal Society of London. Series B, Biological Sciences, Vol.
295, No. 1077, 353374.
Jonassen, D. & Grabowski, B. (1993). Handbook of individual differences, learning, and in-
struction. Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.
Juola J. F., Bouwhuis D. G., Cooper E. E.; Warner C. B. (1991). Control of Attention around
the Fovea. Journal of Eeperimental Psychology Human Perception and Performance
17(1): 125141.
Just, M. A., Carpenter, P. A. (1976). Eye xations and cognitive processes, Cognitive Psychol-
ogy 8, 441480.
Just, M. A., & Carpenter, P. A. (1980). A theory of reading: From eye xations to comprehen-
sion. Psychological review, 87, 329354.
Kahneman, D. (1973). Attention and Effort. Prentice Hall, Inc.: Englewood Cliffs, New Jer-
sey.
Kendon, A. (1980). Gesticulation and speech: Two aspects of the process of utterance. In
Key, M. (Ed.), The Relationship of Verbal and Nonverbal Communication, Mouton: The
Hague, 207227.
Kess, J. F. (1992). Psycholinguistics. Psychology, Linguistics and the Study of Natural Lan-
guage. Benjamins: Amsterdam/Philadelphia.
Kintsch, W. & van Dijk, T. A. (1983). Strategies of discourse comprehension. New York: Aca-
demic.
Kintsch, W. (1988). The role of knowledge in discourse comprehension construction-inte-
gration model. Psychological Review, 95, 163182.
Kita, S., & zyrek, A. (2003). What does cross-linguistic variation in semantic coordina-
tion of speech and gesture reveal?: Evidence for an interface representation of spatial
thinking and speaking. Journal of Memory and Language, 48, 1632.
Kita, S. (1990). The temporal relationship between gesture and speech: A study of Japanese-
English bilinguals. Unpublished masters thesis. Department of Psychology, University
of Chicago.
Kiyoshi, Naito, Katoh Takaaki, & Fukuda Tadahiko (2004). Expertise and position of line of
sight in golf putting. Perceptual and Motor Skills.V.99/P., 163170.
Korolija, N. (1998). Episodes in Talk. Constructing coherence in multiparty conversation.
Linkping Studies in Arts and Science 171. Linkping university.
Kosslyn, S. (1994). Image and Brain. Cambridge, Mass. The MIT Press.
Kosslyn, S. (1978). Measuring the visual angle of the minds eye. Cognitive Psychology, 10,
356389.
Kosslyn, S. (1980). Image and Mind. Harvard University Press. Cambridge, Mass. and Lon-
don, England.
Kosslyn, S. M. (1995). Mental imagery. In S. M. Kosslyn & D.N. Osherson (Eds.), Visual
cognition: An invitation to cognitive science (Vol. 2, pp. 267296). Cambridge, MA: MIT
Press.
Kowler, E. (1996). Cogito Ergo Moveo: Cognitive Control of Eye Movement. In Landy M. S.,
Maloney, L. T. & Paul, M. (Eds.) Exploratory vision. The Active Eye. 5177.
Kozhevnikov, M, Hegarty, M. & Mayer, R. E. (2002). Revising the Visualizer-Verbalizer
Dimension: Evidence for Two Types of Visualizers. Cognition and Instruction 20 (1),
4777.
Krutetskii, V. A. (1976). The psychology of mathematical abilities in school children. Chicago:
University of Chicago Press.
Labov, W. & Waletzky, J. (1973). Erzhlanalyse: Mndliche Versionen persnlicher Erfah-
rung. In Ihwe, J. (Hg.) Literaturwissenschaft und Linguistik. Bd. 2. Frankfurt/M.: Fisch-
er-Athenum, 78126.
Laeng, Bruno & Teodorescu, Dinu-Stefan (2002). Eye scanpaths during visual imagery re-
enact those of perception of the same visual scene. Cognitive Science 2002, Vol. 26, No.
2: 207231.
Lahtinen, S. (2005). Which one do you prefer and why? Think aloud! In Proceedings of Join-
ing Forces, International Conference on Design Research. UIAH, Helsinki, Finland.
Lakoff, G. & Johnson, M. (1980). Metaphors we live by. Chicago: University of Chicago
Press.
Lakoff, G. (1987). Women, re, and dangerous things: what categories reveal about the mind.
The University of Chicago Press: Chicago, IL.
Lang, E., Carstensen, K.-U. & Simmons, G. (1991). Modelling Spatial Knowledge on a Lin-
guistic Basis. Theory Prototype Integration. (Lecture Notes on Articial Intelligence
481) Springer-Verlag: Berlin, Heidelberg, New York.
Langacker, R. (1987). Foundations of Cognitive Grammar. Volume 1. Stanford University
Press: Stanford.
Langacker, R. (1991). Foundations of Cognitive Grammar. Volume 2. Stanford University
Press: Stanford.
References 187
Lemke, J. (1998). Multiplying meaning: Visual and verbal semiotics in scientific text. In
Martin, J. & Veel, R. Reading Science. London: Routledge.
Levelt, W. J. M. (1981). The speakers linearization problem. Philosophical Transactions of the
Royal Society of London B, 295, 305315.
Levelt, W. J. M. (1983). Monitoring and self-repair in speech. Cognition, 14, 41104.
Levelt, W. J. M. (1989). Speaking: From intention to articulation. MIT Press, Bradford books:
Cambridge, MA.
Lvy-Schoen, A. (1969). Dtermination et latence de la rponse oculomotrice deux stimu-
lus. L Anne Psychologique, 69, 373392.
Lvy-Schoen, A. (1974). Le champ dactivit du regard: donnes exprimentales. LAnne
Psychologique, 74, 4366.
Linde, C. & Labov, W. (1975). Spatial network as a site for the study of language and thought.
Language, 51: 924939.
Linde, C. (1979). Focus of attention and the choice of pronouns in discourse. In Talmy
Givn (Ed.), Syntax and Semantics, Volume 12, Discourse and Syntax, Academic Press:
New York, San Francisco, London, 337354.
Linell, P. (2005). En dialogisk grammatik? I: Anward, Jan & Bengt Nordberg (red.), Samtal
och grammatik, Studentlitteratur, 231315.
Linell, P. (1982). Speech errors and the grammatical planning of utterances. Evidence from
Swedish. In Koch, W., Platzack, C. & Totties, G. (Eds.), Textstrategier i tal och skrift.
Almqvist & Wiksell International, 134151. Stockholm.
Linell, P. (1994). Transkription av tal och samtal, Arbetsrapporter frn Tema Kommunikation
1994: 9, Linkpings universitet.
Loftus, G. R. & Mackworth, N. H. (1978). Cognitive determinants of xation location dur-
ing picture viewing. Journal of experimental psychology: Human Perception and Perfor-
mance, 4, 565572.
Lucy (1992). Language, diversity and thought. A Reformulation of the Linguistic Relativity
theory. Cambridge University Press.
Mackworth, N. H. & Morandi, A. J. (1967). The gaze selects informative details within pic-
tures. Perception and Psychophysics 2, 547552.
Mast, F. W., & Kosslyn, S. M. (2002). Eye movements during visual mental imagery, TRENDS
in Cognitive Science, Vol. 6, No. 7.
Mathesius, V. (1939). O takzvanm aktulnm lenn vtnm. Slovo a slovesnost 5, 171174.
Also as: On informationbearing Structure of the Sentence. In K. Susumo (Ed.) 1975.
Harvard: Harvard university, 467480.
Meyer, A. S. & Dobel, C. (2003). Application of eye tracking in speech production research
In The Minds Eye: Cognitive and applied aspects of eye movement research, J. Hyn &
Deubel, H. (Editors), Elsevier Science Ltd, 253272.
Meulen, F. F. van der, Mayer A. S. & Levelt, W. J. M. (2001). Eye movements during the pro-
duction of nouns and pronouns. Memory & Cognition, 29, 512521.
Mishkin, M., Ungerleider, L. G. & Macko, K. A. (1983). Object vision and spatial vision: Two
cortical pathways, Trends in Neuroscience, 6, 414417.
Mozer, M. C. & Sitton, M. (1998). Computational modelling of spatial attention. In Pashler,
H. (Ed.), Attention. Psychology Press: UK, 341393.
Naughton, K. (1996). Spontaneous gesture and sign: A study of ASL signs cooccuring with
speech. In Messing, L. (Ed.), Proceedings of the workshop on the integration of gesture in
Language and speech. University of Delaware, 125134.
Nordqvist, S. (1990). Kackel i trdgrdslandet. Opal.
Noton, D. & Stark, L. (1971a). Eye movements and visual perception. Scientic American,
224, 3443.
Noton, D. & Stark, L. (1971b). Scanpaths in saccadic eye movements while viewing and
recognizing patterns, Vision Research 11, 929.
Noton, D., & Stark, L. W. (1971c). Scanpaths in eye movements during perception. Science,
171, 308311.
Nuyts, J. (1996). Consciousness in language. Pragmatics & Cognition Vol. 4(1), 1996. Special
issue on language and consciousness, 153180.
Olshausen, B. A. & Koch, C.(1995). Selective Visual Attention. In Arbib, M. A. (Ed.), The
handbook of brain theory and neural networks. Cambridge, MA: MIT Press, 837840.
Oviatt, S. L. (1999). Ten Myths of Multimodal Interaction. Communications of the ACM,
Vol. 42, No. 11, Nov. 99, 7481.
Paivio, A. (1971). Imagery and Verbal Processes. Hillsdale, N.J.: Erlbaum.
Paivio, A. (1986). Mental representation: A dual coding approach. New York: Oxford Uni-
versity Press.
Paivio, A. (1991a). Dual Coding Theory: Retrospect and current status, Canadian Journal of
Psychology, 45 (3), 255287.
Paivio, A. (1991b). Images in Mind. Harvester Wheatsheaf: New York, London.
Pollatsek, A. & Rayner, K. (1990). Eye movements, the eye-hand span, and the perceptual
span in sight-reading of music. Current Directions in Psychological Science, 4953.
Posner, M. I. (1980). Orienting of attention. Quarterly Journal of Experimental Psychology
32, 325.
Prince, E. (1981). Toward a Taxonomy of Given-New Information. In Cole, P. (Ed.), Radical
pragmatics. New York: Academic Press.
Pylyshyn, Z. W. (2001). Visual indexes, preconceptual objects, and situated vision, Cogni-
tion, 80 (1/2), 127158.
Quasthoff, U. (1979). Verzgerungsphnomene, Verknpfungs- und Gliederungssignale in
Alltagsargumentationen und Alltagserzhlungen, In H. Weydt (Ed.), Die Partikel der
deutschen Sprache. Walter de Gruyter: Berlin, New York, 3957.
Qvarfordt, Pernilla (2004). Eyes on Multimodal Interaction. Linkping Studies in Science and
Technology No. 893. Department of Computer and Information Science, Linkpings
Universitet.
Rayner, K. (Ed.). (1992). Eye movements and visual cognition: scene perception and reading.
New York: Springer-Verlag.
Redeker, G. (1990). Ideational and pragmatic markers of discourse structure. Journal of
Pragmatics 14 (1990), 367381.
Redeker, G. (1991). Linguistic markers of discourse structure. Review article. Linguistics 29
(1991), 139172.
Redeker, G. (2000). Coherence and structure in text and discourse. In William Black &
Harry Bunt (Eds.), Abduction, Belief and Context in Dialogue. Studies in Computational
Pragmatics (233263). Amsterdam: Benjamins.
References 189
Redeker, G. (2006). Discourse markers as attentional cues at discourse transitions. In Ker-

stin Fischer (Ed.), Approaches to Discourse Particles. Studies in Pragmatics, 1 (339358).
Amsterdam: Elsevier.
Rieser, H. (1994). The role of focus in task oriented dialogue. In Bosch, P. & van der Sandt,
R. (Eds.), (1994). Focus and Natural Language Processing. Vol. 3, IBM Institute of Logic
and Linguistics, Heidelberg.
Sacks, H. (19681972/1992). Lectures on conversation, Vol. II, edited by Gail Jefferson. Ox-
ford: Blackwell.
Schank, R. C. & Abelson, R. P. (1977). Scripts, Plans, Goals, and Understanding. Hillsdale,
NJ: Erlbaum.
Scheiter, K., Wiebe, E. & Holsanova, J. (submitted). Theoretical and instructional aspects of
learning with visualizations. In Zheng, R. (Ed.), Cognitive effects of multimedia learning.
IGI Global, USA.
Schiffrin, D. (1987). Discourse markers. Cambridge University Press: Cambridge.
Schill, K. (2005). A Model of Attention and Recognition by Information Maximization. In
Neurobiology of Attention. Section IV. Systems. Elsevier. 671676.
Schill, K., Umkehrer, E., Beinlich, S., Krieger, G. & Zetzche, C. (2001). Scene Analysis with
saccadic eye movements: Top-down and bottom-up modelling. J. Electron. Imaging 10,
152160.
Schilperoord, J. & Sanders, T. (1997). Pauses, Cognitive Rhythms and Discourse Structure:
An Empirical Study of Discourse Production. In Liebert, W.-A., Redeker, G. & Waugh,
L. (Eds.), Discourse and perspective in Cognitive Linguistics. Benjamins: Amsterdam/
Philadelphia, 247267.
Selting, M. (1998). TCUs and TRPs: The Construction of Units in Conversational Talk. In
Interaction and linguistic Structures. InLiST No.4., Potsdam.
Shepard, R. N. & Metzler, J. (1971). Mental Rotation of Three-Dimensional Objects. Science
1971, Vol. 171, no. 3972, pp. 701703.
Sloboda, J. A. (1974). The eye-hand span an approach to the study of sight reading. Psy-
chology of Music, 2(2), 410
Slowiaczek M. L. (1983). What does the mind do while the eyes are gazing? In Rayer, K.
(Ed.), Eye Movements in Reading: Perceptual and language processes. Academic Press,
Inc.: New York, London etc., 345355.
Sneider, W. X., & Deubel, H. (1995). Visual attention and saccadic eye movements: Evidence
for obligatory and selective spatial coupling. In Findley et al. (Eds.) Eye movement re-
search. Elsevier Science B.V.
Solso, R. L. (1994). Cognition and the Visual Arts. A Bradford Book. The MIT Press. Cam-
bridge, Massachusetts London, England.
Spivey, M., Richardson, D., & Fitneva, S. (2004). Thinking outside the brain: spatial indices
to visual and linguistic information. In J. M. Henderson, & F. Ferreira (Eds.), The in-
terface of language, vision and action: Eye movements and the visual world. New York:
Psychology Press.
Spivey, M. & Geng, J. (2001). Oculomotor mechanisms activated by imagery and memory:
Eye movements to absent objects. Psychological Research, 65, 235241
Spivey, M. J., Tyler M., Richardson, D. C., & Young, E. (2000). Eye movements during. com-
prehension of spoken scene descriptions. Proceedings of the Twenty-second Annual
Meeting of the Cognitive Science Society, 487492, Erlbaum: Mawhah, NJ.
Stenstrm, A.-B. (1989). Discourse Signals: Towards a Model of Analysis. In H. Weydt (Ed.),
Sprechen mit Partikeln. Walter de Gruyter: Berlin, New York, 561574.
Strohner, H. (1996). Resolving Ambigious Descriptions through Visual Information. In
Representations and Processes between Vision and NL, Proceedings of the 12th Euro-
pean Conference of Articial Intelligence, Budapest, Hungary 1996.
Strmqvist, S. (1998). Lite om sprk, kommunikation, och tnkande. In T. Bckman, O.
Mortensen, E. Raanes, & E. stli (red.). Kommunikation med dvblindblivne. Dron-
ninglund: Frlaget Nordpress, 1320.
Strmqvist, S. (2000). A Note on Pauses in Speech and Writing. In Aparici, M. (ed.), Devel-
oping literacy across genres, modalities and languages, Vol 3. Universitat de Barcelona,
211224.
Strmqvist, S. (1996). Discourse Flow and Linguistic Information Structuring: Explorations
in Speech and Writing. Gothenburg Papers in Theoretical Linguistics 78.
Suwa Masaki, Tversky Barbara, Gero, John & Purcell, Terry (2001). Seeing into sketches:
Regrouping parts encourages new interpretations. In J. S. Gero, B. Tversky & T. Purcell
(Eds.), Visual and Spatial Reasoning in Design II, Key Centre of Design Computing and
Cognition, University of Sydney, Australia, 207219.
Tadahiko, Fukuda & Nagano Tomohisa (2004). Visual search strategies of soccer players in
one-to-one defensive situation on the field. Perceptual and Motor Skills.V.99/P, 968
974.
Tanenhaus, M. K., Magnuson, J. S., Dahan, D., & Chambers, C. 2000. Eye movements and
lexical access in spoken-language comprehension: Linking hypothesis between fixa-
tions and linguistic processing. Journal of Psycholinguistics Research 29, 6, 557580.
Tanenhaus, M. K., Spivey-Knowlton, M. J., Eberhard, K. M., & Sedivy, J. C. 1995. Integration
of visual and linguistic information in spoken language comprehension. Science 268,
5217, 16351634.
Taylor, H. A. & Tversky, B. (1992). Description and depiction of environments. Memory and
Cognition, 20, 483496.
Theeuwes, J. (1993). Visual selective attention: A theoretical analysis. Acta psychologica 83,
93154.
Theeuwes, J., Kramer, A. F., Hahn, S., & Irwin, D. (1998). Our eyes do not always go where we
want them to go: Capture of the eyes by new objects. Psychological Science, 9, 379385.
Thomas, N. J. T. (1999). Are theories of imagery theories of imagination? An active per-
ception approach to conscious mental content. Cognitive Science 1999, Vol. 23, No. 2,
207245.
Tomlin, R. S. (1995). Focal Attention, voice, and word order. An experimental, cross-lin-
guistic study. In P. Downing & M. Noonan (Eds.), Word Order in Discourse. Amster-
dam: John Benjamins, 517554.
Tomlin, R. S. (1997). Mapping Conceptual Representations into Linguistic Representations:
The Role of Attention in Grammar. In J. Nuyts & E. Pederson, With Language in Mind.
Cambridge: CUP, 162189.
References 191
Tversky, B. (1999). What does drawing reveal about thinking? In J. S. Gero & B. Tversky
(Eds.), Visual and spatial reasoning in design. Sydney, Australia: Key Centre of Design
Computing and Cognition, 93101.
Tversky, B., Franklin, N., Taylor, H. A., Bryant, D. J. (1994). Spatial Mental Models from
Descriptions, Journal of the American Society for Information Science 45(9), 656668.
Ullman, S. (1996). High-level vision: Object recognition and visual recognition, Cambridge,
MA: MIT Press.
Ungerleider, L. G. & Mishkin, M. (1982). Two cortical visual systems. In Analysis of visual
behavior, (Ed.) D. J. Ingle, M. A. Goodale & R. W. J. Mansfield. MIT Press.
Underwood, G. & Everatt, J. (1992). The Role of Eye Movements in Reading: Some limita-
tions of the eye-mind assumption. In E. Chekaluk & K. R. Llewellyn (Eds.), The Role of
Eye Movements in Perceptual Processes, Elsevier Science Publishers B. V., Advances in
Psychology, Amsterdam, Vol. 88, 111169.
van Donzel, M. E. (1997). Perception of discourse boundaries and prominence in spontane-
ous Dutch speech. Working papers 46 (1997). Lund university, Department of Linguis-
tics, 523.
van Donzel, M. E. (1999). Prosodic Aspects of Information Structure in Discourse. LOT, Neth-
erlands Graduate School of Linguistics. Holland Academic Graphics: The Hague.
Velichkovsky, B. M. (1995). Communication attention: Gaze-position transfer in coopera-
tive problem solving. Pragmatics and Cognition 3(2), 199222.
Velichkovsky, B., Pomplun, M., & Rieser, J. (1996). Attention and Communication: Eye-
Movement-Based Research Paradigms. In W. H. Zangemeister, H. S. Stiehl, & C. Freksa
(Eds.), Visual Attention and Cognition (125154). Amsterdam, Netherlands: Elsevier
Science.
Viviani, P. (1990). Eye movements in visual research: Cognitive, perceptual, and motor con-
trol aspects. In E. Kowler (Ed.) Eye movements and their role in Visual and Cognitive
Processes. Reviews of Oculomotor Research V4. Amsterdam: Elsevier Science B. V.,
353383.
Yarbus, A. L. (1967). Eye movements and vision (1st Russian edition, 1965). New York: Ple-
num Press.
Young, L. J. (1971). A study of the eye-movements and eye-hand temporal relationships of
successful and unsuccessful piano sight-readers while piano sight-reading. Doctoral
dissertation, Indiana University, RSD721341
Zwaan, R. A. & Radvansky, G. A. (1998). Situation models in language comprehension and
memory. Psychological Bulletin 123, 162185.
Author index
A Chafe 4, 712, 2021, 42, 49, Frid 185

Abelson 9, 189 81, 8485, 100101, 115, 127, Fukuda 186, 190
Aijmer 43, 179 151, 179180
Allport 83, 179 Chambers 190 G
Allwood 179 Clark, E. 3, 180 Grdenfors 81, 181
Andersson, B. 96, 177, 179 Clark, H. H. 4, 14, 45, 153, Garett 3, 181
Andersson, R. 138 179, 181182 Gedenryd 50, 181
Arnheim 142, 179 Cohen 159, 181 Geng 159, 189
Cooper 185 Gentner 152, 181
B Couper 82, 181 Gernsbacher 50, 180, 182
Baddeley 68, 179 Crystal 7, 181 Gero 190191
Ballard 87, 145, 166, 179, 182 Givn 4, 50, 180, 182, 187
Bangerter 45, 179 D Goldman-Eisler 3, 40, 182
Barkowsky 158, 181 Dahan 190 Goolsby 99, 182
Barsalou 162, 179, 185 Dahl 179 Gordon 99, 102, 185
Barthelson 183 Davidson 180 Grabowski 68, 185
Bartlett 68, 179 De Graef 181 Griffin 7, 8182, 90, 102, 116,
Beattie 9, 179 Demarais 159, 181 143, 182
Beinlich 189 Deubel 80, 183184, 187, 189 Grosz 43, 4950, 182
Berlyne 179 Diderichsen 82, 181 Grow 67, 182
Berman 9, 179 Dobel 8182, 102, 187 Glich 43, 182
Bersus 99, 179 Drivoldsmo 180 Gullberg 52, 182
Bertel 158, 177, 180181 Duchowski 90, 181
Blumenthal 146, 180 H
Bock 7, 82, 102, 116, 146, 180, E Hahn 190
182 Eberhard 190 Hallbert 82, 182
Bouwhuis 185 Engel 158, 181 Halliday 4, 7, 21, 182
Braarud 82, 180 Ericsson 81, 181 Hannus 167, 169, 182
Brandt 158, 180 Everatt 80, 191 Hansson 185
Bruce 7, 180, 185 Hauland 82, 182
Bryant 191 F Haviland 4, 182
Buswell 80, 86, 108, 180 Ferreira 80, 86, 180, 182183, Hayhoe 87, 179, 182183
Butsch 99, 180 189 Hebb 166, 183
Butterworth 9, 180181 Filipson 185 Hedberg 184
Findlay 83, 132, 181 Hegarty 157158, 167, 183, 186
C Finke 40, 157158, 162, 181 Henderson 80, 86, 9091,
Carpenter 80, 185 Firbas 4, 181 180, 182183, 189
Carstensen 58, 186 Fitneva 189 Herskovits 58, 183
Casad 145, 180 Franklin 191 Hoffman 90, 183
Hollingworth 86, 9091, 183 L Pomplun 103, 191

Hollnagel 180 Labov 59, 81, 153, 186187 Pook 179
Holmberg 183184 Laeng 158, 166, 186 Posner 80, 83, 179, 188
Holmqvist 52, 127, 153154, Lahtinen 97, 186 Prince 4, 188
156, 158, 167, 169, 179, 183185 Lakoff 127, 152, 186 Purcell 190
Holsanova 6, 7, 1112, 15, 19, Lang 58, 186 Pylyshyn 166, 188
2122, 28, 36, 39, 4546, 52, Langacker 23, 127, 186
59, 61, 64, 88, 92, 94, 96, Lemke 2021, 36, 187 Q
114, 142, 154, 156, 158, 161, Levelt 3, 7, 119, 146, 153, 180, Quasthoff 43, 188
167169, 179, 183185, 189 187 Qvarfordt 96, 188
Horne 15, 185 Lvy-Schoen 132, 187
Huber 158, 185 Lieberman 68, 179 R
Hyn 167, 169, 182183, 187 Linde 49, 50, 81, 153, 187 Radvansky 169, 191
Linell 23, 12, 80, 146148, Rahm 184
I 153, 177, 187 Rao 179
Inhoff 99, 102, 185 Loftus 87, 187 Rayner 90, 99, 181, 183, 188
Irwin 180, 190 Lucy 79, 187 Redeker 4344, 51, 183,
Lundqvist 183 188189
J Richardson 159, 189190
Johansson, R. 36, 102, 158, 165, M Rieser 103, 189, 191
168, 185 Macko 187
Johansson, V. 179 Mackworth 8687, 187 S
Johnson 152, 185186 Magnuson 190 Sacks 59, 189
Johnson-Laird 152, 185 Mast 166, 187 Sanders 9, 15, 189
Jonassen 68, 185 Mathesius 4, 187 Schank 9, 189
Juola 80, 185 Mayer, A. S. 187 Scheiter 167, 189
Just 80, 185 Mayer, R. E. 186 Schiffrin 43, 189
Metzler 158, 189 Schill 97, 189
K Mishkin 69, 187, 191 Schilperoord 9, 15, 189
Kahneman 106, 185 Morandi 86, 187 Sedivy 190
Karlsson 179 Mozer 83, 187 Selting 7, 189
Katoh 186 Shepard 157158, 181, 189
Kendon 102, 185 N Sidner 43, 4950, 182
Kess 3, 185 Nagano 190 Simmons 58, 186
Kintsch 169, 185 Naughton 102, 188 Simon 81, 181
Kirst 158, 185 Nilsson 184 Sitton 83, 187
Kita 7, 102, 186 Nordqvist 5, 65, 85, 164, 188 Slobin 9, 179
Kiyoshi 99, 186 Noton 141, 145, 188 Sloboda 99, 189
Koch 83, 187188 Nuyts 127, 188, 190 Slowiaczek 80, 189
Korolija 16, 186 Sneider 80, 189
Koryansk 184 O Solso 80, 9091, 125, 142,
Kosslyn 4041, 69, 158, 162, Olshausen 83, 188 189
166, 186187 Oviatt 102, 177, 188 Spivey 159, 166, 189190
Kowler 99, 186, 191 zyrek 186 Spivey-Knowlton 190
Kozhevnikov 68, 186 Stark 141, 145, 158, 180, 188
Kramer 190 P Stenstrm 43, 190
Krieger 189 Paivio 6768, 188 Stevens 152, 181
Krutetskii 68, 186 Pollatsek 99, 188 Strohner 82, 190
Author index 195
Strmqvist 3, 9, 15, 40, 54, 80, Tversky 48, 58, 153, 190191 W
179, 190 Tyler 159, 190 Waletzky 59, 186
Suwa 154, 190 Walker 80, 83, 181
U Waller 157, 183
T Ullman 191 Warner 185
Umkehrer 189 Wengelin 179
Tadahiko 99, 186, 190
Underwood 80, 183, 191 Wiebe 189
Tanenhaus 103, 190
Ungerleider 69, 187, 191
Trning 183 Y
Taylor 153, 180, 190191 V Yarbus 80, 87, 97, 136, 145, 191
Teodorescu 158, 166, 186 van der Meulen 119, 187 Young 99, 159, 190191
Theeuwes 80, 83, 190 van Dijk 169, 185
Thomas 166, 181, 190 van Donzel 15, 191 Z
Tomlin 82, 180181, 190 Velichkovsky 103, 191 Zetzche 189
Tufvesson 179 Viviani 145, 191 Zwaan 169, 191
Subject index
A 106107, 113, 115119, within focus 101

activated information 6 122123, 131137, 141, 147, within superfocus 113
active consciousness 56, 92 149, 174, 176 conscious focus of attention
active information 11 in the visual and verbal 6, 7, 82, 89, 151, 156
active processing 134, 136, flow 98, 174 consciousness 4, 9, 1920,
149, 175 co-construct meaning 156 81, 8485, 123, 153, 156, 179,
alignment 122, 174 cognition 68, 80, 84, 94, 127, 180, 188
animacy principle 79, 131 158, 179, 181183, 186, 188 consciousness-based approach
anticipatory visual fixations cognitive factors 6466, 87, 4
99 173 contextual aspects 65
area of interest 107, 108 cognitive linguistics 127, 152 conversation 7, 28, 36, 46, 48,
associative processes 148, 176 cognitive processes 1, 16, 4951, 154, 157, 168, 173, 177,
associative view 146 8082, 97, 115, 121123, 126, 186, 189
attention 68, 16, 19, 43, 145, 157, 177, 185 counting-like gazes 176
46, 49, 50, 52, 54, 68, 79, cognitive rhythm 4, 9, 16, 171 covert 8081, 84, 171
80, 8287, 96, 100, 123, 127, cognitive science 66, 81, 182, covert attention 80
136, 154, 156, 166, 171172, 186 covert mental processes
177, 179, 180, 183184, cognitive semantics 127, 129, 171
187191 132, 152
attentional spotlight 84, 89, coherence 37, 39, 5255, 64, D
94, 98, 100, 174 151, 186 delay 32, 92, 101, 107108,
attitude(s) 20, 24, 3738, 47, collaborative user interface 114118, 122
81, 87, 94, 97, 117, 142143, 96 delay configuration 108, 118
145, 167, 171172, 178 comparable unit 99, 100, 106, describe 19, 2021, 25, 2729,
115, 175 31, 35, 3941, 46, 51, 5557,
C comparative gazes 176 60, 6566, 68, 70, 77, 79,
categorical proximity 132, 137 complex 85, 87, 132, 145146, 153, 158,
categorisation 14, 21, 24, picture 19, 28, 65, 8588, 164, 171
29, 31, 36, 94, 103, 108, 113, 98, 120121, 139, 144, 148, describer 42, 109, 113, 122, 127,
116119, 130, 144145, 149, 158, 164165, 174175, 184 156, 174
175 scene 86, 121, 143, 174, 176 description coherence 39,
categorisation units of thought 9, 89 46, 64
difficulties 21, 29, 31, 36, visual scene 175 description styles 32, 54,
175 compositional principle 131 6265, 67, 69, 173
process 21, 94, 130, 149 conceptualisation 88, 120, 122, descriptive discourse 1, 9,
categorising activities 27, 147, 153, 174, 176 1516, 38, 53, 55, 79, 81,
37, 87 conceptualise 145, 151, 154 8889, 98, 120121, 123, 147,
clause(s) 3, 57, 10, 15, 43, 59, configuration(s) 9293, 100, 151154, 157, 167168, 171, 173,
62 123, 146 102104, 108, 113118, 121, 176177, 184
cluster(s) 19, 2223, 41, 84, 125126, 141, 142, 165, 174, design 47, 50, 81, 97, 154, 168,
8889, 94, 98104, 180 177, 180181, 191
digression 39, 40, 42, 44, evaluation 9, 10, 50, 137, 139, F
45, 172 149, 175, 177 feedback 2, 9, 15, 39, 5051, 54
discourse analysis 171 evaluative foci 24, 27, 3334, figure-ground 127
discourse boundaries 12, 14, 36, 52, 54, 62, 69, 109, 116, fixation(s) 8991, 94, 99, 101,
15, 17, 171, 191 143 103, 106, 135, 141, 144145,
discourse evaluative tool 94 161162, 175, 181, 190
coherence 50, 53, 122, 172 events 4, 16, 2021, 27, 30, 37, duration 88, 90
comprehension 151152, pattern 89, 123, 125, 135,
41, 57, 59, 62, 6667, 77, 86,
169, 185 141, 143, 150, 173
117118, 120121, 142143,
hierarchy 10, 13, 100 flow of speech 1, 171
152, 158, 171, 173, 179
level(s) 13, 15, 88, 102, 115, flow of though 1, 80, 171
execution 3, 146
120123, 143, 145, 147, focus 1, 3, 57, 1113, 1516, 19,
existential constructions 58,
174, 176 2123, 2628, 30, 3233,
markers 5, 8, 1112, 14, 70, 74, 76 36, 37, 39, 41, 4446,
16, 4243, 46, 5153, 59, experiential factors 6466, 4850, 5253, 5557,
6162, 67, 75, 156 87, 173 6162, 65, 6769, 77, 79,
operators 43 experiment 82, 84, 161 8385, 87, 92, 94,
production 2, 9, 16, 103, expert foci 24, 27, 3338, 62, 100104, 106109,
151152, 171, 176 171172 114116, 118, 121123,
segmentation 1, 9, 11, external 125127, 129, 144147, 151,
1516, 39, 171 memory aid 50, 54, 153, 173 153154, 156157, 164165,
topics 9, 113, 121, 144 visual representations 49 172176, 183, 189
discourse-mediated mental visualisations 157, 167168, of active thought 3
representations 49, 149, 177 of attention 7, 49
176 eye fixation patterns 143 of thought 3
discourse-mediated eye gaze 96, 161, 174 focusing 49, 50, 5354, 59,
representations 172 eye movement(s) 70, 8084, 6263, 79, 81, 87, 101, 103,
distribution of foci 28, 33, 35, 87, 90, 97, 99, 106, 116, 126127
38, 171 125126, 129, 134135, free description 63, 87, 121,
drawing 3839, 46, 4950, 52, 137, 143, 145, 150, 157158, 174175
54, 97, 151, 153154, 156157, functional clusters 175
159, 161166, 168, 176177,
172173, 191 functional distribution
180183, 185, 188189
dual code theory 68 of configuration types 100
function of 143, 176
dynamic 4, 30, 32, 5567, of multimodal patterns 116
patterns 86, 137138, 141,
69, 7077, 79, 84, 94, 97, functions of eye movements
144145, 147, 149, 157158,
127128, 141, 153, 157, 168, 143
163, 168, 176177
173, 178
description style 56, 59, protocol 167, 175
G
6164, 69, 70, 73, 7677, eye tracker 28, 85 gaze 50, 82, 84, 90, 96, 122,
173 eye tracking 2933, 35, 3738, 144, 174, 187
motion verbs 62, 64 40, 79, 85, 87, 95, 98, 100, behaviour 50
verbs 58, 62, 70, 72, 74 102, 116, 120, 123, 143, 151, pattern 96, 161, 174
dysfluencies 3, 16, 21, 80, 158, 162, 167, 169, 179, 187 geometric type 68
113, 144 eye voice latencies 161 gestures 3, 15, 46, 50, 52, 97,
eye-gaze patterns 96 151
E eye-mind assumption 191 global correspondence 161
embedded segments 44 eye-voice latency 102, 122, 174 163, 165
errors 3, 187 eye-voice span 102 global thematic units 16
Subject index 199
global transition 9, 16, 171 J localising foci 21, 23, 2627,

joint attention 153, 156, 172 32, 3638, 62, 69, 103, 109,
H joint focus of attention 50, 54, 116, 118, 123, 143, 171172, 175
harmonic type 68 154, 157, 168, 177 loudness and voice quality 42,
hesitation 23, 5, 78, 11, 21, 59, 62, 77
3940, 113 L
hierarchy of discourse language 24, 7, 9, 12, 1415, M
production 9 19, 46, 50, 54, 6668, macroplanning 3, 8, 146
holistic view 146 8082, 8485, 87, 97, meaning making 157, 172
hypotactic segments 53, 172 102, 104, 125, 137, 143144, meaningful sequences 143,
146, 150152, 169, 176, 176
I 179, 180184, 187191 meaningful units 97, 134, 177
iconic visualiser(s) 6869, and thought 4, 187 means of bridging the foci 42,
77, 173 and vision 85, 176, 184 52
idea unit 5, 7, 9, 83, 88, 100 comprehension 12, 81, 152, memory aid 50, 54, 153, 173
ideational pointing gazes 176 169, 190191 mental activities 96, 115,
image schemata 152, 157 planning 125, 146, 150 118119, 174175
imagined scene 166 production 3, 12, 80, 87, mental activity 100, 106,
imagery system 68 104, 125, 143144, 150, 182 115, 123
impressions 16, 19, 29, 35, 81, larger units of discourse 9, mental distance 4042, 53,
87, 114, 121, 145, 148 15, 49, 175 172
inactive foci 7 latency 82, 92, 101102, 107, mental groupings 69, 125, 131,
individual differences 6667, 115, 122123, 147, 161, 174176 136137, 142, 149, 176
69, 77, 125, 157, 167, 169, 173, level(s) mental imagery 28, 40, 69,
185 of abstraction 22, 27, 40, 79, 102, 150151, 157158,
information flow 4, 39, 172 42, 49, 54, 81, 122, 130, 166168, 173, 177, 181, 185, 187
information package(s) 5, 132, 135, 137, 172, 174 mental models 151152, 158,
9, 15 of discourse 1, 3, 7, 12, 17, 168169
information structure 191 mental processes 81, 87, 94,
80, 87, 121, 171
information unit(s) 5, 9, 16 97, 142, 175
of specificity 125, 130,
integration patterns 115, 177 mental representations 172
148, 175
interactive foci 25, 27, 29, 34, mental zooming 136137, 143,
lexical markers 11, 15, 17, 40,
36, 51 149, 176
52, 156, 171
interactive setting 19, 26, mentally imagined objects
linear segments 44 172173
2829, 3335, 3840, 56, 64,
6970, 7274, 77, 172, 185 linguistics 66, 183 metacomments 26, 29, 34, 113
interdisciplinary 183 list of items 31, 62, 92, 102, metaphors 129, 149, 152153,
interface design 97 107, 115118, 123 175
interpretation 36, 41, 52, listeners 1, 1217, 28, 36, 40, meta-textual foci 171172
6869, 82, 86, 103, 118120, 43, 46, 4950, 5254, 138 microplanning 3, 146
137, 144145, 149, 165, 175 139, 141142, 149, 151154, monitoring 3, 2627, 80, 97,
interpreting activities 27, 32, 156, 171172, 176 122, 147148, 174, 176
37, 87, 171 listening and retelling 159, 163 motives 81
intonation unit(s) 5, 7, 1011, local correspondence 161165 multimodal 6, 77, 79, 84, 86,
49 local planning 147 89, 92, 94, 9798, 100,
introspection 29, 35, 131, 143 local transition 16, 171 102, 119, 125, 148, 173, 175,
introspective foci 2627, localising expressions 46, 53, 177178
34, 62 59, 6162, 77 configurations 100
integration patterns 119, organisational function 8587, 8990, 92, 94, 96,
175 2021, 25, 27, 29, 3337, 42, 98102, 104, 109, 119122,
method 77, 97, 125, 173, 87, 116, 121 125128, 131, 133134, 137,
177178 on-line 2833, 35, 3738, 139, 142143, 145146, 149,
score sheet 79, 86, 97, 148 40, 70, 7374, 85, 96, 101, 150151, 157, 164, 168, 173
scoring technique xii 103104, 108109, 113114, 174, 176177, 184
sequential method 6, 79, 116, 125, 144, 148, 177 picture viewing 6, 12, 32, 65,
84, 92, 94, 98 on-line picture description 70, 77, 79, 8586, 8990, 92,
system 102 40, 70, 73, 114, 116, 125 94, 9899, 102, 121, 125126,
time-coded score sheets on-line writing 96 128, 131, 133134, 137, 139,
89 organising function 37, 171 142143, 146, 149150, 168,
multiple external orientational function 24, 42, 173, 176177, 183184, 187
representations 167 53, 116 planning 1, 3, 16, 2627, 37, 43,
multiple representations 49, overt 80, 145 80, 82, 97, 99, 103, 115, 120
167169 overt attention 80 123, 138, 141, 144, 146147,
mutual visual access 54, 173 156, 158, 174, 176, 180, 187
P pointing 50, 52, 54, 97, 126
N parallel component model 51 127, 154, 156157, 173
narrative 28, 30, 3235, 3738, paratactic segments 53, 172 prediction 156
55, 59, 6566, 6970, 7377, paratactic transition 44 preparatory glances 122,
81, 171, 173, 179180 pauses 13, 8, 11, 1516, 174
narrative priming 28, 30, 33, 21, 3940, 42, 53, 92, 96, presentational function 21,
38, 55, 66, 70, 7374, 7677, 104105, 108109, 116118, 23, 26, 33, 3637, 41, 116
173 122, 146, 172 priming study 125, 138, 149,
narrative schema 173 pausing 9 176
narrative setting 30, 3435, perceive 12, 14, 125, 131 problem-solving 68, 81, 168,
74, 171 perception 1, 78, 12, 14, 16, 177
non-verbal 3, 1516, 39, 46, 65, 80, 8384, 8687, 89, prosodic criteria 11
50, 5254, 6768, 90, 92, 94, 9697, 108, 123, 125, proximity principle 131
97, 102, 153 131, 134, 137, 142, 149, psycholinguistic studies 81,
actions 50, 52, 54 151152, 158, 178, 180181, 87, 102, 120121, 123, 143
non-verbally 50 184, 186, 188, 190 psycholinguistics 9, 180
n-to-1 mappings 112, 117 of discourse boundaries psychology 66, 81, 131, 180,
n-to-n mappings 113, 117, 119 12, 191 186187
perfect match 101, 106, 114,
O 126 R
object-activity 129, 148, 175 performative distance from the re-categorisational gazes 176
object-attribute 129, 148, 175 descriptive discourse 42 re-conceptualisation 149, 175
object-location 128129, phases 3, 9, 57, 59, 77, 90, 97, re-examine 122, 174
148, 175 102, 146, 159, 161, 164, 166, referent storage 54, 173
object-object 148, 175 173 referential availability 44,
object-path 148, 175 phrase 67, 15, 120, 123, 143 49, 53
off-line 19, 21, 2629, 3133, phrasing unit(s) 5, 7 referents 8, 2021, 27, 2930,
35, 3739, 43, 63, 68, 70, picture description 2, 6, 10, 3637, 41, 44, 50, 107, 121,
7276 12, 17, 19, 20, 22, 28, 3040, 142143, 154, 171
off-line picture description 42, 46, 53, 5557, 59, 6467, re-fixate 175176
19, 21, 27, 37, 39, 43, 70 69, 7073, 7677, 7982, re-fixation 174
Subject index 201
refocus 4546, 53, 61, 77, 87, series of n-to-n mappings 113, spoken language 67, 12, 16,
173 117 43, 67, 7984, 86, 88, 94, 98,
region informativeness 87, 91 series of perfect matches 106 115, 126, 137, 148, 167, 181, 190
regulatory function 25 series of triangles 109110 spontaneous conversation 16,
relevance principle 79 similarity principle 131 28, 38, 151, 153
remembering 6669, 77, 173 simultaneous 28, 3031, 34, spontaneous description and
reorientation 42 38, 8182, 86, 89, 94, 97101, drawing 46
retrospective 167 103, 115, 122, 139, 142143, spontaneous descriptive
148, 161, 171172, 174, 176 discourse 36, 154
S simultaneous description with spontaneous drawing 48, 157
saccades 90, 135, 149, 161 eye tracking 34 spotlight 79, 8384, 94, 130,
saliency principle 79, 131 simultaneous verbal 154
scanpaths 139, 140142, 145, description 28, 94, 97 states 4, 11, 2021, 27, 30, 37,
186 situation awareness 49, 54, 97 41, 51, 86, 97, 117118, 121, 131,
scene 1921, 2425, 32, 36, 42, spatial and temporal 142143, 145, 171
59, 6566, 70, 77, 79, 8688, correspondence 161 static 29, 32, 5559, 6267,
9091, 94, 97, 101102, spatial expressions 23, 58, 6973, 7677, 81, 90, 92,
106, 114, 116, 118, 120121, 6163, 70, 72, 7476 173, 183
125128, 131139, 141142, spatial groupings 125, 131, description style 5557, 59,
144145, 149, 153, 158159, 137, 142 6364, 6972, 76, 173
163, 165166, 172173, 176 spatial perception 6162 steps in picture description 53
177, 188, 190 spatial priming 3738, 66, storage of referents 50, 153
scene perception 86, 90, 97, 6973, 7577 structure 1, 5, 1415, 17, 1920,
116, 125, 142, 188 spatial proximity 125, 131132, 28, 30, 3739, 43, 46, 5153,
scene semantics 141, 176 134137, 148149, 167, 176 55, 94, 9697, 109, 121, 137,
segmentation 1, 5, 9, 1012, spatial relations 23, 27, 30, 37, 145, 161, 172, 188
1416, 97 52, 5658, 64, 6769, 76, 143, structure of spoken picture
segmentation rules 1, 5, 11 157159, 168, 173, 177 descriptions 17, 19, 38
semantic 11, 1417, 51, 79, spatial visualiser(s) 68 substantive foci 2123, 27,
98100, 131, 137, 142, 148, speakers 5, 1416, 2125, 27, 3031, 33, 3638, 44, 62, 102,
150, 171, 174175 35, 37, 3942, 4546, 49, 104, 113, 116, 123, 171, 175
correspondence 79, 5255, 6061, 73, 75, 77, 82, substantive foci with
98100, 148, 150, 102, 137, 139, 142, 146, 148, categorisation difficulties
174175 151153, 157, 167168, 171172, 21, 3031, 33, 38, 113, 116,
criteria 11, 1417, 171 176177 123
groupings 131, 137, 142 specification 130, 149, 175 substantive list of items 22,
semantic, rhetorical and speech 14, 6, 811, 1516, 146
sequential aspects of 21, 2627, 3840, 4243, summarising foci 22, 24,
discourse 51 45, 5153, 75, 8082, 84, 92, 27, 31, 38, 41, 63, 114, 117,
semiactive foci 7 94, 99, 102103, 106, 116, 122123, 146, 171
semiactive information 49, 121123, 137138, 143, 146 summarizing gazes 176
54, 172 148, 172, 174176, 179182, superfocus 1, 8, 1214, 22,
sentences 1, 5, 1415, 102 185188, 191 27, 4041, 53, 89, 92, 100,
sequential processual 79 speech and thought 9 102104, 106107, 109,
sequential steps 46, 53 speech unit 4, 6, 42 113115, 119, 121123, 133,
series of delays 107 spoken discourse 15, 12, 16, 144, 147148, 156, 164, 171,
series of n-to-1 mappings 111 127, 154, 157, 173, 183 174176
support for visualisation 50, of superfoci 89, 121 verbalisation process 131,
153 typology of foci 38 148, 177
symmetry principle 131 verbalisers 6869
U viewing dimensions 137, 142
T underlying cognitive viewing patterns 138, 145,
task-dependent cluster 136 processes 6, 82, 94, 98, 148 149, 176
taxonomy 4, 28, 30, 37, 39, 171 underlying mental vision 69, 79, 82, 8485,
taxonomy of foci 30, 37, 39, processes 97 9091, 94, 127, 176, 182183,
171 unit of comparison 100101, 186189, 191
taxonomic proximity 131 115, 125 visual access 49
temporal correspondence 85 units in spoken descriptive visual and cognitive
temporal 6163, 70, 7277, discourse 1 processing 80
99, 121, 123, 125, 157, usability 81, 167, 168 visual behaviour 2, 28, 8889,
173174, 186, 191 utterance 34, 8, 43, 45, 52, 96, 145
dynamics 75, 76, 77, 173 102, 120, 139, 146148, 176, visual displays 80, 120, 123
expressions 6263, 70, 185 visual fixation cluster 89, 103,
7274, 7677, 173 utterances 1, 5, 910, 1315, 107, 115, 122, 174
perception 6162 81, 123, 138139, 141, 144, 147, visual foci 84, 93, 100, 103,
relations 77, 99, 121, 123, 149, 153, 175176, 187 115116, 119, 122, 128, 135, 148,
125, 157, 173174, 186, 191 175176
simultaneity 174 V visual focus 68, 8384, 99,
thematic distance 4041 variations in picture 101, 106, 115, 122123, 154,
from surrounding linguistic description 55 174175
context 40 verbal and visual clusters 89, visual focus of attention 84,
from surrounding pictorial 125, 148 99
context 41 verbal and visual protocols 84 visual inspection 118, 120,
think-aloud protocols 81, 87 verbal behaviour 88, 98, 100, 131, 174
thought processes 81, 82 174, 177, 182 visual paths 91
timeline 88, 100 verbal foci 8, 1113, 15, 2123, visual representations 149,
topical episodes 16 34, 39, 41, 5152, 82, 84, 168, 176
trajector-landmark 127 8687, 89, 92, 94, 99, 103, visual scene 46, 137, 145, 147,
transcribing 3 177, 180, 186
106107, 109, 113, 115117,
transcript 4, 6, 1112, 19, 46, visual stream 92, 101, 119, 126
122, 126, 130, 145, 148, 152,
visual thinkers 55, 64, 6667,
82, 8889, 9192 156, 176
151, 157, 182
transcription 1, 34, 12, 16, 88 verbal focus 1, 68, 16, 21,
visualisation(s) 40, 49, 151,
symbols 4 27, 82, 84, 92, 99106, 109, 153, 157, 164165, 167168,
transition(s) 3, 25, 4041, 114115, 117118, 122123, 126, 172173, 177
4445, 50, 53, 156, 172 146, 154, 171, 174175 visualiser 68
between foci 39 verbal focus of attention 84, visualisers 6869
triangle configuration 102 99 visually present objects 92
two different styles 56 verbal protocols 81, 87, 167 vocaliser 68
two windows to the mind 6, verbal stream 92, 101, 116117,
79, 84, 94, 98, 174 119, 126 W
types 21, 2628, 33, 3537, 46, verbal superfoci 12, 13, 15, 89, windows to the mind 6, 79,
55, 69, 87, 89, 96, 99, 121, 115, 119, 148 84, 94, 98, 174
145, 171, 174 verbal superfocus 1, 6, 8, 16,
of foci 21, 2628, 33, 3537, 21, 113, 132, 135, 171 Z
46, 55, 69, 87, 96, 99, 145, verbal thinkers 64, 67, 77, zooming in 101
171, 174 151, 173 zooming out 127, 130, 132134
In the series Human Cognitive Processing the following titles have been published thus far or
are scheduled for publication:
23 Holnov, Jana: Discourse, Vision, and Cognition. 2008. xiii,202pp.

22 Berendt, Erich A. (ed.): Metaphors for Learning. Cross-cultural Perspectives. 2008. ix,249pp.
21 Amberber, Mengistu (ed.): The Language of Memory in a Crosslinguistic Perspective. 2007. xii,284pp.
20 Aurnague, Michel, Maya Hickmann and Laure Vieu (eds.): The Categorization of Spatial Entities
in Language and Cognition. 2007. viii,371pp.
19 Benczes, Rka: Creative Compounding in English. The Semantics of Metaphorical and Metonymical
Noun-Noun Combinations. 2006. xvi,206pp.
18 Gonzalez-Marquez, Monica, Irene Mittelberg, Seana Coulson and Michael J. Spivey
(eds.): Methods in Cognitive Linguistics. 2007. xxviii,452pp.
17 Langlotz, Andreas: Idiomatic Creativity. A cognitive-linguistic model of idiom-representation and
idiom-variation in English. 2006. xii,326pp.
16 Tsur, Reuven: Kubla Khan Poetic Structure, Hypnotic Quality and Cognitive Style. A study in mental,
vocal and critical performance. 2006. xii,252pp.
15 Luchjenbroers, June (ed.): Cognitive Linguistics Investigations. Across languages, fields and
philosophical boundaries. 2006. xiii,334pp.
14 Itkonen, Esa: Analogy as Structure and Process. Approaches in linguistics, cognitive psychology and
philosophy of science. 2005. xiv,249pp.
13 Prandi, Michele: The Building Blocks of Meaning. Ideas for a philosophical grammar. 2004. xviii,521pp.
12 Evans, Vyvyan: The Structure of Time. Language, meaning and temporal cognition. 2004. x,286pp.
11 Shelley, Cameron: Multiple Analogies in Science and Philosophy. 2003. xvi,168pp.
10 Skousen, Royal, Deryle Lonsdale and Dilworth B. Parkinson (eds.): Analogical Modeling. An
exemplar-based approach to language. 2002. x,417pp.
9 Graumann, Carl Friedrich and Werner Kallmeyer (eds.): Perspective and Perspectivation in
Discourse. 2002. vi,401pp.
8 Sanders, Ted, Joost Schilperoord and Wilbert Spooren (eds.): Text Representation. Linguistic
and psycholinguistic aspects. 2001. viii,364pp.
7 Schlesinger, Izchak M., Tamar Keren-Portnoy and Tamar Parush: The Structure of
Arguments. 2001. xx,264pp.
6 Fortescue, Michael: Pattern and Process. A Whiteheadian perspective on linguistics. 2001. viii,312pp.
5 Nuyts, Jan: Epistemic Modality, Language, and Conceptualization. A cognitive-pragmatic perspective.
2001. xx,429pp.
4 Panther, Klaus-Uwe and Gnter Radden (eds.): Metonymy in Language and Thought. 1999.
vii,410pp.
3 Fuchs, Catherine and Stphane Robert (eds.): Language Diversity and Cognitive Representations.
1999. x,229pp.
2 Cooper, David L.: Linguistic Attractors. The cognitive dynamics of language acquisition and change.
1999. xv,375pp.
1 Yu, Ning: The Contemporary Theory of Metaphor. A perspective from Chinese. 1998. x,278pp.

Discourse, Vision, and Cognition - Jana Holsanova

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Discourse, Vision, and Cognition - Jana Holsanova

Uploaded by

Copyright:

Available Formats

Human Cognitive Processing (HCP)

Editorial Advisory Board

John Benjamins Publishing Company

American National Standard for Information Sciences Permanence of

Library of Congress Cataloging-in-Publication Data

2008 John Benjamins B.V.

2. How general are these foci in other settings? 28

2. Verbal or visual thinkers? 66

1.2.4 Series of N-to-1 mappings 109

2.1 Listening and retelling 159

While there is a growing body of psycholinguistic experimental research on

Segmentation of spoken discourse

The aim of this chapter is to present a segmentation of spoken discourse and to

1. Characteristics of spoken discourse

Spontaneous speech tends to be produced in spurt-like portions, not in well-

The example above illustrates the main characteristics of spontaneous speech,

1.1 Transcribing spoken discourse

In order to analyse spoken discourse, it needs to be transcribed by using a nota-

Table 1. Transcription symbols

2. Segmentation of spoken discourse and cognitive rhythm

Information flow in discourse is associated with dynamic changes in language

Figure 1. The motif is from Nordqvist (1990). Kackel i grnsakslandet. Opal.

0405 sen s . fortstter han grva

2.1 Verbal focus and verbal superfocus

In the later chapters on picture viewing and picture description (Chapters 5

A verbal focus denotes the linguistic expression of a conscious focus of atten-

delimited by prosodic, acoustic features and lexical/semantic features. It has

Figure 2. Verbal and visual focus.

We are coming closer to the question of macroplanning in discourse. Exam-

Global overview-de- Superfocus 2 Focus 2

s equence of words combined under a single, coherent contour, usually preced-

5. When there are epistemic expressions or modifying additions that clearly

4. Perception of discourse boundaries

Although language production and language comprehension are traditionally

4.1 Focus and superfocus

Agreement on focus boundaries, superfocus boundaries and

Figure 3. Agreement on focus boundaries, superfocus boundaries and non-boundar-

erceived as a boundary. Both the Swedish and non-Swedish group perceived

of ideas although we do not understand the content. However, the recognition

4.2 Other discourse units

Korolija (1998) has investigated global thematic units in spontaneous con-

This chapter has been devoted to the segmentation of spoken descriptive

of discourse boundaries is facilitated if the interplay of prosodic and acoustic

Structure and content of spoken

When we describe something, a scene or an event, to a person who did not

c orrespond to ideational, interpersonal and textual linguistic metafunctions

1.1 Presentational function

1.1.1 substantive foci

1.1.2 substantive foci with categorisation difficulties

1.1.3 substantive list of items

A substantive list of items is often explicitly announced in a summarising focus

1.1.4 summarising foci

As we can see from Example 5, some speakers introduce a semantically rich

1.1.5 localising foci

1.2 Orientational function

1.2.1 evaluative foci

1.2.2 expert foci

1.3 Organisational function

1.3.1 interactive foci

1.3.2 introspective foci & metacomments

To sum up, the off-line descriptions from memory in an interactive setting

Table 3. Various types of foci represented in off-line picture description from