Professional Documents
Culture Documents
Editors
Marcelo Dascal Raymond W. Jr. Gibbs Jan Nuyts
Tel Aviv University University of California at Santa Cruz University of Antwerp
Volume 23
Discourse, Vision, and Cognition
by Jana Holnov
Discourse, Vision, and Cognition
Jana Holnov
Lund University
Holnov, Jana.
Discourse, vision, and cognition / Jana Holnov.
p. cm. (Human Cognitive Processing, issn 1387-6724 ; v. 23)
Includes bibliographical references and index.
1. Discourse analysis. 2. Oral communication. 3. Psycholinguistics. 4. Visual
communication. I. Title.
P302.H635 2008
401'.41--dc22 2007049098
isbn 978 90 272 2377 7 (Hb; alk. paper)
Preface xi
chapter 1
Segmentation of spoken discourse 1
1. Characteristics of spoken discourse 1
1.1 Transcribing spoken discourse 3
2. Segmentation of spoken discourse and cognitive rhythm 4
2.1 Verbal focus and verbal superfocus 6
3. Segmentation rules 10
4. Perception of discourse boundaries 12
4.1 Focus and superfocus 13
4.2 Other discourse units 15
5. Conclusion 16
chapter 2
Structure and content of spoken picture descriptions 19
1. Taxonomy of foci 19
1.1 Presentational function 21
1.1.1 substantive foci 21
1.1.2 substantive foci with categorisation
difficulties 21
1.1.3 substantive list of items 22
1.1.4 summarising foci 22
1.1.5 localising foci 23
1.2 Orientational function 24
1.2.1 evaluative foci 24
1.2.2 expert foci 24
1.3 Organisational function 25
1.3.1 interactive foci 25
1.3.2 introspective foci & metacomments 26
vi Discourse, Vision and Cognition
chapter 3
Description coherence & connection between foci 39
1. Picture description: Transitions between foci 39
1.1 Means of bridging the foci 42
1.1.1 Discourse markers 43
1.1.2 Loudness and voice quality 45
1.1.3 Localising expressions 46
2. Illustrations of description coherence:
Spontaneous description & drawing 46
2.1 Transitions between foci 50
2.2 Semantic, rhetorical and sequential aspects of discourse 51
2.3 Means of bridging the foci 52
3. Conclusion 53
chapter 4
Variations in picture description 55
1. Two different styles 55
1.1 The static description style 57
1.2 The dynamic description style 59
1.3 Cognitive, experiential and contextual factors 64
Table of contents vii
chapter 5
Multimodal sequential method and analytic tool 79
1. Picture viewing and picture description:
Two windows to the mind 80
1.1 Synchronising verbal and visual data 81
1.2 Focusing attention: The spotlight metaphor 83
1.3 How it all began 84
2. Simultaneous description with eye tracking 85
2.1 Characteristics of the picture: Complex depicted scene 86
2.2 Characteristics of the spoken picture description:
Discourse level 87
2.3 Multimodal score sheets 88
2.3.1 Picture viewing 89
2.3.2 Picture description 91
3. Multimodal sequential method 94
4. Conclusion 98
chapter 6
Temporal correspondence between verbal and visual data 99
1. Multimodal configurations and units of visual and verbal data 100
1.1 Configurations within a focus 101
1.1.1 Perfect temporal and semantic match 101
1.1.2 Delay between the visual and the verbal part 101
1.1.3 Triangle configuration 102
1.1.4 N-to-1 mappings 104
1.1.5 N-to-1 mappings, during pauses 104
1.1.6 N-to-1 mappings, rhythmic re-examination pattern 105
1.2 Configurations within a superfocus 106
1.2.1 Series of perfect matches 106
1.2.2 Series of delays 107
1.2.3 Series of triangles 109
viii Discourse, Vision and Cognition
chapter 7
Semantic correspondence between verbal and visual data 125
1. Semantic correspondence 126
1.1 Object-location relation 128
1.2 Object-path relation 128
1.3 Object-attribute relation 129
1.4 Objectactivity relation 129
2. Levels of specificity and categorisation 130
3. Spatial, semantic and mental groupings 131
3.1 Grouping concrete objects on the basis of spatial proximity 131
3.2 Grouping multiple concrete objects on the basis
of categorical proximity 132
3.3 Grouping multiple concrete objects on the basis
of the composition 133
3.4 Mental zooming out, recategorising the scene 133
3.5 Mental grouping of concrete objects on the basis of similar traits
and activities 134
3.6 Mental grouping of concrete objects on the basis
of an abstract scenario 135
4. Discussion 137
4.1 The priming study 138
4.2 Dimensions of picture viewing 142
4.3 Functions of eye movements 143
4.4 The role of eye fixation patterns 143
4.5 Language production and language planning 146
5. Conclusion 148
chapter 8
Picture viewing, picture description and mental imagery 151
1. Visualisation in discourse production
and discourse comprehension 152
2. Mental imagery and descriptive discourse 157
Table of contents ix
chapter 9
Concluding chapter 171
References 179
Author index 193
Subject index 197
Preface
and visual data (the contents of the attentional spotlight) are used as two win-
dows to the mind. Both kinds of data are an indirect source to shed light on the
underlying cognitive processes.
It is, of course, impossible to directly uncover our cognitive processes. If we
want to learn about how the mind works, we have to do it indirectly, via overt
manifestations. The central question is: Can spoken language descriptions and
eye movement protocols, in concert, elucidate covert mental processes? To an-
swer this question, we will proceed in the following steps:
Chapter 1 presents a segmentation of spoken discourse and defines units of
speech verbal focus and verbal superfocus expressing the contents of active
consciousness and providing a complex and subtle window on the mind.
Chapter 2 focuses on the structure and content of spoken picture descrip-
tions. It describes and illustrates various types of foci and superfoci extracted
from picture descriptions in various settings. Chapter 3 takes a closer look at
how speakers create coherence when connecting the subsequent steps in their
picture descriptions. Chapter 4 discusses individual description styles.
While the first four chapters of this book deal with characteristics of picture
descriptions in different settings, in the remaining four chapters, the perspec-
tive has been broadened to that of picture viewing. These chapters explore the
connection between spoken descriptive discourse, picture viewing and mental
imagery. The discourse segmentation methodology is therefore extended into
a multimodal scoring technique for picture description and picture viewing,
leading up to an analysis of correspondence between verbal and visual data.
Chapter 5 deals with methodological questions, and the focus is on sequen-
tial and processual aspects of picture viewing and picture description. The read-
er gets acquainted with the multimodal method and the analytical tools that are
used when studying the correspondence between verbal and visual data.
In Chapters 6 and 7, the multimodal method is used to compare the con-
tent of the visual focus of attention (specifically clusters of visual fixations) and
the content of the verbal focus of attention (specifically verbal foci and super-
foci) in order to nd out whether there is correspondence in units of picture
viewing and simultaneous picture description. Both temporal and semantic re-
lations between the verbal and visual data are investigated. Finally, clusters on
different levels of the discourse hierarchy are connected to certain functional
sequences in the visual data.
Chapter 8 focuses on the issue of visualisations in discourse production
and discourse comprehension and presents studies on mental imagery associ-
ated with picture viewing and picture description.
Preface xiii
The concluding Chapter 9 looks back on the most important issues and
findings in the book and mentions implications of the multimodal approach
for other fields of research, including evaluation of design, users interaction
with multiple representations, multimodal systems, etc.
This book addresses researchers with a background in linguistics, psycho-
linguistics, psychology, cognitive science and computer science. The book is
also of interest to scholars working in the applied area of usability and in the
interdisciplinary field concerned with cognitive systems involved in language
use and vision.
Acknowledgements
I wish to thank Professor Wallace Chafe, University of Santa Barbara, for in-
spiration, encouragement and useful suggestions during my work. I also ben-
efited from discussions and criticism raised by doctoral students, research-
ers and guest researchers during the eye tracking seminars at the Humanist
Laboratory and Cognitive Science Department at Lund University: Thanks to
Richard Andersson, Lenisa Brando, Philip Diderichsen, Marcus Nystrm and
Jaana Simola. Several colleagues have contributed by reading and commenting
on the manuscript. In particular, I want to thank Roger Johansson and Ken-
neth Holmqvist for their careful reading and criticism. I also wish to thank the
anonymous reviewer for her/his constructive suggestions. Finally, thanks are
due to my family: to our parents, my husband and my children Fredrik and
Annika.
The work was supported by the post-doctoral fellowship grant VR 2002
6308 from the Swedish Research Council.
chapter 1
When we listen to spontaneous speech, it is easy to think that the utterances form
a continuous stream of coherent thought. Not until we write down and analyse
the speech, do we realise that it consists of a series of small units. The speech
flow is segmented, containing many repetitions, stops, hesitations and pauses.
Metaphorically one could say that speech progresses in small discontinuous steps.
In fact, it is the listeners who creatively fill in what is missing or filter out what is
superfluous and in this way construe continuous speech. The stepwise structure
of speech can give us many clues about cognitive processes. It suggests units for
planning, production, perception and comprehension. In other words, the flow of
speech reflects the flow of thoughts.
language terms (Linell 1994:2) and include such features as pauses, hesitations,
unintelligible speech, interruptions, restarts, repeats, corrections and listener
feedback. All these features are typical of speech and give us additional in-
formation about the speaker and the context. The first example stems from a
data set on picture descriptions in Swedish and illustrates some of the typical
features of spoken discourse.
Example 1
0851 ehmm till hger p bilden
ehmm to the right in the picture
0852 s finns de/ ja juste,
there are/ oh yes,
0853 framfr framfr trdet
in front in front of the tree
0854 s e det . grs,
so there is . grass,
0855 eh och till hger om det hr trdet
eh and to the right of this tree
0856 s har vi terigen en liten ker-tppa eller jord,
so again we have a little field or piece of soil,
As we turn our ideas into speech, the match between what we want to say
and what we actually do say is rarely a perfect one. The stepwise production of
real-time spoken discourse is associated with pauses, stutterings, hesitations,
contaminations, slips of the tongue, speech errors, false-starts, and verbatim
repetitions. Within psycholinguistic research, such dysuencies in speech have
been used as a primary source of data since they allow us insights into the actu-
al process of language production. This kind of dysfluencies has been explored
in order to reveal planning, execution and monitoring activities on different
levels of discourse (Garett 1980; Linell 1982; Levelt 1983; Strmqvist 1996).
Goldman-Eisler (1968) has shown that pauses and hesitation reflect plan-
ning and execution in tasks of various cognitive complexity. Pausing and hesi-
tation phenomena are more frequent in demanding tasks like evaluating (as
opposed to simply describing), as well as at important cognitive points of tran-
sition when new or vital pieces of information appear. Also, the choice points
of ideational boundaries are associated with a decrease in speech fluency.
Errors in speech production give us some idea of the units we use. Slips of
the tongue can appear either on a low level of planning and production, as an
anticipation of a following sound (he dropped his cuff of coffee), or on higher
levels of planning and production (Thin this slicely instead of Slice this thinly;
Kess 1992). Garett (1975, 1980) suggests that speech errors provide evidence
for two levels of planning: semantic planning across clause boundaries (on a
functional level) and grammatical planning within the clause range (on a po-
sitional level). Along similar lines, Levelt (1989) assumes macroplanning (i.e.
elaboration of a communicative goal) and microplanning (decisions about the
topic or focus of the utterance etc.). Linell (1982) distinguishes two phases of
utterance production: the construction of an utterance plan (a decision about
the semantic and formal properties of the utterance) and the execution of an ut-
terance plan (the pronunciation of the words) (see also Clark & Clark 1977).
presentation. The first two numbers identify the informant (04), the other two
or three numbers are the speech units in that order. The transcribed speech
is not adapted to the syntax and punctuation of written language. Also, it in-
cludes forms that do not exist in written language (uhu, mhm, ehh) maybe
apart from instant messaging. The data has been translated into spoken Eng-
lish. Compared to the orthography of written language, some spoken English
language forms are used in the transcript (unless the written pronunciation is
used): hes, theres, dont. For easier reading, Table 1 summarises the symbols
that are used to represent speech and events in the transcription.
elements are called: idea units, intonation units, information units, informa-
tion packages, phrasing units, clauses, sentences, utterances etc. Many theories
within the research on discourse structure agree that we can focus on only
small pieces of information at a time, but there are different opinions of what
size these information units should have. In the following, I will define two
units of spoken discourse and formulate segmentation rules. But first, let us
look at the next example.
Example 2 illustrates the segmentation of spoken discourse. In a spoken
description of a picture (cf. Figure 1), one aspect is selected and focused on at a
time and the description proceeds in brief spurts. These brief units that speak-
ers concentrate on one at a time express the focus of active consciousness. They
contain one primary accent, a coherent intonation pattern, and they are often
preceded by a hesitation, a pause or a discourse marker signalling a change of
focus. Note that the successive ideas expressed in units 04040410 are in six
cases marked and connected by discourse markers och d, sen s, sen, s,
d (and then, then, and then, and so, and then).
Example 2
0402 de e Pettson och katten Findus i olika skepnader d, ja
well its Pettson and Findus the cat in different versions,
0403 och Pettson grver
and Pettson is digging
0404 . och d har han hittat nnting som han hller i handen,
. and hes found something that hes holding in his hand,
Discourse, Vision and Cognition
Several verbal foci are clustered into a more complex unit called verbal superfo-
cus. Verbal superfocus is a coherent chunk of speech that surpasses the verbal
focus and is prosodically nished. This larger discourse segment, typically a
longer utterance, consists of several foci connected by the same thematic aspect
and has a sentence-nal prosodic pattern (often a falling intonation). A new
superfocus is typically preceded by a long pause and a hesitation, which reects
the process of refocusing attention from one picture area to another. Alterna-
tively, transitions between superfoci are made with the aid of discourse mark-
ers, acceleration, voice quality, tempo and loudness. An inserted description
or comment uttered in another voice quality (creaky voice or dialect-imitating
voice), deviating from its surroundings can stretch over several verbal foci and
form a superfocus. The acoustic features simplify the perception of superfoci
borders. The referents often remain the same throughout the superfocus, only
. Chafe (1979, 1980) found signicantly longer pauses and more hesitation signals at the
borders between superfoci than at the borders between verbal foci.
Chapter 1. Segmentation of spoken discourse
some properties change. When it comes to the size of superfoci, they often
correspond to long utterances. Superfoci can be conceived of as new complex
units of thought. According to Chafe (1994), these larger units of discourse
(called centres of interest) represent thematic units of speech that are guided
by our experience, intellect and judgements, and thus correspond roughly to
scripts or schemata (Schank & Abelson 1977).
Superfoci are in turn parts of even larger units of speech, called discourse
topics. In our descriptive discourse, examples of such larger units would be
units guided by the picture composition (in the middle , on the left, on the
right, in the background) or units guided by the conventions in descriptive
discourse (impression, general overview-detailed description, description-evalu-
ation). In Chafes terminology, these bigger units are called basic level topics or
discourse topics, verbalising the content of the semiactive consciousness. We
usually keep track of the main ideas expressed in these larger units. Table2
summarises all the building blocks in the hierarchy of discourse production:
foci, superfoci and discourse topics.
Let me briefly mention other suggested units of speech and thought. But-
terworth (1975) and Beattie (1980) speak about hesitant and fluent phases
in speech production. Butterworth asked informants to segment monologic
discourse into idea units that were assumed to represent informal intuitions
about the semantic structuring of the monologues. The main result of his stud-
ies was that borders between idea units coincided with the hesitations in the
speech production, suggesting that the successive ideas are planned during
these periods and formulated during the subsequent fluent periods (Butter-
worth 1975:81, 83). Beattie (1980) found the cyclic arrangement of hesitant
and fluent phases in the flow of language.
Schilperoord & Sanders (1997) analyse cyclic pausing patterns in discourse
production and suggest that there is a cognitive rhythm in our discourse seg-
mentation that reflects the gross hierarchical distinction between global and
local transitions. Yet another tradition in psycholinguistics is to speak about
information units or information packages and look at how information is pack-
aged and received in small portions that are digestible, in a certain rhythm that
gives the recipient regular opportunities for feedback (Strmqvist 1996, 1998,
et al. 2004; Berman & Slobin 1994).
Next, I will specify the set of rules that I used for segmentation of spoken
descriptive discourse.
10 Discourse, Vision and Cognition
Table 2. The discourse hierarchy and the building blocks in free picture description
Discourse topics, Superfoci Foci
(comparable to para- long utterances, clauses phrases, clauses (noun
graphs) dealing with the same topic phrases grouped on
------------------------- ------------------------------- semantic or functional
Examples: Example: grounds), short utter-
Composition Pettson is first digging in the ances
Evaluations field and then hes sowing -------------------------
Associations quite big white seeds and Examples:
Impressions then the cat starts watering in the background;
these particular seeds and on the left;
it ends up with the fellow eh Pettson is digging;
raking this field, in the middle is a tree;
Impression of the whole Superfocus 1 Focus 1
picture
3. Segmentation rules
The discussions on the segmentation of speech have often dealt with the dif-
ficulty to clarify and exactly delimit such units. Chafe (1987) mentions the
functional properties of the smallest unit in discourse: An intonation unit is a
Chapter 1. Segmentation of spoken discourse 11
1. When dividing the transcript into units, the primary (focal) accent togeth-
er with a coherent intonation pattern falling or rising pitch contour are
the dominant and most decisive features for segment borders.
2. Apart from the two above-mentioned features, segment borders can be
recognised by pauses and hesitation that appear on the boundary between
two verbal foci.
3. Additional strength is added if these features are supported by changes in
loudness, voice quality, tempo and acceleration (especially for the segmen-
tation of larger chunks of speech).
4. Verbal focus is a unit that is seldom broken up internally by pauses. But
when rules 13 are fulfilled and there is a short pause in the middle of a
unit where the speaker looks for the correct word or has trouble with pro-
nunciation or wording, it is still considered to be one unit.
around in the air there are . a number of insects flying
but its getting close to the genre of com/ comic strips,
12 Discourse, Vision and Cognition
. The data were collected at Lund University and at the UCSB. I would like to thank John
W. Du Bois and Wallace Chafe for their help with this study.
Chapter 1. Segmentation of spoken discourse 13
All authentic border markings of the Swedish and non-Swedish group have been
compared and related to the different types of borders coded by the experts.
First, the aim of the study was to find out whether discourse borders are more
easily recognised at the higher discourse level of verbal superfoci than at the
lower discourse level of verbal foci. The hypothesis was that the boundaries of
a superfocus a unit comparable to long utterances, higher in the discourse
hierarchy will be perceived as heavier and more final and thus will be more
easily identified than a focus. The agreement on focus boundaries, superfo-
cus boundaries, and non-boundaries by Swedish and non-Swedish informants
is illustrated in Figure 3. As we can see, the hypothesis has been confirmed:
Swedish and non-Swedish informants agree more on superfocus boundaries
than on focus boundaries. The agreement on focus boundaries reached a level
of 45 percent in the Swedish and 59 percent in the non-Swedish group, whereas
the agreement on superfocus boundaries reached 78 percent in the Swedish
and 74 percent in the non-Swedish group. Apart from agreement on different
kinds of boundaries, we can also consider listeners agreement on what was not
100
90
80
70
60
Focus boundaries
50 Superfocus boundaries
Non-boundaries
40
30
20
10
0
1 2
Swedish Non-Swedish
In the previous part, the reader got to know that the interplay of prosodic,
acoustic and semantic criteria facilitates the recognition of boundaries in de-
scriptive discourse. It should be noted, however, that there are argumentative
parts of descriptive discourse where the rhetorical build-up plays a role for
the discourse segmentation. These units are often based on the relation event
cause or statecause. Another type of unit is the question-answer pair (and
other types of adjacency pairs). In dialogic discourse, listeners feedback and
laughter play an important role in confirming segmentation borders. Also, the
non-verbal action, such as body language, gestures etc. can be important for
recognising the start/end of a unit or action (cf. Holsanova 2001, Chapter 5).
16 Discourse, Vision and Cognition
In a study on radio debates, Korolija (1998) reached the conclusion that listen-
ers most easily recognise leitmotifs or macro topics.
5. Conclusion
This chapter will focus on the structure and content of spoken picture descrip-
tions. First, I will describe and illustrate various types of units (foci and su-
perfoci) extracted from an off-line picture description (i.e. descriptions from
memory) in an interactive setting. Second, I will compare the distribution of
the different foci in this setting to data from other studies.
1. Taxonomy of foci
Let me first introduce a transcript extract from a study where informants de-
scribed the picture off-line in an interactive setting (Holsanova 2001:7ff.).
Twelve informants had each examined a complex picture for a period of time
(1 minute). Then the picture was removed and they described the contents of
the picture to a listener. Let us look at an extract from this study (Example 1).
One informant in the second half of his description mentions the tree with
birds in it.
20 Discourse, Vision and Cognition
Example 1
0637 in the middle there was a tree
0638 the lower . part of the tree was crooked
0639 there were . birds
0640 it looked like a happy picture,
0641 everybody was pleased and happy,
0642 eh and (2 s) the elds were brown,
0643 eh it was this kind of topsoil,
0644 it wasnt sand,
0645 and then we had . it was in this tree,
0646 there were . it was like a . metaphor of . the average Swede there,
0647 That you have a happy home
0648 the birds are ying
0649 the birds were by the way characteristic,
0650 They were stereotyped pictures of birds,
0651 big bills
0652 Xxx human traits
As we can see in this example, the informant describes the birds and at the
same time evaluates the atmosphere of the picture (lines 06400641), describes
the quality of the soil (lines 06420644), formulates a similarity by using a
metaphor (lines 06460648) and finally characterises some of the elements de-
picted (lines 06490652). He is not only expressing ideas about states, events
and referents in the picture but also evaluating these in a number of ways. This
can be generalised for all picture descriptions in this study: The informants do
not only talk about WHAT they saw in the picture the referents, states and
events but also HOW they appeared to them. In other words, informants also
express their attitudes towards the content; they evaluate and comment on dif-
ferent aspects of the scene.
The way we create meaning from our experience and describe it to others
can be understood in connection to our general communicative ability. Ac-
cording to Lemke (1998:6) when we make meaning we always simultaneously
construct a presentation of some state of affairs, orient to this presentation
and orient to others, and in doing so create an organised structure of relat-
ed elements. The presentational, orientational and organisational functions
. These elements of spoken descriptions reflect the fact that our consciousness is to a
large extent made up of experiences of perceptions and actions, accompanied by emotions,
opinions and attitudes etc. In addition to perceptions, actions and evaluations, there are
sometimes even introspections or meta-awareness (Chafe 1994:31).
Chapter 2. Structure and content of spoken picture descriptions 21
Let me start with the presentational function, which includes substantive, sum-
marising and localising foci. In these units of speech, speakers present us to a
scene, to scene elements (animate and inanimate, concrete and abstract) and
their relations, to processes and to circumstances.
to be a sort of/I dont remember what is it called, there are those things hanging
there, sort of seed/ seed things in the tree). Often, this focus type constitutes a
whole superfocus.
Another common pattern is that the speaker uses an abstract name for activi-
ties (cultivating process) that s/he then elaborates within the following cluster
of verbal foci (list) as in Example 4.
Example 4
0508 and then one can see . different parts
/eh well, steps in the . cultivating process sum
0509 hes digging, subst list
0510 hes raking subst list
0511 then hes sowing, subst list
The relation between the summarising focus (1102) and the subsequent sub-
stantive foci is characterised as a part-whole relation or as conceptual depen-
dence (Langacker 1987:306). Sowing here subsumes a schema for an action: a
person (11031105) and a cat (1109) are sowing and performing several activi-
ties (11061111) in a specific temporal order (first, and then, it ends).
The next types of units, the evaluative and expert foci, belong to the orienta-
tional function. Here, speakers evaluate the scene by expressing their attitude
to the content, to the listener or even to the listeners attitude. The emphasis on
different elements and the way they are evaluated contributes indirectly to the
speakers positioning. For example, the speaker can position herself/himself as
an expert.
Table 1.
Evaluative foci Transcription examples
i. Speakers reaction to the picture on the whole there are very many things in
the picture
ii. Speakers categorisation of the different eh . the birds are . of indenite sort, they
picture elements are fantasy birds, I would think
iii. Speakers attitude to the presentation of Pettsons facial expression is fascinating,
the different picture elements although he seems to be quite anonymous
he reflects a sort of emotional expression I
think
iv. Speakers comments on the visual proper- the daffodils are enlarged, almost tree-like,
ties of a picture element they are the wrong size in relation to a real
daffodil
v. Speakers comments on the interaction of Findus the cat helps Pettson as much as he
the characters in the picture can, they sow the seeds very carefully place
them one after the other it appears
but also, at a higher level, the genre and painting technique of the picture. By
doing this, they show their expertise. In interpersonal or social terms, the
speakers indirectly position themselves towards the listener.
Example 6
--> 1015 the actual picture is painted in water colour,
--> 1016 so it has/ eh . the picture is not/
--> 1017 it is naturalistic in the motif (theme)
--> 1018 but its getting close to the genre of com/ comic strips,
Expert foci include explanations and motivations where the informant show
their expertise in order to rationalise and motivate their statements or interpre-
tations. Such explanatory units are often delivered as embedded side sequences
(superfoci) and marked by a change in tempo or acceleration. The comment is
started at a faster speed, and the tempo then decreases with the transition back
to the exposition. The explicative comment can also be marked by a change in
voice quality, as in the following example, where it is formulated with a silent
voice (cf. Chapter 3, Section 1.1.2). In Example 7, the expert knowledge of the
participant is based on a story from a childrens book and the experience of
reading aloud to children.
Example 7
0112 And Findus,
0113 he has sown his meatball,
--> 0114 (lower voice) you can see it on a little picture,
--> 0115 this kind of stick with a label
--> 0116 he has sown his meatball under it
0117 (higher voice) and then there is Findus there all the time,
The last group of units, including the interactive, introspective and metatextual
units, constitutes the third, organisational function.
Table 2.
Introspective foci & metacomments Transcription examples
i. Informants think aloud lets see, have I missed something? Ill think
for a moment: Pettson is the old guy and
Findus is the cat
ii comment on the memory process ehm, its starting to ebb away now; I dont
remember his name; eh actually, I dont
know what the fellow in the middle was
doing
iii. make procedural comments lets see if I continue to look at the back-
ground; I have described the left hand side
and the middle
iv. express the picture content on a textual I dont know if I said how it was composed
metalevel, refer to what they had already
said or not said about the picture
v. ask rhetorical questions and reveal steps what more is there to say about it?; what
of planning in their spoken presentation more happened?
through an inner monologue
vi. think aloud in a dialogic form: One eh what more can I say, I can say some-
person systematically posed questions to thing about his clothes; eh well, what
himself, and then immediately answered more can I say, one can possibly say some-
them by repeating the question as a full thing about the number of such Pettson
sentence. The speakers metatextual foci characters
pointed either forward, toward his next
focus, or backward
vii. recapitulate what they have said or sum- eh I mentioned the insects/ I mentioned
marise the presented portions of the small animals; so thats whats happening
picture or the picture as a whole here; thats what I can say; that was my
reflection
evaluative and expert foci, representing the orientational (or attitudinal) func-
tion. In these types of foci informants express their opinions about various
aspects of the picture contents and comment on the picture genre from the
viewpoint of their own special knowledge. Finally, speakers used interactive,
introspective and metatextual foci, serving the organisational function. They
were thinking aloud and referring to their own spoken presentation on a tex-
tual metalevel. The informants used each focus type either as a single ver-
bal focus or as a superfocus consisting either of a multiple foci of the same
type (subst list) or a combination of different focus types (e.g. sum and list,
subst and loc, subst and eval etc.). Apart from reporting WHAT they saw,
the informants also focused on HOW the picture appeared to them. In other
words, the informants were involved in both categorising and interpreting
activities.
28 Discourse, Vision and Cognition
In the above section, a thorough content analysis of the picture descriptions was
presented. The question is, however, whether the developed taxonomy would
hold up in other settings. For instance, would the repertoire of focus types vary
depending on whether the informants described the picture in the presence or
in the absence of listeners? Would there be any influence on the taxonomy and
distribution of foci in a narrative task where the informant has to tell a story
about what happens in the picture? Would other types of foci constitute the
discourse when informants describe the picture either off-line (from memory)
or on-line (simultaneously with the description)? Last but not least, what does
the structure and content of descriptions look like in a spontaneous conversa-
tion? Do the same types of foci occur and is their distribution the same?
In the following section, we will discuss the above-mentioned types of foci
and their distribution in different settings by comparing the presented tax-
onomy with data from different studies on picture description. My starting
point for this discussion will be the data from the off-line descriptions in an
interactive setting reported above and from three additional studies that I have
conducted later on. One of the studies concerns simultaneous verbal descrip-
tion: the informants describe the complex picture while, at the same time, their
visual behaviour is registered by an eye tracker (cf. Chapter 5, Section 2). The
next study concerns off-line description after a spatial task: the informants de-
scribe the same picture from memory while looking at a white board in front of
them. Previously, they conducted a spatially oriented task on mental imagery
(cf. Chapter 4, Section 3.1). Yet another study is concerned with simultaneous
(on-line) description with a narrative priming: the informants described the
same picture while looking at it, and the task is to tell a story about what is hap-
pening in the picture (cf. Chapter 4, Section 3.2). Apart from these four data
sets, I will also briefly comment on the structure and content of spontaneous
descriptions in conversation (Holsanova 2001:148ff.).
Before comparing the structure and content of picture descriptions from
different settings on the basis of focus types, let me start with some assump-
tions.
i. The repertoire of focus types is basically the same and can thus be gener-
alised across different settings.
ii. However, the proportion of focus types in descriptions from different set-
tings will vary.
Chapter 2. Structure and content of spoken picture descriptions 29
iii. In the off-line description, informants have inspected the picture before-
hand, have gained a certain distance from its contents and describe it from
memory. The off-line setting will thus promote a summarising and inter-
preting kind of description and contain a large proportion of sum, intro-
spect and meta. The metacomments will mainly concern recall. Finally,
the informants will find opportunity to judge and evaluate the picture, and
the proportion of eval will be high.
iv. In the on-line description, informants have the original picture in front
of them and describe it simultaneously. Particularities and details will be-
come important, which in turn will be reflected in a high proportion of
subst, list foci and superfoci. The real-time constraint on spoken produc-
tion is often associated with cognitive effort. This will lead to a rather high
proportion of subst foci with categorisation difficulties (subst cat.diff.).
The picture serves as an aid for memory and the describers will not need to
express uncertainty about the appearance, position or activity of the refer-
ents. Thus, the description formulated in this setting will not contain many
modifications. If there are going to be any metacomments (meta), they
will not concern recall but rather introspection and impressions. The fact
that the informants have the picture in front if them will promote a more
spatially oriented description with a high proportion of loc.
v. In the interactive setting, informants will position themselves as experts
(expert) and use more interactive foci (interact). The presence of a
listener will stimulate a thorough description of the picture with many
subst foci and bring up more ideas and aspects which will cause a longer
description (length). The informants will probably express their uncer-
tainty in front of their audience, which can affect the number of epistemic
and modifying expressions. Finally, the presence of a listener will contrib-
ute to a higher proportion of foci with orientational and organisational
functions.
vi. In the eye tracking condition, informants will probably feel more observed
and might interpret the task as a memory test. In order to show that they
remember a lot, they will group several picture elements together and use
a high proportion of sum. The fact that they are looking at a white board
in front of them will promote a more static, spatially focused description
with many local relations (loc). The description will be shorter than in the
on-line condition where the picture is available and during the interaction
condition where the listener is present (length).
30 Discourse, Vision and Cognition
proportion of substantive foci in four studies on picture Proportion of substantive foci with categorization difficulties in
description four studies on picture description
0,60 0,07
0,06
0,50
0,05
0,40
off-line ET 0,04 off-line ET
on-line narrative on-line narrative
0,30
off-line interact off-line interact
0,03
on-line ET on-line ET
0,20
0,02
0,10
0,01
0,00 0,00
1 1
Studies 1-4 Studies 1-4
Figure 1. The proportion of substantive foci (subst) and substantive foci with cat-
egorisation difficulties (subst cat.diff.) in four studies on picture description.
Chapter 2. Structure and content of spoken picture descriptions 31
Proportion of LIST foci in four studies Proportion of summarizing foci in four studies on picture
description
0,20
0,35
0,18
0,16 0,30
0,14
0,25
0,12 off-line ET
0,20 off-line ET
on-line narrative
0,10 on-line narrative
off-line interact
off-line interact
0,08 on-line ET 0,15
on-line ET
0,06
0,10
0,04
0,05
0,02
0,00 0,00
1 1
Studies 1-4 Study 1-4
Figure 2. The proportion of subst list foci and summarising foci (sum) in four stud-
ies on picture description.
32 Discourse, Vision and Cognition
way, F = 6,428; p = 0012, Tukey HSD test). It can partly be explained by the tem-
poral delay between picture viewing and picture description: informants who
inspected the picture beforehand gained distance from the particularities in the
picture contents and were involved in summarising and interpreting activities.
The other explanation is the fact that the grouping of several picture elements
can be an effective way of covering all the important contents of the picture.
0,12
0,10
0,08
off-line ET
on-line narrative
0,06
off-line interact
on-line ET
0,04
0,02
0,00
1
Studies 1-4
Proportion of evaluative foci in four studies on picture Proportion of expert foci in four studies of picture description
description
0,04
0,12
0,10 0,03
0,03
0,08
off-line ET
off-line ET
on-line narrative
on-line narrative 0,02
0,06 off-line interact
off-line interact
on-line ET
on-line ET
0,04 0,01
0,01
0,02
0,00 0,00
1 1
Studies 1-4 Studies 1-4
Figure 4. The proportion of evaluative foci (eval), and expert foci (expert) in four
studies on picture description.
Proportion of introspective and metatextual foci in four studies Proportion of interactive foci in four studies on picture
of picture description description
0,1 0,05
0,04
0,09 0,04
0,04
0,08
0,04
0,07
0,03
0,03
0,06 off-line ET off-line ET
0,03
on-line narrative on-line narrative
0,05
off-line interact 0,02 off-line interact
0,02
0,04 on-line ET on-line ET
0,02
0,03
0,01
0,02
0,01 0,01
0 0,00
1 1
Studies 1-4 Studies 1-4
70
60
50
off-line ET
40 on-line narrative
30 off-line interact
on-line ET
20
10
0
1
Studies 1-4
depended both on the explanations of the instructor, on the questions and an-
swers of the drawer and on their agreement about the size of the picture ele-
ments, their proportions, spatial relations and so on.
3. Conclusion
The aim of this chapter has been twofold: (a) to introduce a taxonomy of foci
that constituted the structure and content of the off-line picture descriptions
and (b) to compare types of foci and their distribution with picture descrip-
tions produced in a number of different settings.
Concerning the developed taxonomy, seven different types of foci have
been identified serving three main discourse functions. The informants were
involved both in categorising and interpreting activities. Apart from describing
the picture content in terms of states, events and referents in an ideational or
presentational function, informants naturally integrated their attitudes, feel-
ings and evaluations, serving the interpersonal function. In addition, they also
related the described elements and used various resources to create coherence.
This can be associated with the organisational function of discourse. Substan-
tive, summarising and localising foci were typically used for presentation of
picture contents. Attitudinal meaning was expressed in evaluative and expert
foci. A group of interpersonal, introspective and metatextual foci served the
regulatory or organising function. Informants thought aloud, making com-
ments on memory processes, on steps of planning, on procedural aspects of
their spoken presentation etc. In sum, besides reporting about WHAT they saw
on the picture, the informants also focused on HOW the picture appeared to
them and why. Three additional sets of data were collected in order to find out
how general these focus types are: whether a narrative instruction or a spatial
priming in off-line and on-line condition (with or without eye tracking) would
influence the structure and the content of picture description and the distribu-
tion of various types of foci.
Concerning types of foci and their distribution in four different settings, we
can conclude that the repertoire of foci was basically the same across the differ-
ent settings, with some modifications. The expert foci, where speakers express
their judgements on the basis of their experience and knowledge, were only
found in the off-line condition. This suggests that the off-line condition gives
more freedom and opportunities for the speakers to show their expertise and,
indirectly, to position themselves. Further, the subgroup of substantive foci,
38 Discourse, Vision and Cognition
The previous chapter dealt with structure and content of picture descriptions,
in particular with the taxonomy of foci. In this chapter, we will take a closer
look at how speakers create coherence when connecting the subsequent steps
in their descriptions. First, we will discuss transitions between foci and the
verbal and non-verbal methods that informants use when creating coherent
descriptions in an interactive setting. Second, description coherence and con-
nections between foci will be illustrated by a spontaneous description of a vi-
sual environment accompanied by drawing.
The results in the first part of this chapter are based on the analysis of the off-
line picture descriptions with a listener (cf. Chapter 2, Section 1 and Holsanova
2001:38ff.). As was mentioned in connection with discourse segmentation in
Chapter 1, hesitation and short pauses often appear at the beginning of verbal
foci (eh . in this picture there are also a number of . minor . gures/ eh fantasy
40 Discourse, Vision and Cognition
Pauses and hesitations are significantly longer when speakers move to an-
other superfocus compared to when they move internally between foci
within the same superfocus. Pauses between superfoci are on average 2,53
seconds long, whereas internal pauses between foci measure only 0,53 sec-
onds on average (p = 0009, one sided, Ttest). This is consistent with the
results of psycholinguistic research on pauses and hesitations in speech
(Goldman-Eisler 1968) showing that cognitive complexity increases at the
choice points of ideational boundaries and is associated with a decrease in
speech fluency. This phenomenon also applies to the process of writing (cf.
Strmqvist 1996; Strmqvist et al. 2004) where the longest pauses appear
at the borders of large units of discourse.
Pauses and hesitations get longer when speakers change position in the
picture (i.e. from the left to the middle) compared to when they describe
one and the same picture area. This result is consistent with the studies of
subjects use of mental imagery demonstrating that the time to scan a vi-
sual image increases linearly with the length of the scanned path (Kosslyn
1978; Finke 1989). Kosslyn looked at links between the distance in the pic-
torial representation and the distance reflected in mental scanning. Mental
imagery and visualisations associated with picture descriptions will be dis-
cussed further in Chapter 8, Section 2.
Pauses get longer when the speaker moves from a presentation of picture
elements on a concrete level (three birds) to a digression at a higher level of
abstraction (let me mention the composition) and then back again.
The longest pauses in my data appear towards the end of the description, at
the transition to personal interaction, when the description is about to be
concluded.
Let us have a closer look at some explanations. I propose that there are three as-
pects contributing to the mental distance between foci. The first one is the the-
matic distance from the surrounding linguistic context in which the movement
Chapter 3. Description coherence & connection between foci 41
takes place. As mentioned earlier, the picture contents (the referents, states and
events) are mainly described in substantive and summarising foci and super-
foci. Thus, this aspect is connected to the presentational function of discourse.
There are two possibilities: the transition between foci can either take place
within one superfocus or between two superfoci. For example, speakers can de-
scribe within one and the same cluster of verbal foci all the articles of clothing
worn by Pettson or all the activities of the birds in the tree. Alternatively, while
introducing new referents, the speakers can jump from one area to another in
their description (from the tree in the middle to the cows in the background).
Since the superfoci are characterised by a common thematic aspect (cf. Chap-
ter 1, Section 2.1), it is easier to track referential and semantic relations within
a superfocus than to jump to another topic. Thus, when the thematic aspect
changes and the speaker starts on a new superfocus, a larger mental distance
must be bridged (cf. Kosslyn 1978).
The second aspect is the thematic distance from the surrounding pictorial
context. As was concluded earlier, describers do not only focus on a pure de-
scription of states, events and referents but also add evaluative comments on
the basis of their associations, knowledge and expertise. Because of that, we
have to count on a varying degree of proximity to the picture elements or, in
other words, with a various extent of freedom of interpretation. This aspect
is associated with the change from a presentational to an orientational (or at-
titudinal) function. A change of focus from a concrete picture element to the
general characteristics of the painting technique or to an expertise on childrens
book illustrations means not only a change of topic, but also a shift between the
concrete picture world and the world outside of the picture.
Example 1
0737 and you can also see on a new twig small animals
0738 that then run off probably with a daffodil bulb
0739 on a . wheelbarrow
0740 in the bottom left corner
--> 0741 eh I dont know if I said how it was composed,
0742 that is it had a background which consisted of two halves in the
foreground,
0743 (quickly) one on the left and one on the right,
--> 0744 (slowly) ehh . and . eh oh yes, Findus . hes watering
In Example 1, we can follow the informant moving from a description at a
concrete level, when speaking about small animals (07370740) towards a
42 Discourse, Vision and Cognition
Apart from pauses and hesitations at the borders of speech units, we can find
various other bridging cues, such as discourse markers, changes of loudness
and voice quality, as well as stressed localisations.
Chapter 3. Description coherence & connection between foci 43
In the data from the off-line picture descriptions with a listener, we can find
many discourse markers that fulfil various functions: they reect the planning
process of the speaker, help the speaker guide the listeners attention and signal
relations between the different portions of discourse. Grosz & Sidner (1986)
use the summary term cue phrases, i.e. explicit markers and lexical phrases
that together with intonation give a hint to the listener that the discourse struc-
ture has changed.
Example 2
0513 its bright green colours . and bright sky
--> 0514 then we have the cat
0515 that helps with the watering
0516 and sits waiting for the . seeds to sprout
0517 and he also chases two spiders
--> 0518 and then we have some little character in/down in the ... left corner,
0519 that take away an onion,
0521 I dont know what (hh) sort of character
--> 0522 then we have some funny birds . in the tree
Figure 1. A paratactic transition closes the referential frame of the earlier segment
and opens a new segment.
A transition that closes the current referential frame and opens a new segment
is called a paratactic transition. According to Redeker (2006), a paratactic se-
quential relation is a transition between segments that follow each other on the
same level, i.e. a preplanned list of topics or actions. The attentional markers
introducing such a transition have consequences for the linear segments with
respect to their referential availability. The semantic function of the paratac-
tic transitions is to close the current segment and its discourse referents and
thereby activate a new focus space.
Example 3
0112 and Findus,
0113 he has sown his meatball,
--> 0114 (lower voice) you can see it on a little picture,
--> 0115 this kind of peg/stick with a label
--> 0116 he has sown his meatball under it
0117 (higher voice) and then there is Findus there all the time,
In Example 3, the speaker is making a digression from the main track of de-
scription (substantive foci 01120113) to a comment, pronounced in a lower
voice (01140116). She then returns to the main track by changing the volume
of her voice and by using the marker and then (0117). This kind of transition
can be schematised as follows (Figure 2).
A transition that hides the referential frame from the previous segment in
order to embed another segment, and later returns to the previous segment is
called a hypotactic transition. According to Redeker (2006), hypotactic sequen-
tial relations are those leading into or out of a commentary, correction, para-
phrase, digression, or interruption segment. Again, the attentional markers
introducing such a transition have consequences for the embedded segments
with respect to their referential availability. The hypotactic transitions signal
an embedded segment, which keeps the earlier referents available at an earlier
level. Such an embedded segment usually starts with a so-called push marker
or next-segment marker (Redeker 2006) (such as that is, I mean, I guess, by the
way) and nishes with a so-called pop marker or end-of-segment marker (but
Chapter 3. Description coherence & connection between foci 45
push pop
Figure 2. A hypotactic transition hides the referential frame from the previous seg-
ment, while embedding another segment, and later returning to the previous segment.
. In a recent study, Bangerter & Clark (2003) study coordination of joint activities in dia-
logue and distinguish between vertical and horizontal transitions. Participants signal these
transitions with the help of various project markers, such as uh-huh, m-hm, yeah, okay, or
allright.
46 Discourse, Vision and Cognition
634(B) .. mhm
635(A) 1.22 then we have .. eh 1.06 where <DRAWS> subst list 1c
the faucet comes out
636(A) 1.97 and then we have .. all the other <DRAWS> subst list 1d
.. furnishings here,
637(A) 1.53 we have= usually a=eh <DRAWS> subst list 2
638(A) 4.00 vanity they call it there, naming
639(A) 1.04 where . the washbasin is built design
in
640(B) mhm
641(A) but its a 1.09 a .. piece,a part subst
of the washbasin, detail, design
642(B) 0.62 mhm
643(A) and it sits .. on the counter itself, <SHOWS> loc
644(A) which means that if the water arg
overows
645(A) then you have to try to .. force it over cause (if-then)
the rim, back again,
646(B) <IRONIC> 1.00 which is very eval (attitude)
natural .. for us
647(A) 1.09 so the washbasin is actually loc
resting upon it,
648(A) so if you have a cross-section sum
here,
649(B) m[hm]
650(A) [here we] have the very counter <DRAWS> subst list
651(C) mhm
652(A) then we have the washbasin,
653(A) it goes up here like this <DRAWS> subst
654(A) and down, <DRAWS>
655(A) 1.67 and up, <DRAWS>
656(A) 1.24 of course when the water <SHOWS> arg 1
already has come over here conseq. (if-then)
657A) then it wont go back again,
658(B) m[hm]
659(A) [y]ou have to force it over,
660(A) .. and these here <SHOWS> arg 2
661(A) 1.04 eh=if the caulking under here is conseq. (if-then)
not entirely new and perfect
48 Discourse, Vision and Cognition
Figure 3. Figure 4.
Chapter 3. Description coherence & connection between foci 49
calls it the room. The bathtub is introduced in 630 and refocused in 671. How
do the interlocutors handle referential availability? How do they know where
they are and what the speaker is referring to? How can speakers and listeners
retain nominal and pronominal reference for such a long time?
This is a continuation of my general question posed in Chapter 1, whether
our attentional resources enable us only to focus on one activated idea a time
or if we, at the same time, can keep track of the larger units of discourse. Linde
(1979) and Grosz & Sidner (1986) suggest that the interlocutors can do so by
simultaneously focusing on a higher and a lower level of abstraction.
The use of the same item for accomplishing these two types of reference sug-
gests that, in discourse, attention is actually focused on at least two levels si-
multaneously the particular node of the discourse under construction and,
also, the discourse as a whole. Thus, if the focus of attention indicates where
we are, we are actually at two places at once. In fact, it is likely that the number
is considerably greater than two, particularly in more complicated discourse
types. (Linde 1979:351)
others gaze behaviour, mimics, pointing gestures etc. We must not forget that
the speakers gaze (at the drawing, at his own gestures, at the artefacts in the
environment) can affect the attention of the listeners. Furthermore, the draw-
ing that is created step by step in the course of the verbal description is visible
to all interlocutors. The speaker, as well as the others in the conversation, can
interact with the drawing; they can point to it and refer to things and relations
non-verbally. Thus, the drawing represents a useful tool for answering where
are we now? and functions as a storage of referents or as an external memory
aid for the interlocutors. This also means that the interlocutors do not have to
keep all the referents in their minds, nor always mention them explicitly. Apart
from that, the drawing has been used as a support for visualisation and as an
expressive way of underlining what is being said. Finally, it serves as a rep-
resentation of a whole construction problem discussed in the conversation. I
suggest that the common/joint focus of attention is created partly via language,
partly by the non-verbal actions in the visually shared environment. For these
reasons, the discourse coherence becomes a situated and distributed activity
(cf. Gernsbacher & Givn 1995; Gedenryd 1998:201f.). This may be the reason
why we use drawings to help our listeners understand complex ideas.
is very natural for us) before the speaker refocuses the localisation of the wash-
basin (647). In other words, what we follow are the hypotactic (embedded) and
the paratactic (linear) relations between the verbal foci.
This can be applied to our spontaneous description. Although the semantic as-
pect seems to be salient, not all verbal foci are part of the semantic hierarchy on
the content level. Apart from descriptions of bathroom objects (door, bathtub,
faucet, other furnishings), their properties and their placement even argu-
mentative, interactive and evaluative foci are woven into the description.
The schematic figure illustrates the complexity in the thicket of spoken de-
scriptions. Despite this brushwood structure and many small jumps between
different levels, the speaker manages to guide the listeners attention, lead them
through the presentation and signal focus changes so that the description ap-
pears uent and coherent.
Which means are used to signal focus changes and to create coherence in the
data from spontaneous descriptions? The answer is that verbal, prosodic/acous-
tic and non-verbal means are all used. The most common verbal means are dis-
course markers (anyway, but anyway, so, and so, to start with) that are used for
reconnecting between superfoci (for an overview see Holsanova 1997a:24f.).
A lack of explicit lexical markers can be compensated by a clear (contrastive)
intonation. Talking in a louder voice usually means a new focus, while talking
in a lower voice indicates an embedded side comment. A stressed rhythmic
focus combined with a synchronous drawing (updownup) makes the listener
attend to the drawn objects. Prosody and the acoustic quality thus give us im-
portant clues about the interpretation of the embedding and coherence of an
utterance.
Last but not least, interlocutors often use deictic expressions (like this, and
this here) in combination with non-verbal actions (drawing, pointing, gestur-
ing). Deictic means help move the listeners attention to the new objects in the
listeners immediate perceptual space (cf. attention movers in Holmqvist &
Holsanova 1997). Demonstrative pronouns are sometimes used to draw atten-
tion to the corresponding (subsequent) gesture. Attention is directed in two
ways: the speakers are structuring their speech and also guiding the listeners
attention.
. Deictic gestures in spatial descriptions can co-occur with referential expressions and
anchor the referent in space on an abstract level. Alternatively, they work on a concrete
level and make a clear reference to the actual object by mapping out the spatial relationship
explicitly (cf. Gullberg 1999).
Chapter 3. Description coherence & connection between foci 53
3. Conclusion
In this chapter, we have focused on how speakers connect the subsequent steps
in their description and thereby create discourse coherence. We discussed dif-
ferent degrees of mental distance between steps of description and various
means that informants use to connect these descriptive units of talk. To summa-
rise, discontinuities, such as pauses and hesitations, appear within and between
foci, but are largest at the transition between foci and superfoci. Both speakers
and listeners must reorient themselves at transitions. I proposed that the mental
leaps between foci are influenced by three factors: (a) the thematic distance to
the surrounding linguistic context, (b) the thematic distance to the surrounding
pictorial context, and (c) the performative distance to the descriptive discourse.
In other words, the discontinuity is dependent on which (internal or external)
worlds the speaker moves between. If the speaker stays within one thematic
superfocus, the effort at transitions will not be very big. On the contrary, the
lower the degree of thematic closeness, the larger the mental leaps. If the speak-
er moves between a description of a concrete picture element and a comment
on the painting technique, i.e. between a presentational and an orientational
function, larger reorientations will be required. The biggest hesitations and the
longest pauses are found at the transitions where the speaker steps out of the
description and turns to the metatextual and interactional aspects of the com-
municative situation (fulfilling the organisational discourse function).
I also concluded that transitions between the sequential steps in the picture
descriptions are marked by lexical and prosodic means. Focus transitions are
often initiated by pauses, hesitations and discourse markers. Discourse mark-
ers are more or less conscious signals that reveal the structuring of the speech
and introduce smaller and larger steps in the description. Speakers may re-
focus on something they have already said or point forward or backward in
the discourse. Such markers have consequences for the linear (paratactic) and
the embedded (hypotactic) segments with respect to referential availability.
Moreover, speakers use prosodic means such as changes in loudness or voice
quality to create transitions. Finally, in the process of focusing and refocusing,
speakers use stressed localising expressions to bridge the foci.
In a spontaneous description formulated outside the laboratory, the hier-
archical structure contains many small jumps between different levels. Verbal,
prosodic/acoustic and non-verbal means are all used to signal focus changes
and to create coherence. Although the semantic aspect seems to be salient,
apart from descriptions of objects, their properties and their placement, many
54 Discourse, Vision and Cognition
argumentative and evaluative foci are woven into the description. Despite the
complexity, the speaker manages to guide the listeners attention and to lead
them through the presentation. The interlocutors seem to retain nominal and
pronominal references for quite a long time. There are several explanations to
this phenomenon: a) simultaneously focusing on both a higher and a lower
level of abstraction, b) switching between active and semiactive information
and c) using situation awareness and mutual visual access (e.g. by observing
each others pointing, gazing and drawing).
In the spontaneous description, the linguistic and cognitive structuring of
the description is situationally anchored. The interlocutors make use of both
verbal means and non-verbal means of focusing and refocusing. Thus, joint fo-
cus of attention is created partly through language, partly through non-verbal
actions in the visually shared environment. The drawing then gets the function
of a referent storage, of an external memory aid for the interlocutors. The in-
terlocutors look for and provide feedback, and a continuous mutual adaptation
takes place throughout the description (Strmqvist 1998).
In the current chapter, we took a closer look at how speakers create co-
herence when connecting the subsequent steps in their descriptions. The next
chapter will be devoted to different description styles.
chapter 4
At this point, I would like to remind the reader of our discussion regarding
the distribution of different types of foci in Chapter 2. Already there, it was
assumed that a description rich in localisations would have a tendency to be a
more static, whereas a description rich in narrative aspects would tend to be
more dynamic. In the results of the analysis of the picture descriptions from
56 Discourse, Vision and Cognition
Example 3
0859 and then you can see the same man as on the left of the picture
0860 in three different . positions
0861 on the very right
0862 . ehh, there he stands and turns his face to the left,
0863 and digs with the same spade,
0864 and I think that he has his left arm down on the spade,
0865 his right hand further down on the spade,
0866 eh and then he has his left foot on the spade,
0867 to the right of this gure
0868 the same man is standing
0869 turning his face to the right
The informants using the static style describe the objects in great detail. They
give a precise number of picture elements, state their colour, geometric form
and position. They deliver a detailed specication of the objects and when enu-
merating the picture elements, they mainly use nouns:
Example 4
0616 eh the farmers were dressed in the same way
0617 they had . dark boots
0618 . pants
0619 light shirt
0620 such a armless . armless . jacket, or what should I say
0621 hat light hat
58 Discourse, Vision and Cognition
0622 beard
0623 glasses
0624 eh a pronounced nose,
expressions also indirectly mediate the speakers way of moving about in the
discourse. In the static description style, focusing and refocusing on picture
elements is done by localising expressions in combination with stress, loudness
and voice quality (cf. Chapter 3, Section 1.1.2). Few or no discourse markers
were used in this function.
The different phases are introduced by using temporal verbs (starts), temporal
adverbs (then, and then, later on), and temporal subordinate clauses (when hes
ready). The successive phases can also be introduced using temporal preposi-
tions (from the moment when hes digging to the moment when hes raking
and sowing).
60 Discourse, Vision and Cognition
Some speakers are particularly aware of the temporal order and correct them-
selves when they describe a concluded phase in the past time:
Example 8
0711 eh on the one half of it
0712 one can see . eh Pettson
--> 0713 when hes . digging in the eld
--> 0712 and/ or hes done the digging of the eld,
0715 when hes sitting and looking at the soil,
Informants have also noticed differences in time between various parts of the
picture. The informant in the next example has analysed and compared both
sides of the picture and presented evidence for the difference in time:
Example 9
--> 1151 . in a way it feels that the left hand side is . the spring side
--> 1152 because . on the right hand side there is some . raspberry thicket .
or something,
--> 1153 that seems to have come much longer
--> 1154 than the daffodils on the left hand side,
The dynamic quality is achieved not only by the use of temporal verbs (starts
with, ends with) and temporal adverbs (rst, then, later on), but also by a
Chapter 4. Variations in picture description 61
f requent use of motion verbs in the active voice (digs, sows, waters, rakes, sings,
ies, whips, runs away, hunts).
Also the frequent use of so-called pseudo coordinations (constructions
like fgeln sitter ruvar p gg; the bird is sitting and brooding on the eggs) con-
tribute to the rhythm and dynamic character of the description (for details, cf.
Holsanova 1999a:56, 2001:56f.).
Example 10
--> 0404 then . he continues to dig
0405 and then he rakes
0406 with some help from Findus the cat
--> 0407 and . then he lies on his knees and . sows . seeds
0408 and then Findus the cat helps him with some sort of ingenious
watering device
--> 0409 LAUGHS and then Findus the cat is lying and resting,
0410 and he jumps around in the grass
0411 xxx among ants,
0412 and then a little character in the lower left corner
0413 with an onion and a wheelbarrow
0414 a small character
0415 and then in the tree there are bird activities
0416 one is tidying up the nesting box
--> 0417 one is standing and singing
--> 0418 and one is lying and brooding on the eggs,
0419 then you can see some owers
0420 whether its . yellow anemone or yellow star-of-Bethlehem or
something
--> 0421 and . cows go grazing in the pasture,
In the dynamic description style, speakers do not ascribe the spatial perception
the same weight as they do the temporal perception. Thus, we do not find pre-
cise localisations, and spatial expressions are rare. The few localising expres-
sions that the informants use are rather vague (in the air, around in the picture,
at strategically favourable places, in the corners, in a distance).
Another distinguishing feature is that discourse markers are used to focus
and refocus the picture elements, and to bridge and connect them (cf. Chap-
ter3, Section 1.1). Last but not least, the difference in the description style was
also reected on the content level, in the number of the perceived (and report-
ed) characters. The informants with the prototypical dynamic style perceived
62 Discourse, Vision and Cognition
one Pettson (and one cat) gure at several moments in time, whereas the static
describers often perceived multiple gures in different positions.
How are these results related to the focus types presented in Chapter 2? One
could assume that localising foci and substantive foci would be characteristic
of the static style, whereas evaluative foci, introspective foci and a substantive
listing of items (mentioning different activities) would be typical of a dynamic
style. A Ttest showed, however, that localising foci (loc) were not the most im-
portant predictors of the static style. They were frequently present but instead,
expert foci and meta foci reached significant results as being typical of the
static style (p = 0,03). Concerning the dynamic style, evaluative foci were quite
dominant but not significant. list of items was typical and close to significant
(p = 0,06). Both the temporal aspect and the dynamic verbs, which turned out
to be typical of a dynamic description style, were part of the substantive and
Chapter 4. Variations in picture description 63
Table 2. The distribution of the most important linguistic variables in the twelve
descriptions (off-line interact).
Subject # dynamic % dynamic
No. # foci Style # there is %there is #spatial %spatial #temporal % temporal verbs verbs
summarising foci. Table 1 summarises the most important features in the two
description styles.
When it comes to the frequency and distribution of these two styles, the
dynamic description style was mostly evident with five informants (P1, P4,
P5, P7 and P11), whereas the static description style was dominant with four
informants (P3, P6, P8 and P9). The descriptions of the three remaining infor-
mants (P2, P10, P12) showed a mixture of a dynamic and a static style. They
usually began with the dynamic style but, after a while, they started with lists
and enumerations, focusing on spatial aspects, interpretative comments and
comparisons. Thus, the style at the end of their descriptions was closer to the
static description style.
In order to substantiate the characteristics of the two description styles,
I have quantied some of the most important linguistic aspects mentioned
above. Table 2 shows the distribution of the linguistic variables in the twelve
descriptions from memory. The informants are grouped according to the pre-
dominating description style.
The overview in Table 2 encourages a factor analysis a way to summarise
data and explain variation in data (see the two screen dumps in Figure 1). It
might be interesting to nd out which variables are grouped together. Since my
data were not so extensive, the factor analysis has only an explorative charac-
ter. Let me only briefly mention the results. Data from eleven informants were
used as a basis for the factor analysis. One informant was excluded from the
analysis because of her age. Ten variables have been analysed: the length of the
whole description, the length of the free description, number of foci, number of
temporal expressions, number of spatial expressions, refocusing with discourse
64 Discourse, Vision and Cognition
What are the possible explanations for these two styles found in the picture
descriptions? One observation concerns gender differences. If we compare the
results in Table 2 with the informants characteristics, we can see a preliminary
tendency for women to be more dynamic and for men to have a more static
description style. However, more studies are needed to conrm this pattern.
Another possibility would be the difference between visual thinkers and ver-
bal thinkers (Holsanova 1997b), which will be discussed in the next section.
Chapter 4. Variations in picture description 65
Yet another possibility is that these description styles are picture specific
in contrast to other sources of perception. It could be the case that some of the
results are due to the fact that the informants either verbalise the picture as a
representation or focus on the content of the represented scene. The question
whether these results are valid also for scene description and event description
in general has to be tested empirically. Another possibility is that this particular
picture, with its repetitive figures, may have affected the way it was described.
Furthermore, an additional source of explanation is the non-linguistic and
contextual aspects that might to some extent have inuenced what people fo-
cus on during picture viewing, what they remember afterwards and what they
describe verbally (and how). For instance, previous knowledge of the picture,
of the genre, of the book, of the characters or of the story may be the most criti-
cal factors for the distinction of the two description styles, since not knowing
the genre and/or characters may lead to a less dynamic description. One could
expect that if the informants have read the book (to themselves or to their
children), the picture will remind them of the story and the style will become
dynamic and narrative. If the informants know the Pettson and Findus char-
acters from Sven Nordqvists books, films, TV programmes, computer games,
calendars etc., they can easily identify the activities that these characters are
usually involved in, and the description will become dynamic. Finally, if the
informants know the particular story from the childrens book, they can go
over to a story-telling-mode and deliver a dynamic description with narra-
tive elements. On the other hand, some informants may follow the instruction
more strictly and on the basis of their discourse genre knowledge formulate a
rather static picture description.
Here is some evidence from the data. In fact, ten informants mentioned the
characters Pettson and Findus by name (P1, P2, P4, P5, P7, P8, P9, P10, P11,
P12). Two of the informants even seem to know the story, since they included
details about the meatballs that Findus has planted. Three informants recog-
nised and explicitly mentioned the source where the picture comes from or the
illustrated childrens book genre (P3, P9, P10). The only informant who does
not mention either the genre or the characters is P6. If this is because he does
not know them, this could have inuenced his way of describing the scene.
Table 3 summarises various cognitive, experiential and contextual factors
that might have played a role.
So if the complex picture primed the knowledge of the story-book, and
if this knowledge of the story book was the critical factor that influenced the
66 Discourse, Vision and Cognition
Table 3. Overview over cognitive, experiential and contextual factors that might have
inuenced the descriptions
Cognitive, experiential and contextual Example
factors
Scene schema knowledge semantic characteristics of the scene: rural land-
scape, gardening in the spring
Pictorial genre knowledge childrens book illustration
Knowledge of the characters old guy Pettson and his cat Findus
Particular knowledge of the book
Particular knowledge of the story
Particular knowledge of the picture
Discourse genre knowledge spoken picture description
Informants way of remembering things
Informants background and interests fauna and ora
Informants expertise on painting techniques, farming, gardening etc.
Informants associations activities that Pettson and Findus usually are
involved in, spring, harmony
Informants linguistic and cultural language-specific ways of classifying and struc-
background turing scenes and events
The interactional setting description to a specific listener
dynamic (narrative) style of the picture description, then we can assume the
following:
a narrative priming will increase the dynamic elements in the picture de-
scriptions
a spatial priming will, on the contrary, increase the static elements in the
picture description.
We will test these hypotheses and describe the effects of spatial and narrative
priming later on in this chapter. Before that, let us turn to the second section,
which will be devoted to the discussion of individual differences in general and
verbal and visual thinkers in particular.
Quite often, the distinction between verbal and visual thinkers is made in psy-
chology, pedagogy, linguistics and cognitive sciences. The question is whether
we can draw parallels between the dynamic and the static picture description
style on the one hand and the verbal and visual thinkers on the other. In the
Chapter 4. Variations in picture description 67
following, I will discuss the two extracted styles from different theoretical and
empirical perspectives: from studies on individual differences, experiments on
remembering and theories about information retrieval and storage.
The results concerning the two description styles are supported by Grows
(1996) study on text-writing problems. Grow has analysed the written essays
of students and divided them into verbal and visual thinkers on the basis of
how they express themselves in written language. He points out some general
problems that visual thinkers have when expressing themselves in written lan-
guage. According to Grow, visual thinkers have trouble organising expository
prose because their preferred way of thinking is fundamentally different from
that of verbal thinkers. Visual thinkers do not focus on the words but rather
think in pictures and in non-verbal dimensions such as lines, colours, texture,
balance and proportion. They therefore have trouble expressing themselves in
writing, i.e. breaking down ideas that turn up simultaneously into a linear or-
der of smaller units, as required by language. They also have trouble presenting
clear connections between these units. The fact that visual thinkers let several
elements pop up at the same time, without marking the relation between them,
means that it is up to the listener to draw conclusions, interpret and connect
these elements. In contrast, verbal thinkers analyse, compare, relate and evalu-
ate things all the time. Visual thinkers often list things without taking a position
on the issues, do not order them or present them as events. The description be-
comes a static one. Furthermore, the ability of visual thinkers to dramatise and
build up a climax is weak. They do not build the dynamics and do not frame the
description in a context. Verbal thinkers more easily linearise and dramatise.
The features in the static picture description style noted in my study greatly
resemble the general features of the ability to produce written texts that visu-
al thinkers exhibit according to Grow. On the one hand, we have a dynamic,
rhythmic and therefore very lively style, where relations between ideas are ex-
plicitly signalled using discourse markers, close to Grows verbal thinkers. On
the other hand, the static character of the picture description style, with its
perceptual dominance of spatial relations, where the picture is divided into
elds and many details are mentioned but no explicit connection is established
between them, resembles the visual thinker. Despite the difference in medium
(written vs. spoken language), it is therefore easy to draw parallels between the
dynamic and the static picture description style on the one hand and the verbal
and visual thinkers on the other.
Concerning information retrieval, Paivio (1971a, b, 1986) suggests that hu-
mans use two distinct codes in order to store and retrieve information. In his
68 Discourse, Vision and Cognition
dual code theory, Paivio (1986:53f.) assumes that cognition is served by two
modality-specific symbolic systems that are structurally and functionally dis-
tinct: the imagery system (specialised for representation and processing of non-
verbal information) and the verbal system (specialised for language). Currently,
the cognitive styles of the visualisers and the verbalisers have been characterised
as individual preferences for attending to and processing visual versus verbal
information (Jonassen & Grabowski 1993:191; cited in Kozhevnikov et al.
2002). While visualisers rely primarily on imagery processes, verbalisers prefer
to process information by verbal-logical means. According to the current re-
search (Baddeley 1992; Baddeley & Lieberman 1980), working memory consists
of a central executive (controlling attention) and two specialised subsystems: a
phonological loop (responsible for processing verbal information) and a visuo-
spatial sketchpad (responsible for processing visuospatial information).
Since the stimulus picture in the study has been described off-line, let me
also mention classic works on memory. Bartlett (1932:110f.) distinguishes be-
tween visualiser and vocaliser when reporting the results of his experiments on
remembering. According to his observations, visualisers primarily memorise
individual objects, group objects based on likeness of form and, sometimes,
even use secondary associations to describe the remembered objects. Vocalis-
ers, on the other hand, prefer verbal-analytic strategies: their descriptions are
influenced by naming, they use economic classifications for groups of objects
and rely much more on analogies and secondary associations (it reminds me
of so and so). They also frequently describe relations between objects. When
speaking about verbal-analytic strategies, Bartlett mentions the possibility of
distinguishing between verbalisers and vocalisers, but does not give any ex-
amples from his data.
Nevertheless, there is still the possibility of a mixed type of description, us-
ing both the verbal and the visuospatial type of code. Krutetskii (1976; cited in
Kozhevnikov et al. 2002:50) studied strategies in mathematical problem solv-
ing and distinguished between three types of individual strategies on the basis
of performance: the analytic type (using the verbal-logical modes), the geo-
metric type (using imagery) and the harmonic type (using both codes).
Individual differences are in focus also in Kozhevnikov et al. (2002), who
revise the visualiser-verbaliser dimension and suggest a more fine-grained
distinction between spatial and iconic visualisers. In a problem-solving task,
they collected evidence for these two types of visualisers. While the spatial
visualisers in a schematic interpretation focus on the location of objects
and on spatial relations between objects, the iconic visualisers in a pictorial
Chapter 4. Variations in picture description 69
i nterpretation focus on high vividness and visual details like shape, size, col-
our and brightness. This finding is consistent with neurophysiological evi-
dence for two functionally and anatomically independent pathways: one con-
cerned with object vision and the other with spatial vision (Ungerleider &
Mishkin 1982; Mishkin et al. 1983). Moreover, the recent research on mental
imagery suggests that imagery ability consists of distinct visual and spatial
components (Kosslyn 1995).
If we want to apply this above mentioned distinction of spatial and iconic
visualisers to my data, we could conclude the following: Apart from associat-
ing the static description style with the visualisers and the dynamic description
style with the verbalisers, one could possibly find evidence even for these more
fine-grained preferences: verbal, spatial and iconic description styles. For in-
stance, when we look at the distribution of different types of foci (cf. Chapter
2, Section 1) and the categorising devices on the linguistic surface, the domi-
nance of localising foci (loc) could be taken as one characteristic of the spatial
visual style, whereas the dominance of substantive (subst) and evaluative foci
(eval) describing the colour, size and shape of individual objects could serve
as an indicator of the iconic visual style. Finally, the dominance of associa-
tions (the dragonfly reminds me of an aeroplane, the three birds in the tree are
like an average Swedish Svensson-family), mental groupings and interpretations
(it starts with the spring on the left) and economic summarising classifications
(this is a typical Swedish landscape with people sowing) would be the typical
features of a verbal description style.
In this last section, we will compare the results from the interactive setting
with these two sets of data on picture descriptions in order to find out whether
the style of picture descriptions can be affected by spatial and narrative prim-
ing. The first set of data consists of 12 off-line picture descriptions with spatial
priming. The second data set consists of 15 on-line picture descriptions with
narrative priming.
Hypothesis 1 is that the spatial priming will influence a more static style
of picture description. Spatial priming will be reflected in a large propor-
tion of existential constructions, such as there is, a high number of spatial
expressions and other characteristics of the static description style. Spatial
priming will also lower the number of dynamic verbs and temporal expres-
sions in the descriptions.
Hypothesis 2 is that narrative priming will influence a more dynamic style
of picture description. Narrative priming will result in a large proportion
of temporal expressions, a high number of dynamic verbs and other char-
acteristics of the dynamic description style. Narrative priming will also
lower the number of spatial expressions and existential constructions.
The first set of data consists of picture descriptions with (indirect) spatial prim-
ing. Twelve informants (six men and six women), matched by age, were viewing
the same picture for 30 seconds. Afterwards, the picture was covered by a white
board and the informants were asked to describe it off-line, from memory. Their
eye movements were measured both during picture viewing and during picture
description. All picture descriptions were self-paced, lasting on average 1 min-
ute and 55 seconds. The instruction was as follows: I will show you a picture, you
can look at it for a while and then I will ask you to describe it verbally. You are free
to describe it as you like and you can go on describing until you feel done.
The spatial priming consisted of the following step: Before this descrip-
tion task, all informants were listening to a pre-recorded static scene descrip-
tion that was systematically structured according to the scene composition and
contained a large proportion of there is- constructions and numerous spatial
descriptions (There is a large green spruce at the centre of the picture. There is a
bird sitting at the top of the spruce. To the left of the spruce, and at the far left in
the picture, there is a yellow house with a black tin roof and white corners. ).
Chapter 4. Variations in picture description 71
This description lasted for about 2 minutes and created an indirect priming for
the current task.
Would such a suggested way of describing a picture have effects on the
way the informants consequently described the stimulus picture? In particular,
would such an indirect spatial priming influence the occurrence of the static
description style? If yes, were all informants primed to the same extent?
When looking closely at the data, we can see that most descriptions were
affected by spatial priming but that some descriptions still contained dynamic
parts. Example 11 illustrates a dynamic part of picture description and Ex-
amples 12 and 13 demonstrate a static picture description in this setting.
Example 11
0401 Ok,
0402 I see the good old guy Pettson and Findus
dr ser jag den gode gamle Pettson och Findus
0403 eh how they are digging and planting in a garden.
Eh hur de grver och planterar i en trdgrd,
0404 it seems to be a spring day
det r en vrdag tydligen
0405 the leafs and the flowers have already come out
lven blommorna har kommit ut
Example 12
0601 Ehm in the middle to the left Pettson is standing
Hm i mitten till vnster str Pettson
0602 And looking at something he has in his hand
och tittar p nnting han har i sin hand
0603 in the middle of the garden
mitt i trdgrdslandet
0604 And it is green all around
s r det grnt runtomkring
0605 Some birds on the left in the corner
ngra fglar till vnster i hrnet
0606 And then in the middle
sen s i mitten
0607 Stands a tree
str ett trd
0608 with three birds and a flower
med tre fglar och en blomma
72 Discourse, Vision and Cognition
Table 4. The overall number and proportion of linguistic parameters in off-line inter-
act compared to off-line + ET (spatial priming).
Linguistic indicators off-line, interact off-line + spatial priming Sign.
1there is 180 (30 %) 110 (35 %) NS
2 spatial expressions 131 (21 %) 104 (33 %) NS
3 temporal expressions 69 (11 %) 15 (5 %) *
4 dynamic verbs 185 (30 %) 104 (33 %) NS
Number of foci 610 311 *
Average duration 2 min 50 sec 1 min 55 sec *
Mean # foci 50,83 25,92 *
Example 13
1201 ehhaa on the picture you can see a man
ehhaa p bilden ser man en man
1202 with a vest and hat and beard and glasses,
med en vst hatt skgg glasgon,
1203 Eh the same man in four different eh what should I call it situations,
Eh samma man i fyra olika eh . vad ska man sga situationer,
0,35 0,4
0,3 0,35
0,3
0,25
'there is' 0,25 'there is'
0,2 spatial expr. spatial expr.
0,2
0,15 temporal expr. temporal expr.
dynamic verbs 0,15 dynamic verbs
0,1
0,1
0,05 0,05
0 0
1 1
Linguistic parameters 1-4 Linguistic parameters 1-4
The second set of data consists of picture descriptions with narrative priming.
Fifteen informants (six men and nine women), matched by age, described the
same picture. This time, they were given a narrative priming and described it
on-line. The description was self-paced, lasting on average 2 minutes and 30
seconds. The task was to tell a story about what happens in the picture.
Would a narrative priming influence the occurrence of a dynamic de-
scription style in on-line descriptions? The on-line picture descriptions were
analysed thoroughly. The result was that many dynamic descriptions were pro-
duced and that the dynamic aspects were strengthened (cf. Table 5).
74 Discourse, Vision and Cognition
Table 5. Overall number and proportion of linguistic indicators for two styles in:
off-line interactive setting and on-line description with narrative priming.
Linguistic indicators off-line, interact on-line, narrative priming Sign.
there is 180 (30 %) 28 (9 %) *
spatial expressions 131 (21 %) 38 (12,5 %) *
temporal expressions 69 (11 %) 172 (57 %) *
dynamic verbs 185 (30 %) 52 (17 %) *
Number of foci 610 302 *
Average duration 2 min 50 sec 2 min 30 sec NS
Mean # foci 52,83 20,13 *
The use of temporal expressions (temporal verbs and adverbs) and dynamic
verbs were two of the relevant indicators of a dynamic style. What is most strik-
ing when comparing these two data sets is the dramatic increase of temporal
expressions and the decrease of existential constructions there is in de-
scriptions with narrative priming (cf. Figure 3). 57 percent of all foci contain
temporal verbs or adverbs compared with only 11 percent in the interactive
setting (p = 0,0056). Only 9 percent of all foci contain a there is-construction
compared to 30 percent in the interactive setting (p = 0,0005). As assumed, the
proportion of spatial expressions was significantly lower in the narrative set-
ting than in the interactive one (p = 0.038). Contrary to our expectations, the
number of dynamic verbs dropped.
It can be concluded that the narrative priming did influence the way the
informants described the picture. In particular, the narrative priming mostly
0,70
0,35
0,60
0,3
0,25 0,50
'there is' 'there is'
0,2 0,40
spatial expr. spatial
temporal expr. temporal
0,15 0,30
dynamic verbs dynamic verbs
0,1 0,20
0,05 0,10
0
0,00
1 1
Linguistic parameters 1-4 Linguistic parameters 1-4
enhanced the temporal dynamics in the description. The effect is even stron-
ger than in the indirect spatial priming. Also, the use of there is and spatial
expressions dropped in this setting. The reason for this might be that the in-
formants did not feel bound to the picture. Some of the informants created a
completely new story out of the picture elements, far from the actual constella-
tion. Another possible explanation can be found in double priming. If the in-
formants recognised the characters, the book or the story, they were narratively
primed both by the book and by the instruction. These two factors might have
supported each other in a tandem fashion.
Furthermore, compared to the dynamic style in the off-line condition, we
find more types of narrative elements here: the informants frequently use tem-
poral specifications, they connect the foci by means of discourse markers, and
they use projected speech (direct and indirect quotation of the figures). Some
of the speakers even include a real story with a punch line and a typical story-
ending (cf. Example 14).
Example 14
0712 so. . Pettson starts to loosen up the soil
s Petterson brjar och luckrar upp jorden
0713 and then he finds something weird,
och d hittar han ngot konstigt,
0714 he looks at it . properly,
han tittar . ordentligt,
0715 and . it is a seed that has probably flown here from somewhere,
och. det r ett fr som nog har flugit hit frn ngonstans,
0716 he has never seen such a seed,
. han har aldrig sett ett snt fr,
0717 even if hes a good gardener,
ven om han r en kn/ en duktig trdgrdsmstare,
0718 And has seen most of it,
och varit med om det mesta,
0719 so he becomes extremely curious,
s han blir jttenyfiken,
0720 he . shouts to the little cat:
han . ropar till lilla katten:
0721 come and take a waterhose here
ta nu en vattenslang hr
0722 take it at once/ I have to plant this seed,
ta genast/ jag mste genast plantera den hr frn,
76 Discourse, Vision and Cognition
The aim of this section was to compare the results concerning static and dy-
namic styles with other data. Spatial and narrative priming had effects on the
description style. What seems to be influenced most by the spatial priming is
a larger number of localisations, significantly fewer temporal expressions and
significantly shorter descriptions. The character of the descriptions shifted to-
wards a static style. The narrative priming gave rise to a significantly larger
proportion of temporal expressions and a significant drop in spatial expres-
sions and existential constructions of the type there is/there are. In particular,
narrative priming enhanced the temporal dynamics in the description. The ef-
fect was even stronger than in the indirect spatial priming.
Questions that remain to be answered are the following: Does the presence
of a listener influence the occurrence of these two styles? If we were to not find
the two styles in other types of off-line description, then there would be reason
to think that these are conversationally determined styles that only occur in
face-to-face interaction. To test this, we have to find out whether these styles
appear in off-line monological picture descriptions.
4. Conclusion
In the twelve descriptions, two dominant styles were identified: the static and
the dynamic description style. Attending to spatial relations is dominant in
the static description style where the picture is decomposed into elds that are
then described systematically, with a variety of terms for spatial relations. In
the course of the description, informants establish an elaborate set of referen-
tial frames that are used for localisations. They give a precise number of picture
elements, stating their colour, geometric form and position. Apart from spatial
expressions, the typical features of the static description style are frequent use
of nouns, existential constructions (there is, it is, it was), auxiliary or posi-
tion verbs and passive voice. Focusing and refocusing on picture elements is
Chapter 4. Variations in picture description 77
In Chapters 14, the reader has been presented with the characteristics of the
spoken picture description. In the following four chapters, I will broaden the
perspective and explore the connection between spoken descriptive discourse,
visual discovery of the picture and mental imagery. The current chapter deals
with methodological questions, and the focus is on sequential and processual
aspects of picture viewing and picture description. First, spoken language and
vision will be treated as two windows to the mind and the focusing of attention
will be conceived of with the help of a spotlight metaphor. Second, a new eye
tracking study will be described and the multimodal score sheet will be intro-
duced as a tool for analysis of temporal and semantic correspondence between
verbal and visual data. Third, a multimodal sequential method suitable for a
detailed dynamic comparison of verbal and visual data will be described.
80 Discourse, Vision and Cognition
In psychology and cognitive science, spoken language has been used to exter-
nalise mental processes during different tasks in form of verbal protocols or
think-aloud protocols (Ericsson & Simon 1980), where subjects are asked to
verbally report sequences of though during different tasks. Verbal protocols
have been extensively used to reveal steps in reasoning, decision-making and
problem-solving processes and applied as a tool for design and usability test-
ing. Linguists were also trying to access the mind through spoken descriptive
discourse (Linde & Labov 1975) and through a consciousness-based analysis
of narrative discourse (Chafe 1996).
It has been argued that eye movements reflect human thought processes, since
it is easy to determine which elements attract the observers eye (and thought),
in which order and how often. But eye movements reveal these covert proc-
esses only to a certain extent. We gain only information about which area the
fixation landed on, not any information about what level the reader was focus-
ing on (was it the contents, the format, or the colour of the inspected area?). An
area may be xated visually for different purposes: in order to identify a picture
element, to compare certain traits of an object with traits of another area, in
order to decide whether the momentary inferences about a picture element are
true or in order to check details on a higher level of abstraction. Verbal foci
and superfoci include descriptions of objects and locations, but also attitudes,
impressions, motives and interpretations. A simultaneous verbal picture de-
scription gives us further insights into the cognitive processes.
There are several possible ways to use the combination of verbal and visual
data. In recent studies within the so-called visual world paradigm, eye move-
ment tracking has been used as a tool in psycholinguistic studies on object
recognition and naming and on reading and language comprehension (for an
overview see Meyer & Dobel 2003 and Griffin 2004). The static pictorial stimu-
li were drawings of one, two or three objects and the linguistic levels concerned
single words and referring expressions, nouns, pronouns, simple noun phrases
(the cat and the chair) and only to a very small extent utterances, for instance
the angel is next to the camel, the cowboy gives the hat to the clown. The
hypotheses behind this line of research is that eye movements are closely time-
locked to the speech stream and that eye movements are tightly coupled with
. The relation between language and mind is also discussed by Grdenfors (1996).
82 Discourse, Vision and Cognition
The way to uncover the idea unit goes via spoken language in action and via
the process of visual focusing.
The human ability to focus attention on a smaller part of the visual eld has
been discussed in the literature and likened to a spotlight (Posner 1980; Theeu-
wes, 1993; Olshausen & Koch 1995) or to a zoom-lens (Findlay & Walker 1998).
() the locus of directed attention in visual space is thought of as having great-
er illumination than areas to which attention is not directed or areas from which
attention has been removed (Theeuwes 1993:95). The current consensus is
that the spotlight of attention turns off at one location and then on at another
(Mozer & Sitton 1998:369). The explanations for such a spotlight differ, though.
Traditionally, attention is viewed as a limited mental resource that constrains
cognitive processing. In an alternative view, the concept is viewed in terms of
the functional requirements of the current task (Allport 1989).
What we attend to during the visual perception and the spoken language
description can be conceived of with the help of a spotlight metaphor, which
intuitively provides a notion of limitation and focus. Actually, the spotlight
comes from the inner world, from the mind. It is the human ability to visually
and verbally focus attention on one part of the information ow a time. Here,
this spotlight is transformed to both a verbal and a visual beam (cf. Figure 1).
The picture elements fixated are visually in the focus of a spotlight and
embedded in a context. The spotlight moves to the next area that pops up from
the periphery and will be in focus for a while. If we focus our concentration
and eye movements on a point, we mostly also divert our attention to that
point. By using a sequential visual fixation analysis, we can follow the path
IDEA
UNIT
Chafe has also suggested that language and vision have similar properties: both
proceed in brief spurts and have a focus and a periphery (Chafe 1980, 1994,
1996). As far as I know, this is the very first known eye tracking study exploring
the couplings between vision and discourse by using a complex picture. (And it
took almost 25 years before the topic was taken up again.) Comparing patterns
during the visual scanning of the picture and during the verbal description of
the picture is a very useful method that provides sustainable results. It helps us
in answering the question whether there is a unifying system integrating vision
and language, and it enriches the research about the nature of human attention,
vision, discourse and consciousness.
The type of scene is, of course, a very important factor influencing both the
form and content of verbal picture descriptions and the eye movement patterns
during picture viewing. Let me therefore stop for a while at the chosen complex
picture, characterise it in more detail and relate it to a scene classification within
scene perception research. Henderson & Hollingworth 1999 define a scene as
a semantically coherent (and often nameable) human-scaled view of a real-
world environment comprising background elements and multiple discrete ob-
jects arranged in a spatially licensed manner (Henderson & Ferreira 2004:5).
The chosen illustration can be characterised by the following features:
The aim of this study has been to conduct a qualitative sequential analysis
of the temporal and semantic relations between clusters of visual and verbal
data (cf. Chapters 6, Section 1 and Chapter 7, Section 1). For each person,
eye movement data have been transformed to a visual flow on a timeline and
the objects that have been fixated in the scene have been labelled. The spoken
language descriptions have been transformed from the transcript observation
format to a verbal ow on a timeline and the borders of foci and superfoci have
been marked. As a result of these transformations, a multimodal time-coded
score sheet with different streams can be created.
A multimodal time-coded score sheet is a format suitable for synchronising
and analysing visual and verbal data (for details see Holsanova 2001:99f.). In
comparison with ELAN, developed in Max Planck Institute in Nijmegen, and
other transcription tools with tier, timeline and tags, I have built in the analysis
of foci and superfoci, which is unique. Figure 3 illustrates what happens visually
and verbally during a description of the three pictures of the old man Pettson,
who is involved in various activities on the right hand side of the picture.
As we can see in Figure 3, the score sheet contains two different streams:
it shows the visual behaviour (objects fixated visually during description on
line 1; thin line = short fixation duration; thick box = long fixation) and verbal
behaviour (verbal idea units on line 2), synchronised over time. Since we start
1. iView fix.
TIME
0.45 0.47 0.49 0.51 0.53 0.55 0.57 0.59 1.01 1.03 1.05 1.07 1.09 1.11 1.13 1.15 1.17
from the descriptive discourse level, units of discourse are marked and cor-
related with the visual behaviour. Simple bars mark the borders of verbal foci
(expressing the conscious focus of attention) and double bars mark the borders
of verbal superfoci (thematic clusters of foci that form more complex units of
thought). On line 3, we find the coding of superfocus types (summarising,
substantive, evaluative etc.). With the help of this new analytic format, we
can examine what is in the visual and verbal attentional spotlight at a particu-
lar moment: Configurations of verbal and visual clusters can be extracted and
contents in the focused verbal idea flow and the visual fixation clusters can be
compared. This score sheet makes it possible to analyse what is happening dur-
ing preceding, simultaneous and following fixations when a larger idea is de-
veloped and formulated. Since the score sheet also includes types of superfoci,
it is possible to track the functional distribution of the extracted verbal and
visual patterns. This topic will be pursued in detail in Chapter 6.
The multimodal time-coded score sheets are suitable for an analysis of pro-
cessual aspects of picture viewing and picture description (and perhaps even for
the dynamics of perception and action in general). So for example, instead of
analysing the result of the picture discovery of three versions of Pettson in the
form of a fixation pattern (Figure 4, Section 2.3.1) and the result of the verbal
picture description in form of a transcript (Example 1, Section 2.3.2), we are able
to visualise and analyse the process of picture viewing and picture description on
a time-coded score sheet (as seen in Figure 3). Let me explain this in detail.
objects and areas that have been fixated by the viewer. This is, however, a static
pattern since it does not exactly visualise when and in what order they were fix-
ated. The circles in Figure 4 indicate the position and duration of the fixations,
the diameter of each fixation being proportional to its duration. The lines con-
necting fixations represent saccades. The white circle in the lower right corner
is a reference point: it represents the diameter of a one-second fixation.
Let me at this point briefly mention some basic information about eye move-
ments. Eye gaze fullfills many important functions in communication: Apart
from being an important source of information during social interaction, gaze
is related to verbal and non-verbal action (Griffin 2004). The reason why people
move their eyes is to bring a particular portion of the visible field of view into
high resolution so that we may see in fine detail whatever is at the central direc-
tion of gaze (Duchowski 2003:3). The saccadic eye movements are important
during picture viewing. They consist of two temporal phases: fixations and sac-
cades. Fixations are stops or periods of time when the point of regard is relative-
ly still. The average fixation duration varies. Fixation duration varies according
to the activity we are involved in. Table 1 (based on Rayner 1992; Solso 1994;
Henderson & Hollingworth 1999 and my own data) shows some examples of
average xation duration during different viewing activities.
The jumps between stopping points are called saccades. During saccades,
the eyes move at a relatively rapid rate to reorient the point of vision from one
spatial position to another. Saccades are very short, usually lasting from 2050
ms. It is during fixations that we acquire useful information, whereas our vi-
sion is suppressed and we are essentially blind during saccades (Hendersson &
Hollingworth 1998, 1999; Hoffman 1998). Usually, three types of visual acuity
are distinguished: foveal vision which encompasses a visual angle of only about
12 degrees, parafoveal vision which encompasses a visual angle of up to 10
Chapter 5. Multimodal sequential method and analytic tool 91
Example 1. Transcript sample Three versions of Pettson on the right (uttered dur-
ing picture discovery illustrated in Figure 4).
0123 (1s) eeh to the right 0:50 sum + loc
0124 there are three versions of (1s) old guy Pettson 0:54
0125 who is working the land, 0:56
0126 he is digging in the soil 0:59 subst. list
0127 he is (1s) eh raking 1:02
0128 and then he is sowing, 1:06
The spoken picture description has been recorded, transcribed and translated
from Swedish into English. The transcript of the spoken description is detailed
(Example 1). It includes verbal features, prosodic features (such as intonation,
rhythm, tempo, pauses, stress, voice quality, loudness), and non-verbal features
(such as laughter). It also contains hesitations, interruptions, restarts and other
features that are typical of speech and that give us additional information about
the speaker and the situational context. Each numbered line represents a new
verbal focus expressing the content of active consciousness. Several verbal foci
are clustered into superfoci (for example summarising superfocus 01230125
or a list of items 01260128; see Chapter 1, Section 2.1 for definitions and
details).
However, both the visual and the verbal outcomes above, i.e. the fixation
plot in Figure 4 and the transcript for three versions of Pettson in Example 1,
are static. In order to follow the dynamics of picture viewing and picture de-
scription, we need to synchronise these two data outcomes in time. Let us have
a look at the same sequence in the new sequential format (Figure 6) where we
zoom in on what is happening in the visual and verbal streams. The boxes of
different shading and different length with labels on the top line represent ob-
jects that were fixated visually. On the bottom line, we find the verbal foci and
superfoci including pauses, stress etc. Both the verbal and the visual streams
are correlated on a time-line.
With the help of the multimodal sequential method, we are able to extract
several schematic configurations or patterns both within a focus and within a
superfocus as a result of a detailed analysis of the temporal and semantic rela-
tions (cf. Holsanova 2001). Some of them can be seen in Figure 6, for example
an n-to-one configuration between the visual and the verbal part in the second
focus when the viewer is introducing the three pictures. Notice that there is a
large delay (or latency) between the visual fixation of the sowing Pettson (dur-
ing the second verbal focus) and the verbal mention of this particular picture
(in the sixth focus). In fact, the sowing Pettson is not locally fixated, in parallel
P raking
P raking
P sowing
P raking
P digging
P digging
P raking
P digging
P digging
P raking
P raking
P raking
P digging
P digging
P raking
P digging
P raking
P raking
P digging
VISUAL
VERBAL
--- eeh to the right there are three variations of --- old guy Pettson who is working the land, he is digging in the soil he is --- eh raking and then he is sowing,
time
Figure 6. Schematic configuration on the multimodal time-coded score sheet: Three versions of Pettson on the right with connections
between verbal and visual foci and with markings of foci and superfoci.
Chapter 5. Multimodal sequential method and analytic tool
93
94 Discourse, Vision and Cognition
to the verbal description within the same focus. Its mentioning is based on a
previous fixation on this object, during the summarising focus there are three
versions of (1s) old guy Pettson. The ability to keep track of this referent gives us
some hints about the capacity of the working memory. For more details about
the temporal and semantic relations, see Chapters 6 and 7.
Figure 7. Multimodal time coded score sheet of keystroke data and eye tracking data
Chapter 5. Multimodal sequential method and analytic tool
95
96 Discourse, Vision and Cognition
design and human factors are therefore yet another areas where this method
could be applied when testing how nuclear power plant operators react in
a simulated scenario, measuring situation awareness of operators in airport
towers, and evaluating computer interfaces or assessing architectural and in-
terface design (Lahtinen 2005). A further possible application lies within the
educational context (test of language development, text-picture integration,
etc.). It would also be possible to add a layer for gestures and for drawing in
order to reveal practitioners expert knowledge by analysing several verbal and
non-verbal activities: designers sketching and verbally explaning a structure
for a student, archaeologists visually scanning, verbally describing structures
on a site while simultaneously and pointing and gazing at them, or radiologists
visually scanning images and verbally describing the anomalies in an exami-
nation report. All these multimodal actions involve several mental processes
including analytic phases, planning, survey phases, constructive phases, moni-
toring phases, evaluative phases, editing phases and revision phases. It would
be interesting to synchronize different streams of behaviour (verbal, visual,
gestural, other non-verbal) in order to investigate the natural segmentation of
action into functional phases or episodes, in order to get to know more about
individual strategies and patterns and about the distribution of the underlying
mental processes.
Finally, multimodal score sheets can also be applied within scene percep-
tion and reasoning. The selection of informative regions in an image or in a
scene is guided both by bottom-up and top-down processes such as internal
states, memory, tasks and expectations (Yarbus 1967). Recorded eye move-
ments of human subjects with simultaneous verbal descriptions of the scene
can reveal conceptualisations in the human scene analysis that, in turn, can be
compared to system performance (Schill 2005; Schill et al. 2001).
The advantages when using a multimodal method in applied areas are
threefold: it gives more detailed answers about cognitive processes and the
ongoing creation of meaningful units, it reveals the rationality behind the in-
formants behaviour (how they behave and why, what expectation and associa-
tions they have) and it gives us insights about users attitudes towards certain
layout solutions (what is good or bad, what is easy or difficult etc.). In short, the
sequential multimodal method can be successfully used for a dynamic analysis
of perception and action in general.
98 Discourse, Vision and Cognition
4. Conclusion
Temporal correspondence
between verbal and visual data
In our everyday life, we often look ahead, pre-planning our actions. People
look ahead when they want to reach something, identify a label, open a
bottle. When playing golf, rugby, cricket, chess or football, the players usually
do not follow the track of the moving object but rather fixate the expected
future position where the object should land (Tadahiko & Tomohisa 2004;
Kiyoshi et al. 2004). Piano players and singers read the next passage in the
score and their eyes are ahead of their hands and voices (Bersus 2002;
Goolsby 1994; Pollatsek & Rayner 1990; Sloboda 1974; Young 1971). Last
but not least, copytypers eyes are ahead of their fingers (Butsch 1932; Inhoff
& Gordon 1997). In sum, predictions and anticipatory visual fixations are
frequent in various types of activities (Kowler 1996). The question is how
picture viewing and picture description are coordinated in time. Do we al-
ways look ahead at a picture element before describing it verbally or can the
verbal description of an object be simultaneous with visual scanning? Does it
even happen that eye movements follow speech?
By now, the reader has been presented with different types of foci and superfo-
ci, with the sequential qualitative method and the analytic tool and with multi-
modal time-coded score sheet, all of which have been described in the previous
chapters. In the following two chapters, I will use this method to compare the
content of the visual focus of attention (specifically clusters of visual fixations)
and the content of the verbal focus of attention (specifically verbal foci and su-
perfoci). Both temporal and semantic correspondence between the verbal and
visual data will be investigated.
In this chapter, I will primarily concentrate on temporal relations. Is the
visual signal always simultaneous with the verbal one? Is the order of the ob-
jects focused on visually identical with the order of objects focused on ver-
bally? Can we find a comparable unit in visual and verbal data? It is, of course,
100 Discourse, Vision and Cognition
Let us start with the inventory of configurations extracted from the data.
The multimodal time-coded score sheet presented in the previous chapter
showed connections between verbal and visual foci in detail. Here, the vari-
ous configurations will be presented in a more simplied, schematic way. In
this simplied version of the score, there are only two streams: the visual and
the verbal behaviour. The discrete units compared (boxes on the timeline)
represent the objects that at a certain moment are in the focus of visual and
verbal attention.
Chapter 6. Temporal correspondence between verbal and visual data 101
Since we have deduced from Chafe (1980) that the verbal focus is a candidate
for a unit of comparison between verbal and visual data, we will start on the
focus level. In particular, we will compare the verbal focus (an idea about the
picture element that is conceived of as central at a certain point in time and
delimited by prosodic, acoustic and semantic features) with the temporally si-
multaneous visual focus (a cluster of visual fixations directed onto a discrete
object in the scene).
stone Pettson 1
VISUAL VISUAL
VERBAL VERBAL
stone Pettson
time time
typical of certain parts of the descriptions, especially of the list of items and
substantive foci (subst). One further characteristic is that this conguration
usually does not appear alone, within one focus, but rather as a series, as a part
of a superfocus (see conguration 1.2.2).
Eye-voice latency (or eye-voice span) has been reported in a number of
psycholinguistic studies using eye tracking, but the average duration differs in
different activities. It has, for instance, been observed that the eye-voice latency
during reading aloud is 750 ms and during object naming, 900 ms (Griffin &
Bock 2000; Griffin 2004). As we have seen in picture viewing and picture de-
scription, the eye-voice latency in list foci lasts for about 20003000 ms. In
our mental imagery studies (Johansson et al. 2005, 2006), the eye-voice latency
was on average 2100 ms during the scene description and approximately 300
ms during the retelling of it (cf. Chapter 8, Section 2.1). The maximum value
across all subjects was 5000 ms in both phases.
There are interesting parallels in other areas of non-verbal behaviour, such
as gesturing, signing or using a pen. The nding that the verbal account lags
behind the visual xations can be compared to Kendons (1980), Naughtons
(1996) and Kitas (1990) results, which reveal that both spontaneous gesturing
and signed language precede their spoken lexical analogues during communi-
cation. The eye-hand latency in copy-typing ranges between 200 and 900 ms
(Inhoff & Gordon 1997). Also, when users interact with a multimodal system,
pen input precedes their speech by one or two seconds (Oviatt et al. 1997).
The question that arises here is why the latency in picture viewing and pic-
ture description is so long compared to studies in reading and object naming.
What does it measure? Meyer and Dobel (2003:268) write: When speakers
produce sentences expressing relationships between entities, this formulation
phase may be preceded by a conceptualization or appraisal phase during which
speakers aim to understand the event, assign roles to the event participants and
possibly, select a verb. Conceptualisation and formulation activities during ut-
terance production might be one explanation, but the length of the lag indi-
cates that there are more complex cognitive proceses going on at a discourse
level. Other possible explanations for the length of the lag will be discussed in
detail in section three of this chapter.
stone stone
VISUAL
VERBAL
theres a stone
time
conguration in Figure 3 encompasses two visual foci and one verbal focus. It
has been formulated as a part of a superfocus consisting of two foci in front of
the tree there is a stone. The size of the visual fixation cluster is limited by the
size of the verbal focus (< 2 sec). If the cluster is longer than that, then the fixa-
tions exceeding the focus border belong to the next focus.
The rst visual xation on the object seems to be a preparatory one. The
rexation on the same object then occurs simultaneously with the verbal de-
scription of this object. The two white boxes in between are anticipatory xation
clusters on other objects that are going to be mentioned in the next verbal
focus. The observer may at the time be looking ahead and pre-planning (cf.
Velichkovsky, Pomplun & Rieser 1995). Thus, it seems like a large proportion
of the categorisation and interpretation activities has already taken place dur-
ing the first phase of viewing (the first visual fixation cluster) whereas the later
refixation simultaneous with the formulation of stone occurs in order to
increase the saliency on stone during speech production. The issue of anticipa-
tion fixations on a forthcoming referent has been shown in auditory sen-
tence comprehension (Tanenhaus et al. 2000). In the above configuration, we
see a typical example of anticipation in on-line discourse production: the visual
cluster on white objects interferes temporally and semantically with the current
focus (more about semantic relations will be said in the following chapter).
The triangle conguration is typical of substantive verbal foci (subst) and
localising foci (loc) and represents 17 percent of all presentational foci. As we
will see later (in conguration 1.2.3), it usually represents only a part of a su-
perfocus and is often intertwined with xations from other foci. Already in this
conguration we can see that a 1:1 correspondence between verbal and visual
foci will not hold. The process of attentional focusing and refocusing on picture
elements seems to be more complicated than that.
104 Discourse, Vision and Cognition
VISUAL
VERBAL
time
VERBAL
I see a tree in the middle
time
The next higher unit in the hierarchy to consider is the superfocus (a larger co-
herent chunk of speech consisting of several verbal foci). Congurations 1.2.1
1.2.6 below exemplify more complex patterns extracted from the data within a
superfocus, i.e. the sequences of foci delimited by double bars on the score.
VISUAL
VERBAL
time
VISUAL
VERBAL
time
VERBAL
one is sittin g the other is and the third she-bird is beating a rug or something,
on its eggs singing
time
s uggests that the observer has not analysed these areas thoroughly the rst
time s/he entered this area (i.e. during the summarising overview). Instead,
s/he returns to them and analyses them in detail later on.
One of the varieties of this configuration can be seen in Figure 9. The
speaker is describing three birds doing different things in a summarising focus
and continues with a list: one is sitting on its eggs, the other is singing and the
third female bird is beating a rug or something. The visual dwelling on the third
bird is very long, consisting of inspections of various details of the birds ap-
pearance that lead to the categorisation she-bird. The delay configurations are
combined with one 1-to-n conguration.
During the summarising (sum) and list foci in on-line descriptions, in-
formants were inspecting one and the same area of interest twice but on two
different levels of specicity. This finding about two levels of viewing is fully
consistent with the initial global and a subsequent local phase hypothesised by
Buswell (1935), who identied two general patterns of perception:
One of these consists of a general survey in which the eye moves with a se-
ries of relatively short pauses over the main portions of the picture. A second
type of pattern was observed in which series of xations, usually longer in
duration, are concentrated over small areas of the picture, evidencing detailed
examination of those sections. While many exceptions occur, it is apparent
that the survey type of perception generally characterises the early part of an
examination of a picture, whereas the more detailed study, when it occurs,
usually appears later. (Buswell 1935:142)
Chapter 6. Temporal correspondence between verbal and visual data 109
VERBAL
in front of the tree which is curved is a stone
time
tree foilage tree tree foilage bird 2 tree bird 2 bird 1 bird 3 bird 2 bird 1 bird 3
VISUAL
VERBAL
in the middle is a tree with one with three birds doing different things
time
Pettson 1
Pettson 3
Pettson 2
Pettson 1
Pettson 2
Pettson 4
tree tree tree tree
VISUAL
112 Discourse, Vision and Cognition
VERBAL
I see a tree in the middle and four men working around it
time
VISUAL
VERBAL
time
Pauses, hesitation, vagueness and modifications (those things, sort of), met-
acomments, corrections and verbatim repetitions are typical of this type of
verbal superfocus. In other words, there are multiple visual and verbal activi-
ties. More specically, multiple visual xation clusters are accompanied by lin-
guistic alternation and dysfluencies.
Apart from configurations within the superfocus, correspondence could
be found on an even higher level of discourse, namely the discourse topics.
114 Discourse, Vision and Cognition
Frequency distribution of
configuration types
5%
11%
37% n-to-1
delay
17% triangles
n-to-n
match
30%
Figure 14. Frequency distribution of the configuration types in free on-line picture
descriptions.
Two describers can be guided by the composition of the picture and moved
systematically (both visually and verbally) from the left to the right, from the
foreground to the background (Holsanova 2001:117f.). The discourse topic can
also be based on impressions, associations or evaluations in connection to the
scene.
connected to one verbal focus, in particular in summarising foci. The foci are
often intertwined, which causes a partial lack of qualitative correspondence
between the visual and verbal foci. The congurations where multiple visual
foci are connected to one verbal focus and intertwined with each other are well
integrated into a larger unit, into a superfocus. Explanations for intertwined
configurations can be found in discourse planning. If one visual xation clus-
ter is connected to one verbal focus, there is usually a latency of about 23
seconds between them, in particular in list of items. This delay conguration
is in turn a part of a larger unit, the surrounding superfocus, and the length
of the delay reflects cognitive processes on a discourse level. Looking back at
Chafes (1980) suggestions, the conclusion we can draw from this comparison
is that the verbal focus about one object does not always closely correspond to
a single visual focus, i.e. a visual fixation cluster on this object. Also, the order
in which the picture elements are focused on differs partially. The hypothesis
that there are comparable units in the visual scanning of the picture and the
simultaneous spoken language description of it can still be conrmed, but
on a higher level. It is the superfocus rather than the focus that seems to be
the suitable unit of comparison between visual and verbal data, since in most
cases the superfocus represents an entity that delimits separable clusters of
visual and verbal data.
We will now turn to the second section of this chapter, which focuses on
the functional distribution of the extracted configurations.
foci can contain evaluative and localising aspects. The patterns seem to vary
according to the different aspects included.
However, when we look at the summarising foci (sum), list of items (list)
and foci with categorisation difculties (cat. diff.), the configurations appear
118 Discourse, Vision and Cognition
40
35
30
25 n-to-1
delay
20 triangles
n-to-n
15
perfect match
10
0
1 2 3 4 5 6
Sum Subst List Loc Cat.diff. Eval.
regularly across all informants and can be systematically correlated with cer-
tain types of verbal activities. For summarising foci (sum), n-to-1 configuration
appears to be typical. This coupling is quite plausible: The information about
objects, states and events is acquired visually by multiple fixation clusters on
several objects in the scene, both during pauses and during the verbal descrip-
tion, and is summarised verbally, usually in one verbal focus.
For list of items, the delay configuration dominates. This can be explained
by a counting-like behaviour: the informants have inspected the objects during
a summarising focus and are checking the items off the list. Since they are de-
scribing these objects in detail now, they have to check the appearance, relation
and activity of the listed objects. This categorisation, interpretation and formu-
lation costs mental effort, which is reflected in the delay configuration.
Triangles are typical of localising foci (loc). Within one verbal focus high-
lighting an idea about an objects location in the scene (e.g. in front of the tree
there is a stone), the viewers visually inspect the stone-area before mentioning
stone. During this first inspection, the inspected object functions as a pars pro
toto, representing the in front-of-the-tree-location (for more details about this
semantic relation, see the following chapter). After that, the informants inspect
some other objects in the scene (which will be mentioned in the following
Chapter 6. Temporal correspondence between verbal and visual data 119
foci), and finally devote the last visual inspection to the stone again, while si-
multaneously naming it. The mental activities seem to be divided into a prepa-
ratory phase when most of the categorisation and interpretation work is done
and a formulation phase during the refixation.
Categorisation difficulties are exclusively connected to cluster type
n-to-n mappings. Concerning the characteristics of the verbal and visual
stream, multiple verbal and visual foci are typical of this type of activity, where
visual xations were either short/medium or very long (intensive scanning).
For the verbal stream, repetitive sequences, lexical or syntactic alternation
and, sometimes even hyperarticulation are characteristic. When determining
a certain kind of object, activity or relation, the effort to name and catego-
rise is associated with problem solving and cognitive load. This effort is of-
ten manifested by four or ve different verbal descriptions of the same thing
within a superfocus.
In short, tendencies towards reoccurring multimodal integration patterns
were found in certain types of verbal superfoci during free picture descrip-
tions. The results from the identied typology of multimodal clusters and their
functional distribution can give us hints about certain types of mental activi-
ties. If further research confirms this coupling, these multimodal clusters can
receive a predictive value.
In the middle is a tree with one with three birds doing different things; one is sitting
on its eggs, the other is singing and the third female bird is beating a rug or something.
Figure 17. The complex visual display and linguistic production in the current stud-
ies on descriptive discourse.
4. Conclusion
i. The verbal and the visual signals were not always simultaneous. Visual
scanning was also done during pauses in the verbal description and at the
beginning of the sessions, before the verbal description had started.
ii. The visual focus was often ahead of speech production (objects were visu-
ally focused on before being described). If one visual xation was con-
nected to one verbal focus, there was usually a delay of approximately 23
seconds between them, in particular in list foci. This latency was due to
conceptualisation, planning and formulation of a free picture description
on a discourse level, which affected the speech-to gaze alignment and pro-
longed the eye-voice latency.
iii. Visual focus could, however, also follow speech meaning so that a visual
fixation cluster on an object could appear after the describer had men-
tioned it. The describer was probably monitoring and checking his state-
ment against the visual account.
iv. Areas and objects were frequently re-examined, which resulted either in
multiple visual foci or in both multiple visual and verbal foci. As we will
see in the following chapter, a refixation on one and the same object could
be associated with different ideas.
v. One visual xation usually did not match one verbal focus and a perfect
overlap was very rare. Some of the inspected objects were not mentioned
at all in the verbal description, some of them were not labelled as a discrete
entity but instead included later, on a higher level of abstraction (there are
flying objects).
vi. Often, several visual xations were connected to one verbal focus, in par-
ticular in summarising foci (sum).
vii. The order of objects focused visually and objects focused verbally was not
always the same, due to the fact that in the course of one verbal focus, pre-
paratory glances were passed and visual fixation clusters landed on new
objects, long before these were described verbally.
viii. Multiple visual foci were intertwined and well integrated into a larger unit,
into a superfocus. This was due to discourse coherence and was related to
cognitive processes on the discourse level.
ix. Comparable units could be found; however, not on a focus level. In most
cases, the superfocus represented the entity that delimited separable clus-
ters of visual and verbal data. I have therefore suggested that attentional
superfoci rather than foci are the suitable units of comparison between
verbal and visual data.
Chapter 6. Temporal correspondence between verbal and visual data 123
We can conclude that many of the observations are related to cognitive pro-
cesses on the discourse level. The main result concerns similarities between
units in our visual and verbal information processing. In connection to the
traditional units of speech discussed in the literature, the empirical evidence
suggests that the superfocus expressed in longer utterances (or similar larger
discourse units) plays a crucial role and is the basic unit of information pro-
cessing. Apart from that, as the results from my perception study (Chapter 1)
indicate, the superfocus is easier to identify and both cognitively and commu-
nicatively relevant. I was also able to demonstrate the lag between the aware-
ness of an idea and the verbalisation of that idea: when describing a list of
items in free descriptive discourse, the latency between a visual focus and
a verbal focus was longer than the latency in psycholinguistic studies on a
phrase or clause level. This insight may be valuable for linguists but may also
enrich research about the nature of human attention and consciousness. We
will return to the issue of speech production and planning in the following
chapter.
The second section of this chapter maintains that a certain type of fixation
pattern reflects a certain type of mental activity. I analysed how often each type
of pattern appears in association with each type of focus and presented the
frequency distribution of cluster types as a function of verbal activity type. It
shows that this correspondence is present in summarising foci (sum), lists of
items (list), localising foci (loc) and substantive foci with categorisation dif-
ficulties (cat. diff). Finally, the current eye tracking study has been discussed
in comparison with psycholinguistic studies. Differencies have been found in
the characteristics of the visual displays used and in the linguistic description
produced.
The current chapter dealt with temporal relations between verbal and vi-
sual data. The following chapter will concern semantic relations between these
two sorts of data.
chapter 7
Semantic correspondence
between verbal and visual data
It is not self-evident that we all agree on what we see, even though we look at
the same picture. Ambiguous pictures are only one example of how our per-
ception can differ. The reason behind these individual differences is that our
way of identifying objects in a scene can be triggered by our expectations,
interests, intentions, previous knowledge, context or instructions we get. The
picture content that, for example, the visitors at an art gallery extract during
picture viewing does not very often coincide with the title that the artist has
formulated. Current theories in visual perception stress the cognitive basis
of art and scene perception: we think art as much as we see art (cf. Solso
1994). Thus, observers perceive the picture on different levels of specificity,
group the elements in a particular way and interpret both WHAT they see
and HOW the picture appears to them. All this is reflected in the process of
picture viewing and picture description.
The previous chapter dealt with temporal relations in picture viewing and pic-
ture descriptions, particularly with the configurations of verbal and visual data,
the unit of comparison and the distribution of the various configuration types
as a function of verbal activity types. In this chapter, we will use the multi-
modal method and the analytic tools described in Chapter 5 in order to com-
pare the contents of verbal and visual clusters. First, we will take a closer look
at the semantic relations between visual and verbal data from on-line picture
descriptions. Second, we will review the results concerning levels of specificity.
Third, we will focus on spatial proximity and mental groupings that appear
during picture description. Fourth, we will discuss the generability of our re-
sults by presenting results of a priming study conducted with twelve additional
informants. We will also discuss the consequences of the studies for viewing
aspects, the role of eye movements, fixation patterns, and language production
and language planning.
126 Discourse, Vision and Cognition
1. Semantic correspondence
In the first section, we will be looking at the content of the visual and verbal
units. In picture viewing and picture description, the observers direct their
visual xations towards certain objects in the scene. They also focus on objects
from the scene in their spoken language descriptions. With respect to semantic
synchrony, we can ask what type of information is focused on visually when,
for instance, someone describes the tree and the three birds in it. Does she
xate all the units that the summary is built upon? Is there a one-to-one rela-
tionship between the visual and the verbal foci? It is also interesting to consider
whether the order of the described items is identical with the order in which
they were scanned visually.
Clusters of visual xations on an object can be caused by different per-
ceptual and cognitive processes. The spoken language description can help us
to reveal which features observers are focusing on. Let us look again at the
relation between the visual and the verbal focus in the example presented in
Figure1: In front of the tree which is curved is a stone.
In the visual stream, the object stone is xated three times: twice in the
triangle conguration and once in the perfect match conguration (cf.
Chapter 6). If we compare these congurations with the verbal stream, we
discover that the relation between the visual and the verbal focus is differ-
ent. While their relation is one of semantic identity in the perfect match
(stone = stone), this is not the case in the triangle configuration. Here, the
concrete object stone is viewed from another perspective; the focus is not
on the object itself, but rather on its location (stone = in front of the tree). In
the perfect match conguration, eye movements are pointing at a concrete
object. In the triangle case, on the other hand, there is an indirect semantic
VERBAL
in front of the tree which is curved is a stone
time
relation between vision and spoken discourse. The observers eyes are point-
ing at a concrete object in the scene but as can be revealed from the verbal
description the observer is mentally zooming out and focusing the position
of the object.
This latter relation can be compared with the gure-ground (trajector-
landmark) relation in cognitive semantics (Lakoff 1987; Langacker 1987,
1991; Holmqvist 1993). According to cognitive semantics, our concepts of the
world are largely based on image-schemata, embodied structures that help
us to understand new experiences (cf. Chapter 8, Section 1). For instance,
in the relation in front of , the trajector (TR) the location of the object is
considered to be the most salient element and should thus be focused by the
observer/describer. In fact, when saying in front of the tree, the informant
actually looks at the trajector (=stone) and directs his visual attention to it
(Figure 2). In other words, the saliency in the TR role of the schema co-occurs
with the visual (and mental) focus on the stone, while the stone itself is verbal-
ised much later. It would be interesting to conduct new experiments in order
to verify this finding from spontaneous picture description for other relations
within cognitive semantics.
Nuyts (1996) is right when he notes that Chafes concept of human cogni-
tion is much more dynamic than issues discussed within mainstream cogni-
tive linguistics. Chafe is more process-oriented whereas mainstream cognitive
linguists are more representation-oriented, studying mental structures and is-
sues of a more strictly conceptual-semantic nature. However, as this example
shows, much could be gained by introducing the theoretical concepts from
cognitive semantics, such as landmark, trajector, container, prototype, figure-
ground, source-goal, centre-periphery, image schema etc. as explanatory de-
vices on the processual level.
LM
TR
In example (a), the cows are partly viewed as a concrete object and partly as
a location. This time, the location is not expressed with the help of another
concrete object in the scene but in terms of picture composition (in the back-
ground).
In (b), the observer is not xating one object in the middle as a represen-
tation of the location, as we would expect. Instead, the middle is created by
comparing or contrasting the two halves of the scene. The observer delimits
the spatial location with the aid of multiple visual xations on different objects.
He is moving his eyes back and forth between the two halves of the picture,
xating similar objects on the left and on the right (Findus 1 and Findus 2, soil
in the left eld and soil in the right eld) on a horizontal line. After that, he
follows the vertical line of the tree in the middle, xating the bottom and the
top of it. This dynamic sequence suggests that the observer is doing a cross
with his eyes, as if he were trying to delimit the exact centre of the scene. The
semantic relation between the verbal and the visual foci is implicit; the discrete
objects are conceived of as pars pro toto representation of the two halves of the
picture, suggesting the location of the middle.
Example (c)
(c) 0415 and behind the tree
0416 there go . probably cows grazing
0417 down towards a . stone fence or something
and, nally, cows are rexated while the perceived direction or path of their
movement is mentioned (down towards a stone fence).
In addition to the object-location relation, the focus can also be on the ob-
ject-attribute relation. During evaluations, observers check the details (form,
colour, contours, size) but also match the concrete, extracted features with the
expected features of a prototypical/similar object: something that looks like a
telephone line or telephone poles . that is far too small in relation to the humans.
They compare objects both inside the picture world and outside of it. By using
metaphors and similes from other domains they compare the animal world
with the human one: one bird looks very human; theres a dragonfly like a dou-
ble aeroplane.
To summarise the ndings concerning the relation between the visual and
verbal foci we can state the following: A visual xation on an object can mean
that:
When dealing with the specic semantic relationship between the content of
the verbal and the visual spotlight, we can then ask: Is it the object as such that
is visually focused on, or is it some of its attributes (form, size, colour, con-
tours)? Is it the object as a whole that is focused on, or does the observer zoom
in and analyse the details of an object on a ner scale of resolution? Or is the
observer rather mentally zooming out to a more abstract level (weird cat, cat,
animal, strange thing, something)?
As for the verbal categorisation, the main figure, Pettson, can be described
as a person, a man, an old guy, a farmer, or he can be called by his name. His
appearance can be described (a weird guy), his clothes (wearing a hat), the ac-
tivity he is involved in can be specified (he is digging) and this activity can be
evaluated (he is digging frenetically). Thus, we can envision a description on dif-
ferent levels of specificity and with different degrees of creativity and freedom.
The tendencies that could be extracted from the data are the following: In-
formants start either with a specific categorisation which is then verbally modi-
fied (in a potato field or something; the second bird is screaming or something like
that) or, more often, with a vague categorisation, a filler, followed by a speci-
fication: Pettson has found something he is looking at he is looking at the soil.
In other words, a general characteristic of an object is successively replaced by
Chapter 7. Semantic correspondence between verbal and visual data 131
more specific guesses during the visual inspection: he is standing there looking
at something maybe a stone in his hand that he has dug up.
Informants also use introspection and report on mental states, as in the
following example. After one minute of picture viewing and picture descrip-
tion, the informant verbalises the following idea: when I think about it, it seems
as if there were in fact two different fields, one can interpret it as if they were in
two different fields these persons here (cf. Section 3.4).
In the following, I will show that not only scene-inherent concrete objects or
meaningful groups of objects are focused on, but that also new mental group-
ings are created on the way. Let us turn to the aspect of spatial proximity.
Another type of cluster that was perceived as a meaningful unit in the scene
was hills at the horizon. The observers eyes followed the horizontal line, filling
in links between objects. This cluster was probably a compositionally guided
cluster. The observer is zooming out, scanning picture elements on a compo-
sitional level. This type of cluster is still quite close to the suggested, designed
or scene-inherent meaningful groupings.
What we do not see and know on the basis of eye movements only, is how the
observer perceives the objects on different occasions. This can be traced thanks
to the method combining picture viewing and picture description.
In the case of two different fields, the objects that are refixated represent a
bigger (compositional) portion of the picture and support the observers recon-
ceptualisation. By mentally zooming out, he discovers an inferential boundary
between parts of the picture that he has not perceived before. The scene origi-
nally perceived in terms of one field has become two fields as the observer gets
more and more acquainted with the picture. This example illustrates the process
during which the observers perception of the picture unfolds dynamically.
We are now moving further away from the scene-inherent spatial proximity
and approaching the next type of clusters constructed by the observers. This
time, the cluster is an example of an active mental grouping of concrete ob-
jects based on an extraction of similar traits and activities (cf. Figure 6. Fly-
ing insects). The prerequisite for this kind of grouping is a high level of active
processing. Despite the fact that the objects are distributed across the whole
scene and appear quite differently, they are perceived of as a unit because of
the identified common denominator. The observer is mentally zooming out
and creating a unit relatively independent of the suggested meaningful units
in the scene. The eye movements mimic the describers functional grouping of
objects.
In a number of cases, especially in later parts of the observation, the clusters are
based on thematic aspects. The next cluster lacks spatial proximity (Figure 7).
The observer is verbalising his impression about the picture content: it looks
like early summer. This abstract concept is not identical with one or several
concrete objects compositionally clustered in the scene, and visual fixations
are not guided by spatial proximity. The previous scanning of the scene has
led the observer to an indirect conclusion about the season of the year. In the
visual fixation pattern, we can see large saccades across the whole picture com-
position. It is obviously a cluster based on mental coupling of concrete objects;
their parts or attributes (such as flowers, foliage, plants, leaves, colours) on a
higher level of abstraction.
Since the concept early summer is not directly identical with one or sev-
eral concrete objects in the scene, the semantic relation between verbal and
visual foci is inferred. The objects are concrete indices of a complex (abstract)
concept. In addition, the relation between the spoken description and the vi-
sual depiction is not a categorical one, associated with object identification.
Instead, the observer is in a verbal superfocus, formulating how the picture
appears to him on an abstract level. Afterwards, during visual rescanning, the
observer is searching again for concrete objects and their parts as crucial indi-
cators of this abstract scenario. By refocusing these elements, the observer is
in a way collecting evidence for his statement. In other words, he is checking
whether the object characteristics in the concrete scene match the symptoms
of the described scenario. Concrete objects can be viewed differently on differ-
ent occasions as a result of our mental zooming in and out. We have the ability
to look at a single concrete object and simultaneously zoom out and speak
about an abstract concept or about the picture as a whole. When talking about
creativity and freedom, this type of mental groupings shows a high degree of
active processing.
These clusters are comparable to Yarbus (1967) task-dependent clusters
where
[] in response to the instruction estimate the material circumstances of
the family shown in the picture, the observer paid particular attention to the
womens clothing and the furniture (the armchair, stool, tablecloth). In re-
sponse to the instruction give the ages of the people shown in the picture,
all attention was concentrated on their faces. In response to the instruction
surmise what the family was doing before the arrival of the unexpected visi-
tor, the observer directed his attention particularly to the objects arranged on
the table, the girls and womans hands []. After the instruction remember
clothes worn by the people in the picture, their clothing was examined. The
instruction remember position of the people and objects in the room caused
the observer to examine the whole room and all the objects. [] Finally, the
instruction estimate how long the unexpected visitor had been away from
the family, caused the observer to make particularly intensive movements of
the eyes between the faces of the children and the face of the person entering
the room. In this case he was undoubtedly trying to nd the answer by study-
ing the expression on the faces and trying to determine whether the children
recognised the visitor or not. (Yarbus 1967:192193)
Although the informants in my study did not receive any specic instructions,
their description spontaneously resulted in such kinds of functionally deter-
mined clusters. These clusters were not experimenter-elicited, as in Yarbus
case, but naturally occurring (correlational). It was the describers themselves
who created such abstract concepts and scenarios, which in turn provoked
a distributed visual search for corresponding signicant details in the scene.
Clusters of this type have temporal proximity but no spatial proximity. They
are clearly top-down guided and appear spontaneously in the later parts of the
observation/description.
Chapter 7. Semantic correspondence between verbal and visual data 137
These findings have consequences for viewing dimensions (Section 4.2) and
for function of eye movements (Section 4.3). It should be noted that the spoken
language description may inuence eye movements in many respects. Never-
theless, it is not unlikely that patterns based on evaluation and general impres-
sion do appear even in free visual scene perception. They may be a natural part
of interpreting scenes and validating the interpretation. I can also imagine that
evaluative patterns can be involved in preference tasks (when comparing two
or more pictures). Or, in a longer examination of a kitchen scene resulting in
the following idea: it looks as if somebody has left it in a hurry. If we want to ex-
tract such clusters on the basis of eye movements alone, the problem is that the
spoken language description is needed in order to identify them. If we combine
both modalities, we are more able to detect these kinds of complex clusters.
4. Discussion
The question arises whether the eye movement patterns connected to abstracts
concepts (such as early summer) are specific to the task of verbally describing
a picture, or whether they appear even during perception of the same speech.
Are these eye movement patterns caused by the process of having to systemati-
cally structure speech for presentation? Are they limited to situations where
the picture description is generated simultaneously with picture viewing? Do
these patterns appear for speakers only? Or can similar eye movement patterns
138 Discourse, Vision and Cognition
be elicited even for viewers who observe the picture after they have listened to
a verbal description of the scene? In order to answer these questions, we con-
ducted a priming study that will be presented in Section 4.1. If we find similar
eye movement patterns even for listeners, this would mean that these patterns
are not connected to the effort of planning a speech but rather to the semantics
of the viewed scene.
The question under investigation was whether we could prime the occurrence
of similar viewing patterns by presenting spoken utterances about picture con-
tent before picture onset.
. I would like to thank Richard Andersson for his help with the study.
Chapter 7. Semantic correspondence between verbal and visual data 139
Figure 8a. Flying insects: Listeners scanpaths (03, 04, 07 and 09). Can be compared
to speakers original scanpath in Figure 6.
140 Discourse, Vision and Cognition
Figure 8b. Telephone line: Listeners scanpaths (03, 04, 07, and 09). Area of interest is
marked with a rectangle on the right.
Figure 8c. Early summer: Listerners scanpaths (03, 04, 07, and 09). Can be compared
to speakers original scanpath in Figure 7.
Chapter 7. Semantic correspondence between verbal and visual data 141
the picture. This would indicate a) that we will find similar eye movement pat-
terns within the group of listeners and b) that the eye movement produced by
the listeners will be similar to the clusters produced by the original speaker.
In the analysis, I concentrate on visual clusters caused by utterances 16
(birds in the tree, two different fields, early summer, telephone line, three birds
in the lilies, flying insects), since I have the original visual patterns to compare
with. It is difficult to statistically measure similarity of scanpaths. It is even
more difficult to compare dynamic patterns, to measure similarity between
spatial AND temporal configurations and to quantitatively capture tempo-
ral sequences of fixations. To my knowledge, no optimal measure has been
found yet.
Figure 8ac illustrates scanpath similarity of three statistically significant
scenarios expressed in utterances 3, 4 and 6 (flying insects, telephone line and
early summer) for four randomly chosen listeners (03, 04, 07 and 09).
For the scanpaths elicited by the utterances 3, 4 and 6 (flying insects, tele-
phone line and early summer), I calculated visual fixations on a number of
areas of interest that were semantically important for the described scenario.
The concept early summer does not explicitly say which concrete objects in
the scene should be fixated, but I defined areas of interest indicating the sce-
nario: flowers, leaves, foliage, plants etc., that have also been fixated by the
original speaker. The aim was to show that the listeners visual fixation pat-
terns within relevant areas of interest were significantly better than chance.
Then the expected value for the time spent on the relevant area proportional
to its size was compared with the actual value. A t-test was conducted telling
us whether the informants looked significantly more on the relevant areas of
interest than by chance.
The visual scanpaths among the twelve listeners were rather similar. Where-
as the concept two fields and birds in the lilies were not significant, in the cases
of flying insects (p = 0,002) and telephone line (p = 0,001), we got significant
results for all listeners. For early summer, the results were partially significant
(large flowers on the left, p = 0,01; flowers left of the tree, p = 0,05). We can
thus conclude that a number of these utterances elicited similar eye movement
patterns even for the group of listeners. This implies that the eye movement
patterns are not restricted to the process of planning and structuring a verbal
description of the scene but are rather connected to the scene semantics. These
results are in line with studies by Noton and Stark (1971a, 1971b, 1971c) who
showed that subjects tend to fixate regions of special interest according to cer-
tain scanpaths.
142 Discourse, Vision and Cognition
The results concerning spatial, semantic and mental groupings (this chapter,
Section 3) can be interpreted in terms of viewing dimensions. Studying simul-
taneous picture viewing and picture description can help us to understand the
dynamics of the ongoing perception process and, in this way, contribute to the
area in the intersection of artistic and cognitive theories (cf. Holsanova 2006).
Arnheim (1969) draws the conclusion that visual perception () is not a pas-
sive recording of stimulus material but an active concern of the mind (Arnheim
1969:37). Current theories of visual perception stress the cognitive basis of art
and scene perception. We think art as much as we see art (Solso 1994). Our
way of perceiving objects in a scene can be triggered by our expectations, our
interests, intentions, previous knowledge, context or instructions. In his book
Visual thinking, Arnheim (1969) writes: cognitive operations called thinking
are not the privilege of mental processes above and beyond perception but the
essential ingredients of perception itself. I am referring to such operations as
active exploration, selection, grasping of essentials, simplification, abstraction,
analysis and synthesis, completion, correction, comparison, problem solving,
as well as combining, separating, putting in context (Arnheim 1969:13). He
draws the conclusion that visual perception () is not a passive recording of
stimulus material but an active concern of the mind (Arnheim 1969:37).
My data confirms the view of active perception: it is not only the recog-
nition of objects that matters but also how the picture appears to the view-
ers. Verbal descriptions include the quality of experience, subjective content
and descriptions of mental states. Viewers report about (i) referents, states and
events, (ii) colours, sizes and attributes, (iii) compositional aspects, and they
(iv) mentally group the perceived objects into more abstract entities, (v) com-
pare picture elements, (vi) express attitudes and associations, (vii) report about
mental states. Thanks to the verbal descriptions, several viewing dimensions can
be distinguished: content aspect; quality aspect; compositional aspect; mental
Chapter 7. Semantic correspondence between verbal and visual data 143
The results of the comparison of verbal and visual data reported in Chapters
6 and 7 have also consequences for the current discussion about the functions
of eye movements. Griffin (2004) gives an overview of the function of speech-
related gazes in communication and language production. The reported psy-
cholinguistic studies on eye tracking reported involve simple visual stimuli
and a phrase or sentence level of language production. However, by studying
potentially meaningful sequences or combinations of eye movements on com-
plex scenes and simultaneous picture description on a discourse level, we can
extend the list of eye movement functions as follows (cf. Table 3).
Table 3. Function of eye movement patterns extracted from free on-line descriptions
of a complex picture.
Type of gazes Explanation Example, type of focus
Counting-like gazes aid in sequencing in language pro- three birds, three . eh four
duction when producing quantifiers, men digging in the field
before uttering plural nouns list foci
Gazes in message shifting the gaze to the next object to on the level of the focus,
planning be named, aid in planning of the next superfocus and discourse
focus; aid in interpretation of picture topic level
regions subst, sum, eval foci
Gazes reflecting cat- gazing at objects while preparing cat.diff.
egorising difficulties how to categorise them within a
superfocus, gazes at objects or parts
of objects until the description is
retrieved, even during dysfluencies.
Gazes reflect allocation of mental
resources
Monitoring gazes Speakers sometimes return their sum
gazes to objects after mentioning
them in a way that suggests that
they are evaluating their utterances.
Re-fixations.
Comparative gazes aid in interpretations; reparation/ two fields
(reparative gazes or re- self-editing in content, changing subst foci, meta, eval
categorisation gazes) point of view
Organisational gazes organisation of info, structuring of Now I have described the
discourse, choice of discourse topics left hand side
introspect, meta foci
Ideational gazes when referring to objects, when subst, loc foci
using a modifier, when describing
objects location or action
Summarising gazes common activity flying objects
common taxonomic category-fixat- sum
ing multiple objects
Abstract, inferential reflected in a speakers gaze on flow- early summer
gaze-production link ers and leaves before suggesting that subst, sum
a scene depicts early summer
Chapter 7. Semantic correspondence between verbal and visual data 145
consistent with Yarbus (1967) conclusion that eye movements occur in cycles
and that observers return to the same picture elements several times and with
Noton & Stark (1971b) who coined the word scanpath to describe the sequen-
tial (repetitive) viewing patterns of particular regions of the image. According
to the authors, a coherent picture of the visual scene is constructed piecemeal
through the assembly of serially viewed regions of interest. In the light of my
visual and verbal data, however, it is important to point out that a refixation on
an object can mean something else than the first fixation. A fixation on one and
the same object can correspond to several different mental contents. This find-
ing also confirms the claim that meaning relies on our ability to conceptualise
the same object or situation in different ways (Casad 1995:23).
Fixation patterns can reveal more than single fixations. However, we still
need some aid, some kind of referential framework, in order to infer what ideas
and thoughts these fixations and scanpaths correspond to. As Viviani (1990)
and Ballard et al. (1996) pointed out, there is an interpretation problem: we
need to relate the underlying overt structure of eye scanning patterns to in-
ternal cognitive states. The xation itself does not indicate what properties of
an object in a scene have been acquired. Usually, the task is used to constrain
and interpret fixations and scanpaths on the objects in the scene. For instance,
Yarbus (1967) instructions give the ages of the people shown in the picture
etc. resulted in different scanpaths and allowed a functional interpretation of
the eye movement patterns. It showed which pieces of information had been
considered relevant for the specific task and were therefore extracted by the
informants. However, as we have seen in the analysis of semantic relations, we
can also find similar spontaneous patterns in free picture description, with-
out there being a specific viewing task. In this case, the informants attempt
to formulate a coherent description of the scene and their spontaneous verbal
description may be viewed as a source of top-down control. The question is
whether the task offers enough explanation for the visual behaviour or whether
the verbal description is the optimal source of explanation for the functional
interpretation of eye movement patterns. In certain respects, the combination
of visual scanpaths and verbal foci can reveal more about the ongoing cognitive
processes. If we focus on the discourse level and include different types of foci
and superfoci (cf. Chapter 2, Section 1), we can get more information about the
informants motivations, impressions, attitudes and (categorisation) problems
(cf. Chapter 6, Section 3).
146 Discourse, Vision and Cognition
The question arises how the scanning and description process develop. Con-
cerning the temporal aspect we can ask: Does the thought always come before
speech? Do we plan our speech globally, beforehand? Or do we plan locally, on
an associative basis? Do we monitor and check our formulations afterwards?
Several answers can be found in the literature. Levelt (1989) assumes plan-
ning on two levels: macroplanning (i.e. elaboration of a communicative goal)
and microplanning (decisions about the topic or focus of the utterance etc.).
Bock & Levelt (1994, 2004) maintain that speakers outline clause-sized mes-
sages before they begin to sequentially prepare the words they will utter. Linell
(1982) distinguishes between two phases of utterance production: the construc-
tion of an utterance plan (the decision about semantic and formal properties
of the utterance) and the execution of an utterance plan (the pronunciation of
the words). He also suggests a theory that represents a compromise between
Wilhelm Wundts model of complete explicit thought and Hermann Pauls as-
sociative model.
According to Wilhelm Wundts holistic view (Linell 1982; Blumenthal
1970), the speaker starts with a global idea (Gesamtvorstellung) that is later
analysed part-by-part and sequentially organised into an utterance. Applied
to the process of visual scanning and verbal description, the observer would
have a global idea of the picture as a whole, as well as of the speech genre
picture description. The observer would then decide what contents she would
express verbally and in what order: whether she would describe the central
part rst, the left and the right part of the picture later on, and the foreground
and the background last, or whether she would start from the left and continue
to the right. If such a procedure is followed systematically, the whole visual
and verbal focusing process would be guided by a top-down principle (e.g. by
a picture composition). This idea would then be linearised and verbalised and
specified in a stepwise fashion. Evidence against a holistic approach contains
hesitations, pauses and repetitions revealing the fact that the utterance has not
been planned as a whole beforehand. Evidence supporting this pre-structured
way of description, on the other hand, is reflected by the combination of sum-
marising foci (sum), followed by a substantive list of items (list).
According to Hermann Pauls associative view, the utterance production is
a more synthetic process in which concepts (expressed in words and phrases)
are successively strung together by association processes (Linell 1982:1). Ap-
plied to our case, the whole procedure of picture viewing and picture description
Chapter 7. Semantic correspondence between verbal and visual data 147
In the light of my data from free descriptive discourse, not every focus and
superfocus is planned in the same way and to the same extent. We find a) evi-
dence for conceptualisation and advanced planning on a discourse level, in
particular in sequences where the latency between the visual examination and
148 Discourse, Vision and Cognition
the speech production is very long. We also find evidence for b) more associa-
tive processes, in particular in sequences where the speakers start executing
their description before they have counted all the instances or planned the de-
tails of the whole utterance production (I see three eh . four Pettsons doing
different things, there are one three birds doing different things). Finally, we
find evidence for c) monitoring activities, where the speakers afterwards check
the expressed concrete or abstract concept against the visual encounter. I there-
fore agree with Linells view that the communicative intentions may be partly
imprecise or vague from the start and become gradually more structured, en-
riched, precise and conscious through the verbalisation process.
5. Conclusion
The aim of the first section has been to look more closely at the semantic rela-
tions between verbal and visual clusters. My point of departure has been com-
plex ideas expressed as verbal foci and verbal superfoci in free simultaneous
spoken language descriptions and processed eye movement data from viewing
a complex picture, both aligned in time and displayed on the multimodal score
sheets. Using a sequential method, I have compared the contents of the verbal
and visual spotlights and thereby also shed light on the underlying cognitive
processes. Three aspects have been analysed in detail: the semantic correspon-
dence, the level of specificity and the spatial proximity in connection to the
creation of new mental units.
The semantic relations between the objects focused on visually and de-
scribed verbally were often implicit or inferred. They varied between object-
object, object-location, object-path, object-activity and object-attribute. Infor-
mants were not only judging the objects size, form, location, prototypicality,
similarity and function, but also formulating their impressions and associa-
tions. It has been suggested that the semantic correspondence between verbal
and visual foci is comparable on the level of larger units and sequences, such
as the superfocus.
The combination of visual and verbal data showed that objects were fo-
cused on and conceptualised on different levels of specificity. The dynamics of
the observers on-line considerations could be followed, ranging from vague
categorisations of picture elements, over comments on ones own expertise,
mentions of relevant extracted features, formulations of more specic guesses
about an object category, to evaluations. Objects location and attributes were
Chapter 7. Semantic correspondence between verbal and visual data 149
I have pointed out the relevance of the combination of visual and verbal
data for delimitation of viewing aspects, for the role of eye movements, fixation
patterns and for the area of language production and language planning.
This chapter has been concerned with the semantic correspondence be-
tween verbal and visual data. In the following chapter, I will present studies on
mental imagery associated with picture viewing and picture description.
chapter 8
same objects and scenes in different ways has been demonstrated in Chapter 7.
Semantic relations between visual and verbal foci in descriptive discourse (in
front of the tree is a stone) were explained by introducing some of the theoreti-
cal concepts from cognitive semantics, such as landmark, trajector, container,
prototype, figureground, sourcegoal, centreperiphery, image schema etc.
(Chapter 7). The question is whether speakers and listeners think in images
during discourse production and discourse understanding.
squeeze more out of them.). When speakers use these formulations in commu-
nication, they evoke mental images in the hearers and are used as an important
resource for mutual understanding (cf. Table 1).
Holmqvist (1993) makes use of image schema when he describes discourse
understanding in terms of evolving mental images. Speakers descriptive dis-
course contains concepts that appear as schematic representations and establish
patterns of understanding. When speakers want to describe a complex visual
idea, e.g. about a scene, a navigation route or an apartment layout (cf. Labov &
Linde 1975; Taylor & Tversky 1992), they have depending on their goals to
organise the information so that our partner can understand it (Levelt 1981).
By uttering ideas, speakers evoke images in the minds of the listeners, the con-
sciousness of the speaker and the listeners get synchronised, and the listeners
co-construct the meanings.
In face-to-face spontaneous conversation, this process has a more dynamic
and cooperative character (Clark 1996). The partners try to achieve joint at-
tention, formulate complementary contributions and interactively adjust their
visualisations. Quite often, the partners draw simultaneously with their verbal
descriptions. Sketches and drawings are external spatial-topologic representa-
tions reflecting the conceptualisation of reality and serving as an aid for our
memory (Tversky 1999). The partners must create a balance between what
is being said and what is being drawn. The utterances and non-verbal action
(such as drawing and gesturing) can be conceived of as instructions for the
listeners for how to change the meaning, how something is perceived, how one
thinks or feels or what one wants to do with something that is currently in the
conscious focus (Linell 2005). Drawing plays an important role in descriptive
discourse (Chapter 3, Section 2), since it functions as:
Example 1
here is the whole spectrum,
here is
much money
and very good quality,
Mhmh
they do a good job,
but they know it costs a little more
to do a good job,
Mhm
... then we have down here we have
the fellows who come from Italy
and all those countries,
they spend the money quickly
and they dont care,
... [mhm]
so we have more or less
Scandinavians and Scots up here
()
now we can build roads and all this
stuff.
()
.. then they trade down here,
.... mhm
.... and of course when . these have
been given work enough times
156 Discourse, Vision and Cognition
We claimed that the consciousness of the speaker and of the listeners are syn-
chronised, that they create joint attention focus and co-construct meaning.
This is achieved by lexical markers of verbal foci, by drawing and by point-
ing. Let us first look at how the movement of a conscious focus of attention
is reflected in the verbal description. The speaker marks the topic shifts and
transition between verbal foci in the unfolding description with the help of
discourse markers (cf. Chapter 3, Section 1.1.1, Holmqvist & Holsanova 1997).
For instance, then, and then marks a progression within the same superfocus,
whereas and now signals moving to a new superfocus. So marks moving back
to a place already described, preparing the listener for a general summary, and
the expressions and now its like this, and then they do like this serves as a link
to the following context, preparing the listener for a more complex explication
to follow.
The describer guides the listeners attention by lexical markers like then
down here, meaning: now we are going to move the focus (regulation), and we
are moving it to this particular place (direction/location, deixis) and we are
going to stay in this neighbourhood for some time (planning/prediction). The
Chapter 8. Picture viewing, picture description and mental imagery 157
As we have seen earlier, image schemata as for instance in front of are similar
to the eye movement patterns during a spoken picture description (Chapter 7).
This indicates that cognitive schemata might also be important for the speaker
him/herself. On the other hand, some informants report that they are verbal
and not visual thinkers, and the research on visual-spatial skills has shown
that there are individual differences (Hegarty & Waller 2004, 2006, Chapter
4, Section 1 and 2). In order to verify the assumption that we use our ability
to create pictures in our minds, we conducted a series of studies on mental
imagery during picture description. The results of these studies contribute to
our understanding of how speakers connect eye movements, visualisations and
spoken discourse to a mental image.
The reader might ask: What is mental imagery and what is it good for? Fin-
ke (1989:2) defines mental imagery as the mental invention or recreation of an
experience that in at least some respects resembles the experience of actually
perceiving an object or an event, either in conjunction with, or in the absence
of, direct sensory stimulation. (cf. also Finke & Shepard 1986). In popular
158 Discourse, Vision and Cognition
In the first study, twelve informants six female and six male students at Lund
University listened to a pre-recorded spoken scene description and later re-
told it from memory. The goal of this study was to extend the previous findings
(Demarais & Cohen 1998; Spivey & Geng 2001; Spivey, Tyler, Richardson &
Young 2000) in two respects: First, instead of only studying simple directions,
we focused on the complexity of the spatial relations (expressions like at the
centre, at the top, between, above, in front of, to the far right, on top of, below, to
the left of). Second, apart from measuring eye movements during the listening
phase, we added a retelling phase where the subjects were asked to freely re-
tell the described scene from memory. Eye movements were measured during
both phases. To our knowledge, these aspects had not been studied before. In
addition, we collected ratings of the vividness of imagery both during the lis-
tening and retelling phase and asked the subjects whether they usually imagine
things in pictures or words.
The pre-recorded description was the following (here translated into Eng-
lish):
Imagine a two dimensional picture. At the centre of the picture, there is a large
green spruce. At the top of the spruce a bird is sitting. To the left of the spruce and
to the far left in the picture there is a yellow house with a black tin roof and white
corners. The house has a chimney on which a bird is sitting. To the right of the
large spruce and to the far right in the picture, there is a tree, which is as high as
the spruce. The leaves of the tree are coloured in yellow and red. Above the tree,
at the top of the picture, a bird is flying. Between the spruce and the tree, there is
a man in blue overalls, who is raking leaves. In front of the spruce, the house, the
tree and the man, i.e. below them in the picture, there is a long red fence, which
runs from the pictures left side to the pictures right side. At the left side of the
picture, a bike is leaning towards the fence, and just to the right of the bike there
is a yellow mailbox. On top of the mailbox a cat is sleeping. In front of the fence,
i.e. below the fence in the picture, there is a road, which leads from the pictures
left side to the pictures right side. On the road, to the right of the mailbox and the
bike, a black-haired girl is bouncing a ball. To the right of the girl, a boy wearing
a red cap is sitting and watching her. To the far right on the road a lady wearing
a big red hat is walking with books under her arm. To the left of her, on the road,
a bird is eating a worm.
. The initial Swedish verb was Frestll dig which is neutral to the modality (image or
word) of thinking.
160 Discourse, Vision and Cognition
A. B.
C. D.
Figure 4. iView analysis of the first 67 seconds for one subject. (A) 019 sec. Spruce
and bird in top. (B) 1932 sec. The house to the left of the spruce, with bird on top of
chimney. (C) 3252 sec. The tree to the right of the house and the spruce. (D) 5267
sec. The man between the spruce and the tree, and the fence in front of them running
from left to right.
Chapter 8. Picture viewing, picture description and mental imagery 161
Spatial schematics for the objects in the pre-recorded description can be seen
in Figure 3. The experiment consisted of two main phases, one listening phase
in which the subjects listened to the verbal description, and one retelling phase
in which the participants retold the description they had listened to in their
own words. Eye movements were recorded both while subjects listened to the
spoken description and while they retold it.
2. In the listening phase, the eye movement from one position to another
must appear within 5 seconds after the object is mentioned in the descrip-
tion.
3. In the retelling phase, the eye movement from one position to another
must appear within 5 seconds before or after the subject mentions the ob-
ject.
1. When an eye movement is moving from one object to another during the
description or the retelling it must move in the correct direction.
2. In the listening phase, the eye movement from one position to another
must appear within 5 seconds after the object is mentioned in the descrip-
tion
3. In the retelling phase, the eye movement from one position to another
must appear within 5 seconds before or after the subject mentions the ob-
ject.
The key difference between global and local correspondence is that global cor-
respondence requires fixations to take place at the categorically correct spatial
position relative to the whole eye tracking pattern. Local correspondence only
requires that the eyes move in the correct direction between two consecutive
objects in the description. Examples and schematics of this can be seen in Fig-
ures 5 and 6.
No correspondence was considered if neither the criteria for local corre-
spondence nor global correspondence was fulfilled (typically, when the eyes
did not move or moved in the wrong direction).
For a few subjects, some eye movements were re-centred and shrunk into a
smaller area (thus yielding more local correspondence). However, the majority
of eye movements kept the same proportions during the listening phase and
Chapter 8. Picture viewing, picture description and mental imagery 163
A. B. C.
Figure 5. (A) Example of mostly global correspondences. (B) Example of mostly local
correspondences. (C) Example of no correspondences at all.
A. B.
A. B.
Figure 7. Comparison of one persons eye movement patterns during listening (A)
and retelling phase (B).
the retelling phase. A comparison of one and the same persons eye movement
patterns during listening and retelling phase can be seen in Figure 7.
Results for correct eye movements were significant during the listening
phase and retelling phase in local and global correspondence coding. When
listening to the pre-recorded scene description (and looking at a white board),
54.8 percent of the eye movements were correct in the global correspondence
coding and 64.3 percent of eye movements were correct in the local correspon-
dence coding. In the retelling phase, more than half of all objects mentioned
had correct eye movements, according to the conservative global correspon-
dence criteria (55.2 percent; p = 0.004). The resizing effects, i.e. the fact that
164 Discourse, Vision and Cognition
i nformants may have shrunk, enlarged and stretched the image, were quite
common during picture description. It was also common that informants re-
centred the image from time to time; thus yielding local correspondence. When
allowed for re-centring and resizing of the image as with local correspon-
dence then almost three quarters of all objects had correct eye movements
(74.8 percent, p= 0.0012). The subjects spatial pattern of eye movements was
highly consistent with the original spatial arrangement.
In the second study, we asked another twelve informants six female and six
male students from Lund University to look at a complex picture for a while
and then describe it from memory. We chose Sven Nordqvists (1990) picture
again as a complex visual stimulus. The study consisted of two main phases, a
viewing phase in which the informants inspected the stimulus picture and a de-
scription phase in which the participants described this picture from memory
in their own words while looking at a white screen. Eye movements were re-
corded during both phases. At the beginning of the viewing phase, each infor-
mant received the following instructions:
You will see a picture. We want you to study the picture as thoroughly as pos-
sible and to describe it afterwards.
The picture was shown for about 30 seconds, and was then covered by a white
screen. The following description phase was self-paced: the informants usu-
ally took 12 minutes to describe the picture. After the session, the informants
were asked to rate the vividness of their visualisation during the viewing and
the retelling phase on a scale ranging from 1 to 5. They were also asked to assess
whether they usually imagine things in pictures or in words.
The descriptions were transcribed in order to analyse which picture ele-
ments were mentioned and when. The eye movements were then analysed ac-
cording to objects derived from the descriptions. For instance, when an infor-
mant formulated the following superfocus,
01:20 And ehhh to the left in the picture
01:23 there are large daffodils,
01:26 it looks like there were also some animals there perhaps,
we would expect the informant to move her eyes towards the left part of the
white screen during the first focus. Then it would be plausible to inspect the
Chapter 8. Picture viewing, picture description and mental imagery 165
referent of the second focus (the daffodils). Finally, we could expect the infor-
mant to dwell for some time within the daffodil area on the white screen
searching for the animals (three birds, in fact) that were sitting there on the
stimulus picture.
The following criteria were applied in the analysis in order to judge whether
correct eye movements occurred: Eye movements were considered correct in
local correspondence when they moved from one position to another in the cor-
rect direction within a certain time interval. Eye movements were considered
correct in global correspondence when moving from one position to another
and finishing in a position that was spatially correct relative to the whole eye-
tracking pattern of the informant (for a detailed description of our method, cf.
Johansson et al. 2005, 2006). We tested significance between the number of
correct eye movements and the expected number of correct movements by
chance.
Our results were significant both in the local correspondence coding (74.8
percent correct eye movements, p = 0,0015) and in the global correspondence
coding (54.9 percent correct eye movements, p = 0,0051). The results suggest
that informants visualise the spatial configuration of the scene as a support for
their descriptions from memory. The effect we measured is strong. More than
half of all picture elements mentioned had correct eye movements, according
to the conservative global correspondence criteria. Allowing for re-centring
and resizing of the image as with local correspondence makes almost three
quarters of all picture elements have correct eye movements. Our data indicate
that eye movements are driven by the mental record of the object position and
that spatial locations are to a high degree preserved when describing a complex
picture from memory.
Despite the fact that the majority of the subjects had appropriate imagery
patterns, we found no correlation between the subjects rating of their own
visualisations and the degree of correct eye movement, neither for the view-
ing phase nor for the retelling phase. The subjects assessments about whether
they usually think in words or pictures were proportionally distributed across
four possibilities: (a) words, (b) pictures, (c) combination of words and pic-
tures, (d) no guesses. Again, a simple correlation analysis showed no corre-
lation between these assessments and the degree of correct eye movements,
neither for the viewing nor for the retelling phase. One possible interpretation
. The effect was equally strong for verbal elicitations (when the informants listened to
a verbal, pre-recorded scene description instead of viewing a picture), and could also be
found in complete darkness (cf. Johansson et al. 2006).
166 Discourse, Vision and Cognition
A. B.
Figure 8. One and the same informant: viewing phase (A) and description phase (B).
might be that people in general are not aware of which mental modality they
are thinking in.
Overall, there was a good similarity between data from the viewing and the
description phases, as can be seen in Figure 8.
According to Kosslyn (1994), distance, location and orientation of the
mental image can be represented in the visual buffer, and it is possible to shift
attention to certain parts or aspects of it. Laeng & Teodorescu (2001) interpret
their results as a confirmation that eye movements play a functional role during
image generation. Mast and Kosslyn (2002) propose, similarly to Hebb (1968),
that eye movements are stored as spatial indexes that are used to arrange the
parts of the image correctly. Our results can be interpreted as further evidence
that eye movements play a functional role in visual mental imagery and that
eye movements indeed are stored as spatial indexes that are used to arrange the
different parts correctly when a mental image is generated.
There are, however, alternative interpretations. Researchers within the em-
bodied view claim that instead of relying on an mental image, we use features
in the external environment. An imagined scene can then be projected over
those external features, and any storing of the whole scene internally would be
unnecessary. Ballard et al. (1996, 1997) suggest that informants leave behind
deictic pointers to locations of the scene in the environment, which later may
be perceptually accessed when needed. Pylyshyn (2001) has developed a some-
what similar approach to support propositional representations and speaks
about visual indices (cf. also Spivey et al. 2004).
Another alternative account is the perceptual activity theory suggesting
that instead of storing images, we store a continually updated and refined set
of procedures or schemas that specify how to direct out attention in different
situations (Thomas 1999). In this view, a perceptual experience consists of an
Chapter 8. Picture viewing, picture description and mental imagery 167
4. Conclusion
This chapter has dealt with mental imagery and external visualisations in con-
nection with descriptive discourse. As we have seen, in a naturally occurring
conversation, external visualisations help the partners to achieve a joint focus
of attention and to coordinate and adjust their mental images during mean-
ing-making. External visual representations such as drawings are central to
learning and reasoning processes. They can be manipulated, changed and are
subject to negotiations. The partners can work with patterns and exemplars
standing for abstract concepts. Apart from the spatial domain, drawings can
be used for other conceptual domains, i.e. for the non-spatial domain (time,
money), abstract domain (contrast, intensity, quality), dynamic domain (stages
in a process) etc.
However, as we have seen, mental imagery and inner visualisations of
different kinds are also important for the speakers themselves. In a study on
picture viewing, picture description and mental imagery, a significant similar-
ity was found between (a) the eye movement patterns during picture viewing
and (b) those produced during picture description (when the informants were
looking at a white screen). The eye movements closely reflected the content and
the spatial relations of the original picture, suggesting that the informants cre-
ated some sort of mental image as an aid for their descriptions from memory.
Apart from that, even verbal descriptions engaged mental imagery and elicited
eye movements that reflect spatiality (Johansson et al. 2005, 2006).
Mental imagery and mental models are useful for educational methods
and learning strategies. In the area of visuo-spatial learning and problem-solv-
ing, it is recommended to use those external spatial-analogical representations
(charts, geographical layouts, diagrams etc.) that closely correspond to the us-
ers mental models.
Our ability to picture something mentally is also relevant for design and
human-computer interaction, since humans interact with systems and objects
based on how they believe that the system works or the objects should be used.
The issue of usability is thus tightly connected to the extent to which external
representations correspond to our visualisations of how things function.
For a users interaction with a format containing multiple representations
(texts, photos, drawings, maps, diagrams and graphics) it is important the
message is structured in a coherent way, so that the user has no difficulties to
conceptualize, process and integrate information from different sources with
her own visualisation and experience (Holsanova et al., forthc.). However, we
Chapter 8. Picture viewing, picture description and mental imagery 169
Concluding chapter
I hope that, by now, the reader can see the advantages of combining discourse
analysis with cognitively oriented research and eye movement tracking. I also
hope that I have convincingly shown how spoken descriptive discourse and eye
movement measurements can, in concert, elucidate covert mental processes.
This concluding chapter looks back on the most important issues and find-
ings in the book and mentions some implications of the multimodal approach
for other fields of research. The way speakers segment discourse and create
global and local transitions reflects a certain cognitive rhythm in discourse
production. The flow of speech reflects the flow of thoughts. In Chapter 1, I
defined the most important units of spoken descriptive discourse that reflect
human attention, verbal focus and verbal superfocus. I showed that listeners
intuition about discourse boundaries and discourse segmentation is facilitated
when the interplay of prosodic and acoustic criteria is further confirmed by
semantic criteria and lexical markers. Also, it is easier for listeners to identify
boundaries at the higher levels of discourse, such as the superfocus.
The way we create meaning from our experience and describe it to others
can be understood in connection with our general communicative ability: We
partly talk about WHAT we experienced by selecting certain referents, states
and events and by grouping and organising them in a certain way, but we also
express our attitudes and relate to HOW these referents, states and events ap-
peared to us. A taxonomy of foci reflecting these different categorising and inter-
preting activities that the speakers are involved in was developed in Chapter 2.
Seven different types of foci have been identified, serving three main discourse
functions. Substantive, summarising and localising foci are typically used for
presentation of picture contents, attitudinal meaning is expressed in evaluative
and expert foci, and a group of interpersonal, introspective and meta-textual
foci serves the regulatory and organising function.
This taxonomy of foci could be generalised to different settings but the dis-
tribution of foci varied. For instance, the summarising foci dominated in a set-
ting where the picture was described from memory, whereas substantive foci
dominated in simultaneous descriptions in a narrative setting. When spatial
172 Discourse, Vision and Cognition
aspects of the scene were focused on, the proportion of localising foci was
significantly higher. Furthermore, an interactive setting promoted a high pro-
portion of evaluative and expert foci in a situation where the informants were
expressing their attitudes to the picture content, making judgements about the
picture as a whole, about properties of the picture elements and about relations
between picture elements. Introspective and meta-textual foci were also more
frequent in a situation where the listener was present.
In spoken descriptions, we can only focus our attention on one particular
aspect at a time, and the information flow is divided into small units of speech.
These segments are either linear or embedded. It happens that we make a di-
gression a step aside from the main track of our thoughts and spend some
time on comments, but we usually succeed in coming back to the main track,
finish the previous topic and start on a new one. Sometimes, we must mentally
reorient at transitions between segments. In the process of meaning making,
both speakers and listeners try to connect these units into a coherent whole.
In Chapter 3, I showed how speakers connect the subsequent steps in their
description and thereby create discourse coherence. Discourse markers reveal
the structuring of the speech, introduce smaller and larger steps in the descrip-
tion and mark the linear (paratactic) and the embedded (hypotactic) segments
in discourse. Also, there are different degrees of mental distance between steps
of description, reflected in discontinuities at the transition between foci and
superfoci. This phenomenon has been interpreted in terms of the internal or
external worlds the speaker moves between. The largest hesitations and the
longest pauses were found at the transitions where the speaker steps out of the
description and turns to the meta-textual and interactional aspects.
An analysis of a spontaneous description with drawing where the speaker
is trying to achieve a certain visualisation effect for the listeners has shown that
the interlocutors despite the complexity in the hierarchical structure
can retain nominal and pronominal references for quite a long time
can simultaneously focus on a higher and a lower level of abstraction
can handle multiple discourse-mediated representations of visually pres-
ent and mentally imagined objects
can attend to the same objects with another idea in the mind.
The dissociation between the visual and mental representations as well as the
simultaneous handling of multiple discourse-mediated representations on dif-
ferent levels of abstraction is made possible (a) by the partners switching be-
tween active and semiactive information, (b) by joint attention and (c) by the
Chapter 9. Concluding chapter 173
use of mutual visual access (e.g. by observing each others pointing, gazing and
drawing). The drawing as an external visual representation fulfils many
functions in addition to the spoken discourse: It functions as a referent storage,
external memory aid for the interlocutors, basis for visualisation of imaginary
events and scenarios, and representation of the whole topic of the conversa-
tion. The fact that partners in a conversation can handle multiple discourse-
mediated representation of visually present and mentally imagined objects and
scenarios contributes to the theory of mind.
Different individuals focus on different aspects in their picture descrip-
tions. In Chapter 4, I characterised and exemplified the two dominant styles
found in the data and discussed various cognitive, experiential and contextual
factors that might have given rise to these styles. Whereas attending to spatial
relations is dominant in the static description style where the picture is de-
composed into fields that are then described systematically, with a variety of
terms for spatial relations, attending to the flow of time is the dominant pat-
tern in the dynamic description style, where the informants primarily focus
on temporal relations and dynamic events in the picture, talk about steps of
a process, successive phases, and a certain order. The quality of the dynamic
style is achieved by a frequent use of temporal verbs, temporal adverbs and
motion verbs in an active voice. Discourse markers are often used to focus and
refocus on the picture elements, and to interconnect them. Apart from that,
the informants seem to follow a narrative schema: the descriptions start with
an introduction of the main characters, their involvement in various activities
and a description of the scene. The extracted description styles are further dis-
cussed in the framework of studies on individual differences and remember-
ing. Connections are drawn to studies about visual and verbal thinkers and
spatial and iconic visualisers. I also showed that spatial and narrative priming
has effects on the description style. Spatial priming leads to a larger number of
localisations, significantly fewer temporal expressions and significantly shorter
static descriptions, whereas narrative priming mostly enhances the temporal
dynamics in the description.
The first four chapters focused on various characteristics of picture descrip-
tions in different settings and built up a basis for a broader comparison be-
tween picture descriptions, picture viewing and mental imagery presented in
Chapters 58.
The multimodal method and the analytical tool, multimodal time-cod-
ed score sheet, were introduced in Chapter 5. Complex ideas formulated in
the course of descriptive discourse were synchronised with fixation patterns
174 Discourse, Vision and Cognition
during visual inspection of the complex picture. Verbal and visual data have
been used as two windows to the mind. With the help of this method, I was
able to synchronise visual and verbal behaviour over time, follow and com-
pare the content of the attentional spotlights on different discourse levels,
and extract clusters in the visual and verbal flow. The method has been used
when studying temporal and semantic correspondence between verbal and
visual data in Chapters 6 and 7. By incorporating different types of foci and
superfoci in the analysis, we can follow the eye gaze patterns during specific
mental activities.
Chapter 6 focused on temporal relations. From a free description of a
complex scene, I extracted configurations from verbal and visual data on a
focus and superfocus level. In result, I found complex patterns of eye gaze and
speech.
The first question to be answered concerned temporal simultaneity be-
tween the visual and verbal signal. I found that the verbal and the visual sig-
nals were not always simultaneous. The visual focus was often ahead of speech
production. This latency was due to conceptualisation, planning and formu-
lation of a free picture description on a discourse level, which affected the
speech-to-gaze alignment and prolonged the eye-voice latency. Visual focus
could, however, also follow speech (i.e. a visual fixation cluster on an object
could appear after the describer had mentioned it). In these cases, the describ-
er was probably monitoring and checking his statement against the visual ac-
count. In some instances, there was temporal simultaneity between the verbal
and visual signals but no semantic correspondence (when informants dur-
ing a current verbal focus directed preparatory glances towards objects to
be described later on).
The second question concerned the order of the objects focused on visu-
ally and verbally. The empirical results showed that the order of objects fixated
may, but need not always be the same as the order in which the objects were
introduced within the verbal focus or superfocus. For instance, some of the
inspected objects were not mentioned at all in the verbal description, some
of them were not labelled as a discrete entity but instead included later, on
a higher level of abstraction. In the course of one verbal focus, preparatory
glances were passed and visual fixation clusters landed on new objects, long
before these were described verbally. Also, areas and objects were frequently
re-examined and a re-fixation on one and the same object could be associated
with different ideas.
Chapter 9. Concluding chapter 175
and, afterwards, checked it against the visual encounter. I therefore agree with
Linells (1982) view that the communicative intentions may be partly imprecise
or vague from the start and become gradually more structured, enriched, pre-
cise and conscious through the verbalisation process.
Chapter 8 was concerned with the role of mental imagery and external
visualisations in descriptive discourse. In a naturally occurring conversation,
external visualisations help the partners to achieve a joint focus of attention
and to coordinate and adjust their mental images during meaning-making.
However, as we have seen, inner visualisations and, in particular, mental im-
ages, are also important for the speakers themselves. In a study of picture
viewing, picture description and mental imagery, a significant similarity was
found between (a) the eye movement patterns during picture viewing and (b)
those produced during picture description (when the picture was removed
and the informants were looking at a white screen). The eye movements close-
ly reflected the content and the spatial relations of the original picture, sug-
gesting that the informants created a sort of mental image as an aid for their
descriptions from memory. Eye movements were thus not dependent on a
present visual scene but on a mental record of the scene. In addition, even
verbal scene descriptions evoked mental images and elicited eye movements
that reflect spatiality.
Let me finally mention some implications for other fields of research. The
multimodal method and the integration patterns discovered can be applied for
different purposes. It has currently been implemented in the project concern-
ing on-line written picture descriptions where we analyse verbal and visual
flow to get an enhanced picture of the writers attention processes (Andersson
et al. 2006). Apart from that, there are many interesting applications of integra-
tion patterns within evaluation of design, interaction with multimodal inter-
active systems and learning. The integration patterns discovered in our visual
and verbal behaviour can contribute to the development of a new generation
of multimodal interactive systems (Oviatt 1999). In addition, we would be able
to make a diagnosis about the current user activity and predictions about their
next move, their choices and decisions (Bertel 2007). In consequence, we could
use this information on-line, for supporting users individual problem-solv-
ing strategies and preferred ways of interacting. The advantages when using a
multimodal method are threefold: it gives more detailed answers about cog-
nitive processes and the ongoing creation of meaningful units, it reveals the
rationality behind the informants behaviour (how they behave and why, what
178 Discourse, Vision and Cognition
e xpectations and associations they have) and it gives us insights about users at-
titudes towards different solutions (what is good or bad, what is easy or difficult
etc.). In short, the sequential multimodal method can be successfully used for
a dynamic analysis of perception and action in general.
References
Aijmer, K. (1988). Now may we have a word on this. The use of now as a discourse par-
ticle. In M. Kyt, O. Ihalainen & M. Rissanen (Eds.), Corpus Linguistics, Hard and Soft.
Proceedings of the Eighth International Conference on English Language Research on
Computerized Corpora, 1533.
Aijmer, K. (2002). English Discourse Particles. Evidence from a corpus. Studies in Corpus
Linguistics. John Benjamins: Amsterdam.
Allport, A. (1989). Visual attention, In Posner, M. I. (Ed.), Foundations of Cognitive Science.
Cambridge, MA: MIT Press, 631682.
Allwood, J. (1996). On Wallace Chafes How consciousness shapes language. Pragmatics &
Cognition Vol. 4(1), 1996. Special issue on language and conciousness, 5564.
Andersson, B., Dahl, J., Holmqvist, K., Holsanova, J., Johansson, V., Karlsson, H., Strmqvist,
S., Tufvesson, S., & Wengelin, . (2006). Combining keystroke logging with eye track-
ing. In Luuk Van Waes, Marielle Leiten & Chris Neuwirth (Eds.)Writing and Digital
Media, Elsevier BV (North Holland), 166172.
Arnheim, R. (1969). Visual Thinking, Berkeley, University of California Press, CA.
Baddeley, A. & Lieberman, K. (1980). Spatial working memory. In R. Nickerson (Ed.), At-
tention and performance (Vol. VIII, pp. 521539). Hillsdale, NJ: Lawrence Erlbaum
Associates, Inc.
Baddeley, A. (1992). Is working memory working? The fifteenth Bartlett lecture. The Quar-
terly Journal of Experimental Psychology, 44A, 131.
Ballard, D. H., Hayhoe, M. M., Pook, P. K., & Rao, R. P. N. (1997). Deictic codes for the em-
bodiment of cognition. Behavioral and Brain Sciences (1997) 20, 13111328.
Ballard, D. H., Hayhoe, M. M., Pook, P. K., & Rao, R. P. N. (1996). Deictic Codes for the Em-
bodiment of Cognition. CUP: Cambridge.
Bangerter, A. & Clark, H. H. (2003). Navigating joint projects with dialogue. Cognitive Sci-
ence 27 (2003), 195225.
Barsalou, L. (1999). Perceptual symbol systems, Behavioral and Brain Sciences (1999), 22,
577660.
Bartlett (1932, reprinted 1997). Remembering. Cambridge: Cambridge University Press.
Beattie, G. (1980). Encoding units in spontaneous speech. In H. W. Dechert & M. Raupach
(Eds.), Temporal variables in speech, pp. 131143. Mouton, The Hague.
Berlyne, D. E. (1971). Aesthetics and psychobiology. New York: Appleton-Century-Crofts.
Berman, R. A. & Slobin, D. I. (1994). Relating events in narrative. A crosslinguistic develop-
mental study. Hillsdale, New Jersey: Lawrence Erlbaum.
Bersus, P. (2002). Eye movement in prima vista singing and vocal text reading. Master
paper in Cognitive Science, Lund University. http://www.sol.lu.se/humlab/eyetracking/
Studentpapers/PerBerseus.pdf
180 Discourse, Vision and Cognition
Clark, H. H. (1992). Arenas of Language Use. The University of Chicago press: Chicago.
Clark, H. H. (1996). Using Language. Cambridge University Press: Cambridge.
Couper, R. M. (1974). The Control of eye fixations by the meaning of spoken language: A
new methodology for real-time investigation of speech perception. Memory and lan-
guage processing. Cognitive Psychology, 6, 84107.
Crystal, D. (1975). The English tone of voice. London: St. Martin.
De Graef, P. (1992). Scene-context effects and models of real-world perception. In K. Rayner
(Ed.), Eye movements and visual cognition: Scene perception and reading, 243259. New
York: Springer-Verlag.
Demarais, A. & Cohen, B. H. (1998). Evidence for image-scanning eye movements during
transitive inference. Biological Psychology, 49, 229247.
Diderichsen, Philip (2001). Visual Fixations, Attentional Detection, and Syntactic Perspec-
tive. An experimental investigation of the theoretical foundations of Russel Tomlins
fish film design. Lund University Cognitive Studies 84.
Duchowski, Andrew T. (2003). Eye Tracking Methodology: Theory and Practice. Springer-
Verlag, London, UK.
Engel, D., Bertel, S. & Barkowsky, T. (2005). Spatial Principles in Control of Focus in Rea-
soning with Mental Representations, Images, and Diagrams. Spatial Cognition IV,
181203.
Ericsson, K. A. & Simon, H. A. (1980). Verbal Reports as Data. Psychological Review; 87:
215251.
Findlay, J. M., & Walker, R. (1999). A model of saccadic eye movement generation based
on parallel processing and competitive inhibition. Behavioral and Brain Sciences 22:
661674. Cambridge University Press.
Finke, R. A. & Shepard, R. N. (1986). Visual functions of mental imagery. In K. R. Boff, L.
Kaufman, & J. P. Thomas (Eds.), Handbook of perception and human performance. New
York:Wiley.
Finke, R.A. (1989). Principles of Mental Imagery. Massachusetts Institute of Technology:
Bradford books.
Firbas, J. (1992). Functional Sentence Perspective in Written and Spoken Communication.
Cambridge: Cambridge University Press.
Grdenfors, P. (1996). Speaking about the inner environment, In S. Alln (Ed.), Of thoughts
and words. The relation between language and mind. Proceedings of Nobel Symposium
92. Stockholm 1994. Imperial College Press, 143151.
Grdenfors, Peter (2000). Conceptual Spaces: The Geometry of Thought, MIT Press Cam-
bridge, MA.
Garett, M. (1980). Levels of processing in sentence production. In Butterworth (1980), Lan-
guage production. London: Academic Press, 177220.
Garrett, M. (1975). The Analysis of Sentence Production. In Bower, G. (Ed.) Psychology of
Learning and Motivation, Vol. 9. New York: Academic Press. 133177.
Gedenryd, H. (1998). How designers work. Making sense of authentic cognitive activities.
Lund University Cognitive Studies 75: Lund.
Gentner, D., & Stevens, A. L. (Eds.). (1983). Mental models. Hillsdale, NJ: Lawrence Erlbaum
Associates.
182 Discourse, Vision and Cognition
Hayhoe, M. M. (2004). Advances in relating eye movements and cognition. Infancy, 6(2),
pp. 267274.
Hebb, D. O. (1968). Concerning imagery. Psychological Review, 75, 466477.
Hegarty, M. (1992). Mental animation: Inferring motion from static displays of mechani-
cal systems. Journal of Experimental Psychology: Learning, Memory and Cognition, 18,
10841102.
Hegarty, M. (2004). Diagrams in the mind and in the world: Relations between internal and
external visualizations. In A. Blackwell, K. Mariott & A. Shimojima (Eds.), Diagram-
matic Representation and Inference. Lecture Notes in Artificial Intelligence 2980 (113).
Berlin: Springer-Verlag.
Hegarty, M. & Waller, D. (2004). A dissociation between mental rotation and perspective-
taking spatial abilities. Intelligence, 32, 175191.
Hegarty, M. & Waller, D. (2006). Individual differences in spatial abilities. In P. Shah & A.
Miyake (Eds.). Handbook of Visuospatial Thinking. Cambridge University Press.
Henderson, J. M. & Hollingworth, A. (1998). Eye Movements during Scene Viewing. An
Overview. In Underwood, G. W. (Ed.), Eye Guidance in Reading and Scene Perception,
269293. Oxford: Elsevier.
Henderson, J. M. & Hollingworth, A. (1999). High Level Scene Perception. Annual Review
of Psychology 50, 243271.
Henderson, J. M. (1992). Visual attention and eye movement control during reading and
picture viewing. In K. Rayner (Ed.) Eye movements and Visual Cognition. New York:
Springer Verlag.
Henderson, J. M. & Ferreira, F. (Eds.). (2004). The integration of language, vision, and action:
Eye movements and the visual world. New York: Psychology Press.
Herskovits, A. (1986). Language and Spatial Cognition. An Interdisciplinary Study of the
Prepositions in English. Cambridge University Press: Cambridge.
Hoffman, J. E. (1998). Visual Attention and Eye Movements. In Pashler, H. (Ed.). (1998).
Attention. Psychology Press: UK, 119153.
Holmqvist, K., Holmberg, N., Holsanova, J., Trning, J. & Engwall, B. (2006). Reading In-
formation Graphics Eyetracking studies with Experimental Conditions (J. Errea, ed.)
Malofiej Yearbook of Infographics, Society for News Design (SND-E). Navarra Univer-
sity, Pamplona, Spain, pp. 5461.
Holmqvist, K. & Holsanova, J. (1997). Reconstruction of focus movements in spoken dis-
course. In Liebert, W., Redeker, G., Waugh, L. (Eds.), Discourse and Perspective in Cog-
nitive Linguistics. Benjamins: Amsterdam, 223246.
Holmqvist, K. (1993). Implementing Cognitive Semantics. Image schemata, valence accomo-
dation and valence suggestion for AI and computational linguistics. Lund University
Cognitive Studies 17.
Holmqvist, K., Holsanova, J., Barthelson, M. & Lundqvist, D. (2003). Reading or scanning?
A study of newspaper and net paper reading. In Hyn, J. R., & Deubel, H. (Eds.), The
minds eye: Cognitive and applied aspects of eye movement research (657670). Elsevier
Science Ltd.
Holsanova, J., Holmberg, N. & Holmqvist, K. (forthc.). Intergration of Text and Information
Graphics in Newspaper Reading. Lund University Cognitive Studies 125.
184 Discourse, Vision and Cognition
Horne, M., Hansson, P., Bruce G., Frid, J. & Filipson, M. (1999). Discourse Markers and the
Segmentation of Spontaneous Speech: The case of Swedish men but/and/so. Working
Papers 47, 123140. Dept. of Linguistics, Lund university.
Huber, S., & Kirst, H. (2004). When is the ball going to hit the ground? Duration estimates,
eye movements, and mental imagery of object motion. Journal of Experimental Psychol-
ogy: Human Perception and Performance, Vol. 30, No. 3, 431444.
Inhoff, A. W. & Gordon, A. M. (1997). Eye Movements and Eye-Hand Coordination During
Typing. Current Directions in Psychological Science, Vol. 6/6/1997. American Psycho-
logical Society: Cambridge University Press, 153157.
Johansson, R., Holsanova, J. & Holmqvist, K. (2005). What Do Eye Movements Reveal
About Mental Imagery? Evidence From Visual And Verbal Elicitations. In Bara, B. G.,
Barsalou, L., Bucciarelli, M. (Eds.), Proceedings of the 27th Annual Conference of the
Cognitive Science Society, pp. 10541059. Mahwah, NJ: Erlbaum.
Johansson, R., Holsanova, J. & Holmqvist, K. (2006). Pictures and spoken descriptions elicit
similar eye movements during mental imagery, both in light and in complete darkness.
Cognitive Science 30: 6 (pp. 10531079). Lawrence Erlbaum.
Johansson, R., Holsanova, J. & Holmqvist, K. (2005). Spatial frames of reference in an in-
teractive setting. In Tenbrink, Bateman & Coventry (Ed.) Proceedings of the Workshop
on Spatial Language and Dialogue, Hanse-Wissenschaftskolleg Delmenhorst, Germany
October 2325, 2005.
Johnson-Laird, P. N. (1983). Comprehension as the Construction of Mental Models, Philo-
sophical Transactions of the Royal Society of London. Series B, Biological Sciences, Vol.
295, No. 1077, 353374.
Jonassen, D. & Grabowski, B. (1993). Handbook of individual differences, learning, and in-
struction. Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.
Juola J. F., Bouwhuis D. G., Cooper E. E.; Warner C. B. (1991). Control of Attention around
the Fovea. Journal of Eeperimental Psychology Human Perception and Performance
17(1): 125141.
Just, M. A., Carpenter, P. A. (1976). Eye xations and cognitive processes, Cognitive Psychol-
ogy 8, 441480.
Just, M. A., & Carpenter, P. A. (1980). A theory of reading: From eye xations to comprehen-
sion. Psychological review, 87, 329354.
Kahneman, D. (1973). Attention and Effort. Prentice Hall, Inc.: Englewood Cliffs, New Jer-
sey.
Kendon, A. (1980). Gesticulation and speech: Two aspects of the process of utterance. In
Key, M. (Ed.), The Relationship of Verbal and Nonverbal Communication, Mouton: The
Hague, 207227.
Kess, J. F. (1992). Psycholinguistics. Psychology, Linguistics and the Study of Natural Lan-
guage. Benjamins: Amsterdam/Philadelphia.
Kintsch, W. & van Dijk, T. A. (1983). Strategies of discourse comprehension. New York: Aca-
demic.
Kintsch, W. (1988). The role of knowledge in discourse comprehension construction-inte-
gration model. Psychological Review, 95, 163182.
186 Discourse, Vision and Cognition
Kita, S., & zyrek, A. (2003). What does cross-linguistic variation in semantic coordina-
tion of speech and gesture reveal?: Evidence for an interface representation of spatial
thinking and speaking. Journal of Memory and Language, 48, 1632.
Kita, S. (1990). The temporal relationship between gesture and speech: A study of Japanese-
English bilinguals. Unpublished masters thesis. Department of Psychology, University
of Chicago.
Kiyoshi, Naito, Katoh Takaaki, & Fukuda Tadahiko (2004). Expertise and position of line of
sight in golf putting. Perceptual and Motor Skills.V.99/P., 163170.
Korolija, N. (1998). Episodes in Talk. Constructing coherence in multiparty conversation.
Linkping Studies in Arts and Science 171. Linkping university.
Kosslyn, S. (1994). Image and Brain. Cambridge, Mass. The MIT Press.
Kosslyn, S. (1978). Measuring the visual angle of the minds eye. Cognitive Psychology, 10,
356389.
Kosslyn, S. (1980). Image and Mind. Harvard University Press. Cambridge, Mass. and Lon-
don, England.
Kosslyn, S. M. (1995). Mental imagery. In S. M. Kosslyn & D.N. Osherson (Eds.), Visual
cognition: An invitation to cognitive science (Vol. 2, pp. 267296). Cambridge, MA: MIT
Press.
Kowler, E. (1996). Cogito Ergo Moveo: Cognitive Control of Eye Movement. In Landy M. S.,
Maloney, L. T. & Paul, M. (Eds.) Exploratory vision. The Active Eye. 5177.
Kozhevnikov, M, Hegarty, M. & Mayer, R. E. (2002). Revising the Visualizer-Verbalizer
Dimension: Evidence for Two Types of Visualizers. Cognition and Instruction 20 (1),
4777.
Krutetskii, V. A. (1976). The psychology of mathematical abilities in school children. Chicago:
University of Chicago Press.
Labov, W. & Waletzky, J. (1973). Erzhlanalyse: Mndliche Versionen persnlicher Erfah-
rung. In Ihwe, J. (Hg.) Literaturwissenschaft und Linguistik. Bd. 2. Frankfurt/M.: Fisch-
er-Athenum, 78126.
Laeng, Bruno & Teodorescu, Dinu-Stefan (2002). Eye scanpaths during visual imagery re-
enact those of perception of the same visual scene. Cognitive Science 2002, Vol. 26, No.
2: 207231.
Lahtinen, S. (2005). Which one do you prefer and why? Think aloud! In Proceedings of Join-
ing Forces, International Conference on Design Research. UIAH, Helsinki, Finland.
Lakoff, G. & Johnson, M. (1980). Metaphors we live by. Chicago: University of Chicago
Press.
Lakoff, G. (1987). Women, re, and dangerous things: what categories reveal about the mind.
The University of Chicago Press: Chicago, IL.
Lang, E., Carstensen, K.-U. & Simmons, G. (1991). Modelling Spatial Knowledge on a Lin-
guistic Basis. Theory Prototype Integration. (Lecture Notes on Articial Intelligence
481) Springer-Verlag: Berlin, Heidelberg, New York.
Langacker, R. (1987). Foundations of Cognitive Grammar. Volume 1. Stanford University
Press: Stanford.
Langacker, R. (1991). Foundations of Cognitive Grammar. Volume 2. Stanford University
Press: Stanford.
References 187
Lemke, J. (1998). Multiplying meaning: Visual and verbal semiotics in scientific text. In
Martin, J. & Veel, R. Reading Science. London: Routledge.
Levelt, W. J. M. (1981). The speakers linearization problem. Philosophical Transactions of the
Royal Society of London B, 295, 305315.
Levelt, W. J. M. (1983). Monitoring and self-repair in speech. Cognition, 14, 41104.
Levelt, W. J. M. (1989). Speaking: From intention to articulation. MIT Press, Bradford books:
Cambridge, MA.
Lvy-Schoen, A. (1969). Dtermination et latence de la rponse oculomotrice deux stimu-
lus. L Anne Psychologique, 69, 373392.
Lvy-Schoen, A. (1974). Le champ dactivit du regard: donnes exprimentales. LAnne
Psychologique, 74, 4366.
Linde, C. & Labov, W. (1975). Spatial network as a site for the study of language and thought.
Language, 51: 924939.
Linde, C. (1979). Focus of attention and the choice of pronouns in discourse. In Talmy
Givn (Ed.), Syntax and Semantics, Volume 12, Discourse and Syntax, Academic Press:
New York, San Francisco, London, 337354.
Linell, P. (2005). En dialogisk grammatik? I: Anward, Jan & Bengt Nordberg (red.), Samtal
och grammatik, Studentlitteratur, 231315.
Linell, P. (1982). Speech errors and the grammatical planning of utterances. Evidence from
Swedish. In Koch, W., Platzack, C. & Totties, G. (Eds.), Textstrategier i tal och skrift.
Almqvist & Wiksell International, 134151. Stockholm.
Linell, P. (1994). Transkription av tal och samtal, Arbetsrapporter frn Tema Kommunikation
1994: 9, Linkpings universitet.
Loftus, G. R. & Mackworth, N. H. (1978). Cognitive determinants of xation location dur-
ing picture viewing. Journal of experimental psychology: Human Perception and Perfor-
mance, 4, 565572.
Lucy (1992). Language, diversity and thought. A Reformulation of the Linguistic Relativity
theory. Cambridge University Press.
Mackworth, N. H. & Morandi, A. J. (1967). The gaze selects informative details within pic-
tures. Perception and Psychophysics 2, 547552.
Mast, F. W., & Kosslyn, S. M. (2002). Eye movements during visual mental imagery, TRENDS
in Cognitive Science, Vol. 6, No. 7.
Mathesius, V. (1939). O takzvanm aktulnm lenn vtnm. Slovo a slovesnost 5, 171174.
Also as: On informationbearing Structure of the Sentence. In K. Susumo (Ed.) 1975.
Harvard: Harvard university, 467480.
Meyer, A. S. & Dobel, C. (2003). Application of eye tracking in speech production research
In The Minds Eye: Cognitive and applied aspects of eye movement research, J. Hyn &
Deubel, H. (Editors), Elsevier Science Ltd, 253272.
Meulen, F. F. van der, Mayer A. S. & Levelt, W. J. M. (2001). Eye movements during the pro-
duction of nouns and pronouns. Memory & Cognition, 29, 512521.
Mishkin, M., Ungerleider, L. G. & Macko, K. A. (1983). Object vision and spatial vision: Two
cortical pathways, Trends in Neuroscience, 6, 414417.
Mozer, M. C. & Sitton, M. (1998). Computational modelling of spatial attention. In Pashler,
H. (Ed.), Attention. Psychology Press: UK, 341393.
188 Discourse, Vision and Cognition
Naughton, K. (1996). Spontaneous gesture and sign: A study of ASL signs cooccuring with
speech. In Messing, L. (Ed.), Proceedings of the workshop on the integration of gesture in
Language and speech. University of Delaware, 125134.
Nordqvist, S. (1990). Kackel i trdgrdslandet. Opal.
Noton, D. & Stark, L. (1971a). Eye movements and visual perception. Scientic American,
224, 3443.
Noton, D. & Stark, L. (1971b). Scanpaths in saccadic eye movements while viewing and
recognizing patterns, Vision Research 11, 929.
Noton, D., & Stark, L. W. (1971c). Scanpaths in eye movements during perception. Science,
171, 308311.
Nuyts, J. (1996). Consciousness in language. Pragmatics & Cognition Vol. 4(1), 1996. Special
issue on language and consciousness, 153180.
Olshausen, B. A. & Koch, C.(1995). Selective Visual Attention. In Arbib, M. A. (Ed.), The
handbook of brain theory and neural networks. Cambridge, MA: MIT Press, 837840.
Oviatt, S. L. (1999). Ten Myths of Multimodal Interaction. Communications of the ACM,
Vol. 42, No. 11, Nov. 99, 7481.
Paivio, A. (1971). Imagery and Verbal Processes. Hillsdale, N.J.: Erlbaum.
Paivio, A. (1986). Mental representation: A dual coding approach. New York: Oxford Uni-
versity Press.
Paivio, A. (1991a). Dual Coding Theory: Retrospect and current status, Canadian Journal of
Psychology, 45 (3), 255287.
Paivio, A. (1991b). Images in Mind. Harvester Wheatsheaf: New York, London.
Pollatsek, A. & Rayner, K. (1990). Eye movements, the eye-hand span, and the perceptual
span in sight-reading of music. Current Directions in Psychological Science, 4953.
Posner, M. I. (1980). Orienting of attention. Quarterly Journal of Experimental Psychology
32, 325.
Prince, E. (1981). Toward a Taxonomy of Given-New Information. In Cole, P. (Ed.), Radical
pragmatics. New York: Academic Press.
Pylyshyn, Z. W. (2001). Visual indexes, preconceptual objects, and situated vision, Cogni-
tion, 80 (1/2), 127158.
Quasthoff, U. (1979). Verzgerungsphnomene, Verknpfungs- und Gliederungssignale in
Alltagsargumentationen und Alltagserzhlungen, In H. Weydt (Ed.), Die Partikel der
deutschen Sprache. Walter de Gruyter: Berlin, New York, 3957.
Qvarfordt, Pernilla (2004). Eyes on Multimodal Interaction. Linkping Studies in Science and
Technology No. 893. Department of Computer and Information Science, Linkpings
Universitet.
Rayner, K. (Ed.). (1992). Eye movements and visual cognition: scene perception and reading.
New York: Springer-Verlag.
Redeker, G. (1990). Ideational and pragmatic markers of discourse structure. Journal of
Pragmatics 14 (1990), 367381.
Redeker, G. (1991). Linguistic markers of discourse structure. Review article. Linguistics 29
(1991), 139172.
Redeker, G. (2000). Coherence and structure in text and discourse. In William Black &
Harry Bunt (Eds.), Abduction, Belief and Context in Dialogue. Studies in Computational
Pragmatics (233263). Amsterdam: Benjamins.
References 189
Spivey, M. J., Tyler M., Richardson, D. C., & Young, E. (2000). Eye movements during. com-
prehension of spoken scene descriptions. Proceedings of the Twenty-second Annual
Meeting of the Cognitive Science Society, 487492, Erlbaum: Mawhah, NJ.
Stenstrm, A.-B. (1989). Discourse Signals: Towards a Model of Analysis. In H. Weydt (Ed.),
Sprechen mit Partikeln. Walter de Gruyter: Berlin, New York, 561574.
Strohner, H. (1996). Resolving Ambigious Descriptions through Visual Information. In
Representations and Processes between Vision and NL, Proceedings of the 12th Euro-
pean Conference of Articial Intelligence, Budapest, Hungary 1996.
Strmqvist, S. (1998). Lite om sprk, kommunikation, och tnkande. In T. Bckman, O.
Mortensen, E. Raanes, & E. stli (red.). Kommunikation med dvblindblivne. Dron-
ninglund: Frlaget Nordpress, 1320.
Strmqvist, S. (2000). A Note on Pauses in Speech and Writing. In Aparici, M. (ed.), Devel-
oping literacy across genres, modalities and languages, Vol 3. Universitat de Barcelona,
211224.
Strmqvist, S. (1996). Discourse Flow and Linguistic Information Structuring: Explorations
in Speech and Writing. Gothenburg Papers in Theoretical Linguistics 78.
Suwa Masaki, Tversky Barbara, Gero, John & Purcell, Terry (2001). Seeing into sketches:
Regrouping parts encourages new interpretations. In J. S. Gero, B. Tversky & T. Purcell
(Eds.), Visual and Spatial Reasoning in Design II, Key Centre of Design Computing and
Cognition, University of Sydney, Australia, 207219.
Tadahiko, Fukuda & Nagano Tomohisa (2004). Visual search strategies of soccer players in
one-to-one defensive situation on the field. Perceptual and Motor Skills.V.99/P, 968
974.
Tanenhaus, M. K., Magnuson, J. S., Dahan, D., & Chambers, C. 2000. Eye movements and
lexical access in spoken-language comprehension: Linking hypothesis between fixa-
tions and linguistic processing. Journal of Psycholinguistics Research 29, 6, 557580.
Tanenhaus, M. K., Spivey-Knowlton, M. J., Eberhard, K. M., & Sedivy, J. C. 1995. Integration
of visual and linguistic information in spoken language comprehension. Science 268,
5217, 16351634.
Taylor, H. A. & Tversky, B. (1992). Description and depiction of environments. Memory and
Cognition, 20, 483496.
Theeuwes, J. (1993). Visual selective attention: A theoretical analysis. Acta psychologica 83,
93154.
Theeuwes, J., Kramer, A. F., Hahn, S., & Irwin, D. (1998). Our eyes do not always go where we
want them to go: Capture of the eyes by new objects. Psychological Science, 9, 379385.
Thomas, N. J. T. (1999). Are theories of imagery theories of imagination? An active per-
ception approach to conscious mental content. Cognitive Science 1999, Vol. 23, No. 2,
207245.
Tomlin, R. S. (1995). Focal Attention, voice, and word order. An experimental, cross-lin-
guistic study. In P. Downing & M. Noonan (Eds.), Word Order in Discourse. Amster-
dam: John Benjamins, 517554.
Tomlin, R. S. (1997). Mapping Conceptual Representations into Linguistic Representations:
The Role of Attention in Grammar. In J. Nuyts & E. Pederson, With Language in Mind.
Cambridge: CUP, 162189.
References 191
Tversky, B. (1999). What does drawing reveal about thinking? In J. S. Gero & B. Tversky
(Eds.), Visual and spatial reasoning in design. Sydney, Australia: Key Centre of Design
Computing and Cognition, 93101.
Tversky, B., Franklin, N., Taylor, H. A., Bryant, D. J. (1994). Spatial Mental Models from
Descriptions, Journal of the American Society for Information Science 45(9), 656668.
Ullman, S. (1996). High-level vision: Object recognition and visual recognition, Cambridge,
MA: MIT Press.
Ungerleider, L. G. & Mishkin, M. (1982). Two cortical visual systems. In Analysis of visual
behavior, (Ed.) D. J. Ingle, M. A. Goodale & R. W. J. Mansfield. MIT Press.
Underwood, G. & Everatt, J. (1992). The Role of Eye Movements in Reading: Some limita-
tions of the eye-mind assumption. In E. Chekaluk & K. R. Llewellyn (Eds.), The Role of
Eye Movements in Perceptual Processes, Elsevier Science Publishers B. V., Advances in
Psychology, Amsterdam, Vol. 88, 111169.
van Donzel, M. E. (1997). Perception of discourse boundaries and prominence in spontane-
ous Dutch speech. Working papers 46 (1997). Lund university, Department of Linguis-
tics, 523.
van Donzel, M. E. (1999). Prosodic Aspects of Information Structure in Discourse. LOT, Neth-
erlands Graduate School of Linguistics. Holland Academic Graphics: The Hague.
Velichkovsky, B. M. (1995). Communication attention: Gaze-position transfer in coopera-
tive problem solving. Pragmatics and Cognition 3(2), 199222.
Velichkovsky, B., Pomplun, M., & Rieser, J. (1996). Attention and Communication: Eye-
Movement-Based Research Paradigms. In W. H. Zangemeister, H. S. Stiehl, & C. Freksa
(Eds.), Visual Attention and Cognition (125154). Amsterdam, Netherlands: Elsevier
Science.
Viviani, P. (1990). Eye movements in visual research: Cognitive, perceptual, and motor con-
trol aspects. In E. Kowler (Ed.) Eye movements and their role in Visual and Cognitive
Processes. Reviews of Oculomotor Research V4. Amsterdam: Elsevier Science B. V.,
353383.
Yarbus, A. L. (1967). Eye movements and vision (1st Russian edition, 1965). New York: Ple-
num Press.
Young, L. J. (1971). A study of the eye-movements and eye-hand temporal relationships of
successful and unsuccessful piano sight-readers while piano sight-reading. Doctoral
dissertation, Indiana University, RSD721341
Zwaan, R. A. & Radvansky, G. A. (1998). Situation models in language comprehension and
memory. Psychological Bulletin 123, 162185.
Author index
Strmqvist 3, 9, 15, 40, 54, 80, Tversky 48, 58, 153, 190191 W
179, 190 Tyler 159, 190 Waletzky 59, 186
Suwa 154, 190 Walker 80, 83, 181
U Waller 157, 183
T Ullman 191 Warner 185
Umkehrer 189 Wengelin 179
Tadahiko 99, 186, 190
Underwood 80, 183, 191 Wiebe 189
Tanenhaus 103, 190
Ungerleider 69, 187, 191
Trning 183 Y
Taylor 153, 180, 190191 V Yarbus 80, 87, 97, 136, 145, 191
Teodorescu 158, 166, 186 van der Meulen 119, 187 Young 99, 159, 190191
Theeuwes 80, 83, 190 van Dijk 169, 185
Thomas 166, 181, 190 van Donzel 15, 191 Z
Tomlin 82, 180181, 190 Velichkovsky 103, 191 Zetzche 189
Tufvesson 179 Viviani 145, 191 Zwaan 169, 191
Subject index
digression 39, 40, 42, 44, evaluation 9, 10, 50, 137, 139, F
45, 172 149, 175, 177 feedback 2, 9, 15, 39, 5051, 54
discourse analysis 171 evaluative foci 24, 27, 3334, figure-ground 127
discourse boundaries 12, 14, 36, 52, 54, 62, 69, 109, 116, fixation(s) 8991, 94, 99, 101,
15, 17, 171, 191 143 103, 106, 135, 141, 144145,
discourse evaluative tool 94 161162, 175, 181, 190
coherence 50, 53, 122, 172 events 4, 16, 2021, 27, 30, 37, duration 88, 90
comprehension 151152, pattern 89, 123, 125, 135,
41, 57, 59, 62, 6667, 77, 86,
169, 185 141, 143, 150, 173
117118, 120121, 142143,
hierarchy 10, 13, 100 flow of speech 1, 171
152, 158, 171, 173, 179
level(s) 13, 15, 88, 102, 115, flow of though 1, 80, 171
execution 3, 146
120123, 143, 145, 147, focus 1, 3, 57, 1113, 1516, 19,
existential constructions 58,
174, 176 2123, 2628, 30, 3233,
markers 5, 8, 1112, 14, 70, 74, 76 36, 37, 39, 41, 4446,
16, 4243, 46, 5153, 59, experiential factors 6466, 4850, 5253, 5557,
6162, 67, 75, 156 87, 173 6162, 65, 6769, 77, 79,
operators 43 experiment 82, 84, 161 8385, 87, 92, 94,
production 2, 9, 16, 103, expert foci 24, 27, 3338, 62, 100104, 106109,
151152, 171, 176 171172 114116, 118, 121123,
segmentation 1, 9, 11, external 125127, 129, 144147, 151,
1516, 39, 171 memory aid 50, 54, 153, 173 153154, 156157, 164165,
topics 9, 113, 121, 144 visual representations 49 172176, 183, 189
discourse-mediated mental visualisations 157, 167168, of active thought 3
representations 49, 149, 177 of attention 7, 49
176 eye fixation patterns 143 of thought 3
discourse-mediated eye gaze 96, 161, 174 focusing 49, 50, 5354, 59,
representations 172 eye movement(s) 70, 8084, 6263, 79, 81, 87, 101, 103,
distribution of foci 28, 33, 35, 87, 90, 97, 99, 106, 116, 126127
38, 171 125126, 129, 134135, free description 63, 87, 121,
drawing 3839, 46, 4950, 52, 137, 143, 145, 150, 157158, 174175
54, 97, 151, 153154, 156157, functional clusters 175
159, 161166, 168, 176177,
172173, 191 functional distribution
180183, 185, 188189
dual code theory 68 of configuration types 100
function of 143, 176
dynamic 4, 30, 32, 5567, of multimodal patterns 116
patterns 86, 137138, 141,
69, 7077, 79, 84, 94, 97, functions of eye movements
144145, 147, 149, 157158,
127128, 141, 153, 157, 168, 143
163, 168, 176177
173, 178
description style 56, 59, protocol 167, 175
G
6164, 69, 70, 73, 7677, eye tracker 28, 85 gaze 50, 82, 84, 90, 96, 122,
173 eye tracking 2933, 35, 3738, 144, 174, 187
motion verbs 62, 64 40, 79, 85, 87, 95, 98, 100, behaviour 50
verbs 58, 62, 70, 72, 74 102, 116, 120, 123, 143, 151, pattern 96, 161, 174
dysfluencies 3, 16, 21, 80, 158, 162, 167, 169, 179, 187 geometric type 68
113, 144 eye voice latencies 161 gestures 3, 15, 46, 50, 52, 97,
eye-gaze patterns 96 151
E eye-mind assumption 191 global correspondence 161
embedded segments 44 eye-voice latency 102, 122, 174 163, 165
errors 3, 187 eye-voice span 102 global thematic units 16
Subject index 199
integration patterns 119, organisational function 8587, 8990, 92, 94, 96,
175 2021, 25, 27, 29, 3337, 42, 98102, 104, 109, 119122,
method 77, 97, 125, 173, 87, 116, 121 125128, 131, 133134, 137,
177178 on-line 2833, 35, 3738, 139, 142143, 145146, 149,
score sheet 79, 86, 97, 148 40, 70, 7374, 85, 96, 101, 150151, 157, 164, 168, 173
scoring technique xii 103104, 108109, 113114, 174, 176177, 184
sequential method 6, 79, 116, 125, 144, 148, 177 picture viewing 6, 12, 32, 65,
84, 92, 94, 98 on-line picture description 70, 77, 79, 8586, 8990, 92,
system 102 40, 70, 73, 114, 116, 125 94, 9899, 102, 121, 125126,
time-coded score sheets on-line writing 96 128, 131, 133134, 137, 139,
89 organising function 37, 171 142143, 146, 149150, 168,
multiple external orientational function 24, 42, 173, 176177, 183184, 187
representations 167 53, 116 planning 1, 3, 16, 2627, 37, 43,
multiple representations 49, overt 80, 145 80, 82, 97, 99, 103, 115, 120
167169 overt attention 80 123, 138, 141, 144, 146147,
mutual visual access 54, 173 156, 158, 174, 176, 180, 187
P pointing 50, 52, 54, 97, 126
N parallel component model 51 127, 154, 156157, 173
narrative 28, 30, 3235, 3738, paratactic segments 53, 172 prediction 156
55, 59, 6566, 6970, 7377, paratactic transition 44 preparatory glances 122,
81, 171, 173, 179180 pauses 13, 8, 11, 1516, 174
narrative priming 28, 30, 33, 21, 3940, 42, 53, 92, 96, presentational function 21,
38, 55, 66, 70, 7374, 7677, 104105, 108109, 116118, 23, 26, 33, 3637, 41, 116
173 122, 146, 172 priming study 125, 138, 149,
narrative schema 173 pausing 9 176
narrative setting 30, 3435, perceive 12, 14, 125, 131 problem-solving 68, 81, 168,
74, 171 perception 1, 78, 12, 14, 16, 177
non-verbal 3, 1516, 39, 46, 65, 80, 8384, 8687, 89, prosodic criteria 11
50, 5254, 6768, 90, 92, 94, 9697, 108, 123, 125, proximity principle 131
97, 102, 153 131, 134, 137, 142, 149, psycholinguistic studies 81,
actions 50, 52, 54 151152, 158, 178, 180181, 87, 102, 120121, 123, 143
non-verbally 50 184, 186, 188, 190 psycholinguistics 9, 180
n-to-1 mappings 112, 117 of discourse boundaries psychology 66, 81, 131, 180,
n-to-n mappings 113, 117, 119 12, 191 186187
perfect match 101, 106, 114,
O 126 R
object-activity 129, 148, 175 performative distance from the re-categorisational gazes 176
object-attribute 129, 148, 175 descriptive discourse 42 re-conceptualisation 149, 175
object-location 128129, phases 3, 9, 57, 59, 77, 90, 97, re-examine 122, 174
148, 175 102, 146, 159, 161, 164, 166, referent storage 54, 173
object-object 148, 175 173 referential availability 44,
object-path 148, 175 phrase 67, 15, 120, 123, 143 49, 53
off-line 19, 21, 2629, 3133, phrasing unit(s) 5, 7 referents 8, 2021, 27, 2930,
35, 3739, 43, 63, 68, 70, picture description 2, 6, 10, 3637, 41, 44, 50, 107, 121,
7276 12, 17, 19, 20, 22, 28, 3040, 142143, 154, 171
off-line picture description 42, 46, 53, 5557, 59, 6467, re-fixate 175176
19, 21, 27, 37, 39, 43, 70 69, 7073, 7677, 7982, re-fixation 174
Subject index 201
refocus 4546, 53, 61, 77, 87, series of n-to-n mappings 113, spoken language 67, 12, 16,
173 117 43, 67, 7984, 86, 88, 94, 98,
region informativeness 87, 91 series of perfect matches 106 115, 126, 137, 148, 167, 181, 190
regulatory function 25 series of triangles 109110 spontaneous conversation 16,
relevance principle 79 similarity principle 131 28, 38, 151, 153
remembering 6669, 77, 173 simultaneous 28, 3031, 34, spontaneous description and
reorientation 42 38, 8182, 86, 89, 94, 97101, drawing 46
retrospective 167 103, 115, 122, 139, 142143, spontaneous descriptive
148, 161, 171172, 174, 176 discourse 36, 154
S simultaneous description with spontaneous drawing 48, 157
saccades 90, 135, 149, 161 eye tracking 34 spotlight 79, 8384, 94, 130,
saliency principle 79, 131 simultaneous verbal 154
scanpaths 139, 140142, 145, description 28, 94, 97 states 4, 11, 2021, 27, 30, 37,
186 situation awareness 49, 54, 97 41, 51, 86, 97, 117118, 121, 131,
scene 1921, 2425, 32, 36, 42, spatial and temporal 142143, 145, 171
59, 6566, 70, 77, 79, 8688, correspondence 161 static 29, 32, 5559, 6267,
9091, 94, 97, 101102, spatial expressions 23, 58, 6973, 7677, 81, 90, 92,
106, 114, 116, 118, 120121, 6163, 70, 72, 7476 173, 183
125128, 131139, 141142, spatial groupings 125, 131, description style 5557, 59,
144145, 149, 153, 158159, 137, 142 6364, 6972, 76, 173
163, 165166, 172173, 176 spatial perception 6162 steps in picture description 53
177, 188, 190 spatial priming 3738, 66, storage of referents 50, 153
scene perception 86, 90, 97, 6973, 7577 structure 1, 5, 1415, 17, 1920,
116, 125, 142, 188 spatial proximity 125, 131132, 28, 30, 3739, 43, 46, 5153,
scene semantics 141, 176 134137, 148149, 167, 176 55, 94, 9697, 109, 121, 137,
segmentation 1, 5, 9, 1012, spatial relations 23, 27, 30, 37, 145, 161, 172, 188
1416, 97 52, 5658, 64, 6769, 76, 143, structure of spoken picture
segmentation rules 1, 5, 11 157159, 168, 173, 177 descriptions 17, 19, 38
semantic 11, 1417, 51, 79, spatial visualiser(s) 68 substantive foci 2123, 27,
98100, 131, 137, 142, 148, speakers 5, 1416, 2125, 27, 3031, 33, 3638, 44, 62, 102,
150, 171, 174175 35, 37, 3942, 4546, 49, 104, 113, 116, 123, 171, 175
correspondence 79, 5255, 6061, 73, 75, 77, 82, substantive foci with
98100, 148, 150, 102, 137, 139, 142, 146, 148, categorisation difficulties
174175 151153, 157, 167168, 171172, 21, 3031, 33, 38, 113, 116,
criteria 11, 1417, 171 176177 123
groupings 131, 137, 142 specification 130, 149, 175 substantive list of items 22,
semantic, rhetorical and speech 14, 6, 811, 1516, 146
sequential aspects of 21, 2627, 3840, 4243, summarising foci 22, 24,
discourse 51 45, 5153, 75, 8082, 84, 92, 27, 31, 38, 41, 63, 114, 117,
semiactive foci 7 94, 99, 102103, 106, 116, 122123, 146, 171
semiactive information 49, 121123, 137138, 143, 146 summarizing gazes 176
54, 172 148, 172, 174176, 179182, superfocus 1, 8, 1214, 22,
sentences 1, 5, 1415, 102 185188, 191 27, 4041, 53, 89, 92, 100,
sequential processual 79 speech and thought 9 102104, 106107, 109,
sequential steps 46, 53 speech unit 4, 6, 42 113115, 119, 121123, 133,
series of delays 107 spoken discourse 15, 12, 16, 144, 147148, 156, 164, 171,
series of n-to-1 mappings 111 127, 154, 157, 173, 183 174176
202 Discourse, Vision and Cognition
support for visualisation 50, of superfoci 89, 121 verbalisation process 131,
153 typology of foci 38 148, 177
symmetry principle 131 verbalisers 6869
U viewing dimensions 137, 142
T underlying cognitive viewing patterns 138, 145,
task-dependent cluster 136 processes 6, 82, 94, 98, 148 149, 176
taxonomy 4, 28, 30, 37, 39, 171 underlying mental vision 69, 79, 82, 8485,
taxonomy of foci 30, 37, 39, processes 97 9091, 94, 127, 176, 182183,
171 unit of comparison 100101, 186189, 191
taxonomic proximity 131 115, 125 visual access 49
temporal correspondence 85 units in spoken descriptive visual and cognitive
temporal 6163, 70, 7277, discourse 1 processing 80
99, 121, 123, 125, 157, usability 81, 167, 168 visual behaviour 2, 28, 8889,
173174, 186, 191 utterance 34, 8, 43, 45, 52, 96, 145
dynamics 75, 76, 77, 173 102, 120, 139, 146148, 176, visual displays 80, 120, 123
expressions 6263, 70, 185 visual fixation cluster 89, 103,
7274, 7677, 173 utterances 1, 5, 910, 1315, 107, 115, 122, 174
perception 6162 81, 123, 138139, 141, 144, 147, visual foci 84, 93, 100, 103,
relations 77, 99, 121, 123, 149, 153, 175176, 187 115116, 119, 122, 128, 135, 148,
125, 157, 173174, 186, 191 175176
simultaneity 174 V visual focus 68, 8384, 99,
thematic distance 4041 variations in picture 101, 106, 115, 122123, 154,
from surrounding linguistic description 55 174175
context 40 verbal and visual clusters 89, visual focus of attention 84,
from surrounding pictorial 125, 148 99
context 41 verbal and visual protocols 84 visual inspection 118, 120,
think-aloud protocols 81, 87 verbal behaviour 88, 98, 100, 131, 174
thought processes 81, 82 174, 177, 182 visual paths 91
timeline 88, 100 verbal foci 8, 1113, 15, 2123, visual representations 149,
topical episodes 16 34, 39, 41, 5152, 82, 84, 168, 176
trajector-landmark 127 8687, 89, 92, 94, 99, 103, visual scene 46, 137, 145, 147,
transcribing 3 177, 180, 186
106107, 109, 113, 115117,
transcript 4, 6, 1112, 19, 46, visual stream 92, 101, 119, 126
122, 126, 130, 145, 148, 152,
visual thinkers 55, 64, 6667,
82, 8889, 9192 156, 176
151, 157, 182
transcription 1, 34, 12, 16, 88 verbal focus 1, 68, 16, 21,
visualisation(s) 40, 49, 151,
symbols 4 27, 82, 84, 92, 99106, 109, 153, 157, 164165, 167168,
transition(s) 3, 25, 4041, 114115, 117118, 122123, 126, 172173, 177
4445, 50, 53, 156, 172 146, 154, 171, 174175 visualiser 68
between foci 39 verbal focus of attention 84, visualisers 6869
triangle configuration 102 99 visually present objects 92
two different styles 56 verbal protocols 81, 87, 167 vocaliser 68
two windows to the mind 6, verbal stream 92, 101, 116117,
79, 84, 94, 98, 174 119, 126 W
types 21, 2628, 33, 3537, 46, verbal superfoci 12, 13, 15, 89, windows to the mind 6, 79,
55, 69, 87, 89, 96, 99, 121, 115, 119, 148 84, 94, 98, 174
145, 171, 174 verbal superfocus 1, 6, 8, 16,
of foci 21, 2628, 33, 3537, 21, 113, 132, 135, 171 Z
46, 55, 69, 87, 96, 99, 145, verbal thinkers 64, 67, 77, zooming in 101
171, 174 151, 173 zooming out 127, 130, 132134
In the series Human Cognitive Processing the following titles have been published thus far or
are scheduled for publication: