Professional Documents
Culture Documents
2, 1996
INTRODUCTION
Spoken utterances have characteristics that their written counterparts do not,
and it is a reasonable assumption that these characteristics have consequences for perceptual processing. As interest in the auditory processing of
spoken sentences increases, it becomes important to tmderstand these characteristics. The past several decades have seen the emergence o f theories
We thank the following for discussions o f specific points or for their comments on
portions earlier drafts, which improved the paper substantially: A n n Bradlow, Ronnie
Cann, Miriam Eckert, Merrill Garrett, Caroline Heycock, Pat Keating, Sharon Manuel,
Janet Nicol, Lisa Selkirk, Mark Steedman, and two anonymous reviewers.
1 Speech Communication Group, Research Laboratory of Electronics, Massachusettes Institute of Technology, Cambridge, Massachusettes 02139.
2 Department of Linguistics, University of Edinburgh, Edinburgh, United Kingdom EH8
9LL.
3 Address all correspondence to Stephanie Shattuck-Hufnagel, Speech Communications
Group, Research Laboratory o f Electronics, 36-511 MIT, 77 Massachusettes Avenue,
Cambridge, Massachusettes 02139, or stef@speech.mit.edu.
193
0090-6905/96/0300-0193509.50/0
194
Shattuck-Hufnagel and T u r k
that attempt to account for patterns o f intonation, timing and even variations
in segmental implementation in spoken utterances, in terms of a hierarchy
of prosodic constituents and prominences. This proposed prosodic structure
is separate from the morphosyntactic structure of the utterance, although
influenced by it. Several years ago, we set out to educate ourselves on the
literature that forms the basis o f these claims: the linguistic theories that lay
out a hierarchy o f prosodic constituents and prominences, the phonological
arguments that support the theoretical claims, and the quantitative behavioral
evidence that tests their relevance for speech and sentence processing models.
Our purpose is to summarize what we have learned in these three areas, in a
way that will be useful to others investigating spoken language processing.
Among the major lessons we have absorbed are:
9 The morphosyntactic hierarchy influences the signal indirectly, via the constraints it imposes on the choices that the speaker makes among the prosodic
possibilities for a given utterance;
9 These prosodic choices are also influenced by many other factors;
9 For this reason, the prosody of a particular utterance of a sentence cannot
be predicted reliably from the text alone; thus, it is necessary to determine
the prosodic structure that the speaker actually used for each particular
spoken utterance;
9 Current prosodic theory and transcription systems provide the tools to specify this structure, at least in a preliminary way that has already led to many
useful insights.
The view that prosodic structure arises via the prosodic component o f
the grammar, is influenced by a number o f extrasyntactic factors, and (in
concert with the segmental specification of the words o f the utterance) directly determines the acoustic-phonetic realization o f an utterance, raises a
number o f critical issues. These include how to formally represent the prosodic aspects o f phonological structure, how these structures influence the
phonetic dimensions of F0, duration, amplitude and segmental quality, how
the various factors that influence the speaker's choice of prosodic structure
for a given utterance are weighted, and the mechanism by which they interact. In this tutorial we will not provide definitive answers to these questions, but will review information that is useful in thinking about them. We
begin in Section 2 with the claim that the organizational structure o f an
utterance is not identical to its syntactic structure. Section 3 is a summary
and comparison of several proposed prosodic hierarchies in the literature,
along with evidence for each prosodic element. Section 4 presents evidence
that bears specifically on the hierarchical arrangement o f the constituents,
including the results o f studies that directly compare the efficacy of syntactic
and prosodic structure in accounting for acoustic-phonetic measures of sp0-
A Prosody Tutorial
195
196
same prosodic context. Others use the term to refer to the phonological organization of segments into higher-level constituents and to the pattern of
relative prominences within these constituents. For example, proposed prosodic constituent hierarchies include such elements as intonational phrases,
prosodic phrases, prosodic words, clitic groups, metrical feet, etc., and proposed hierarchies of relative prominence include such contrasts as nuclear vs.
pre-nuclear pitch accents at the phrasal level, and full-vowel syllables vs.
reduced syllables at the lexical level. A third class of definition merges the
phonetic and phonological aspects of prosody, including both the higher level
organization, with its constituent boundaries and prominences, and the phonetic reflexes of this organization in the pattern of FO, duration, amplitude
and segment quality/reduction within an utterance. The evidence laid out in
this paper and elsewhere has convinced us that an acceptable definition should
include the notion of prosodic structure as an abstract entity, associated with
a separate component of the grammar, and that this component must integrate
various types of information to determine the appropriate prosodic shape of
a spoken utterance. As a working definition, we specify prosody as both (l)
acoustic patterns of F 0, duration, amplitude, spectral tilt, and segmental reduction, and their articulatory correlates, that can be best accounted for by
reference to higher-level structures, and (2) the higher-level structures that
best account for these patterns. We subscribe to Beckman's (1996) observation that prosody is " a complex grammatical structure that must be parsed in
its own right"; it is "the organizational structure of speech."
2.1. General Correspondence Between Syntax and Prosody
A natural early hypothesis in generative grammar, and in the psycholinguistic studies that grew out of it in the 1960's and '70's, was that the
structural constituents of spoken utterances correspond to those predicted by
the syntax. Many studies showed that major acoustic phonetic phenomena,
such as intonational boundaries, preboundary lengthening and pausing, tend
to occur at major syntactic boundaries, and that some aspects of phrasal
prominence patterns can also be predicted from morphosyntactic properties
(Brown & Miron, 1971; Goldman-Eisler, 1972; Klatt, 1976; Cooper & Paccia-Cooper, 1980; Chomsky and Halle, 1968).
Further evidence of the syntax-prosody link is provided by the fact that
some syntactic ambiguities can be disambiguated by the placement of spoken boundaries (Lehiste, 1974; Lehiste et al., 1976; Price et al., 1991). This
is particularly true for bracketting ambiguities like:
(1) (old) (men and women) vs. (old men) (and women),
(2) (a + b) (c) vs. (a) + (b c)
A Prosody Tutorial
197
(for evidence see e.g., Streeter 1978), although less true for other types of
ambiguity involving grammatical relations or grammatical function, such as
the infamous:
(3) Visiting firemen can be a nuisance.
(4) The shooting of the hunters was terrible.
The ability of speakers to disambiguate some forms of syntactic ambiguity
by prosodic means shows that syntax imposes some constraints on prosodic
structure.
Another line of evidence comes from the fact that some syntactic constituents obligatorily require a particular intonational constituent, e.g., parentheticals, tags, and nonrestrictive relatives and other appositives are
produced with their own intonational phrase. Selkirk (1978, 1981:137) gives
examples such as
(5) In Pakistan, Tuesday, which is a weekday, is, Jane said, a holiday.
The first, third, and fifth word strings between punctuation marks in
this example are obligatory intonational phrases. Similarly, some prosodic
structures are ruled out for certain syntactic forms, as shown by the unacceptability of examples like (6c), which indicate that surface syntax imposes
constraints on the way an utterance is organized.
(6) (a)
(b)
(c)
(d)
Finally, the form class of a word, traditionally regarded as a syntactic dimension, has been shown to affect its durational pattern (Sereno and Jongman 1995) as well as its prominence pattern (Kelly 1989).
Early studies of sentence processing, most carried out in English,
amassed a substantial body of evidence supporting the effects of surface
syntactic structure on various measures of perceptual processing; for a summary see Fodor, Bever and Garrett (1974). For example, Ladefoged and
Broadbent (1960) showed that listeners' misperceive the location of a click
presented simultaneously with speech, in a way that suggests resistance to
interruption of constituents. Garrett et al. (t966) extended this method to
show that major syntactic boundaries influence the perceived location of the
click even when prosodic cues are neutralized.
Although initial evidence seemed to support the claim that syntactic
structure has a direct effect on the phonological and acoustic-phonetic shape
of an utterance, as well as on other forms of language behavior, as investigators began to examine larger corpora of utterances actually produced by
198
speakers, they found notable discrepancies with the results predicted by syntax. As the examples in the following section show, traditional morphosyntax is not isomorphic with the organizational structure of spoken utterances.
2.2. Discrepancies B e t w e e n Syntactic and P r o s o d i c Structure
A Prosody Tutorial
199
78%
92%
87%
95%
88%
94%
84%
200
Shattuck-Hufnagel and T u r k
These examples show that speakers have options for the prosodic treatment
of a given syntactic structure. This observation itself suggests that syntax
A Prosody Tutorial
201
does not entirely determine prosody. Even more compelling, however, is the
fact that some of these options divide an utterance into constituents which
appear to violate surface syntactic structure.
(ii) Some Well-Formed Prosodic Choices Appear To Violate Syntactic
Structure. One way to derive a set of prosodic options would be by stipulating a condition which parses a spoken string exhaustively into any of the
set of well-formed syntactic constituents. However, if a parsing constraint
exists, it must be weakened to allow for the possibility that at least some of
the prosodic constituents do not correspond to well-formed syntactic constituents. A notorious example is the sentence:
(16) Sesame Street is brought to you by the Children's Television Workshop.
In utterances of this sentence, the 'by' can be grouped either with its following noun phrase (forming two syntactically well-motivated phrases):
(16a) Sesame Street is brought to you, by the Children's Television Workshop,
or with the preceding phrase,
(16b) Sesame Street is brought to you by, the Children's Television Workshop
in which case the left constituent does not correspond to a well-motivated
constituent in traditional syntactic theories (see Section 2.3.).
Gee & Grosjean (1983) provide a further illustration of this point. They
measured pause duration in slowed speech, and interpreted the results in
terms of hierarchical performance structures. In these structures, function
words were often grouped with adjacent content words, even in cases where
there was a major syntactic constituent boundary between the function word
and the following content word (as in the case of a subject pronoun followed
by a verb). For example, speakers' pauses divided the clause:
(17a) He brought out the objections . . .
into the two constituents.
(17b) He brought out ] the objections . . .
which do not correspond to the two major surface syntactic phrases of subject NP and VP. Such examples suggest that the organizational boundaries
in an utterance do not delineate the syntactic constituents of the underlying
sentence in a rigid way, either by consistently marking the same types of
syntactic boundaries or by consistently parsing the word string into wellformed syntactic constituents of various types.
Syntax also cannot provide an account of the effects of rate, length,
and symmetry on constituent boundary location, illustrated by the examples
in the next section.
202
Old
UR
/ 3
" buys
3
d wine
3
3/
Slow
31312
Faster
3 I 2
3 = low;
Rule: 3
-> 2 /
2 = rising
3 in same constituent.
Fig. 1. Comparison of tones produced in slow and fast renditions of this phrase in
Mandarin Chinese. The rule changing Tone 3 to Tone 2 when followed by a 3 in the
same constituent applied to Old but not to Li in the slow version, indicating a constituent
boundary after Li, but to both O/d and Li in fast speech (but not to buys), indicating a
constituent boundary after buys. (after Cheng, 1970, 1973).
A Prosody Tutorial
203
that turns an underlying low tone into a rising tone when followed by a low
tone in the same constituent. This rule operates on old in slow speech but
on both old and Li in more rapid speech, showing that buys must be in the
same constituent as Li in the rapid rendition. Jun (1993) reports parallel
phenomena in Korean.
Another factor that seems to play a role is symmetry, or a balance
between the length of subconstituents of an utterance. For example, Gee and
Grosjean (1983) and Grosjean et al. (1979) report that talkers placed a performance constituent boundary in the middle of a syntactic constituent when
this resulted in a more equal partition of the elements of the utterance. That
is, the length of the prosodic subparts of a spoken utterance was determined
in part not by the syntactic structure, but by a tendency to divide the spoken
utterance into equal parts.
Another demonstration that syntax can't predict all aspects of prosody
comes from non-syntactically-structured word lists. Talkers can 'prosodify'
such lists; that is, they naturally group words into panse-delimited constituents (Suci, 1967), and some factor(s) other than syntax must be determining
these prosodic patterns. However, Suci's further finding that intertalker variability is substantially less for syntactically organized utterances than for
word lists suggests that, even though syntactic structure does not predict all
aspects of prosody, it nonetheless provides a powerful constraint on the
range of ways in which a sentence can be prosodically treated by a speaker.
Some of the discrepancies with syntactic structure that arise in prosody
also displayed themselves in studies of other forms of language behavior.
For example, Martin (1970) found that subjects asked to parse visually presented sentences into 'natural' subgroups produced hierarchical structures,
but did not always parse the sentences following traditional syntactic lines,
showing "differences between prescriptive and subjective sentence organization" (p. 159). In particular, while sentences are thought to follow a S
(VO) syntactic structure in English, subjects often gave (SV) O parsings.
This result is reminiscent of the performance structures obtained by Gee and
Grosjean (1983) from the pause duration patterns observed in slowed speech.
In 2.1. and 2.2. we have seen that, despite a general correspondence
between the surface syntactic structure of a sentence and the prosody of an
utterance of that sentence, there are many discrepancies between the two
sets of structures. It appears that extrasyntactic factors as well as syntactic
factors influence the speaker's choice of prosodic shape for an utterance.
What, then, is the best way to characterize the relationship between syntax
and prosody? This question is currently the topic of lively debate and extensive research; we address it briefly in the final portion of this section.
204
Shattuck-Hufnagel and T u r k
A Prosody Tutorial
205
Summary
In sum, although syntax imposes certain constraints on prosody, other
factors must be invoked to account for which aspects of syntax can and
cannot be violated, which aspects can and cannot be signalled, and which
aspects the speaker chooses to signal or not to signal in a particular utterance, as well as to account for elements of spoken utterances for which
syntax does not provide a prediction. The postulated prosodic component of
the grammar provides a candidate for the organizational structure underlying
spoken utterances that should help provide answers to these questions, and
a locus for the integration of factors from various components that have an
effect on the shape of this structure for any given utterance. In Section 3,
we describe the theory of the prosodic hierarchy.
3. T H E O R I E S OF T H E P R O S O D I C H I E R A R C H Y OF
With the advent of modern prosodic theories, beginning with Liberman's (1975) and Liberman and Prince's (1977) seminal work on metrical
theory, and with the integration of metrical theory and intonational theory
in a single component of the grammar, it became possible to consider the
role of specific proposals for prosodic structure. We will review four wellknown prosodic hierarchies proposed by Nespor and Vogel (1986), Hayes
(1989), Beckman and Pierrehumbert (1986) and Pierrehumbert and Beckman
(1988), and Selkirk (1978; 1986 and to appear). These theories (Fig. 2) share
the view that prosodic structure consists of a hierarchy of labelled constituents, and they give an idea of the range of possibilities in the literature.
Additional important views, such as those of Halle and Vergnaud (1987),
Gussenhoven (1984), Jun (1993), Hayes (1995), and others will be referred
206
Shattuek-Hufnagel and T u r k
Selkirk
Beckman,
Pierrehurnbert
Utterance
(Utterance)
Int!n.Phrase
Int!n.Phrase
Syl}able
Accentu~Phrase
Syl}able
Mora
Fig. 2. Prosodic constituent hierarchies from the literature; additional important theories,
such as those of Halle and Vergnaud (1987), Liberman (1975), Liberman and Prince
(1977), Gussenhoven (1988) and others are discussed in the text.
MaP/Int.IP
MiPtCG MiP/CG
Pw~d
F'
Pwd
F[I ~
MaP/Int.IP
MiP/CG
F~Pwd
//~ F
MiP/CG
MiP
/P~wd
~F
P~d
F
Massachusetts Supreme
Fig. 3. Prosodic constituent boundaries for the utterance illustrated acoustically in Fig.
4a; digitized examples of critical utterances from the paper are available via anonymous
ftp from lexic.mit.edu.
A Prosody Tutorial
207
208
While the Strict Layering Hypothesis appears to hold in the great majority o f cases, it is thought that some prosodic constituents m a y occasionally
directly dominate constituents two or even three levels down in the hierarchy, e.g., Prosodic Words m a y directly dominate Feet, 1 level down, along
with (optionally) Syllables, 2 levels down, (Inkelas, 1989; Selkirk, to appear
and references). Selkirk (to appear) presents a case for the Minor Phrase
directly dominating a Prosodic Word as well as a Syllable (see Section
3.2.3.). Likewise, in some views, constituents are thought to directly dominate other constituents o f the same type: e.g., Ladd (t986) argues that sentences which contain two or more Intonational Phrases have their own
intonation contour which is more than just the sum o f the two Intonational
Phrase contours. Ladd proposes that an Intonational Phrase can directly dominate another Intonational Phrase, so that Intonational Phrases can be nested
recursively. Also, a Prosodic Word can dominate another Prosodic Word in
Selkirk (to appear) and in Inkelas (1989).
The exceptions to Strict Layering cited above suggest that this principle
is not an exceptionless rule, but instead, m a y be at most a strong tendency.
In order to allow for the types o f cases where Strict Layering does not appear
to apply, the Strict Layering Hypothesis has been broken down into a set o f
four violable constraints within an optimality theory framework (Ito and
Mester, 1992); see 2.3. The four constraints are listed below as they appear
in Selkirk (to appear: 3).
(a) Headedness: A constituent of level Cj in the Prosodic Hierarchy must
dominate a constituent of level Cj-1 (i.e., of the next level down).
(b) Layeredness: A constituent of level Cj in the Prosodic Hierarchy may not
dominate a constituent of level Cj + n (i.e. of a higher level).
(c) Exhaustivity: A constituent of level Cj in the Prosodic Hierarchy may not
dominate a constituent of level Cj - (1 + n) (i.e., of more than one level
down). This constraint would be violated in the case that a Prosodic Word
dominates a Syllable, for example.
(d) Nonrecursivity: A constituent Of level Cj in the Prosodic Hierarchy may
not dominate a constituent of the same level Cj.
While Headedness and Layeredness are not known to be violated, Exhaustivity and Nonrecursivity do appear to be violable. (19) shows a violation o f Exhaustivity proposed in Selkirk (to appear); (20) shows a violation
o f Nonrecursivity proposed in Ladd (1986).
A Prosody Tutorial
209
(19)
MiP
/
/
/
Syl
\
Pwd
/x
Syl Syl
(20)
MP
/ \
MP
MP
Prosodic constituents have been defined in several different frameworks: in terms of the domains of phonological rules (e.g., The Utterance
is the domain of application of/r/epenthesis in British English, Nespor and
Vogel, 1986), in terms of intonation (e.g., an Intermediate Intonational
Phrase in English is the span of a coherent intonation contour that includes
a certain pattern of phonological tonal elements, Beckman and Pierrehumbert, 1986), and in terms of rhythmic prominence (e.g., a Foot in English
consists of a full-vowel syllable followed by the unstressed syllables in the
same word). Some theorists view prosodic constituents as members of a
single hierarchy, and for presentation purposes we will adopt this view (see
Gussenhoven, 1992 for an alternative position.) Recently, studies have appeared which explicitly test the notion that e.g., intonationally-defined constituents are the domain of application of phonological rules; see Jun (1993).
In this section we discuss each proposed constituent in the prosodic
hierarchy, as summarized in Fig. 2. For each constituent, Utterance, Intonational Phrase, Phonological Phrase, Clitic Group, Foot, Syllable, and
Mora, we will present the definitions that have been proposed and some of
the evidence that has been invoked. Evidence that supports the generally
hierarchical structure of spoken utterances, without relying on the detailed
assumptions of a particular theoretical proposal, is reserved for Section 4.1.
3.2.1. The Utterance
The Utterance has been proposed as the largest unit in the prosodic
hierarchy: It is the largest span of application of phonological rules (Selkirk,
1978, 1980; Nespor & Vogel, 1986; Hayes, 1989) and its boundaries are
210
Shattuck-Hufnagel and T u r k
It is generally agreed that the Intonational Phrase is a prosodic constituent which is intonationally defined, and is the domain o f a perceptually
coherent intonational contour, or tune. According to Pierrehumbert (1980),
the Intonational Phrase contains a specified sequence o f phonological elements: Nuclear Pitch Accent followed by a Phrase Accent and a Boundary
Tone (additional Prenuclear Pitch Accents are optional). Her theory differs
from previous formulations in several ways, notably in its sparse phonological specifications of the intonationally significant elements of the utterance,
and its restriction to tone level targets of High and Low, with neither Mid
tone targets, nor target movements such as Rise and Fall; apparent rises and
falls are captured by appropriate sequences o f High and Low elements (see
below and Section 3.3. for a more complete summary).
A Prosody Tutorial
211
212
Shattuck-Hufnagel and T u r k
man & Pierrehumbert, 1986), and its left edge has been described as the
locus for Early Accent Placement within the word in American English
(Home, 1990; Shattuck-Hufnagel, 1988, 1992a, 1995; Shattuck-Hufnagel et
al., 1994) as well as for glottalization of vowel-onset words (Pierrehumbert
& Talkin, 1992; Dilley et al., 1994; Dilley and Shattuck-Hufnagel, 1995).
The Full Intonational Phrase permits glottalization of vowel-onset words at
its left edge, while the Intermediate Intonational Phrase does not (Dilley et
al., under revision).
A number of other theories define the Intonational Phrase in other ways,
including the British approach (associated with O'Connor and Arnold and
with Halliday) which defines the phrase in terms of the structural units of
head, nucleus and tail, and the Dutch approach developed by 't Hart, Collier
and colleagues which specifies a small set of rises, falls and level tones for
each language, some of which are prominence-lending. For further discussion of these views, see O'Connor and Arnold (1961), Halliday (1967),
Cruttenden (1986) especially Chapter 3, and 't Hart and Collier (1975) summarized in 't Hart et al. (1990). Additional systems have been developed
by Pike (1945), Chafe (1980) and others.
3.2.3. The Phonological Phrase
A Prosody Tutorial
213
(23)
/
NP
I
I
N
/
/
/
AP
/
/
A
Maj Phr [
Min Phr [
/
AdvP
I
I
Adv
I
I
/
/
[[Blue]AP [aphids]N]NP
\
VP
/
\
IP
\
VP
I
I
V
I
I
\
DP
/ \
Det NP
I
I \
[[completely]AP[covered]V[[the]Det[new]AP [buds]NlNP.
][
,][
]
]
][
at
cannot be reduced.
214
Shattuck-Hufnagel and T u r k
/
MiP
Pwd
Foot
syl
.... l o o k
MaP
\
MiP
Pwd
Foot
syl
at
....
Whereas Selkirk's Major Phrases are proposed to align with the edges
of e n t i r e maximal projections which are not lexically governed, her Minor
Phrase is proposed to align with either the left or right edge of h e a d s of
maximal projections which are not lexically governed. The Minor Phrase
"groups together a phrasal head and adjacent modifiers and functional elements, as in Det Adj Noun sequences in Italian, English, French Modem
Greek (see Nespor & Vogel, 1986; Selkirk, 1986)" (Selkirk, to appear: 20).
As shown in the above example, the Minor Phrase in English aligns with
the right edge of heads of non-lexically governed maximal projections.
In other theories (Selkirk, 1978; Nespor & Vogel, 1986; and Hayes,
1989), only a single Phonological Phrase is proposed. In some cases, the
proposed Phonological Phrase corresponds most closely with Selkirk's Major Phrase, and in other cases, the Phonological Phrase corresponds more
closely to the Minor Phrase. It remains to be seen whether both a large and
a small Phonological Phrase are used in all languages.
A correspondence between the intonationally defined Intermediate Intonational Phrase (Beckman & Pierrehumbert, 1986) and a syntacticallyconstrained constituent such as the Major Phrase has been suggested for
Japanese (Selkirk and Tateishi, 1988, 1991), Bengali (Hayes and Lahiri,
1991) and Korean (Jun, 1993). Selkirk and Tateishi (1988, 1991) found that
this domain coincides with the left edge of a maximal projection (which
aligns with a Major Phrase boundary). Hayes and Lahiri (1991) and Jun
(1993) suggest a correspondence between the intonational phrase and the
domain of segmental phonological phenomena which occur in constituents
roughly the size of the syntactically constrained Major Phrase.
Selkirk and Tateishi (1988, 1991) and Selkirk (to appear) suggest that
the Minor Phrase corresponds to the Accentual Phrase in Japanese (a unit
equal in size to or larger than a prosodic word in which no more than one
pitch accent occurs, Beckman and Pierrehumbert, 1986). Jun (1993) presents
evidence consistent with this view for Korean. In English it is less likely
A Prosody Tutorial
215
that the Minor Phrase can be defined intonationally; it may be the case that
different types of constraints/definitions are appropriate for similar-sized
units in different languages.
3.2.4 Clitic Group
The Clitic Group contains at most one content word with (optionally)
adjacent monosyllabic function words (clitics). There is ample evidence that
monosyllabic function words (including prepositions, determiners, pronouns,
auxiliaries, modals, complementizers, and conjunctions) are realized quite
differently from content words (nouns, verbs, adjectives, some adverbs) in
continuous speech. In particular, while content words always bear some type
of lexical stress, function words often, but not always, surface in a reduced
form which is phonologically related to the full form: 'is' can surface as
'[0z]'; 'him' can surface as '[0m]'; 'have' can surface as 'Iv]', etc. (see
Selkirk, 1984; Inkelas and Zec, 1993). In addition, these function words
often appear to be closely grouped with an adjacent content word. For example, Grosjean et al. (1979) and Gee and Grosjean (1983) found that function words are often grouped together with an adjacent content word. That
is, talkers produced a much shorter pause between they and offered in They
offered.., than between John and asked in John asked . . . . The Clitic Group
is proposed to account for this close linking and potential for reduction.
The close grouping of function words with adjacent content words has
been described in several different ways. In Selkirk (to appear) and Inkelas
and Zec (1993), the Phonological Phrase (Minor Phrase in Selkirk's theory)
and Prosodic Word are used to group function words with either following
or preceding content words (see Section 3.2.3 and 3.2.5). In Hayes (1989)
and Nespor and Vogel (1986), on the other hand, a function word is always
grouped into a Clitic Group with an adjacent content word. Theories differ
as to the exact definition of the term 'clitic' and as to whether function
words group with following or preceding content words (see discussions in
Nespor and Vogel, 1986; Hayes, 1989; and Inkelas & Zec, 1993).
Selkirk's Minor Phrase provides some of the functions of the Clitic
Group proposed by Nespor and Vogel (1986) and Hayes (1989). As described in the following section, the Clitic Group contains at most one content word with (optionally) adjacent function words. According to Selkirk
(to appear: 19), the Minor Phrase can also serve to group function words
with (an) adjacent content word(s), as shown in (25):
216
Shattuck-Hufnagel and T u r k
(25)
/
MiP
/
/
/
Syl
the
\
Pwd
Foot
/ \
Syl Syl
sto - ry
The function word's lack of Prosodic Word status accounts for the fact
that a function word in this position is reduced: It is not the head of a foot,
and thus cannot be stressed.
Hayes presents phonological arguments for this constituent, citing English rules of/v/-deletion and palatalization (from Selkirk, 1972) as rules
which operate within the Clitic Group domain: e.g.:
(26) Will you {save me} a seat?
(27) {Is Sheila} coming?
In Hayes' view, the/v/in {save me} is deleted and the/z/in {is Sheila}
can be palatized, since {save me} and {Is Sheila} are Clitic Groups. In the
following examples,/v/-deletion does not apply because {save} and {Morn}
are two separate Clitic Groups, and in {Carla's} {shower}, the /z/ is not
palatalized, since {Carla's} and {shower} are separate Clitic Groups.
(28) {Save} {Mom}
(29) {Carla's} {shower}
Intuitions may differ on the palatalizability of this/z/; again, arguments from
text are not compelling, if this phrase can be produced optionally as one or
two prosodic constituents, with the/z/palatalized only when it is produced
as one.
3.2.5. The Prosodic Word
While Hayes (1989), Selkirk (1978, to appear), Beckman and Pierrehumbert (1986), and Nespor and Vogel (1986) all include the Prosodic Word
in their respective hierarchies, definitions for this constituent differ in several
key respects. Namely, they differ in how function and content words are
parsed into Prosodic Words, and also in how different types of morphemes
are parsed into Prosodic Words.
Prosodic Word (a): Function vs. Content Word Distinction
A Prosody Tutorial
217
words are Prosodic Words which form Clitic Groups with adjacent content
words. In Selkirk's (to appear) and Inkelas and Zec's (1993) accounts, on
the other hand, content words, but not function words, are parsed into Prosodic Words at the output of the lexical component of the grammar. According to Selkirk (to appear), function words which procliticize are attached
to Minor Phrases during the postlexical component of the grammar (example
25); unstressed function words which encliticize at the postlexical level adjoin to a Super Prosodic Word which dominates the function word and the
Prosodic Word dominating the preceding content word (as in the phrase
need him, where the him is reduced:
(30)
/
Pwd
Pwd
... need
\
\
him ...
Note that when the him is not reduced, it has a different structure, like
the one in (31). When function words are Major-Phrase-final in English (and
are not what Inkelas and Zec (1993) would call 'clitics'), they form Prosodic
Words on their own and as a result surface in strong form (e.g., That's the
one to look for):
(31)
MaP
/
MiP
Pwd
... look
\
MiP
Pwd
F
f o r ...
218
longer acoustic closure duration and longer aspiration duration (VOT) than
word-medial stops in the same lexical stress environment. Similarly, Krakow
(1989) showed different patterns of labial and velic articulation for wordinitial vs. word-medial /m/; the difference in articulatory kinematics for
word-initial vs. word-medial nasals was quantitative, rather than qualitative,
when the consonants were both in the same position in their respective
syllables (i.e. syllable-initial).
Although a range of constituents in the hierarchy have been shown to
exhibit pre-boundary lengthening (Ladd and Campbell, 1991; Wightman et
al., 1992), it is still unclear whether a Prosodic Word boundary per se induces such lengthening.
3.2.6. The Foot
The term 'Foot' has been used in the literature to refer to two distinct
types of units. The Foot referred to most often by generative phonologists
is what we will call the Within-Word Foot, sometimes called the Rhythmic
Stress Foot. In English, this Foot contains at most one lexically-stressed
syllable, followed by zero or one (or in some formulations two) reduced
syllables, and may not extend beyond a lexical word boundary. In studies
of the role of the Foot in determining lexical stress in the word (Kiparsky,
1979; Hayes, 1981; Halle and Vergnaud, 1987), the 'word' referred to is the
lexical word, and so this constituent might be called a Within-Lexical-Word
Foot?
The Within-Word-Foot differs from the Abercrombian, Cross-WordBoundary Foot (Abercrombie 1965, 1973), which can combine fragments
of adjacent lexical words and is thought to extend in most cases from one
pitch-accented syllable to just before the next one. For example, Abercrombie (1973: 11) gives:
(32) [ Know then thy- [ -self, pre- [ -sume not [ God to [ scan [^ I,
where the Foot represented by I -self, pre- I crosses at least a lexical word
boundary, if not other constituent boundaries as well.
/ \
W P W Ft
I
di-
W P W Ft
/
gest
\
it
Prosody Tutorial
219
The mora is a unit smaller than the syllable whose existence is uncontroversial in languages such as Japanese, but somewhat more controversial
in other languages such as English. It is dominated by the syllable and
dominates a segment (either a vowel or consonant) or segments. A syllable
contains at least one mora and normally contains no more than two. For
example, in Japanese, a commonly cited moraic language, C0V and V syllables are considered monomoraic, whereas CoV N (N = nasal consonant)
and CoVQ (Q = the first member of a geminate consonant) syllables are
considered bimoraic (McCawley, 1968, Otake et al., 1993).
(33)
or
/
g
ho
\
g
220
quantity sensitive, syllables which contain long vowels are considered heavy
and tend to attract stress. 4 In some languages, syllables which contain VC
rimes are also considered heavy, whereas in other languages, syllables with
VC rimes are considered light. Hyman (1985, as cited in Kenstowicz, 1994)
argues that light syllables are monomoraic, while heavy syllables are bimoraic. The mono- vs. bi-moraic nature o f the syllable t h u s depends on
properties o f the syllable rime, and does not appear to be sensitive to properties o f the syllable onset. We refer the reader to McCawley (1968), Port
et al., (1987), Katada (1990) and Otake et al. (1993) for discussions o f and
evidence for this unit, and to Hayes (1989a) for a theory which implicates
the mora in explanations o f phonemic vowel length distinctions (quantity
sensitivity) in various languages not traditionally thought to make use o f
this constituent.
Pierrehumbert and Beckman (1988) report that certain aspects o f F0
slope in a Japanese Accentual Phrase are negatively correlated with the
number o f Morae the phrase contains.
Summary
This review o f the constituents proposed in several prosodic hierarchies
suggests that (a) there are substantial disagreements about the nature and
definition o f the constituents, particularly at levels between the syllable and
the intonational phrase, and (b) there is also a modicum o f common ground,
particularly at the intonational and sublexical levels. There may also be
differences across languages: It is unclear to what extent speakers o f different languages make use o f all the units in the hierarchy. Despite the lack o f
universal agreement on the nature o f the constituents in the hierarchy and
on whether all prosodic constituents belong in the same hierarchy, a number
o f investigators have begun to test the hypothesis that one or another o f the
proposed constituents plays a role in the representations that guide human
language behavior. We will summarize some of their findings in Section 4.
First, however, we turn to the second o f the two major aspects o f prosodic
structure: the pattern of relative prominences o f an utterance. In addition to
proposals for the hierarchy o f constituents, the prosodic literature contains
a number o f proposals for the hierarchy of prominences, which we review
in the following section.
3.3. The Hierarchical Organization of Prosodic Prominences
A Prosody Tutorial
221
222
greater or lesser degrees of it. He suggested that there were only two kinds
of prominence contrasts: on the one hand, full vowels vs. reduced vowels,
and on the other, pitch-accented full vowels vs. non-pitch accented full vowels, (where a Pitch Accent is an intonationally-cued phrase-level prominence). In his view, only full vowels are candidates for the phrase-level
prominence conveyed by Pitch Accents, and the placement of Pitch Accents
on certain of the full-vowel syllables of an Intonational Phrase is governed
primarily by the speaker's tendency to place one accent as early as possible
in the phrase and another as late as possible. This two-accents-per-phrase
view is embodied most persuasively for American English by the typical
intonation contour of short simple declarative statements like The sky is blue.
This contour is also described in terms of the F0 'hat pattern' in the IPO
system developed for Dutch (see 't Hart et al., 1990) and extended to American English by Maeda (1974). Bolinger's view suggests that primary and
secondary word stress differ not in degree or type of articulatory or acoustic
prominence, but in the instructions they provide for the placement of pitch
accent. That is, in English, a phrase-final accent occurs on the main-stress
full vowel of its word, with phrase-initial accent possible on a secondary
stress syllable that precedes the main stress, in words like antique or Massachusetts.
Bolinger's view is compatible with the four-level prominence system
described by Vanderslice and Ladefoged (1972) and Ladefoged (1975). Ladefoged contrasted (a) Nuclear Pitch Accented syllables, (b) Accented
stressed syllables, (c) non-accented Stressed syllables and (d) Reduced syllables. The major difference is that Ladefoged's, 'Accented' syllables were
not defined intonationally and so his analysis did not provide a specific
account of Prenuclear Pitch Accents that occur before the intonationallydefined Nuclear Pitch Accent of an Intonational Phrase.
Evidence for Elements in the Prosodic Prominence Hierarchy
A critical prediction made by recent theories of the prosodic prominence hierarchy is that different types or levels of prominence are signalled
by different dominant acoustic cues. Beckman and Edwards (1990, 1994)
propose that Stressed syllables are distinguished from Reduced syllables by
quality, duration, and possibly amplitude, while Nuclear Pitch Accented syllables are distinguished from non-accented Full-Vowel syllables by an F0
marker. Stevens (1994) has reported that, for speakers of American English,
the glottal excitation waveform differs for the three categories of Nuclear
Accented, post-nuclear Full-Vowel (i.e., non-accented) and Reduced syllables. This result can be interpreted as support for the claim that these three
types of prominence correspond to different levels in the hierarchy. Sluijter
A Prosody Tutorial
223
(1995) presents evidence that speakers of Dutch and American English distinguish between Pitch-Accented (i.e. focussed) and non-accented FullVowel syllables via F0 levels. More surprisingly, from the point of view of
theories that predict only a 4-way prominence contrast, she reports that
speakers of both languages distinguish primary from secondary stressed
vowels even in non-Pitch-Accented contexts, via the relative level of energy
at high frequencies. In general, the results of quantitative studies support the
view that prosodic prominence is not a single parameter, but that there are
different types or levels of prosodic prominence, associated with a different
dominant acoustic cue or set of cues.
Another line of evidence distinguishing types or levels in the prosodic
prominence hierarchy is the finding of Shattuck-Hufnagel et al. (1994) that
Nuclear Pitch Accents almost invariably occur on the lexically-main-stressed
syllable of their word. In contrast (as predicted by several theories), Prenuclear Pitch Accents may occur on a Full-Vowel syllable earlier in the word,
as in such phrases as:
(34) the MASsachusetts legisLAtion
where upper case indicates a Pitch Accented syllable. Beckman and colleagues (1990) present similar results for phrases like
(35) CHinese anTIQUES.
Apparently, constraints on the within-word placement of Nuclear Pitch Accents differ from those on Prenuclear Accents, further supporting the theoretical distinction between these two types or levels of prominence.
3.4. Linking the Prominence and Constituent Hierarchies
224
Shattuck-Hufnagel and T u r k
This theory draws much of its power from its integration of the hierarchy of constituents with the hierarchy of relative prominences. In addition,
Beckman and Edwards provide evidence from articulatory studies to support
some of their claims (Edwards et al., 1991). However, the proposal leaves
several levels in the prosodic constituent hierarchy without well-defined
heads, and suggests no specific constituent for which Prenuclear Pitch Accents could serve as heads, at least in American English.
One final point about intonational prominence: the term 'stress' has
often been used to refer to prominence at any level in the hierarchy, e.g. to
both lexical stress and phrasal prominence. Since recent investigations suggest that the dominant acoustic cues to lexical and phrasal prominence are
different, it is useful to define the term 'stress' whenever it is used, and to
employ the term 'prominence' to refer to the generic quality shared by all
levels.
In Section 3 we have laid out several theoretical proposals for prosodic
constituent and prominence hierarchies, which provide an alternative description for the structural organization of utterances, beyond traditional
morphosyntactic structure. In addition, we have sampled the kinds of evidence that have been used to argue for each prosodic component. We turn
now to more general evidence for the relevance of prosodic structure in
speech production and perception.
4. F U R T H E R E V I D E N C E F O R P R O S O D I C S T R U C T U R E
We begin with a summary of some of the methods that have been used
to evaluate prosodic theories (4.1.), before turning to examples of evidence
for the generally hierarchical nature of the organization of spoken utterances
(4.2.), and of explicit empirical comparisons between syntactic and prosodic
accounts of utterance organization (4.3.).
4.1. Types of Evidence for Prosodic Elements
A Prosody Tutorial
225
(1992) and Dilley et al. (1994) looked at vowel onset glottalization at the
left edge of intonational constituents, Ladd and Campbell (1991) looked at
preboundary lengthening at the right edge of intonational constituents, and
Turk and Sawusch (1995) examined the domain of lengthening associated
with pitch accents.
Evidence for the psychological reality of prosodic constituents also
comes from studies of perception, memory, and other aspects of language
behavior. Several laboratories have employed unit monitoring tasks to test
the hypothesis that the initial perceptual organization of spoken utterances
occurs in terms of prosodic constituents, and that these constituents may
differ across languages (Mehler et al., 1981; Cutler et al., 1986; and Otake
et al., 1993). Another type of evidence comes from listeners' preferences
for interruption points at constituent boundaries rather than within constituents. Several studies suggest that listeners find passages which have pauses
artificially inserted within constituents unnatural (Pilon (1981) and Wakefield et aL (1974)). Gerken et al. (1994) suggest that the units whose integrity infants prefer may be prosodic constituents, although a critical
comparison of infant preferences for uninterrupted syntactic vs. prosodic
constituents has not yet been reported. Finally, the method of unit extraction
has been employed, in which the infant is familiarized with a list of spoken
words and then presented with a story that either does or does not contain
those words. The degree to which the infant prefers to listen to the story
containing the familiar words is taken as a measure of the ability to extract
those words as units (Jusczyk & Aslin, 1995). Work is in progress to determine whether the infant is extracting morphosyntactic or prosodic units.
The evaluation of phonological, acoustic-phonetic and other behavioral
evidence that bears on the psychological reality of the constituents of the
prosodic hierarchy is still in its infancy. However, taken together, currentlyavailable evidence provides considerable support for the claim that speakers
make active use of prosodic elements in the production of spoken utterances,
and that systematic variations in the phonetic realization of phonemic segments and features depends at least in part on prosodic structure. What is
the evidence that these individual prosodic constituents are organized into a
hierarchical structure, and that this hierarchy provides a better account of
the observable facts of continuous speech than does syntax?
4.2. Evidence for the Generally Hierarchical Nature o f the
O r g a n i z a t i o n a l Structure of S p o k e n Utterances
226
boundary lengthening increases for final boundaries at increasingly higherlevel prosodic constituents (Wightman et aI., 1992, Ladd and Campbell,
1991). Gussenhoven and Rietveld (1992) showed that listeners are sensitive
to this variation. Other studies, focusing on constituent-initial rather than
- final phenomena, show that articulation of a constituent-initial consonants
is in some sense stronger than in other positions, and that this marker increases for initial segments at increasingly higher-level prosodic constituents
(Cooper, 1994; Dilley et al., 1994 under revision; Krakow, 1989; Fougeron
and Keating, 1995 for English; Fougeron, 1996 for French). Still other studies provide evidence for the distinction between two adjacent levels in the
constituent hierarchy, such as Dilley et al. (1994 and under revision) for the
Full vs. Intermediate Intonational Phrase in English, and Jun (1993) for the
Accentual Phrase vs. Intonational Phrase in Korean. Taken together, these
results provide strong evidence for a hierarchical organizational structure in
spoken utterances, and provide suggestive evidence that this structure corresponds to that in the prosodic hierarchy. Evidence would be more persuasive, however, if it came from direct comparisons of the predictions made
by prosodic and syntactic structure. In the next section, we describe several
studies which provide just such comparisons.
4.3. P r o s o d i c vs. S y n t a c t i c S t r u c t u r e as the O r g a n i z a t i o n a l P r i n c i p l e
of Spoken Utterances
A number of studies reviewed above support the notion that the organizational structures underlying spoken utterances correspond to prosodic
rather than syntactic hierarchies. The most compelling evidence for this
claim, however, would be provided by studies that directly contrast the predictions of prosodic and syntactic structures for the same utterances. Only
a few investigations have taken this tack. Gee and Grosjean (1983) compared
their performance structures with syntactic hierarchies as predictors of pausing behavior in the same utterances. As noted above, several key differences
emerged between syntactically motivated predictions and the prosodic facts.
In the performance structures, function words (e.g. pronoun subjects) were
separated from content words by only the shortest pauses, although a syntactic major boundary divided them. Also, pauses tended to bisect sentences
into approximately equal halves: performance structures thus appeared to be
symmetrical and balanced, whereas balance is not a requirement for syntactic
structures. Gee and Grosjean proposed that these performance structures reflect prosodic, rather than syntactic, structure.
Using a different measure, Ferreira (1991) also directly compared the
predictions from prosodic and syntactic structures. Final or pre-boundary
lengthening has traditionally been thought to occur at the right edge of
A Prosody Tutorial
227
syntactic constituents (Klatt, 1976; Cooper and Paccia-Cooper, 1980). Ferreira presents evidence showing that a hierarchy of syntactic constituents
can't predict the amount of pre-boundary lengthening in a set of controlled
stimuli, and that Selkirk's version of the prosodic hierarchy appears to do a
better job. Ferreira analyzed sets of phrases like the last three in (36) (a)
and (b), which have different numbers of syntactic boundaries but the same
number of prosodic boundaries after the word cop. (See Section 3 for definitions of PWd (Prosodic Word), PPhr (Prosodic Phrase) and IntPhr (Intonational Phrase). When the target phrase is produced in longer sentence
contexts, these three target phrases show the same degree of pre-boundary
lengthening on 'cop' compared with the control phrase shown in the first
example of each set, as predicted by their identical prosodic status. In contrast, the substantial differences in number and type of syntactic boundaries
are not reflected in the measurements of preboundary lengthening. This observation supports the notion that aspects of an utterance that involve timing
are best accounted for in terms of the prosodic rather than the syntactic
hierarchy.
(36) Prosodic vs. syntactic boundaries
(a) Syntactic boundaries
[The cop [who's [[ [a friend] NP] VP] S] S-overbar] NP]
[The [friendliest] AdjP cop] NP
[The friend [of [the cop] NP] PP] NP]
[The man [who's [[ [a cop] NP] VP] S] S-overbar] NP]
(b) Prosodic boundaries (according to Selkirk 1986 algorithm)
(((The cop)PWd (who's a friend)PWd)PPhr)IntPhr
(((The friendliest)PWd (cop)PWd)PPhr)IntPhr
(((The friend)PWd (of the cop)PWd)PPhr)IntPhr
(((The man)PWd (who's a cop)PWd)PPhr)IntPhr
(Control)
(Control)
A final set of studies directly comparing syntactic and prosodic structure as determinants of phonetic realization is described by Jun (1993) for
Korean. She found that the domains of several postlexical rules in Korean
provide support for two intonationally-defined constituents: the Intonational
Phrase and the Accentual Phrase. Jun argues that these intonational constituents cannot be syntactically defined, and thus that intonational constituents
(rather than other prosodic constituents more closely tied to the syntax) are
the appropriate description of domains of phonological rule application (see
Ewen and Anderson (1987) for further discussion).
In Sections 2, 3, and 4 we have sampled some of the evidence for the
claim that prosodic constituents and prominences provide the organizational
structure of spoken utterances. What are the implications of this pattern of
results for studies of auditory sentence processing? In Section 5 we discuss
228
5. P R A C T I C A L M A T T E R S
The proposed hierarchies of prosodic prominences and constituents described in Section 3 and supported by a body of evidence provide candidate
descriptions for some of the aspects of spoken utterances that are not fully
determined by morphosyntacfic structure. In order to make use of these
proposals for studies of spoken sentence processing, it is necessary to understand what prosodic shape the speaker has given to the particular stimulus
utterances to be used in those studies. To this end, we will describe a proposed system for the transcription of prosody called ToBI, along with some
other potential methods for determining prosodic structure (5.1.), summarize
some of the extrasyntactic factors that appear to influence the speaker's
choice of prosodic structure for a particular utterance (5.2.), and touch briefly
on how the prosodic component might function in the grammar (5.3.).
5.1. Tools for Determining the Prosody of a Specific Utterance
Earlier assumptions that prosody could be predicted from the text of a
sentence alone are not supported by the evidence. Unfortunately, our understanding of the additional factors that determine the speaker's choice of
constituent boundary and prominence locations for a particular utterance is
far from complete. These two facts combined might seem to indicate a bleak
picture for investigators who want to study the effects of different prosodic
structures on sentence processing, or at least to control for these effects by
understanding the prosody of stimulus utterances. However, there are tools
available for transcribing the prosodic shape of an utterance post hoc, and
these tools make it possible to select from among candidate utterances the
ones which best fit the requirements of a given study, as well as to indicate
to a reader the prosodic shape of the utterances that were used. We will
describe one such tool, the ToBI transcription system, in some detail, and
survey a number of other proposed systems for describing the prosody of a
particular utterance.
Motivation for the Development of the ToBI Transcription
Convention.
A pervasive and persistent problem in the study of prosody has been
the lack of an IPA-like system, widely accepted and practiced, for the tran-
A Prosody Tutorial
229
230
The ToBI system, with versions under development for German, Japanese, and Korean, is just one of a number of systems that are being developed and tested in different laboratories for various languages around the
world. Several transcription systems based on intonational grammars described earlier are also in active use. For example, 't Hart et al. (1990)
review the system for specifying intonational contours developed for Dutch
at the IPO at Eindhoven in the 1970s. This system is based on a small
number of elements, i.e. rises, falls, and plateaus, that combine to specify
the well-formed intonational contours of the language, and it invokes three
target F0 levels. The IPO-system components have been determined for a
number of languages, including British English, Russian and others, and
were quantified for simple declarative sentences in American English by
Prosody
Tutorial
231
[ ......
~ ....
I ....
'r
9 (
'~'-
I
f
o
o
I~
o
.t
g8
....
I . . . . ~ , ,I ....
t ....
232
Shattuck-Hufnagel and T u r k
i')
~11 a8
"r"
!
L
q~."
..
J|8
i
V
,.
.4 "~
I.~
I
I
j .~i
.o
. . . .
. . . .
. . . . .
Prosody Tutorial
233
Maeda (1974). Another system in active use for transcribing English intonation is the British approach (O'Connor and Arnold, 1961; Halliday, 1967)
which specifies a complex Nuclear Accent element for each phrase, flanked
by a preceding Head and a following Tail. Notable alternative approaches
to intonation in American English include Pike (1945), and Chafe (1980 and
its references) which treats additional components of spontaneous spoken
prosody as well as intonation.
The discussion above emphasizes the perceptual transcription of utterance prosody, but it is also possible to make use of the growing body of
phonological and acoustic phonetic diagnostics for prosodic constituent
boundaries of various types. For example, if the Intermediate Intonational
Phrase is the domain of Early Accent Placement within late-main-stress
words, as proposed by Beckman and Edwards (1994) and observed by Shattuck-Hufnagel et al. (1994), then the occurrence of main-stress accent vs.
early accent in the word Massachusetts might help to disambiguate an utterance of In Massachusetts hospitals don't thrive as:
(39) In MassaCHUsetts, ] HOSpitals don't THRIVE
VS.
234
Shattuck-Hufnagel and T u r k
A Prosody Tutorial
235
aries. She suggests that focus and semantic weight also influence these decisions.
236
A Prosody Tutorial
Syntax
-....
237
Semantics,
Pragmatics,
Focus
Length,
Other?
Rate
Segmental
Phonology
Prosody
Phonetics
Fig. 5. One view of the role of the prosodic component of the grammar.
and how these distinctions in the grammar relate to the syntactic component,
are not yet resolved. However, it is clear that a number o f factors that
contribute to determining the prosodic structure o f an utterance must be
combined at some point. We believe that resolution o f this question will
hinge, to some extent, on detailed phonological and acoustic-phonetic analyses o f the prosodic and segmental realization o f specific utterances.
CONCLUSION
Although prosodic inquiry based on current theory is still in its infancy,
we believe that it already has important implications for the design o f auditory sentence processing experiments. We summarize these implications
as four questions that an investigator might ask while constructing studies
o f auditory sentence processing using spoken utterances.
(1) W h a t is the prosody of the stimulus utterances?
238
Prosodic structure cannot be read directly off the signal. For example,
if we assume that the highest F0 peak in a word or phrase occurs on the
most prominent syllable, we will be wrong in many cases: English has a
Low pitch accent, which means that phrase-level prominences will not always correspond to F0 peaks.
Similarly, F0 may continue to rise after a High pitch accented syllable,
so that the actual peak occurs on a following reduced syllable; an example
of this phenomenon is shown in Fig. 6 (utterance available via ftp). Another
example: long duration is not always a cue to prominence; the final syllable
of a phrase is lengthened, even if it is a reduced syllable with the lowest
possible degree of prosodic prominence. Apparently, listeners are able to
distinguish prominence-related from boundary-related duration lengthening,
perhaps because the two factors affect different parts of the syllable (Edwards et al., 1991). As these examples show, it would not be wise to assume
that the highest F0 peaks and/or the longest duration syllables are necessarily
associated with perceptual prominence, since there is a many-to-one and
one-to-many mapping between prosodic structure and these acoustic dimen-
A Prosody Tutorial
239
400
~'0..
no,.,'~
...'.
~I~
rvlgd,
300
.
.,..'
.,...
m_. 20c
o
i ,"""
1oo
2
'.,.
"
m~80! '
~'40
100
_f-
. , _
~...
..
,. , . - "
"... ........ .
gr m
,I
r
"'......
:rm
200
k._,.,j,,~ . ~
TIME (ms)
"
o1,
600
~...,~:r162
700
800
gO0
1000
.........
"~""~"~
sions. Data about the acoustic signal is more interpretable once the locations
of prominences and boundaries are known for the particular utterance at
hand; prosodic transcription is a useful tool for determining these facts. For
example, in studies of the acoustic correlates of Nuclear Pitch Accent, it is
important to determine that the speaker actually placed the Nuclear Pitch
Accent on the syllable to be measured.
This problem can arise even for the elicitation of simple utterances for
the study of acoustic phonetic patterns. An example is the ubiquitous frame
sentence for simple target word or syllable elicitation in English, " S a y targetword again." In our experience, speakers can produce even this simple
sentence with a variety of prosodic patterns, i.e. with different tunes, constituent boundary placements and prominence placements. Some examples
that we have encountered are shown in 41), where upper case indicates a
pitch-accented word and a comma indicates an intonational phrase boundary:
(41) (a) say TARGETWORD again
(b) SAY, TARGETWORD again
(c) SAY targetword AGAIN
240
Many experimental results can be interpreted in terms of several different types of constituents, e.g. morphosyntactic or prosodic. Most studies,
however, examine only one of these possibilities. Ideally, a given study will
compare the efficacy of both morphosyntactic and prosodic structures as
determinants of observed results. Such comparisons will provide evidence
to test the hypothesis that prosody is a separate component of the grammar.
4. W h a t is m e a n t by various t e r m s that refer to prosodic elements?
A Prosody Tutorial
241
REFERENCES
Abercrombie, D. (1965). Syllable quantity and enclitics in English. In Studies in Phonetics and Linguistics, Oxford University Press.
Abercrombie, D. (1973). A phonetician's view of verse structure. In W.E. Jones and J.
Laver (eds), Phonetics in Linguistics: a book of readings, London: Longman.
Beckman, M. E. (1996). The parsing of prosody. Language and Cognitive Processes (to
appear).
Beckman, M. E., & Edwards, J. (1990). Lengthenings and shortenings and the nature of
prosodic constituency. In J. Kingston and M.E. Beckman (eds), Papers in Laboratory Phonology L" Between the Grammar and the Physics of Speech, Cambridge:
Cambridge University Press.
Beckman, M. E., & Edwards, J. (1994). Articulatory evidence for differentiating stress
categories. In P. Keating (ed), Phonological Structure and Phonetic Form: Papers
in Laboratory Phonology III, Cambridge: Cambridge University Press.
Beckman, M. E., & Pierrehurnbert, J. (1986). Intonational structure in Japanese and
English. Phonology Yearbook 3, 255-309.
Berkovits, R. (1993). Utterance-final lengthening and the duration of final stop closures.
J. Phonetics 21, 479-489.
Berkovits, R. (1993a). Progressive utterance-final lengthening in syllables with final fricatives. Language and Speech 36, 89-98.
Bickmore, L. (1990). Branching nodes and prosodic categories: Evidence from Kinyambo. In S. Inkelas and D. Zec (eds), The Phonology-Syntax Connection, U. Chicago Press, 1-18.
Bolinger, D. (1958). A theory of pitch accents in English. Word 14, 109-149.
Bolinger, D. (1965). Pitch accent and sentence rhythm. In Bolinger, D.L., Forms of
English: Accent, Morpheme, Order, Cambridge, Mass: Harvard University Press,
163 ff.
Bolinger, D. (1981). Two la'nds of vowels, two kinds of rhythm. Manuscript distributed
by the Indiana University Linguistics Club, Bloomington, Indiana.
Brown, E., & Miron, M.S. (1971). Lexical and syntactic predictors of the distribution of
pause time in reading. J. Verbal Learning and Verbal Behavior 10, 658-667.
Chafe, W. (1980). ed., The Pear Stories: Cognitive, Cultural and Linguistic Aspects of
Narrative Production. Norwood, N.J.: Ablex.
Cheng, C.-C. (1970). Domain of phonological rule application. In J.M. Sadock and A.L.
Vanek (eds), Studies Presented to Robert B. Lee by his Students, Edmonton: Linguistic Research 39-60.
242
Shattuck-Hufnagel and T u r k
Cheng, C.-C. (1973). A Synchronic Phonology of Mandarin Chinese. The Hague: Mouton.
Chomsky, N., & Halle, M. (1968). The Sound Pattern of English. New York: Harper
and Row.
Clements, G. N. (1978). Tone and syntax in Ewe. In D.J. Napoli (ed), Elements of Tone,
Stress and Intonation, Washington, D.C.: Georgetown University Press, 21-99.
Cooper, A. M. (1991). Laryngeal and oral gestures in English/ptr/. Proceedings of the
XIIth International Congress of Phonetic Sciences, Aix-en-Provenee, Vol. 2, 50-53.
Cooper, W. E., & Paccia-Cooper, J. (1980). Syntax and Speech. Cambridge, Mass: Harvard University Press.
Cruttenden, A. (1986). Intonation. Cambridge: Cambridge University Press.
Cutler, A., Mehler, J., Norris, D. G., & Segui, J. (1986). The syllables differing role in
the segmentation of French and English. Journal of Memory & Language 25, 385400.
Dilley, L., & Shattuck-Hufnagel, S. (1995). Variability in glottalization of word-onset
vowels in American English. Proc XIIIth International Congress of Phonetic Sciences, Stockholm, Vol. 4, pp. 586-589.
Dilley, L., Shattuck-Hufnagel, S., & Ostendorf, M. (1994). Prosodic constraints in glottalization of vowel-initial syllables in American English. JASA 95 (5-pt. 2) 29782979.
Dilley, L., Shattuck-Hufnagel, S., & Ostendorf, M. (under revision), Glottalization of
vowel-initial words as a function of prosodic structure.
Edwards, J., Beckman, M. E., & Fletcher, J. (1991). The articulatory kinematics of final
lengthening. JASA 89 (1), 369-382.
Ewen, C., & Anderson, J. (eds.) (1987). Phonology Yearbook 4: Syntactic Conditions on
Phonological Rules. Cambridge: Cambridge University Press.
Fant, J., Kntkenberg, A., & Nord, L. (1991). Stress patterns and rhythm in the reading
of prose and poetry with analogies to music performance. Presented at the Music,
Language, Speech, and Brain International Wenner Gren Symposium, Stockholm.
Ferreira, F. (1991). Creation of prosody during sentence production. Psychological Review 100 (2), 233 253.
Fodor, J. A., Bever, T. G., & Garrett, M. F. (1974). The Psychology of Language: An
Introduction to Psyeholinguistics and Generative Grammar. New York: McGrawHill.
Fougeron, C. (1996). Articulation of French nasal segments depending on their prosodic
position. Presented at the January meeting of the Linguistic Society of America,
San Diego.
Fougeron, C., & Keating, P. (1995). Demarcating prosodic groups with articulation. JASA
97 (5-pt 2), 3384 and U C L A ms.
Garrett, M. F., Bever, T. G., & Fodor, J. A. (1966). The active use of grammar in speech
perception. Perception and Psyehophysies 1, 30-32.
Gerken, L. A., Jusczyk, P. W., & Mandel, D. R. (1994). When prosody fails to cue
syntactic structure: Nine-month-olds' sensitivity to phonological vs. syntactic
phrases. Cognition 51,237-265.
Gee, J. P., & Grosjean, F. (1983). Performance structures: A psycholinguistic and linguistic appraisal. Cognitive Psychology 15, 411-458.
Goldhor, R. S. (1976). Sentential determinants of duration in speech. MIT ms.
Goldman-Eisler, F. (1972). Pauses, clauses, sentences. Language and Speech 15, 103113.
A Prosody Tutorial
243
Grosjean, F., Grosjean, L., & Lane, H. (1979). The patterns of silence: Performance
structures in sentence production. Cognitive Psychology 11, 58-81.
Gussenhoven, C. (1984). On the Grammar and Semantics of Sentence Accents. Dordrecht: Foris.
Gussenhoven, C. (1992). Intonational phrasing and the prosodic hierarchy. Phonologica
1988, 89-99.
Gussenhoven, C., & Reitveld, A. C. M. (1992). Intonation contours, prosodic smacture
and preboundary lengthening. J. Phonetics 20, 283-303.
Haegeman, L. (1993). Introduction to Government and Binding Theory. Oxford: Blackwell.
Hale, K., & Selkirk, E. O. (1987). Government and tonal phrasing in Papago. Phonology
Yearbook 4: 151-183.
Halle, M., & Vergnand, J.-R. (1987). An Essay on Stress. Cambridge, Mass: MIT Press.
Halliday, M. A. K. (1967). Intonation and Grammar in British English. The Hague:
Mouton.
't Hart, J., Collier, R., & Cohen, A. (1990). A Perceptual Study of Intonation. Cambridge:
Cambridge University Press.
Hayes, B. (1981). A metrical theory of stress rules. MIT PhD thesis, revised version
distributed by the Indiana University Linguistics Club, Bloomington, Indiana. Published by Garland Press, NY, 1985.
Hayes, B. (1983). A grid-based theory of English meter. Linguistic Inquiry 14, 357-393.
Hayes, B. (1984). The phonology of rhythm in English. Linguistic Inquiry 15, 33-74.
Hayes, B. (1989). The prosodic hierarchy in meter. In P. Kiparsky and G. Youmans
(eds.), Phonetics and Phonology, Vol 1: Rhythm and Meter. San Diego: Academic
Press. pp. 201-260.
Hayes, B. (1989a). Compensatory lengthening in moraic phonology. Linguistic Inquiry
20, 253-306.
Hayes, B. (1995). Metrical Stress Theory: Principles and Case Studies. Chicago: University of Chicago Free Press.
Hayes, B., & Lahiri, A. (1991). Bengali intonational phonology. Natural Language and
Linguistic Theory 9, 47-96.
Heeman, P., & Allen, J. (1995). The TRAINS 93 Dialogues. TRAINS Technical Note
94-2, University of Rochester.
Home, M. (1990). Empirical evidence for a deletion formulation of the rhythm rule in
English. Linguistics 28, 959-981.
Hyman, L. (1985). A theory of phonological weight. Dordrecht: Foris.
Inkelas, S. (1989). Prosodicy constituency in the lexicon. U. Mass Amherst PhD thesis.
Inkelas, S., & Zec, D. (1990). (eds), The Phonology-Syntax Connection. Chicago: University of Chicago Press.
Inkelas, S., & Zec, D. (1993). Auxiliary reduction without empty categories: A prosodic
account. Working Papers of the Cornell Phonetics Laboratory 8, 205-253.
Ito, J., & Mester, R.-A. (1992). Weak layering and word binarity. University of Santa
Cruz ms.
Jun, S.-A. (1993). The phonetics and phonology of Korean prosody. Ohio State University Phi:) thesis.
Jusczyk, P. W., & Aslin, R. N. (1995). Infants' detection of the sound patterns of words
in fluent speech. Cognitive Psychology 28, 1-23.
Kahn, D. (1976). Syllable-based generalizations in English phonology. Ms. distributed
by the Indiana University Linguistics Club, Bloomington, Indiana.
244
Shattuck-Hufnagel and T u r k
Kaisse, E. (1985). Connected Speech: The Interaction of Syntax and Phonology. Orlando:
Academic Press.
Kaisse, E. M., & Zwicky, A. M. (1987). Introduction: Syntactic influences on phonological rules. In C. Ewen and J. Anderson (eds.), Phonology Yearbook 4, Cambridge
Univesity Press.
Katada, F. (1990). On the representation o f moras: Evidence from a language game.
Linguistic Inquiry 21, 641-646.
Kelly, M. (1989). Rhythm and language change in English. J. Memory and Language
28, 690-710.
Kenstowicz, M. (1994). Phonology in Generative Grammar. Cambridge, Mass: Blackwell.
Kiparsky, P. (1979). Metrical structure assignment is cyclic. Linguistic Inquiry 10, 4 2 1 442.
Kisseberth, C., & Abasheikh, M. K. (1974). Vowel length in C h i m w i : n i - - a case study
o f the role of grammar in phonology. In M.M.L. Galy, R.A. Fox, and A. Bruck
(eds), Papers from the Parasession on Natural Phonology, Chicago: Chicago Linguistics Society.
Klatt, D. H. (1976). Linguistics u s e s of segmental duration in English: Acoustic and
perceptual evidence. JASA 59, 1208-1220.
Kxakow, R. (1989). The artieulatory organization of syllables: A kinematic analysis of
labial and relic gestures. Yale University PhD thesis.
Ladd, R. (1986). Intonational phrasing: The case for recursive prosodic structure. Phonology Yearbook 3, 311-340.
Ladd, R. (to appear 1996). Intonational Phonology. Cambridge: Cambridge University
Press.
Ladd, R., & Campbell, N. (1991). Theories of prosodic structure: Evidence from syllable
duration. Proceedings of the XIIth International Congress of Phonetic Sciences, Aixen-Provence, II, 290-293.
Ladefoged, P. (1975). A Course in Phonetics. New York: Harcourt, Brace, Jovanovich.
Ladefoged, P., & Broadbert, D. E. (1960). Perception of sequence in auditory events.
Quarterly J. of Experimental Psychology 13, 162-170.
Lehiste, I. (1973). Phonetic disambiguation o f syntactic ambiguity, Glossa F, 107-121.
Lehiste, I. (1974). Interaction between test word duration and the length of utterance.
Ohio State University Working Papers in Linguistics 17, 160-169.
Lehiste, I., Olive, J. P., & Streeter, L. A. (1976). The role of duration in disambiguating
syntactically ambiguous sentences. JASA 60, 1199-1202.
Liberman, M. Y. (1975). The intonational system of English. MIT Linguistics PhD thesis.
Liberman, M. Y., & Prince, A. (1977). On stress and linguistic rhythm. Linguistic Inquiry
8, 249-336.
Lieberman, P. (1960). Some acoustic correlates of word stress in American English. JASA
32, 451-454.
Maeda, S. (1974). A characterization o f fundamental frequency contours o f speech. Quarterly Progress Report, MIT Research Laboratory of Electronics 114, 1 9 3 ~ 1 1 .
Martin, E. (1970). Toward an analysis o f subjective phrase structure. Psychological Bulletin 74, 153-166.
McCarthy, J. J. (1993). A case o f surface constraint violation. Canadian Journal of
Linguistics 38 (2), 169-195.
McCarthy, J. J., & Prince, A. (1995). Prosodic morphology. In J. Goldsmith (ed), A
Handbook of Phonological Theory. Oxford: Basil Blackwell.
A Prosody Tutorial
245
246
Shattuck-Hufnagel and T u r k
Selkirk, E. O. (1984). Phonology and Syntax: The Relation Between Sound and Structure.
Cambridge, Mass.: MIT Press.
Selkirk, E. O. (1986). On derived domains in sentence phonology. Phonology Yearbook
3, 3 7 1 4 0 5 .
Selkirk, E. O. (1993). Modularity in constraints on prosodic structure. Ms., presented at
the ESCA Workshop on Prosody, Lurid.
Selkirk, E. O. (1993a). Accent focus and given~new: The role for focus projection. U
Mass. Amherst ms.
Selkirk, E. O. (1994).
Selkirk, E. O. (to appear). The prosodic structure of function words. In J. Martin and K.
Demuth (eds), International Conference on Bootstrapping from Speech to Grammar
in Early Acquisition, Brown University, Providence RI, Hillsdale N.J.: Lawrence
Erlbaum.
Selkirk, E. O., & Shen, X. (1990). Prosodic domains in Shanghai Chinese. In S. Inkelas
and D. Zec (eds), The Phonology-Syntax Connection. Chicago: The University of
Chicago Press.
Selkirk, E. O., & Tateishi, K. (1988). Constraints on minor phrase formation in Japanese.
Papers from the Twenty-fourth Regional Meeting of the Chicago Linguistic Society,
Chicago: Chicago Linguistics Society.
Selkirk, E. O., & Tateishi, K. (1991). Syntax and downstep in Japanese. In C. Georgopoulos, and R. Ishihara (eds.), Interdisciplinary Approaches to Language. Dordrecht:
Kluwer Academic Publishing.
Sereno, J. A., & Jongman, A. (1995). Acoustic correlates of grammatical class. Language
and Speech 38, 57-76.
Shattuck-Hufnagel, S. (1988). Acoustic phonetic correlates of stress shift. JASA 84, S1,
1988.
Shattuck-Hufnagel, S. (1992). The role of word structure in segmental serial ordering.
Cognition 42, 213-259.
Shattuck-Hufnagel, S. (1992a). Stress shift as pitch accent placement: Within-word early
accent placement in American English. In Proceedings of the International Conference on Spoken Language Processing, Banff, v. 1 pp. 747-750.
Shattuck-Hufnagel, S. (1995). The importance of phonological transcription in empirical
approaches to 'stress shift' vs. 'early accent.' In B. Cormell and A. Arvaniti (eds),
Phonology and Phonetic Evidence: Papers in Laboratory Phonology IV, Cambridge:
Cambridge University Press.
Shatmek-Hufnagel, S., Ostendorf, M., & Ross, K. (1994). Stress shift and early pitch
accent placement in lexical items in American English. J. Phonetics 22, 357-388.
Shih, C.-L. (1986). The prosodie domain of tone sandhi in Chinese. UCSD PhD thesis.
Silva, D. J. (1989). Determining the domain for intervocalic stop voicing in Korean. In
S. Kuno et aL (eds.) Harvard Studies in Korean Linguistics III. Harvard University,
Cambridge, MA.
Silverman, K. (1987). The structure and processing of fundamental frequency contours.
Cambridge University PhD thesis.
Silverman, K., Beckman, M. B., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P.,
Pierrehm~abert, J., & Hirschberg, J. (1992). ToBI: A standard for labeling English
prosody. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), Banff, II, 867-870.
Sluijter, A. M. C. (1995). Phonetic Correlates of Stress and Accent. Holland Institute of
Generative Linguistics, Den Haag: CIP-Gegevens Koninklijke Bibliotbeek, University of Leyden PhD thesis.
A Prosody Tutorial
247
Sluijter, A. M. C., Shattuck-Hufnagel, S., Stevens, K. N., & van Heuven, V. (1995).
Supralaryngeal resonance and glottal pulse shape as correlates of stress and accent
in English. In Proceedings of the XIIIth International Congress of Phonetic Sciences, Stockholm, II, 630-633.
Sluijter, A. M. C., & van Heuven, V. J. (to appear). Effects o f focus distribution, pitch
accent and lexical stress on the temporal organization of syllables in Dutch. Phonetica.
Steedman, M. (1991). Structure and intonation. Language 68, 260-296.
Stevens, K. N. (1994). Prosodic influences on glottal waveform: Preliminary data. In
Proceedings of the International Symposium on Prosody, Yokohama, 53-64.
Streeter, L. (1978). Acoustic determinants of phrase boundary perception. JASA 64,
1582-1592.
Suci, G. (1967). The validity of pause as an index o f units in language. J. Verbal Learning and Verbal Behavior 6, 26-32.
Turk, A. E., & Sawusch, J. R. (1995). The domain o f the durational effects of accent.
Speech Group Working Papers, Research Laboratory of Electronics, Massachusetts
Institute of Technology, Cambridge, Mass, Vol X, 42-71.
Vanderslice, R., & Ladefoged, P. (1972). Binary suprasegmental features and transformational word-accentuation rules. Language 48, 819-836.
Wakefield, J. R., Doughtie, E. B., & Yom, L. (1974). Identification o f smacmral components of an unknown language. J. Psycholinguistic Research 3, 262-269.
Wightman, C. W., Shattuck-Hufnagel, S., Ostendorf, M., & Price, P. J. (1992). Segmental
durations in the vicinity o f prosodic phrase boundaries. JASA 91 (3), 1707-1717.