You are on page 1of 11

Freya Bailes and Roger T.

Dean
MARCS Auditory Laboratories
When Is Noise Speech? A
University of Western Sydney, Australia
Locked Bag 1797 Survey in Sonic Ambiguity
Penrith South DC NSW 1797, Australia
{f.bailes, roger.dean}@uws.edu.au

NoiseSpeech is a compositional device in which terms of sound-causing events may override a more
sound is digitally manipulated with the intention of bottomup sensory perception. Concordantly, Ballas
evoking the sound qualities of unintelligible speech (1993) explores how associations are formed between
(Dean 2005). Speech is characterized by rapidly environmental sound and sound source, listing ex-
changing broadband sounds (Zatorre, Belin, and posure to particular sounds in everyday life, being
Penhune 2002), whereas musicparticularly tonal able to visualize the sound-producing event, and the
musicchanges more slowly and narrowly in similarity of the sound to a mental stereotype.
frequency content. As Zatorre and colleagues argue, Dean (2005) proposed that a hybrid of noise and
this distinction may be reflected in better temporal speech may not only invent a new language but
resolution in the left auditory cortex and better more importantly may present a new message.
spectral resolution in the right, so that perception NoiseSpeech seems to escape commodification,
is adapted to both ranges and extremes of sonic and in this respect fits Attalis (1985) concept of
stimuli. NoiseSpeech is constructed either by composition, yet it is not devoid of connotation
applying the formant structure (that is, spectral or expression. We argue that NoiseSpeech is likely
peak content) of speech to noise or other sounds, or to evoke affective responses from a listener through
by distorting speech sounds such that they no longer its association with the affective expression of
form identifiable phonemes or words. The resultant human speech (Dean and Bailes 2006). Traditionally,
hybrid is an artistic device that, we argue, may owe affect has been conceptualized in terms of valence
its force to an encapsulation of the affective qualities (positive and negative) and arousal (active and
of human speech, while intentionally stripping the passive) dimensions. (See for example Leman et al.
sounds of any semantic content. In this article, 2005). Here, we distinguish affect from emotional
we present an empirical investigation of listener connotations that are more concerned with top-
perceptions of NoiseSpeech, demonstrating that down cognitive associations, such as comfort
non-specialist listeners hear such sounds as similar or annoyance. According to this distinction, a
to each other and to unaltered speech. certain familiarity with a sound is necessary for
NoiseSpeech is ambiguous in evoking the iden- emotion to be evoked, but not for the perception
tification of an everyday source of soundhuman of affect. Where a sound is identified as familiarly
speechwithin the musical context of sound art. speech-like, higher-order cognitive processes may
Arguably, it could be said to blur the distinction be involved, associated with the perception of an
described by Gaver (1993a, 1993b) between two emotion. However, when sounds are either strongly
types of listening: the everyday and the mu- distorted through processing or are of ambiguous
sical. When NoiseSpeech occurs in the context origin, listeners may perceive this on an affective
of a composition or performance, what form of level as valence and arousal.
listening does a listener employ? The context is one Vocal affect expression has been studied through-
of musical listening, yet the identification of the out history (Banse and Scherer 1996). Links have
sounds with the human generation of speech is an been made in speech between emotion and altered ar-
everyday listening concern. The propensity to iden- ticulation, respiration, and phonation. In particular,
tify the source of a sound is a question of interest to these effects are believed to be quantifiable in terms
both cognitive and ecological approaches to percep- of acoustic variables such as spectral energy distri-
tion. Handel (1989) posits that cognizing sound in bution, fundamental frequency, and speech rate. For
example, Banse and Scherer (1996) examined the por-
Computer Music Journal, 33:1, pp. 5767, Spring 2009 trayal of different emotions by professional actors,
c
!2009 Massachusetts Institute of Technology. comparing human listener recognition rates with the

Bailes and Dean 57


results of digital acoustic analyses. Relevant to our properties of speech are systematically interpreted
concern with extra-semantic affect, the expressions as such by a listener? In fact, speech is highly com-
the actors were required to speak were nonsense sen- plex, typically integrating rhythm, accentuation,
tences, albeit composed of recognizable phonemes. intonation, and affective expression, not to mention
Listeners recognized emotion based on acoustic the semantic interpretation arising from parsing the
features without semantic clues. We used a related phonemic components. Speech is characterized by
approach in previous artistic works using artificially particular groups of spectral peaks, called formants.
constructed languages, mainly comprising unrecog- Whereas the precise frequency and spectral shape
nizable phonemes, also believing that listeners will of formants tend to distinguish each individual,
identify affective changes in spite of the total lack of and are modulated according to whether a person is
comprehensible words (e.g., Smith and Dean 1996). whispering, speaking softly, shouting, or singing, a
Auditory psychology assumes commonality in discernible pattern of spectral peaks is nevertheless
the experiences of a given population of listeners. associated with male or female human speech. Op-
With respect to the perception of speech, research eratic singers superimpose a characteristic singers
has inevitably focused on parsing sounds and con- formant centered around 24 kHz. Boersma and
structing linguistic meaning, the primary goal being Kovacic (2006) describe the long-term audio spectral
unambiguous communication. However, speech characteristics of Croatian folk singing, illustrating
can be powerful on a number of extra-linguistic its lack of a singers formant, but they do note the
levels, including musical levels. Ambiguity of mes- selective exploitation in some styles of a distinct
sage or meaning and freedom of interpretation are pattern in the 3.5 kHz range, which they call the
useful within the context of sound art. Windsor high shouters formant.
(1997) reminds us that context shapes a listeners Other characteristics of speech include the
interpretation of sound: phenomenon of speech declination, which
describes the common large drop in frequency
Speech sounds, for example, could be organized
(and perceived pitch) during a spoken utterance
into a familiar sequence of fundamental frequen-
(Pierrehumbert 1979). Although intensity changes
cies, encouraging us to hear them as a melody,
are modest during declination (generally less than
or as a sequence of linguistically meaningful
4 dB), there is a cognitive compensation for the
speech events, encouraging us to hear them as
frequency drop that is influenced by intensity.
talking. They could be organized in such as way
Perception of pitch in speech (like that in music)
as to combine them, through cross-synthesis or
is complex (de Cheveigne 2005) but possibly uses
juxtaposition, with other sounds, encouraging
pattern-matching of spectral components. This
us to think of them in relation to other sounds
may operate with speech somewhat differently than
and their possible meanings, or their spectral
with musical pitchcertainly with musical pitches
structure could be accentuated or altered to pro-
largely comprising harmonic tonesbecause speech
duce harmonic progressions or to focus our
formants have central frequencies that often bear
attention upon their non-speech-like qualities.
non-harmonic relations to each other.
Their provenance might be concealed, or they
Sound source identification is important in audi-
might in fact be simulacra: we might hear a vir-
tory perception, but to what extent is the tendency
tual source other than speech despite the source
to attribute sound to a particular source mal-
material, or hear speech where none exists.
leable? NoiseSpeech artificially imposes formants
To produce NoiseSpeech and reliably evoke the on sounds to evoke a sense of vocal human agency.
impression of speech-derived sound, it is first nec- Alternatively, NoiseSpeech takes existing speech
essary to identify the perceptual markers of speech. and filters the sound, in our case taking care to
Expressed in terms of ecological theories of percep- maintain at least those formants characteristic of
tion, it is necessary to identify the invariance of speech that were present in the frequency range
speech (Windsor 1997; Clarke 2005). That is, what preserved. A few experimental psychologists have

58 Computer Music Journal


used non-speech control stimuli that were derived necessary for listeners to recognize that such sounds
from speech sounds. For example, some have re- are related to each other. It would also be interesting
ferred to pseudospeech. This term has been used to study whether they associate the device with
inconsistently, but in one application consists of human speech. This article presents a survey of
auditory patterns generated by amplitude and fre- the extent to which listeners define NoiseSpeech
quency modulation using a base frequency of 200 Hz samples as being derived from actual speech.
and an amplitude envelope which resembles the real
spoken sentences used in the delivery of a two-
minute story (Hermann et al. 2002). This application Survey
suggested that more long-range couplings (between
EEG signals from different brain areas) occur in the Our survey aimed to determine whether listeners
speech condition, as compared to the pseudospeech would perceive NoiseSpeech samples of various sorts
condition (p. 3). Although this might suggest a as being derived from speech. NoiseSpeech samples
distinction between processing of speech and pseu- varied with respect to their composition (see stimuli
dospeech, other evidence shows that speechlike below), ranging from those clearly derived from
sounds may be comparably lateralized to speech: speech to highly processed and consequently distant
in this case, the sounds studied had speechlike speech derivation. Listeners were asked to rate
rapid acoustic changes. . . . Their temporal structure each sample on a yes-or-no basis as being derived
was identical to that of consonant-vowel-consonant from speech and/or piano, drums, or water. Water
speech syllables, with the very rapid frequency was the dummy variable, in that no samples were
changes of the spectral peaks, or rapid formant tran- constructed from water sounds. Respondents were
sitions, that characterize consonants (Belin et al. also asked to indicate whether they were sureyes
1998, p. 536). Overall, this evidence allows the possi- or nofor each of the four listed source derivations.
bility that speech-like sounds may well be identified The use of a dummy variable was important to es-
as such, as also claimed in the early studies of Remez tablish the coherence and reliability of assessments.
et al. (1981), who synthesized what to some seemed On the other hand, the choice of instrumental
to be largely unintelligible speech, to others science sounds with which to contrast the speech and
fiction sounds, music, or whistles, from a min- NoiseSpeech is a more arbitrary one, because our
imum set of three sinusoids. The responses of his par- interest here is in whether the latter cluster together
ticipants were greatly influenced by the instructions in perception, rather than in how that clustering
they received, whether pointing to speech or not. relates to perception of other musical sounds.
Features of speech that might be exploited in
sound art composition include rhythm, accentu- Participants
ation, intonation, and other prosodic parameters.
Of course, these sonic features are already germane Fifteen women and six men (N = 21), ages 1820,
to music in general, but their application to noise participated in this survey. They did not classify
was more unusual until the 1990s generation of themselves as musicians, and none had received any
real-time laptop artists, and certainly less well phonetics training. Volunteers were undergraduate
understood. To our knowledge, there has been no psychology students at the University of Canberra,
empirical investigation of the extent to which receiving 45 minutes of course credit for their
listeners are sensitive to speech characteristics involvement in this survey.
transmitted through noise (as distinct from the
many published experiments in which speech or
phonemic content is superimposed upon noise). Stimuli
If NoiseSpeech can be effective as a different and
perhaps more emotive compositional device than All stimuli were six seconds in duration. Forty items
other digital manipulations of sound, it would be were selected to represent a variety of NoiseSpeech

Bailes and Dean 59


type. An additional item was used as a practice Table 1. Description of Stimuli
trial. Stimuli can be categorized according to the
origin of the sound. One group of stimuli applied Processed drum sounds (shuffling segments, time
a modification of the processing patch by Atau stretching, frequency/rhythm shifting)
Tanaka (provided with the Max/MSP programming AtDm01 Unaltered, normal drum lick, very
platform) to drum sounds. Another applied the slightly more than 2 bars
AtDm02 Shuffled, slightly more than two bars
same patch to a female voice speaking the words
AtDm03 Shuffled, slightly more than two bars
everything is at least double. The Tanaka patch AtDm04 Shuffled, slightly more than two bars
segments the item and then reassembles a sound AtDm05 Frequency shifted down, shuffled, but
by randomized reordering of the segments. A third tempo unchanged, so still slightly more
group is another female voice speaking is that you, than two bars
included with the Max/MSP software distribution, AtDm06 Frequency-shifted upward, shuffled, but
after subsequent digital filtering. A fourth group of tempo unchanged as AtDm05
stimuli is based on a group of noise sounds, while AtDm07 Tempo slow, frequency-shifted down
another is based on noise alone with stable formants AtDm08 Tempo even slower, frequency-shifted
superimposed. The formants used were centered on down
784, 1175, and 2784 Hz, being taken from a female AtDm09 Tempo slower still, frequency-shifted
down; really one event for the whole six
voice speaking the vowel /a/. Finally, a small group
seconds
of sounds presents processed piano phrases. AtDm10 Tempo slower again, frequency-shifted
These groups can be further linked according to down, less than one complete event
whether they are modified speech or non-speech AtDm11 Frequency-shifted up, slowed somewhat,
with speech-like characteristics. It should be noted like AtDm08
that modified speech stimuli were not intended AtDm12 Frequency-shifted up, slowed like AtDm07
to represent the comprehensive phonetic content
Processed sample of female speaking everything is at
of English. Rather, modified speech stimuli were least double
selected on a compositional basis as examples of AtHs01 Almost five renderings of the phrase,
the result of NoiseSpeech generation principles normal speed
used in works by Roger Dean and others. Table 1 AtHs02 Shuffled, same speed, almost five
summarizes these stimuli. renderings of the shuffled version
AtHs03 Shuffled differently, same speed, almost
five renderings
Procedure AtHs04 Shuffled differently again, same speed,
almost five renderings
Participants rated the stimuli during group listening AtHs05 Another shuffle, like AtHs02-04
sessions, although one participant completed the AtHs06 Slowed down, lower frequency, less than
listening session alone with the experimenter. one cycle
During group sessions, respondents were spaced AtHs07 AtHs06 slowed further
around the room such that they would not be AtHs08 AtHs06 slowed even further
influenced by the responses of others yet could AtHs09 AtHs06 slowed even further still
benefit from stereo presentation of the sounds. NoiseSpeech series: is that you, a female voice from
Stimuli were presented using QuickTime and the Max/MSP distribution
played through Yamaha MSP5A speakers connected NsSp01 Normal speed, almost eight renderings
to an iBook 900HMz G3 laptop computer. The NsSp02 Normal speed, high-pass filtered
following instructions were given: NsSp03 Normal speed, high-pass filtered with a
higher cutoff frequency
You are asked to decide whether the passages
of sound you hear are derived from drums,

60 Computer Music Journal


Table 1. Continued because identifying the source of a sound is central
to our hypothesized increase in affective response
NsSp04 Normal speed, high-pass filtered with an outlined herein.
even higher cutoff frequency The answer form for a given trial was as follows:
NsSp05 Normal speed, high-pass filtered with an
even higher cutoff frequency than NsSp04 Example 1
NsSp06 Normal speed, low-pass filtered Sound derived from this? Are you sure?
NsSp07 Normal speed, low-pass filtered but with a Drums Yes No Yes No
greater bandwidth to include more middle Piano Yes No Yes No
frequencies Speech Yes No Yes No
NsSp14 High-pass filtered, and with feedback Water Yes No Yes No
NoiseSpeech based on a group of noise sounds which Before the survey began, a practice stimulus was
move in spectral center toward higher frequencies,
presented with feedback from the researcher empha-
with a superimposed formant
NsSp08 Original with formant superimposed
sizing the focus on possible sound derivation (but
NsSp09 High-pass filtered without providing cues as to how this might be as-
NsSp10 High-pass filtered with a higher cutoff sessed). Respondents were given the opportunity to
frequency ask questions before proceeding to the main survey.
NsSp11 High-pass filtered with an even higher The forty stimuli were presented with an interval
cutoff frequency of approximately five seconds between them. The
NsSp12 Low-pass filtered stimuli were played in a random order, and a
NoiseSpeech: Noise alone with formant full presentation of all forty stimuli constituted
NsSp15 Noise with high formant superimposed a block. Three blocks were presented (with
NsSp16 Noise with broader lower formant different randomization for each block) for the
superimposed complete survey. Each group of participants heard a
NsSp17 Noise with broader lower formant still different randomization. As there were six different
NsSp18 Like NsSp16 with less of the lower groups of participants, this entailed six different
frequencies orders of stimuli presentation. Randomization
PnPr series: processed piano phrases of stimulus order ensures that any given sound
PnPr01 High-pass filtered is heard randomly spread across the duration
PnPr02 Slowed phrase, high-pass filtered of the experimental session. This can remove
PnPr03 Phrase slowed even more, low-pass filtered what might otherwise be effects of progressive
familiarization with the range of stimuli during
the data gathering, and in our design this was also
piano, speech, and/or water, indicating whether avoided by the use of an initial complete exposure
or not you are confident for each of the listed block (block 1), whose response data were not
possibilities. When the passage has finished used. Note that this initial block is unlikely to
playing, you will be invited to indicate your impact on listeners longer-term experience with
answers by ticking the appropriate boxes on familiar sounds such as speech, drums, and piano,
your answer sheet, either yes or no. Please although it does allow them to become aware of the
note that you may indicate several different range of unfamiliar sounds to which they are being
sources for the same passage, or you may think exposed.
that none of the possible sources are related to A break was introduced halfway through the
what you hear. session, during which time participants filled out a
questionnaire asking for demographic information,
Respondents were asked to consider sound details of musical background, and whether they
derivation rather than what the stimuli sound like, had ever received any phonetics training.

Bailes and Dean 61


Table 2. Correlations Between Mean Ratings of the A cluster analysis was performed to investigate
Four Possible Sound Derivations of the Stimuli whether there were statistically reliable sub-groups
in listener ratings of the items. The advantage of
Drums Piano Speech Water
this technique is that it uses multivariate data
Drums / 0.25 0.49 0.49 (in this case, the separate ratings of drums, piano,
Piano / / 0.45 0.12 speech, and water) to form clusters of items based
Speech / / / 0.40 on statistical distance. Although the analysis is
Water / / / / exploratory, identifying clusters in this study is
p < 0.5. informative because the focus is on timbre as a
p < 0.001. rich and multidimensional phenomenon. Grouping
sounds according to multiple variables provides a
more realistic description of listener perceptions
Analysis than examining speech ratings alone. Item data were
averaged across respondent and block. The variables
Raw yes and no responses for possible sound drums, piano, speech, and water were
source and confidence were coded as ordinal data entered into a hierarchical cluster analysis in SPSS
such that no, sure = 1, no, unsure = 2, yes, 11.0. Three clusters emerged in the final solution
unsure = 3, and yes, sure = 4. Data from block 1 (see Figure 1). As a point of interest, although we
were disregarded, as it was intended simply to report a solution based on data from blocks 2 and 3,
provide an initial exposure to the full range of this is hardly changed when data from block 1 are
stimuli. Using data from blocks 2 and 3, we first included. There were 9 items in cluster 1, 14 items
checked for order effects across blocks, and then in cluster 2, and 17 items in cluster 3. Descriptive
tested for correlations between speech and the statistics for each cluster are shown in Table 3.
other response categories. The primary analysis Cluster 1 (N = 9) is characterized by high ratings
was a cluster analysis of the stimuli based on for drums. Indeed, the items constituting this cluster
the mean ratings for the four variables. Resulting are AtDm01AtDm08 and AtDm11. AtDm09 and
clusters were characterized, and the stimuli were AtDm10 are similarly derived from drum sounds,
additionally ranked in order of speech-like quality. but these belong to cluster 2; it is worth noting
that they represent the most processed of the drum
stimuli with respect to a slowing down of the sound.
Cluster 2 (N = 14) is characterized by higher-
Results than-average ratings for water. None of the sounds
were water-derived, but those that were rated
A repeated measures analysis of variance (ANOVA) as such include the entire noise-derived series,
of mean ratings per item (within-subjects factors of where either groups of noise (NsSp08NsSp12) or
block and source) revealed no significant difference noise alone (NsSp15NsSp18) were filtered and
in ratings across block. Consequently, data for superimposed with formants. The entire processed
blocks 2 and 3 were collapsed for subsequent piano series falls within cluster 2, raising the average
analyses. piano rating of the cluster. The two slowest drum
It was found that the four variables (drums, stimuli (AtDm09 and AtDm10) were also within
piano, speech, and water) were negatively correlated this cluster.
with each other except for piano and drum ratings, Cluster 3 (N = 17) is typified by high ratings for
which were positively correlated, and piano and speech. All items within this cluster are indeed
water, which were uncorrelated. (See the correlation derived from recordings of speech that have been
matrix in Table 2.) The positive correlation between subsequently processed, or as formant superimpo-
piano and drums probably reflects their shared sitions on noise or groups of noise. For instance,
attributes as percussion instruments. AtHs01AtHs09 are a processed sample of a female

62 Computer Music Journal


Figure 1. Dendrogram of
the cluster analysis of 40
items.

voice speaking everything is at least double, and NoiseSpeech sounds (though with generally lower
NsSp01NsSp14 constitute a female voice asking, speech ratings), even though they involve sta-
Is that you? As noted previously, two of the ble (unchanging) formants superimposed on noise,
items (NsSp01 and AtHs01) are unaltered speech; whereas speech itself comprises rapidly changing
this data shows that putative NoiseSpeech items formant spectra.
indeed cluster with genuine speech. One remarkable Taken together, the clusters seem to describe a
feature that deserves future investigation is that continuum, with the most clearly identifiable sound
NsSp15NsSp18 cluster with the other speech and sourcesdrums (cluster 1) and speech (cluster 3)at

Bailes and Dean 63


Table 3. Mean (M) and Standard Deviation (SD) for Table 4. Items Ranked by Mean Speech Rating
Blocks 2 and 3 by Cluster across Participants and Block (for Blocks 2 and 3)
Cluster Mean Mean
Speech Speech
Total items 1 (N = 9) 2 (N = 14) 3 (N = 17) Item Name Rating Cluster Item Name Rating Cluster
Clustering
Variable M SD M SD M SD M SD
AtHs01 4.00 3 AtDm10 1.40 2
Drums 1.82 1.05 3.68 0.30 1.29 0.29 1.28 0.25 NsSp02 4.00 3 NsSp08 1.33 2
Piano 1.29 0.32 1.38 0.24 1.43 0.43 1.13 0.12 NsSp03 4.00 3 NsSp12 1.33 2
Speech 2.26 1.26 1.12 0.08 1.29 0.15 3.66 0.47 AtHs04 3.98 3 NsSp18 1.33 2
Water 1.85 0.82 1.21 0.09 2.75 0.69 1.45 0.35 AtHs02 3.95 3 AtDm08 1.28 1
AtHs03 3.95 3 NsSp09 1.23 2
NsSp14 3.95 3 NsSp17 1.20 2
the extremes, passing through cluster 2, which NsSp01 3.90 3 AtDm02 1.17 1
NsSp04 3.85 3 AtDm01 1.15 1
groups the most processed sounds together. These
NsSp07 3.80 3 AtDm04 1.15 1
seem to have been the most ambiguous items for
AtHs09 3.78 3 NsSp10 1.15 2
listeners and were rated highly for water, even AtHs05 3.72 3 NsSp11 1.15 2
though water sounds were never sampled. A two- NsSp06 3.63 3 NsSp15 1.15 2
cluster reading combines clusters 2 and 3 and AtHs06 3.43 3 PnPr01 1.15 2
contains all speech and NoiseSpeech stimuli, joined AtHs08 3.03 3 PnPr02 1.15 2
by only five other sounds. NsSp05 2.73 3 AtDm05 1.10 1
Table 4 shows the ranking of all items for AtHs07 2.50 3 AtDm11 1.10 1
their average rating as speech-like, disregarding AtDm09 1.65 2 AtDm03 1.08 1
other response categories. Again notable is the fact NsSp16 1.45 2 AtDm12 1.05 1
that only one of the unprocessed speech samples PnPr03 1.45 2 AtDm06 1.03 1
(AtHs01) attracted the highest possible score for
speech quality, and the rating of two NoiseSpeech the most speech-like, both in terms of listener
items equaled this, whereas the speech ratings of perceptions and in actual sound source: here, speech
several other NoiseSpeech items exceeded that samples were processed to varying extents.
for the other unprocessed speech stimuli. Thus, The relationship between sound-source ratings,
listeners intermingled speech and NoiseSpeech in which exhibits a pattern of correlation as predicted,
their categorical responses. merits discussion. The only positive correlation was
found between ratings for piano and drums: these
two sound sources are both percussion instruments.
Discussion Other correlations were negative or non-existent
(water and piano). The inclusion of water as a
Results of the cluster analysis can be seen as dummy response was intentional. Had we used
describing differing degrees of speech-like quality, the category of other, we might have expected
moderated by timbral complexities of the sound quite different results, with listeners tending to
qualities evoked by water, drums, and piano. attribute sounds about which they were unsure in
Cluster 1 can be considered the least speech- this bracket. At the opposite extreme, replacing
like, with a clear grouping based on high ratings the dummy response water with noise as a
for drums. Cluster 2 is more ambiguous, with possible sound derivation would certainly have led
a mismatch between sound source and listener to a tighter grouping of items in cluster 2, which
perception insofar as these items were rated as constitutes the noisy items.
being water-derived, when in fact water was never In the context of the current survey, it is difficult
sampled or intentionally evoked. Cluster 3 is to determine to what extent hearing the full range

64 Computer Music Journal


of NoiseSpeech stimuli in block 1 subsequently of this trading only function when the speech
strengthened the listeners perceptions of and as- perception mode is engaged, citing in particular
sociations with speech or with other unmodified the work of Best, Morrongiello, and Robson (1981)
sounds. Listeners may well have formed an associa- in using analogous sine waves imitating the
tion between the few undisguised items and stimuli formant trajectories of (voiced) speech signals as
generated from these. This is not unlike an artis- a comparison with genuine speech signals. Those
tic listening experience in which the inclusion of listeners who engaged only an auditory rather than
variously speech-like NoiseSpeech within one work speech mode failed to demonstrate such trading
provides a listener with an increasingly relative relations, and they heard some of the sounds solely
and shifting context of cues through time. It is also as whistles. However, once instructed to listen for
related to many published speech-perception experi- speech, this could be overcome.
ments that require phoneme identification among a More recent evidence has questioned most of
set of stimuli varying along a subtle transformation these arguments. For example, Diehl, Lotto and
gradient. Would an isolated stimulus be rated as Holt (2004) and colleagues review speech perception
more speech-like in a different context of stimuli from three major perspectives. Two of these, the
that are in no way reminiscent of speech? If this motor theory and the direct-realist theory, implicate
were the case, it would have interesting implications the articulatory events of the tongue, lips, and vocal
for theories that posit the immediate perception of folds as key gestural objects of speech perception
environmentally significant information, as with (Diehl, Lotto, and Holt 2004). In contrast, the third
the human voice. perspective, that of general-auditory (GA) theories,
It is worth considering further the psychological makes no such assumption, and partly as a result,
background to the open question of the affective does not need to invoke a special speech-mode of
impact of NoiseSpeech. One reason for proposing it perception. In this GA perspective, speech may still
as a viable form for expressive composition is the be uniquely important during learning processes:
longstanding claim that speech engages different speech may be a particularly salient signal for
modes of perception and cognition than do other infants, and learning processes may be biased
sounds. For example, early studies from the Haskins to pick up just the kind of information that is
laboratory and elsewhere, critically reviewed by important for speech categories (Diehl, Lotto, and
Repp (1982, p. 81), argued that speech cannot be Holt 2004, p. 164).
reduced to a combination of auditory processes The issue of a speech-specific mode of perception
involved also in perceiving and interpreting non- thus remains to be resolved, but the likelihood that
speech sounds. A few of the component arguments the engagement of perceptual or cognitive processes
can be summarized briefly. The left-hemisphere focused on speech can affect listeners responses
lateralization of speech processing is unassailable remains unchallenged by this literature. In our
(p. 85) but shared with certain other sound stimuli. study, NoiseSpeech sounds are clustered in sonic
Duplex perception, in which formant transitions character with speech. Thus they are potentially
are removed from a synthetic syllable and presented usable semiotic tools within the compositional
to one ear while the rest of the speech pattern is process; and it is remarkable that some of the
presented to the other ear (p. 86), creates both a NoiseSpeech samples we have analyzed were
fused percept of the original syllable and a whistle indeed derived solely from noise and formant
or chirp in one ear, and is argued to suggest the characteristics (and not from preexistent speech
simultaneous use of speech and nonspeech modes sounds). It remains to be established whether they
of perception (p. 86). Phonetic trading is the have distinct affective impacts by virtue of their un-
phenomenon in which a change in the setting of conscious or conscious identification as speech-like
one cue . . . can be offset by an opposed change in the and their engagement of some modes of perception
setting of another cue so as to maintain the original and cognition that are shared by speech. Moreover,
phonetic percept. Repp argues that certain aspects there are many cultural associations with sounds

Bailes and Dean 65


that may override the impact of human agency. manipulation, particularly with sounds involving
For example, the Australian listeners in this study substantial components of noisy spectra, our results
were surveyed during a serious drought. Had they indicate that superimposition of certain sonic fea-
been asked to rate the emotional impact of sounds tures from speech could again transform the axes
they deemed to be derived from water, this could upon which listeners attend to the music; this may
well have been greater than for sounds perceived as be an effective device for the laptop improviser, as
speech-derived. we originally proposed (Dean 2005).
Our findings provide an opportunity to reflect
on how well a computer would fare in detecting
NoiseSpeech. A range of information-retrieval
methods for machine discrimination of speech from
Future Work
music has been developed, often using cepstral
The compositional characteristics underlying the
coefficients and Hidden Markov Models. Other
clustering of NoiseSpeech sounds surveyed in this
methods have been developed that use, for example,
study raise a number of questions for future research.
geometric properties of spectrograms as objects for
First, the clustering of items comprising stable
visual identification (Casagrande, Eck, and Kegl
formants with those in which the spectrum change
2005). Casagrande and colleagues wrote a Winamp
over time is of considerable theoretical interest,
plug-in that trained an AdaBoost model. On the
because spectral and temporal sound dimensions
basis of our present results, it can be predicted
may be important features distinguishing music
that a machine will probably identify NoiseSpeech
and speech perception in general, as indicated at
roughly as do human listeners, and informal use
the outset of this article. Second, clusters 2 and 3
of this Winamp plugin supports this view. In
are separable along compositional lines: cluster 2
addition, such automated detection of NoiseSpeech
includes sounds that are shaped white noise, whereas
in line with human perceptions could be a useful
cluster 3 comprises sounds that are native or
compositional or performance tool in generative
processed speech samples. We speculate that the
sound art. Of course, the extent to which ambiguity
items with the character of cluster 2, in which
in listener perceptions is deliberately fostered is an
noise is the predominant sound (indicated by their
artistic question. Results from our survey can only
attribution to a water source) will generally evoke
serve to inform sound artists of possible listener
a more negative affect than the items in cluster 3,
responses.
which are heard as speech-like. This hypothesis
In the broadest terms, to consider sonic construc-
can be tested using NoiseSpeech stimuli, taken
tion in terms of multidimensional spaces of timbres
in conjunction with the results from the current
is worthwhile, because composers can assert or
survey. Further empirical investigation would not
emphasize such relationships systematically if they
only inform computer-music composition, but it
choose. In sound-sculpted work that is entirely fixed
would also provide a greater understanding of the
before presentation, this implies that procedures
role of familiarity in the perception of emotional
advocated by spectromorphologic composers and
or affective expression. Future research should also
by timbre-space empiricists can indeed have use-
further investigate different listener perceptions
ful application. For example, transformations or
when NoiseSpeech is heard in either a musical
cross-morphing of speech sound, as practiced by
setting or in an isolated hearing.
composers such as Charles Dodge, Paul Lansky,
Trevor Wishart, Wende Bartley, Tristan Murail, and
others, may pull a listener away from semantic
listening (that is, attending to verbal content) to Acknowledgments
speech-oriented yet non-semantic listening, and
eventually to musical listening without concern for Our thanks go to Michael Bylstra for his assistance
the physical source of the sound. In real-time sonic in encoding data.

66 Computer Music Journal


References Diehl, R. L., A. J. Lotto, and L. L. Holt. 2004. Speech
Perception. Annual Review of Psychology 55:149
Attali, J. 1985. Noise: The Political Economy of Music, 179.
trans. B. Massumi. Manchester: Manchester University Gaver, W. W. 1993a. How Do We Hear in the World?
Press. Explorations in Ecological Acoustics. Ecological
Ballas, J. A. 1993. Common Factors in the Identification Psychology 5(4):285313.
of an Assortment of Brief Everyday Sounds. Journal Gaver, W. W. 1993b. What in the World Do We Hear? An
of Experimental Psychology: Human Perception and Ecological Approach to Auditory Event Perception.
Performance 19:250267. Ecological Psychology 5(1):129.
Banse, R., and K. R. Scherer. 1996. Acoustic Profiles in Handel, S. 1989. Listening: An Introduction to the Per-
Vocal Emotion and Expression. Journal of Personality ception of Auditory Events. Cambridge, Massachusetts:
and Social Psychology 70:614636. MIT Press.
Belin, P., et al. 1998. Lateralization of Speech and Hermann, T., et al. 2002. Sonifications for EEG
Auditory Temporal Processing. Journal of Cognitive Data Analysis. Paper presented at the 2002
Neuroscience 10(4):536540. International Conference on Auditory Dis-
Best, C. T., B. Morrongiello, and R. Robson. 1981. play, Kyoto, Japan, July. Available online at
Perceptual Equivalence of Acoustic Cues in Speech and www.icad.org/websiteV2.0/Conferences/ICAD2002/
Nonspeech Perception. Perception and Psychophysics proceedings/22 Thomas Hermann EEG.pdf.
29:191211. Leman, M., et al. 2005. Prediction of Musical Affect
Boersma, P., and G. Kovacic. 2006. Spectral Characteris- Using a Combination of Acoustic Structural Cues.
tics of Three Styles of Croatian Folk Singing. Journal Journal of New Music Research 34(1):3967.
of the Acoustical Society of America 119(3):18051816. Pierrehumbert, J. 1979. The Perception of Fundamental
Casagrande, N., D. Eck, and B. Kegl. 2005. Frame- Frequency Declination. Journal of the Acoustical
Level Speech/Music Discrimination Using AdaBoost. Society of America 66(2):363369.
Proceedings of the 2005 International Society for Remez, R. E., et al. 1981. Speech Perception with-
Music Information Retrieval. London: Queen Mary, out Traditional Speech Cues. Science 212:947
University of London, pp. 345350. 950.
Clarke, E. F. 2005. Ways of Listening: An Ecological Repp, B. H. 1982. Phonetic Trading Relations and Context
Approach to the Perception of Musical Meaning. New Effects: New Experimental Evidence for a Speech
York: Oxford University Press. Mode of Perception. Psychological Bulletin 92(1):81
Dean, R. T. 2005. NoiseSpeech, a Noise of Living 110.
Bodies: Towards Attalis Composition. Journal of Smith, H., and R. T. Dean. 1996. Nuraghic Echoes. Audio
New Media and Culture 3(1). Available online at compact disc. Rufus Records RF 025.
www.ibiblio.org/nmediac/winter2004/NoiseSpc.htm. Windsor, L. W. 1997. Frequency Structure in Electroa-
Dean, R. T., and F. Bailes. 2006. NoiseSpeech. Perfor- coustic Music: Ideology, Function and Perception.
mance Research 11(3):8586. Organised Sound 2(2):7782.
de Cheveigne, A. 2005. Pitch Perception Models. In C. J. Zatorre, R. J., P. Belin, and V. B. Penhune. 2002. Structure
Plack, et al., eds. Pitch: Neural Coding and Perception. and Function of Auditory Cortex: Music and Speech.
New York: Springer. Trends in Cognitive Sciences 6(1):3746.

Bailes and Dean 67

You might also like