Professional Documents
Culture Documents
Music Department
Spectral Anticipations
University of California at San Diego
9500 Gilman Drive
La Jolla, California 92037-0326 USA
sdubnov@ucsd.edu
This article deals with relations between random- profile. This approach considers dynamic properties
ness and structure in audio and musical sounds. of signals in terms of an anticipation property, which
Randomness, in the casual sense, refers to some- is shown to be significant for discrimination of noise
thing that has an element of variation or surprise in versus structure and the characterization and anal-
it, whereas structure refers to something more pre- ysis of complex signals. Mathematically, this is for-
dictable, rule-based, or even deterministic. When malized in terms of measuring the reduction in the
dealing with noise, which is the purest type of uncertainty about the signal that is achieved when
randomness, one usually adopts the canonical phys- anticipations are formed by the listener.
ical or engineering definition of noise as a signal Considering anticipation as a characteristic of a
with a white spectrum (i.e., composed of equal or signal involves a different approach to the analysis
almost-equal energies in all frequencies). This of audio and musical signals. In our approach, the
seems to imply that noise is a complex phenome- signal is no longer considered to be characterized
non simply because it contains many frequency according to its features or descriptors alone, but its
components. (Mathematically, to qualify as random characterization takes into account also an observer
or stochastic process, the density of the frequency that operates intelligently on the signal, so that both
components must be such that the signal would the information source (the signal) and a listener
have a continuous spectrum, whereas periodic com- (information sink) are included as parts of one model.
ponents would be spectral lines or delta functions.) This approach fits very well into an information
In contradiction to this reasoning stands the fact theoretic framework, where it becomes a characteri-
that, to our perception, noise is a rather simple sig- zation of a communication process over a time-
nal, and in terms of its musical use, it does not al- channel between the music (present time of the
low much structural manipulation or organization. acoustic signal) and a listener (a system that has
Musical notes or other repeating or periodic acous- memory and prediction capabilities based on a sig-
tic components in music are closer to being deter- nals past). The amount of structure is equated to
ministic and could be considered as structure. the amount of information that is transmitted or
However, complex musical signals, such as poly- transferred from a signals past into the present,
phonic or orchestral music that contain simultane- which depends both on the nature of the signal and
ous contributions from multiple instrumental the nature of the listening system.
sources, often have a spectrum so dense that it This formulation introduces several important
seems to approach a noise-like spectrum. In such advantages. First, it resolves certain paradoxes re-
situations, the ability to determine the structure of lated to the use of information theoretic concepts in
the signal cannot be revealed by looking at signal music. Specifically, it corrects the nave equation of
spectrum alone. Therefore, the physical definition entropy (or uncertainty) to the amount of interest
of noise as a signal with a smooth or approximately present in the signal. Instead, we are considering the
continuous spectrum seems to obscure other signif- relative reduction in uncertainty caused by predic-
icant properties of signals versus noise, such as tion/anticipation. Additionally, the measure has the
whether a given signal has temporal structurein desired inverted-U function behavior that charac-
other words, whether the signal can be predicted. terizes both signals that are either nearly constant
The article presents a novel approach to (auto- and signals that are highly random as signals that
matic) analysis of music based on an anticipation have little structure. In both cases, the reduction in
uncertainty owing to prediction is small. In the first
Computer Music Journal, 30:2, pp. 6383, Summer 2006 case, this is caused by the small amount of variation
2006 Massachusetts Institute of Technology. in the signal to start with, whereas in the second
Dubnov 63
case, there is little reduction in uncertainty, be- Our Model
cause prediction has little or no effect. Finally, this
formulation also clarifies the difference between Our model of signal structure considers musical or
determinism and predictability. Signals that have audio material as an information source, which is
deterministic dynamics (such as certain chaotic sig- communicated over time to a listener (either hu-
nals) might have varying degrees of predictability, man or machine). The amount of information
depending on the precision of the measurement or transmitted through this communication process
exact knowledge of their past. A listener, or any depends on the nature of the signal and the listener.
practical prediction system, must have some uncer- The listener makes predictions and forms expecta-
tainty about its past, and this uncertainty grows tions; the music source generates new samples. Ac-
when trying to predict the future, even in case of a cordingly, we define structure to be the aspect of
deterministic system. musical material that the listener can predict, and
The ideas in the article are presented starting noise or randomness as what the listener would
from a background of information theory and basis consider as an unpredictable flow of data.
decomposition, followed by the idea of a vector ap-
proach to the anticipation measurement, and con-
cluding with some higher-level reflections on its Information, Entropy, Mutual Information, and
meaning for complex musical signals. The presen- Information Rate
tation takes several approaches: an engineering or
signal-processing approach, in which notions of When a sequence of symbols, such as text, music,
structure are related to basic questions about signal images, or even genetic codes are considered from
representation; an audio information-retrieval ap- an information theoretic point of view, it is as-
proach, in which the ideas are applied for characteri- sumed that the specific data is a result of produc-
zation of natural sounds; and a musical analysis tion by an information source. The concept of the
approach, in which time-varying analysis of musi- information source allows description of many dif-
cal recordings is used to create an anticipation pro- ferent sequences by a single statistical model with a
file that describes their structures. This last aspect given probability distribution. The idea of looking
has interesting possibilities for exploring relation- at a source, rather than a particular sequence, char-
ships among signal measurements and human cog- acterizes which types of sequences are probable and
nitive judgments of music, such as emotional force. which are notin other words, what data is more
The main contribution of our model is in the vec- likely to appear. Information theory also teaches
tor approach, which generalizes notions of anticipa- that, in the long run, some sequences become
tion for the case of complex, multi-component typical of the source, while others might turn very
musical signals. This measure uses concepts from improbable or so rare that they would, in practice,
principle components analysis (PCA) and indepen- never occur.
dent components analysis (ICA) to find a suitable The logarithm of the relative size of the typical
geometric representation of a monaural audio set (log of the number of typical sequences divided
recording in a higher-dimensional feature space. by log of number of all possible sequences of the
This pre-processing creates a representation that al- same length) is called entropy, and it is considered
lows estimating the anticipation of a complex sig- as the characteristic amount of uncertainty inher-
nal from the sum of anticipations of its individual ent to the source. The larger the typical set, the
components. Accordingly, the term complex signal higher the uncertainty, and vice versa. For instance,
used throughout this article describes the multi- a biased coin that falls mostly on heads will pro-
component nature of a single-channel recording and duce sequences whose empirical average approaches
should not be confused with multiple-channel the statistical mean, which comprises only a small
recordings. fraction of all possible sequences of heads and
Dubnov 65
Figure 2. Information
source, channel, and re-
ceiver.
Example Case 1: Inverted-U Function for the Example Case 4: Determinism Versus
Amount of Structure Predictability
A purely random signal cannot be predicted and An interesting application of IR is characterization
thus has the same uncertainty before and after pre- of chaotic processes. Considering for instance a lo-
diction, resulting in zero IR. An almost constant gistic map x(n + 1) = ax(n)(1 x(n)), it is evident that
signal, on the other hand, has a small uncertainty knowledge of x(n) provides complete information
Dubnov 67
Figure 3. Signal-IR plotted
on top of the signal spec-
trogram. See text for more
detail.
nov 2003) that IR may be calculated as a sum of IRs X1 = [x1, x2,..., xn]T , X2 = [xn+1, xn+2,..., x2n]T ,..., X1
of the individual components, si(n), i = 1 . . . n, (8)
= [x(i 1)n+1, x(i 1)n+2,..., xin]T
n
!L(X1, X2,..., XL) = !(si(i),..., si(L)) (7)
i =1
Concatenating L signal segments into a matrix X =
The vector-IR generalization allows identification [X1X2 . . . XL], SVD provides a decomposition X =
of the structural elements of the signal as the com- UV T, which gives the n basis vectors in columns
ponents with high scalar-IR. of the U matrix (orthogonal n dimensional vectors),
a diagonal matrix that gives the n variances of the
coefficients, and the first n rows of the matrix V T
Example: Vector-IR Analysis of Mixed Sinusoidal
that give the normalized expansion coefficients in
and Noise Components
this space.
Given a sinusoidal signal s(t) = sin(wt + ) and a Figure 4 shows the results of applying SVD to the
white-noise signal n(t), we consider the case of a sinusoidal and noise signal. The first two basis vec-
mixed sinusoidal and noise signal x(t) = s(t) + n(t). tors are the two quadrature sinusoidal components.
For this signal, we may find separate signal and The remaining basis vectors are the noise compo-
noise spaces using singular value decomposition nents. Figure 5 shows the corresponding expansion
(SVD) using signal representation as a sequence of coefficients.
frames containing consecutive signal samples We apply scalar-IR analysis to the first n rows of
the matrix V T using the SFM method of scalar-IR Table 1. Scalar-IR and Vector-IR Values for the Ex-
estimation. The sum of the scalar-IRs yields the ample Sinusoidal Component, Noise Component,
vector-IR. These estimates were compared to the and Combined Sines+Noise Signal
scalar-IR of the individual sinusoidal and noise sig-
Sinusoid Noise Sines+Noise
nals and their sum signal. In Table 1 in the paren-
theses, we show the value of SFM for these signals, Scalar-IR (SFM) 6.31 (3 10 ) 8.7e-5 (0.99) 0.16 (0.72)
6
which is bounded between zero and one, making it Vector-IR (SFM) 8.18 (7 108) 0.21 (0.657) 2.44 (0.007)
somewhat easier to consider these values as a mea-
sure of the amount of structure. To read the SFM val-
ues, one should recall that SFM values close to one product of the SFMs of the individual sinusoid or
indicate noise and values near zero indicate structure. noise signals).
In the case of vector-IR, the SFM value is actually the These results show that the amount of structure
generalized-SFM, which is a result of summing the revealed by vector-IR (or generalized-SFM) for the
individual scalar-IRs and transforming the resulting sines+noise signal is significantly higher (and SFM
vector-IR back to SFM. We use the relationship lower) than the structure estimated by scalar-IR on
the sum signal. It should also be noted that vector-IR
generalizedSFM = e2 vectorIR (9) estimation of noise components is imperfect, result-
ing in nonzero values. In general, there seems to be a
which also equals the product of the individual col- tradeoff between the precision of estimation of struc-
umn SFMs. So, in principle, presence of a strong tured and noise components using the SFM proce-
structural element (i.e., SFM close to zero) makes dure. Using low-resolution spectral analysis in SFM
the generalized SFM small, indicating structure. estimation allows better averaging and improved es-
This depends of course on the ability of SVD to sep- timation of the noise spectrum. This comes at the
arate the structural components (in practice, the expense of poorer estimation of peaked or spectrally
SVD of the sines+noise signal contained some errors structured components. Using high-resolution spec-
in the structural components that it found, result- tral estimate causes better resolution of the spectral
ing in a higher generalized-SFM than the theoretical peaks, but creates a bias towards structured results.
Dubnov 69
Figure 6. Geometrical sig-
nal representation consist-
ing of a transform
operation followed by
data reduction.
Dubnov 71
Figure 8. Spectrogram of a
cheering crowd.
final step consists of separate estimation of the IR of the vocal exclamations appear as high-amplitude
the individual components, according to the prin- spectral lines varying in time.
ciples of IR estimation for multivariate/vector pro- Performing vector-IR analysis shows that different
cesses described earlier. Having described the coefficients contain different IR structure. Figure 9
different stages in the anticipation algorithm, we shows the result of IR analysis of a cheering-crowd
now describe the application of this method to au- signal sampled at 8 kHz using an FFT of size 256
dio characterization and musical analysis. with 50 percent overlap. A total of 129 spectral bins
are retained (half the total bins due to symmetry of
the STFT, plus the first bin). The IR analysis con-
Comparison of Vector and Scalar-IR Analysis for sists of decorrelation of log-magnitude spectral
Natural Sounds matrix using SVD and evaluation of scalar-IR
separately for each of the expansion coefficients.
In this and the following sections, we examine nat- Scalar estimation of IR of each of the coefficients
ural sounds that contain significant energy through- consists of estimating SFM from the power spectral
out all frequencies of the spectrum. Figure 8 shows density of the coefficient time series using the
the magnitude spectrum on a decibel scale (log- Welch method (Hayes 1996) with 64 spectral bins.
magnitude) of a cheering crowd sound. This sound The results of vector-IR analysis were compared
contains a dense mixture of different types of sounds. to scalar-IR analysis of the signal. Additionally, a
Hand clapping can be seen as vertical lines, while synthetic signal with a power spectral density simi-
lar to that of the original cheering-crowd signal was Table 2. Results of Vector-IR and Scalar-IR Analysis
constructed by means of passing white noise through of the Original Cheering-Crowd Signal and a Corre-
an appropriate filter, whose parameters were esti- sponding Synthetic Signal
mated using linear prediction (LP) with eight filter
Equivalent
coefficients. This synthetic signal, even though Original Signal Filtered Noise
non-white, has little overall structure, because its
different spectral bands (considered either as coeffi- Vector-IR 10.3 1.9
cients of STFT or outputs of a filterbank) lack a Scalar-IR 1.9 1.6
time structure. The log-magnitude spectral matrix
of such a signal can be described as a rank-one con-
stant matrix corresponding to single basis vector vector-IR analysis to the influence of an overall
that captures the overall spectral shape, and a noise spectral shape, that is, to compare the real signal to
matrix that represents the variations between the a colored noise signal that does not have the de-
different bands at different times. The SVD of such tailed temporal structure within its individual
matrix contains a single structured basis vector that bands. The results of vector-IR and scalar-IR anal-
captures the overall spectral shape, and remaining ysis of both signals are given in Table 2.
basis vectors having white (and thus non-structured) It can be seen that vector-IR efficiently detects
coefficients. The purpose of testing IR for this syn- the structure in the original cheering-crowd signal,
thetic signal was to examine the robustness of distinguishing it from the corresponding filtered
Dubnov 73
Figure 10. First four spec-
tral magnitude basis
functions.
we extend the method of IR analysis to the non- orchestral textures. (The reason for this will be ex-
stationary case by applying IR in a time-varying plained in the next section.)
fashion.
In the following example, we represented a sound
in terms of a sequence of spectral envelopes, repre- Anticipation Profile of Musical Signals
sented by spectral or cepstral coefficients. These
vectors are grouped into macro-frames and submit- A comparative vector-IR analysis of musical signal
ted to separate IR analyses, resulting in a single IR using 30 cepstral and 30 magnitude spectral basis
value for each macro-frame. When the IR graph is components is presented in Figure 12 for an excerpt
plotted against time, one obtains a graph of IR evo- from the first movement of Schumanns Piano Quar-
lution over the course of the musical signal; we call tet in E-flat Major. The features were estimated over
this the anticipation profile. Using different anal- signal frames of 20 msec duration. The macro-frame
ysis parameters, anticipation profiles could be used for IR analysis was 3 sec long with 75 percent over-
to analyze affective vocalizations (Slaney and lap between successive analysis frames. Figure 12
McRoberts 1998) or musical sounds (Scheirer 2000). shows vector-IR results for the two representations,
For the purpose of vocalization analysis, the IR overlaid on top of a signal spectrogram.
macro-frames are 1 sec long with 75 percent over- As can be seen from the figure, both methods re-
lap. For analysis of musical recordings, longer sult in similar anticipation profiles. Varying the or-
frames are required, varying from 3 sec for solo or der of the model (changing the data reduction or so
chamber instrumental music to 30 sec for complex called cepstral liftering number) between 20 and
Dubnov 75
Figure 12. Graph of the an- features, displayed over a
ticipation profile (esti- spectrogram of a musical
mated by vector-IR) using excerpt (the first move-
cepstral (solid) and spec- ment of Schumanns Piano
tral magnitude (dotted) Quartet in E-flat Major).
60 components has little effect on the results, indi- nificant changes in the music. Analysis of similari-
cating that the system is quite robust to changes in ties between the IR curves and the major sections of
the representation detail. the score from a music-theoretical point of view
When considering the results of this analysis, one seem to indicate that the anticipation profile cap-
might argue that the nature of music of the Schu- tures relevant aspects of the musical structure. For
mann piano quartet is such that its structure might instance, the first drop of the IR graph corresponds
be detected using simpler methods, such as energy to the opening of the prelude, ending at the first ca-
or other existing voice-activity detectors. To show a dence and modulation to the dominant. The follow-
situation where vector-IR analysis gives a unique ing lower part of IR graph corresponds to two
glimpse into musical structure that other features repetitions of an ostinato pattern. Then the two
do not allow, we present in Figure 13 the results of peaks in IR seem to be related to reappearance of the
vector-IR analysis of an acoustic recording of a opening theme with harmonic modulations, ending
MIDI rendering of Bachs Prelude in G Major from in the next IR valley at the repeating melodic pat-
Book I of the Well-Tempered Clavier. tern on the parallel minor. The next increase in IR
The synthetic nature of the computer playback of corresponds to development section on the domi-
a MIDI file is such that the resulting acoustic signal nant, followed by a final transition to cadence, with
lacks almost any expressive inflections, including climax and final descent along the cadential passage.
dynamics. Moreover, the use of a synthetic piano To have a better impression of the correspon-
creates a signal with little spectral variation. As can dence between the variation in music structure and
be observed in Figure 13, vector-IR still detects sig- texture to our analysis, we present in Figure 14
again the same graph of anticipation profile, this sical work (Angel of Death by Roger Reynolds) was
time overlaid on top of MIDI notes. (The crosses in- collected during live concerts (McAdams et al.
dicate note onsets, with actual note numbers not 2002). During these concerts, listeners were as-
represented in the graph.) signed a response box with a sliding fader that al-
It appears that changes and repetitions of music lowed continuous analog ratings to be made on a
materials are detected by IR analysis of the acoustic scale of emotional force (Smith 2001). Listeners
signal. (It should be noted that our anticipation were instructed that positive or negative emotional
analysis does not involve note or pitch detection or reactions of similar magnitude were to be judged at
any use of the score of MIDI information.) the same level of the emotional force scale. The
ends of the emotional force scale were labeled
weak and strong. In addition, a small I dont
Anticipation Profile and Emotional Force know region was provided at the far left end of this
scale that could be sensed tactilely, as the cursor
To evaluate the significance of the IR method for provided a slight resistance to moving into or out of
music analysis, a comparison between anticipation this zone.
profiles derived from automatic signal analysis and Continuous data from response boxes were con-
human perception of musical contents is required. verted to MIDI format (on an integer scale from 0 to
In a recent experiment, large amounts of data con- 127) and recorded simultaneously with the musical
cerning human emotional responses when listening performance. Figure 15 presents a comparative
to a performance of a contemporary orchestral mu- graph of the anticipation profile resulting from IR
Dubnov 77
Figure 14. Graph of antici-
pation profile (estimated
by vector-IR from an audio
recording) displayed on top
of MIDI note onsets of the
Bach Prelude.
analysis of the audio signal (the concert recording) Combined with additional signal features such as
and a graph of the average human responses termed signal energy, higher correspondence between signal
Emotional Force (EF) for two versions of the or- information analysis and Emotional Force judg-
chestral piece. ments was achieved. The analysis shows strong evi-
Analyses were performed using analysis frame dence that signal properties and human reactions are
sizes of 200 msec with macro-frames 3 sec long, related, suggesting applications of these techniques
with no overlap between the macro-frame seg- to music understanding and music information-
ments. These IR values were additionally smoothed processing systems. Full details of the experiments
using a 10-segment-long moving-average filter, re- and the additional information analysis methods
sulting in an effective analysis frame of 30 sec with will appear elsewhere (Dubnov, McAdams, and
a 3-sec interval between analysis values. The extra Reynolds in press).
averaging removed fast variations in vector-IR anal-
ysis, assuming that a 30-sec smoothing better
matches the rate of change in human judgments Thoughts About Self-Supervised Brain Processing
for such a complex orchestral piece. Architecture
As can be seen from Figure 15, certain portions of
the IR curve fit closely to the EF data, whereas other We would like to consider briefly in this section the
portions differ significantly. It was found that corre- possible relationships between anticipation analysis
lation of the IR data and the EF were 63 percent and and self-supervised architectures of brain pro-
47 percent for top and bottom graphs, respectively. cessing, particularly in relation to higher cognitive
Dubnov 79
Figure 16. Information-
processing architecture
that considers emotion as
a self-supervision commu-
nication within the brain.
Conclusion
This article presents a novel approach to (auto-
matic) analysis of audio and music based on an an-
ticipation profile. The ideas are developed from a
background of information theory and basis decom-
aspects of musical information processing. One position, through the idea of a vector approach to
such architecture that was suggested for emotional the anticipation measurement, and ending with
processing assumes an existence of two separate some higher-level reflections on its meaning for
brain mechanisms that interact in analysis of com- complex musical signals. The anticipation profile is
plex information (Huron 2005): a fast brain that estimated by evaluating the reduction in the
deals primarily with pre-processing of sensory infor- uncertainty (entropy) of a variable resulting form
mation, forming perceptions from the stream of prediction based on the past. Algorithms for estima-
constantly impinging sensory excitations, and a sec- tion of the anticipation profile were presented, with
ond component, the slow brain, that interacts applications for audio signal characterization and
with the fast brain by performing appraisal of the music analysis. Discussions show that this mea-
fast-brain performance. This architecture is de- sure also resolves several ill-defined aspects of the
scribed in Figure 16. structure-versus-noise problem in music and in-
In the context of our computational model, we cludes the listener as an integral part of structural
consider the functions of the fast brain to be related analysis of music. Moreover, the proposed anticipa-
to basic pattern-recognition actions, such as feature tion measure might be related to more general as-
extraction and data reduction. Anticipation could pects of appraisal or self-supervision of
be considered as an appraisal that is done by the information-processing systems, including aspects
slow brain. That is, anticipation involves evaluating of emotional brain architecture.
the utility of the past perception for explaining the
present in terms of assigning a score to the fast-
brain functions according to the relative reduction References
of uncertainty about the present when considering
its past (a quote from the definition of IR). Allamanche, E., et al. 2001. Content-Based Identification
It should be noted that selection of informative of Audio Material Using MPEG- 7 Low Level Descrip-
features is commonly done in other signal-processing tion. Proceedings of the 2001 International Sympo-
applications, such as speech understanding or com- sium on Music Information Retrieval. Bloomington,
puter vision, mostly in terms of classification per- Indiana: Indiana University, pp. 197204.
Casey, M. A. 2001. MPEG-7 Sound Recognition Tools.
formance. That is, features are considered to be
IEEE Transactions on Circuits and Systems for Video
informative if they have significant mutual informa- Technology 11(6):737747.
tion with the class labels for a particular recognition Casey, M. A., and A. Westner. 2000. Separation of Mixed
task. In music, the task of recognition is secondary, Audio Sources by Independent Subspace Analysis.
and it is only natural to assume that anticipation, Proceedings of the 2000 International Computer Music
rather then recognition, is a more appropriate task Conference. San Francisco, California: International
for describing music information-processing opera- Computer Music Association, pp. 154161.
Dubnov 81
spectrum values are equal, thus meaning a flat spec- past. In information theoretic terms, and assuming
trum or a white-noise signal. a stationary process, this measure equals the differ-
ence between the entropy of the marginal distribu-
tion of the process xn and the entropy rate of the
Information Redundancy process, equally for all n.
This shows that the vector-IR can be evaluated from MATLAB code of the IR analysis, along with spe-
the difference of the entropy of the last block and cific examples of sound analysis used in this article,
the conditional entropy of that block given its past. can be found online at http://music.ucsd.edu/
Using the transform relationship, one can equiva- ~sdubnov/InformationRate.
lently express vector-IR as a difference in entropy
Dubnov 83