Professional Documents
Culture Documents
Xavier Rodet
Adaptive Threshold
*Universidad Pblica de Navarra
Campus Arrosada, 31006
Determination for Spectral
Pamplona, Spain
miro@unavarra.es
Peak Classification
The decomposition of audio spectra into sinusoids, distributions of the different peak classes overlap,
transients, and noise can serve as a useful tool for and the optimal determination of the decision
improving the results of parameter estimation or boundaries depends on the specific application.
signal manipulation applications. As has been shown The peak classification method proposed in
for the case of transient detection (Rbel 2003) and Zivanovic, Rbel, and Rodet (2004) uses descriptors
sinusoidal and noise discrimination (Zivanovic, that were designed to adequately characterize non-
Rbel, and Rodet 2004), the classification of spectral stationary sinusoidal signals. These descriptors
peaks is a beneficial step toward the identification have proven to lead to better classification perfor-
of these signal components. Such a classification mance than other approaches devoted to sinudoidal
scheme that makes optimal use of the information detection / estimation (Thompson 1982; Rodet
provided by spectral peaks can then be used to 1997). It was also shown in Zivanovic, Rbel, and
achieve a robust segmentation into higher-level Rodet (2004) that the peak classes can be character-
signal components, for example, partials or unvoiced ized by distributions in the descriptor domains,
regions. Unlike the perceptual audio segmentation similar to probability density functions. Once the
(Painter and Spanias 2005), which attempts to distributions have been generated, a simple decision
maximize the matching between the auditory tree can be derived that allows the classification of
excitation pattern associated with the original spectral peaks into sinusoids, noise, and sidelobes.
signal and the corresponding auditory excitation The peak classification method has been used
pattern associated with the modeled signal, we base successfully in a number of applications. As ex-
our classification purely on signal characteristics. amples we mention polyphonic fundamental-
The basis for spectral peak classification is an frequency detection (Yeh, Rbel, and Rodet 2005),
adequate choice of criteria that would best describe adaptive noise-floor determination (Yeh and Rbel
sinusoidal and noise spectral peaks of audio signals. 2006), and voiced / unvoiced frequency boundary
Ideally, those criteria (from now on, descriptors) determination. Another interesting application lies
would be able to precisely detect the nature of each in the pre-selection of the sinusoidal peaks to reduce
peak in the spectrum and thus provide a complete the number of candidate peaks considered for partial
separation between the corresponding peak classes tracking in additive analysis. A reliable classification
in the descriptor domains. Consequently, the deci- of noise peaks could reduce the number of incorrect
sion boundary for the classification process would connections, and, for probabilistic approaches like
be unambiguous, and no misclassfication of spectral that described by Depalle, Garcia, and Rodet (1993),
peaks would occur. This scenario, however, is it would considerably reduce the computational cost.
purely hypothetical as the peaks corresponding to The major problem with the classification scheme
sinusoids (partials) in the spectra of real-world sig- of Zivanovic, Rbel, and Rodet (2004) is the control
nals are usually subject to additive noise and some of the classification boundaries (classification
type of modulation. In these cases, the descriptor thresholds) that generally need adaptation for the
specific problem at hand. Another problem is that
Computer Music Journal, 32:2, pp. 5767, Summer 2008 the descriptor boundaries of the different classes
2008 Massachusetts Institute of Technology. will depend on the analysis window that is used. Up
Zivanovic et al. 57
to now, no high-level control parameter existed that density between two neighboring minima in the
would allow a user to adjust the sensitivity of the discrete Fourier transform (DFT) modulus |X(k)| of
algorithm in an intuitive manner. There are two the signal x(n), multiplied by the analysis window.
signal parameters that directly affect the classifica- The spectral peak descriptors proposed in Zi-
tion boundaries. The first is the maximum modula- vanovic, Rbel, and Rodet (2004) are the normalized
tion depth and period of the sinusoids. The second bandwidth descriptor, the normalized duration
is the minimum amplitude of the sinusoids above descriptor, and the frequency coherence descriptor.
the noise floor. Both parameters influence the The first two are well suited to distinguish between
boundaries of the sinusoidal class, and accordingly, sinusoidal and noise peaks, and the third can be
both can be used to control the decision boundaries. used to detect the side-lobe structure that is an
The problem using the modulation limits as a artifact of the windowing process.
control parameter is the fact that the modulation is
not a single parameter, but rather a parameter
vector of at least four dimensions (i.e., period and Normalized Bandwidth Descriptor (NBD)
depth for both amplitude and frequency modula-
tion). Therefore, it cannot be used to provide an Energy distribution along the frequency grid pro-
intuitive control of the classification boundaries. vides useful information for identifying the nature
On the other hand, the sinusoidal peak amplitude of the signal related to a given spectral peak. Taking
above the noise floor is a single parameter that for a X(k) as the DFT of the windowed signal and L as the
given modulation limit would allow us to control number of samples in the spectral peak, we define
the complex decision thresholds rather intuitively. the NBD as a function of mean frequency k (in bins)
Accordingly, in this article we investigate the and a root-mean-square bandwidth BWrms:
relation between the peak amplitude above the noise
k ( k k ) X(k)
floor and the descriptor boundaries for the class of 2 2
(5)
radians is given by
Zivanovic et al. 59
the fact that the spectrum of the sinusoidal compo- where r(n) is additive Gaussian noise. According to
nent must contain a dominant main lobe. Other- the previous discussion, we set FAM = 2FFM. The
wise, the investigation of an individual spectral frequency vibrato rate FFM must be selected such
peak cannot provide us with sufficient information that the spectrum always contains a significant
about the underlying sinusoid. Accordingly, the main lobe, which is ensured by FFM = 1 / (4.2M).
modulation rate and depth should be limited such Accordingly, the window covers less than one-
that a dominant main lobe is present in the Fourier fourth of the FM vibrato period. The values for the
spectrum of each sinusoidal component. Because amplitude and frequency modulation depth have
frequency and time resolution are related to the been chosen as AAM = 0.5 and AFM = 10.
window size and shape, the modulation limits These values ensure a dominant-peak main lobe
depend on these two variables. A simple solution to for arbitrary phase angles ( and ). The window
ensure the modulation constraint for all window length M, the sinusoidal frequency Fo, and the
sizes is to determine the maximum modulation sample rate R do not impact the results. The size
that respects the constraint for a given window size of the DFT N is chosen to assure that the picket-
and to change the worst-case modulation rate pro- fence effect has minimal impact on a peak rep-
portionally with the window size. resentation in the discrete spectrum. For
As the next step, we must define the worst-case completeness, we note the values that we used for
signal, which will be used to derive the descriptor the following investigation into the descriptor
limits of the sinusoidal class. From the wide range distributions: M = 40 msec, N = 4,096, Fo = 880 Hz,
of possible modulation laws, we have chosen the and R = 44,000 samples / sec).
sinusoidal amplitude and frequency modulation in It is clear that the present worst-case signal does
white Gaussian background noise as our worst- not cover all modulations that may be encountered
case reference signal. The choice is motivated by in a real-world setting, even if we respect the fact
the fact that a range of FM and AM conditions can that a dominant main lobe is required to detect a
be covered. If the window size is small compared modulated sinusoid. The explicit inclusion of time-
to the vibrato rate, for example, it is easy to see varying sinusoids into the model will nevertheless
that the vibrato signal approximately creates lead to a classifier that has significant advantages in
linear FM and AM. Recent investigations (Arro- real-world situations involving time-varying
abarren, Rodet, and Carlosena 2006; Verfaille, sinusoids.
Guastavino, and Depalle 2005) have shown that, for Because the part of the sinusoidal peak that can
real-world vibrato signals, the AM and FM will be observed changes with the variance r2 of the
generally not be phase-synchronous. Accordingly, background noise level r(n) the peak descriptors
the worst-case signal model exhibits arbitrary phase will not only change with the modulation, but also
relations between the amplitude and frequency with the signal-to-noise (SNR) ratio. For multi-
modulation. Because part of the AM is induced by component signals, the global SNR does not provide
the FM and the resonator filter of the sound source, meaningful insight, and therefore, we use the peak
the dominant AM rate may either be the same as signal-to-noise ratio (SNRp) as our noise-level
the FM rate or twice as high. As the latter case is parameter. The SNRp indicates the sinusoidal peak
more critical, we chose it for our worst-case signal power level in dB over the noise floor (see Figure 1)
scenario. and it presents a convenient parameter to control
In view of the aforementioned discussion, the the limits of the sinusoidal class.
following mathematical expression for the sinusoi- To experimentally create the descriptor distribu-
dal model is proposed: tions, we proceed as follows. For the noise class
distributions, we calculate the descriptors for all
x(n) = cos 2"F0 n + AFM sin ( 2"FFM n + # ) spectral peaks in the DFT of white Gaussian noise
processes using an analysis window of size M. For
1+ AAM cos ( 2"FAM n + $ ) + r ( n ) (11) the sinusoidal class, we create a grid of phase values
Classification Strategy
Zivanovic et al. 61
Table 1. Peak Classification Thresholds for Infinite sinusoidal distributions every time the minimum
SNRp Computed Using a Hanning Window SNRp that is selected by the user is changed. As
shown subsequently, however, the experimental
Peak Classes Descriptor Values
evaluation of the distribution limits can be avoided
side-lobe / non-side-lobe FCD N / M thanks to a simple approximate formula that
sine / noise NBD 0.13 and expresses the relationship between the parameter
0.13 NDD 0.16 SNRp and the margins of the sinusoidal peak distri-
butions in the descriptor domain. These can be used
to adapt the classifier to the selected SNRp. The
two-level decision tree as follows. In the first level, thresholds to be adapted are the right margin of the
the side-lobe and non-side-lobe classification is NBD sinusoidal distribution and both margins of
performed. In the second level, the peaks previously the NDD sinusoidal distribution. As for the FCD,
declared as non-side-lobes are classified as sinusoids the threshold can be held fixed thanks to the good
and noise. The thresholds for both levels of classifi- side-lobe separation from the rest of the peak classes.
cation are obtained by analyzing the distributions
shown in Figure 2. For infinite SNRp, the classifica-
tion could be obtained by simply using FCD and Modeling SNRp Dependency
NBD thresholds to perfectly separate all three peak
classes. Note that only in this particular case, the The relation between the classification threshold
NBD attains almost perfect sine / non-sine classifica- and the SNRp is rather complex, and to be able to
tion; therefore the contribution of the NDD is achieve a model of these relations, the problem
negligible. The classification scheme that is used requires a number of simplifications. We first
for infinite SNRp is shown in Table 1. experimentally determine the signal pattern that is
For a finite SNRp, the sinusoidal distributions related to the descriptor limits for infinite SNRp.
experience a spread proportional to the noise level Then, we develop a simplified model of the effect of
in the worst-case signal. In particular, the NBD the additive noise to be able to achieve a mathemat-
sinusoidal distribution extends towards the right, ical formulation of the threshold dependency on the
whereas the NDD sinusoidal distribution spreads in SNRp. The relation does not take into account the
both directions. The sinusoidal NBD distribution fact that the signal pattern at the descriptor limits
overlaps partially with the noise NBD distribution, may depend on the SNRp.
which means that the NBD can no longer perfectly
separate the peak classes. To reduce this ambiguity,
we use the NDD. As mentioned previously, the NBD Threshold
sinusoidal NDD distribution covers only a small
range of descriptor values. Hence, by considering only Recall that the NBD is the ratio of the peak band-
the peaks within the limits of the sinusoidal NDD width to the peak width. As described above, we
distribution as sinusoids, we can eliminate some of must first determine the sinusoidal signal that will
the noise peaks previously classified as sinusoids give rise to the maximum value of the descriptor
and thus refine the initial sine / noise classification. NBDmax = BW / L. This can be done by means of a
It is important to understand that a decreasing straightforward search over the two-dimensional
SNRp will modify the limits of the sinusoidal grid of phase values and for a given analysis
distribution in a manner similar to that of an window. (See Table 2 for a listing of some promi-
increase in the modulation parameters. Therefore, nent analysis windows.)
the minimum SNRp can be used to control the The presence of noise will affect both BW and L.
decision thresholds in a rather intuitive manner. It is clear that L will decrease, because the peak
To keep track of the limit values of the sinusoidal local minima approach the peak maximum in terms
distributions, we would need to regenerate all the of magnitude. In a simple approximation, we can
NDD Threshold
Zivanovic et al. 63
Table 3. Coefficient Values for Modeling the mmin and mmax Dependency on SNRp
Window ao (rdmin) a1 (rdmin) a2 (rdmin) bo (rdmax) b1 (rdmax) b2 (rdmax)
Hanning 0.0174 0.5770 10.6280 0.0006 0.1211 0.8279
Blackman 0.0081 0.3903 9.0630 0.0022 0.1472 0.7083
Hamming 0.0003 0.2816 2.4716 0.0037 0.1615 0.7230
narrow-band Gaussian noise. Owing to the small ent because the bandwidth of the peaks related to
bandwidth of the signal peak, the effective noise NDDmin and NDDmax are different. They have been
bandwidth is rather small. For each SNRp, there selected such that they obey a simple relation to the
exist two noise signal patterns, rdmin and rdmax, that window size. Note that the exact frequency values
will maximally increase and decrease, respectively, are not critical for the model and that the frequen-
the NDDmax and NDDmin values. We use a simple cies do not depend on the SNRp.
signal model consisting of an amplitude-modulated The modulation indices mmin and mmax must be
carrier as basis for our noise model. The noise greater than unity to ensure the phase change of in
model is band-limited (reflecting the bandwidth of the crossover between contiguous modulation
the spectral peak) but not necessarily time-limited. lobes. Both amplitude A and modulation indices are
Owing to the small bandwidth, the noise pattern a function of the SNRp. A determines the total
may extend out of the signal window. Because we energy of each pattern, and mmin and mmax control
do not want to take into account the length of the the distribution of that energy along the analysis
DFT for the simple model here, we limit the noise window. The amplitude is simply a scaling factor
signal to the time segment of the analysis window. that ensures the most of the spectral energy of the
To reduce NDDmin, rdmin should narrow the width noise patterns lies SNRp dB under the main lobe of
of the central maximum of xdmin. To achieve this, the worst-case signal. The values for the modula-
rdmin must be in phase with xdmin around the win- tion indices are more difficult to estimate, as they
dows center and in counter-phase otherwise. change in a non-linear fashion with the SNRp. To
Because a strong amplitude at the window boundar- obtain a mathematical model, we used the signal in
ies would always enlarge the NDD, we additionally Equation 13 and a wide range of SNRp settings and
assume that the noise pattern rdmin has the analysis have experimentally determined the maximum and
window applied. minimum NDD as well as the values for mmin and
On the contrary, rdmax must be in counter-phase mmax that would best match the experimental data.
with xdmax around the windows center and in phase Finally, we derived a second-order polynomial
close to the window edges. The resulting waveform representation of the modulation indices by means
would have the energy more uniformly distributed of adapting a second-order polynomial to the set of
along the analysis window and thus larger NDDmax. modulation indices. For various types of analysis
Here, rdmax must not be tapered in order to contrib- windows, the resulting functions are given by
ute significantly to the energy concentration in xdmax
around the window edges. mmin = aiSNRpi , mmax = biSNRpi
i i
According to this discussion, we used the following
model for the narrow-band Gaussian noise patterns: with the corresponding coefficients given in Table
rd min(n) = Acos(2"Fo n) 1+ mmin cos(4"n / M) w(n) 3. For the Hanning window, the envelopes of the
corresponding noise patterns for SNRp = 10dB are
rd max(n) = Acos(2"Fo n) 1 mmax cos(2"n / M) (14) shown in Figure 3, and the envelopes of the result-
ing waveforms after the superposition are shown in
The noise patterns are therefore sine-modulated Figure 4.
waveforms. The modulation frequencies are differ- Note that the energy distributions of the signal
Zivanovic et al. 65
Figure 5. Approximation
errors calculated as a
difference between the
measured and modeled
values for each SNRp using
various analysis windows.
Values in the legend corre-
spond to infinite SNRp.
Acknowledgments
The first author would like to gratefully acknowl-
edge the financial support of the Universidad
Publica de Navarra, Spain.
References
Zivanovic et al. 67