Professional Documents
Culture Documents
Sinusoidal models are often used for the representa- likelihood estimate (MLE) generally assume that
tion, analysis, or transformation of music or speech the amplitude of the sinusoids is constant. As an
signals (Quatieri and McAulay 1986; Amatriain example, we refer to an algorithm that is based on
et al. 2002.). An important step that is necessary for signal demodulation employing an initial search
obtaining the sinusoidal model lies in estimating over a grid of frequencies and frequency slopes and
the amplitudes, frequencies, and phases of the a final fine-tuning of the parameters using an
sinusoids from the peaks of the Discrete Fourier iterative maximization of the amplitude of the
Transform (DFT). The estimation is rather simple demodulated signal (Abatzoglou 1986). Similar to
provided the signal is stationary. A standard method multi-component signals with stationary sinusoids,
for this estimation is the quadratically interpolated the MLE of sinusoidal parameters for multi-
Fast Fourier Transform (QIFFT) estimator (Abe and component signals with frequency-modulated (FM)
Smith 2005). The QIFFT estimator uses the bin at sinusoids is rather costly, because a highly nonlin-
the maximum of each spectral peak together with ear and high-dimensional cost function must be
its two neighbors to establish a second-order poly- maximized (Saha and Kay 2002). Owing to the
nomial model of the log amplitude and unwrapped computational savings and despite the fact that
phase of the peak. The amplitude and frequency windowing reduces the estimator efficiency (Offelli
estimates of the sinusoid that is related to the and Petri 1992), the windowing technique is gener-
spectral peak are then derived from the height and ally preferred if the signal contains more than a
frequency position of the maximum of the polyno- single sinusoid.
mial. The evaluation of the phase polynomial at the Most of the algorithms that employ analysis
frequency position provides the estimate of the windows for the parameter analysis of amplitude-
phase of the sinusoid. modulated (AM) and / or FM sinusoids rely on the
For non-stationary sinusoids, the parameter esti- fact that the analysis window is approximately
mation becomes more difficult, because the QIFFT Gaussian, such that a mathematical investigation
algorithm is severely biased whenever the fre- becomes tractable. Marques and Almeida (1986)
quency is not constant. The term bias refers to the developed this approach for sinusoids with linear
systematic estimation error, that is, the error of the FM and constant amplitude, and Peeters and Rodet
estimator that exists even if no measurement noise (1999) extended it to sinusoids with linear FM and
is present. For the partials in natural vibrato signals, AM. Abe and Smith (2005) presented a version for
the estimation bias of the QIFFT estimator accounts sinusoids with linear FM and exponential AM. The
for a significant amount of residual energy (i.e., the method presented in Abe and Smith 2005 is special
energy remaining after subtracting the sinusoidal in that it tries to extend its range to other analysis
model from the original signal). This is the major windows by means of a set of linear bias-correction
reason for the perceived voiced energy in the resid- functions. The resulting estimator is computation-
ual of vibrato signals. ally efficient and achieves small bias for standard
A number of algorithms with low estimation bias windows as long as the zero-padding factor is
for non-stationary sinusoids have been proposed. sufficiently large (i.e., greater than three) and the
Algorithms that try to implement a maximum modulation rates are relatively small.
In this article, we present a bias-correction
Computer Music Journal, 32:2, pp. 6879, Summer 2008 scheme for sinusoidal parameter estimation of
2008 Massachusetts Institute of Technology. sinusoids with linear AM / FM modulation. As a first
Rbel 69
amplitudes of the spectral bins do not follow a tude slope a. Then, we investigate into the proper-
second-order polynomial. Accordingly, the interpo- ties of the spectra of the individual parts and use the
lation is already systematically incorrect for sta- linearity of the Fourier transform to draw conclu-
tionary sinusoids, and therefore we will not discuss sions for the complete spectrum. We first write the
this source of bias here. Nevertheless, as will be- DFT of the signal in Equation 1 using a normalized
come clear later, it is important to reduce this type analysis window W(n) with nW(n) = 1:
of bias as much as possible. This can be achieved by
wrapped phase spectra, however, are no longer demodulator signal with the input signal in Equa-
piecewise constant. Both phase spectra have an tion 1 will remove the frequency slope and keep all
additional even phase function superimposed. The other parameters unchanged such that the QIFFT
phase offset of Sc(') does not vanish at the origin, algorithm can be applied without additional bias.
and by consequence, the phase is biased already for However, because other sinusoids may be present in
a = 0. For a 0, the even-symmetric phase offset the signal, we cannot apply time-domain demodula-
that is applied to Sl(') will destroy the even sym- tion directly.
metry of the magnitude of S(') such that the peak The demodulation algorithm that uses only the
maximum moves away from the origin, and there- observed part of the spectral peak to approximately
fore the amplitude and frequency estimates of the demodulate the sinusoidal component is described
QIFFT estimator are no longer correct. Accordingly, here in the frequency domain. Assume S(k) is the
the QIFFT estimator suffers from additional bias N-point DFT of the sinusoid to be analyzed and Y(k)
quite similar as has been shown for the Gaussian is the DFT of the demodulator signal. All DFT
window in Peeters and Rodet (1999). spectra are calculated such that the origin of the
DFT basis functions is in the center of the analysis
window. The signal analysis window is ws(n), and
Reducing the Bias the demodulator signal is windowed using wy(n). To
obtain the demodulated sinusoid spectrum X(k), we
In the previous section, we saw that the source of would need to compute the circular convolution
the bias of the QIFFT estimator is the frequency
S(k) Y(k)
slope of the sinusoid. A conceptually simple ap- X(k) = C (5)
proach to estimate the parameters (A,,) of a sinu- N
soid related to a spectral peak requires two steps: where C is a normalization factor taking into
first estimate the frequency slope, then demodulate account windowing effects. As a result of this
the sinusoid and use the QIFFT estimator to find operation, we obtain the spectrum of the product
the sinusoidal parameters. of the demodulator and sinusoidal component
Note that this approach is in principle equivalent windowed by the product window wy(n)ws(n).
to the MLE for constant-amplitude linear FM sig- Therefore, proper normalization would be achieved
nals described in Abatzoglou (1986). Because the by means of setting C = 1/nwy(n)ws(n).
demodulation technique is used for the frequency- Because only part of the sinusoid spectrum is
slope estimation, we first discuss the frequency- available, the normalization factor should be
domain demodulation algorithm. In the subsequent adapted. Assume the peak under investigation is
section, the frequency-slope estimation is described. denoted by P(k). P(k) is part of the spectrum S(k),
Rbel 71
and it covers B bins. To be able to take into account approximately the same value creates the smallest
the impact of the missing part of the spectrum, we bias. Besides the fact that this method achieves
create a spectral model of the observed sinusoid perfect compensation for a = 0, there is a second
assuming the initial slope estimate D is correct: advantage of this method that is related to the
2"i
impact of the background noise. Assuming the
kn
Pm(k) = w s(n)e i"Dn e
2
N (6) background noise energy is locally constant and
n
understanding the maximum border amplitude of
We then select a subset Pm(k) of B bins around the the peak as a very rough indicator of the background
center frequency k = 0. (Note that in the case that B noise level, we can conclude that cutting the peak
is even, the resulting model is not symmetric.) The at its maximum border level could be beneficial,
required normalization factor can now be approxi- because it avoids the parts of the signal where the
mately estimated as background noise is dominant.
1 A final point to note here is that, for parameter
C = (7) estimation from demodulated peaks with the QIFFT
max k(| Pm (k) Y(k)|)
estimator, it is essential to use the bias-correction
Accordingly, if we replace S(k) in Equation 5 by P(k), functions proposed in Abe and Smith (2004) with
we should demodulate using the corrected normal- correction factors adapted to the effective window
ization factor C'. wy(n)ws(n).
The correction factor will be more precise (i.e., Our experimental investigation shows that the
lower bias) for demodulator windows that concen- spectra of the demodulation kernels Y(k) and the
trate more energy in the B-bin-wide band around related observed peak models Pm(k) can be pre-
frequency 0 of the spectrum. This calls for calculated for a fixed grid of frequency slopes and
higher-order windows with low sidelobes. The then linearly interpolated to obtain an approximate
demodulator window, however, will be applied to spectral peak for any given slope. If the length of the
the signal such that, according to Offelli and Petri analysis windows is M, a frequency slope grid with
(1992), the noise sensitivity of the analysis is step size 0.025/M 2 is sufficient to produce estimates
increased. This calls for low-order windows with that are nearly indistinguishable from the results
larger sidelobes. Accordingly, the demodulator produced with the non-interpolated kernels. To use
window allows a trade-off between noise sensitivity the complete information that is available in the
and bias. The experimental investigation suggests observed peak, we use deconvolution kernels of
that the use of the Hanning window as demodulator length 2B + 1 centered around the maximum of the
window wy is a favorable choice for all analysis deconvolution spectrum.
windows ws. The deconvolution can be implemented in the
The compensation of the normalization factor frequency domain as described or in the time
assumes that the amplitude slope a = 0 and that the domain. Time-domain implementation is probably
peak model is cut symmetrically with respect to the more efficient if at least the demodulation kernel
peak center. To achieve a good match between the could be directly stored in the time domain. The
normalization factor and the missing part of the possibilities of time-domain interpolation of the
spectrum of the sinusoidal component that creates demodulation kernels have not yet been studied; we
the peak P(k), the peak that is extracted from the believe, however, that time-domain interpolation
spectrum should be as close as possible to the peak would require on-the-fly generation of the complex
model that is used to derive the compensation kernels from interpolated phase functions. Owing
factor. A number of strategies to extract the ob- to the linearly modulated frequencies of the demod-
served peak from the spectrum have been com- ulation kernels, this would most likely be less
pared. Experimentally, we found that cutting the efficient than the frequency-domain implementa-
peak such that its left and right magnitude have tion described herein.
Rbel 73
proposed to work on non-stationary sinusoids. quency slope scales with the partial number such
Notably, we use the bias-correction algorithm that for high partials, extreme slopes may arise. The
proposed in Abe and Smith (2005) and the algorithm implementation of the algorithm used for the ex-
of Peeters and Rodet (1999). The results of these perimental investigation uses linearly interpolated
algorithms are denoted as AS and PR, respectively. demodulation kernels as proposed above.
Furthermore, we use the original version of the
demodulation estimator according described in
Rbel (2006; denoted as DE) and the new version Frequency Slope Estimation
that includes slope enhancement and uses the
Hanning window for all demodulation kernels The first experiment investigates the frequency-
(denoted as DS). slope estimation. Figure 1 shows the results ob-
All experiments are performed with Gaussian and tained with the enhanced demodulator DS and
Hanning analysis windows if the algorithms sup- with the AS method according to Equation 8.
port them. The window type that is used is indi- Because the DE and PRG estimators use exactly the
cated by adding the letter G for Gaussian, H for same frequency-slope estimate as the AS estimator,
Hanning, or X for both, to the estimator abbrevia- we do not consider those estimators here. We use
tion. In performance comparisons of the estimators, two different zero-padding factors (Fast Fourier
we will use the expression DSX is better than Transform [FFT] sizes N = 1,024 and N = 4,096)
ASX to denote the fact that DSH and DSG are and two different sets of modulation ranges. The
better than ASH and ASG, respectively. The win- strong modulation uses Dmax = 4/M 2 and amax = 1/M,
dow applied to the demodulation kernels will be and for weak modulation we select Dmax = 0.5/M 2
equal to the analysis window for DEX and Hanning and amax = 0.15/M. Note that the weak modulation
for DSX. The Gaussian analysis window is cut such range approximately covers the interval for which
that it has a length of 8, with being the standard the ASH bias correction has been derived in Abe
deviation of the Gaussian. To facilitate orientation, and Smith (2005). The DSX estimator has been
we display the results of the QIFFT estimator as tested with a set of demodulation offsets Do
well as the Cramer-Rao bounds for second-order [0.2,0.4,0.5,0.6,0.8]/M 2.
polynomial phase estimation described in Ristic The results demonstrate that the selection of this
and Boashash (1998). Note however, that these parameter is rather uncritical. It has a notable effect
bounds have been derived for constant-amplitude only for the DSH estimator, a very small zero-
polynomial phase signals, such that they can only padding factor, and strong modulation. This is
be used to provide an approximate idea of the esti- related to the fact that the initial frequency-slope
mator efficiency. estimate of the ASH that is the basis of the slope
In these experiments, we use synthetic test sig- refinement in DSH is rather poor. If Do is smaller
nals with a single sinusoid according to Equation 1 than the error, then the correction with the polyno-
with A = 1, 0 randomly sampled from a uniform mial model becomes less precise. Even for the
distribution over the normalized frequency range smallest offset, the DSH estimator was never worse
[0.2,0.3], randomly chosen from a uniform distri- than the ASH estimator. The smallest offset that
bution between [,], and varying slopes a and D. works close to the optimum for all of the experi-
The analysis window covers M = 1,001 samples in ments was Do = 0.5. Accordingly, we selected this
all cases. The frequency slope D is selected from a value for the following experiments.
uniform distribution over interval [Dmax,Dmax]. A number of conclusions can be drawn from the
Similarly the amplitude slope a is sampled from a experimental results in Figure 1. First, for strong
uniform distribution over the range [amax,amax]. The modulation the DSX methods have significantly
slope ranges are considered realistic for real-world lower bias than the ASX methods. Second, for the
signals. Note that in harmonic signals, the fre- Hanning window, the DSH estimator compared to
(a) (c)
(b) (d)
Rbel 75
Figure 2. Comparison of with Dmax = 4/M2 and signals is displayed as
the estimation errors for amax = 1/M; (df) phase- lower limit. Algorithms
the different parameter estimation errors for using a Gaussian / Hanning
estimators using (ac) different modulation window are distinguished
window size M = 1,001 and limits and FFT sizes. The by means of solid / dashed
FFT size N = 4,096 and CRB for constant ampli- lines.
(strong) linear AM / FM tude polynomial phase
(a) (d)
(b) (e)
(c) (f)
Rbel 77
and amplitude are: (1) verification that the extre- Table 1. Energy Reduction in the Residual Signal
mum of the polynomial model is a local maximum; Obtained with Different Bias-Reduction Algorithms
(2) verification that the amplitude that is obtained Frequency Band ASH DEH DSH
with the optimal demodulation slope is larger than
the amplitude obtained with the initial slope esti- Full Range 4.19 dB 4.72 dB 5.04 dB
mate; and (3) verification that the slope offset to 02 kHz 3.13 dB 3.75 dB 4.05 dB
reach the optimal slope is within 2Do. If one of 24 kHz 7.32 dB 8.40 dB 9.33 dB
46 kHz 5.78 dB 6.90 dB 7.32 dB
these tests fails, the polynomial representation of
the slope and amplitude relation is considered The performance of the algorithms varies with the frequency
unreliable, and the DEX estimator is used as a band.
fallback.
The test used to verify the validity of the linear The improvement is less pronounced because the
AM / FM sinusoidal representation is based on the FM modulation extent is low. In the mid-band
center of gravity of the energy (the mean time) of range, the FM modulation becomes stronger, and
the signal related to the spectral peak under investi- the reduction methods achieve residual energy
gation. If the mean time is larger than the maxi- reduction from 7.39.3 dB. For the highest band, the
mum mean time that can be expected for the signal FM modulation is still stronger, but the noise level
model of Equation 1, then we can assume that the is higher as well, such that the reduction of the
peak is related to a sinusoid with transient ampli- residual energy is not as strong.
tude evolution (Rbel 2003). In this situation, the The advantage of the demodulation methods over
exponential amplitude evolution used by the ASX ASH is clearly visible. The DEX estimator improves
estimator is more appropriate than the linear AM, the reduction of the ASH estimator by 0.51.1 dB.
and therefore the ASX estimator is used. Note that The DSX estimator is clearly the best, with an
the ASX and DEX estimators are sub-modules improvement compared to the ASH estimator of
required for the DSX estimator anyway, so the 0.82.0 dB. The residual signals for the QIFFT and
fallback solutions do not require additional costs DSH estimator are shown in Figure 3; the reduction
in terms of implementation or calculations. of the residual energy is clearly visible.
For the last experiment, we compare the estima-
tors by examining the energy of the residual signal
of a harmonic model of a tenor singer. The signal Conclusions
contains strong vibrato, and therefore the bias due
to the non-stationary parameters is expected to be We have shown that an efficient bias-reduction
significant. The harmonic models contain a maxi- strategy for the estimation of sinusoidal parameters
mum of 30 sinusoids at each time instant. We consists of a frequency-slope estimation and demod-
calculate the variance of the residual signal for the ulation prior to application of the standard QIFFT
QIFFT, DEH, DSH, and ASH methods for a signal estimator. The procedure significantly reduces the
window of 800 samples and an FFT size of 4,096 bias of the standard estimator. It does not require
samples. The variance of the residual signal is com- the use of a Gaussian analysis window, and it works
pared to the QIFFT estimator, and the reduction of for a much larger range of modulation depths than a
the residual energy in different frequency bands that recently proposed algorithm. By investigating the
can be achieved with each estimator is listed in reduction in the residual energy that can be ob-
Table 1. tained for a vibrato signal, we have shown that the
From Table 1, we can conclude that all bias- proposed enhanced demodulation estimator effec-
reduction methods achieve significant improve- tively works in real-world situations. It has been
ments of the residual energy. It is interesting to shown that, compared to the standard QIFFT esti-
compare performance in the different frequency mator, the reduction of the residual error depends
bands. In the low band, the improvement is 34 dB. on the frequency range and can be as large as 69 dB.
Rbel 79