Professional Documents
Culture Documents
DAFX-1
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
also assume that the fundamental is considerably stronger than the (13)
partials, and 1 reduces to • x̂k|k−1 , x̂k|k , x̂k|k+1 are the a priori, current and a posteri-
ori state vector estimates respectively.
yk = a1 cos (ω1 kTs + φ1 ) + vk (2)
• P̂k|k−1 , P̂k|k , P̂k|k+1 are the a priori, current and a poste-
where a1 , w1 and φ1 are the fundamental amplitude, frequency riori error covariance matrices respectively.
and phase respectively and Ts is the sampling interval. The state • Kk is the Kalman gain that acts as a weighting factor be-
vector is constructed as tween the observation yk and a priori prediction x̂k|k−1 in
determining the current estimate.
α
2
xk = uk (3) • σw is modeled as the process noise variance and I is an
u∗k identity matrix of dimensions 3 × 3.
• Initial state vector and error covariance matrix are denoted
where as x̂1|0 and P̂1|0 respectively
α = exp(jω1 Ts ) The fundamental frequency, amplitude and phase estimates at
uk = a1 exp(jω1 kTs + jφ1 ) (4) instant k are given as
u∗k = a1 exp(−jω1 kTs − jφ1 ) ln (x̂k|k (1))
f1,k =
This particular selection of state vector ensures that we can track 2πjTs
q
all three parameters that defines the fundamental – frequency, am- a1,k = x̂k|k (2) × x̂k|k (3) (14)
plitude and phase. The relative advantage of choosing this com-
plex state vector has been described in [13]. The state vector 1 x̂k|k (2)
φ1,k = ln
estimate update rule xk+1 relates to xk as 2j x̂k|k (3)x̂k|k (1)2k
where H is the observation matrix. We can see that 3.1. Detection of Silent Zones
a1
Hxk = [exp (jω1 kTs + jφ1 ) + exp (−jω1 kTs − jφ1 )] It is important to keep track of note-off events (unpitched mo-
2
ments) in the signal because the estimated frequency for such silent
= a1 cos (ω1 kTs + φ1 ) regions should be zero. Moreover, whenever there is a transition
(7) from note-off to note-on, the Kalman filter error covariance matrix
needs to be reset. This is because the filter quickly converges to
2.1. Kalman Filter equations a steady state value, and as a result the Kalman gain Kk and er-
ror covariance matrix P̂k|k settle to very low values. If there is a
The recursive Kalman filter equations aim to minimize the trace
sudden change in frequency of the signal (which happens at note
of the error covariance matrix. The EKF equations are given as
onset), the filter will not be able to track it unless the covariance
follows
matrix is reset.
Kk = P̂k|k−1 H ∗T [H P̂k|k−1 H ∗T + 1]−1 (8) To keep track of silent regions, we divide the signal into frames
of 20 ms. One way to determine if a frame is silent or not is to cal-
x̂k|k = x̂k|k−1 + Kk (yk − H x̂k|k−1 ) (9) culate its energy. The energy of the ith frame, Ei is given as the
DAFX-2
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
1
following properties
0.9
φyy (n) = σv2 δ(n)
0.8
0.7
Φyy (e ) = σv2 ∀ ω ∈ [−π, π]
jω
p (19)
K
σv2K
Spectral flatness
0.6
spf = 1 =1
0.5 K
(Kσv2 )
0.4
0.3
Detected peak
Magnitude spectrum
0.2
0.1
0
0 50 100 150
Frame number
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Phase spectrum
sum of the square of all the signal samples in that frame. If the
energy is below -60dB, then the frame is classified to be silent. If
the frame has N samples, then
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
N
X −1 Frequency in Hz
Ei = 20 log10 y(n)2 . (15)
n=0 Figure 2: Frequency spectrum used for calculating initial esti-
mates of the state vector.
However, for noisy input signals, the energy in silent frames is
significant. To find silent frames in noisy signals, we make use of
the fact that noise has a fairly flat power spectrum. Therefore, for 3.2. Calculating Initial State and Resetting the Error Covari-
each frame, we calculate the spectral flatness [16]. If spectral flat- ance Matrix
ness ≈ 1, then the frame is classified to be silent. Figure 1 shows
It has already been established that resetting the error covariance
spectral flatness v/s frame number for an audio file containing a
matrix is necessary whenever there is a change in the frequency
single note preceded and followed by some silence.
content of the signal (i.e., whenever a new note is played). Along
The power spectral density (PSD) of the observed signal, Φyy , with resetting the covariance matrix, we need to recalculate our
is given as initial estimates for the state vector. This is because the rate of
∞
X convergence of the Kalman filter depends on the accuracy of its
Φyy (ejω ) = φyy (n)e−jωn (16) initial estimates. Basically, we need to calculate x̂1|0 and P̂1|0
n=−∞
whenever there is a note onset, i.e., whenever there is a transition
where φyy (n) is the autocorrelation function of the input signal y, from silent frame to non-silent frame.
given as Depending on how strong the attack of the instrument is, we
∞ may need to skip a few audio frames until we reach steady state
to accurately estimate initial states. This is because the transient is
X
φyy (n) = y(m)y(n + m). (17)
m=−∞
noisy which makes frequency estimates go haywire, and we must
wait for the signal to settle. The number of frames to skip after
The power spectrum is the DTFT of the autocorrelation function detecting a transition from silent to non-silent frame can be a user-
and one way of estimating it is Welch’s method [17] which makes defined parameter.
use of the periodogram. The power spectral density can be calcu- To calculate the initial estimates of the state vector x̂1|0 we
lated in Matlab with the function pwelch. The spectral flatness take an FFT of the first non-silent frame following a silent frame,
is defined as the ratio of the geometric mean to the arithmetic mean after multiplying it with a Blackman window and zero-padding it
of the PSD. by a factor of 4. Zero padding increases the FFT size which in-
qQ
K K−1 jωk ) creases the sampling density along the frequency axis. We calcu-
k=0 Φ̂yy (e
spf = 1 PK−1 (18) late the magnitude and frequency of the peaks in the magnitude
jωk )
K k=0 Φ̂yy (e spectrum, and take f1,0 as the minimum of the frequencies corre-
sponding to the largest peaks. a1,0 is the magnitude corresponding
where Φ̂yy (ejωk ) is the estimated power spectrum for K frequency to f1,0 normalized by the mean of the window and number of points
bins covering the range [−π, π]. in the FFT. φ1,0 is the corresponding phase. Next, we perform
For white noise v ∼ N(0, σv2 ) corrupting the measurement sig- parabolic interpolation on f1,0 , a1,0 , φ1,0 to get more accurate es-
nal, y, the silent frames of y will have pure white noise with the timates. A typical plot of the spectrum used to calculate initial
DAFX-3
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
350
ground truth
300 ekf estimate
Estimated frequency in Hz yin estimate
250
200
150
100
50
0
0 2 4 6 8 10 12 14 16 18 20
Time in seconds
Figure 3: Plot of a) Ground Truth (blue) b) ECKF pitch detector (red) c) YIN estimator (green)
× 10 4
0.3 2.5
0.25
2
1.5
0.15
0.1
0.5
0.05
0 0
0 5 10 15 20 0 5 10 15 20
Time Time
Figure 4: ECKF: Estimated amplitude and unwrapped phase for same input
estimates is given in Figure 2. The initial states are as follows probably caused by a change in the input that the ECKF needs to
2
track. In that case, σw increases and there’s a term added to the
x̂1|0 = E[x1 ] a posteriori error covariance matrix P̂k|k+1 . This increase in the
e(2πjf1,0 Ts )
error covariance matrix causes the Kalman gain Kk to increase in
= a1,0 e(2πjf1,0 Ts +jφ1,0 ) the next iteration according to equation 8. As a result, the next
(20) state estimate x̂k|k+1 depends more on the input, and less on the
a1,0 e(−2πjf1,0 Ts −jφ1,0 )
current predicted state x̂k|k . Thus, the innovation reduces in the
P̂1|0 = E[(x1 − x̂1|0 )(x1 − x̂1|0 )∗T ] next iteration, and so does σw 2
. In this way, the process noise acts
= 0(3,3) as an error correction term that is adaptive to the variance in input.
3.3. Adaptive Process Noise The ECKF pitch detector was tested on cello notes downloaded
from the MIS database, which contains ground truth labels in the
We only reset the covariance matrix when there is a transition from
form of annotated note names, and on cello and double bass notes
silent to non-silent frame. However, new notes may be played
played with various ornaments that we recorded ourselves.2 The
without any rest in between, and dynamics such as vibrato also
data was summed with white noise normally distributed with 0
need to be captured. To track these changes, an additional term is
2 mean and 0.01 variance, to test the performance of our algorithm
added to equation 12. σw is modeled as the process noise variance
with noisy input.
given as
2 Figure 3 shows the output when the notes A3-Bb3-E3-G3 were
log10 (σw ) = −c + |yk − H x̂k|k | (21)
played on the cello one after another with pauses in between. The
where c ∈ Z+ is a constant. 1 The term yk −H x̂k|k is known as the pitch detected by the ECKF is compared with the ground truth
innovation. It gives the error between our predicted value and the and YIN estimator output. Figure 4 shows the corresponding esti-
actual data. Whenever the innovation is high, there is a significant mated amplitude and phase plots for the same input given by the
discrepancy between the predicted output and the input, which is ECKF estimator. The mean and standard deviation of absolute er-
1 for this paper, c lies in the range 7–9 but its value can be tuned accord- 2 The Matlab implementation can be cloned from https://
ing to the input github.com/orchidas/Pitch-Tracking
DAFX-4
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
EKF estimate
with the ECKF. Unlike the YIN estimator which yields a single YIN estimate
200
pitch estimate for an entire block of data, the ECKF yields a unique
pitch value for every sample of data, and it fluctuates about a mean
Estimated f0 in Hz
150
tually insignificant, hence we neglect it. It would be interesting to 0 0.5 1 1.5 2 2.5
Time in seconds
3 3.5 4
140
Estimated f0 in Hz
Table 1: Error statistics for Figure 3 120
100
80
yin 60
205
ekf
40
204 20
0
Estimated f0 (Hz)
(b) D3 vibrato
202
201 250
EKF estimate
200 YIN estimate
200
Estimated f0 in Hz
199
150
17.565 17.57 17.575 17.58 17.585 17.59 17.595 17.6
Time in seconds
100
DAFX-5
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
detected. However, we leave this problem open for future work. 6. REFERENCES
200
ception,” The Journal of the Acoustical Society of America,
vol. 23, no. 1, pp. 147–147, 1951.
150 [3] Alain De Cheveigné and Hideki Kawahara, “Yin, a funda-
mental frequency estimator for speech and music,” The Jour-
100 nal of the Acoustical Society of America, vol. 111, no. 4, pp.
1917–1930, 2002.
50
[4] A. Michael Noll, “Cepstrum pitch detection,” The Journal
0
of the Acoustical Society of America, vol. 41, pp. 293–309,
0 0.5 1 1.5 2 2.5 3 3.5
1967.
Time in seconds
[5] A. Michael Noll, “Pitch determination of human speech by
Figure 7: ECKF is slow in tracking fast note changes. Notes the harmonic product spectrum, the harmonic sum spectrum,
played are descending fifths from the A string on the double bass. and a maximum likelihood estimate,” in Proceedings of the
symposium on computer processing communications, 1969,
vol. 779.
[6] James A Moorer, “On the transcription of musical sound by
computer,” Computer Music Journal, pp. 32–38, 1977.
5. CONCLUSION [7] J Wise, J Caprio, and T Parks, “Maximum likelihood pitch
estimation,” IEEE Transactions on Acoustics, Speech, and
In this paper, we have proposed a novel real-time pitch detector Signal Processing, vol. 24, no. 5, pp. 418–423, 1976.
based on the Extended Complex Kalman Filter (ECKF). Several [8] Etienne Barnard, Ronald A Cole, Mathew P Vea, and
adjustments have been made for optimum tracking. An algorithm Fileno A Alleva, “Pitch detection with a neural-net classi-
based on spectral flatness has been proposed to detect silence in fier,” IEEE Transactions on Signal Processing, vol. 39, no.
incoming noisy audio signal. The importance of accurate initial 2, pp. 298–307, 1991.
state estimates and resetting the error covariance matrix has been
explained. A correction factor, σw2
, has been included to track fast, [9] F Bach and M Jordan, “Discriminative training of hidden
small changes in input signal. markov models for multiple pitch tracking,” in Proceedings
of International Conference on Acoustics, Speech and Signal
After all these changes have been incorporated into the ECKF
Processing. IEEE, 2005, vol. 5.
pitch detector, the results match those of robust and successful
pitch detectors like the YIN estimator. Perhaps its greatest ad- [10] Patricio De La Cuadra, Aaron S Master, and Craig Sapp, “Ef-
vantage is the fact that it estimates pitch on a sample-by-sample ficient pitch detection techniques for interactive music.,” in
basis, against most methods which are block-based. Moreover, its ICMC, 2001.
performance is robust to the presence of noise in the measurement [11] Rudolph E. Kalman et al., “A new approach to linear filtering
signal. The ECKF has been observed to be well suited for track- and prediction problems,” Journal of Basic Engineering, vol.
ing fine changes in playing dynamics and ornamentation, which 82, no. 1, pp. 35–45, 1960.
makes it an excellent candidate for real-time transcription of solo [12] Adly A Girgis, W Bin Chang, and Elham B Makram, “A
instrument music. Additionally, the ECKF also yields the ampli- digital recursive measurement scheme for online tracking of
tude envelope and phase of the signal, along with its fundamental power system harmonics,” IEEE transactions on Power De-
frequency. livery, vol. 6, no. 3, pp. 1153–1160, 1991.
[13] Pradipta Kishore Dash, G Panda, AK Pradhan, Aurobinda
Routray, and B Duttagupta, “An extended complex kalman
5.1. Future Work filter for frequency measurement of distorted signals,” in
Power Engineering Society Winter Meeting, 2000. IEEE.
We hope this work will encourage new uses of the Kalman filter in IEEE, 2000, vol. 3, pp. 1569–1574.
audio and music. In recent years, the Kalman filter has been used
[14] Xavier Serra and Julius Smith, “Spectral modeling synthesis:
in music for online beat tracking [18] and partial tracking [19, 20].
A sound analysis/synthesis system based on a deterministic
It has also been used for frequency tracking in speech [21]. Since
plus stochastic decomposition,” Computer Music Journal,
the Kalman filter is such a powerful tool that can work for any
vol. 14, no. 4, pp. 12–24, 1990.
valid model with the right state-space equations, we believe it can
have many more potential real-time applications in music. Some [15] Gabriel A Terejanu, “Extended kalman filter tutorial,” Tech.
of the other topics we wish to explore include partial tracking and Rep., University of Buffalo, 2008.
real-time onset detection with the Kalman filter. A real-time pitch [16] A Gray and J Markel, “A spectral-flatness measure for study-
detector along with an onset detector lays the groundwork for real- ing the autocorrelation method of linear prediction of speech
time transcription, which remains an exciting and advanced prob- analysis,” IEEE Transactions on Acoustics, Speech, and Sig-
lem. nal Processing, vol. 22, no. 3, pp. 207–217, 1974.
DAFX-6
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
[17] Peter Welch, “The use of fast fourier transform for the esti-
mation of power spectra: a method based on time averaging
over short, modified periodograms,” IEEE Transactions on
Audio and Electroacoustics, vol. 15, no. 2, pp. 70–73, 1967.
[18] Yu Shiu, Namgook Cho, Pei-Chen Chang, and C-C Jay Kuo,
“Robust on-line beat tracking with kalman filtering and prob-
abilistic data association (kf-pda),” IEEE transactions on
consumer electronics, vol. 54, no. 3, 2008.
[19] Andrew Sterian and Gregory H Wakefield, “Model-based
approach to partial tracking for musical transcription,” in
SPIE’s International Symposium on Optical Science, Engi-
neering, and Instrumentation. International Society for Op-
tics and Photonics, 1998, pp. 171–182.
[20] Hamid Satar-Boroujeni and Bahram Shafai, “Peak extraction
and partial tracking of music signals using kalman filtering.,”
in ICMC, 2005.
[21] Özgül Salor, Mübeccel Demirekler, and Umut Orguner,
“Kalman filter approach for pitch determination of speech
signals,” SPECOM, 2006.
DAFX-7