Professional Documents
Culture Documents
ABSTRACT face. The interface enables a user to boost or cut the volume
of each instrument part of an existing musical piece. With
This paper describes a music remixing interface, called In- this interface, a user can easily give an alternative query with
strument Equalizer, that allows users to control the volume a different mixing balance to obtain refined results of the
of each instrument part within existing audio recordings QBE retrieval. The issue in the above example of finding
in real time. Although query-by-example retrieval systems another piece with weaker vocal or drum sounds can thus
need a user to prepare favorite examples (songs) in general, be resolved. Note that existing graphic equalizers or tone
our interface gives a user to generate examples from exist- controls on the market cannot control each individual instru-
ing ones by cutting or boosting some instrument/vocal parts, ment part in this way: they can adjust only frequency char-
resulting in a variety of retrieved results. To change the vol- acteristics (e.g., boost or cut for bass and treble). Although
ume, all instrument parts are separated from the input sound remixing stereo audio signals [8] had reported previously,
mixture using the corresponding standard MIDI file. For it had tackled to control only harmonic instrument sounds.
the separation, we used an integrated tone (timbre) model Our goal is to control all instrument sounds including both
consisting of harmonic and inharmonic models that are ini- harmonic and inharmonic ones.
tialized with template sounds recorded from a MIDI sound This paper describes our music remixing interface, called
generator. The remaining but critical problem here is to deal Instrument Equalizer, in which a user can listen to and remix
with various performance styles and instrument bodies that a musical piece in real time. It has sliders corresponding
are not given in the template sounds. To solve this problem, to different musical instruments and enables a user to ma-
we train probabilistic distributions of timbre features by us- nipulate the volume of each instrument part in polyphonic
ing various sounds. By adding a new constraint of maxi- audio signals. Since this interface is independent of the suc-
mizing the likelihood of timbre features extracted from each ceeding QBE system, any QBE system can be used. In our
tone model, we succeeded in estimating model parameters current implementation, it leverages the standard MIDI file
that better express actual timbre. (SMF) corresponding to the audio signal of a musical piece
1 INTRODUCTION to separate sound sources. We can assume that it is relatively
One of promising approaches of music information retrieval easy to obtain such SMFs from the web, etc. (especially for
is the query-by-example (QBE) retrieval [1, 2, 3, 4, 5, 6, 7] classical music). Of course, given a SMF, it is quite easy to
where a user can receive the list of musical pieces ranked control the volume of instrument parts during the SMF play-
by their similarity to a musical piece (example) that the user back, and readers might think that we can use it as a query.
gives as a query. Although this approach is powerful and Its sound quality, however, is not good in general and users
useful, a user has to prepare or find favorite examples and would lose their drive to use the QBE retrieval. Moreover,
sometimes feels difficulty to control/change the retrieved we believe it is important to start from an existing favorite
pieces after seeing them because the user has to find another musical piece of high quality and then refine the retrieved
appropriate example to get better results. For example, if a results.
user feels that vocal or drum sounds are too strong in the
2 INSTRUMENT EQUALIZER
retrieved pieces, the user has to find another piece that has
weaker vocal or drum sounds while keeping the basic mood The Instrument Equalizer enables a user to remix existing
and timbre of the piece. It is sometimes very difficult to find polyphonic musical signals. The screenshot of its interface
such a piece within a music collection. is shown in Figure 1 and the overall system is shown in Fig-
We therefore propose yet another way of preparing an ex- ure 2. It has two features for remixing audio mixtures as
ample for the QBE retrieval by using a music remixing inter- follows:
133
ISMIR 2008 – Session 1c – Timbre
134
ISMIR 2008 – Session 1c – Timbre
135
ISMIR 2008 – Session 1c – Timbre
To evaluate the ‘effectiveness’ of this separation, we can 3.3 Cost function without considering timbre feature
use a cost function defined as the Kullback-Leibler (KL) di- distributions
(J)
vergence from Xkl (c, t, f ) to Jkl (c, t, f ): In Itoyama’s previous study [9], they used template sounds
instead of timbre feature distributions to evaluate the ‘good-
(J)
(J)
(J)
Xkl (c, t, f ) (Y )
ness’ of the feature vector. The cost function, Qkl , used in
Qkl = Xkl (c, t, f ) log dt df.
Jkl (c, t, f ) [9] can be obtained by replacing the negative log-likelihood,
c (p) (Y )
Qkl , with the KL divergence, Qkl , from Ykl (t, f ) (the
(J)
By minimizing the sum of Qkl over (k, l) pertaining to power spectrogram of a template sound) to Jkl (t, f ):
Δ(J) (k, l; c, t, f ), we obtain the spectrogram distribution
(Y )
(J)
(J)
Xkl (c, t, f )
function and model parameters (i.e., the most ‘effective’ de- Qkl = Xkl (c, t, f ) log dt df
composition). c
Jkl (c, t, f )
(J)
By minimizing the Qkl pertaining to each parameter of rkl (c)Ykl (t, f )
the integrated model, we obtain model parameters estimated + rkl (c)Ykl (t, f ) log dt df.
c
Jkl (c, t, f )
from the distributed spectrogram. This parameter estimation
is equivalent to a maximum likelihood estimation. The pa- 4 EXPERIMENTAL EVALUATION
rameter update equations are described in appendix A.
We conducted experiments to confirm whether the perfor-
3.2 Timbre varieties within each instrument mance of the source separation using the prior distribution is
Even within the same instrument, different instrument bod- equivalent to the one using the template sounds. In the first
ies have different timbres, although its timbral difference is experiment, we separated the sound mixtures which were
smaller than the difference among different musical instru- generated from a MIDI sound generator. In the other one,
ments. Moreover, in live performances, each musical note the sound mixtures were created from the signals with mul-
could have slightly different timbre according to the perfor- tiple tracks [15] which were before mixdown. In this exper-
mance styles. Instead of preparing a set of many template iment, we compared the following two conditions:
sounds to represent such timbre varieties within each instru-
ment, we represent them by using a probabilistic distribu- 1. using the log-likelihood of timbre feature distribu-
tion. tions (proposed method, section 3.2),
We use parameters of the integrated model, ukl (m), 2. using the template sounds (previous method [9], sec-
(I) tion 3.3).
vkl (n), and Fkl (t, f ), to represent the timbre variety of in-
strument k by training a diagonal Gaussian distribution with
(u) (v) (F ) (u) 4.1 Experimental conditions
mean μk (m), μk (m), μk (f ) and variance Σk (m),
(v) (F ) We used 5 SMFs from the RWC Music Database (RWC-
Σk (n), Σk (f ), respectively. Note that other probability
MDB-P-2001 No. 1, 2, 3, 8, and 10) [16]. We recorded all
distributions, such as a Dirichlet distribution, are available
musical notes of these SMFs by using two different MIDI
in this case. The model parameters for training the prior dis-
sound generators made by different manufacturers. We used
tribution are extracted from instrument sound database [14]
one of them for the test (evaluation) data and the other for
(i.e., the parameters are estimated without any prior distri-
obtaining the template sounds or training the timbre feature
butions).
distributions in advance.
By minimizing the cost function,
The experimental procedure is as follows:
(p)
(J)
(J)
Xkl (c, t, f ) 1. initialize the integrated model of each musical note by
Qkl = Xkl (c, t, f ) log dt df
Jkl (c, t, f ) using the corresponding template sound,
c
1 (u) (u) 2. estimate all the model parameters from the input
+ (ukl (m) − μk (m))2 /Σk (m) sound mixture, and
2 m
1 (v) (v) 3. calculate the SNR in the frequency domain for the
+ (vkl (n) − μk (n))2 /Σk (n) evaluation.
2 n
1 (I) (F ) (F ) The SNR is defined as follows:
+ (Fkl (t, f ) − μk (f ))2 /Σk (f ) dt df,
2
1
SNR = SNRkl (c, t) dt,
where the last term is an additional cost by using the prior C(T1 − T0 ) c
distribution, we obtain the parameters by taking into account (J)
the timbre varieties. This parameter estimation is equivalent Xkl (c, t, f )2
SNRkl (c, t) = log10 (J) (R)
2 df,
to a maximum A Posteriori estimation. Xkl (c, t, f ) − Xkl (c, t, f )
136
ISMIR 2008 – Session 1c – Timbre
137
ISMIR 2008 – Session 1c – Timbre
(H) (I)
[8] Woodruff, J., Pardo, B., and Dannenberg, R., “Remixing A.3 wkl , wkl : amplitude of harmonic and inhar-
Stereo Music with Score-informed Source Separation”, Proc. monic tone models
ISMIR, pp. 314–319, 2006. (H)
(H) c,m,n Xklmn (c, t, f ) dt df
[9] Itoyama, K., Goto, M., Komatani, K., Ogata, T., Okuno, wkl =
(H) (I)
H.G., “Integration and Adaptation of Harmonic and Inhar- c m,n Xklmn (c, t, f ) + Xkl (c, t, f ) dt df
monic Models for Separating Polyphonic Musical Signals”, and
Proc. ICASSP, pp. 57–60, 2006. (I)
(I) Xkl (c, t, f ) dt df
[10] Cano, P., Loscos, A., and Bonada, J., “Score-performance wkl = c .
(H) (I)
Matching using HMMs”, Proc. ICMC, pp. 441–444, 1999. c m,n Xklmn (c, t, f ) + Xkl (c, t, f ) dt df
[11] Adams, N., Marquez, D., and Wakefield, G., “Iterative Deep-
A.4 ukl (m): coefficient of the temporal power envelope
ening for Melody Alignment and Retrieval”, Proc. ISMIR, pp. (H) (u)
199–206, 2005. c,n Xklmn (c, t, f ) dt df + μk (m)
ukl (m) = (H) .
[12] Cont, A., “Realtime Audio to Score Alignment for Polyphonic Xklmn (c, t, f ) dt df + 1
c,m,n
Music Instruments using Sparce Non-negative Constraints and
Hierarchical HMMs”, Proc. ICASSP, pp. 641–644, 2006. A.5 vkl (n): relative amplitude of n-th harmonic com-
ponent
[13] Kameoka, H., Nishimoto, T., Sagayama, S., “Harmonic- (H) (v)
temporal Structured Clustering via Deterministic Annealing Xklmn (c, t, f ) dt df + μk (m)
vkl (n) = c (H)
.
EM Algorithm for Audio Feature Extraction”, Proc. ISMIR, Xklmn (c, t, f ) dt df + 1
c,n
pp. 115–122, 2005.
A.6 τkl : onset time
[14] Goto, M., Hashiguchi, H., Nishimura, T., and Oka, R., “RWC (H)
Music Database: Music Genre Database and Musical Instru- c,m,n (t − mφkl )Xklmn (c, t, f ) dt df
ment Sound Database”, Proc. ISMIR, pp. 229-230, 2003.
τkl = (H) .
c,m,n Xklmn (c, t, f ) dt df
[15] Goto, M., “AIST Annotation for the RWC Music Database”, A.7 ωkl (t): F0 trajectory
Proc. ISMIR, pp. 359–260, 2006. (H)
c,m,n nf Xklmn (c, t, f ) df
[16] Goto, M., Hashiguchi, H., Nishimura, T., and Oka, R., ωkl (t) = (H)
.
“RWC Music Database: Popular, Classical, and Jazz Music c,m,n n2 Xklmn (c, t, f ) df
Databases”, Proc. ISMIR, pp. 287–288, 2002.
A.8 φkl : diffusion of a Gaussian of power envelope
[17] Vincent, E., Gribonbal, R., Févotte, C., “Performance Mea-
surement in Blind Audio Source Separation”, IEEE Trans. on (φ) (φ)2
−A1 + A1 + 4A2 A0
(φ) (φ)
ASLP, vol. 14, No. 4, pp. 1462–1469, 2006. φkl = , where
(φ)
2A1
A DERIVATION OF THE PARAMETER UPDATE (φ)
(H)
EQUATION A2 = Xklmn (c, t, f ) dt df,
c,m,n
In this section, we describe the update equations of each
(φ)
(H)
parameter derived from the M-step of the EM algorithm. A1 = m(t − τkl )Xklmn (c, t, f ) dt df, and
By differentiating the cost function about each parameter, c,m,n
(H)
the update equations were obtained. Let Xklmn (c, t, f ) and (φ)
(H)
(I) A0 = (t − τkl )2 Xklmn (c, t, f ) dt df.
Xkl (c, t, f ) be the decomposed power: c,m,n
(H) (J)
Xklmn (c, t, f ) = Δ(H) (m, n; k, l, t, f )Xkl (c, t, f ) A.9 σkl : diffusion of harmonic component along the
frequency axis
(I) (J)
Xkl (c, t, f ) = Δ(I) (k, l, t, f )Xkl (c, t, f ).
and
c,m,n (f − nωkl (t))2 Xklmn (H)
(c, t, f ) dt df
(J )
σkl = (H) .
A.1 wkl : overall amplitude c,m,n Xklmn (c, t, f ) dt df
(J)
(H) (I)
(I) (I)
A.10 Ekl (t), Fkl (t, f ): inharmonic tone model
wkl = Xklmn (c, t, f ) + Xkl (c, t, f ) dt df.
c m,n (I)
(I) c Xkl (c, t, f ) df
A.2 rkl (c): relative amplitude of each channel Ekl (t) = (I)
and
Xkl (c, t, f ) dt df
(H) (I)
c
(I) (F )
C m,n Xklmn (c, t, f ) + Xkl (c, t, f ) dt df (I) X (c, t, f ) + μk (f )
rkl (c) = . Fkl (t, f ) = c kl (I) .
(H) (I)
c m,n Xklmn (c, t, f ) + Xkl (c, t, f ) dt df c Xkl (c, t, f ) df + 1
138