A Classifi Er-Based Approach To Score - Guided Source Separation of Musical Audio

Christopher Raphael
University of Indiana
A Classifier-Based
School of Informatics
901 East 10th Street
Approach to Score-Guided
Bloomington, Indiana 47408-3912 USA
craphael@indiana.edu Source Separation of
Musical Audio
Audio source separation seeks to decompose an purchased from commercial sources; however, the
audio recording into several different layers corre- small collection of available accompaniments tend
sponding to independent sources, such as different to be poorly played and recorded. The ability to
speakers, or, in our case, musical parts. Source sepa- desolo a complete recording would open up a vast
ration is a formidable task; although the problem library of beautifully played and expertly recorded
has received considerable attention in recent years, accompaniments for our system. Thus, our particu-
it is safe to say that it remains open. lar vantage point produces an asymmetrical view of
Many approaches to this audio decomposition the source-separation problem, in which we seek to
problem are deemed blind source separation, mean- separate a single instrument from a large ensemble.
ing that the audio is decomposed without explicit This has important implications for the types of
knowledge of its contents (Cardoso 1998; Bregman models and algorithms that we employ.
1990; Ellis 1996). In particular, much recent work The unusual aspect of our problem statement is
has focused on Independent Component Analysis that we assume detailed knowledge of the audio
(ICA) as the methodological backbone of various content of our recordings: we begin with symbolic
approaches (Bell and Sejnowski 1995; Lee et al. musical score, giving the complete collection of
1999). Work on blind separation also contains work pitches and rhythms in the solo and all accompany-
specifically devoted to music audio (e.g., Maher ing parts. Our long-standing interest in score
1990; Vincent 2006). Although blind separation is alignment has led to algorithms that automatically
no doubt broadly useful and deeply interesting, many create a correspondence between the audio record-
of the techniques rely on restrictive assumptions ings and the symbolic scores (Raphael 1999, 2004).
about the recording process or audio, often not satis- Thus, at any moment in the audio, we know what
fied in practice. Moreover, blind approaches seem notes are sounding and to which parts they belong.
simply wrong-headed for our purposes, because they A partial depiction of our score knowledge is
fail to capitalize on our explicit and detailed knowl- given in Figure 1 in which vertical lines mark the
edge of the audio. The focus of our effort here is in onsets of each solo note. Score knowledge for
fully incorporating this knowledge in a principled musical source separation has also been used in
approach to musical source separation. Ben-Shalom et al. (2004) and Every (2006). Both of
Our motivation stems from our ongoing work in these efforts apply a time-varying filter to distin-
musical accompaniment systems, in which a guish the desired audio from its complement. In
computer program generates a flexible and respon- these efforts, as in ours, the difficulty of identifying
sive accompaniment to a live soloist in a non- the precise timefrequency components one wishes
improvisatory piece of music. Our favorite musical to isolate is the Achilless heel one must inevita-
domain is the concerto, or other work involving an bly address. Our approach differs from these by
entire orchestra for the accompaniment. Because casting this isolation problem as a one of classifica-
our preferred approach resynthesizes a preexisting tion and employing appropriate methodology.
audio recording to synchronize with the live player Whereas our interest is motivated by a particular
(Raphael 2003), we rely on orchestra-only record- application, this work potentially has a broader
ings. Some orchestral accompaniments can be impact. The most obvious application is karaoke,
which also requires an accompaniment-only record-
Computer Music Journal, 32:1, pp. 5159, Spring 2008 ing. Desoloing a recording is easy when the solo
2008 Massachusetts Institute of Technology. part is recorded separately and asymmetrically
Raphael 51
Figure 1. Spectrogram of of this same image with
opening of Samuel Barbers the solo part highlighted in
Violin Concerto with note blue can be seen at xavier
onsets for the solo violin .informatics.indiana.edu /
marked with vertical lines. ~craphael / cmj07 / .
A high-resolution version
mixed into stereo channels, as is often the case in provide subjective justification for our restricted
popular music. One need only estimate the mixing problem statement. This training data then leads to
weights for each channel and then invert the mixing a principled machine-learning formulation of the
operation. This popular technique, formalized by problem whose performance we evaluate objec-
ICA, forms the basis of several commercial desolo- tively. We conclude with experiments on data taken
ing software products. When the recording and from a commercial compact disc in an especially
mixing techniques do not support this trick, then difficult domain: separating the soloist from the
methods such as our current proposal constitute a orchestra in a concerto setting.
viable alternative. Other applications of the general
problem of musical source separation include
remixing existing recordings, incorporating existing STFT Representation
musical material into new compositions, construc-
tion of audio databases, audio editing, andno Our approach is based on the STFT representation
doubtmany ideas not yet conceived. of our audio signal. The advantages of this represen-
Our essential approach examines a small yet tation are rather obvious for musical signals: much
reasonable subset of possible decompositions of the music is composed of notes that are, almost by
audio: using our road map, we attribute each short- definition, of limited extent in both time and fre-
time Fourier transform (STFT) timefrequency quency. Thus, most pairs of notes are supported by
point to either the soloist or to the accompaniment. entirely disjoint regions of timefrequency space.
Then, we invert our STFT using the appropriate Even with the STFT, collisions will still occur
subset of points to produce either the desoloed between harmonics of some notes. However, we
audio or the soloist alone. This is the well-known believe that other possible signal representations,
idea of masking (Roweis 2000; Bach and Jordan such as wavelets, share this problem, and the STFT
2005). Using easy-to-create training data synthe- goes as far as any representation can in minimizing
sized from separate solo and orchestra files, we the difficulty.
52 Computer Music Journal

Suppose our audio signal is denoted by corresponds to the accompaniment. Equivalently,
we could seek a decomposition in STFT space (i.e.,
x = , x (1), x (0), x (1),
X = Xs + Xa, where Xs and Xa are the STFTs of xs and
We write the STFT of x as X = X(t,k), where xa, respectively) though this problem still involves
the precise estimation of phase and amplitude for
X(t, k) = x (n)e2!ikn/Kw (tH n) (1) each timefrequency bin of Xs and Xa, subject to the
n
constraint.
where k = 0, . . . , K 1, t Z, H is our hop size, and Instead, we consider the approximations
w is the window function which is 0 outside the
range {K/2, . . . ,K/2 1}. We assume K = HL so that Xs 1s X
L is the integral number of hops needed to tra- Xa 1A X
verse the FFT length K.
Perfect recovery of x from X is exceedingly simple where
when
S = {(t, k):| Xx(t, k)| | Xa(t, k)|}
w (tH n) = c
2
(2) A = {(t, k):| Xs(t, k)| < | Xa(t, k)|}
t
for all n and some constant c (see Zlzer 2002 and and
the references therein for a more detailed discus- 1 if(t, k) C
1C (t, k) =
sion). In this case, we have 0 otherwise
1
x (n) = x (n) w 2(tH n) Clearly, these approximations are much easier to
c t (3) estimate than the true Xs and Xa, because we need
1 K 1
only estimate a Boolean value for each STFT point
= X(t, k)e2!ikn/K(tH n) rather than a complex number.
cK t k = 0
One can appreciate the quality of this approxima-
1 K/ 2
tion by synthetically manufacturing X from known
= a(t, k)cos("(t, k) + 2!kn / K)w (tH n) (4) Xs and Xa and listening to the resulting decomposi-
cK t k = 0
tion. To this end, we began with a performance xs of
where the amplitudes {a(t,k)} and phases {(t,k)} are a soloist playing an excerpt from a Mozart violin
taken from X(t,k). concerto. We built the orchestra audio around this
There are several window functions other than performance by first matching both the violinists
the constant window that have the necessary prop- performance and a prerecorded orchestral perfor-
erty of Equation 2. Among them are the Hann or mance to a score. We then warped the orchestra
raised cosine window with L = K/H = 4 hops per recording to synchronize with the solo part using
FFT length, which we use in our experiments. phase vocoding (Flanagan and Golden 1966; Puck-
Equation 4 is a rather intuitive description of the ette 1995; Laroche and Dolson 1997) and adjusted
original signal as a sum of windowed and translated the levels to achieve good balance to produce xa.
cosines, whose frequencies are indexed by k and There are certainly easier ways to produce two syn-
whose translations are indexed by t. chronized parts, but we already had the machinery
set up for this procedure, and we wanted to approxi-
mate realistic conditions as closely as possible.
Approximate Source Separation Then, from the audio files, xs and xa, we produced
the composite STFT, X = Xs + Xa, as well as the two
Ideally, we wish to decompose our signal x into x = estimates of the separate solo and accompaniment
xs + xa, where xs corresponds to the solo part, and xa parts, xa and xs, as
Raphael 53
Figure 2. The binary mask as white. See xavier
indicating, for each point, .informatics.indiana.edu /
which part makes the ~craphael / cmj07 / for a
greater contribution. The spectrogram using color to
solo violin is represented indicate the binary mask.
x a = STFT 11A X is labeled as solo, including regions seemingly far

x s = STFT 1S X
1 from any solo harmonics. Part of this chaotic nature
of the mask is explained by the spectrogram image.
using Equation 3. The three files x = STFT 1X,xs,xa From this image, it is clear that the class labels of
can be heard at online at xavier.informatics.indiana many of the points are somewhat irrelevant, owing
.edu / ~craphael / cmj07 /. to their small contribution to the audio signal.
The quality of xs and xa was rather surprising to This experiment demonstrates the perceptual
us, as it sounded for the most part quite similar to accuracy of xa and xs, thus justifying the use of our
the original files. This suggests that the effect of our approximation. Although certainly much easier than
masking operations might not be as significant as trying to estimate Xs and Xa from X, the estimation
one might expect, as has been observed by others in of our ideal mask is still a difficult problem and will
the music-processing domain (Li and Wang 2006). introduce further audio degradation. Thus, the audio
It is perhaps worth noting that our masked ver- results should be considered an upper bound on what
sions 1S X and 1A X are not necessarily the STFTs of our masking approach can achieve. The next section
any time signal. This is because the overlapping of develops an approach for estimating the ideal mask.
windows produces linear constraints the STFT
must satisfy. We have no reason to suppose that our
masked versions of X would satisfy these constraints. Estimating the Mask
However, the difference between 1S X and STFT xs, is
exceedingly small, both in terms of measurable and Classification Trees
perceived distances (similarly for 1A X and STFT xa).
Figure 2 shows the region S as white and the com- Constructing our composite data from unmixed
plementary region is A colored black. The accompa- solo and accompaniment parts as described leads to
nying Web page shows the spectrogram with the principled methods for estimating the ideal mask,
two regions colored differently to clearly distinguish as follows. For each point in the composite STFT,
them. Perhaps surprising is how much of the STFT X = Xs + Xa, we know whether Xs or Xa made the

bigger contribution. Thus, our synthetic spectrum
Distance to the Closest Solo Harmonic
can be viewed as training data for a classifier that
attempts to label each point as belonging to S or A. This feature is also purely a function of the score
Needless to say, this approach produces voluminous match. Conceptually, we create a binary repre-
quantities of training datahundreds of thousands sentation of the solo performance as an idealized
of correctly labeled points for minute-long audio spectrogram, containing 1s only where solo har-
excerpts. With such a large and easily obtainable monic occur. For each STFT point we compute
collection of ground truth, it seems natural to train the minimum Euclidean distance to a 1-point
a classifier to label each STFT point. In addition to over all STFT points of earlier times. This fea-
the fully automatic construction of the classifier, ture is useful for detecting points whose energies
such an approach allows one to numerically evalu- are mostly caused by reverberation of a solo
ate its success, rather than making subjective harmonic.
judgments of audio quality.
In building our classifier, we depart slightly from
that presented in the previous section. Many, per- Modulus |X(t,f)|
haps most, of the STFT points do not significantly High-energy points are more likely to be associated
affect our end result owing to their small magni- with the solo part. We also computed average mod-
tudes. When building the classifier, we eliminate ulus over local neighborhoods.
points in {|X(t,f)| < T } for some threshold T, because
we do not view their labels as meaningful, thus
distracting the classifier from its essential task. Rank
With the remaining points, we build a tree-
The Rank feature refers to the percentile ranking of
structured classifier following the ideas of Breiman
the modulus over a neighborhood of STFT points.
et al. (1984). Our features are derived both from our
The STFT points associated with the solo tend to
score match, as well as aspects of the STFT, and
be larger in magnitude than their neighbors. This
consisted of several features, outlined subsequently.
feature was computed over 44, 33, and 22
We experimented with several other features, but
neighborhoods.
none of these achieved any measurable increase in
performance on a validation set.
Phase Coherence
Vertical Distance to the Closest Solo Harmonic One expects that the STFT points composing a
This feature computes the distance in frequency harmonic will tend to evolve in time with similar
from the given STFT point to the closest solo phase advance. This is, in fact, the idea behind the
harmonic. The feature depends only on the score phase-locking improvements to the phase vo-
match. For each STFT point, we compute which coder (Laroche and Dolson 1999). We computed a
solo note, if any, is coincident with that point and measure of the degree to which this is true for an
how far the point is, in frequency units, from the STFT point as the empirical variance of the phase
closest solo harmonic. This feature by itself can be advances. We expect this feature will be small on a
used to give somewhat credible results. peak, and especially true for the more closely re-
corded solo instrument.
Vertical Distance to the Closest Orchestra

Harmonic Horizontal Derivatives
This feature is perfectly analogous to the previous Horizontal differences of the STFT moduli were
feature, except we consider distances to orchestra computed in hopes of detecting higher activity for
harmonics rather than solo harmonics. solo harmonics.
Raphael 55
Construction of the Classifier beyond the right edge of the note to include the
notes reverberation and account for our uncertainty
The classifier is then built according to the usual in pitch as well. Let t0, te, t1 denote STFT time
Classification and Regression Tree (CART) prescrip- indices giving the onset of the note, the onset of the
tion of recursive partitioning, choosing, at each next note, and the latest possible time the note
stage, the feature and split point the minimizes the might continue to reverberate. Let the frequency
average class-label entropy of the two child nodes. extent be bounded by k0 and k1, so B = {t0, . . . , t1}
We built deep trees, using 680,000 correctly labeled {k0, . . . , k1}.
STFT points, splitting tree nodes until a node con- We seek to label all of the points in B as s or a for
tained only solo or orchestra points, or until the solo or accompaniment and write C(t,k) for the
node had less than 50 points, thus producing thou- label of point (t,k). In the previous section, our
sands of branches. We then pruned the tree using tree-structured classifier was used to make binary
traditional CART techniques using an independent decisions about each point; however, note that our
validation set of approximately the same size as for classifier can be used to estimate the probabilities of
training (Breiman et al. 1984). these assignments, for example, P(C(t,k) = s|X), as
A portion of the classification results on a sepa- the proportion of training examples labeled as s at
rate test set are presented on the Web page accom- the terminal node encountered by (t,k). In practice,
panying this article, again on the soloists entrance we smooth these estimates. In this way, we use the
to the Mozart violin concerto, in which the mistak- learned tree as the basis for our data model.
enly labeled points are indicated in color. The If I B is the collection of points labeled as solo,
falsely classified points accounted for 0.025 of the then, assuming independence, the joint labeling, CB,
total collection of S and 0.029 of the total collection of all points in B, has probability
of A, out of 680,000 test points. The associated au-
dio reconstructions are not without their merits,
( B | X) =
PC PC
( (t, k) = s| X) PC
( (t, k) = a| X)
(t,k) I c (t,k) I
but they suffer from the discontinuous nature of the
purely local processing technique. where Ic is the complement if I in B.
To force connectedness of our labeling, we
constrain the region I as follows. For each t =
Spatial-Constraint-Based Classification t0, . . . , te, . . . , t1 we choose a single (possibly
empty) interval It {k0, . . . , k1}, constrained by
The decisions of the learned classification tree are the requirements
mostly based on the distances to solo and orchestral
It It +1 0/ when It 0/ , It +1 0/ (6)
harmonics, as well as local energy in the signal.
Clearly, these features do not contain enough infor-
It +1 It when t te (7)
mation to consistently distinguish between solo
and orchestra; we doubt any local features can do Thus, the sequence of intervals traces out a con-
this. Rather, the separation must be made on less- nected region I = tt =t It whose vertical extent is
1
local properties of the signal, which is the approach nonincreasing in the region attributed to reverbera-
we move toward in the current section. To this end, tion. Subject to the constraints, we seek the set I
we constrained our classifier to estimate masks that maximizes Equation 5.
having a connected structure, typical of the masks Such a region can easily be identified using dy-
we seek. This modification identifies two distinct namic programming. To this end, we enumerate the
kinds of events occurring within the solo part: note possible intervals It for each t {t0, . . . , t1}. For each
harmonic events and transient events. interval It, we define the data probability
For each harmonic of each solo note we consider a
rectangular box B of sufficient extent to contain the Dt(It) = PC
( (t, k) = s| X) PC
( (t, k) = a| X)
energy generated by that note. The box must extend k It c k It

and set Ht (It ) = Dt (It ). We then recursively com-
0 0 0 0
disc of Samuel Barbers Violin Concerto, the ac-
pute the score Ht(It) for t {t0 + 1, . . . , t1} as companying Web page shows the solo points iden-
tified by our spatial-constraint-based classifier
Ht(It) = max Ht 1(It 1)Dt(It) colored in purple with the remaining points colored
It 1
in blue. The accompanying Web page also pre-
where the maximum is over all intervals It1 that sents the solo and orchestra audio achieved by
satisfy Equations 6 and 7. If I*t is the maximizing
1
inverting the STFT after using the estimated masks.
interval for Ht , then we can recursively construct
1
Although traces of the unwanted part are occasion-
the optimal sequence of intervals as ally present, we believe these results to be highly
promising, especially when considering the chal-
It*1 = arg mas Ht 1(It 1)Dt(It) lenge of source separation in this orchestral context.
It 1
It is, of course, not possible to provide any quanti-
thus producing our optimal sequence of intervals: tative evaluation of this experiment, because we
I*t . . . I*t .
0 1
are not given the unmixed solo and orchestra
The second type of solo event we identify are channels.
transient events associated with note onsets. Many
instruments produce vertical lines in the spectro-
gram images at note-onset positions, corresponding Discussion and Future Directions
to widely dispersed spectral energy, before the note
settles into its steady-state behavior. While such Even with our precise score match, our desoloing
events are most obvious in percussive and plucked- process degrades the resulting audio. Although we
string instruments, we have observed them in most hope to improve our results, we expect this will al-
of the instruments we have studied. These tran- ways be true. In an unusual turn of events, however,
sient events are typically contained within a thin forces seem to conspire in our favor to ameliorate
and tall rectangle in STFT space, {t0, . . . , t1} this situation in the context of our accompaniment
{k0, . . . , k1}, centered in time around the note onset system. The damage done to the audio will be at the
time. Specifically, t0, . . . , t1 corresponds to approxi- precise points in timefrequency space where the
mately 100 msec, while k0, . . . , k1 contains the live soloist will be playing, thus masking somewhat
entire frequency range. We model the transient much of the harm done in removing the recorded
region as a sequence of horizontal intervals Ik soloist. The accompanying Web page shows an
{t0, . . . , t1}, where k {k0, . . . , k1}. The (possibly example of our accompaniment system using
empty) intervals are constrained by desoloed audio on the second movement of the
Strauss Oboe Concerto with the author playing the
Ik = Ik +1 when Ik 0/ , Ik +1 0/
oboe. The desoloing procedure was more simple-
thus producing a sequence of rectangles separated minded than that presented here, but it still pro-
by gaps to allow the free passage of orchestral duces acceptable results.
harmonics. We seek the collection of rectangles that The most significant contribution of this work is
maximizes the combination of machine-learning techniques
k1 with the road map provided by the score match,
H = Fv(t, k) Fh(t, k) resulting in a principled way to address the desolo-
k = k 0 t Ik
ing problem. Our technique is generally applicable,
where Fv and Fh are two-dimensional filters designed in the sense that it does not rely on unrealistic
to highlight vertical features (the solo transients) assumptions about the recording process. Beyond
and horizontal features (the orchestral harmonics). that, we have demonstrated a method for training
Again, this criterion is easily optimized using dy- our separating mechanism from real data, as well as
namic programming. numerically evaluating the quality of this separa-
Using an excerpt from a commercial compact tion. Finally, we have offered credible audio results
Raphael 57
that show the promise score-guided musical source Bregman, A. 1990. Auditory Scene Analysis. Cambridge,
separation. Massachusetts: MIT Press.
Although we believe in posing the separation Breiman, L., et al. 1984. Classification and Regression
problem as one of estimating binary masks, there Trees. Monterey, California: Wadsworth and Brooks /
are many other, perhaps better, ways this estima- Cole.
Cardoso, J. 1998. Blind Signal Separation: Statistical
tion might be accomplished. The matched score can
Principles. Proceedings of the IEEE 9(10):20092025.
serve as the basis for estimating more detailed mod- Ellis, D. 1996. Prediction-Driven Computational Audi-
els of the signal, including the functions |Xs| and tory Scene Analysis. PhD dissertation, Department of
|Xa|, or even the complete complex Xs and Xa. The Electrical Engineering and Computer Science, MIT.
first of these, however, is complicated by the fact Every, M. R. 2006. Separation of Musical Sources and
that |Xs| + |Xa| |X|, as well as the difficulty imposed Structure from Single-Channel Polyphonic Record-
by the positivity restriction on our estimates, ings. PhD thesis, Department of Electronics, Univer-
though this latter issue is an active research area sity of York.
(Lee and Seung 2000). When dealing with the full Flanagan, J. L., and R. M. Golden. 1966. Phase Vocoder.
complex STFTs, we do have Xs + Xa = X; however, it Bell Systems Technical Journal 45:14931509.
is unclear to us how to model the complex evolu- Laroche, J., and M. Dolson. 1997. Phase-Vocoder: About
This Phasiness Business. Proceedings of the IEEE
tion of the signal. Both of these approaches are rea-
ASSP Workshop on Applications of Signal Processing
sonable endeavors, even if the eventual goal is only to Audio and Acoustics. New York: Institute of Electri-
the binary masks, because the extra nuisance pa- cal and Engineers, pp. 14.
rameters may lead the more precise estimation of Laroche, J., and M. Dolson. 1999. Improved Phase Vo-
the masks. Members of the Bayesian Signal Anal- coder Timescale Modification of Audio. IEEE Transac-
ysis community, as well as others, may recognize tions on Speech and Audio Processing 7(3):323332.
these as problems right up their alley. We wel- Lee, D. D., and S. Seung. 2000. Algorithms for Non-
come the contributions of such areas and will Negative Matrix Factorization. Proceedings of the
endeavor to make score-matched audio data avail- 2000 Conference on Neural Information Processing
able to those who request it. Systems. La Jolla, California: Neural Information Pro-
cessing Systems Foundation, pp. 556562.
Lee, T. W., et al. 1999. A Unifying Information-Theoretic
Framework for Independent Component Analysis.
Acknowledgment International Journal of Computers and Mathematics
with Applications 39:121.
This work was supported by NSF grant IIS-0534694. Li, Y., and D. Wang. 2006. Singing Voice Separation from
Monaural Recordings. Proceedings of the 7th Inter-
national Conference on Music Information Retrieval.
References Victoria, Canada: University of Victoria, pp. 176179.
Maher, R. C. 1990. Evaluation of a Method for Sepa-
Bach, F., and Jordan M. 2005. Blind One-microphone rating Digitized Duet Signals. Journal of the Audio
Speech Separation: A Spectral Learning Approach. Engineering Society 38(12):956979.
Proceedings of Neural Information Processing Systems Puckette, M. 1995. Phase-Locked Vocoder. Proceedings
17:6572. of the IEEE ASSP Conference on Applications of Signal
Bell, A. J., and T. J. Sejnowski. 1995. An Information- Processing to Audio and Acoustics. New York: Insti-
Maximization Approach to Blind Separation and Blind tute of Electrical and Engineers, pp. 222225.
Deconvolution. Neural Computation 7(6):11291159. Raphael, C. 1999. Automatic Segmentation of Acoustic
Ben-Shalom, A., et al. 2004. Optimal Filtering of an In- Musical Signals Using Hidden Markov Models. IEEE
strument Sound in a Mixed Recording Using Harmonic Transactions on Pattern Analysis and Machine Intel-
Model and Score Alignment. Proceedings of the ligence 21(4):360370.
2004 International Computer Music Conference. San Raphael, C. 2003. Orchestral Musical Accompaniment
Francisco, California: International Computer Music from Synthesized Audio. Proceedings of the Inter-
Association, pp. 715718. national Computer Music Conference. San Francisco,

California: International Computer Music Association, formation Processing Systems. La Jolla, California:
pp. 127134. Neural Information Processing Systems Foundation,
Raphael, C. 2004. A Hybrid Graphical Model for Align- pp. 793799.
ing Polyphonic Audio with Musical Scores. Proceed- Vincent, E. 2006. Musical Source Separation Using
ings of the 5th International Conference on Music TimeFrequency Source Priors. IEEE Transactions on
Information Retrieval. Barcelona, Spain: Audiovisual Speech and Audio Processing 14(1):9198.
Institute, Universitat Pompeu Fabra, pp. 387394. Zlzer, U., ed. 2002. DAFXDigital Audio Effects. New
Roweis, S. 2000. One Microphone Source Separation. York: Wiley.
Proceedings of the 2000 Conference on Neural In-
Raphael 59

A Classifi Er-Based Approach To Score - Guided Source Separation of Musical Audio

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Classifi Er-Based Approach To Score - Guided Source Separation of Musical Audio

Uploaded by

Copyright:

Available Formats

Christopher Raphael

52 Computer Music Journal

x a = STFT 11A X is labeled as solo, including regions seemingly far

54 Computer Music Journal

Vertical Distance to the Closest Orchestra

56 Computer Music Journal

58 Computer Music Journal

You might also like