Professional Documents
Culture Documents
With the growing popularity of the MPEG-1 Audio to handheld devices, where soundtrack tempo plays
Layer 3 (MP3) format, handheld devices such as a key role in controlling relevant game parameters,
personal digital assistants (PDAs) and mobile such as the speed of the game (Holm et al. 2005). For
phones have become important entertainment content-based audio / video synchronization (Den-
platforms. Unlike conventional audio equipment, man et al. 2005), the musical beat is the primary
mobile devices are characterized by limited pro- information source used as the anchor for timing.
cessing power, battery life, and memory, as well as The beat of a piece of pop music is defined as the
other constraints. Therefore, music-processing sequence of almost equally spaced phenomenal
tasks like beat detection must be implemented impulses. The beat is the simplest yet fundamental
using low-complexity algorithms to cope with the semantic information we perceive when listening
constraints of these mobile devices. to pop music. Groupings and strong / weak relation-
This article presents a scheme of complexity- ships form the rhythm and meter of the music
scalable beat detection of popular (pop) music (Scheirer 1998).
recordings that can run on different platforms, The beat-tracking process typically organizes mu-
particularly battery-powered handheld devices. We sical audio signals into a hierarchical beat structure
show a user-friendly and platform-adaptive scheme of three levels: quarter note, half note, and measure
such that the detector complexity can be adjusted to (Goto 2001), as shown in Figure 1. Beats at the
match the constraints of the device and user re- quarter-note level correspond to periodic beats or
quirements. The proposed algorithm provides both pulses at a simple level, and those at the half-note
theoretical and practical contributions, because we level and the measure level correspond to the over-
use the number of Huffman bits from the com- all rhythm, which is associated with grouping,
pressed bitstream without requiring any decoding as hierarchy, and a strong / weak dichotomy. Pop-music
the sole feature for onset detection. Furthermore, beat detection is a subset of the beat-detection prob-
we provide an efficient and robust graph-based beat- lem, which has been solved with detection accuracy
induction algorithm. By applying the beat detector as the primary if not the sole objective. In this ar-
to compressed rather than uncompressed audio, the ticle, we focus on beat detection in recorded audio
system execution time can be reduced by almost rather than real-time beat tracking.
three orders of magnitude. Currently, most beat-detection methods are im-
We have implemented and tested the algorithm plemented on a personal computer or server. Based
on a PDA platform. Experimental results show that on our experiments, we find that it is difficult to
our beat detector offers significant advantages over scale down the complexity of existing methods to
other existing methods in execution time while run on portable platforms such as PDAs and mobile
maintaining satisfactory detection accuracy. phones, where processing power, memory, and
battery life become critical constraints. Although
some recent results show that beat tracking can be
Motivation implemented in a mobile phone after major optimi-
zations (Seppanen et al. 2006), running such a
After a decade of explosive growth, mobile devices complex algorithm taxes battery life, which is not
today have become important entertainment plat- desirable. Because software applications running on
forms alongside desktop computers and servers. battery-powered portable platforms are gaining
Many applications such as games have been moved popularity, algorithms for content processing such
as beat detection must be designed to match both
Computer Music Journal, 32:1, pp. 7187, Spring 2008 the constraints of the device resources and the users
2008 Massachusetts Institute of Technology. expectations.
(Scheirer 1998; Dixon 2001; Goto 2001). A system Compressed-Domain Beat Detection
presented in Wang and Vilermo (2001) tracks beats
at the quarter-note level in the transform domain. In an MP3 bitstream, some parameters are readily
However, it has remained unknown whether it is available without decoding, including window type,
possible to directly detect beats from a compressed part2_3_length (Huffman code length), global gain,
bitstream without partial decoding. In this article, etc. (Wang et al. 2003). Figure 3 shows different
we investigate the possibility of detecting the whole features extracted from a compressed bitstream and
hierarchical beat structure. the corresponding waveform.
As with most beat detectors dealing with pop Because our objective was to design beat detection
music, we assume that the time signature is 4 / 4 and for pop music, we selected certain of these param-
the tempo is almost constant across the entire piece eters on the basis of the following criteria: (1) the
of music and roughly between 70 and 160 beats per feature is well correlated to signal energy; (2) the
minute (BPM). Our test data is music from commer- feature exhibits good self-similarities; (3) the feature
cial compact discs with a sampling rate of 44.1 kHz. depends mainly on the music or the acoustic signals
waveform
1
0.5
(a) 0
0.5
1
7.85 7.9 7.95 8 8.05
4
time (ms) x 10
window type
4
(b) 2
0
6020 6040 6060 6080 6100 6120 6140 6160 6180
MP3 granule index
part2_3_length
1600
1400
(c) 1200
1000
800
60
50
(d) 40
30
20
10
0
6020 6040 6060 6080 6100 6120 6140 6160 6180
MP3 granule index
global gain
190
185
180
(e) 175
170
165
160
6020 6040 6060 6080 6100 6120 6140 6160 6180
MP3 granule index
annotated beats
0.8
(f) 0.6
0.4
0.2
0
7.85 7.9 7.95 8 8.05
4
time (ms) x 10
12 bits
s ynch. pattern 38 bits 12 bits 47 bits 12 bits
part2_3_length part2_3_length
"111111111111"
(granule 1) (granule 2)
(a)
12 bits
s ynch. pattern 40 bits 12 bits 106 bits 12 bits
part2_3_length part2_3_length
"111111111111"
(granule 1) (granule 2)
(b)
that are compressed, and not on the encoder that Table 1. Results of the Cross-Correlation Method
has produced the data, which renders window type
Global Part2_3_ Full-Band
data unsuitable for beat detection, for example; Song No. Gain Length Energy
and (4) the features MP3 data field has separate
values for each granule. (In an MP3 bitstream, the 1 0.002 0.228 0.326
primary temporal unit is a frame, which is further 2 0.036 0.194 0.253
divided into two granules. Some data fields are 3 0.043 0.184 0.184
shared by both granules in an MP3 frame, whereas 4 0.004 0.217 0.188
5 0.009 0.218 0.264
others have separate values for each granule. We
Average 0.002 0.208 0.243
prefer the latter type because it gives a better time
resolution.)
In practice, we have used the following quantita- consist of multi-band data, whereas compressed-
tive measures for feature selection. For each data domain data seem to reveal only full-band charac-
type in the compressed domain, we created a se- teristics. In other words, we can achieve better
quence s by extracting the value from each granule. detection accuracy by using multi-band processing
Then another sequence b was generated as follows: with increased complexity. However, if instant
results are needed, a single-band approach can offer
bi k = 1 if there is an annotated beat at granule i, k = {0,1, 2}
significantly reduced complexity with reduced de-
bi = 0 if there is no annotated beat at granule i k, k = {0,1, 2} tection accuracy.
300
150
250
200
100
150
100
50
50
0 0
30 35 40 45 50 55 60 65 70 53 54 55 56 57
granules granules
400 80
300 60
200 40
100 20
0 0
400 550 700 850 660 680 700 720 740
msec msec
Figure 6
of all patterns with lapse qnl from the onsets, where Length of a quarter note:
qnl is the quarter-note length. The procedure shown
below extracts all patterns with a prescribed lapse
Pattern I
by a single iteration through the ordered set of all
onsets. In that procedure, we use another ordered Pattern II
event set (L, R'), which has the same properties and
Pattern III
operations as (S, R) as the data structure to store all
the patterns. The relation R' is defined by Li R' Lj if Figure 7
and only if head(Li) R head(Lj). The algorithm is
shown in Figure 8.
if F[rank(O, e)] = 0
then initialize a new empty pattern P
P P {e}
F[rank(O, e)] 1
es succ(O, e)
F[rank(O, es)] 1
e es
if diff(es, e) > qnl +
then break
es succ(O, es)
L L {P}
0 otherwise, 1000
diff (x, y)
[qnl , qnl + ]
ROUND(diff (x, y) / qnl) Transform / PCM-Domain Beat Detection
Therefore, if we insert k = (ROUND(diff(x, y) / Both TBD and PBD have three general steps: onset
qnl) 1) number of beats b1, b2, . . . , bk between x detection, beat induction, and bar detection. The
N2
( )
2
where Xf [i] is the fth MDCT coefficient decoded at
j Xj[n]
=N
Eb[n] = 3 1 N2 granule i, q(n) is the granule index mapped from the
2
(
Xa,j[n]
a = 1 j = N1
)
nth beat time, q(n + 1) is the granule index mapped
from beat time (n + 1), and gap(n) = (q(n + 1) q(n))/5.
We consider only the frequency range of 11,000
where the first relation applies to granules that con- Hz, which is supposed to contain the frequencies of
tain a long window, and the second relation applies dominant tones (Goto 2001). Thus, only the first 27
to granules that contain short windows. Also, Xj[n] MDCT frequency lines for long windows and the
is the jth MDCT coefficient decoded at granule n first nine MDCT frequency lines for short windows
(when granule n contains a long window), Xa,j[n] is are used to create the histogram.
q q h w
fails in the evaluation, then the detected half notes to detection accuracy and the corresponding execu-
and bars are all rejected; otherwise, we find in se- tion time.
quence Ta a beat b1 that marks the start of a bar and
in sequence Ts, as well as a beat b2 that also marks
the start of a bar. Suppose the index of b1 in Ta is i1, Execution Time
and the index of b2 in Ts is i2. If (i1 i2) modulo 4 is
0, we accept the detected half notes and bars; other- The three beat detectors were implemented on an
wise, if (i1 i2) modulo 4 is 2, we accept the detected HP iPAQ hx4700 PDA running Microsoft Windows
half notes and reject the detected bars; if not, both Mobile 2003 SE. (The iPAQ hx4700 uses the Intel
the detected half notes and bars are rejected. PXA270 processor with a clock speed of 624 MHz
and has 64 MB of SDRAM and 128 MB of ROM.)
Owing to the low quality of compressed-domain
Detection Accuracy features, the proposed beat detector must be per-
formed offline in the compressed domain. The
The evaluation results are listed in Table 3. Fig- average execution times in the three domains are
ure 11 shows the average performance with respect presented in Figure 12. We normalize the execution
5 5
Normalized execution
Normalized execution
4 4
time (sec)
time (sec)
3 3
2 2
1 1
0 0
(a) (b)
1
0.9
3 0.8
Normalized execution
2.5 0.7
2 0.6
time (min)
0.5
1.5 0.4
1 0.3
0.5 0.2
0.1
0
0
(c) P BD TB D C BD
(d)
time by dividing the actual execution time by the onds. The average beat detection time is about
duration of the input song (in minutes). 1 second for CBD, 12 seconds for TBD, and 13
The experimental results show that beat induc- minutes for PBD. These results show that the
tion takes roughly the same amount of time in the compressed- or transform-domain processing
three operation domains. The main difference lies provides a significant advantage for mobile plat-
in the onset detection, which is the dominant factor forms, whereas PBD is more suitable for desktop
that causes the vast difference between CBD and or server platforms.
PDB in terms of execution time. The execution
time of CBD is negligible in comparison to MP3
decoding. The execution time of TBD is comparable Applicability to Other Formats
to MP3 decoding. PBD requires a significantly
longer execution time compared to MP3 decoding, To evaluate dependency on the input audio format,
mainly due to an extra time-frequency transform. we also implemented the proposed algorithm with
In summary, the average duration of the 25 test the Advanced Audio Coding (AAC) decoder at a
songs is about 4 minutes. The average decoding constant bit rate of 128 kbps. The detection perfor-
time per song from MP3 to PCM is about 21 sec- mance is significantly lower than that with MP3.
4
x 10
18
16
Compresseddomain beat detector
Transformdomain beat detector
Normalized execution time (unit: msec/min)
12
10
0
0 5 10 15 20 25
Song index
Most of the errors with AAC bitstreams are -errors Concluding Remarks
(Goto and Muraoka 1997). We believe that the main
reason for the difference is that the time resolution We have presented a complexity-scalable beat-
of AAC is much lower, which results in a lower detection method that considers user expectations
feature quality. The difference is illustrated in and the resource constraints of mobile devices. The
Figure 13. This implies that the proposed method algorithm was implemented and tested on a tar-
may not be directly applicable to other audio for- geted PDA platform. Experimental results show
mats. Given the popularity of MP3, this is not that the compressed- and transform-domain pro-
overly restrictive. It will be interesting to investi- cessing are particularly suitable for mobile applica-
gate how sensitive the algorithm is to the bitrate of tions, providing a satisfactory tradeoff between
MP3 files. detection accuracy and execution speed.
2500
2000
#bits per frame
1500
1000
500
0
0 50 100 150 200 250 300 350 400 450
AAC frame index
2000
1500
#bits per granule
1000
500
0
0 100 200 300 400 500 600 700 800
MP3 granule index
Because the TBD can provide a good tradeoff be- Denman, H., et al. 2005. Exploiting Temporal Discon-
tween detection accuracy (comparable to the PBD) tinuities for Event Detection and Manipulation in
and execution speed (comparable to the CBD), we Video Streams. Proceedings of the 2005 International
are working on optimizing the TBD to make it more Workshop on Multimedia Information Retrieval.
suitable for mobile devices. In the future, we plan to New York: Association for Computing Machinery,
pp. 183192.
transport our beat detectors to different hardware
Dixon, S. 2001. Automatic Extraction of Tempo and Beat
(e.g., mobile phones) and software platforms (e.g., from Expressive Performances. Journal of New Music
Symbian). Another avenue of future work is to de- Research 30(1):3958.
sign algorithms by taking into account the con- Dixon, S. 2003. On the Analysis of Musical Expressions
straints of power consumption of mobile platforms. in Audio Signals. International Society for Optical
Engineering 5021(2):122132.
Goto, M. 2001. An Audio-Based Real-Time Beat Track-
References ing System for Music With or Without Drum-Sounds.
Journal of New Music Research 30(2):159171.
Bello, J. P., et al. 2005. A Tutorial on Onset Detection Goto, M., and Y. Muraoka. 1997. Issues in Evaluating
in Music Signals. IEEE Transactions on Speech and Beat Tracking Systems. Working Notes of the 1997 In-
Audio Processing 13(5):10351047. ternational Joint Conference on Artificial Intelligence