You are on page 1of 4

IEEE SIGNAL PROCESSING LETTERS, VOL. 16, NO.

6, JUNE 2009 525

Audio Fingerprinting Based on Multiple


Hashing in DCT Domain
Yu Liu, Hwan Sik Yun, and Nam Soo Kim, Member, IEEE

Abstract—Audio fingerprinting techniques aim at successfully relies on the assumption that at least one of the so called subfin-
performing content-based audio identification even when the audio gerprints is invariant to noise. Although this assumption holds
signals are slightly or seriously distorted. In this letter, we propose a in mild conditions (e.g., MP3 compression, downsampling, and
novel audio fingerprinting technique based on multiple hashing. In
order to improve the robustness of hashing, multiple hash strings equalization), it fails to be valid for some seriously corrupted
are generated through the discrete cosine transform (DCT) which audio clips such as those recorded in a noisy environment, hence
is applied to the temporal energy sequence in each subband. Ex- results in a serious performance degradation.
perimental results show that the proposed algorithm outperforms For such reasons, some efforts have been delivered to im-
the Philips Robust Hash (PRH) algorithm [1] under various distor- prove the robustness of subfingerprints. One possible method
tions.
mentioned in [1] suggests to generate a list of most probable
Index Terms—Audio fingerprinting, content-based audio identi- candidates for each subfingerprint based on the reliability infor-
fication, discrete cosine transform (DCT), robust hashing.
mation obtained from soft coding, albeit that the reliability in-
formation is in fact not reliable and less spectacular in practical
I. INTRODUCTION implementations. Another method is to view extracting subfin-
gerprints from the spectrogram as 2-D filtering in spectro-tem-
poral domain and try to substitute the filters [7], which is empir-

A N AUDIO fingerprint is a compact low-level con-


tent-based digest of an audio signal. It provides the
ability to identify short, unlabeled audio clips in a fast and
ical and not founded on a theoretical basis [4]. These approaches
attempt to enhance the robustness of individual subfingerprints
separately while joint improvement is considered more desir-
reliable way. The applications of audio fingerprinting include able.
broadcast monitoring, audio connecting, file sharing and auto- In this letter, we propose a novel audio fingerprinting tech-
matic music library organization [1]. There are several practical nique that generates multiple subfingerprints for each frame, and
requirements which a successful audio fingerprinting system the corresponding database searching scheme is also extended.
should satisfy [2]. First, it should be able to identify corrupted Specifically, discrete cosine transform (DCT) is applied to the
audio clips in spite of degradations. Second, it should be able temporal sequence of energies in each subband, and one sub-
to identify the clips of only a few seconds long. Finally, it fingerprint is generated for each DCT coefficient. Experimental
should be computationally efficient, both in calculating the results show that the proposed approach outperforms the PRH
fingerprints and in searching for the best match in the database. algorithm under various environments.
The application demands and the difficulties of designing such
systems boost the interest in audio fingerprinting techniques,
and many practical issues have been studied. II. PRH ALGORITHM
Among various algorithms, the system developed by A. Wang
from Shazam [3] has been considered as a successful and wide- There are two steps in the PRH algorithm: fingerprint extrac-
spread work. Besides, the Philips Robust Hash (PRH) algorithm tion and database searching phases. The overall block diagram
[1] is also a well studied content-based audio identification tech- for the fingerprint extraction stage is illustrated in Fig. 1. First,
nique. The robustness of the PRH algorithm has been verified the audio signal is divided into overlapping frames with the
mathematically via analyzing the overall bit error probability [4] length of about 370 ms, and the frame shift is 1/32 of the frame
or bit error rate (BER) [5]. It is important for any fingerprinting length. Second, power spectrum is obtained by performing FFT,
algorithm that it not only results in few bit errors, but also al- and then the energies for 33 non-overlapping logarithmically
lows for efficient searching [6]. Specifically, the PRH algorithm spaced subbands (e.g., Bark Scale) covering the frequency range
from 300 Hz to 2000 Hz are calculated. Finally, hash strings
Manuscript received December 01, 2008; revised February 05, 2009. Current (referred to as subfingerprints) are computed from the subband
version published April 24, 2009. This work was supported in part by the Korea energies in each frame as follows:
Research Foundation Grant funded by the Korean Government (MOEHRD)
(KRF-2008-313-D00783) and in part by the Korea Science and Engineering
Foundation (KOSEF) Grant funded by the Korean Government (MOST) (R0A-
2007-000-10022-0). The associate editor coordinating the review of this manu-
script and approving it for publication was Prof. Vesa Valimaki.
The authors are with the School of Electrical Engineering and the Institute of (1)
New Media and Communications, Seoul National University, Seoul 151-742,
Korea (e-mail: hsyun@hi.snu.ac.kr).
and
Digital Object Identifier 10.1109/LSP.2009.2016837 (2)
1070-9908/$25.00 © 2009 IEEE
526 IEEE SIGNAL PROCESSING LETTERS, VOL. 16, NO. 6, JUNE 2009

Fig. 3. Overall block diagram of the fingerprint extraction stage in the MLH
Fig. 1. Overall block diagram of the fingerprint extraction stage in the PRH method.
algorithm.

subbands. This method provides an efficient way to summa-


rize the discriminative information of the audio spectra. Since,
however, it only makes use of the information in two neigh-
boring frames, it may be vulnerable to the possible interfer-
ences. The bits in subfingerprints can be flipped due to the cor-
ruption of local noise and mislead the audio fingerprint search.
The problem can be alleviated by including more frames and
performing low-pass filtering. By applying a set of low-pass fil-
ters, the information that is invariant to noise can be extracted
for a robust discrimination. The set of filters are designed to be
orthogonal so that the output features obtained can be more dis-
tinguished from each other.
In this section, we propose a new audio fingerprinting tech-
Fig. 2. Illustration of generating candidates from the hash table in the PRH nique called the multiple hashing (MLH) method. In the pro-
algorithm. posed algorithm, DCT is applied to the temporal sequence of
energies in each subband, and a subfingerprint is constructed for
each DCT coefficient stream. The reasons for employing DCT
in which in the MLH method are twofold. First, among all the orthogonal
transforms, the decorrelation performance of DCT is closest to
the Karhunen–Loéve transform [9]. Second, DCT has a strong
(3) energy compaction property [8] implying that most of the signal
energy tends to be concentrated in a few low-frequency compo-
nents. The decorrelation property ensures that each subfinger-
In (1), denotes the th subband energy in the th
print can be treated separately and performance improvement
frame, and is the output difference. is the
is possible via generating more subfingerprints. If the subband
32-bit subfingerprint of frame and is the th bit of
energies are evolving slowly, only a few DCT coefficients are
it.
sufficient to describe the subfingerprints.
For the audio files stored in the database, all the subfinger-
The framework of fingerprint extraction in the MLH system
prints computed are registered in a hash table with the subfin-
is depicted in Fig. 3. The first three parts, i.e., framing, FFT and
gerprints being treated as the keys. Each entry of the hash table
band energy calculation are the same as those in the PRH algo-
stores a list of pointers to the positions in the audio files where
rithm. However, in contrast to the PRH algorithm, before com-
the subfingerprint occurs. In the stage of database searching,
puting the hash strings, -point DCT is performed on the con-
256 subfingerprints which amount to approximately 3 seconds
secutive subband energies
are extracted from the query audio, and each subfingerprint is
. Among the DCT coefficients, only the lower-or-
matched with the hash table contents to find the candidate po-
dered values are retained for the computation of subfinger-
sitions where it may come from. The process to generate the
prints. As a result, we obtain coefficients for each frame
candidates is illustrated in Fig. 2. A fingerprint block with the
and subband denoted by . Then,
same size as the query block ( bits) from the
each DCT coefficient is processed in a way similar to the PRH
candidate position is obtained, BER between the two blocks is
algorithm as follows:
computed and compared with a threshold which is set to 0.35
in [1]. If the BER is less than the threshold, the two signals are
considered similar and the candidate audio is declared as the re- (4)
sult.
where represents the th output difference of sub-
III. MULTIPLE HASHING ALGORITHM band in frame . Let denote the th subfingerprint in
frame . Then,
The subfingerprints generated from the PRH algorithm are
the spectro-temporal differences between adjacent frames and (5)
LIU et al.: AUDIO FINGERPRINTING BASED ON MULTIPLE HASHING IN DCT DOMAIN 527

the PRH algorithm, and a candidate list is created by accumu-


lating all the search results in all included hash tables. Finally,
BERs are computed by comparing the query fingerprint block
with those stored at the candidate positions in the database, and
the most hit candidate with BER less than the specified threshold
is returned as the result. The process to generate the candidates
is illustrated in Fig. 4.

IV. EXPERIMENTAL RESULTS


To evaluate the performance of the proposed MLH approach,
we conducted several experiments under various environments.
The database used in the experiments included 1500 music files
collected from commercial compact discs. The database con-
sisted of three groups of 500 files from classical, pop and rock
genres. As for the positive queries, 1200 music clips with the
length of about 3 s were randomly chosen from the database
among which there were 393 rock, 426 classical, and 381 pop
pieces. On the other hand, in order to compute the false positive
rates, 200 music files that were not included in the database were
collected from which another 1200 music clips were randomly
extracted to form the negative queries. To assess the robustness
of the algorithm, the following distortions were applied to both
kinds of queries.
Set 1: Additive white noise with the SNR at 5 db.
Set 2: Additive white noise with the SNR at 0 db.
Set 3: Playing and recording in a very quiet environment.
Set 4: Playing and recording in office noise environment.
There are several parameters which should be determined for
the implementation of the MLH method. In our experiments, the
DCT length was set to be 16, which was considered to provide
Fig. 4. Illustration of generating candidates from the hash tables in the MLH a good compromise between the frequency and time resolutions.
method. As for , we set since more than 90% of the total energy
was found to concentrate on the first four coefficients in the
tested materials. Finally, to speed up computation, we applied
in which the running DCT algorithm [10] since the computation of DCT
shifts one sample at each time. To compare the performances,
we also implemented the PRH algorithm [1].
As can be seen from the description of the algorithms, there
(6) are two types of false negatives: First, a distorted positive query
clip gets rejected when the BER is higher than the threshold.
Note that in (4) we use and Second, dismissal of a distorted positive query clip happens
instead of and to ensure that when there exist no error-free subfingerprints in the query block,
they are obtained based on band energies which do not overlap even when the BER between the query fingerprint block and
with those used to compute and . the candidate it comes from in the database is less than the
As in the PRH system, the subfingerprints extracted from threshold. The first type of false negative is due to the tradeoff
the database are registered in hash tables to enable an efficient when trying to set a proper threshold, while the second type of
searching algorithm. Since subfingerprints are computed for false negative comes with the hashing structure in the searching
each frame, hash tables are constructed, for example, the algorithm. On the other hand, there is only one category of false
subfingerprints obtained from the second DCT coefficients in positives, in which a negative query clip is accepted as in the
each frame are registered in the second hash table. The data- database since the BER between the obtained fingerprint block
base searching scheme consists of three steps. First, the query and the candidate pointed by the hash table is less than the
audio is divided into 256 frames, and subfingerprints are ob- threshold. In order to evaluate the robustness of the system, we
tained in each frame as in the fingerprint extraction phase. Con- plotted 1-false positive rate (FPR) versus 1-false negative rate
sequently, the query fingerprint block consists of (FNR) by varying the threshold, and the obtained receiver op-
bits, and forms the finger- erating characteristic (ROC) curves are shown in Fig. 5 from
print block . Second, the candidate positions are generated in which we can discover the superiority of the proposed method.
each hash table separately, i.e., the subfingerprints in fingerprint Also note that we use the term recognition rate in the following
block are matched with the contents in the th hash table as in experiments to represent 1-FNR, therefore it differs from the
528 IEEE SIGNAL PROCESSING LETTERS, VOL. 16, NO. 6, JUNE 2009

query set, the MLH algorithms were applied using different


combinations of hash tables. The recognition rate achieved by
each algorithm is given in Table I. Note that ‘HT’ in HT is
omitted in the table for simplicity, for example, “1, 2” means
the MLH method using hash tables HT1 and HT2.
It can be seen that the performance using HT1 was better than
that of the PRH algorithm. Moreover, as more hash tables were
added, the recognition rate improved dramatically. However, it
is worth noting that if more hash tables are used, the memory
usage and the computational burden also increase. Specifically,
when hash tables are used, the memory usage is times
that of the PRH algorithm, and the computation complexity, in
the worst case, also becomes times. Thus it requires a careful
consideration on how many hash tables are needed depending on
the environment in which the audio fingerprinting system would
be deployed. As can be seen from the experimental results, when
the clips are corrupted by mild noises, the MLH method using
only one hash table already achieves satisfactory results, how-
ever, for the seriously distorted clips, more hash tables should be
Fig. 5. ROC curves for four query sets; (a) Set 1. (b) Set 2. (c) Set 3. (d) Set 4. added to achieve a better performance at the expense of higher
memory usage and computational burden.
TABLE I
RECOGNITION RATES (%) OF THE PRH AND MLH ALGORITHMS V. CONCLUSION
WITH DIFFERENT COMBINATIONS OF HASH TABLES
In this letter, we have presented a novel audio fingerprinting
technique based on multiple hashing in DCT domain. Specif-
ically, DCT is applied to the temporal sequence of energies
in each subband and only the lower-ordered coefficients are
retained to compute the subfingerprints. Multiple hash tables
are built corresponding to the multiple subfingerprints in each
frame, and different combinations of hash tables are evaluated.
Experimental results have shown that the proposed MLH
scheme outperformed the conventional PRH algorithm under
various conditions. Future works may include the investigation
of an efficient method for which reduced number of hash tables
are used while maintaining the high accuracy of MLH.

REFERENCES
[1] J. Haitsma and T. Kalker, “A highly robust audio fingerprinting
system,” in Proc. 3rd Int. Conf. Music Information Retrieval, Oct.
2002, pp. 107–115.
[2] P. Cano, E. Batlle, T. Kalker, and J. Haitsma, “A review of audio fin-
gerprinting,” J. VLSI Signal Process., vol. 41, no. 3, pp. 271–284, Nov.
2005.
[3] A. Wang, “An industrial strength audio search algorithm,” in Proc. 4th
Int. Conf. Music Information Retrieval, Oct. 2003, pp. 7–13.
[4] F. Balado, N. Hurley, E. McCarthy, and G. Silvestre, “Performance
analysis of robust audio hashing,” IEEE Trans. Inform. Forensics Se-
curity, vol. 2, no. 2, pp. 254–266, June 2007.
[5] P. Doets and R. Lagendijk, “Distortion estimation in compressed
music using only audio fingerprints,” IEEE Trans. Audio, Speech,
Lang. Process., vol. 16, no. 2, pp. 302–317, Feb. 2008.
[6] J. Haitsma and T. Kalker, “Speed-change resistant audio fingerprinting
conventional recall rate in that the latter excludes only the first using auto-correlation,” in Proc. Int. Conf. Acoustics, Speech, and
Signal Processing, Apr. 2003, vol. 4, pp. 728–731.
type of false negatives. [7] M. Park, H. Kim, Y. Ro, and M. Kim, “Frequency filtering for a highly
In addition, to show the performance variation with different robust audio fingerprinting scheme in a real-noise environment,” IEICE
combinations of hash tables, we set the threshold to be 0.35 as in Trans. Inform. Syst., vol. E89-D, no. 7, pp. 2324–2327, July 2006.
[8] K. Rao and P. Yip, Discrete Cosine Transform: Algorithms, Advan-
[1] and measured the results. The four hash tables employed are tages, Applications. New York: Academic, 1990.
denoted as HT1, HT2, HT3 and HT4. Here HT was built from [9] N. Ahmed, T. Natarajan, and K. Rao, “Discrete cosine transform,”
the th DCT coefficients, for example, HT1 was constructed IEEE Trans. Comput., pp. 90–93, Jan. 1974.
[10] J. Xi and J. Chicharo, “Computing running DCTs and DSTs based on
from the DC components. For comparison, the hash table con- their second-order shift properties,” IEEE Trans. Circuits Syst. I: Fund.
structed in the PRH algorithm is represented as HT0. For each Theory Applicat., vol. 47, no. 5, pp. 779–783, May 2000.

You might also like