A35 Yang

35
Exposing MP3 Audio Forgeries Using Frame Offsets

RUI YANG, ZHENHUA QU, and JIWU HUANG, Sun Yat-sen University
Audio recordings should be authenticated before they are used as evidence. Although audio watermarking and signature are
widely applied for authentication, these two techniques require accessing the original audio before it is published. Passive
authentication is necessary for digital audio, especially for the most popular audio format: MP3. In this article, we propose
a passive approach to detect forgeries of MP3 audio. During the process of MP3 encoding the audio samples are divided into
frames, and thus each frame has its own frame offset after encoding. Forgeries lead to the breaking of framing grids. So the frame
offset is a good indication for locating forgeries, and it can be retrieved by the identication of the quantization characteristic.
In this way, the doctored positions can be automatically located. Experimental results demonstrate that the proposed approach
is effective in detecting some common forgeries, such as deletion, insertion, substitution, and splicing. Even when the bit rate is
as low as 32 kbps, the detection rate is above 99%.
Categories and Subject Descriptors: H.4.0 [Information Systems Applications]: General; K.6.5 [Management of Comput-
ing and Information Systems]: Security and Protection
General Terms: Security, Algorithms, Verication
Additional Key Words and Phrases: MP3 audio forgery, forgery detection, audio authentication
ACM Reference Format:
Yang, R., Qu, Z., and Huang, J. 2012. Exposing MP3 audio forgeries using frame offsets. ACM Trans. Multimedia Comput.
Commun. Appl. 8, S2, Article 35 (September 2012), 20 pages.
DOI = 10.1145/2344436.2344441 http://doi.acm.org/10.1145/2344436.2344441
1. INTRODUCTION
With the development of digital voice recorders and cell phones, nowadays speech and conversation
can be easily recorded as evidence. However, hearing cannot be believing since these audio recordings
can be tampered with very easily by pervasive audio editing software. An audio recording may contain
some important words or sentences synthesized from other audio, so authentication technologies
need to be developed for digital audio. The existing audio authentication technologies can be divided
into two groups: active authentication (including digital watermarking and digital signature) and pas-
sive authentication. Active authentication requires accessing original audio before it is distributed,
for example, embedding a watermark or generating a signature, while passive audio authentication
A portion of this article was presented at the 10
th
ACM Multimedia and Security Workshop.
The work was supported in part by 973 Program (2011CB302204) in China and NSFC (U1135001, 61202497).
J. Huang is also a visiting researcher of State Key Laboratory of Information Security, Beijing 100190, China.
Authors addresses: R. Yang, Z. Qu, and J. Huang (corresponding author), Sun Yat-sen University, Guangzhou 510006, China;
email: isshjw@mail.sysu.edu.cn.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided
that copies are not made or distributed for prot or commercial advantage and that copies show this notice on the rst page
or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to
lists, or to use any component of this work in other works requires prior specic permission and/or a fee. Permissions may be
requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481,
or permissions@acm.org.
c 2012 ACM 1551-6857/2012/09-ART35 $15.00
DOI 10.1145/2344436.2344441 http://doi.acm.org/10.1145/2344436.2344441
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 2012.
35:2

R. Yang et al.
means checking the integrity of audio recording by analyzing its inherent properties. In most authen-
tication cases, audio does not actually contain any digital watermark or signature. Thus it is necessary
to passively examine the integrity of the digital audio.
Until now, there were fewworks on passive authentication for digital audio. Based on the assumption
that a natural signal has weak higher-order statistical correlations in the frequency domain and that
forgery in speech would introduce unnatural correlations, Farid [1999] used bispectral analysis to
detect digital forgery for speech signals. It was shown that the zero phase of bispectral decreased a
lot for forged speech. However, the method is only suitable for uncompressed audio. Grigoras [2005]
pointed out that digital equipment captures not only the intended speech but also the 50/60 Hz Electric
Network Frequency (ENF) when recording. The ENF criterion could be used to check the integrity
of digital audio recordings and to verify the exact time when a digital recording was created. This
could be done by compared the ENF of audio recordings with a reference frequency database from the
electric company or the laboratory. The method is highly dependent on the accuracy of the extracted
ENF, while ENF is a quite weak signal compared to the audio recording. Dittmann et al. [Kraetzer
et al. 2007] proposed a method to determine the authenticity of the speakers environment. In their
paper it was said that the extraction of the background features in an audio stream could provide an
informative basis for determining the location of its origin and the used microphone. But a lot of audio
recordings are required for training.
MP3 audio format is popularly used in most applications, and is now the most popular format among
all formats in digital voice recorders. The top 20 best-selling digital voice recorders of amazon.com all
support the MP3 format, and some of them only support the MP3 format. For most cell phones, the
default recording format is the MP3 format. Digital voice recorder and cell phone are the most fre-
quent recording machines for people in daily life. It would be fairly easy to remove complete sections
of a recording or splice two sentences from different recordings. Small changes in the audio stream
can cause a different meaning of the whole sentence. Exposing forgeries in MP3 les can authenti-
cate the daily recordings presented as evidence in criminal and civil court cases, and such as under-
cover surveillance recordings made by the police, recordings presented by feuding parties in a divorce,
recorded telephone conversation in domestic violence cases, and recordings from corporations seeking
to prove employee wrongdoing or industrial espionage. At the same time, forgeries detection solutions
are needed for manufacturers of audio recording equipment.
There are as yet still no reported passive authentication methods focusing on MP3 format audio. An
existing related work is the classication of MP3 encoders, which was proposed by Boehm and Westfeld
[2004]. The work outlines a method to discriminate 20 different MP3 encoders with 10 features. Experi-
mental results show that these features have accurate classication for MP3 encoders and can improve
the performance of MP3 steganalysis. The application of the method to passive authentication is not
discussed in the paper. Theoretically the method could handle tampered audio by splicing audio from
different recorders, but tampering within an audio recording is out of its range. As MP3 audio becomes
popular, it is necessary to develop passive approaches to check the integrity of MP3 audio.
Passive authentication on JPEG image and MPEG video has attracted many researchers. Some ap-
proaches have been proposed, such as the quantization-table-based method [Lukas and Fridrich 2003],
the periodical-artifacts-based method [Popescu and Farid 2004], Benfords-law-based method [Fu et al.
2007], and the shift double JPEG detection-based method [Qu et al. 2008]. One direct question arises:
can these methods be applied to passive authentication on MP3? Unfortunately, direct extension of
the existing JPEG methods to MP3 audio does not work, because there are many differences between
MP3 compression and JPEG compression. For example, an MP3 encoder divides the samples of the
time domain into frames with 50% overlap, while JPEG compression is without overlap. This leads to
the impossibility of detection of block artifacts in MP3 compression. The calculation and quantization

35:3
Fig. 1. Block diagram of MP3: (a) encoder; (b) decoder.
in MP3 compression are performed with oat point representation. So the quantization-table-based
method in JPEG which performs well with integer numbers is useless for MP3 compression.
In this article we will propose a forgery detection method for digital audio of MP3 format. Note that
forgeries at MP3 les are always performed in this way: rst decoding, then tampering, and nally
re-encoding. Based on the discovery that forgeries break the original frame segmentation, we utilize
frame offsets to locate forgeries automatically. The original frame offsets are retrieved by a quantiza-
tion characteristic. Via extensive experiments, it is shown that the proposed method can detect most
common forgeries, such as deletion, insertion, substitution, and splicing. At the same time, the pro-
posed method is robust to some common postprocesses like ltering and adding noise.
The article is organized as follows. In Section 2, we give a brief analysis of MP3 coding and claim that
only identical frame offsetting can introduce the quantized spectral characteristic. Then we develop a
method to detect frame offsets in Section 3. Based on the detection method, we propose that the change
of frame offsets could locate forgeries effectively in Section 4. The experimental results are shown in
Section 5. Finally, we conclude our article with a discussion and future work in Section 6.
2. ANALYSIS OF MP3 COMPRESSION CHARACTERISTICS
In this section, rst we will give a brief overview of MP3 coding, then explain two important concepts of
this article: frame offset and quantization characteristics. In Section 2.1 we only explain those princi-
ples that are relevant to our detection method, especially the spectral decomposition and quantization.
Detailed architecture and specication of MP3 coding may be referred to ISO [1992]. In Section 2.2,
the denition of frame offset is demonstrated via an example. In Section 2.3, the quantization charac-
teristics are analyzed.
2.1 MP3 Coding
Figure 1(a) shows the block diagram of a typical MP3 encoder [Painter and Spanias 2000]. The input
PCM signal is rst separated into 32 sub-bands by the analysis lterbank, and the Modied Discrete
Cosine Transform (MDCT) window further divides each of these 32 sub-bands into 18 sub-bands (long
windows) or 6 sub-bands (short windows). Then a total of 576 or 192 spectral lines are generated
respectively.
35:4

R. Yang et al.
Fig. 2. Framing grids and frame offsets. The top panel shows three continuous framing grids for the rst encoding, and the
bottom panel shows the corresponding frame grids for the second encoding. The frame offsets of the three framing grids are
identical.
The psychoacoustic model analyses the audio content and estimates the masking thresholds. The
output of this model consists of the just noticeable noise level for each sub-band and the information
about the window type for MDCT.
According to the masking thresholds estimated by the psychoacoustic model, the spectral values
are quantized via a power-law quantizer. The quantization step introduces an iterative algorithm to
control both the bit rate and the distortion level, so that the perceived distortion is as small as possible,
under the limitations of the desired bit rate. Finally, the quantized spectral values are encoded using
Huffman code tables to form a bitstream.
The block diagramof MP3 decoder is shown in Figure 1(b). Firstly, Huffman decoding is performed on
the MP3 bitstream, and then the decoder restores the quantized MDCT coefcient values and the side
information related to them, such as the window type that is assigned to each frame. After inverse
quantization, the coefcients are inverse-MDCT transformed to the sub-band domain. Finally, the
PCM waveforms are reconstructed by the synthesis lterbank.
2.2 Frame Offset
The frame offset [Yang et al. 2008] is dened as the shifting samples of the frame grid between the
rst and second encoding in this article. It is noted that forgeries at MP3 les are always performed
in this way: rst decoding, then tampering, and nally re-encoding. So the frame offset would become
nonzero when forgeries are conducted on MP3 les, and is always zero for no forgery. Figure 2 shows
an illustration of the generation of frame offset. When performing the rst encoding, the framing grids
of the original signal are shown in the top of Figure 2. Each framing grid contains 1152 samples with
50% overlap. After decoding, some extra zero samples are added at the beginning of the signal by the

35:5
0 100 200 300 400 500 600
0.05
0
0.05
frequency index
v
a
l
u
e
(a) unquantized spectral in a real value form
0 100 200 300 400 500 600
0.05
0
0.05
frequency index
v
a
l
u
e
(b) quantized spectral in a real value form
0 100 200 300 400 500 600
0
5
10
frequency index
m
a
g
n
i
t
u
d
e

(
d
B
)
(c) unquantized spectral in a logarithmic representation
0 100 200 300 400 500 600
0
5
10
frequency index
m
a
g
n
i
t
u
d
e

(
d
B
)
(d) quantized spectral in a logarithmic representation
No troughs
Many troughs
Fig. 3. Unquantized and quantized spectral coefcients: (a) and (b) are in a real value form, while (c) and (d) are in a logarithmic
representation. The major difference between the unquantized and quantized spectral is the number of zero coefcients, which
are shown as troughs.
decoder. During the second encoding, new framing grids are generated. Obviously, if forgeries occur,
frame offsets of some frames may change.
2.3 Quantization Characteristics
Many spectral coefcients are usually quantized to zero during the encoding. This is due to some spec-
tral components being completely masked by other components and the existence of some coefcients
around zero which is the inherent probability distribution of the spectral coefcients. The increase in
zero spectral coefcients is a quantization characteristic of MP3 coding. This characteristic is rstly de-
scribed by Herre and Schug [2000] and Herre et al. [2002]. They utilized it to optimize audio cascaded
coding. In the following, we will analyze this characteristic.
The difference between an unquantized spectral coefcient and its quantized one is not easily visible
in their real value form, as illustrated in Figures 3(a) and (b). But they can be discriminated by looking
at the spectral coefcients in a logarithmic representation. As shown in Figures 3(c) and (d), there are
many zero values which appear as troughs in the quantized spectral, while this phenomenon cannot
be found in the unquantized spectral.
These troughs in the spectral representation will be visible only if the framing grids are the same as
those in the rst encoding. This means that only if the identical frame offset with the rst encoding is
35:6

R. Yang et al.
0 100 200 300 400 500 600
0
5
10
frequency index
m
a
g
n
i
t
u
d
e

(
d
B
)
(a) offset = 1
0 100 200 300 400 500 600
0
5
10
frequency index
m
a
g
n
i
t
u
d
e

(
d
B
)
(b) offset = 0
0 100 200 300 400 500 600
0
5
10
frequency index
m
a
g
n
i
t
u
d
e

(
d
B
)
(c) offset = +1
Fig. 4. Spectral coefcients when with frame offsets of 1, 0, +1 samples. The quantization characteristics appear only if the
correct frame offset (0) is applied.
applied will these troughs appear. This fact is illustrated by Figure 4, which shows MDCT coefcients
of a decoded signal with one-sample-left shift (offset = 1), no-sample shift (offset =0) and one-sample-
right shift (offset = +1) from the encoder framing grid, respectively. As we see, the troughs disappear
even with the frame offset being one-sample shift in the decoded signal.
3. METHOD OF RETRIEVING FRAME OFFSETS
The key of detecting frame offsets is the identication of quantization characteristics. In this section,
we develop a method of retrieving frame offsets based on the observations in the previous section.
3.1 Number of Active Coefcients
From Figure 4, it is noted that a signicant difference between spectral coefcients without offsets
(Figure 4(b)) and with offset (Figures 4(a) and (c)) is the number of active (nonzero) spectral coefcients.
For convenience, we denote the number of active coefcients as NAC in this article. In Figure 4, the
NACs for offset 1 and +1 (shifted offsets) are 306 and 300, respectively; while the NAC for offset 0
(matching offset) is only 197. For a robust and automatic identication of the characteristic spectral,
the NACs as a function of frame offset can be used as a feature. Such a criterion yields reliable results,
as shown in Figure 5. We observe that the beginning of each frame is clearly detectable by an obvious
decrease in the NACs. A period of 576 can be observed. Why is there a period of 576? It is noted that
576 = 1152 50%, where 1152 is the length of a frame and 50% is the amount of overlap specied by
the MP3 standard. A frame with offset 576 exactly corresponds to the next frame.

35:7
0 576 1152 1728 2000
150
200
250
300
350
Number of active coefficients via different frame offsets
frame offset
N
u
m
b
e
r

o
f

a
c
t
i
v
e

c
o
e
f
f
i
c
i
e
n
t
s
Fig. 5. NACs via different frame offsets. NAC achieves minimums when the frame offsets are multiples of 576.
3.2 Theoretical Analysis
Now let us examine why the quantization characteristics appear only if the matching offset is applied.
It arises from the inherent property of MDCT. The MDCT transform performed in MP3 coding is as
follows [Wang and Velermo 2003].
X
( p)
[k] =
2
N
2N1
n=0
x
( p)
[n] h[n] cos
_
N
(n+
N +1
2
)
_
k +
1
2
__
, 0 k N 1 (1)
By applying an inverse-MDCT transform to the frame, we get 2N time-aliased samples.
x
( p)
[n] =
2
N
N1
k=0
X
( p)
[k] cos
_
N

_
n+
N +1
2
_
_
k +
1
2
__
, 0 n 2N 1 (2)
In order to cancel the aliasing and get the original samples, we have to use the OLA (Overlapping
Addition) procedure. An inverse-MDCT is applied to the previous and the next frame. Then, each of
the resulting aliased segments is multiplied by its corresponding window function and the overlapping
time segments are added together. We thus recover the original samples.
x
( p)
[n] =
_
x
( p1)
[n+ N] h[N n1] + x
( p)
[n] h[n], 0 n N 1
x
( p)
[n] h[2N n1] + x
( p+1)
[n N] h[n N], N n 2N 1
(3)
Denote that
x
( p)
[n] = x
( p)
[n] h[n], 0 n 2N 1. (4)
If a signal exhibits local symmetry such that
_
x
( p)
[n] = x
( p)
[N n1], 0 n N 1
x
( p)
[n] = x
( p)
[3N n1], N n 2N 1
(5)
its MDCT coefcients become zero. That is, X
( p)
[k] = 0 for k = 0, . . . , N 1.
In Wang et al. [2000], it has been proven that x
( p)
[n] fullls Eq. (5) if X
( p)
[k] = 0. This inherent
property of the MDCT gives the answer to why NAC has a signicant decrease only if the identical
frame offset is applied. After MP3 encoding, many spectral coefcients are masked or quantized to
35:8

R. Yang et al.
Table I. Mean Value and Standard Diviation
of NACs at Different Bit Rates
shifted NACs matching NAC
bit rate
Mean Std Mean Std
32 kbps 175.61 13.45 67.80 12.34
64 kbps 313.46 19.99 178.38 11.06
96 kbps 331.72 18.30 249.15 25.07
128 kbps 345.45 19.14 310.23 25.60
zero. When decoding, these zero spectral coefcients are restored to the time domain, and x
( p)
[n] ful-
lls Eq. (5). While performing MDCT on the decoded data with the identical frame offset to the rst
encoding process, we will get a lot of X
( p)
[k] equal to zero. If there is a different frame offset, the local
symmetry in Eq. (5) is broken, and then the corresponding spectral X
( p)
[k] will not be zero.
3.3 Experiments on Retrieving Frame Offsets
To illustrate the preceding analysis, we randomly select 30 different audio frames, and encode these
frames with LAME v3.97 [LAM 2012] at the bit rates of 32 kbps, 64 kbps, 96 kbps, and 128 kbps,
respectively. For each bit rate, we apply offsets from 575 to 575 on these frames, and calculate NACs
corresponding to all offsets. Then we get 1151 NACs for each frame totally. The 1150 NACs correspond-
ing to wrong offsets are named as shifted NACs, and the NAC corresponding to the correct offset is
denoted as matching NAC. The shifted NACs and the matching NAC are plotted, respectively. As
shown in Figure 6, for each bit rate, there are 30 boxes representing the distribution of shifted NACs.
As shown in Figure 6(a), the minimum value of shifted NACs is larger than 150 for each frame, while
the matching NAC is below 80. For all frames, we observe that matching NAC is very discriminative
from shifted NACs. The case of 64 kbps, 96 kbps, and 128 kbps are illustrated in Figures 6(b), (c), and
(d), respectively.
Although frames may be encoded with different bit rates, the matching NAC is always smaller than
shifted NACs. This means that we can regard the minimum NAC as the matching NAC. From Figure 6,
we also notice that the distance between shifted NACs and the matching NAC becomes small while
the bit rate increases. This is because signal distortion and lost information is less when the bit rate is
higher, and MDCT coefcients contain less 0s.
As the aforesaid investigation is based on only 30 frames, the conclusion may be not general enough.
In the following, we will take statistics on 12800 frames, including 6400 frames of speech and 6400
frames of music. We compute 1150 shifted NACs and the matching NAC for each frame. Table I
displays the mean values and standard deviations of NAC based on 12800 frames. It is found that
the mean values of shifted NACs and the matching NAC have a signicant distance. The standard
deviations are all small compared to the mean values. However, as we noted before, the difference
between shifted NACs and the matching NAC becomes small when with a high bit rate, such as
128 kbps.
4. LOCATING FORGERIES VIA CHECKING FRAME OFFSETS
As audio samples are divided into frames for encoding, the frame offset could be useful evidence of
tampering. When forgeries occur, all frames after the forged points will be affected. The detected offsets
of corresponding frames will change. Figure 7 is an example of cropping. The original sentence I am
not guilty is recorded with sampling rate of 44.1kHz and saved as MP3 format by a digital recorder,
as shown in Figure 7(a). We manipulate this audio recording with CoolEdit v2.1, and remove the key
word not. The meaning of the sentence becomes the opposite: I am guilty, shown in Figure 7(b). The
detected offsets of all frames in the original audio and the doctored one are demonstrated in Figure 7(c)

35:9
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
50
100
150
200
250
300
350
400
N
A
C
different audio frames
(a) NAC result of frames encoded with 32 kbps
distribution of 1150 NACs
with wrong offsets for 14th audio frame
NAC with the correct offset for 14th audio frame
32kbps
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
100
150
200
250
300
350
400
N
A
C
(b) NAC result of frames encoded with 64 kbps
64kbps
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
100
150
200
250
300
350
400
N
A
C
(c) NAC result of frames encoded with 96 kbps
96kbps
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
100
150
200
250
300
350
400
N
A
C
(d) NAC result of frames encoded with 128 kbps
128kbps
Fig. 6. The distribution of NACs corresponding to frame offsets from 575 to 575 on 30 different audio frames, which are
encoded using LAME v3.97, mono. The box stands for the distribution of 1150 NACs with wrong offsets, while the isolated point
is the NAC with the correct offset. In panel (a) (b) (c) (d) are the cases for 32 kbps, 64 kbps, 96 kbps, 128 kbps, respectively.
35:10

R. Yang et al.
0 2 4 6 8 10 12 14 16
x 10
4
0.5
0
0.5
(a) Original Waveform
0 2 4 6 8 10 12 14 16
x 10
4
0.4
0.2
0
0.2
(b) Doctored Waveform
I am guilty.
I am not guilty.
Cropping
0 50 100 150 200 250
0
200
400
600
d
e
t
e
c
t
e
d

o
f
f
s
e
t
different frame
(c) detection result of original audio
0 50 100 150 200 250
0
200
400
600
d
e
t
e
c
t
e
d

o
f
f
s
e
t
different frame
(d) detection result of doctored audio
Fig. 7. Example of locating one cropping. The sentence I am not guilty is cropped to I am guilty, shown as (a) and (b). (c) is
the detection result of the original audio. The detected offsets of all frames are 0, which means there are no forgeries. (d) is the
detection result of the doctored audio. The detected offsets change at frame 119, which means there is a forgery. Note that the
horizontal-axis represents samples in (a)(b), but frames in (c)(d). 160000 samples corresponds to 277 frames exactly.
and Figure 7(d), respectively. We observe that all frames in the original audio have the same offset 0.
But for the doctored one, the detected offsets have two different values, 0 for frames 1 to frame 118,
and 384 for the remainder. We can draw a conclusion that there is a forgery at frame 119.
From the previous example, we have the general procedures of locating forgeries: (i) detecting offsets
of all frames; (ii) checking the differences between frame offsets.
Now how can the offsets of all frames be retrieved effectively?
Given an audio signal of L samples, we denote it with vector-notation x, and mark the j-sample-
shifted version (which means appending j zero samples at the beginning of x) as x
( j)
(0 j < 576).
x
(0)
= x, x
( j+1)
=
_
0,x
( j)
_
, j = 0, . . . , 574
For each offset j, we split x
( j)
into 1152 samples per frame with 50% overlap, so we totally get
N = L/576 1 frames as follows. We have
_
x
( j)
0
x
( j)
N1
_
= Fx
( j)
,

35:11
where F represents frame segmentation as well as applying the window function, and x
( j)
k
is the k-th
frame of x
( j)
,
We apply the lterbank and MDCT to each frame and obtain its spectral (576 MDCT coefcients).
We have
s
( j)
k
= T x
( j)
k
,
where T represents both ltering by the lterbank and MDCT. s
( j)
k
represents the spectral of the k-th
frame of x
( j)
.
We change s
( j)
k
into the logarithm representation M
k
( j)
.
M
( j)
k
= 10log
_
max
_
s
( j)
k
s
( j)
k
10
10
, 1
__
We express M
k
( j)
in a logarithm representation by projecting all values into the range [0,10].
We then count the number of active value in M
k
( j)
. We have
c
( j)
k
= CM
( j)
k
,
where C represents the counting operation.
For frame k, the detected offset is
offset
k
=
_
arg min
j
c
( j)
k
, if mean
_
c
( j)
k
_
minc
( j)
k
,
100, if mean
_
c
( j)
k
_
minc
( j)
k
< ,
where mean(c
( j)
k
) =
1
576
575
j=0
c
( j)
k
, is a threshold to discriminate whether the frame offset is detectable.
For some cases the frame offset does not exist or is not covered, all c
( j)
k
are close, but there is always a
minc
( j)
k
. So we need a threshold to indicate these cases, and we accept the frame offset is detectable
only when mean(c
( j)
k
) minc
( j)
k
is large enough. Otherwise the frame offset is undetectable. Note that
each frame would expect a 0 offset for no forgery, since there is no sample shift on each frame. However,
the detection results of some frames would come up with nonzero offset for forgery.
To locate the forgeries, we just differentiate offset. If offset
k
= offset
k1
, a forgery occurs at frame k.
5. EXPERIMENTAL RESULTS
5.1 Illustration of Locating Forgeries
In Section 4, we show that the proposed method can locate one deletion correctly. However, the frame
offset method is effective not only for one deletion, but also for multiple deletions. Here we demonstrate
an example where a sentence only consists of numbers, as often appears in witness statements. As
shown in Figure 8, three numbers are cropped away from the original sentence. The detected offsets of
all frames in the doctored audio are shown in Figure 8(c). We observe that the frame offsets change at
the 70th, 180th, and 470th frame. This means that some forgeries occur at these locations.
From Figure 8, if the manipulations on the MP3 audio destroy frame segmentations of the previous
encoding, the frame offset method would be able to locate those forgeries. After insertion, the doctored
audio is separated into three segments. Obviously the three segments have different frame offsets.
Figure 9 shows an example of insertion detection. It is shown that the method locates those forgeries
very exactly. As two spliced parts often come from the different sources, they often have different frame
offsets, so our method is also effective for detecting splicing. The case of substitution is illustrated in
Figure 10.
35:12

R. Yang et al.
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 10
5
1
0.5
0
0.5
1
(a) waveform of original audio
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 10
5
1
0.5
0
0.5
1
(b) waveform of doctored audio
one two three four five six seven eight nine
one three five six seven nine
0 100 200 300 400 500 600 700 800
0
200
400
600
different frames
d
e
t
e
c
t
e
d

o
f
f
s
e
t
(c) detect result of doctored audio
Fig. 8. Example of locating multiple deletions. Three numbers are cropped away from a series of numbers, shown as (a) and
(b). (c) is the detection result of the doctored audio. Frame offsets change at the 70th, 180th, and 470th frames, which means
there are forgeries at these frames.
5.2 Extensive Experiments
Our experiments also include extensive tests of different types of audio clips. Our tested audio includes
64 speech clips (each 30 s long) and 64 music clips (each 30 s long). These original audio clips are in
WAV format, 22.05 kHz, 16 bit, mono. We use LAME 3.97 to encode the audio clips into MP3 with
bit rates of 32 kbps, 64 kbps, and 96 kbps, respectively. Then each clip consists of 1142 frames. For
each clip, we randomly select 100 frames and each frame performs 200 sample deletion and 200 sam-
ple insertion, respectively. So for each bit rate, we test our approach on 12800 doctored frames with
deletion and another 12800 frames with insertion. We apply our method to these audio clips. We use
the false positive error to measure the undoctored frames incorrectly identied as doctored, while the
false negative error represents the doctored ones that are not detected. We denote the false positive
error rate and false negative error rate as f
p
and f
n
, respectively. The accurate detection rate AR is
calculated as follows.
AR =
_
1
f
p
+ f
n
2
_
100% (6)
The test results for speech and music are shown in Table II and Table III, respectively.
As we see, whether we are locating deletion or insertion in these audio frames, all accuracy rates are
above 99%. We notice that the detection results of low bit rates are a little better than those of high bit

35:13
0 2 4 6 8 10 12 14
x 10
4
0.5
0
0.5
(a) Original waveform 1
0 2 4 6 8 10 12 14
x 10
4
0.5
0
0.5
(b) Original waveform 2
0 2 4 6 8 10 12 14
x 10
4
0.5
0
0.5
(c) Forgery waveform
I dont think so
I agree with it
I dont agree with it
Insertion
0 50 100 150 200 250
0
200
400
600
different frames
(d) Detect result of doctored audio
d
e
t
e
c
t
e
d

o
f
f
s
e
t
Fig. 9. Example of locating insertion. A key word dont is inserted into a sentence, shown as (a) and (b). (c) is the detection
result of the doctored audio. Frame offsets change at the 48th and 100th frames, which means there are forgeries at these
frames.
rates. This is due to MP3s with lower bit rates having stronger compression traces which means that
the frame offset can be detected more accurately. The f
p
s of speech are higher than those of music,
while the opposite is the case for f
n
s. This may be due to the presence of fewer silent samples in the
music clips, and frame offset detection of silent portions introduces errors more easily.
It is noted that the detection rate cannot achieve 100%. For some special cases our method will fail to
locate forgeries. When the frame contains lots of zero samples, for example, one half, the correct offset
cannot be detected via NAC, as shown in Figure 11. The actual offset of the frame is 200. However,
the detected offset is 575. While applying different offsets, the number of zero samples varies rapidly,
which leads to unstable NAC.
5.3 Sensitivity and Robustness
In this subsection, we discuss the sensitivity and robustness of the proposed method against a variety
of attack schemes.
5.3.1 Splicing at the Boundary. If the adversary is smart enough to splice or crop exactly multiple
of 576 samples to achieve the exact boundary of one frame, will the detection method still work? After
generating the desired audio, the adversary only needs to adjust some (1 575) samples to match
35:14

R. Yang et al.
0 2 4 6 8 10 12
x 10
4
0.5
0
0.5
(a) Original waveform 1
0 2 4 6 8 10 12
x 10
4
0.5
0
0.5
(b) Original waveform 2
0 2 4 6 8 10 12
x 10
4
0.5
0
0.5
(c) Forgery waveform
I like it
I hate doing that
I like doing that
Substitution
0 50 100 150 200
0
200
400
600
different frames
d
e
t
e
c
t
e
d

o
f
f
s
e
t
(d) Detect result of doctored audio
Fig. 10. Example of locating substitution. A key word hate is replaced by like, shown as (a) and (b). (c) is the detection result
of the doctored audio. Frame offsets change at the 48th and 90th frames, which means there are forgeries at these frames.
Table II. Detection Results for Speech
Forgery Type bit rate f
p
f
n
AR
deletion 32 kbps 0.50% 0.03% 99.73%
deletion 64 kbps 0.90% 0.14% 99.48%
deletion 96 kbps 1.12% 0.34% 99.27%
insertion 32 kbps 0.51% 0.03% 99.73%
insertion 64 kbps 0.85% 0.20% 99.47%
insertion 96 kbps 1.01% 0.37% 99.31%
Table III. Detection Results for Music
Forgery Type bit rate f
p
f
n
AR
deletion 32 kbps 0.20% 0.27% 99.76%
deletion 64 kbps 0.27% 0.47% 99.63%
deletion 96 kbps 0.32% 0.61% 99.53%
insertion 32 kbps 0.16% 0.20% 99.82%
insertion 64 kbps 0.23% 0.42% 99.67%
insertion 96 kbps 0.28% 0.45% 99.63%
the frame boundary. Because 1 575 samples only last less than 575/44100 = 0.013 s for a 44.1 kHz
sampling rate, this adjustment would not affect the meaning of the desired audio. Thanks to the 50%
overlap framing method during the MP3 encoding, we can still nd the trace of this forgery. We give a
demonstration in Figure 12. Suppose that one forgery occurs at the boundary of frame k. There exactly

35:15
0 100 200 300 400 500 600
1
0.5
0
0.5
1
sample index
a
m
p
l
i
t
u
d
e
(a)waveform of an undetectable frame
0 100 200 300 400 500 600
100
150
200
250
300
350
400
frame offset
N
A
C
(b) NAC result
Fig. 11. An example of fail case. Shown in (a) is the waveform of one frame with undetectable frame offset. Shown in (b) are
the NACs via different frame offsets.
576 samples are cropped. The spectral of new frame k+1 will not have the quantization characteristic
no matter with which offset, but frame k and frame k + 2 still have many troughs with the original
offset.
5.3.2 Additive Noise. Additive noise may be added to the tampered speech to cover forgeries, and
this presents a challenge for forgery detection. To investigate the robustness of the proposed scheme
undergone with additive noise, a short speech clip consisting of 45 frames is tested. The audio samples
of the 20th frame are added with white Gaussian noise of 30dB, as shown in Figure 13(a). Since
both the 19th and 21st frames are 50% overlapping with the 20th frame, it means that the 19th and
21st frames are half doctored at the same time. Then we investigate the effect of additive noise on
NAC. All frames are applied with offsets from 0 to 575, and the corresponding NACs are recorded and
plotted vertically, as shown in Figure 13(b). It is noted that all the plots have a signicantly small
value except those plots of the 18th, 19th, 20th, 21st, and 22nd frames. This means frame offsets of
all frames except these ve frames can be detected via NAC. Since there is not such a remarkable
decrease among the NACs of the 18th, 19th, 20th, 21st, and 22nd frames, the frame offsets of these
ve frames are undetectable and marked with a special value 100 as mentioned in Section 3. The
detection result of the tampered speech is shown as Figure 13(c). From the detection result, it shows
that the proposed method can resist locally added noise, which means that forgeries covered by noise
can be located.
However, if the noise is globally added after forgeries, all the frame offsets become undetectable
and marked as 100. In this case, the proposed method is not able to locate the forgeries, but it still
35:16

R. Yang et al.
0 576 1152 1728 2303
0.2
0
0.2
a
m
p
l
i
t
u
d
e
(a) waveform
0 576 1152 1728
0
5
10
m
a
g
n
i
t
u
d
e
(
d
B
)
(b) spectral
frame k frame k+1 frame k+2
Original audio
Doctored audio
Fig. 12. The case of splicing at the boundary. Shown in (a) is a waveform of audio whose 576 samples are cropped from the
1153rd sample. Shown in (b) is the spectral of the three frames of doctored audio. All the frames have the quantization charac-
teristics except the middle frame.
0 0.5 1 1.5 2 2.5
x 10
4
1
0
1
(a) audio with additive noise
0 5 10 15 20 25 30 35 40 45
200
400
different frame
N
A
C
(b) NAC result of each frame
0 5 10 15 20 25 30 35 40 45
100
50
0
different frame
d
e
t
e
c
t
e
d

o
f
f
s
e
t
(c) detection result of each frame
adding noise
Fig. 13. The effect of additive noise on NAC. Shown in (a) is the waveform of audio with partially additive noise. Shown in
(b) are the NAC results of all frames. Shown in (c) is the detection result of frame offsets.
indicates that the audio is abnormal and must be postprocessed. In this case, the audio is suspect and
rejected as evidence.
5.3.3 Filtering. Another common way to cover forgeries is ltering the tampered signal. Here we
test with a median lter, mean lter, and low-pass lter. The same speech clip as in the preceding
section is selected for testing. Since the effect of different lters on NAC is similar, under the limitation
of page range only the result of the median lter is illustrated.

35:17
0 0.5 1 1.5 2 2.5
x 10
4
1
0
1
(a) audio with filtering
0 5 10 15 20 25 30 35 40 45
200
400
different frame
N
A
C
(b) NAC result of each frame
0 5 10 15 20 25 30 35 40 45
100
50
0
different frame
d
e
t
e
c
t
e
d

o
f
f
s
e
t
(c) detection result of each frame
filtering
Fig. 14. The effect of median ltering on NAC. Shown in (a) is the waveform of audio partially ltered. Shown in (b) are the
NAC results of all frames. Shown in (c) is the detection result of frame offsets.
First, the 20th frame of the audio signal is ltered by a median lter with length of 7, as shown in
Figure 14(a). Since both 19th and 21st frames are 50% overlapping with the 20th frame, it means that
the 19th and 21st frames are half ltered at the same time. Then NACs of all frames are investigated
and the proposed detection method is applied to the whole speech clip.
As shown in Figure 14(b), similar to the case of adding noise, the plots of NACs of the 18th, 19th,
20th, 21st, and 22nd frames have no signicant decreases, while the plots of other frames have an
obviously small value. From the detection result at Figure 14(c), it shows that frame offsets of the
18th, 19th, 20th, 21st, and 22nd frames are undetectable, but other frames have a obvious offset as 0.
It means that the proposed method can indicate the ltered portion of an audio signal if the signal is
partially ltered. However, similar as the case of adding noise, if the audio signal is globally ltered,
the proposed method could not locate forgeries automatically, but still indicates the ltered signal has
been manipulated.
6. DISCUSSIONS AND CONCLUSIONS
6.1 Extension to Other Formats
Although we only investigate audio of MP3 format, the idea of locating forgeries via the frame offset
is suitable for audio of other compressed formats, such as AAC, WMA, and OGG Vorbis. Since the
generation of audio with these formats is performed frame by frame, the frame offset of each frame is
achievable.
To conrm this, we use audio signal encoded with AAC for testing. Notice that the length of each
frame in AAC is 1024, and the frequency spectral is also of MDCT coefcients. The tool we utilize
to encode and decode audio signals is FAAC [FAA]. The test clip consists of 40 frames audio, and its
sampling rate is 44.1 kHz. The encoding parameters of FAAC are 96 kbps, mono. First, we investigate
whether the AAC audio has the quantization characteristic. Offsets 1, 0, and +1 are applied to the
9th frame, respectively. For each offset, 1024 MDCT coefcients can be obtained. Then we plot these
coefcients in a logarithmic representation, as shown in Figures 15(a), (b), and (c). It is obvious that
35:18

R. Yang et al.
0 200 400 600 800 1000 1200
0
20
40
m
a
g
n
i
t
u
d
e

(
d
B
)
frequency index
(a) offset = 1
0 200 400 600 800 1000 1200
0
20
40
m
a
g
n
i
t
u
d
e

(
d
B
)
frequency index
(b) offset = 0
0 200 400 600 800 1000 1200
0
20
40
m
a
g
n
i
t
u
d
e

(
d
B
)
frequency index
(c) offset = +1
0 500 1000 1500 2000 2500
400
600
800
frame offset
N
A
C
(d) NAC result of audio encoded with AAC
Fig. 15. Quantization characteristic of AAC. Subgure (a), (b), (c) are corresponding to spectral of the 9th frame with offsets
1, 0, and +1, respectively. Similar with the case of MP3, the quantization characteristic shows up when only with the matching
offset (0). Subgure (d) shows the NAC result of 9th frame with offsets 1 to 2500.
only Figure 15(b) shows the quantization characteristic. Furthermore, we apply offsets 1 to 2500 on
the frame, and obtain the corresponding NAC results, as shown in Figure 15(d). A period of 1024
can be observed. Within the length of the frame, there is only one matching offset, and its NAC is
discriminative from other 1023 NACs.
Now we are in a step of checking AAC audio forgeries. The audio with 40 frames has totally 40960
samples. We delete samples from index 10000 to 15000. Then we apply the proposed method to the
doctored AAC audio. Each frame generates 1024 NACs, and the matching offset is recognized as the
one corresponding to minimize NAC. The detection result is shown as Figure 16.
Therefore we show that the proposed method can detect forgeries on AAC audio. Our method is also
able to extend to other frame-based encoders, since applying the matching offset is easier to approxi-
mate with the rst-encoding spectral than using other shifted offsets. What we must remember is the
procedure of extracting spectral varying from different encoders, since they use different frame length
and windows.
6.2 Conclusions
In this article, we propose a method to expose MPEG audio forgeries using frame offsets. The main
contributions of this work are as follows. First, according to our best knowledge, this is the rst piece
of work on detecting forgeries on MP3 audio. It extends the research topics of forgery detection. Second,
this work illustrates that MDCT coefcients can reect forgery traces very well for MPEG audio. Via
theoretical analysis and extensive experiments, we show that NAC is a reliable feature to retrieve

35:19
0.5 1 1.5 2 2.5 3 3.5 4
x 10
4
1
0
1
(a) original audio
0.5 1 1.5 2 2.5 3 3.5 4
x 10
4
1
0
1
(b) doctored audio
5 10 15 20 25 30 35 40
0
200
400
(c) detection result
frame index
d
e
t
e
c
t
e
d

o
f
f
s
e
t
Fig. 16. Forgeries detection result of AAC audio.
frame offsets. Based on the fact that most common forgeries change frame offsets of audio, the proposed
method can locate these forgeries effectively. Extensive experimental results show that the proposed
method has very good performance on both speech and music. All the accuracy rates are above 99%,
which shows the effectiveness of our proposed method. Another advantage of the proposed method is
the simplicity in computation. We only need to investigate the MDCT coefcients of the audio.
However, if audio is transcoded between different compressed formats, the frame offset is difcult
to obtain and the proposed method will fail in this case. It is noted that at a high bit rate such as 128
kbps the NAC method is not very suitable for retrieving frame offsets, since zero coefcients are few
at high bit rates. So in the future, we will focus on obtaining the frame offset when transcoding and at
high bit rates.
ACKNOWLEDGMENTS
The authors would like to thank the anonymous reviewers for their constructive comments. Their
suggestions will be very helpful for our future work.
REFERENCES
BOEHM, R. AND WESTFELD, A. 2004. Statistical characterisation of mp3 encoders for steganalysis. In Proceedings of the 6th ACM
Multimedia and Security Workshop. ACM.
FAAC. 2012. Freeware advanced audio coder. http://www.audiocoding.com/faac.html.
FARID, H. 1999. Detecting digital forgeries using bispectral analysis. MIT AI Memo AIM-1657, MIT.
FU, D., SHI, Y., AND SU, W. 2007. A generalized benfords law for jpeg coefcients and its applications in image forensics. In
Proceedings of SPIE Conference on Security, Steganography, and Watermarking of Multimedia Contents.
GRIGORAS, C. 2005. Digital audio recording analysis: The electric network frequency (enf) criterion. Int. J. Speech Lang. Law 2, 1,
6376.
HERRE, J. AND SCHUG, M. 2000. Analysis of decompressed audioThe inverse decoder. In Proceedings of the 109th AES
Convention.
HERRE, J., SCHUG, M., AND GEIGER, R. 2002. Analysing decompressed audio with the inverse decoderTowards an operative
algorithm. In Proceedings of the 112th AES Convention.
35:20

R. Yang et al.
ISO. 1992. Iso/iec international standard is 11172-3. Information technologyCoding of moving pictures and associated audio
for digital storage media up to about 1.5 Mbit/s. http://www.iso.org/iso/catalouge detail.htm?csnumber=22412.
KRAETZER, C., OERMANN, A., DITTMANN, J., AND LANG, A. 2007. Digital audio forensics: A rst practical evaluation on microphone
and environment classication. In Proceedings of the 9th ACM Multimedia and Security Workshop.
LAME 3.97. 2012. Mp3 encoder. http://lame.sourceforge.net.
LUKAS, J. AND FRIDRICH, J. 2003. Estimation of primary quantization matrix in double compressed jpeg images. In Proceedings
of the Digital Forensic Research Workshop.
PAINTER, T. AND SPANIAS, A. 2000. Perceptual coding of digital audio. Proc. IEEE 88, 4, 451513.
POPESCU, A. AND FARID, H. 2004. Statistical tools for digital forensics. In Proceedings of the 6th International Workshop on
Information Hiding.
QU, Z., LUO, W., AND HUANG, J. 2008. A convolutive mixing model for shift double jpeg compression with application to passive
image authentication. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing.
WANG, Y. AND VELERMO, M. 2003. Modied discrete cosine transformIts implications for audio coding and error concealment.
AES J. 51, 1, 5162.
WANG, Y., YAROSLAVSKY, L., VILERMO, M., AND VAANANEN, M. 2000. Some peculiar properties of the mdct. In Proceedings of the
16th IFIP World Computer Congress.
YANG, R., QU, Z., AND HUANG, J. 2008. Detecting digital audio forgeries by checking frame offsets. In Proceedings of the 10th ACM
Multimedia and Security Workshop. ACM.
Received November 2010; revised July 2011; accepted August 2011

A35 Yang

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A35 Yang

Uploaded by

Copyright:

Available Formats

35

Exposing MP3 Audio Forgeries Using Frame Offsets

You might also like