Professional Documents
Culture Documents
n=0
x
( p)
[n] h[n] cos
_
N
(n+
N +1
2
)
_
k +
1
2
__
, 0 k N 1 (1)
By applying an inverse-MDCT transform to the frame, we get 2N time-aliased samples.
x
( p)
[n] =
2
N
N1
k=0
X
( p)
[k] cos
_
N
_
n+
N +1
2
_
_
k +
1
2
__
, 0 n 2N 1 (2)
In order to cancel the aliasing and get the original samples, we have to use the OLA (Overlapping
Addition) procedure. An inverse-MDCT is applied to the previous and the next frame. Then, each of
the resulting aliased segments is multiplied by its corresponding window function and the overlapping
time segments are added together. We thus recover the original samples.
x
( p)
[n] =
_
x
( p1)
[n+ N] h[N n1] + x
( p)
[n] h[n], 0 n N 1
x
( p)
[n] h[2N n1] + x
( p+1)
[n N] h[n N], N n 2N 1
(3)
Denote that
x
( p)
[n] = x
( p)
[n] h[n], 0 n 2N 1. (4)
If a signal exhibits local symmetry such that
_
x
( p)
[n] = x
( p)
[N n1], 0 n N 1
x
( p)
[n] = x
( p)
[3N n1], N n 2N 1
(5)
its MDCT coefcients become zero. That is, X
( p)
[k] = 0 for k = 0, . . . , N 1.
In Wang et al. [2000], it has been proven that x
( p)
[n] fullls Eq. (5) if X
( p)
[k] = 0. This inherent
property of the MDCT gives the answer to why NAC has a signicant decrease only if the identical
frame offset is applied. After MP3 encoding, many spectral coefcients are masked or quantized to
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 2012.
35:8
R. Yang et al.
Table I. Mean Value and Standard Diviation
of NACs at Different Bit Rates
shifted NACs matching NAC
bit rate
Mean Std Mean Std
32 kbps 175.61 13.45 67.80 12.34
64 kbps 313.46 19.99 178.38 11.06
96 kbps 331.72 18.30 249.15 25.07
128 kbps 345.45 19.14 310.23 25.60
zero. When decoding, these zero spectral coefcients are restored to the time domain, and x
( p)
[n] ful-
lls Eq. (5). While performing MDCT on the decoded data with the identical frame offset to the rst
encoding process, we will get a lot of X
( p)
[k] equal to zero. If there is a different frame offset, the local
symmetry in Eq. (5) is broken, and then the corresponding spectral X
( p)
[k] will not be zero.
3.3 Experiments on Retrieving Frame Offsets
To illustrate the preceding analysis, we randomly select 30 different audio frames, and encode these
frames with LAME v3.97 [LAM 2012] at the bit rates of 32 kbps, 64 kbps, 96 kbps, and 128 kbps,
respectively. For each bit rate, we apply offsets from 575 to 575 on these frames, and calculate NACs
corresponding to all offsets. Then we get 1151 NACs for each frame totally. The 1150 NACs correspond-
ing to wrong offsets are named as shifted NACs, and the NAC corresponding to the correct offset is
denoted as matching NAC. The shifted NACs and the matching NAC are plotted, respectively. As
shown in Figure 6, for each bit rate, there are 30 boxes representing the distribution of shifted NACs.
As shown in Figure 6(a), the minimum value of shifted NACs is larger than 150 for each frame, while
the matching NAC is below 80. For all frames, we observe that matching NAC is very discriminative
from shifted NACs. The case of 64 kbps, 96 kbps, and 128 kbps are illustrated in Figures 6(b), (c), and
(d), respectively.
Although frames may be encoded with different bit rates, the matching NAC is always smaller than
shifted NACs. This means that we can regard the minimum NAC as the matching NAC. From Figure 6,
we also notice that the distance between shifted NACs and the matching NAC becomes small while
the bit rate increases. This is because signal distortion and lost information is less when the bit rate is
higher, and MDCT coefcients contain less 0s.
As the aforesaid investigation is based on only 30 frames, the conclusion may be not general enough.
In the following, we will take statistics on 12800 frames, including 6400 frames of speech and 6400
frames of music. We compute 1150 shifted NACs and the matching NAC for each frame. Table I
displays the mean values and standard deviations of NAC based on 12800 frames. It is found that
the mean values of shifted NACs and the matching NAC have a signicant distance. The standard
deviations are all small compared to the mean values. However, as we noted before, the difference
between shifted NACs and the matching NAC becomes small when with a high bit rate, such as
128 kbps.
4. LOCATING FORGERIES VIA CHECKING FRAME OFFSETS
As audio samples are divided into frames for encoding, the frame offset could be useful evidence of
tampering. When forgeries occur, all frames after the forged points will be affected. The detected offsets
of corresponding frames will change. Figure 7 is an example of cropping. The original sentence I am
not guilty is recorded with sampling rate of 44.1kHz and saved as MP3 format by a digital recorder,
as shown in Figure 7(a). We manipulate this audio recording with CoolEdit v2.1, and remove the key
word not. The meaning of the sentence becomes the opposite: I am guilty, shown in Figure 7(b). The
detected offsets of all frames in the original audio and the doctored one are demonstrated in Figure 7(c)
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 2012.
Exposing MP3 Audio Forgeries Using Frame Offsets
35:9
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
50
100
150
200
250
300
350
400
N
A
C
different audio frames
(a) NAC result of frames encoded with 32 kbps
distribution of 1150 NACs
with wrong offsets for 14th audio frame
NAC with the correct offset for 14th audio frame
32kbps
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
100
150
200
250
300
350
400
N
A
C
different audio frames
(b) NAC result of frames encoded with 64 kbps
64kbps
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
100
150
200
250
300
350
400
N
A
C
different audio frames
(c) NAC result of frames encoded with 96 kbps
96kbps
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
100
150
200
250
300
350
400
N
A
C
different audio frames
(d) NAC result of frames encoded with 128 kbps
128kbps
Fig. 6. The distribution of NACs corresponding to frame offsets from 575 to 575 on 30 different audio frames, which are
encoded using LAME v3.97, mono. The box stands for the distribution of 1150 NACs with wrong offsets, while the isolated point
is the NAC with the correct offset. In panel (a) (b) (c) (d) are the cases for 32 kbps, 64 kbps, 96 kbps, 128 kbps, respectively.
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 2012.
35:10
R. Yang et al.
0 2 4 6 8 10 12 14 16
x 10
4
0.5
0
0.5
(a) Original Waveform
0 2 4 6 8 10 12 14 16
x 10
4
0.4
0.2
0
0.2
(b) Doctored Waveform
I am guilty.
I am not guilty.
Cropping
0 50 100 150 200 250
0
200
400
600
d
e
t
e
c
t
e
d
o
f
f
s
e
t
different frame
(c) detection result of original audio
0 50 100 150 200 250
0
200
400
600
d
e
t
e
c
t
e
d
o
f
f
s
e
t
different frame
(d) detection result of doctored audio
Fig. 7. Example of locating one cropping. The sentence I am not guilty is cropped to I am guilty, shown as (a) and (b). (c) is
the detection result of the original audio. The detected offsets of all frames are 0, which means there are no forgeries. (d) is the
detection result of the doctored audio. The detected offsets change at frame 119, which means there is a forgery. Note that the
horizontal-axis represents samples in (a)(b), but frames in (c)(d). 160000 samples corresponds to 277 frames exactly.
and Figure 7(d), respectively. We observe that all frames in the original audio have the same offset 0.
But for the doctored one, the detected offsets have two different values, 0 for frames 1 to frame 118,
and 384 for the remainder. We can draw a conclusion that there is a forgery at frame 119.
From the previous example, we have the general procedures of locating forgeries: (i) detecting offsets
of all frames; (ii) checking the differences between frame offsets.
Now how can the offsets of all frames be retrieved effectively?
Given an audio signal of L samples, we denote it with vector-notation x, and mark the j-sample-
shifted version (which means appending j zero samples at the beginning of x) as x
( j)
(0 j < 576).
x
(0)
= x, x
( j+1)
=
_
0,x
( j)
_
, j = 0, . . . , 574
For each offset j, we split x
( j)
into 1152 samples per frame with 50% overlap, so we totally get
N = L/576 1 frames as follows. We have
_
x
( j)
0
x
( j)
N1
_
= Fx
( j)
,
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 2012.
Exposing MP3 Audio Forgeries Using Frame Offsets
35:11
where F represents frame segmentation as well as applying the window function, and x
( j)
k
is the k-th
frame of x
( j)
,
We apply the lterbank and MDCT to each frame and obtain its spectral (576 MDCT coefcients).
We have
s
( j)
k
= T x
( j)
k
,
where T represents both ltering by the lterbank and MDCT. s
( j)
k
represents the spectral of the k-th
frame of x
( j)
.
We change s
( j)
k
into the logarithm representation M
k
( j)
.
M
( j)
k
= 10log
_
max
_
s
( j)
k
s
( j)
k
10
10
, 1
__
We express M
k
( j)
in a logarithm representation by projecting all values into the range [0,10].
We then count the number of active value in M
k
( j)
. We have
c
( j)
k
= CM
( j)
k
,
where C represents the counting operation.
For frame k, the detected offset is
offset
k
=
_
arg min
j
c
( j)
k
, if mean
_
c
( j)
k
_
minc
( j)
k
,
100, if mean
_
c
( j)
k
_
minc
( j)
k
< ,
where mean(c
( j)
k
) =
1
576
575
j=0
c
( j)
k
, is a threshold to discriminate whether the frame offset is detectable.
For some cases the frame offset does not exist or is not covered, all c
( j)
k
are close, but there is always a
minc
( j)
k
. So we need a threshold to indicate these cases, and we accept the frame offset is detectable
only when mean(c
( j)
k
) minc
( j)
k
is large enough. Otherwise the frame offset is undetectable. Note that
each frame would expect a 0 offset for no forgery, since there is no sample shift on each frame. However,
the detection results of some frames would come up with nonzero offset for forgery.
To locate the forgeries, we just differentiate offset. If offset
k
= offset
k1
, a forgery occurs at frame k.
5. EXPERIMENTAL RESULTS
5.1 Illustration of Locating Forgeries
In Section 4, we show that the proposed method can locate one deletion correctly. However, the frame
offset method is effective not only for one deletion, but also for multiple deletions. Here we demonstrate
an example where a sentence only consists of numbers, as often appears in witness statements. As
shown in Figure 8, three numbers are cropped away from the original sentence. The detected offsets of
all frames in the doctored audio are shown in Figure 8(c). We observe that the frame offsets change at
the 70th, 180th, and 470th frame. This means that some forgeries occur at these locations.
From Figure 8, if the manipulations on the MP3 audio destroy frame segmentations of the previous
encoding, the frame offset method would be able to locate those forgeries. After insertion, the doctored
audio is separated into three segments. Obviously the three segments have different frame offsets.
Figure 9 shows an example of insertion detection. It is shown that the method locates those forgeries
very exactly. As two spliced parts often come from the different sources, they often have different frame
offsets, so our method is also effective for detecting splicing. The case of substitution is illustrated in
Figure 10.
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 2012.
35:12
R. Yang et al.
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 10
5
1
0.5
0
0.5
1
(a) waveform of original audio
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 10
5
1
0.5
0
0.5
1
(b) waveform of doctored audio
one two three four five six seven eight nine
one three five six seven nine
0 100 200 300 400 500 600 700 800
0
200
400
600
different frames
d
e
t
e
c
t
e
d
o
f
f
s
e
t
(c) detect result of doctored audio
Fig. 8. Example of locating multiple deletions. Three numbers are cropped away from a series of numbers, shown as (a) and
(b). (c) is the detection result of the doctored audio. Frame offsets change at the 70th, 180th, and 470th frames, which means
there are forgeries at these frames.
5.2 Extensive Experiments
Our experiments also include extensive tests of different types of audio clips. Our tested audio includes
64 speech clips (each 30 s long) and 64 music clips (each 30 s long). These original audio clips are in
WAV format, 22.05 kHz, 16 bit, mono. We use LAME 3.97 to encode the audio clips into MP3 with
bit rates of 32 kbps, 64 kbps, and 96 kbps, respectively. Then each clip consists of 1142 frames. For
each clip, we randomly select 100 frames and each frame performs 200 sample deletion and 200 sam-
ple insertion, respectively. So for each bit rate, we test our approach on 12800 doctored frames with
deletion and another 12800 frames with insertion. We apply our method to these audio clips. We use
the false positive error to measure the undoctored frames incorrectly identied as doctored, while the
false negative error represents the doctored ones that are not detected. We denote the false positive
error rate and false negative error rate as f
p
and f
n
, respectively. The accurate detection rate AR is
calculated as follows.
AR =
_
1
f
p
+ f
n
2
_
100% (6)
The test results for speech and music are shown in Table II and Table III, respectively.
As we see, whether we are locating deletion or insertion in these audio frames, all accuracy rates are
above 99%. We notice that the detection results of low bit rates are a little better than those of high bit
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 2012.
Exposing MP3 Audio Forgeries Using Frame Offsets
35:13
0 2 4 6 8 10 12 14
x 10
4
0.5
0
0.5
(a) Original waveform 1
0 2 4 6 8 10 12 14
x 10
4
0.5
0
0.5
(b) Original waveform 2
0 2 4 6 8 10 12 14
x 10
4
0.5
0
0.5
(c) Forgery waveform
I dont think so
I agree with it
I dont agree with it
Insertion
0 50 100 150 200 250
0
200
400
600
different frames
(d) Detect result of doctored audio
d
e
t
e
c
t
e
d
o
f
f
s
e
t
Fig. 9. Example of locating insertion. A key word dont is inserted into a sentence, shown as (a) and (b). (c) is the detection
result of the doctored audio. Frame offsets change at the 48th and 100th frames, which means there are forgeries at these
frames.
rates. This is due to MP3s with lower bit rates having stronger compression traces which means that
the frame offset can be detected more accurately. The f
p
s of speech are higher than those of music,
while the opposite is the case for f
n
s. This may be due to the presence of fewer silent samples in the
music clips, and frame offset detection of silent portions introduces errors more easily.
It is noted that the detection rate cannot achieve 100%. For some special cases our method will fail to
locate forgeries. When the frame contains lots of zero samples, for example, one half, the correct offset
cannot be detected via NAC, as shown in Figure 11. The actual offset of the frame is 200. However,
the detected offset is 575. While applying different offsets, the number of zero samples varies rapidly,
which leads to unstable NAC.
5.3 Sensitivity and Robustness
In this subsection, we discuss the sensitivity and robustness of the proposed method against a variety
of attack schemes.
5.3.1 Splicing at the Boundary. If the adversary is smart enough to splice or crop exactly multiple
of 576 samples to achieve the exact boundary of one frame, will the detection method still work? After
generating the desired audio, the adversary only needs to adjust some (1 575) samples to match
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 2012.
35:14
R. Yang et al.
0 2 4 6 8 10 12
x 10
4
0.5
0
0.5
(a) Original waveform 1
0 2 4 6 8 10 12
x 10
4
0.5
0
0.5
(b) Original waveform 2
0 2 4 6 8 10 12
x 10
4
0.5
0
0.5
(c) Forgery waveform
I like it
I hate doing that
I like doing that
Substitution
0 50 100 150 200
0
200
400
600
different frames
d
e
t
e
c
t
e
d
o
f
f
s
e
t
(d) Detect result of doctored audio
Fig. 10. Example of locating substitution. A key word hate is replaced by like, shown as (a) and (b). (c) is the detection result
of the doctored audio. Frame offsets change at the 48th and 90th frames, which means there are forgeries at these frames.
Table II. Detection Results for Speech
Forgery Type bit rate f
p
f
n
AR
deletion 32 kbps 0.50% 0.03% 99.73%
deletion 64 kbps 0.90% 0.14% 99.48%
deletion 96 kbps 1.12% 0.34% 99.27%
insertion 32 kbps 0.51% 0.03% 99.73%
insertion 64 kbps 0.85% 0.20% 99.47%
insertion 96 kbps 1.01% 0.37% 99.31%
Table III. Detection Results for Music
Forgery Type bit rate f
p
f
n
AR
deletion 32 kbps 0.20% 0.27% 99.76%
deletion 64 kbps 0.27% 0.47% 99.63%
deletion 96 kbps 0.32% 0.61% 99.53%
insertion 32 kbps 0.16% 0.20% 99.82%
insertion 64 kbps 0.23% 0.42% 99.67%
insertion 96 kbps 0.28% 0.45% 99.63%
the frame boundary. Because 1 575 samples only last less than 575/44100 = 0.013 s for a 44.1 kHz
sampling rate, this adjustment would not affect the meaning of the desired audio. Thanks to the 50%
overlap framing method during the MP3 encoding, we can still nd the trace of this forgery. We give a
demonstration in Figure 12. Suppose that one forgery occurs at the boundary of frame k. There exactly
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 2012.
Exposing MP3 Audio Forgeries Using Frame Offsets
35:15
0 100 200 300 400 500 600
1
0.5
0
0.5
1
sample index
a
m
p
l
i
t
u
d
e
(a)waveform of an undetectable frame
0 100 200 300 400 500 600
100
150
200
250
300
350
400
frame offset
N
A
C
(b) NAC result
Fig. 11. An example of fail case. Shown in (a) is the waveform of one frame with undetectable frame offset. Shown in (b) are
the NACs via different frame offsets.
576 samples are cropped. The spectral of new frame k+1 will not have the quantization characteristic
no matter with which offset, but frame k and frame k + 2 still have many troughs with the original
offset.
5.3.2 Additive Noise. Additive noise may be added to the tampered speech to cover forgeries, and
this presents a challenge for forgery detection. To investigate the robustness of the proposed scheme
undergone with additive noise, a short speech clip consisting of 45 frames is tested. The audio samples
of the 20th frame are added with white Gaussian noise of 30dB, as shown in Figure 13(a). Since
both the 19th and 21st frames are 50% overlapping with the 20th frame, it means that the 19th and
21st frames are half doctored at the same time. Then we investigate the effect of additive noise on
NAC. All frames are applied with offsets from 0 to 575, and the corresponding NACs are recorded and
plotted vertically, as shown in Figure 13(b). It is noted that all the plots have a signicantly small
value except those plots of the 18th, 19th, 20th, 21st, and 22nd frames. This means frame offsets of
all frames except these ve frames can be detected via NAC. Since there is not such a remarkable
decrease among the NACs of the 18th, 19th, 20th, 21st, and 22nd frames, the frame offsets of these
ve frames are undetectable and marked with a special value 100 as mentioned in Section 3. The
detection result of the tampered speech is shown as Figure 13(c). From the detection result, it shows
that the proposed method can resist locally added noise, which means that forgeries covered by noise
can be located.
However, if the noise is globally added after forgeries, all the frame offsets become undetectable
and marked as 100. In this case, the proposed method is not able to locate the forgeries, but it still
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 2012.
35:16
R. Yang et al.
0 576 1152 1728 2303
0.2
0
0.2
a
m
p
l
i
t
u
d
e
(a) waveform
0 576 1152 1728
0
5
10
m
a
g
n
i
t
u
d
e
(
d
B
)
(b) spectral
frame k frame k+1 frame k+2
Original audio
Doctored audio
Fig. 12. The case of splicing at the boundary. Shown in (a) is a waveform of audio whose 576 samples are cropped from the
1153rd sample. Shown in (b) is the spectral of the three frames of doctored audio. All the frames have the quantization charac-
teristics except the middle frame.
0 0.5 1 1.5 2 2.5
x 10
4
1
0
1
(a) audio with additive noise
0 5 10 15 20 25 30 35 40 45
200
400
different frame
N
A
C
(b) NAC result of each frame
0 5 10 15 20 25 30 35 40 45
100
50
0
different frame
d
e
t
e
c
t
e
d
o
f
f
s
e
t
(c) detection result of each frame
adding noise
Fig. 13. The effect of additive noise on NAC. Shown in (a) is the waveform of audio with partially additive noise. Shown in
(b) are the NAC results of all frames. Shown in (c) is the detection result of frame offsets.
indicates that the audio is abnormal and must be postprocessed. In this case, the audio is suspect and
rejected as evidence.
5.3.3 Filtering. Another common way to cover forgeries is ltering the tampered signal. Here we
test with a median lter, mean lter, and low-pass lter. The same speech clip as in the preceding
section is selected for testing. Since the effect of different lters on NAC is similar, under the limitation
of page range only the result of the median lter is illustrated.
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 2012.
Exposing MP3 Audio Forgeries Using Frame Offsets
35:17
0 0.5 1 1.5 2 2.5
x 10
4
1
0
1
(a) audio with filtering
0 5 10 15 20 25 30 35 40 45
200
400
different frame
N
A
C
(b) NAC result of each frame
0 5 10 15 20 25 30 35 40 45
100
50
0
different frame
d
e
t
e
c
t
e
d
o
f
f
s
e
t
(c) detection result of each frame
filtering
Fig. 14. The effect of median ltering on NAC. Shown in (a) is the waveform of audio partially ltered. Shown in (b) are the
NAC results of all frames. Shown in (c) is the detection result of frame offsets.
First, the 20th frame of the audio signal is ltered by a median lter with length of 7, as shown in
Figure 14(a). Since both 19th and 21st frames are 50% overlapping with the 20th frame, it means that
the 19th and 21st frames are half ltered at the same time. Then NACs of all frames are investigated
and the proposed detection method is applied to the whole speech clip.
As shown in Figure 14(b), similar to the case of adding noise, the plots of NACs of the 18th, 19th,
20th, 21st, and 22nd frames have no signicant decreases, while the plots of other frames have an
obviously small value. From the detection result at Figure 14(c), it shows that frame offsets of the
18th, 19th, 20th, 21st, and 22nd frames are undetectable, but other frames have a obvious offset as 0.
It means that the proposed method can indicate the ltered portion of an audio signal if the signal is
partially ltered. However, similar as the case of adding noise, if the audio signal is globally ltered,
the proposed method could not locate forgeries automatically, but still indicates the ltered signal has
been manipulated.
6. DISCUSSIONS AND CONCLUSIONS
6.1 Extension to Other Formats
Although we only investigate audio of MP3 format, the idea of locating forgeries via the frame offset
is suitable for audio of other compressed formats, such as AAC, WMA, and OGG Vorbis. Since the
generation of audio with these formats is performed frame by frame, the frame offset of each frame is
achievable.
To conrm this, we use audio signal encoded with AAC for testing. Notice that the length of each
frame in AAC is 1024, and the frequency spectral is also of MDCT coefcients. The tool we utilize
to encode and decode audio signals is FAAC [FAA]. The test clip consists of 40 frames audio, and its
sampling rate is 44.1 kHz. The encoding parameters of FAAC are 96 kbps, mono. First, we investigate
whether the AAC audio has the quantization characteristic. Offsets 1, 0, and +1 are applied to the
9th frame, respectively. For each offset, 1024 MDCT coefcients can be obtained. Then we plot these
coefcients in a logarithmic representation, as shown in Figures 15(a), (b), and (c). It is obvious that
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 2012.
35:18
R. Yang et al.
0 200 400 600 800 1000 1200
0
20
40
m
a
g
n
i
t
u
d
e
(
d
B
)
frequency index
(a) offset = 1
0 200 400 600 800 1000 1200
0
20
40
m
a
g
n
i
t
u
d
e
(
d
B
)
frequency index
(b) offset = 0
0 200 400 600 800 1000 1200
0
20
40
m
a
g
n
i
t
u
d
e
(
d
B
)
frequency index
(c) offset = +1
0 500 1000 1500 2000 2500
400
600
800
frame offset
N
A
C
(d) NAC result of audio encoded with AAC
Fig. 15. Quantization characteristic of AAC. Subgure (a), (b), (c) are corresponding to spectral of the 9th frame with offsets
1, 0, and +1, respectively. Similar with the case of MP3, the quantization characteristic shows up when only with the matching
offset (0). Subgure (d) shows the NAC result of 9th frame with offsets 1 to 2500.
only Figure 15(b) shows the quantization characteristic. Furthermore, we apply offsets 1 to 2500 on
the frame, and obtain the corresponding NAC results, as shown in Figure 15(d). A period of 1024
can be observed. Within the length of the frame, there is only one matching offset, and its NAC is
discriminative from other 1023 NACs.
Now we are in a step of checking AAC audio forgeries. The audio with 40 frames has totally 40960
samples. We delete samples from index 10000 to 15000. Then we apply the proposed method to the
doctored AAC audio. Each frame generates 1024 NACs, and the matching offset is recognized as the
one corresponding to minimize NAC. The detection result is shown as Figure 16.
Therefore we show that the proposed method can detect forgeries on AAC audio. Our method is also
able to extend to other frame-based encoders, since applying the matching offset is easier to approxi-
mate with the rst-encoding spectral than using other shifted offsets. What we must remember is the
procedure of extracting spectral varying from different encoders, since they use different frame length
and windows.
6.2 Conclusions
In this article, we propose a method to expose MPEG audio forgeries using frame offsets. The main
contributions of this work are as follows. First, according to our best knowledge, this is the rst piece
of work on detecting forgeries on MP3 audio. It extends the research topics of forgery detection. Second,
this work illustrates that MDCT coefcients can reect forgery traces very well for MPEG audio. Via
theoretical analysis and extensive experiments, we show that NAC is a reliable feature to retrieve
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 2012.
Exposing MP3 Audio Forgeries Using Frame Offsets
35:19
0.5 1 1.5 2 2.5 3 3.5 4
x 10
4
1
0
1
(a) original audio
0.5 1 1.5 2 2.5 3 3.5 4
x 10
4
1
0
1
(b) doctored audio
5 10 15 20 25 30 35 40
0
200
400
(c) detection result
frame index
d
e
t
e
c
t
e
d
o
f
f
s
e
t
Fig. 16. Forgeries detection result of AAC audio.
frame offsets. Based on the fact that most common forgeries change frame offsets of audio, the proposed
method can locate these forgeries effectively. Extensive experimental results show that the proposed
method has very good performance on both speech and music. All the accuracy rates are above 99%,
which shows the effectiveness of our proposed method. Another advantage of the proposed method is
the simplicity in computation. We only need to investigate the MDCT coefcients of the audio.
However, if audio is transcoded between different compressed formats, the frame offset is difcult
to obtain and the proposed method will fail in this case. It is noted that at a high bit rate such as 128
kbps the NAC method is not very suitable for retrieving frame offsets, since zero coefcients are few
at high bit rates. So in the future, we will focus on obtaining the frame offset when transcoding and at
high bit rates.
ACKNOWLEDGMENTS
The authors would like to thank the anonymous reviewers for their constructive comments. Their
suggestions will be very helpful for our future work.
REFERENCES
BOEHM, R. AND WESTFELD, A. 2004. Statistical characterisation of mp3 encoders for steganalysis. In Proceedings of the 6th ACM
Multimedia and Security Workshop. ACM.
FAAC. 2012. Freeware advanced audio coder. http://www.audiocoding.com/faac.html.
FARID, H. 1999. Detecting digital forgeries using bispectral analysis. MIT AI Memo AIM-1657, MIT.
FU, D., SHI, Y., AND SU, W. 2007. A generalized benfords law for jpeg coefcients and its applications in image forensics. In
Proceedings of SPIE Conference on Security, Steganography, and Watermarking of Multimedia Contents.
GRIGORAS, C. 2005. Digital audio recording analysis: The electric network frequency (enf) criterion. Int. J. Speech Lang. Law 2, 1,
6376.
HERRE, J. AND SCHUG, M. 2000. Analysis of decompressed audioThe inverse decoder. In Proceedings of the 109th AES
Convention.
HERRE, J., SCHUG, M., AND GEIGER, R. 2002. Analysing decompressed audio with the inverse decoderTowards an operative
algorithm. In Proceedings of the 112th AES Convention.
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 2012.
35:20
R. Yang et al.
ISO. 1992. Iso/iec international standard is 11172-3. Information technologyCoding of moving pictures and associated audio
for digital storage media up to about 1.5 Mbit/s. http://www.iso.org/iso/catalouge detail.htm?csnumber=22412.
KRAETZER, C., OERMANN, A., DITTMANN, J., AND LANG, A. 2007. Digital audio forensics: A rst practical evaluation on microphone
and environment classication. In Proceedings of the 9th ACM Multimedia and Security Workshop.
LAME 3.97. 2012. Mp3 encoder. http://lame.sourceforge.net.
LUKAS, J. AND FRIDRICH, J. 2003. Estimation of primary quantization matrix in double compressed jpeg images. In Proceedings
of the Digital Forensic Research Workshop.
PAINTER, T. AND SPANIAS, A. 2000. Perceptual coding of digital audio. Proc. IEEE 88, 4, 451513.
POPESCU, A. AND FARID, H. 2004. Statistical tools for digital forensics. In Proceedings of the 6th International Workshop on
Information Hiding.
QU, Z., LUO, W., AND HUANG, J. 2008. A convolutive mixing model for shift double jpeg compression with application to passive
image authentication. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing.
WANG, Y. AND VELERMO, M. 2003. Modied discrete cosine transformIts implications for audio coding and error concealment.
AES J. 51, 1, 5162.
WANG, Y., YAROSLAVSKY, L., VILERMO, M., AND VAANANEN, M. 2000. Some peculiar properties of the mdct. In Proceedings of the
16th IFIP World Computer Congress.
YANG, R., QU, Z., AND HUANG, J. 2008. Detecting digital audio forgeries by checking frame offsets. In Proceedings of the 10th ACM
Multimedia and Security Workshop. ACM.
Received November 2010; revised July 2011; accepted August 2011
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 2012.