Band Energy Difference For Source Attribution in Audio Forensics

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2018.2812185, IEEE
Transactions on Information Forensics and Security
1
Band Energy Difference for Source Attribution in

Audio Forensics
Da Luo, Member, IEEE, Paweł Korus, Member, IEEE, and Jiwu Huang, Fellow, IEEE
Abstract—Digital audio recordings are one of the key types Digital photographs, and audio and video recordings are com-
of evidence used in law enforcement proceedings. As a result, monly used as evidence. To ensure their credibility, the field
the development of reliable techniques for forensic analysis of of multimedia forensics has developed numerous techniques
such recordings is of principal importance. One of the main
problems in forensic analysis is source attribution, i.e., verifying to assess the origin, processing history and authenticity of
whether a certain recording was acquired with a given device. the acquired data. Source attribution is one of the most
While this problem has been widely studied for other types important problems in this field, and encompasses two main
of multimedia signals, there are very few techniques for audio problem classes: a) identification - which involves linking an
recordings. Moreover, reported evaluation results were obtained investigated recording to one of several suspected acquisition
from extremely small datasets on the order of a dozen devices.
The goal of this paper is to propose a new feature set, the devices; b) verification - which involves confirmation that an
band energy difference (BED) descriptor, for source attribution investigated recording was acquired by a specific acquisition
of digital speech recordings. We demonstrate that a frequency device. Both problems have two principal variations where the
response curve extracted from sample recordings can serve goal is to confirm either the model or a specific instance of
as a robust fingerprint that carries significant discriminative the acquisition device.
power and can characterize the recording device. We study
two sub-problems of source attribution: (a) identification of a The source attribution problem for digital photographs has
recording device among a list of possible candidates (device been studied in detail, leading to reliable verification pro-
identification); and (b) confirmation that a suspected device tocols [1–7]. The fundamental principle of camera instance
has indeed been used to acquire the recording in question attribution involves matching sensor pattern noise signatures -
(device verification). For our evaluation, we prepared two novel a consistent bias of individual pixels of the imaging sensor
datasets: a controlled-conditions dataset with 31 devices; and an
uncontrolled-conditions dataset with 141 devices. Our experimen- stemming from manufacturing imperfections. Similar tech-
tal evaluation demonstrates that the proposed BED descriptor is niques also exist for video cameras [8] and scanners [9, 10].
effective for both device identification and verification. In the In camera model attribution, the typical approach involves
former task, we reached an accuracy of over 96%. In the latter, exploiting machine learning techniques trained on low-level
we obtained a high true positive rate of 89% while maintaining image features [11, 12].
a fixed low false positive rate of 1%.
In contrast to digital images, source attribution techniques
Index Terms—Audio forensics, source attribution, recording for audio recordings are still in an early stage of develop-
devices identification, band energy difference response, micro- ment. Contemporary forensic investigations of audio record-
phone fingerprint.
ings rely primarily on manual comparisons of waveforms and
spectrograms against reference material acquired in similar
I. I NTRODUCTION conditions [11]. Due to the complexity of possible sounds,
recording environments and distorting factors, the goal is to
OURT proceedings and law enforcement procedures
C increasingly rely on a variety of multimedia data that
can be easily acquired by ubiquitous smart mobile devices.
authenticate the whole acquisition context, which includes,
but is not limited to the used microphone. One of the best-
established analysis methods involves authentication of the
electric network frequency signature which allows one to
This work was supported in part by the NSFC (U1636202,
61602318, 61631016) and in part by Shenzhen R&D Program roughly confirm the location of the recording as well as expose
(JCYJ20160328144421330) and in part by the Beijing Key Laboratory its prospective discontinuities [13].
of Advanced Information Science and Network Technology under Grant Adoption of machine learning methods is still in its infancy
XDXX1602.
D. Luo is with the Shenzhen Key Laboratory of Media Security, College and remains limited to academic research. Most existing stud-
of Information Engineering, Shenzhen University, Shenzhen, China, and also ies rely on generic acoustic features, such as Mel-frequency
with the Computer and Network Security Institute, Dongguan University of cepstral coefficients (MFCCs), and use very small datasets
Technology, Dongguan, China (e-mail: luoda@szu.edu.cn).
P. Korus is with Tandon School of Engineering, New York University, USA including approximately a dozen devices (see Section II for
and also with the Department of Telecommunications, AGH University of a detailed summary of existing research). In contrast, mature
Science and Technology, Kraków Poland. The presented research was carried camera attribution protocols have been tested on large scale
out while he was with the Shenzhen Key Laboratory of Media Security,
College of Information Engineering, Shenzhen University, Shenzhen, China datasets with thousands of individual devices representing up
(e-mail: pkorus@agh.edu.pl). to 150 distinct models [14]. Hence, significantly larger datasets
J. Huang is with the Guangdong Key Laboratory of Intelligent Infor- will be needed to reliably assess the performance of existing
mation Processing and Shenzhen Key Laboratory of Media Security, Col-
lege of Information Engineering, Shenzhen University, Shenzhen, China (e- and future methods.
mail:jwhuang@szu.edu.cn). In this paper we address the problem of microphone attri-
1556-6013 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2018.2812185, IEEE
2
bution and focus on speech recordings obtained using smart- A similar approach was later discussed by Hanilçi et al.
phones. We propose a new feature set, the band energy who also used MFCCs to identify the brand and model of
difference descriptor, and demonstrate its efficiency for two cellphones that acquired a given recording [19]. The authors
typical attribution problems: device identification and veri- used a dataset with 14 cellphone models and considered both
fication. The proposed feature set is compact and easy to vector quantization (VQ) and support vector machines (SVMs)
compute. It allowed us to obtain over 96% classification for classification. In a follow up study, Hanilçi et al. further
accuracy with a standard support vector machine classifier. optimized acoustic features for cellphone recognition [20].
Our evaluation is performed on two novel datasets: (a) They considered Mel-frequency, linear frequency and Bark
controlled-conditions recordings from 31 smart-phones; and frequency cepstral coefficients (MFCCs, LFCCs and BFCCs,
(b) uncontrolled-conditions recordings from 141 smart-phones. respectively) as well as linear prediction cepstral coefficients
To the best of our knowledge, this is the largest and most (LPCCs). Their evaluation covered various strategies of feature
comprehensive evaluation of this kind to date. optimization including feature normalization, cepstral mean
The main contributions of our work include the following: normalization, cepstral mean and variance normalization, and
• proposal of a compact feature set, called a band energy delta and double-delta coefficients. The study concluded that
difference descriptor, which characterizes the frequency while baseline MFCCs outperformed other types of features,
response of the device; we demonstrate that the proposed adoption of cepstral mean and variance normalization yields
feature set provides meaningful and discriminative fea- superior performance for LPCCs (with only slightly better
tures that work effectively in both device identification results than MFCCs).
and verification; An alternative approach proposed by Garcia-Romero and
• creation of two novel datasets with recordings obtained Espy-Wilson [21] uses Gaussian mixture models to distinguish
in controlled and uncontrolled conditions; both datasets between microphones and improve speaker recognition accu-
are significantly larger than their previously used coun- racy. In this case, a set of GMM parameters is essentially
terparts. used as a joint speaker-microphone pair descriptor. Similar to
• extensive evaluation of the proposed and state-of-the-art other studies, the method has been tested on a small dataset
techniques on the created data-sets. (8 landline telephone handsets and 8 microphones) and awaits
The paper is organized as follows. Section II reviews related large-scale evaluation.
work on source attribution of audio recordings. The proposed A Gaussian mixture model was also used by Kotropoulos
descriptor is described in detail in Section III, along with the and Samaras [22], who extracted MFCCs from individual
considered device identification and verification procedures. frames of a recording and trained a GMM with diagonal
Section IV presents the results of our experimental evaluation. covariance matrices. The template for each device was con-
We conclude in Section V. For the sake of research repro- structed from Gaussian super-vectors (GSVs). The described
ducibility, the code of our method and the data-sets used in method was evaluated on a dataset of 21 cell-phones repre-
our evaluation can be obtained online1 . senting 7 different brands. The authors reported identification
accuracy of 97.6% with the use of neural networks with
II. R ELATED W ORK radial basis function (RBF) kernels. Zou [23] also employed
Research on machine-learning-based source attribution of GSVs with MFCCs to characterize sparse representations
audio recordings began with the pioneering work of Kraetzer et (fingerprints) of 14 cell phones.
al., who proposed a set of intra-window and inter-window fea- Eskidere used a GMM classifier and compared its per-
tures based on time-domain, frequency-domain and cepstral- formance on various acoustic features including MFCCs,
domain characteristics [15]. A follow up study improved the LPCCs, and perception-based linear predictive coefficients
detection accuracy by using decision fusion with two multi- (PLPCs) [24]. He tested the algorithm on 16 different micro-
class classifiers: a decision tree and a linear logistic regression phones and concluded that the best identification accuracy can
model [16]. Example features included in the newest ver- be obtained with LPCCs. The obtained identification accuracy
sion (v.3.0.1) of the Kraetzer’s descriptor include parameters reached 99.6% for a speaker-independent model (100% in the
like zero cross rate, entropy, RMS amplitude, pitch, spectral speaker-dependent case).
smoothness, Mel-frequency cepstral coefficients, etc. In total, Panagakis and Kotropoulos proposed using random spectral
the descriptor includes 592 intra-window and 17 inter-window features (RSFs) extracted from averaged spectrograms by pro-
features. For details, the reader is referred to [17]. jection onto Gaussian random matrices [25]. They considered
Despite the variety of the included features, feature rank- a sparse-representation classifier and obtained identification
ing performed by Kraetzer et al. in [17, 18] revealed that accuracy of 95.55% on a dataset with 8 landline telephone
the discriminative power of this descriptor relies mostly on handsets. In a follow-up study [26], the same authors included
MFCCs and filtered MFCCs. Based on this descriptor, a additional labeled spectral features and improved the identifi-
learning algorithm can be trained to distinguish between cation accuracy reaching 97.58%.
different microphones. The method has been tested on up to 8 In their most recent paper, Panagakis and Kotropou-
microphones, and to the best of our knowledge, it still awaits los [27] proposed a new classification approach based on over-
large-scale evaluation. complete dictionaries, where each feature vector is represented
as a linear combination of atoms from a dictionary. The
1 The data will be published online later. features are extracted by averaging spectrograms and MFCCs
3
and taking their random projection in a similar manner as from the differences in energies of adjacent frequency bands.
for RSFs. The classifier selects the device, whose atoms The resulting descriptor can discriminate between different
yield the lowest reconstruction error in a regularized sparse recording devices for various acquisition contexts.
representation. The described approach was tested on a dataset
with 8 landline telephone handsets, and the best variant under A. Descriptor Computation
consideration achieved a classification accuracy of 97.67%.
Jahanirad et al. considered a related problem of source de- The proposed descriptor is computed as follows. First, the
vice identification for VoIP calls made from smartphones [28] audio recording s is divided into N -sample frames s(t) with
and computers [29]. Their approach uses the entropy of t = 1, . . . , T being the index of the frame. Then, an N -element
MFCCs extracted from silent frames of the recordings. The window H is applied to every frame, yielding:
authors tested many variants of the feature set and several
popular classifiers. They reported a classification accuracy of x(t) = Hs(t) . (1)
99.9% on two datasets with: (a) 21 smartphones [28]; and (b) In our experiments, we used the popular Hamming window.
up to 10 Mac computers or PCs [29]. We found that the choice of the window function has negli-
Cuccovillo et al. proposed a feature vector for microphone gible impact on the achievable classification performance. In
discrimination and used it for audio tampering detection [30] Appendix A, we report the obtained results for other window
and source attribution [31]. The features were based on blind functions.
channel estimation. In a recent follow-up study, they adapted In the next step, a discrete Fourier transform is computed for
their descriptor to an open-set classification approach [32]. every frame, yielding X(t) . Let Xk denote the k-th element
While the authors reported good classification results across of X. The feature vector for each frame is defined as:
various compression settings (the Rand index measure of at (t) (t)
least 94%), the approach was tested on only 8 acquisition (t) 1 |Xk+1 | − |Xk | >= 0,
Fk = (2)
devices (both smartphones and computers). 0 otherwise,
Finally, the problem has also been approached from the where k = 1, 2, ..., N/2 − 1 due to the spectrum’s symme-
perspective of statistics. Statistical methods based on fre- try. The final descriptor D of the recording is obtained by
quency coefficients were considered by Buchholz et al. [33]. averaging the features for all frames:
Malik and Miller proposed an interesting novel approach that
extracts higher-order statistical features from the estimated T
X
scale-invariant Hu moments of the bicoherence magnitude D= F(t) . (3)
spectrum [34]. For microphone identification, they used cross- t=1
correlation between bicoherence phase spectra. Similar to The above procedure leads to a compact, efficiently com-
previous studies, the method was tested on 8 microphones, and putable descriptor with N/2 − 1 dimensions. Fig. 2 shows
also awaits large-scale evaluation. In a recent paper, Eskidere example descriptors (N = 256) for 11 smart-phones obtained
considered wavelet-based features [35], which compared fa- in 2 different locations with four different speakers. It can be
vorably to MFCCs on a dataset with 14 acquisition devices of observed that certain characteristics revealed by the descriptor,
various types (notebooks, MP3 players, smart-phones, digital e.g., peaks in certain locations remain stable despite variations
voice recorders, etc.). in the recorded content and the recording conditions.
Discriminative capability of the proposed descriptor can also
III. BAND ENERGY DIFFERENCE DESCRIPTOR be observed in Fig. 3 which shows a t-SNE [36] mapping
of BED descriptors from 31 smart-phones (one speaker and
This section introduces the proposed feature set, referred one recording environment). Most of the devices form clearly
to as a band energy difference (BED) descriptor. The audio separable clusters, even in the 2-dimensional projection of the
acquisition pipeline begins with conversion of mechanical descriptors. This clearly demonstrates the potential utility of
motion of a diaphragm into an electrical signal. The result- this descriptor in forensic applications. A detailed analysis of
ing signal is then quantized and post-processed. Finally, the its performance for both device identification and verification
recording is compressed for storage or transmission (popular will be presented in Section IV.
formats include AMR, MP3, WMA and M4A). The details
of this pipeline differ among manufacturers leading to minor
B. Source Attribution Framework
variations and artifacts that can be used for forensic analysis.
We observed that some frequency bands tend to exhibit Source attribution involves two main sub-problems, illus-
significant deviations of their energy compared to adjacent trated in Fig. 4 [11]:
bands. This behavior is consistent across multiple devices • Device identification is a multi-class classification prob-
and various recording conditions, including different speakers lem and involves a candidate list of devices, one of which
and recording environments. Example deviations are shown in is known to be the correct acquisition device. A classifier
Fig. 1 for 6 smart-phones from 2 manufacturers. The reported is then trained to distinguish between the devices.
statistics were collected from 300 audio clips recorded in • Device verification is a two-class classification problem
different locations and with different speakers. To capture which aims to confirm whether a questioned recording
these variations, we consider a feature descriptor computed was acquired by a specific target device.
4
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
Huawei (a) iPhone (a)
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
Huawei (b) iPhone (b)
0.8
0.6
0.6
0.4
0.4
0.2
0.2
Huawei (c) iPhone (c)
Fig. 1. Examples of statistical artifacts captured by the proposed BED descriptor: red circles mark the most significant deviations of band energy difference
for 6 smart-phones from 2 manufacturers; the statistics were obtained from 300 audio clips recorded in different locations with different speakers.
This section will describe in detail the procedures that a 2) Device verification: In the device verification scenario,
forensic analysis should follow. For both sub-problems, we use our framework requires the following inputs: (1) a speech
a support vector machine2 (SVM) with a radial basis kernel recording, whose source needs to be verified; (2) the target
as a classifier. The parameters of the SVM (C, γ) were found smart-phone that is claimed to have recorded the clip; (3) other,
using grid search in the exponential ranges [−5, 7] and [−3, 7] non-target smart-phones to provide representative counter-
for C and γ, respectively. The classifiers were trained on examples. To train a verification model, it is necessary to
audio features extracted from 2 second-long, non-overlapping record training samples with all of the devices described
segments. Such a length has been determined by previous above. Similar to the identification scenario, the recording
research to be sufficient for the application at hand [11]. conditions should match the questioned clip as closely as
1) Device identification: In the device identification sce- possible.
nario, our framework requires the following inputs: (1) a Based on the example recordings, the next step is to extract
speech recording, whose source needs to be identified; (2) BED descriptors and train a two-class classifier to distinguish
suspected smart-phones, one of which is known to be the between the target device and the non-target devices. Hence,
original recording device. one class is trained with examples from the target device
To train an identification model, it is necessary to record and the second class is trained with examples from randomly
training samples with all of the suspected devices described chosen non-target devices. The impact of various aspects
above. The recording conditions should match the questioned of choosing non-target devices, including their number and
recording as closely as possible, in particular with respect models, will be addressed in detail in Section IV-F. In order
to the recording environment (see Section IV-C for more to guarantee a low false positive rate, the classifier should be
information about the robustness of the descriptor). trained to yield posterior probabilities of its decisions. Then,
a proper decision threshold can be chosen based on n-fold
Based on the example recordings, the next step is to
cross validation on the above training set. This issue will be
extract BED descriptors according to the procedure from
addressed in Section IV-F.
Section III-A. A multi-class classification model is then trained
based on the above descriptors. Finally, the questioned record-
ing is processed by the trained identification model, which IV. E XPERIMENTAL E VALUATION
determines the most likely acquisition device. To validate our approach, we conducted an extensive experi-
mental evaluation, which covered both of the source attribution
2 We have also tried other classifiers and obtained similar, albeit slightly problems introduced in Section III-B: device identification
worse results (for all considered forensic features). Specifically, we observed and verification. In addition to the proposed descriptor, we
approx. 2% deterioration for the Fisher linear discriminant (FLD) ensem- considered two baseline approaches: (1) a 12-dimensional
ble [37] and a further 1.5% drop for a neural network with dropouts. The
deterioration was consistent for both device identification and verification vector of MFCC coefficients; (2) Kraetzer’s 609-dimensional
problems. audio forensic feature (AFF) vector [17]. In the following
5
(iPhone) M1@A F1@A M2@A F2@A M1@C F1@C M2@C F2@C

#1
(Xiaomi)
#4
(iPhone)
#7
(iPhone)
#10
(Xiaomi)
#13
(Xiaomi)
#16
(iPhone)
#19
(Xiaomi)
#22
(Oppo)
#25
(Xiaomi)
#28
(Huawei)
#31
Fig. 2. Example band energy difference (BED) descriptors for 11 smart-phones (successive rows) obtained in various recording conditions (successive
columns): two different locations (denoted as A and C) and 4 different speakers - two males (M1 , M2 ) and two females (F1 , F2 ).
description, the above approaches will be referred to as MFCC mono mode with a 44,100 Hz sampling rate. We recorded a
and Kraetzer’s AFF, respectively. 2-minute clip for each of 4 individual speakers (2 males, and
2 females) in 4 locations (office 1, office 2, hall, platform)
A. Datasets leading to 16 unique configurations. For each speaker, we
recorded a different utterance. To ensure that each phone is
Noting the lack of large-scale datasets in previous studies, recording the same content, we compiled a 9-minute long
we prepared two novel datasets for our evaluation: (1) a reference clip that was played back and recorded with all
controlled-conditions dataset with 31 devices (CC dataset); and smart-phones simultaneously (in batches of 5 phones per
(2) an uncontrolled-conditions dataset with 141 devices (UC session). To ensure good quality, we recorded the initial speech
dataset). A brief description of the structure and the acquisition with a Takstar TS-6800 wireless professional microphone and
protocols is presented below. A detailed list of the recording used speakers from Logitech Lapdesk N550 for playback.
devices can be found in supplementary materials. Individual speakers in the reference clip were separated by a
Most of the presented experiments were performed on the distinct monotone beep, padded with a few seconds of silence.
CC dataset which guarantees that our feature set can capture The final clips were obtained by automatic segmentation and
device-specific characteristics. It also allows to assess sensitiv- synchronization of the recordings based on the beeps.
ity to the recording conditions. The UC dataset demonstrates
b) Uncontrolled-conditions dataset: This dataset con-
the scalability of the proposed descriptor, even in challenging
tains compressed (MP3/M4A with 128 kbps) recordings of
conditions of numerous similar devices and recording contexts.
human speech, acquired using 141 different smart-phones.
a) Controlled-conditions dataset: This dataset contains
Similar to the CC dataset, all recordings were obtained in
uncompressed3 recordings of human speech, acquired using
mono mode with 44,100 Hz sampling. The recordings are
31 different smart-phones. All recordings were acquired in
10 minutes long and represent various content, speakers, and
3 Due to recording software limitations, we could not record raw WAV audio recording locations (some locations/speakers are featured in
on iPhones. Instead, we used the M4A format with the default bit-rate. many recordings). This data-set contains multiple recordings
6
iphone MI
iphone MI
MI4LTE iphone
iphone iphone
iphone iphone
Meizu iphone
MI iphone
Meizu MI
Zukz iphone
iphone Huawei
Nexus MI
Samsung iphone
OPPO Note
Xperia MI
Vivo Letv
Huawei
Fig. 3. Visualization of the discriminative power of the proposed BED descriptor: 2-D t-SNE projection of BED descriptors for 31 smart-phones in a selected
location for a selected speaker; it can be observed that even in the projected 2-D space, the devices are easily separated.
Fig. 4. Two main source attribution scenarios: a) device identification with N-class classification problem; b) device verification with a two-class classification
problem.
7
Location A (office 1) Location B (foffice 2) Location C (hall) Location D (platform)

TABLE I
AVERAGE CLASSIFICATION ACCURACY FOR DIFFERENT FRAME LENGTHS ;
TRAINING TIME IS REPORTED RELATIVE TO THE SHORTEST FRAME . M1@A 99 76 81 75 74 52 56 55 84 57 62 63 54 35 45 37
F1@A 77 100 78 80 54 72 51 59 51 82 55 67 37 52 40 39
Device identification [%] Device verification [%] Training
N M2@A 84 77 99 83 59 51 71 58 70 58 82 68 46 38 58 42
time
UC dataset CC dataset UC dataset CC dataset
F2@A 86 88 92 99 65 63 65 74 71 68 76 87 45 45 51 55
512 98.54 96.64 98.04 96.80 7-11×
256 99.07 96.59 97.64 96.53 3.5-4.5× M1@B 67 48 54 50 100 81 85 84 64 42 46 43 43 31 34 24
128 98.80 95.14 97.62 95.86 1.5-2×
64 96.70 93.01 97.15 95.67 1× F1@B 53 67 49 57 85 100 78 90 40 55 44 48 33 38 32 31
testing configuration
M2@B 49 44 63 48 80 74 99 87 44 39 58 45 35 32 39 29
F2@B 55 58 58 63 87 89 90 98 51 47 52 57 41 40 42 42
obtained with the same models/brands of the phones. The most
M1@C 83 53 60 57 69 39 45 44 99 70 76 75 55 36 39 37
notable groups include 44 iPhones, 25 Huawei phones, and 25
F1@C 58 86 57 64 45 68 42 51 74 99 72 81 40 53 34 39
Xiaomi phones.
M2@C 70 60 82 66 53 37 65 50 91 75 100 87 46 41 60 44
77 72 77 87 59 52 56 68 88 86 93 99 52 48 50 55
B. Parameter Selection F2@C
M1@D 57 36 40 40 46 35 34 37 54 31 37 42 100 72 82 86
Computation of the proposed descriptor involves choosing
the frame length N . To determine the best frame length, we F1@D 41 57 39 43 39 47 35 38 40 55 37 46 87 100 86 90
performed both device identification and verification on the M2@D 39 32 58 38 40 31 50 34 37 32 59 39 86 73 100 86
UC and CC datasets. Based on the obtained results (Table I) F2@D 46 44 48 54 42 41 36 48 48 40 49 59 93 87 96 100

we chose frame length N = 256 for future experiments since M1@A F1@A M2@A F2@A M1@B F1@B M2@B F2@B M1@C F1@C M2@C F2@C M1@D F1@D M2@D F2@D
it delivered the best results in device identification (averaged training configuration
for UC and CC datasets), and nearly the best results in device

Fig. 5. Classification accuracy in a cross-configuration scenario where training
verification, while requiring a reasonable training time (the is performed in different conditions than testing; configurations correspond to
reported values correspond to the typical range for both the the Cartesian product of 4 recording locations (A,B,C and D) and 4 speakers:
device identification and verification scenarios). male 1 (M1 ), male 2 (M2 ), female 1 (F1 ) and female 2 (F2 ).
C. Generalization Capability of the BED Descriptor hence was used in the experiments described below. For the
To investigate the robustness of the proposed descriptor, we sake of statistical sufficiency, we repeated the training 20 times
considered a cross-configuration scenario where the training with random division of the data into training and testing
samples are chosen from one context (i.e., a {speaker, loca- samples.
tion} pair) and the testing samples from another. The results
for the device identification scenario are collected in Fig. 5. E. Device Identification Performance
The reported values correspond to the average classification 1) CC Dataset: Tab. II compares the average identification
accuracy. It can be observed that when the training and accuracy of the considered descriptors. The best result for each
testing configurations match, the identification accuracy is configuration is marked in bold. The proposed BED descriptor
nearly 100% (see the diagonal). It can also be observed that delivers the best performance for most of the configurations
the proposed descriptor is fairly robust to the speakers (see (15 out of 16). In the remaining 1 configuration, it is only
red/orange squares around the diagonal) but more sensitive slightly worse than the state-of-the-art Kraetzer’s AFF set.
to the background noise of the recording environment. Note The raw MFCC descriptor is significantly inferior for all
that despite obvious performance deterioration, the accuracy is configurations. A detailed confusion matrix for the proposed
always considerably better than for random guessing (≈3.2%). BED descriptor is shown in Tab. III. The average identification
accuracy is 96.6% with the worst per-phone accuracy of 88%
D. Training Protocols and a median of 98%.
While it is relatively easy to replicate the recording environ- 2) UC Dataset: In this experiment, we performed device
ment, acquiring more speech samples from a specific person identification on the large-scale UC dataset. A single SVM
may be more challenging. Moreover, a single trained model classifier was trained to distinguish between 141 devices.
should be able to handle challenging scenarios with multiple The obtained average classification accuracies for the MFCC,
speakers. Hence, to address the sensitivity of the proposed Kraetzer’s AFF and BED descriptors are 93.8%, 98.0% and
descriptor to the speaker’s voice, we considered a cross- 99.1%, respectively. The distributions of the scores along the
speaker4 training protocol, i.e., designate a target speaker diagonal of the confusion matrix are shown in Fig. 7.
for testing and train the classifier on speech examples from
the remaining 3 speakers. Such an approach gives a more F. Device Verification Performance
representative assessment of expected model performance, and 1) CC Dataset: The device verification scenario is a two-
4 We also considered an analogous cross-environment scenario, where 3
class classification problem. Hence, training a classifier in-
environments are used for training and the remaining one is used for testing. volves choosing non-target devices. This process is expected
See Appendix B for the obtained device identification results. to introduce additional variations to the analysis, and hence
8
MFCC : TPR (96.9%) / FPR (9.7%) MFCC : TPR (36.7%) for FPR=1%
100 100
true / false positive rate [%]
true positive rate [%]

80 80
60 60
40 40
20 20
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
device ID device ID
Kraetzer’s AFF : TPR (99.4%) / FPR (8.8%) Kraetzer’s AFF : TPR (54.0%) for FPR=1%
100 100

80 80
60 60
40 40
20 20
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
device ID device ID
BED : TPR (99.8%) / FPR (6.8%) BED : TPR (88.9%) for FPR=1%
100 100
80 80
60 60
40 40
20 20
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
device ID device ID
Fig. 6. Comparison of true/false positive rates in the device verification scenario: (top) MFCC features; (middle) Kraetzer’s AFF; (bottom) proposed descriptor;
(left) SVM decision for a default decision threshold; (right) true positive rates for fixed false positive rate of 0.01.
TABLE II 100
C OMPARISON OF DEVICE IDENTIFICATION PERFORMANCE WITH
CROSS - SPEAKER TRAINING
correct detection rate [%]
Identification accuracy [%] per target speakers 90

Loc. M1 F1 M2 F2 Av.
MFCC
Office 1 89.6 ± 0.14 89.9 ± 0.18 83.7 ± 0.16 73.4 ± 0.06 84.2 80
Office 2 86.5 ± 0.27 93.9 ± 0.33 79.1 ± 0.24 77.4 ± 0.34 84.2
Hall 84.9 ± 0.04 91.3 ± 0.08 88.6 ± 0.11 79.8 ± 0.09 86.1
Platform 86.7 ± 0.06 92.6 ± 0.12 85.0 ± 0.34 77.8 ± 0.19 85.5
Kraetzer’s AFF 70
Office 1 92.5 ± 0.30 94.3 ± 0.19 85.1 ± 0.15 86.2 ± 0.19 89.5
Office 2 90.5 ± 0.24 89.3 ± 0.31 79.4 ± 0.58 82.0 ± 0.39 85.3 MFCC Kraetzer’s AFF BED
Hall 89.9 ± 0.31 94.8 ± 0.12 89.6 ± 0.22 86.7 ± 0.27 90.3
Platform 94.8 ± 0.25 95.8 ± 0.07 88.9 ± 0.06 92.9 ± 0.32 93.1 Fig. 7. Distribution of correct classification rates for the MFCC, Kraetzer’s
Proposed BED descriptor AFF and BED descriptors on the large-scale UC dataset.
Office 1 96.3 ± 0.10 96.1 ± 0.06 96.2 ± 0.08 94.7 ± 0.16 95.8
Office 2 96.3 ± 0.09 97.1 ± 0.07 95.1 ± 0.08 96.7 ± 0.09 96.3
Hall 97.3 ± 0.03 92.7 ± 0.07 97.2 ± 0.04 97.5 ± 0.02 96.2
Platform 98.0 ± 0.02 97.9 ± 0.02 98.3 ± 0.08 98.1 ± 0.05 98.1 we repeated the experiment 20 times to assess the strength
of this variation. In the experiments on the CC dataset, we
9
TABLE III
C ONFUSION MATRIX OF THE PROPOSED BED DESCRIPTOR IN THE DEVICE IDENTIFICATION SCENARIO WITH CROSS - SPEAKER TRAINING
97 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
1 90 . 3 . . 2 . 5 . . . . . . . . . . . . . . . . . . . . . . .
2 . 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 6 . 89 . . 1 . 3 1 . . . . . . . . . . . . . . . . . . . . . .
4 . . . 100 . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 . . . . 99 . . . . . . . 1 . . . . . . . . . . . . . . . . . .
6 1 . . . . 97 . 2 . . . . . . . . . . . . . . . . . . . . . . .
7 . . . . . . 91 . 9 . . . . . . . . . . . . . . . . . . . . . .
8 3 . 1 . . 2 . 93 . . . . . . . . . . . . . . . . . . . . . . .
9 . . . . . . 2 . 97 . . . . . . . . . . . . . . . . . . . . . .
10 . . . . . 1 . 1 . 98 . . . . . . . . . . . . . . . . . . . . .
11 . . . . . . . . . . 96 . . . 3 . . . . . . . . . . . . . . . .
12 . . . . . . . . 2 . . 97 . . . . . . 1 . . . . . . . . . . . .
13 . . . . 3 . . . . . . . 90 . . . . . . . 3 2 . . . . . . . . .
14 . . . . . . . . . . . . . 99 . . . . . . . . . . . . . . . . .
15 . . . . . . . . . . 1 . . . 98 . . . . . . . . . . . . . 1 . .
16 . . . . . . . . . . . . . . . 98 . . . . . . . . 1 . . . . . .
17 . . . . . . . . . . . . 1 . . . 94 . . . 3 . . . . . . . 1 . .
18 . . . . . . . . . . . . . 1 . . . 99 . . . . . . . . . . . . .
19 . . . . . . . . . . . 1 . . . . . . 99 . . . . . . . . . . . .
20 . . . . . . . . . . . . . . . . . . . 98 . . . . . . . . . . .
21 . 1 . . 1 . . . . . . . 1 . . . . . . . 88 2 . . . 1 . . 5 . .
22 . . . . . . . . . . . . . . . . . . . . . 99 . . . . . . . . .
23 . . . . . . . . . . . . . . . . . . . . 3 . 95 . . . . . . . .
24 . . . . . . . . . . . 2 . . . . . . . . . . . 97 . . . . . . .
25 . . . . . . . . . . . . . . . 1 . . . . . . . . 99 . . . . . .
26 . . . . . . . . . . . . . . . . . . . . . . . . . 100 . . . . .
27 . . . . . . . . . . . . . . . . . . . . . . . . . . 100 . . . .
28 . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 . . .
29 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 . .
30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 .
31 . . . . . . . . . . . . . . . . . . . . . 2 . . . . . . . . 98
used 10 non-target devices for model training, and treated the 1
remaining 20 devices as unseen samples for validation. The

0.9
impact of choosing a different number of devices, including
true positive rate
devices from the same brand, will be explored in detail below.

0.8
Fig. 6 shows variations in the true/false positive rates for
each of the phones. The left column illustrates the default 0.7
threshold chosen by the SVM training algorithm. Similar to
MFCC
previous experiments, the raw MFCC descriptor is clearly 0.6 Kraetzer’s AFF
inferior while the Kraetzer’s AFF and the proposed BED BED
descriptors seem to provide comparable performance, with 0.5
0.01 0.05 0.1
somewhat better results for the latter. While the true positive false positive rate
rates exhibit very good performance, the false positive rates
reach unacceptable levels and exhibit significant variations Fig. 8. Comparison of receiver operation characteristics (ROC) in the
across devices. device verification scenario; the curves are averaged over all frames from
all configurations and 20 random repetitions of the experiment.
The right column of Fig. 6 shows a more practical setting,
where the threshold for each model is individually adapted
to fix a constant false positive rate at 1%. In this case, the
proposed BED descriptor achieved an average TPR of 88.9% was repeated 20 times with randomized devices selection.
and significantly outperformed both the MFCC descriptor We used the default threshold for the SVM classifier. It
(TPR 36.7%) and Kraetzer’s AFF (TPR 56.6%). The trade- can be observed that the true positive rate saturates, and
off for all descriptors can be clearly observed on the receiver even slightly deteriorates after a certain number of devices is
operation characteristics (ROC) in Fig. 8. reached. However, even for 20 non-target devices, the median
2) UC Dataset: In this experiment we evaluated device TPR exceeds 99%. More importantly, adding more devices
verification performance on the large-scale UC dataset. In consistently improves the false positive rate which is more
contrast to the previous evaluation, we focused on two main important in the device verification setting.
aspects: (1) the impact of the number of non-target phones; and In the second scenario, we randomly chose 10 devices of
(2) the impact of same-brand phones used in the construction the same brand to form the non-target class. We used the
of the non-target class. The results for both scenarios are remaining devices of the same brand for verification. We show
shown in Fig. 9. the results for phones produced by Apple, Huawei and Xiaomi
In the former scenario, we randomly chose 5, 10, or 20 which are represented in our dataset by 44, 25, and 25 devices,
devices to form the non-target class. The remaining 135, 130, respectively. The results are shown in the right column of
and 120 devices were used for verification. The experiment Fig. 9. Although as expected, the most numerous group of
10
100 100 TABLE IV


AVERAGE CLASSIFICATION ACCURACY DIFFERENCE WITH RESPECT TO
THE DEFAULT H AMMING WINDOW.
99 99
Device identification [%] Device verification [%]

Window name
98 98
UC dataset CC dataset UC dataset CC dataset
Hanning -0.14 +0.27 +0.15 +0.31
97 97 Chebyshev -0.09 +0.29 +0.12 +0.18
5 10 20 Apple Huawei Xiaomi
number of non-target phones Gaussian -0.03 +0.13 +0.06 +0.14
brand of non-target phones
12 12
false positive rate [%]
false positive rate [%]

10 10
A PPENDIX A
8 8
I MPACT OF THE W INDOW FUNCTION
6 6
In our experiments, we used the popular Hamming window.
4 4
We have also experimented with other window functions, but
2 2
observed negligible differences in performance. Below, we
0 0
5 10 20 Apple Huawei Xiaomi report the results obtained with three other window functions:
number of non-target phones brand of non-target phones
Hanning, Chebyshev, and Gaussian widows. For each window,
we performed a complete evaluation for both device identifi-
Fig. 9. Device verification performance on the UC dataset: (left) grouped by
the number of devices that form the non-target class; (right) grouped by phone cation and device verification scenarios. The experiment was
brand when the non-target class is trained with same-brand phones only. repeated 20 times with randomized division into training and
testing sets. The obtained results are collected in Table IV.
The numbers are reported relative to the default choice of the
Apple phones yields the worst performance, the proposed BED Hamming window. The observed differences are negligible,
descriptor still delivers true positive rates of nearly 99% with which shows that the proposed descriptor is not affected by
false positive rates of approximately 6%. For the remaining the window function.
brands, we observed even better results.
A PPENDIX B
C ROSS - ENVIRONMENT T RAINING
V. C ONCLUSIONS AND F UTURE W ORK
The results reported in the main body of the paper were
Our study investigated a novel construction of spectral- obtained with cross-speaker training, which involved training
domain audio features in the context of audio forensics. We a classifier on speech samples of 3 speakers, and testing on
focused on the problem of source attribution which involves the remaining 4th speaker, the target speaker. We consider
either identification or verification of the recording device. such an approach the most appropriate since: (a) recordings
During our evaluation, the proposed band energy difference with multiple speakers should ideally be handled by a single
descriptor clearly outperformed state-of-the-art audio forensic trained model; and (b) it is often more practical to replicate
feature extractors. We observed the greatest improvement in the recording environment than the speaker (who may not be
the device verification setting for fixed low false positive rates, available, or may decline to participate).
which is a key requirement in practical applications. For the sake of completeness, in this appendix we report the
obtained device identification results (the CC dataset) for an
In our future work, we plan to address the robustness of
analogous cross-environment approach. The results (Table V)
the proposed descriptor. Our ultimate goal is to develop a
confirm our observations from Section IV-C on descriptor
reliable training and verification protocol, capable of trustwor-
generalization, which revealed certain similarities between the
thy authentication with a single trained model. However, to
locations Office 1 and Hall, and distinct characteristics of
address this issue, it is crucial to utilize large datasets with a
the Platform environment. The results are consistent for all
broad range of recording conditions. Currently however, such
speakers. Although we observed some loss in the descriptors’
datasets are not available in the research community.
discrimination capability, even the worst observed accuracies
Our current evaluation was performed on two new datasets,
(≈62%) are significantly greater than random chance (≈3.2%).
created specifically for this study. Both datasets are signifi-
cantly larger than their counterparts in previous studies [11, 17, R EFERENCES
18] and contain 31 and 141 devices, respectively. While these [1] S. Bayram, H. T. Sencar, N. Memon, and I. Avcibas, “Source camera
datasets already constitute a major step forward with respect identification based on cfa interpolation,” in Proc. of IEEE Int. Conf.
to the existing literature, we acknowledge that even larger Image Processing, 2005, pp. III–69–72.
[2] K. S. Choi, E. Y. Lam, and K. Y. Wong, “Automatic source camera
datasets will be necessary for reliable large-scale evaluation. identification using the intrinsic lens radial distortion,” Optics Express,
vol. 14, no. 24, pp. 11551–11565, 2006.
[3] J. Lukáš, J. Fridrich, and M. Goljan, “Digital camera identification from
VI. ACKNOWLEDGMENTS sensor pattern noise,” IEEE Transacations on Information Forensics and
Security, vol. 1, no. 2, pp. 205–214, 2006.
[4] A. E. Dirik, H. T. Sencar, and N. Memon, “Source camera identification
We would like to thank Dr C. Kraetzer for providing us the based on sensor dust characteristics,” in IEEE Workshop on. Signal
source code of their forensic feature extractor. Processing Applications for Public Security and Forensics, 2007.
11
TABLE V [23] L. Zou, Q. He, and X. Feng, “Cell phone verification from speech
C OMPARISON OF DEVICE IDENTIFICATION PERFORMANCE WITH recordings using sparse representation,” in Proc. of IEEE Int. Conf.
CROSS - ENVIRONMENT TRAINING Acoustics Speech and Signal Processing, 2015, pp. 1787–1791.
[24] Ö. Eskidere, “Source microphone identification from speech recordings
Identification accuracy [%] per target environment based on a gaussian mixture model,” Turkish Journal of Electrical
Engineering & Computer Sciences, vol. 22, no. 3, pp. 754–767, 2014.
Speaker Office 1 Office 2 Hall Platform [25] Y. Panagakis and C. Kotropoulos, “Automatic telephone handset
Proposed BED descriptor identification by sparse representation of random spectral features,” in
ACM Proceedings on Multimedia and Security, 2012, pp. 91–96.
M1 92.14 ± 0.07 77.42 ± 0.15 92.95 ± 0.08 61.74 ± 0.28
[26] Y. Panagakis and C. Kotropoulos, “Telephone handset identification by
F1 95.68 ± 0.26 74.48 ± 0.10 88.92 ± 0.37 63.88 ± 0.21
M2 92.51 ± 0.13 74.13 ± 0.23 92.09 ± 0.16 68.27 ± 0.17 feature selection and sparse representations,” in IEEE Int. Workshop on
F2 94.67 ± 0.19 78.66 ± 0.12 92.19 ± 0.20 62.45 ± 0.39 Inf. Forensics and Security. IEEE, 2012, pp. 73–78.
[27] Y. Panagakis and C. Kotropoulos, “Telephone handset identification
by collaborative representations,” Int. Journal of Digital Crime and
Forensics, vol. 5, no. 4, pp. 1–14, 2013.
[28] M. Jahanirad, A. Wahab, N. B. Anuar, M. Idris, and M. N. Ayub, “Blind
[5] C.-T. Li, “Source camera identification using enhanced sensor pattern source mobile device identification based on recorded call,” Engineering
noise,” IEEE Transacations on Information Forensics and Security, vol. Applications of Artificial Intelligence, vol. 36, pp. 320–331, 2014.
5, no. 2, pp. 280–287, 2010. [29] M. Jahanirad, N. B. Anuar, and A. Wahab, “Blind source computer
[6] X. Kang, Y. Li, Z. Qu, and J. Huang, “Enhancing source camera device identification from recorded voip calls for forensic investigation,”
identification performance with a camera reference phase sensor pattern Forensic Science International, vol. 272, pp. 111–126, 2017.
noise,” IEEE Transacations on Information Forensics and Security, vol. [30] L. Cuccovillo, S. Mann, M. Tagliasacchi, and P. Aichroth, “Audio tam-
7, no. 2, pp. 393–402, 2012. pering detection via microphone classification,” in IEEE Int. Workshop
[7] M. Chen, J. Fridrich, M. Goljan, and J. Lukas, “Determining image on Multimedia Signal Processing, 2013, pp. 177–182.
origin and integrity using sensor noise,” IEEE Trans. Inf. Forensics [31] L. Cuccovillo, S. Mann, P. Aichroth, M. Tagliasacchi, and C. Dittmar,
Security, vol. 3, no. 1, pp. 74–90, 2008. “Blind microphone analysis and stable tone phase analysis for audio
[8] S. Bayram, H. Sencar, and N. Memon, “Efficient techniques for sensor tampering detection,” in International AES Convention, NY, USA, Oct
fingerprint matching in large image and video databases,” in IS&T/SPIE 2013.
Electronic Imaging. International Society for Optics and Photonics, [32] L. Cuccovillo and P. Aichroth, “Open-set microphone classification via
2010, pp. 754109–754109. blind channel analysis,” in IEEE Int. Conf. Acoustics, Speech and Signal
[9] N. Khanna, A. K. Mikkilineni, G. T. C. Chiu, J. P. Allebach, and E. J. Proc., 2016, pp. 2074–2078.
Delp, “Scanner identification using sensor pattern noise,” in Proc. [33] R. Buchholz, C. Kraetzer, and J. Dittmann, “Microphone classification
SPIE 6505, Security, Steganography, and Watermarking of Multimedia using fourier coefficients,” in Proc. of the International Workshop on
Contents IX, Feb. 2007, p. 65051K. Information Hiding, Darmstadt, Germany, June 2009.
[10] N. Khanna, A. K. Mikkilineni, and E. J. Delp, “Scanner identification [34] H. Malik and J. Miller, “Microphone identification using higher-order
using feature-based processing and analysis,” IEEE Transacations on statistics,” in Proc. AES Int. Conf. Audio Forensics, 2012, pp. 2–5.
Information Forensics and Security, vol. 4, no. 1, pp. 123–139, 2009. [35] Ö. Eskidere, “Identifying acquisition devices from recorded speech
[11] C. Kraetzer and J. Dittmann, “Microphone forensics,” in Handbook of signals using wavelet-based features,” Turkish Journal of Electrical
Digital Forensics of Multimedia Data and Devices, A. Ho and S. Li, Engineering & Computer Sciences, vol. 24, no. 3, pp. 1942–1954, 2016.
Eds. John Wiley & Sons, Ltd, 2015. [36] L. Van der Maaten and G. Hinton, “Visualizing data using t-SNE,”
[12] F. Marra, G. Poggi, C. Sansone, and L. Verdoliva, “A study of Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008.
co-occurrence based local features for camera model identification,” [37] J. Kodovsky, J. Fridrich, and V. Holub, “Ensemble classifiers for
Multimedia Tools and Applications, pp. 1–17, 2016. steganalysis of digital media,” IEEE Transactions on Information
[13] C. Grigoras, “Digital audio recording analysis: the electric network Forensics and Security, vol. 7, no. 2, pp. 432–444, 2012.
frequency criterion,” International Journal of Speech Language and the
Law, vol. 12, no. 1, pp. 63–76, 2005.
[14] M. Goljan, J. Fridrich, and T. Filler, “Large scale test of sensor
fingerprint camera identification,” in IS&T/SPIE Electronic Imaging.
International Society for Optics and Photonics, 2009, pp. 72540I–
72540I.
[15] C. Kraetzer, A. Oermann, J. Dittmann, and A. Lang, “Digital audio
forensics: A first practical evaluation on microphone and environment
classification,” in Proc. of the Workshop on Multimedia and securityg,
Dallas, Texas, USA, 2007, pp. 63–74.
[16] C. Kraetzer, M. Schott, and J. Dittmann, “Unweighted fusion in
microphone forensics using a decision tree and linear logistic regression
models,” in ACM workshop on Multimedia and Security, 2009, pp. 49–
56.
[17] C. Krätzer, Statistical pattern recognition for audio-forensics: empirical
investigations on the application scenarios audio steganalysis and
microphone forensics, Ph.D. thesis, Magdeburg, Universität, 2013.
[18] C. Kraetzer, K. Qian, M. Schott, and J. Dittmann, “A context model for
microphone forensics and its application in evaluations,” in IS&T/SPIE
Electronic Imaging. International Society for Optics and Photonics,
2011, pp. 78800P–78800P–15.
[19] C. Hanilçi, F. Ertaş, T. Ertaş, and Ö. Eskidere, “Recognition of brand
and models of cell-phones from recorded speech signals,” IEEE Tran.
Inf. Forensics Security, vol. 7, no. 2, pp. 625–634, 2012.
[20] C. Hanilçi and F. Ertas, “Optimizing acoustic features for source
cell-phone recognition using speech signals,” in ACM Workshop on
Information hiding and multimedia security. ACM, 2013, pp. 141–148.
[21] D. Garcia-Romero and C. Espy-Wilson, “Automatic acquisition device
identification from speech recordings,” in Proc. of IEEE Int. Conf.
Acoustics Speech and Signal Processing, 2010, pp. 1806–1809.
[22] C. Kotropoulos and S. Samaras, “Mobile phone identification using
recorded speech signals,” in IEEE Int. Conf. Digital Signal Processing,
2014, pp. 586–591.

Band Energy Difference For Source Attribution in Audio Forensics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Band Energy Difference For Source Attribution in Audio Forensics

Uploaded by

Copyright:

Available Formats

This article has been accepted for publication in a future issue of this journal, but has not been

Band Energy Difference for Source Attribution in

Huawei (a) iPhone (a)

Huawei (b) iPhone (b)

Huawei (c) iPhone (c)

(iPhone) M1@A F1@A M2@A F2@A M1@C F1@C M2@C F2@C

Location A (office 1) Location B (foffice 2) Location C (hall) Location D (platform)

performed both device identification and verification on the M2@D 39 32 58 38 40 31 50 34 37 32 59 39 86 73 100 86

UC and CC datasets. Based on the obtained results (Table I) F2@D 46 44 48 54 42 41 36 48 48 40 49 59 93 87 96 100

it delivered the best results in device identification (averaged training configuration

for UC and CC datasets), and nearly the best results in device

true positive rate [%]

true positive rate [%]

true positive rate [%]

Identification accuracy [%] per target speakers 90

used 10 non-target devices for model training, and treated the 1

remaining 20 devices as unseen samples for validation. The

devices from the same brand, will be explored in detail below.

100 100 TABLE IV

true positive rate [%]

Device identification [%] Device verification [%]

false positive rate [%]

You might also like