DNN F0 Models For SPSS

Available online at www.sciencedirect.
com
Speech Communication 76 (2016) 8292
www.elsevier.com/locate/specom
Modeling F0 trajectories in hierarchically structured deep neural networks

Xiang Yin a,b, Ming Lei b, Yao Qian b, Frank K. Soong b, Lei He b, Zhen-Hua Ling a,, Li-Rong Dai a
a National
Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei,
AnHui 230027, PR China
b Microsoft, Beijing, China
Received 31 March 2015; received in revised form 21 October 2015; accepted 29 October 2015
Available online 28 November 2015
Abstract
This paper investigates F0 modeling of speech in deep neural networks (DNN) for statistical parametric speech synthesis (SPSS). Recently,
DNN has been applied to the acoustic modeling of SPSS and has shown good performance in characterizing complex dependencies between
contextual features and acoustic observations. However, the additive nature and long-term suprasegmental property of F0 features have not been
fully exploited in the existing DNN-based SPSS. Two different model structures, cascade DNN and parallel DNN are proposed to embody the
hierarchical and additive properties of the F0 in DNN-based prosody modeling. In the cascade structure, the DNN-predicted F0 contours of higher
levels are used as input to the DNN of the current level. In the parallel structure, F0 components corresponding to different prosody levels are
separately generated by DNNs and added together to obtain the final F0 contour. An optimized discrete cosine transform (DCT) is used to extract
long-term F0 features at syllable, word, and phrase levels. The experimental results show that our approach yields better subjective performance
than either the conventional HMM or DNN approaches. Among all compared systems, the parallel DNN achieves the best objective and subjective
performance.
2015 Elsevier B.V. All rights reserved.
Keywords: Speech synthesis; Hidden Markov model; Fundamental frequency; Deep neural network; Discrete cosine transform.
1. Introduction
In recent years, the hidden Markov model (HMM) based
statistical parametric speech synthesis (SPSS) method has become the mainstream in speech synthesis (Tokuda et al., 2000;
Yoshimura et al., 1999). In HMM training, acoustic features
of spectrum, F0 and segment duration are modeled simultaneously in a unified HMM framework. In synthesis, for any given
text, acoustic feature trajectories are predicted by HMMs via
the maximum likelihood parameter generation (MLPG) algorithm (Tokuda et al., 1995). Finally, the generated trajectories
are fed into a vocoder to synthesize output speech waveforms.
This approach can synthesize highly intelligible and relatively
smooth speech (Ling et al., 2006; Zen et al., 2007a). However,
Corresponding author. Tel.: +86 551 63607871.

E-mail addresses: byx1030@mail.ustc.edu.cn (X. Yin), minlei@microsoft.com (M. Lei), yaoqian@microsoft.com (Y. Qian),
frankkps@microsoft.com (F.K. Soong), helei@microsoft.com (L. He),
zhling@ustc.edu, zhling@ustc.edu.cn (Z.-H. Ling), lrdai@ustc.edu.cn (L.R. Dai).
http://dx.doi.org/10.1016/j.specom.2015.10.007
0167-6393/ 2015 Elsevier B.V. All rights reserved.
the synthesized speech still carries some machine-like quality,

particularly on its prosody. Also, the generated spectral and F0
features tend to be overly smoothed and result in somewhat
muffled voice and not very lively prosody (Zen et al., 2009).
Research work has been done to alleviate this smoothing effect
via some new acoustic-modeling and parameter-generation algorithms, e.g. incorporating a global variance (GV) model into
the parameter generation (Toda and Tokuda, 2005); reformulating HMM as a trajectory model (Zen et al., 2007b); acoustic
modeling via a restricted Boltzmann machine (RBM) or deep
belief networks (DBN) (Ling et al., 2013); and minimum generation error (MGE) based model training (Wu et al., 2007). These
methods can enhance the naturalness of synthesized speech, but
mainly at the spectral or the segmental level.
In conventional HMM-based speech synthesis, spectral and
F0 features are modeled as context-dependent, phone HMMs
with frame-wise observations. However, F0 has a longer context
dependency than the spectral feature which can be fairly well
modeled with the contexts of neighboring states/phones. As a
result, the F0 trajectory should be modeled in its own time or
X. Yin et al. / Speech Communication 76 (2016) 8292
context scale. It has also long been conjectured that F0 trajectories exhibit a hierarchical and additive nature (Fujisaki, 2008).
As a result, F0 contours can be decomposed into several components, and each one is controlled by its own prosodic unit
descriptors. Also, the long-term effects of F0 in an utterance suggests that suprasegmental features, such as tones of syllables in a
tonal language like Mandarin, stresses of words, and intonation
patterns of phrases are encoded hierarchically (Wightman et al.,
1992). The F0 contour shapes of speech segments which are
longer than a frame or phone, i.e., syllables, words, and phrases,
play critical roles in human perception of speech prosody.
However, the conventional HMM-based speech synthesis
method does not adequately consider the long term effects in
both production and perception of speech in F0 modeling. Quite
a few research attempts have been made to address this issue.
Improved model topologies have been proposed to reflect the hierarchical and additive nature of F0 generation (Lei et al., 2010;
Qian et al., 2008; Wu and Soong, 2012; Zen and Braunschweiler,
2009). By Qian et al. (2008), an additive decision tree-based F0
modeling method was proposed in which a group of decision
trees were built sequentially to minimize the prediction residual.
By Zen and Braunschweiler (2009), F0 features were modeled
by decision trees, which were built interactively in a unified approach with different contextual feature sets at different prosodic
levels. By Lei et al. (2010); Wu and Soong (2012), F0 observations were decomposed into several components according to
their prosodic levels (i.e., states, phones, syllables, and words),
which were modeled separately using the contextual features related to each prosodic level. Next, the model parameters were optimized simultaneously under the MGE training criterion. Second, some long-term F0 features have also been proposed and
integrated into F0 modeling and prediction (Qian et al., 2011;
Teutenberg et al., 2008; Wang et al., 2008; Wu et al., 2008). By
Wang et al. (2008), the mean of log F0 of voiced frames within
each syllable was extracted and modeled context-dependently. In
synthesis, the F0 contour was generated by maximizing the combined likelihood functions of the original model at a frame-scale
and the new model at syllable-scale. Discrete cosine transform
(DCT) coefficients were proposed to represent the long-term F0
patterns (Latorre and Akamine, 2008; Obin et al., 2011; Qian
et al., 2011; Teutenberg et al., 2008; Wu et al., 2008). By Qian
et al. (2011), DCT coefficients were used to parameterize the F0
contours of syllables and phrases. Models were trained for these
DCT coefficients and combined with the original frame-level
model in parameter generation.
Recently, neural network-based approaches have been proposed where contextual features are mapped to the acoustic features directly, either in a deep neural network (DNN) (Qian et al.,
2014; Zen et al., 2013) or in a recurrent neural network (RNN)
(Fan et al., 2014; Fernandez et al., 2014; Zen et al., 2014). Compared with the conventional HMM-based approach which implements context-dependent HMMs with a decision tree-based
clustering, the conditional distribution of acoustic features for a
given contextual DNN model can model complex context dependency while avoiding the intrinsic data fragmentation problem in
a large decision-tree (Zen et al., 2013). However, in the proposed
DNN-based approaches, F0 and spectral features are still treated
SPEECH
SPEECH
DATABASE
DATABASE
83
Speech
Training
Speech
Speech analysis
analysis
Acoustic features
Context features
HMM
HMM training
training
Decision
tree
Question
Question set
set
Clustered
HMMs
Synthesis
TEXT
Text
Text
analysis
analysis
HMM
HMM
sequence
sequence
decision
decision
Acoustic
Acoustic
parameter
parameter
generation
generation
Parametric
Parametric
synthesizer
synthesizer
SYNTHESIZED
SPEECH
Fig. 1. Diagram of a typical HMM-based parametric speech synthesis system.
in the same time scale. We have proposed a method to model

F0 trajectory of intonation phrases (IP) by a decision tree or a
DNN with optimized DCT analysis (Yin et al., 2014). The decision tree or DNN for IP-level F0 modeling was combined with
state-level F0 models to predict final F0 contours in a two-level
hierarchical framework. Experimental results showed the effectiveness of long-term F0 modeling. However, the approach only
utilized two prosodic levels, and no joint effort was performed
to optimize the parameters of two different levels. In this paper, we investigate how to improve DNN-based F0 modeling
by considering the additive and long-term properties of F0 at
four prosodic levels. First, two model structures, cascade DNN
and parallel DNN, are proposed to embody the hierarchical and
additive nature of F0 generation. The cascade structure uses the
F0 contours predicted by the DNN of the higher prosodic level
as input to the DNN of the current prosodic level. In the parallel
structure, F0 components of each prosodic level are individually
predicted from separate DNNs with the contextual features of
corresponding levels. Then, these predicted F0 components are
added together to form the final F0 contour. In addition, an improved DCT analysis (Yin et al., 2014) is adopted to extract F0
vectors for syllable, word, and phrase-scale speech segments.
These F0 vectors are used as the target features of DNN prediction in both cascade and parallel structures.
The rest of the paper is organized as follows. Section 2 reviews the conventional HMM-based speech synthesis. Section 3
briefly introduces DNN-based speech synthesis. Section 4 describes the details of our proposed new approach. The experimental setups and results are shown in Section 5. Conclusions
are given in Section 6.
2. HMM-based parametric speech synthesis
Fig. 1 illustrates the block diagram of a typical HMM-based
speech synthesis system. In training, acoustic features of F0
and spectral parameters in D dimensions are first extracted from
the speech database. An acoustic observation feature sequence

o = [o
1 , o2 , . . . , oT ] is first constructed from the extracted
parameters, where ( ) denotes matrix transpose and T is the
total number of frames. The observed feature vector ot R3D of
84
the tth frame consists of the static acoustic parameters ct RD

and the corresponding delta and delta-delta components as
ot = [ct , ct , 2 ct ] .
(1)
Text analysis
TEXT
One frame
Input context
features
The feature sequence o can be re-written as

o = Wc,
(2)

where c = [c
1 , c2 , . . ., cT ] denotes the static feature sequence,
and W is the linear transformation matrix to compute the dynamic components (Tokuda et al., 2000). Next, a set of HMMs
for context-dependent phones are estimated by maximizing
the likelihood function p(o|) over all training data. A decision
tree-based model clustering technique is applied to address the
data-sparsity problem and to determine the model parameters of
unseen contextual features in synthesis. The size of the decision
trees is controlled by the minimum description length (MDL)
criterion (Shinoda and Watanabe, 2000). Because F0 only exists
in voiced speech, a multi-space probability distribution (MSD)
HMM (Tokuda et al., 1999) is applied to characterize the distribution for F0 in a rigorous probabilistic framework. It builds
a continuous probability space in Gaussian distributions of F0
values for voiced frames and a discrete probability space for
unvoiced frames.
In synthesis, the front-end text analysis is used both to generate the contextual features for each phone in the sentence
and to determine the sentence HMM . A state sequence
q = {q1 , q2 , . . ., qT } is predicted using both the contextual features and the trained duration probabilities (Yoshimura et al.,
1999). Next, the spectral and F0 features are generated by maximizing the output probability of the sentence HMM under the
constraints between static and dynamic features as
c = arg max P(Wc|, q),

c
(3)
where c is the output feature sequence. For F0 prediction, a

voicing decision procedure for each frame is conducted before
generating F0 contours for voiced segments using (3) (Tokuda
et al., 1999). Finally, these generated acoustic features are sent
to a vocoder, STRAIGHT synthesizer (Kawahara et al., 1999),
to synthesize the final speech waveforms.
3. DNN-based parametric speech synthesis
DNN-based parametric speech synthesis has recently been
proposed (Qian et al., 2014; Zen et al., 2013) to improve the
conventional HMM-based approach. Fig. 2 shows the framework of this approach. In training, acoustic features, including
the F0 and spectral features with their velocity and acceleration
components, are extracted as in the HMM-based method. Missing F0 values in unvoiced frames are interpolated (Chen et al.,
1997) between voiced frames to preserve the continuity of F0
in modeling and a binary voicing flag is added to the acoustic
feature vector at each frame. Let yt denote the N dimensional
acoustic feature vector for the tth frame. A contextual feature
vector xt of M dimensions is also constructed for the tth frame
using the extracted contextual features. Each context vector includes binary elements bt for categorical linguistic features (e.g.,
identity of current phone) and numeric elements nt for numerical
Deep Neural Network
Output acoustic
features
Parameter generation
Static acoustic features
Parametric
synthesizer
SPEECH
Fig. 2. Framework of DNN-based parametric speech synthesis.
linguistic contexts (e.g., the number of syllables in the current

word) as
xt = [bt , nt ].
(4)
Next, DNN weights are trained with pairs of contextual feature

vectors and acoustic feature vectors as model input and output,
respectively, in a minimum mean square error (MMSE) criterion.
In synthesis, acoustic features are predicted by the trained
DNN from the input contextual features obtained from the frontend text analysis and are then set as mean vectors of Gaussian distributions. Along with pre-computed covariance matrices from all training data, the parameter generation algorithm in
(3) is applied to generate the static acoustic feature sequence. Finally, a vocoder such as the one used in the HMM-based method
is used to synthesize waveforms from the generated acoustic features.
This DNN-based speech synthesis approach could address
the limitations in the conventional decision tree-clustered
HMM-based approach, such as ignoring some weak contextual features, independence among feature dimensions, and data
splitting in decision tree-based clustering (Zen et al., 2013). Although significant improvement is achieved by the use of DNN
for TTS, DNN-based speech synthesis has limitations that are
similar to those of HMM-based speech synthesis. In particular,
the additive nature and long-term property of F0 features are not
fully exploited in the frame-scale F0 modeling.
4. DNN-based hierarchical F0 modeling
4.1. Basic ideas
This paper investigates DNN-based hierarchical F0 modeling
for SPSS. Considering the additive nature of F0 generation, a
group of DNNs are utilized to describe the contributions made
by the contextual prosody to the observed F0 contour. A prosodic
structure of four levels, i.e., intonation phrase (IP), word, syllable, and phone, is adopted for F0 modeling of English speech.
Four sets of contextual features are designed corresponding to

Table 1
Four sets of contextual features used for hierarchical F0 modeling of English
speech in this paper. IP-set, W-set, S-set, and P-set are the sets of contextual
features for IP level, word level, syllable level, and phone level, respectively.
Name
IP-set
W-set
S-set
P-set
Contextual features
Type of sentence
Position of IP in sentence
Final Boundary Tone of ToBI in IP
Number of syllables, words in IP
Number of syllables, words, IPs in sentence
Duration of IP and sentence
Features in IP-set
POS of word
Break index of word
Position of word in IP
Final Boundary Tone of ToBI in word
Number of syllables in word
Number of syllables to high or low tone
Number of syllables to left or right bound
Duration of word
Features in W-set
Accent of syllable
Stress of syllable
Position of syllable in word, IP
Number of phones in syllable
Durations of syllable
Features in S-set
Identity of phone
Position of phones in syllable
Durations of phone and state
Position of frame in phone
each prosodic level as listed in Table 1 based on Lei et al. (2010),

where IP-set, W-set, S-set, and P-set denote the sets of contextual features at IP, word, syllable, and phone levels respectively.
Each set of a given level comprises the contextual features of
all higher prosodic levels, i.e., S-set contains W-set where W-set
contains IP-set. As a result, P-set is equivalent to the full set
of contextual features used in the conventional HMM or DNNbased speech synthesis. Considering the long-term property of
F0 perception, the F0 vectors fixed dimensions extracted by
DCT analysis are adopted as the targets of DNN prediction for
prosodic levels that are higher than phone; the detail will be
presented in the next sub-section. Two DNN model structures,
cascade and parallel, are proposed and compared.
4.2. DCT-based F0 vector extraction and contour recovery

Because frame observations cannot effectively represent
long-term F0 patterns, F0 contours composed of multiple frames
are used as observations for training DNNs beyond phone-level
in our proposed cascade and parallel structures. However, it is
difficult to model and predict F0 contours by DNNs directly
due to: F0 contours at each prosodic scale have variable lengths,
while the output layer of a DNN needs a fixed dimension; Second, F0 contours are discontinuous due to the unvoiced frames.
Therefore, we propose to convert F0 contours into fixed dimension vectors using fixed number of DCT coefficients. F0 vector
85
is then a length-normalized vector which can represent a continuous F0 contour of a variable length.
The F0 vector extraction module in our proposed method
converts an IP-scale, word-scale, or syllable-scale F0 contour
into a F0 vector with fixed DIP , DW , or DS dimensions. First, an
F0 contour containing possible unvoiced frames is parameterized by N-order DCT coefficients. Then, a F0 vector of DIP , DW ,
or DS dimensions are calculated from these DCT coefficients by
inverse DCT. After these two steps, F0 contours are converted
into F0 vectors which are continuous and have fixed dimensions
for each prosodic scale. The order N of DCT analysis determines
the degree of details represented by F0 vectors. In the F0 contour recovery module, a F0 vector is first parameterized by DCT
coefficients and then converted back to an F0 contour with given
frames using inverse DCT.
To alleviate the influence of unvoiced frames in estimating
DCT coefficients, an optimized DCT analysis has been proposed
in our previous work where DCT coefficients of an F0 contour are calculated by minimizing the sum of the square errors
between the original and reconstructed F0 contours, but computed only at voiced frames (Yin et al., 2014). Assuming that
s = [s0 , s1 . . .st . . .sT 1 ] denotes an F0 contour with T frames,
its N-order DCT coefficients e = [e0 , e1 , . . ., eN1 ] are calculated as

(st st )2
e = arg min
e
tV
= arg min
e
(st D(t ) e) ,
(5)
tV
where V denotes the set of frame indices of voiced frames in s;

s = [s0 , s1 . . .st . . .sT 1 ] denotes the F0 contour reconstructed
from the DCT coefficients e;

1
1
, cos
t+
, . . ., cos
(N 1) t +
D(t ) =
2
T
2
T
2
(6)
is the vector for calculating the standard inverse DCT at the
tthframe (Makhoul, 1980). Eq. (5) can be solved by setting
(st st )2 /e = 0 and the solution is
tV
e =

tV
(t)
(t)
D(t) st .
(7)
tV
This optimized DCT analysis is adopted in this paper for

computing forward DCT during F0 vector extraction and contour
recovery. For inverse DCT, the standard formula by Makhoul
(1980) is used.
4.3. F0 modeling using cascade DNNs
The motivation for developing this cascade structure is to
gradually introduce the contextual features corresponding to
different prosodic levels into F0 prediction by concatenating
a group of DNN predictors. The F0 contours predicted by the
DNN of the upper prosodic level are used as the input features for
the DNN of the current level. Compared with the conventional
86

IP-level DNN
Predicted IP-scale F0 vector
HMM training
Forced alignment
Contour
recovery
IP-scale
F0 vector
DNN
DNN
generation
generation
IP-set
Initial W-scale F0 vector
Segmentation
W-level DNN training
F0 vector
extraction
S-level DNN training
Initial W-scale
F0 vector
Observed W-scale
DNN training
F0 vector
P-level DNN training
W-level DNN
W-set
W-scale
F0 vector
DNN
DNN
generation
generation
W-set
Initial S-scale F0 vector
DNN
DNN
generation
generation
S-set
S-scale
F0 vector
W-scale
F0 contour
Contour
Contour
recovery
recovery
Segmentation
Segmentation
Initial S-scale F0 contour
F0
F0 vector
vector
extraction
extraction
S-level DNN
Segmentation
Segmentation
Initial W-scale F0 contour
F0
F0 vector
vector
extraction
extraction
W-level DNN
IP-level DNN training
IP-scale
F0 contour
Contour
Contour
recovery
recovery
S-scale
F0 contour
Contour
ontour
recovery
recovery
Segmentation
Segmentation
Initial frame-scale F0 vector
DNN generation
P-level DNN
Predicted W-scale F0 vector

DNN
DNN
generation
generation
P-set
Final F0 contour
Fig. 3. Training process of hierarchical F0 modeling using a cascade DNN.

Fig. 4. Generation process of hierarchical F0 modeling in a cascade DNN.
IP, W, S, and P stand for intonation phrase, word, syllable, and phone,
respectively.
DNN-based modeling method set forth in Fig. 2, this structure

is expected to better reflect the contribution of higher prosodic
levels to the process of F0 generation.
Fig. 3 depicts the training process for the proposed hierarchical DNN F0 modeling in a cascade form. First, the training process for conventional HMM-based speech synthesis is
conducted to obtain the phonetic segmentation of training data.
Then, the four DNNs are trained successively from the highest
to the lowest prosodic levels. Take training word-level DNN as
an example, the word-scale F0 vectors extracted from the F0
features of training data, i.e., the observed word-scale F0 vector
in Fig. 3, are adopted as the regression targets of the word-level
DNN. For each word in the training set, the W-set and its initial
word-scale F0 vector derived from the predicted IP-scale F0 vector comprise the input features of the word-level DNN. Given
the paired input and target features, the model parameters of the
word-level DNN are estimated by back-propagation (BP) training with a mini-batch-based stochastic gradient descent (SGD)
algorithm. Furthermore, word-scale F0 vectors are predicted using the estimated model to obtain the input features for training
the syllable-level DNN at the next lower level.
F0 generation via the trained cascade DNNs is shown in
Fig. 4. Given the text of a sentence to be synthesized, the IP-set
for each IP, the W-set for each word, the S-set for each syllable,
and the P-set for each frame can be derived using the results
of text analysis and phone duration prediction according to the
definitions in Table 1. The IP-set of each IP is first fed into
the trained IP-level DNN to generate an IP-scale F0 vector for
the current IP. This F0 vector is of a fixed dimension and is
then renormalized (i.e., stretched or contracted) to the correct
length (number of frames) of the training data in the current IP
by the optimized DCT and inverse DCT. After segmenting the
IP-scale F0 contour into words, an initial word-scale F0 vector
is extracted for each word and then combined with the W-set
to compose the inputs of word-level DNN. This procedure is
successively conducted through the word-, syllable-, and phone
levels successively to produce the final F0 contour.
F0 contour for
training current level
F0
F0 vector
vector
extraction
extraction
{IP, W, S}-scale
F0 vector
DNN
DNN
training
training
Trained DNN
F0 contour for training

P level
{IP, W, S}-set
DNN
NN
training
training
DNN
DNN
generation
generation
Predicted {IP, W, S}-scale
F0 vector
Contour
Contour
recovery
recovery
Predicted {IP, W, S}-scale

F0 contour
Residual F0 contour
for training next level
Trained DNN
P-set
DNN
DNN
generation
g
generation
Predicted F0 contour
Original F0
contour
Residual F0 contour for training

IP level in next iteration
(a) IP, word and syllable-level
(b) Phone-level
Fig. 5. Training process within one iteration of hierarchical F0 modeling using

parallel DNNs.
4.4. F0 modeling using parallel DNNs

To explicitly represent the additive process in F0 generation, the parallel DNN structure is investigated in this paper.
By decomposing the original F0 contours into the corresponding components at each prosodic level, the DNN of each level
can be trained iteratively and the components of each level are
predicted and then added together to obtain the final F0 contour.
The generation process is parallel, this structure will be referred
as a parallel DNN structure.
In training a parallel structure, the four DNNs are trained
iteratively. Within each iteration, the model parameters of the IPlevel DNN, word-level DNN, syllable-level DNN, and phonelevel DNN are successively updated. Fig. 5 shows the flowchart
corresponding to different prosodic levels in the cascade structure. Due to the iterative model training, the computation cost of
training parallel DNNs is much high than training cascade ones.
IP-level DNN
IP-set
DNN
DNN
generation
generation
IP-scale
F0 vector
Contour
Contour
recovery
recovery
IP-scale
F0 contour
Contour
Contour
recovery
recovery
W-scale
F0 contour
87
W-level DNN
W-set
DNN
DNN
generation
generation
W-scale
F0 vector
5. Experiments
S-level DNN
S-set
DNN
DNN
generation
generation
S-scale
F0 vector
Contour
Contour
recovery
recovery
Final F0 contour
S-scale
F0 contour
P-level DNN
P-set
DNN
DNN
generation
generation
Frame-scale
F0 contour
Fig. 6. Generation procedure of hierarchical F0 modeling in a parallel DNNs.
of model training for each individual DNN in one iteration. Input

features of each DNN correspond to contextual features of each
prosodic level defined in Table 1. When estimating the DNN at
a particular level, the residual F0 contour is first calculated by
removing the contributions from all other levels in the original
given F0 contour (in voiced frames). When training the IP-level
DNN in the first iteration, the natural F0 contour is used because
other DNNs are not yet available. Next, the targets of DNN
prediction are derived from the residual F0 contours. For the IPlevel, word-level, and syllable-level DNNs, the output features
are F0 vectors of fixed dimensions, and the output features for
the phone-level DNN are frame-scale residual F0 values. After
the DNN training at the current level, a generation procedure
is conducted to estimate the contribution of the current level to
the natural F0 contour, which will then be subtracted from the
natural F0 contour (in voiced frames) to derive the regression
targets for training the DNN of the lower levels. This indicates
that the output features of the current DNN correspond to the
contribution of the current level to the natural F0 contour. The F0
vector extraction and contour recovery modules are introduced
in Section 4.2 and are the same as those used in the cascade
DNN.
A two-step model-training strategy is adopted to accelerate
the convergence of model parameters and to improve the prediction performance of the trained models. The first few iterations
focus on the estimation of the DNN parameters of each level
separately, while the following iterations focus on the estimation of the inter-dependencies between the DNNs. Accordingly,
in the first few iterations, hundreds of epochs are used for each
DNN to serve as an initialization of the parallel DNN structure;
and in the following iterations, only one epoch is used for each
DNN in order to estimate the inter-dependencies between levels
of the parallel DNN structures.
F0 generation in a parallel DNN is shown in Fig. 6. In the
parallel structure, the four sets of contextual features generate four individual components of F0 contours. These individual components are then added together to produce the final F0 contour. The key difference between the cascade structure and the parallel structure is that the additive process of F0
generation is not explicitly considered in the former one. Furthermore, there is no iterative and joint training of all DNNs
5.1. Experimental conditions

In our experiments, we used a corpus recorded in a soundproof room by a female, native speaker of American English.
The corpus consisted of 5, 100 phonetically and prosodically
balanced utterances (which lasted approximately 5 h long) together with segmental and prosodic labels. Overall, 4, 590 sentences were selected randomly as the training set and the remaining 510 sentences were reserved as the test set. The phonetic transcriptions are generated by Microsofts TTS front end
and they are subjected to manual checking. The part of speech
(POS) was labeled by Microsofts automatic POS tagging tools
without manually checking. Speech waveforms were recorded
in 16 kHz/16 bit format. The spectral analysis is performed by
a 25 ms window, shifted every 5 ms. The acoustic features, including: logarithmic F0, 41st-order linear spectral pairs (LSPs)
(including a gain), together with their velocity, and acceleration components were extracted. F0s were extracted by a robust
algorithm for pitch tracking (RAPT) (Talkin, 1995).
5.2. Baseline systems
An HMM-based SPSS system was built using the training
database, which was named the HMM-Baseline in our experiment. In this system, a 5-state, left-to-right-with-no-skip structure was used to train HMMs for context-dependent phones.
Each HMM state was modeled by a single Gaussian distribution
with diagonal covariance. The contextual featuresi.e., the P-set
in Table 1were used as the question set and the minimum description length (MDL) criterion (Shinoda and Watanabe, 2000)
was followed to build decision trees for model clustering. HMM
parameters were first trained under the maximum likelihood
(ML) criterion and then refined by minimum generation error
(MGE) training (Wu and Wang, 2006).
A conventional DNN-based speech synthesis in Fig. 2 for
F0 prediction was constructed and denoted as DNN-Baseline. In
this system, we adopted a network structure of 3 hidden layers
with 1, 400 nodes per layer. The input feature vector extracted
from the P-set was of 355 dimensions, 319 of which were binary features for categorical linguistic contexts and rest of which
were numerical features for numerical linguistic contexts. The
4-dimensional output feature vector contained a voicing flag, log
F0 and its dynamic counterparts. The voicing flag was a binary
voicing flag of the current frame. An exponential decay function
(Chen et al., 1997) was used to interpolate the missing F0 values
in unvoiced frames. Approximately 80% of the silence frames
were removed from the training data both to balance the training
data and to reduce the computational cost. The input and output features of the training data were normalized to zero mean
and unity variance. A sigmoid activation function was used for
the hidden layers and a linear function was used for the output
layer. The weights were trained using a back-propagation (BP)

procedure with a mini-batch-based stochastic gradient descent
(SGD) algorithm. The entire modeling process was similar to
that of (Qian et al., 2014); the difference was that the spectral
parameters and the gain were not included in the output features
of the DNN. In synthesis, the duration and spectral features were
predicted in the same manner as in the HMM-Baseline system.
Natural
Interpolated
Conventional DCT
Optimization-based DCT
F0 (Hz)
88
5.3. Two-level F0 modeling system

A system named DNN-IPStt was constructed using our previous two-level hierarchical F0 modeling method with DNNs
(Yin et al., 2014) for comparison. In this system, the input included 10 binary and 11 numerical features that were derived
from the IP-set; the output features were the first 5 optimized
DCT coefficients of IP. The weights of DNN were initialized by
layer-wise RBM pre-training and then optimized to minimize the
mean squared error between the output features of the training
data and predicted values using a mini-batch, SGD-based backpropagation algorithm. Input features were normalized to zero
mean unity variance, whereas output features were normalized
to be within [0.01, 0.99] based on their minimum and maximum
values in the training data. These features were set heuristically
according to Fan et al. (2014); Qian et al. (2014). The sigmoid
activation function was used for the hidden layers and the linear
function was used for the output layers. The number of hidden
layers was set to two and the number of hidden nodes for each
layer was tuned to 1, 000. The F0 residuals of the IP-level DNN
prediction were then modeled by state-level context-dependent
Gaussian distributions, where P-set was used as the contextual
features.
5.4. System construction using proposed methods
Frame index
Fig. 7. The natural F0 contour of an IP and its recovered counterparts using 5 DCT coefficients calculated by the conventional DCT analysis and the
optimization-based DCT analysis.
Table 2
Configurations of the DNN-Cascade system. #In-dim and #Out-dim indicate the
dimension of input and output features respectively. #Layer and #Nodes indicate
the number of hidden layers and the number of nodes for each layer.
IP-level
W-level
S-level
P-level
#In-dim
#Out-dim
#Layer
#Nodes
23
178
167
364
225
70
40
4
2
2
2
3
1, 000
1, 000
1, 000
600
Table 3
Configurations of the DNN-Parallel system.
IP-level
W-level
S-level
P-level
#In-dim
#Out-dim
#Layer
#Nodes
23
108
127
361
225
70
40
4
2
2
2
3
1, 000
1, 200
1, 500
1, 000
Two systems using our proposed DNN-based hierarchical F0

modeling methods were built for comparison
DNN-Cascade. The F0 modeling and prediction of this system follows the method introduced in Section 4.3.
DNN-Parallel. The F0 modeling and prediction of this system
follows the method introduced in Section 4.4.
As introduced in Section 4.2, F0 vectors were used as the

output features of the DNNs above phone-level in both systems.
In the F0 vector extraction module, the optimization-based DCT
analysis (Yin et al., 2014) was adopted and the first 5 DCT coefficients were used to derive the F0 vectors in our experiments.
For the IP, word and syllable levels, the frame numbers of the F0
vectors DIP , DW , and DS were set to 225, 70, and 40, respectively.
Fig. 7 shows the natural F0 contour of an IP and its recovered
counterparts using 5 DCT coefficients calculated according to
the conventional DCT analysis and the optimization-based DCT
analysis. F0 interpolation at unvoiced frames is necessary before
the conventional DCT analysis. The F0 contour interpolated using an exponential decay function (Chen et al., 1997) is also
shown in Fig. 7. It is observed that the optimization-based DCT
method is more effective in describing the overall shape of the F0
contour than the conventional DCT method. For the phone level,
the frame-scale F0 was taken as the output feature and was interpolated with an exponential decay function (Chen et al., 1997).
The detailed configurations of DNN structures for both systems are shown in Tables 2 and 3, which were tuned heuristically in some preliminary experiments. The sigmoid activation
function was used for hidden layers and the linear activation
function was used for output layers. When training each DNN,
the input features were normalized to zero mean and unit variance, whereas the output features were normalized to the range
of [0.01, 0.99] based on their minimum and maximum values in
the training set. At the training stage of the DNN-Cascade system, the number of epochs used to fine-tune each DNN model is
set to 500. When training the DNN-Parallel system, 500 epochs
were conducted for the first 5 iterations, and another 600 iterations were then carried out with one epoch for each DNN
model in each iteration. Fig. 8 shows the Root Mean Square
Error (RMSE) on the training and test sets in training the DNNParallel system. In synthesis, the duration and spectral features
of these two systems were predicted in the same manner as the
HMM-Baseline system.

Training set
89
Training set
Test set
Test set
20
X
19
F0 RMSE (HZ)
RMSE (Hz)
18
17
16
15
14
13
12
11
10
The number of iterations
HMM-Baseline
DNN-Baseline
RMSE (Hz)
Test set
DNN-Parallel
(a)
Training set
DNN-Cascade
Fig. 9. RMSEs of predicted F0 contour using the four systems on training set
and test set. x indicates the difference between two systems is insignificant
and indicates the difference is significant.
DNN-Baseline
33.33%
No preference
HMM-Baseline
46.67%
20.00%
The number of iterations
(b)
Fig. 8. Change of RMSEs on the training and test sets during (a) the first 5
iterations and (b) the following 600 iterations when training the DNN-Parallel
system.
5.5. Experimental results

Objective and subjective tests were conducted to evaluate
the F0 prediction performance of the proposed methods. The
test sentences were first segmented by force alignment using
the acoustic models of the HMM-Baseline system. Second, the
segmental durations were followed to predict F0 features using
different systems. Note that the complete sets of contextsi.e.,
the P-setwere provided to all systems; therefore, it was fair to
compare their capacity in F0 modeling and prediction.
To evaluate the objective performance of each system, the
RMSE on the test set was calculated by comparing the natural and predicted F0 values of voiced frames. A t-test is conducted to inform us whether the differences between compared
systems are significant (p < 0.05). In subjective tests, preferences by human subjects were determined. In each preference test, 20 test sentences were randomly selected and synthesized using the paired systems for comparison,1 such as
HMM-Baseline and DNN-Baseline. The listening tests were
conducted by crowdsourcing on Amazon Mechanical Turk
1 Some examples of the synthetic speech using the five systems can be found
at http://home.ustc.edu.cn/ byx1030/demo-1.html.
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00%
Fig. 10. Preference scores with 95% confidence intervals between HMMBaseline and DNN-Baseline.
(https://www.mturk.com). The workers were asked to focus on

the prosody differences of each synthetic-speech pair, played to
them in random order. In our experiments, each preference test
was taken by at least nine listeners whose native language was
English.
5.5.1. Comparison between two baseline systems
The RMSEs of F0 prediction using the four systems are summarized in Fig. 9. From this table, we can see that the DNNBaseline system is not as good as the HMM-Baseline system in
the objective test. Similar results have been presented by Zen
et al. (2013), where the RMSE in log F0 of the DNN-based
system was higher than that of the HMM-based system. One
possible reason is that an interpolation was conducted to generate F0 values for the unvoiced frames during the preparation
of training data in the DNN-Baseline system, which may introduce some noise to DNN training. As to their subjective performances, Fig. 10 shows the results of preference test between the
HMM-Baseline and DNN-Baseline systems. We can see that the
DNN-Baseline system is preferred over the HMM-Baseline. The
subjective evaluation result shows the advantages of using DNNs
to describe the complicated relationship between contexts and
observations over decision trees used in the conventional HMMbased system. Table 4 lists the normalized variances (VARN) of
F0 contours predicted using the two baseline systems on the test
set. The VARN was calculated as the average ratio between the
90
Table 4
Normalized variances (VARN) of F0 contours predicted using the two baseline
systems on test set.
System
VARN
HMM-Baseline
DNN-Baseline
0.68
0.73
DNN-Cascade
No preference
DNN-IPStt
DNN-Cascade
40.56%
26.11%
33.33%
33.89%
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00%

No preference
DNN-Baseline
50.00%
Fig. 12. Preference scores with 95% confidence intervals between DNN-IPStt
and DNN-Cascade.
16.11%
DNN-Parallel
46.16%
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00%
Fig. 11. Preference scores with 95% confidence intervals between DNNBaseline and DNN-Cascade.
predicted variance and the natural variance of the F0 contour for

each utterance. From the results, the VARN of the DNN-Baseline
system was slightly better than that of the HMM-Baseline but the
difference between these two systems was not significant from
the result of the t-test. In other words, the DNN-Baseline system
can generate F0 contours with more variations and therefore a
larger dynamic range, which can help human subjects perceive
synthesized speech as more natural.
No preference
DNN-Baseline
31.49%
22.35%
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00%
Fig. 13. Preference scores with 95% confidence intervals between DNNBaseline and DNN-Parallel.
DNN-Parallel
5.5.2. The performance of cascade DNNs

By comparing the second and third groups in Fig. 9, it is
observed that the DNN-Cascade system has achieved a lower
RMSE of F0 on the test set than the DNN-Baseline system,
but it is still worse than the HMM-Baseline system. A preference test was also conducted between the DNN-Baseline system
and the DNN-Cascade system. The preference scores in Fig. 11
show that the DNN-Cascade system achieved higher preference
than the DNN-Baseline system. In the DNN-Baseline system,
the frame-scale F0 values are predicted using the full set of
contextual features. In the DNN-Cascade system, the final F0
values are generated by the phone-level DNN, whose input features are a combination of the full set of contextual features and
the F0 features predicted using the DNNs at higher prosodic
levels. Therefore, such improvement can be attributed to the
cascade structure and the boosting effects of the contextual features that correspond to higher prosodic levels. Fig. 12 shows that
the DNN-Cascade system is more preferred than the DNN-IPStt
system in preference tests. This finding implies that considering
more prosodic levels can have benefits for F0 modeling.
5.5.3. Performance of parallel DNNs
As shown in Fig. 9, the RMSEs of the DNN-Parallel system on the training set and test sets are 11.37 Hz and 12.62 Hz,
respectively, which are the lowest among all of the evaluated systems. A preference test between the DNN-Baseline system and
the DNN-Parallel system was also conducted, and the results are
No preference
DNN-Cascade
48.33%
27.78%
23.89%
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00%
Fig. 14. Preference scores with 95% confidence intervals between DNNCascade and DNN-Parallel.
shown in Fig. 13. All of these results indicate the effectiveness

of the proposed parallel structure for DNN-based F0 modeling,
in which the additive nature of F0 production at four prosodic
levels is considered. From Fig. 14, we observe that the preference gain achieved by the parallel structure is more significant
than that achieved by the cascade structure. This is reasonable
because an explicit additive process of F0 generation is embedded in the parallel structure. Furthermore, the DNN-Parallel
system achieved better performance than the DNN-IPStt system
as shown in Fig. 15. This indicates the power of considering
more prosodic levels in F0 modeling or the importance of joint
training at all prosodic levels.
Finally, a preference test between the HMM-Baseline system
and the DNN-Parallel system was conducted. Fig. 16 shows the
evaluation results. It is observed that the DNN-Parallel system
DNN-Parallel
53.33%
No preference
28.89%
DNN-IPStt
17.78%
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00%
Fig. 15. Preference scores with 95% confidence intervals between DNN-IPStt
and DNN-Parallel.
DNN-Parallel
56.11%
91
The cascade structure takes the predicted F0 contours from the

DNN of upper prosodic level, i.e., longer prosodic units, as input
features to the DNN of the current level. In the parallel structure, separate F0 components are first independently predicted
via DNNs at different levels and added together to form the final
F0 contour. The F0 vectors derived with optimized DCT analysis are used as the observations of DNNs above the phone level.
Experimental results show that both structures can achieve better subjective performance than the conventional DNN-based
method, whereas the parallel DNN achieves the best performance, both objectively and subjectively. We will continue to
work on the parallel structure by exploiting other model training
criteria, e.g., the distortion of dynamic F0 features and contextdependent distortion weighting. Moreover, we will also try to
model spectral and F0 features simultaneously with either hierarchical DNNs or recurrent neural networks (RNN).
Acknowledgements
No preference
HMM-Baseline
32.22%
11.67%
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00%
Fig. 16. Preference scores with 95% confidence intervals between HMMBaseline and DNN-Parallel.
HMM-Baseline
DNN-Parallel
F0 (Hz)
Natural
Frame index
Fig. 17. Comparison of F0 contours generated by HMM-Baseline and DNNParallel.
significantly outperformed the HMM-Baseline system. A comparison of the F0 contours generated using these two systems is
shown in Fig. 17. The figure shows that our proposed approach
can represent the overall shapes of F0 contours (e.g., from 100th
frames to 130th frames) and capture the micro-properties of F0
dynamics (e.g. from frame 150 to frame 200) better than the
baseline system.
6. Conclusions
We have proposed to model F0 trajectory for TTS synthesis
with DNNs of two hierarchical structures: cascade or parallel.
The authors would like to thank Wenping Hu, Yuchen Fan

and other colleagues from Microsoft Research Asia and Microsoft China for their kind help and discussions. This work was
partially funded by the National Nature Science Foundation of
China (Grant no. 61273032).
References
Chen, C.J., Gopinath, R.A., Monkowski, M.D., Picheny, M.A., Shen, K., 1997.
New methods in continuous mandarin speech recognition. In: Eurospeech,
pp. 15431546.
Fan, Y.C., Qian, Y., Soong, F.K., 2014. TTS synthesis with bidirectional LSTM
based recurrent neural networks. In: Interspeech, pp. 19641968.
Fernandez, R., Rendel, A., Ramabhadran, B., Hoory, R., 2014. Prosody contour prediction with long short-term memory, bi-directional, deep recurrent
neural networks.. In: Interspeech, pp. 22682272.
Fujisaki, H., 2008. In search of models in speech communication research.. In:
Interspeech, pp. 110.
Kawahara, H., Masuda-Katsuse, I., de Cheveign, A., 1999. Restructuring
speech representations using a pitch-adaptive time-frequency smoothing and
an instantaneous-frequency-based F0 extraction: possible role of a repetitive
structure in sounds. Speech Commun. 27 (3), 187208.
Latorre, J., Akamine, M., 2008. Multilevel parametric-base F0 model for speech
synthesis. In: Proceedings of Interspeech, pp. 22742277.
Lei, M., Wu, Y.J., Soong, F.K., Ling, Z.H., Dai, L.R., 2010. A hierarchical
F0 modeling method for HMM-based speech synthesis.. In: Interspeech,
pp. 21702173.
Ling, Z.H., Li, D., Yu, D., 2013. Modeling spectral envelopes using restricted
Boltzmann machines and deep belief networks for statistical parametric
speech synthesis. IEEE Trans. Audio, Speech Language Proc. 21 (10), 2129
2139.
Ling, Z.-H., Wu, Y.J., Wang, Y.P., Qin, L., Wang, R.H., 2006. USTC system
for Blizzard Challenge 2006: an improved HMM-based speech synthesis
method. In: Proceedings of Blizzard Challenge Workshop.
Makhoul, J., 1980. A fast cosine transform in one and two dimensions. IEEE
Trans. Acoustics, Speech Signal Proc. 28 (1), 2734.
Obin, N., Lacheret, A., Rodet, X., 2011. Stylization and trajectory modelling
of short and long term speech prosody variations. In: Proceedings of Interspeech, pp. 20292032.
Qian, Y., Fan, Y.C., Hu, W.-P., Soong, F.K., 2014. On the training aspects of
deep neural network (DNN) for parametric TTS synthesis. In: Proceedings
of ICASSP, pp. 38573861.
Qian, Y., Liang, H., Soong, F.K., 2008. Generating natural F0 trajectory with
additive trees.. In: Proceedings of Interspeech, pp. 21262129.
92
Qian, Y., Wu, Z., Gao, B., Soong, F.K., 2011. Improved prosody generation by
maximizing joint probability of state and longer units. IEEE Trans. Audio,
Speech, Language Proc. 19 (6), 17021710.
Shinoda, K., Watanabe, T., 2000. MDL-based context-dependent sub-word modeling for speech recognition. J. Acoust. Soc. Jpn(E) 21 (2), 7986.
Talkin, D., 1995. A robust algorithm for pitch tracking (RAPT). Speech Coding
Synthesis 495518.
Teutenberg, J., Watson, C., Riddle, P., 2008. Modelling and synthesising F0
contours with the discrete cosine transform. In: Proceedings of ICASSP,
pp. 39733976.
Toda, T., Tokuda, K., 2005. Speech parameter generation algorithm considering global variance for HMM-based speech synthesis. In: Proceedings of
Eurospeech, pp. 13151318.
Tokuda, K., Kobayashi, T., Imai, S., 1995. Speech parameter generation from
HMM using dynamic features. In: Proceedings of ICASSP, pp. 660663.
Tokuda, K., Masuko, T., Miyazaki, N., Kobayashi, T., 1999. Hidden markov
models based on multi-space probability distribution for pitch pattern modeling. In: Proceedings of ICASSP, pp. 229232.
Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., Kitamura, T., 2000.
Speech parameter generation algorithms for HMM-based speech synthesis.
In: Proceedings of ICASSP, pp. 13151318.
Wang, C.C., Ling, Z.H., Zhang, B.F., Dai, L.R., 2008. Multi-layer F0 modeling
for HMM-based speech synthesis. In: Proceedings of ISCSLP, pp. 129132.
Wightman, C., Price, P., Pierrehumbert, J., Hirschberg, J., 1992. Tobi: a standard
for labeling english prosody. In: Proceedings of ICSLP, pp. 1216.
Wu, Y.J., Soong, F.K., 2012. Modeling pitch trajectory by hierarchical HMM
with minimum generation error training. In: Proceedings of ICASSP,
pp. 40174020. doi:10.1109/ICASSP.2012.6288799.
Wu, Y.J., Wang, R.H., 2006. Minimum generation error training for HMMbased speech synthesis. In: Proceedings of ICASSP, pp. 8992. doi:10.1109/
ICASSP.2006.1659964.
Wu, Y.J., Wang, R.H., Soong, F.K., 2007. Full HMM training for minimizing
generation error in synthesis. In: ICASSP, pp. 517520.
Wu, Z., Qian, Y., Soong, F.K., Zhang, B., 2008. Modeling and generating tone
contour with phrase intonation for mandarin chinese speech. In: Proceedings
of ISCSLP, pp. 14. doi:10.1109/CHINSL.2008.ECP.42.
Yin, X., Lei, M., Qian, Y., Soong, F.-K., He, L., Ling, Z.H., Dai, L.R., 2014.
Modeling DCT parameterized F0 trajectory at intonation phrase level with
DNN or decision tree. In: Proceedings of Interspeech, pp. 22732277.
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T., 1999. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech
synthesis. In: Proceedings of Eurospeech, 6, pp. 23472350.
Zen, H., Braunschweiler, N., 2009. Context-dependent additive log F0 model for
HMM-based speech synthesis.. In: Proceedings of Interspeech, pp. 2091
2094.
Zen, H., Sak, H., Graves, A., Senior, A., 2014. Statistical parametric speech
synthesis based on recurrent neural networks. In: Proceedings of Conference
on UKSpeech.
Zen, H., Senior, A., Schuster, M., 2013. Statistical parametric speech synthesis
using deep neural networks. In: Proceedings of ICASSP, pp. 79627966.
Zen, H., Toda, T., Nakamura, M., Tokuda, K., 2007a. Details of Nitech HMMbased speech synthesis system for the Blizzard Challenge 2005. IEICE Trans.
Inf. Syst. E90-D (1), 325333.
Zen, H., Tokuda, K., Black, A., 2009. Statistical parametric speech synthesis.
Speech Commun. 51, 10391064. doi:10.1016/j.specom.2009.04.004.
Zen, H., Tokuda, K., Kitamura, T., 2007b. Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic
feature vector sequences. Comput. Speech Language 21 (1), 153173.

DNN F0 Models For SPSS

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DNN F0 Models For SPSS

Uploaded by

Copyright:

Available Formats

Available online at www.sciencedirect.

Speech Communication 76 (2016) 8292

Modeling F0 trajectories in hierarchically structured deep neural networks

Corresponding author. Tel.: +86 551 63607871.

the synthesized speech still carries some machine-like quality,

X. Yin et al. / Speech Communication 76 (2016) 8292

Fig. 1. Diagram of a typical HMM-based parametric speech synthesis system.

in the same time scale. We have proposed a method to model

X. Yin et al. / Speech Communication 76 (2016) 8292

the tth frame consists of the static acoustic parameters ct RD

The feature sequence o can be re-written as

c = arg max P(Wc|, q),

where c is the output feature sequence. For F0 prediction, a

Deep Neural Network

Static acoustic features

Fig. 2. Framework of DNN-based parametric speech synthesis.

linguistic contexts (e.g., the number of syllables in the current

Next, DNN weights are trained with pairs of contextual feature

X. Yin et al. / Speech Communication 76 (2016) 8292

each prosodic level as listed in Table 1 based on Lei et al. (2010),

4.2. DCT-based F0 vector extraction and contour recovery

where V denotes the set of frame indices of voiced frames in s;

This optimized DCT analysis is adopted in this paper for

X. Yin et al. / Speech Communication 76 (2016) 8292

Predicted IP-scale F0 vector

Initial W-scale F0 vector

W-level DNN training

S-level DNN training

P-level DNN training

Initial S-scale F0 vector

Initial S-scale F0 contour

Initial W-scale F0 contour

IP-level DNN training

Initial frame-scale F0 vector

Predicted W-scale F0 vector

Fig. 3. Training process of hierarchical F0 modeling using a cascade DNN.

DNN-based modeling method set forth in Fig. 2, this structure

F0 contour for training

Predicted {IP, W, S}-scale

Residual F0 contour for training

(a) IP, word and syllable-level

Fig. 5. Training process within one iteration of hierarchical F0 modeling using

4.4. F0 modeling using parallel DNNs

X. Yin et al. / Speech Communication 76 (2016) 8292

Fig. 6. Generation procedure of hierarchical F0 modeling in a parallel DNNs.

of model training for each individual DNN in one iteration. Input

5.1. Experimental conditions

X. Yin et al. / Speech Communication 76 (2016) 8292

layer. The weights were trained using a back-propagation (BP)

5.3. Two-level F0 modeling system

Two systems using our proposed DNN-based hierarchical F0

As introduced in Section 4.2, F0 vectors were used as the

X. Yin et al. / Speech Communication 76 (2016) 8292

The number of iterations

The number of iterations

5.5. Experimental results

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00%

(https://www.mturk.com). The workers were asked to focus on

X. Yin et al. / Speech Communication 76 (2016) 8292

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00%

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00%

predicted variance and the natural variance of the F0 contour for

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00%