You are on page 1of 11

Nghin cu khoa hc cng ngh (Tn chuyn mc do Ban bin tp

quyt nh)

REAL-TIME IMPLEMENTATION OF MELP VOCODER


ON TI FIXED-POINT TMS320C55X DSP
Phm Vn Hu*, inh Vn Ngc*, Nguyn Anh c**, Thi Trung Kin*
Abstract: This paper presents a real-time full-duplex implementation of the 2400 bit per
second (bps) Mixed Excitation Linear Prediction (MELP) vocoder on TMS320C55x
Digital Signal Processors (DSP). It briefly reviews the MELP algorithm and procedure to
realize and evaluate the implementation on the selected hardware platforms. Speech
quality of the developed MELP is evaluated with both English and Vietnamese voice
samples using direct listening assessment and the ITU P.862 PESQ objective method. It
comes to a conclusion that this realization not only fully meets requirements specified by
the MELP standard (MIL-STD-3005), but also can provide good performance being
comparable with some commercial MELP vocoder products available on the market.
Keywords: Speech coding, vocoder, Melp, speech quality evaluation, real-time DSP

1. INTRODUCTION
Mixed Excitation Linear Prediction (MELP) vocoder is one of the most recognized
and widely used speech coding methods due to its speech quality, compression
rate, and robustness to adverse working conditions such as ambient noises or
imperfect transmission channels a desired requirement for military applications.
It has a range of applications such as digital voice in high frequency (HF)
transceivers and in secured voice applications. MELP was standardized by the US
Department of Defense in 1997 known as MIL-STD-3005 [1]. This vocoder has
been improved and re-standardized during the time 1998-2001 under the name
MELPe (enhanced MELP) with key additional features: a new compression rate of
1200 bits per second, improvements in coding and decoding processes, noise preprocessing to remove background noise, transcoding between code rate 2400 bits/s
and 1200 bits/s, and a new post filter [8]. In this paper, only the first version MELP
was studied.
To see the demand of the MELP vocoder, a survey was conducted on highquality HF transceivers complied with NATO standards either currently equipped
for Vietnamese defense forces or come from prestigious HF manufactures. From
the survey, a finding is that most of the high-end HF transceivers do use the MELP
coding standard such as HF6000 of Tadiran Communications Ltd, TR2400 of
Grintek, Codans NGT SRx of Codan Radio [10]. That observation proved the
quality and prevalence of MELP over other vocoders used in HF transceivers. It is
worth noting that MELP was evaluated with many different languages such as
English, French, German, countries in North Atlantic Treaty Organization (NATO);
MELP for Vietnamese has been in use and practically has no big problems
reported; however, there have been no official reports on this.
Among hardware platforms for MELP implementations, Texas Instrument
(TI) DSP C5000 family is a good candidate; virtually all commercial MELP
products available in the market do support this platform. TMS320C5000 TM DSP
provides fixed-point low-power 16-bit DSPs with performance up to 300 Mhz.

Tp ch Nghin cu KH&CN Qun s, S 23, 02 - 2013

Tn chuyn ngnh do tc gi quyt nh (V d, iu khin & T


ng ha)

This DSP family is also rich in peripherals and has a large portion of on-chip
memory to reduce the overall system cost. With these reasons, C5000 devices are a
perfect fit for a variety of low power and cost-effect signal processing solutions
including portable devices in audio, voice, medical and biometric applications [9].
It should also be mentioned that Texas Instrument provides not only the C5000
DSP chips but also a large set of supporting hardware and software resources to
help developers rapidly accomplish their tasks, some of them can be named: a
variety of DSK (DSP starter kit) and EVM (evaluation module) boards,
TMS320C55x DSP Library (DSPLIB), C5000 Chip Support Library (CSL) and
numerous helpful applications reports.
From the strong demand of low-bit rate speed coding MELP and surveys on
available hardware platforms, the research group decided to study and implement a
real-time MELP vocoder system on C55x, particularly in C5509 and C5510, and
gained some significant results that will be presented in this paper. The structure of
this paper is as follows. Section 1 presents the importance of MELP for military
applications and a quick introduction to low-power low-cost C5000 TI DSPs.
Section 2 briefly describes the MELP algorithm. Section 3 analyzes the C5000
systems used to develop speech coder MELP, Section 4 shows evaluation of the
system with detailed experimental results, and finally section 5 gives conclusions
and feature works.
2. MELP VOCODER ALGORITHM DESCRIPTION
MELP can be classified in the group of vocoders using Linear Prediction Coding
(LPC) model. In this group there have been well-known coders, CELP, LPC-10,
LPC-10e and MELP, to name a few. MELP provides equivalent or better
performance than the 4800 bits per second CELP coder (Federal Standard 1016) at
a lower bit rate [4]. Generally, MELP was developed based on LPC-10 (FS-1015,
STANAG 4198) with five major changes. They are mixed-excitation, aperiodic
pulses a new voicing state for jittery voiced frames, pulse dispersion, adaptive
spectral enhancement, and Fourier magnitude modeling [1-3].
The mixed-excitation is the combination of a pulse train and a random noise
which makes MELP differs from the conventional LPC model when the excitation
source is either the pulse train or noise at a time. This combination is implemented
using a multi-band mixing model which simulates frequency dependent voicing
strengths. The goal of this multi-band mixed-excitation is to reduce the buzz
usually associated with LPC vocoders, especially in broadband acoustic noise [1].
Aperiodic pulses are used in the excitation model where a voiced speech is
classified into voiced (periodic) and jittery voiced (aperiodic). Jittery voiced
speech is often observed during the transition regions between voiced and
unvoiced segments of the speech signal. This feature allows the synthesizer to
reproduce erratic glottal pulses without introducing tonal noises [1]. The pulse
dispersion is implemented using fixed pulse filter based on a spectrally flattened
triangle pulse. This filter has the effect of spreading the excitation filter with a
pitch period. This, in turn, reduces the harsh quality of the synthetic speech [1].

Tp ch Nghin cu KH&CN Qun s, S , 02 - 2014

Nghin cu khoa hc cng ngh (Tn chuyn mc do Ban bin tp


quyt nh)

The adaptive spectral enhancement filter is used to enhance the formant structure
in the synthetic speech; it is constructed based on the poles of the LPC vocal tract
filter. This filter improves the match between synthetic and natural bandpass
waveforms, and introduces a more natural quality to the speech output [1]. Beside
the remarkable already mentioned improvements, another feature should be paid
attention is the Fourier magnitudes which are used to better model the speech
production process than LPC models with a more accurate excitation source [1].
Block diagrams of MELP vocoder with coding (analysis) and decoding
(synthesis) processes taken from [1] are presented in Figure 1 and 2 in that order.
In the analysis process, one heavy and important procedure is used repeatedly and
intensively is the pitch determination, which includes integer pitch search and
fractional pitch refinement [1, section A5.2.4] and [5]. Together with pitch
determination, the quantization of LPC coefficients, consisting of the conversion of
LPC coefficients to the Line Spectrum Frequency (LSF) form [1, 7] and Multistage vector quantization (MSVQ) of LSFs [1, 6], are the most computationally
heavy in the MELP algorithm. It should also be note that, the Fourier magnitudes
of the first 10 pitch harmonics are computed from the prediction residual
generated by the quantized prediction coefficients (LSFs get converted back to
LSP). Therefore, this step has to be done after the LPC quantization. In the
decoding process, pitch is decoded first since it contains the mode information
voiced, unvoiced, and frame erasures. If a frame is detected as an erasure either
with pitch information or by error detection, then a frame repeat mechanism is
implemented, all the parameters for the current frame are replaced with the
parameters from the previous frame. The decoding process generally takes steps in
a reverse order to the coding counterpart with a notice that it interpolates
parameters pitch-synchronously for each synthesized pitch period. The interpolated
parameters are the gain (in dB), LSFs, pitch, jitter, Fourier magnitudes, bandpass
voicing strengths, and the spectral tilt coefficient for the adaptive spectral
enhancement filter.
Input
speech

Pitch
calculatio
n

Bandpass
voicing
analysis

LPC residual
calculation

Peakiness
calculation

Compute
LSFs from
LPC
coefficients

Quantize
gain, pitch,
LSFs,
bandpass
voicing

Fractional
pitch
refinement
Final
pitch
calculatio
n
Compute
Fourier
magnitudes
and quantize

Aperiodic
flag

Pitch
doubling
check

Linear
Prediction
analysis

Gain
calculation

Pack bits into


frames and
apply error
protection

Average
pitch
update

MELP
frame

Figure 1. MELP coder block diagram

Tp ch Nghin cu KH&CN Qun s, S 23, 02 - 2013

Tn chuyn ngnh do tc gi quyt nh (V d, iu khin & T


ng ha)

MELP has a frame size of 22.5 ms which contains 180 samples at sampling
rate 8000 samples per second; each sample has a resolution of 16 bits. The
recommended analog voice requirement is in the range from 100 Hz to 3800 Hz.

Figure 2. MELP decoder block diagram


The transmit MELP frame format is presented as in the Table 1. The total bits
required are 54 per 25 ms frame, then a bit rate of 54*1000/22.5= 2400 bits/s [1].
Table 1. MELP bit allocation
Parameters
LSFs
Fourier Magnitudes
Gain (two per frame)
Pitch, overall voicing
Bandpass voicing
Aperiodic Flag
Error Protection
Sync bit
Total Bits/22.5 ms frame

Voiced
25
8
8
7
4
1
1
54

Unvoiced
25
8
7
13
1
54

3. REAL-TIME MELP IMPLEMENTATION ON TMS320VC5509 AND


TMS320VC5510

10

Tp ch Nghin cu KH&CN Qun s, S , 02 - 2014

Nghin cu khoa hc cng ngh (Tn chuyn mc do Ban bin tp


quyt nh)

In the C5000 devices, C5509 (full name TMS320VC5509A) and C5510 (full name
TMS320VC5510A) are of the most high-end products. With a sophisticated DSP
architecture inside focusing on parallelism and power reduction, algorithms with
high complexity can be performed efficiently and in real-time in C5000. Some key
hardware features are: a complex internal bus structure composed of one program
bus, three data read buses, two data write buses, and additional buses dedicated to
peripheral and DMA activity which provide the ability to perform up to three data
reads and two data writes in a single cycle, two multiply-accumulate (MAC) units,
each capable of 17-bit x 17-bit multiplication in a single cycle, a central 40-bit
arithmetic/logic unit (ALU) supported by an additional 16-bit ALU, a fully
protected pipeline structure with predictive branching capability. Both C5509 and
C5510 have a set of valuable peripherals such as Timer (2), McBSP (3), DMA(6),
Programmable Phase-Locked Loop Clock Generator, but while C5009A is richer
with USB 1.1, I2C interfaces, C5510 is with more on-chip memory - 64K Bytes of
Dual-Access RAM (DARAM) 256K Bytes of Single-Access RAM (SARAM) over
64K Bytes of Dual-Access RAM (DARAM) 192K Bytes of Single-Access RAM
(SARAM) [9]. Materials for help, guidance on the hardware design and software
programming for these two devices have been well documented and easy to find
this will much help ones who start working on TI DSP.

Figure 3. System used to develop speech coder MELP


Of all the above reasons, the research group decided to implement the MELP
vocoder on C5509 and C5510. The complete hardware platform to be used was
C5509 and C5510 DSP Starter Kit (DSK) which are provided either directly by TI
or one of its close partners Spectrum Digital (http://www.spectrumdigital.com).
These DSKs have not only a C5509 or a C5510 DSP as the heart of the system, but
also provide several helpful peripherals around, such as codec (TLV320AIC23B)
with four 3.5 mm. audio jacks (microphone, line-in, speaker, line-out), dips,
switches, leds, a large portion of external SDRAM. The selected integrated

Tp ch Nghin cu KH&CN Qun s, S 23, 02 - 2013

11

Tn chuyn ngnh do tc gi quyt nh (V d, iu khin & T


ng ha)

development environment (IDE) was Code Composer Studio version 3.3 provided
by TI. CCS 3.3 which includes compilers for each of TI's device families, source
code editor, project build environment, debugger, profiler, simulators, real-time
operating system (DSP/BIOS) and many other features. CCS 3.3 is also a powerful
IDE cable of excellent compiler optimization with a range of different options
which can help developers easily and quickly speed up the performance of the
implemented algorithms [9]. The system used to develop the MELP vocoder and
the online real-time model are presented in Figure 3 and 4 respectively.

Figure 4. Online real-time model


4. Performance evaluation
There have been numerous assessment methods proposed in literature to evaluate
the quality of the processed speech. They are either subjective measures (with
human listener participation), e.g., Mean Opinion Scores (MOS) [12], or objective
measures (without human listener participation), e.g., Perceptual Evaluation of
Speech Quality (PESQ) [11]. One of the most widely-used subjective measures is
the Mean Opinion Scores (MOS) , in which trained and experienced listeners rate
the quality of the test speech signal using a five-point numerical scale (see Table 2)
from 1 to 5.. The final score of the test signal is obtained by averaging the scores
given by all listeners (therefore, it is called Mean Opinion Score).
Table 2. MOS rating scale [12, p. 491].
Rating
5
4
3
2
1

12

Speech Quality
Excellent
Good
Fair
Poor
Bad

Level of Distortion
Imperceptible
Just perceptible, but not annoying
Perceptible and slightly annoying
Annoying but not objectionable
Very annoying and objectionable

Tp ch Nghin cu KH&CN Qun s, S , 02 - 2014

Nghin cu khoa hc cng ngh (Tn chuyn mc do Ban bin tp


quyt nh)

Figure 5. General diagram of a typical measure.

Although subjective assessment methods are perhaps the most reliable;


however, they are time-consuming and require trained listeners as well as listening
conditions. Due to these reasons, objective measures are often used. One of the
disadvantages of existing objective measures is it still requires original clean
speech as a reference for their operation due to limitations of fully understanding
human hearing perception, especially under noise conditions. Despite limitations,
these measures have been found useful and showing a good correlation with
subjective listening tests, e.g., MOS scores.
In our research, Perceptual Evaluation of Speech Quality (PESQ) [11] was
chosen for the evaluation task. PESQ was directly designed to assess voice quality
received in telecommunications. This measure shows a high correlation with the
Mean Opinion Score. The score given by PESQ is on the scale of 1 to 4.5, the
higher score means the better quality.
The general diagram of a typical measure is presented in Figure 5; however,
experiments were conducted without the attendance of the additive noise (and then
the SNR), and channels were assumed to be ideal (blocks in dashed line in Figure
5 are of no consideration).
Since test vectors for assess MELP implementations are hard to be obtained
publicly, and due to the limitation of time, only a small set of corpuses either in
English or Vietnamese was used to evaluate the implemented MELP on C55x, that
hereafter will be called C55x MELP. It should be noted that there has not been any
official Vietnamese database helping measure performance of general speech
processing algorithms and speech coding in particular. Therefore, the research
group had to record some Vietnamese voice sentences on its own at the best efforts
to make them follow the input requirements. As opposed to Vietnamese, there have
been several different standardized English database, large enough and widely used
in research publications e.g., AURORA, TIMIT, ITU P50. However, to be able to
compare the performance of the C55x MELP with some other MELP products sold
on market [13, 14], some English corpuses were taken directly from these
products websites; including original clean and the processed speeches.
In
details, 6 sentences at sampling rate 8000 samples-per-second and 16-bit
quantization were taken into evaluation with properties presented in Table 3, in
which except Vn_M.wav and Vn_F.wav are short, the remaining files are long
enough to cover a variety of different sounds.
Table 3. Speech samples for evaluation
Order
1

Filename
Eng_M.wav [13]

Language
English

Tp ch Nghin cu KH&CN Qun s, S 23, 02 - 2013

Male/Female
Male

13

Tn chuyn ngnh do tc gi quyt nh (V d, iu khin & T


ng ha)
2
3
4
5
6

Eng_F.wav [13]
Vn_M.wav
Vn_F.wav
Vov1.wav
reference_64p0k.wav [14]

English
Vietnamese
Vietnamese
Vietnamese
English

Female
Male
Female
Female
Both

The PESQ scores obtained by using the C55x MELP implementation in


comparison with commercial products are shown in Table 4.
Table 4. PESQ scores of C55x MELP implementation
Order

Filename

C55x MELP

Commercial
products

1
2
3
4
5
6

Eng_M.wav
Eng_F.wav
Vn_M.wav
Vn_F.wav
Vov1.wav
reference_64p0k.wav

2.641
2.384
2.631
2.267
2.713
3.106

2.666 [13]
2.445 [13]
Unavailable
Unavailable
Unavailable
2.970 (*)

(*): Scored with the ITU P.862 tool [11]


Figures 6 and 7 show the original and C55x MELP processed speeches of
Vn_M.wav and Vn_F.wav with the respective phases No sn sng cha cc
thanh nin and vy s gii thch ca h l c l.

Vn_M original speech

0.4
0.2

0.2

-0.2

-0.2

-0.4

5000

10000

MELP processed Vn_M speech

0.4

15000

-0.4

5000

10000

Figure 6. Original and C55x MELP processed Vietnamese male spoken


No sn sng cha cc thanh nin

14

Tp ch Nghin cu KH&CN Qun s, S , 02 - 2014

15000

Nghin cu khoa hc cng ngh (Tn chuyn mc do Ban bin tp


quyt nh)

Vn_F original speech

0.4
0.2

0.2

-0.2

-0.2

-0.4

0.5

MELP processed Vn_F speech

0.4

1.5

-0.4

0.5

x 10

1.5

2
4

x 10

Figure 7. Original and C55x MELP processed Vietnamese male spoken


vy s gii thch ca h l c l
Through the experimental results, it was observed that the C55x MELP performed
more or less the same as some MELP products currently sold on the market with
the given input speeches. Specifically, C55x MELP did better job than Vocal but
worse than Signalogic, but the differences are marginal, at only around 0.1 PESQ
score. More intensive tests with other speech corpus confirmed the quality of the
implemented C55x MELP based on the PESQ scores well and by direct listening
assessments. With online real-time configuration as illustrated in Figure 4, the
system run stably providing expected voice quality. This also meant that system is
capable of working in the full-duplex mode where speech coding and decoding
processes run concurrently.
5. Conclusions and Future work
This paper described a real-time MELP implementation on Texas Instrument fixedpoint TMS32VC55x platform. Our evaluation showed that the coder is capable of
full-duplex real- time performance producing as good quality speech at 2400 bps
as some commercial products do. Some problems still remaining for the future
work are: First, more tests with a significantly larger speech database, especially
Vietnamese, should be conducted to verify the performance of the implementation.
Working conditions, either in simulation or in reality, will be expanded that
consider different background noise types at diverse signal to noise ratio (SNR) as
specified in [1] Appendix B. Since MELP has been used widely in HF transceivers
and secure voice applications, the transmission channel characteristics should also
receive attention. Next, in this paper, only speech quality was judged when
comparing performance of the C55x MELP and some other commercial products,
the complexity (on the same platform) and resource consumption should be of
consideration in the future research. Finally, refining and optimizing the
implementation are required to improve the speech quality, speed up the
performance, as well as saving hardware resource that could lead to a system of
multiple MELPs on a single DSP. The enhanced MELP (MELPe, NATO
STANAG-4591) that provides lower data rates (not only 2400 bps, but also 1200
bps and 600 bps) and better speech quality should be a good direction for the
development of this work.

Tp ch Nghin cu KH&CN Qun s, S 23, 02 - 2013

15

Tn chuyn ngnh do tc gi quyt nh (V d, iu khin & T


ng ha)

ACKNOWLEDGES
This work is supported by project 118/2013/H NT (2013-2014), funded by
Ministry of science and technology of Vietnam.
REFERENCES
[1] U. S. DoD, MIL-STD-3005, Department of Defense Telecommunications
Systems Standard, 1999.
[2] A. V. McCree and T. P. Barnwell III, A mixed excitation LPC vocoder model
for low bit rate speech coding, Speech and Audio Processing, IEEE Transactions
on, vol. 3, no. 4, pp. 242250, 1995.
[3] L. M. Supplee, R. P. Cohn, J. S. Collura, and A. V. McCree, MELP: the new
federal standard at 2400 bps, in Acoustics, Speech, and Signal Processing, 1997.
ICASSP-97., 1997 IEEE International Conference on, 1997, vol. 2, pp. 15911594.
[4] M. Kohler, A comparison of the new 2400 bps MELP federal standard with
other standard coders, in Acoustics, Speech, and Signal Processing, 1997.
ICASSP-97., 1997 IEEE International Conference on, 1997, vol. 2, pp. 15871590.
[5] Y. Medan, E. Yair, and D. Chazan, Super resolution pitch determination of
speech signals, Signal Processing, IEEE Transactions on, vol. 39, no. 1, pp. 40
48, 1991.
[6] W. P. LeBlanc, B. Bhattacharya, S. A. Mahmoud, and V. Cuperman, Efficient
search and design procedures for robust multi-stage VQ of LPC parameters for 4
kb/s speech coding, Speech and Audio Processing, IEEE Transactions on, vol. 1,
no. 4, pp. 373385, 1993.
[7] P. Kabal and R. P. Ramachandran, The computation of line spectral
frequencies using Chebyshev polynomials, Acoustics, Speech and Signal
Processing, IEEE Transactions on, vol. 34, no. 6, pp. 14191426, 1986.
[8] MELP and MELPe Vocoder on Wikipedia
http://en.wikipedia.org/wiki/Mixed-excitation_linear_prediction
[9] TI official websites on C5000, TMS320VC5510a, TMS320VC5509a, and CCS
3.3
[10] Technical specifications of Tadiran HF6000, Grintek TR2400, Codans
NGT SRx
[11] ITU P.862, Perceptual evaluation of speech quality (PESQ), an objective
method for end-to-end speech quality assessment of narrowband telephone
networks and speech codecs, ITU Recommendation P.862, 2000.
[12] P. C. Loizou, Speech Enhancement: Theory and Practice, 1st ed. CRC Press,
2007.
[13] MELP commercial product provided by Signalogic
http://www.signalogic.com/index.pl?page=codec_samples
[14] MELP commercial product provided by Vocal
http://www.vocal.com/audio-examples/other-speech-coder-audio-examples/

16

Tp ch Nghin cu KH&CN Qun s, S , 02 - 2014

Nghin cu khoa hc cng ngh (Tn chuyn mc do Ban bin tp


quyt nh)
a ch:

* Vin Cng ngh thng tin / Vin KH&CNQS


** Trung tm Cng ngh cao / B t lnh thng tin

Tp ch Nghin cu KH&CN Qun s, S 23, 02 - 2013

17

You might also like