You are on page 1of 3

SESSION XIX: SPEECH PROCESSING

FAM 19.3: A Single-Chip CMOS Speech Synthesis Chip*


Kazuo Inoue, Kenji Wakabayashi, Yoshinobu Yoshikawa, Shigeaki Masuza wa, Kenji Sano and Seiji Kimura

Sharp Corp.

Nara, Japan
computer design is more useful than a hard-wired chip because
it can be tailored to handle a variety of time domain data com-
pression techniques such as PCM, DPCM, DM or more complex
A CMOS SPEECH SYNTHESIZER LSI circuit, organized as a compression techniques by the relatively simple procedure of
special purpose microcomputer containing program ROM, RAM, changing the program.
32K of speech data ROM and a D/A converter on a single chip, A block diagramof the synthesizer chip is shown in
and supporting speech synthesis techniques, will be reported. By Figure 3.
using data compression techniques' based on adaptive differential The control program which recreates the speech according
pulse code modulation ( ADPCM), the chip is able to generate to the regenerating algorithm (that is, demodulation of the
high quality speech, reproducing the natural inflection and into- ADPCM data, repeating of the representative pitch period,
nation of the original speaker. It produces about 30 words of amplitude interpolation and demodulationof the zero-cross data)
speech from 32K of internal data ROM. Moreover, additional occupies 1 K bytes of control program ROM area. The compres-
ROM, up to a maximum of 128K, may be added without any sed speech data were stored in the 32K of data ROM.
interface circuits to increase the vocabulary. The chip performs signal processing on the data in the32K
Any voice, adult or child, male or female, can be synthesized ROM under the controlof the control program and regenerates
and the chip can also synthesize music. This system used com- the speech signal.
prehensive data compression techniques, based on sampling and For this kindof data processing, it is important that thechip
coding of the speech signal at twice its highest frequency. has the ability to perform fast datatransfers and arithmetic
A distinction is made between voiced and unvoiced utterances. operations. To achieve these goals a 33-instruction set using long
The waveform of unvoiced utterances is encoded using a zero bit instructions was adopted. As a result, speech sampled at 8kHz
cross technique with added amplitude information consisting of could be generated using an 8ps instruction cycle time. The use
two bits per word. This variable amplitude has been shown t o of long bit length instructions which offer fast operation rather
provide valuable auditory information. than fast cycle time arebetter from the pointof view of power
The compression techniques of voiced utterances is based on dissipation and chip design.
the subjective discarding of redundant speech information. Almost all of the circuits except the RAM, 1 / 0 latch, DAC,
Redundant pitch periods and redundant phonemes are removed etc., have been designed as ratioless dynamic type CMOS; these
and representative pitch periods are extracted from successive occupy approximately 90% of the total chip area. This design
waveforms t o replace N similar periods. method serves to reduce power dissipation and to minimize chip
To obtain large enough values of N, while still maintaining size.
the correct envelope, amplitude information andpeak values are The chip can be brought into the standby mode during the
encoded separately and interpolated. This contributes to im- non-generation of speech by a halt instruction. In this mode the
proved data compression and as a result the average value of N oscillator and system clock signal are halted and the only power
may be as large as 13 for voiced waveforms. dissipated is a very low leakage current. However, because of their
An adaptive differential pulse code modulation technique is static design, the RAM and latch circuits are held.
employed for the encoding of the representative pitch periods. The chip has 8 terminals for controlor key input and6 ter-
Differences in amplitude betweensuccessive samples are encoded minals for output. Additionally, the chip can be directly
according to a 4b ADPCM rule. The quantitizing unit value for connected to up to 128K of external ROM.
each pitch period, which is also encoded in 4b,is selected to An 8 b D/A converter consisting of a register ladder network
maximize the signal-to-noise ratio between theoriginal and the builds the analog signal and feeds it to thepreamplifier.
decoded signal. Using this method a 25dB S/N ratio was ob- Table 1summarizes the main features and performance of the
tained. Figure 1 shows the original and the regenerated chip.
waveforms using these techniques. The chip has been fabricated using metal-gate CMOS techno-
The speech data condensed by these compression techniques logy and about 33,000transistors are integrated in an area 5.lmm
are stored in the 32K dataROM, Figure 2 shows the ROM data by 5.01mm.
format for bothvoiced and unvoiced utterances. Figure 4 shows a microphotograph of the chip.
A special purpose8 b microcomputer was adopted for the
speech synthesizer, whose architecture, instruction set, arithmetic Acknowledgments
and address capabilities were functionally optimized to perform
the above synthesis techniques. The authors would like to acknowledge and thank K. Yamashita
The chip is self-contained and includes all of the circuits of Osaka City University for his assistance in the development of the
required to regenerate the voice signal on a single chip. A micro- synthesis algorithm. They also wish to thank K. Okano, Corporate
- Director and Group General Manager for his helpful advice on this
*Japanese Patent No. Tokukaisho 55-111995. project.
[See page 337 for Figures 2, 3.3

Control 8 bits oarallel


Instruction Set 3 5 iniiuctions
Instruction Cycle Time 8 ps (TYP)
ROM Capacity 4KB. for sDeech data
I KB. for control programming
RAM Capacity 24 B.
110 Input Port 8 terminals
Output Port 6 terminals
D l 0 Port 8 terminals
Additional ROM 16KB. max
DIA Converter 8
(Ladder
Network)
bits
Sampling Frequency
8Speech
CMOS gate Metal
of
Technology
Number of Transistors
Power
Supply
Power Dissipation
- kHz (TYP)

33,000
2.7 'u 5.5 V
,i
ri
. . .
i _ _ c - J -
v

(at
pW
Power
Stand
by
3 3V) Compressed waveform 4 repeats 8 repeats 9 repeats
s l n g zero crossing
Operational Power 4.5 mW
Die Size 5.10 x 5.01 mm (b) Regenerated Waveform
Package 48 pins flat package
FIGURE 1-An example of the original waveform and
TABLE 1-Summary of hardware features and performance. regenerated waveformusing the speech synthesis techniques.

FIGURE 4-Microphotograph of the CMOS speech synthesis


chip.
(a) Unvoiced Data

AMPLITUDE INFORMATION
FOR ZERO CROSSING
NUMBER OF SAMPLE
FOR CROSSING
I 1/0 l / O i / o 110 I
i
~'
1
1 DATA FOR ZERO
CROSSING

(b) Volced Data

I NUMBER OF REPEAT I
ENVELOP SLOPE, 1 FOR +
0 FOR -
110 ( 1 ) FIRST QUANTIZING
UNIT VALUE

I } 1 INCREASE IN QUANTIZING UNITVALUE


0.INC.
NOT (FOR f )

j 1
J
ADPCM DATA

I '' (4)j
ADPCM(4)j

FIGURE 2-Condensed speech data formatin data ROM.

FIGURE 3-Block diagram of the CMOS speechsynthesis


chip.

You might also like