The influence of glottal excitation functions on the quality of synthetic speech

MISSING IMAGE

Material Information

Title:
The influence of glottal excitation functions on the quality of synthetic speech
Physical Description:
vi, 168 leaves : ill. ; 28 cm.
Language:
English
Creator:
Yea, Jing-Jong, 1955-
Publication Date:

Subjects

Subjects / Keywords:
Speech synthesis   ( lcsh )
Speech synthesis -- Computer programs   ( lcsh )
Glottis   ( lcsh )
Genre:
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 1983.
Bibliography:
Includes bibliographical references (leaves 165-167).
Statement of Responsibility:
by Jing-Jong Yea.
General Note:
Typescript.
General Note:
Vita.

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 000466462
notis - ACN0758
oclc - 11588425
System ID:
AA00003442:00001

Full Text









THE INFLUENCE OF GLOTTAL EXCITATION FUNCTIONS
ON THE QUALITY OF SYNTHETIC SPEECH














BY


JING-JONG YEA


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN
PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF DOCTOR OF PHILOSOPHY


UNIVERSITY OF FLORIDA


1983













ACKNOWLEDGEMENTS


I would like to express my gratitude to my adviser, Dr. Donald G.

Childers, whose constant encouragement and advice have made the completion

of this dissertation possible. I would also thank Dr. G. P. Moore and

Dr. D. Hicks for their valuable suggestions in designing the listening

test and evaluating the quality of synthetic speech. Thanks are due to

Mr. Ralph Haskew for his excellent job in typing and to my fellow students

in the Mind-Machine Interaction Laboratory for their help in many ways.

Finally, I would like to thank my parents for motivating my curiosity

for knowledge and my wife for her love, encouragement, and support.













TABLE OF CONTENTS

Paae

ACKNOWLEDGEMENTS . . ... ....... ii

ABSTRACT . . ... . v

CHAPTERS

1 INTRODUCTION . . ... 1

Overview of Speech Synthesis . 3
Research Problem . . 11
Proposed Research . .. 15

2 SPEECH ANALYSIS . . ... .. 19

Model of Speech Production . .... 19
Sound Sources .. ..... .. .. ... .. 19
The Vocal Tract Filter . .... 21
Radiation Effect . .... 22
Pitch and Voicing Analysis . .... 23
The Cepstrum Pitch Detection Method .. 23
The Modified Autocorrelation Method .. 26
Comparison of the Two Algorithms ...... 30
Formant Analysis . . 33
Methods . . .. 33
Results and Discussion . .... 39
Glottal Inverse Filtering . .... 40
Analysis of a Sentence for Synthesis ... 49

3 SPEECH SYNTHESIS . .... 55

A Cascade/Parallel Formant Synthesizer ...... 55
Synthesis Strategy . .... 63
Synthesis of Vowels . .... 63
Synthesis of Consonants . .... 63
The Glottal Excitation for the Formant Synthesizer 70
Impulse Excitation Source .......... 73
The Glottal Volume Velocity Source .. 73
The Glottal Area Function Excitation Source 77
Synthesis of Sentences . .. 89
Synthesis of the Child's Sentence ...... 89
Synthesis of the Female's Sentence .. 97
Synthesis of the Males' Sentence ...... 109








EVALUATION OF THE QUALITY OF SYNTHETIC SPEECH .

The Concept of Speech Quality .
Design of the Listening Test .
Results and Discussion . .
Response of Listener Group One .
Results for Listener Group Two .

CONCLUSION AND SUGGESTIONS FOR FURTHER RESEARCH .


APPENDICES

A


B

C


ILLUSTRATION OF THE SPEECH ANALYSIS AND SYNTHESIS
PROGRAMS . . .

AN ALGORITHM FOR DETECTING NASAL SOUNDS .

LISTENING TEST SETTING . .


* 140

* 157

. 163


LIST OF REFERENCES . . .

BIOGRAPHICAL SKETCH . . .


Page

116

116
119
122
123
126

133


165

168













Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy


THE INFLUENCE OF GLOTTAL EXCITATION FUNCTIONS
ON THE QUALITY OF SYNTHETIC SPEECH


By

Jing-Jong Yea

August 1983

Chairman: Donald G. Childers
Major Department: Electrical Engineering

One of the factors affecting the quality of synthetic speech is the

improper modeling of the glottal excitation function. This includes the

use of an impulse or a stylized waveform as the glottal excitation, and

neglecting the effect of source-tract interaction.

The results of our research improve the quality of synthetic speech

by using a formant synthesis scheme which adopts a more accurate model of

the glottal excitation. This is accomplished by using a glottall area"

excitation function. In this scheme, we use the glottal area function to

control the time-varying glottal impedance in an equivalent circuit of

the vocal system. The output of the circuit is a time-varying glottall

volume velocity" function which includes the effect of source-tract inter-

action. This glottal volume velocity function is then used as the glottal

excitation of the formant synthesizer. We used this technique to synthe-

size sentences which sounded as if they were spoken repeatedly by a male,

a female, and a child. We used two other glottal excitation functions as







well to synthesize the same sentences for comparison purposes. The vocal

tract transfer functions are kept the same regardless of the forms of the

glottal excitation. The only difference is that one excitation waveform

is an impulse and the other a stylized waveform.

The sentences generated by the three glottal excitation functions are

compared in a formal listening test using the natural sentence as the

reference signal. The result of the listening test shows that the proposed

glottal excitation function can produce more natural sounding speech than

the other two excitation waveforms. Thus, the results of our study suggest

that source-tract interaction is an important factor for synthesizing high

quality speech.













CHAPTER 1
INTRODUCTION


Speech synthesis has a long and interesting history. The first

speech synthesizer was built by von Kempelen in 1791 [1]. His speech

synthesizer was mechanical in nature and capable of producing only a few

vowels. At that time the interest in speech synthesis was academic or

for entertainment purposes, which remained the case until the turn of this

century.

The modern speech synthesizer is electrical in nature. The evolution

of electronic technology coupled with the widespread use of computers

caused the interest in speech synthesis to assume a broader basis. There

are three types of applications for speech synthesis.

The first is academic interest in the physiology and acoustics of

speech production. The speech synthesizer is now a standard research tool

in phonetics and speech perception experiments because of the convenience

of controlling the parameters and reproducing the results.

The second concerns efficient coding of speech information for communi-

cation at a distance. In this application, the speech synthesizer is a

part of the speech analysis/synthesis system which reconstructs speech from

a set of parameters (Figure 1). In this way, the transmission bit rate

(bandwidth) can be greatly reduced (by about an order of 10).

The third type of application is in computer voice response. Because

of the widespread use of computers, there is a need for a more natural means




2











received
transmitted transmission Speech speech
speech Speech channel
SAnalyzer (Synthesizer
Analyzer (parameters)


Figure 1. Speech Synthesizer as a part of communication system.







of communication between human and machine. Speech synthesizers enable

the computer to communicate in terms of human speech. In this case, a set

of rules are stored in the computer memory to generate parameters which

in turn control a speech synthesizer to generate speech (Figure 2).

One common characteristic of speech synthesizers is that it requires

only a small number of parameters to generate a large number of speech

samples. The price we have to pay for this data compression is that the

quality of synthetic speech is generally not comparable to that of human

speech. The quality of synthetic speech is generally considered to consist

of two factors:

(1) Intelligibility--does the synthetic speech convey the intended

message correctly?

(2) Naturalness--does the synthesized speech sound like human gene-

rated speech?

Many previous papers have been concerned with improving the intelligibility

of synthetic speech. The result is that synthetic speech is highly intel-

ligible but less natural. In this research, we are concerned with the

problem of improving the naturalness of synthetic speech. We are particularly

interested in investigating the effect of the glottal source waveform on the

quality of speech. Before we go into the details of the research problem,

let us first discuss the general background of speech synthesis.


Overview of Speech Synthesis


The term "speech synthesis" has been used vaguely in the literature as

"a procedure to produce or reproduce speech from some representation of

speech." In this sense, speech reproduction from such waveform coding

schemes as Pulse Code Modulation (PCM) and Delta Modulation (DM) can also

be viewed as a speech synthesis process. In this study, however, we shall

















parameters


voice
response


Speech Synthesizer as a part of the computer voice response
system.


request


Figure 2.




5


adopt a more restricted view of speech synthesis. We will define the term

"speech synthesis" as "the procedure to produce speech from a parametric

model of speech." Under this definition, the speech synthesis schemes can

be classified in terms of their underlying model. Modern speech synthesizers

fall into three basic categories: articulatory synthesis, formant synthesis,

and linear prediction (LP) synthesis.

Articulatory synthesis is based on a physiological model of speech

production [1,2,3]. As shown in Figure 3, this model attempts to simulate

the mechanical motions of the articulators. The model then controls an

equivalent circuit of the vocal system (Figure 4) to produce speech. The

articulatory study is very useful for phonetic studies because it directly

relates the articulator movements with the resultant acoustic events. The

problems with articulatory synthesis are that a) knowledge about the artic-

ulator positions is not easily available for continuous speech [4], and b)

the computation is more complex than for the other two synthesis schemes.

These two difficulties have deferred the practical application of the artic-

ulatory synthesis in communication and computer voice response.

Formant synthesis is based on an acoustic model of speech production.

This approach capitalizes on the fact that the formants, or vocal tract

resonances, are the most important factors in deciding the content of speech.

The formant synthesizer consists of two parts (Figure 5): 1) a set of

resonance circuits which decides the speech content, and 2) an excitation

source which decides the voicing information. The resonant circuits can be

connected in serial (cascade formant synthesizer) or in parallel (parallel

formant synthesizer).

Both the formant and the voicing information can be directly measured

from the speech signal by digital speech processing techniques. This makes




















S








Figure 3. Articulatory model of the human vocal tract.





























I t----- -I
A A
vocal tract area


Figure 4. Equivalent circuit for the vocal system.


I




















speech
s(n)


Voice
Source


Noise
Source


voicing F1 F3
decision


Figure 5. Block diagram of formant synthesizer.


I






the formant synthesis approach suitable for applications in communication

and computer voice response.

Linear prediction (LP) synthesis is based on a mathematical model of

the speech signal. The basic assumption of this model is that the speech

signal is the output of a linear time invariant recursive filter (Figure 6).

This is equivalent to predicting the current sample of speech by a linear

combination of previous samples, hence the name linear prediction. It has

been shown that this model is exact for voiced speech [5], but only an ap-

proximation for unvoiced and nasal sound. Linear prediction analysis/synthesis

is becoming one of the most popular speech processing techniques because

the linear prediction coefficients (ak) can be obtained from the speech

signal by a very efficient algorithm. Thus, linear prediction analysis/

synthesis is finding many applications in different areas of speech research.

The disadvantage is that unlike formant or articulatory synthesis, there is

no direct physical correspondence for the linear prediction coefficients.

We have briefly discussed the three dominant speech synthesis strategies

with their advantages and disadvantages. The choice of a particular speech

synthesis strategy depends on the application on hand. In studying the

physiology and acoustics of speech production, the formant synthesis or the

articulatory synthesis should be used because they relate the speech to the

physical parameters directly. For communication applications, the LP

synthesis or formant synthesis sheuld-be used because of the ease of obtaining

the parameters from the speech signal. The LP analysis is particularly

suitable for this purpose because the analysis and synthesis can be imple-

mented in real time in this case. For computer voice response, all three

are suitable depending on the particular application with the LP and formant

synthesis techniques being more suitable for the limited vocabulary case,



















S


Linear Sn ak n-k l+
k=l
Prediction



ak n-k
k=l


Figure 6. Block diagram of the linear prediction model of speech.







while the formant and articulatory synthesis techniques are more suitable

for unlimited text-to-speech synthesis.

In this study, we are interested in studying the effect of glottal

excitation on the quality of synthetic speech. Since the linear prediction

model is a mathematical representation, the concept of glottal excitation

is meaningless. This leaves us the choice of formant or articulatory syn-

thesis. In both of these two synthesis schemes, the glottal excitation

function is an explicit part of the synthesizer. Since formant synthesis

has the advantages of 1) computational efficiency and 2) ease of obtaining

the synthesis parameters directly from speech, we decided to use the formant

synthesis scheme in our study.


Research Problem

The purpose of any speech synthesis study is usually to improve the

quality of synthetic speech. It includes the improvement of intelligibility

and naturalness.

To date, hundreds of speech synthesis schemes have been proposed, but

few if any of them are able to generate natural sounding synthetic speech.

This difficulty in synthesizing natural sounding speech is caused

mainly by inadequacies in the modeling of the human vocal system. The ways

in which formant models usually differ from the human vocal system include

the following [6]:

(a) Nasalized vowels are either not treated differently from nonnasal-

ized vowels, or are generated using one additional pole-zero pair

added to the transfer function without any corresponding change

of formant bandwidth due to additional damping.

(b) No attempt is made to copy the natural glottal excitation waveform

for voiced speech. Usually a pulse with minimum-phase spectrum








shaping is used. Otherwise, a stylized glottal pulse is used

that approximates well to typical natural shapes in general

features, but not in details.

(c) The modification of formant frequencies and bandwidths by source-

tract interaction is ignored.

(d) Many consonant sounds are dealt with by special arrangements that

do not closely approximate the acoustic production system.

(e) Bandwidth limitations on control signals prevent very rapid changes

of amplitude (such as occur on stop burst), or formant frequency

(such as at certain consonant-to-vowel boundaries).

(f) Mixed excitation for voiced fricatives and stops is often not

provided.


Items a, d, e, and f are related to the intelligibility problem of

speech synthesis, while items b and c are concerned with the naturalness

of synthesis. Since we are mainly interested in the naturalness of synthetic

speech, we shall compare previous studies which addressed items b and c.

Rosenberg [7] first reported a study comparing the effect of different

glottal source waveforms on the naturalness of synthetic speech. He used a

pitch synchronous pole-zero analysis technique [8] to extract formant fre-

quencies, bandwidths and glottal volume velocity waveforms. He resynthesized

a set of utterances (words) using natural glottal volume velocity and some

idealized pulses. The results of his study showed that the listeners pre-

ferred a particular idealized pulse to the real glottal waveform.

Holmes [6] used a parallel formant synthesizer to synthesize sentences.

His goal was to generate speech indistinguishable subjectively from the

human generated speech. The glottal waveforms he used included the real







glottal waveform and idealized pulses suggested by Rosenberg. He also

incorporated the source-tract interaction by adjusting the formant bandwidth

during the glottal opening period. The results of an informal listening

test showed that 60% of his subjects preferred the natural glottal waveform

over the idealized waveform under the most critical listening conditions,

i.e., using earphones.

There is a contradiction between the above two studies. Rosenberg's

study showed that synthetic speech generated by an idealized waveform sounds

better than the synthetic speech generated by the real glottal volume veloc-

ity. Holmes' study, however, concluded the opposite. We have carefully

studied the above two papers and found that there are some differences which

may contribute to the contradiction.

(a) Rosenberg used a serial formant synthesizer while Holmes used a

parallel formant synthesizer. The comparison of the serial and

parallel formant synthesizers has been studied before [6]. In

general, the serial formant synthesizer is capable of generating

a natural spectrum envelope for vowels without any additional

controls. The disadvantage of the serial formant synthesizer is

that it is not very good in synthesizing consonants. The parallel

formant synthesizer, on the other hand, needs additional amplitude

control for each formant so that the spectrum envelope matches

that of natural speech. The advantage is that the parallel formant

synthesizer is suitable for generating both vowels and consonants.

(b) A more important distinction between the two studies is the assump-

tion about the dependency between the glottal source and the vocal

tract. Rosenberg assumed that the source and tract are independent

of each other and so the glottal waveforms should be similar for

all vowels. He then used the same glottal volume velocity for







synthesizing different words. This procedure was known to intro-

duce undesired quality to the synthetic speech [9]. Holmes, on

the other hand, incorporated source-tract interaction in the

synthesis by manually adjusting the formant bandwidths.


The above arguments seem to be in favor of Holmes' conclusion that the

real glottal volume velocity waveform is important for the generation of

natural sounding speech. Also, they indicate that the source-tract inter-

action is a significant factor in deciding the quality of synthetic speech [10].

Other studies [11,12] investigated the usefulness of the glottal area

function as the excitation for speech synthesizers. The glottal area function

is a measure of the opening of the glottis during voiced speech. It is ob-

tained from ultra high speed films of the glottis. The formant data were

extracted from digitalized speech using the linear prediction analysis. The

speech (vowel) was re-synthesized using a volume velocity waveform which was

derived from the glottal area function without taking account of the source-

tract interaction. The results of this study are

(a) The synthetic vowel and its natural counterpart were distinguishable.

(b) The synthetic vowel did not sound natural.

(c) The real speech also suffered from a lack of naturalness. This

could be attributed to the abnormal condition under which the

speech was recorded (with the laryngeal mirror inserted into the

mouth).

According to the author, three important factors for the unnaturalness

of the synthetic speech were the interpolation of the glottal area (which

was required due to low camera speed); the absence of source-tract interaction;

and the lack of higher order formants, which contribute perceptually to

naturalness.




15


From the above discussion, it is clear that the source-tract inter-

action is an important factor for the generation of natural sounding speech.

Recently, there were a series of papers on the analysis of the glottal source

excitation and the source-tract interaction [13,14]. They all agreed that

the source-tract interaction affects the voice quality. In particular, they

studied the skewing of the glottal volume velocity waveform due to the

source-tract interaction and related it to the quality of voice. Another

aspect of source-tract interaction, such as the formant ripples found in the

inverse-filtered waveform [15], also influences the quality of speech. We

will refer to this latter aspect as the source-tract interaction later in

our discussion.

From the above discussion, we concluded that the glottal source waveform

and the source-tract interaction are both important to the generation of

natural sounding speech. Thus, it is important that we include these features

in the glottal source of a speech synthesizer.


Proposed Research


In this research, we propose to study the influence of the glottal

source function on the quality of synthetic speech. We do this by comparing

three types of glottal excitation functions:

(a) Impulse excitation with glottal shaping is an excitation source

which approximates the spectral characteristics of a real glottal

source excitation by a proper glottal shaping filter. This

excitation is similar to the glottal excitation for the LP syn-

thesizer.

(b) Idealized glottal volume velocity waveform was first proposed by

Rosenberg [7] and later used by Holmes [6] and Fant [13] in their

studies. This excitation function approximates both the spectral







and time-domain characteristics of the real glottal excitation.

We will discuss this excitation source in Chapter 3.

(c) The glottal area excitation source, unlike the glottal area

excitation used in the Nadal-Suris study [11], includes the

effect of source-tract interaction. This is done by transforming

the glottal area function into glottal volume velocity using an

equivalent circuit to the vocal system [15]. Thus, besides approxi-

mating the general spectral and time-domain characteristics of

the glottal source, this excitation also contains the details

of source-tract interaction.


The three types of glottal excitations are summarized in Figure 7.

The main difference between the first two glottal excitations is that the

first one only approximates the spectral characteristics of the glottal

source (i.e., a -12 dB/octave drop in the spectral envelope), while the

second excitation approximates both the spectral and time-domain character-

istics (triangular-like waveshape). The difference between the second and

third excitation function is that the second does not include the source-

tract interaction formantt ripples) while the third one does.

Except for the excitation functions, the formant synthesizer is similar

to Klatt's formant synthesizer [16]. We will use these three types of

excitation functions to synthesize sentences. Three sentences will be

synthesized. The synthesis parameters are extracted from sentences produced

by an adult male, an adult female, and a female child, respectively. In

this way, a wide range of pitch variations are covered so we can look into

the influence of pitch on the quality of synthesis.

The human generated sentences will be used as reference sentences in

a formal listening test to evaluate the quality of synthetic sentences.





















GLOTTRL VOLUME VELOCITY
EXCITRTION


GLOTTPL AREA EXCITATION




IMPULSE EXCITPTION


THREE TYPES OF GLOTTRL EXCITATION FUNCTIONS


Figure 7. Three types of glottal excitation functions used in this study.




18


The result of this study will contribute to our understanding of the

glottal source function. In particular, we will be able to learn the

relative importance of spectral characteristics, time-domain characteristics,

and source-tract interaction on the quality of speech. The study will also

contribute to our knowledge about the difference between male/female and

adult/child voices. Such knowledge will be useful for the construction of

speech synthesizers which are capable of generating natural sounding speech.


I













CHAPTER 2
SPEECH ANALYSIS


As we defined in Chapter 1, speech synthesis is a procedure to produce

speech from a parametric model. Thus, before synthesizing speech, we have

to obtain the parameters of the model. The process of obtaining the param

eters from speech is called "speech analysis." Speech analysis is the

basis of all speech research. In this chapter we will discuss the analysis

of speech to obtain parameters necessary for formant synthesis. These

parameters include the pitch/voicing information, the formant frequencies

and bandwidths, and the glottal source waveform. Before we go into the

details of speech analysis, however, let us discuss the acoustic model of

speech production.


Model of Speech Production


Formant synthesis is based on the acoustic model of speech production

[17]. In this model, the speech signal is the response of the vocal tract

filter system to one or more sound sources. The source-filter theory of

speech production is exemplified by the equivalent circuit in Figure 8 in

which the voltage is the equivalent of the acoustic sound pressure and the

current is the equivalent of the volume velocity.

Sound Sources


There are two types of sound sources: the glottal sound source and the

fricative sound source. For the glottal sound source, Es is the lung pressure





















radiation
inductance


Figure 8.


R radiation
resistance


Equivalent circuit of speech production.







which pushes air through the vocal folds to make them vibrate and thus

create a modulated air flow. For the fricative sound, Es represents the

turbulent air pressure generated by pushing air through the vocal tract

constriction. The term Zs represents the internal impedance of the source

and the impedance of the cavities behind the source. The vocal tract input

impedance is represented by Z In the linear model of speech production

it is assumed that Zv is negligible compared to Zs, so the source volume

velocity is determined by Es and Zs only; i.e., the source and the vocal

tract can be treated separately. However, there is evidence from inverse

filtering experiments [18] that the glottal source volume velocity contains

small formant ripples due to the loading of the vocal tract input impedance.

This loading effect is referred to as the "source-tract interaction" in the

literature. It is believed that the source-tract interaction will affect

the quality of synthetic speech.

Since we are interested in synthesizing high quality speech, it is

important that we include the source-tract interaction in the glottal source

model.

Besides the source-tract interaction, the sound sources are characterized

by two parameters, voicing and pitch. The voicing parameter decides whether

the sound is voiced or fricative. The pitch, or fundamental frequency

parameter, decides the vibration period of the glottal sound source. Both

pitch and voicing can be determined from the speech signal as we will see in

the next section.


The Vocal Tract Filter


The vocal tract filter models the transmission between the source

volume velocity, Is, and the mouth (or nose for nasal sound) volume velocity,

I Since the human vocal tract cannot change too rapidly, the vocal tract








filter can be viewed as a linear time-invariant filter for a short period

of time (20 ms). In the production of vowels, the transmission character-

istics can be modeled as a cascade of resonant circuits (corresponding to

peaks in frequency response), referred to as formantsin the literature.

In the production of nasals and fricatives, the transmission characteristics

have to be modeled as a combination of resonances and antiresonances (peaks

and valleys, respectively, in the frequency response), called formants and

antiformants.

Formants are the most important parameters of the vocal tract filter

because the human ears are more sensitive to peaks in the sound spectrum

[19]. Each formant is characterized by its frequency and bandwidth. For

the speech signal sampled at 10 KHz, the vocal tract filter can be repre-

sented by five formants for adult males, or four formants for adult females

and children. We will discuss the algorithm for extracting formants in the

third section of this chapter.

Antiformants are very important for synthesizing nasalized sounds.

However, there are no effective methods for estimating the antiformants. We

have developed a method for detecting the existence of nasal sounds which

is useful in our speech synthesis study. This algorithm can also be used as

an intermediate step for estimating the antiformant frequency. We will

present this algorithm later in Appendix B.

Radiation Effect


The transformation between the mouth volume velocity and sound pressure

at the listener's ear is called the radiation effect. This effect can be

approximated by a first order differentiation [19]. In the circuit of

Figure 8, the radiation effect is represented by a simple R-L circuit.







Pitch and Voicing Analysis


Pitch and voicing analysis is one of the most important problems in

speech processing. Because of its importance, many solutions to this

problem have been proposed. All of the proposed schemes have their limita-

tions, and it is safe to say that no presently available pitch detection

scheme can be expected to give perfectly satisfactory results across a wide

range of speakers, applications, and operating environments. The difficulty

of pitch/voicing detection (or for speech analysis as a whole) stems from

the fact that the speech signal is the convolution of the source waveform

and the filter response. The effect of the tract will then interfere with

the process of pitch/voicing determination, and vice-versa. Thus, most

pitch detection algorithms first attempt to separate the source and the filter

characteristics. We have implemented two algorithms for pitch/voicing

detection: the cepstrum method, and the modified autocorrelation method.

We will briefly discuss the principles of these two algorithms and then

compare their performance in terms of real speech data.


The Cepstrum Pitch Detection Method


This approach is accomplished in the frequency domain. The cepstrum,

defined as the power spectrum of the logarithm of the power spectrum, has

a peak corresponding to the pitch period of the voiced-speech segment being

analyzed [20]. Also, by detecting the presence or absence of the peak in

the expected pitch dynamic range, we can decide whether the speech segment

is voiced or unvoiced.

The way in which the cepstrum method separates the source and the

filter characteristics is explained as follows. Since the voiced speech is

the response of the vocal tract filter to the glottal volume velocity, we







can express the voiced speech as

s(t) = h(t) u(t) (1)

or

S(w) = H(w) U(w) (2)

where h(t) is the vocal tract impulse response,

H(w) is the Fourier transform of h(t),

u(t) is the glottal volume velocity, and

U(w) is the Fourier transform of u(t).

The power spectrum of speech is the magnitude square of the Fourier trans-

form, and can be expressed as


IS( 1) 2 = H(w) 2 U(W) 2 (3)


One simple way to separate the contribution of the source and the tract is
2
to take the logarithm of S(M) thereby changing the multiplication into

addition. Next take the Fourier transform which gives


log S() 2 = log H(w) + log U(w)12 (4)


F[log S(M) ] = F[log H(w) ] + F[1log U(oj) ] (5)

The source and tract effects are now additive rather than multiplicative.

The importance of this can be explained with the assistance of Figure 9.

The effect of the vocal tract is to produce a "low frequency" spectral

envelope in the logarithm spectrum, while the periodicity of the glottal

source manifests itself as "high frequency" ripples in the logarithm spectrum.

Therefore, the spectrum of the logarithm power spectrum has a sharp peak

corresponding to the high frequency source ripples in the logarithm power

spectrum and a broader peak corresponding to the low-frequency formant





















S6-


40


20 V
1.-- 29 ,



S199ee 2988 3980










CEPSTRUM
LJ







8.8 ------
I-

_-J VOCAL TRACT CHARACTERIS
S89.6




PITCH



8 .8
"- I_____
9. ^ >>. -- .- l- r


Figure 9.


4899 5898
FREQUENCY HZ)


2 4 6 8 18 12 14 16 18
TIME (MSEC.)


Logarithm power spectrum (top) of a voiced speech segment
showing a spectral periodicity resulting from the pitch
periodicity of speech. The power spectrum of the logarithm
spectrum, or cepstrum (bottom), therefore has a sharp peak
corresponding to this spectral periodicity.







structure in the logarithm spectrum. We can make the peak corresponding to

the source periodicity more pronounced by taking a square, and hence obtain

the cepstrum.

The cepstrum of speech can be computed on a general purpose computer

using the Fast Fourier Transform (FFT) algorithm. The location and ampli-

tude of the peak can then be determined, and an algorithm can be used to

decide whether it corresponds to the pitch period or not.

The cepstrum pitch detection algorithm has been implemented on a

Data General NOVA 4 computer. The program is called "PITCH." Appendix A

has a sample dialog of using this program. We will compare the performance

of the cepstrum method with the modified autocorrelation method later in

this section.

The Modified Autocorrelation Method


The modified autocorrelation method is a time domain approach for pitch

detection [21]. This method first removes the formant structure (vocal

tract characteristics) by a technique called center clipping. The autocor-

relation of the center clipped speech is then used to determine the pitch

period and voicing information.

The center clipped speech signal is obtained in the manner illustrated

in Figure 10. A segment of speech to be used in pitch detection is shown

in the upper plot. For this segment, the maximum amplitude, Amax, is found

and the clipping level, CL, is set equal to a fixed percentage of Amax

From Figure 10 it can be seen that for samples above CL, the output of the

center clipper is equal to the input minus the clipping level. For samples

below the clipping level the output is zero. The center clipped speech is

shown in the lower plot of Figure 10. It can be seen T~at the formant



















Input speech









Center clipped speech


Illustration of the process of center clipping.


A
max
+CL




-CL


___ __


Fi gure 10.







oscillation is removed from the center clipped speech. Thus the autocorre-

lation function is free from the interference of the formant oscillations.

This is very important for the purpose of pitch detection. Because formant

oscillations manifest themselves as peaks in the autocorrelation function

and sometimes these peaks are greater than the peak due to the pitch

periodicity. Thus the simple procedure of picking the largest peak in the

autocorrelation function fails in these cases.

Another difficulty with the autocorrelation method is that a large

amount of computation is required even for the center clipped speech. A

simple modification of the center clipping function leads to great simpli-

fication of the autocorrelation function with essentially no degradation

in pitch detection. This modification is shown in Figure 11. As indicated

there, the output is +1 if the input is greater than CL, and -1 if the input

is less than -1. Otherwise, the output is zero. This clipping function

will be called a three level center clipper.

The computation of the autocorrelation function for a three level center

clipped signal is particularly simple. If we denote the output of a three

level center clipper as y(n), then the product terms y(n+m) y(n+m+k) in

the autocorrelation function

N-k-l
R (k) = y(n+m) y(n+m+k) (6)
m=0

can have only three different values

y(n+m) y(n+m+k) = 0, if y(n+m) = 0, y(n+m+k) = 0

= +1, if y(n+m) = y(n+m+k) i 0

= -1, if y(n+m) i y(n+m+k) (7)















Amax

+CL



-CL





+1

-1


Figure 11. Illustration of the 3-level center clipper.




30


Thus, all that is required is some simple combinatory logic and increment/

decrement instead of multiplications.

We have implemented the modified autocorrelation pitch detection

algorithm in a computer program "AUTOC." A three level center clipper with

clipping level set to 65% of maximum value is used. The autocorrelation

is computed and the peak location decided. If the peak value exceeds a

threshold value, then the speech segment is voiced; otherwise, it is

unvoiced. The threshold value is usually chosen as 30% of R (0).


Comparison of the Two Algorithms


We have tested both algorithms on real speech data. One typical result

is shown in Figure 12. The speech utterance is a sentence "We were away a

year ago" by a child subject. The silence interval is represented by a

pitch period value of -5. The unvoiced speech segment is represented by a

pitch period value of 0. Otherwise, the speech segment is voiced with the

pitch period shown.

One interesting feature of this subject's speech is that there are

sudden jumps in pitch either at the beginning or at the end of voicing

intervals. This phenomenon is evidenced by the speech waveform in Figure 13.

The top figure shows the speech waveform at the beginning of voicing. The

bottom figure shows the speech waveform at the end of voicing.

Now let us compare the pitch period contours in Figure 12. We can see

that the two algorithms give the same result over a large portion of voiced

speech. The major difference between the two algorithms occurs at the

beginning or the end of voicing intervals where the pitch period has changed

suddenly. We see that the modified autocorrelation method has followed the

change in pitch period, while the cepstrum method did not follow the pitch

change and labeled the voiced segment as unvoiced.































18
5

LI-



-1


-18i '-u '
9 9.2




Figure 12.


9.4 @.6


1.2 1.4
TIME (SEC


Comparison of the performance of the two pitch detection
algorithms.














SPEECH WAVEFORM AT VOICING ONSET

.I-


_ I88 -








J



-398 1 ,
ii -28 I

-308PITCH BRERK
9 18 29 38 48 58 68 79 8o 99
TIME (MSEC.)








58He


u I

i- -288 -

S-388 -

l-4,,6

-568 -J--
8 18



Figure 13.


28 38 48 58 60 79 88 90
TIME (MSEC.)


Speech waveform
voicing showing


at the beginning (top) and the end of
the "pitch break" phenomena.







The problem with the cepstrum method is that it is a frequency domain

approach. When there is a rapid change in pitch, the ripples in the log-

arithm spectrum are weakened, and so is the pitch peak in the cepstrum.

Thus the pitch period is usually not correctly determined.

Because of the difficulty of the cepstrum method in the above situations,

we decided to use the modified autocorrelation method in this research.


Formant Analysis
Methods

Formant data are the most important parameters in distinguishing dif-
ferent speech sounds. Thus formant analysis is useful not only for speech

synthesis purposes but also for speech recognition. The two techniques

being widely used at present for formant estimation are based on cepstral

analysis [22] and linear prediction [23,24]. We have implemented the linear

prediction formant analysis algorithm because it offers the advantage of

minimal computation and maximal accuracy in formant estimation. The algo-

rithm we implemented is based on MacCandless' algorithm [24].

Linear prediction analysis is a way to estimate the transfer function

of the vocal tract filter. According to the theory of linear prediction, a

sample of speech can be predicted from a linear combination of M previous

samples of speech,
M
s(n) = I ak s(n k) + e(n) (8)
k=1
where e(n) is the error of prediction. Expressed in terms of the Z-transform,

the prediction equation becomes

M
S(z) = I ak S(z) z-k + E(z) (9)
k= 1







and the transfer function between S(z) and E(z) is

= S(z) 1 1
H(z) E( M -k T (10)
1 I akz
k=1l
M
where the denominator polynomial A(z) = 1 a z is usually called
k=l k
the predictor polynomial.

Because the basic property of linear prediction requires that the error
sequence, e(n), has a flat spectral envelope, the function H(z) can provide
a good estimate of the transfer function of the vocal tract filter. There
are two ways of estimating the formants using H(z). One is to compute the
frequency response of H(z),


H(eJ T) I 1 1 (11)
[A(ejmT), PI -
1- k ejwkT
k=l

where T is the sampling frequency of speech. The peaks in the frequency
response, JH(ejwT)I, can then be decided and used as estimates of the formant
locations. Another approach is to find the poles of H(z), or equivalently
the roots of the polynomial A(z). The frequencies and bandwidths of the
complex pole pairs will then correspond to the frequencies and bandwidths
of the formants. To be more specific, the formant frequency (F) and band-
width (B) corresponding to the complex pole P is given by

F = Im(log P) and B = T Re(log P) (12)


where Im(-) and Re(.) represent imaginary part and real part, respectively.
The block diagram of formant estimation is given in Figure 14.















U,



C1E
LL. E r


1 LL.. <







Notice that in Figure 14 there is a block labeled "decision" before

the formant estimates are obtained. This is because the number of spectral

peaks (or complex pole pairs of H(z)) usually is not equal to the number

of formants. In the case of the root extraction method, the number of

complex pole pairs exceeds the number of formants because additional poles

are needed to account for the source characteristics. In the peak-picking

method, merging of closely-spaced peaks can also cause a problem in formant

estimation. Thus the additional decision logic is needed to assign peaks

(or poles) to formants.

The most common way of assigning peaks (or poles) to formants is using

the "continuity" constraint. This means the formant frequencies of adjacent

speech frames (usually defined as 20 ms of speech) cannot differ bya large

amount because of the physical constraint of the human vocal system. Thus

if the formant estimations of the neighboring speech frame are available,

the formant estimates of the current segment can be obtained by assigning

the peak (or pole) to the formant that is closest in frequency. This is

the essence of the McCandless algorithm.

The details of the McCandless algorithm are as follows. We will discuss


the algorithm for the peak-picking method only.


The algorithm for the root


extraction method is very similar, except that the bandwidth value will be

utilized rather than peak amplitude.

Since the continuity constraint is used to assign peaks to formants,

we have to make sure that the initial estimates of formant locations are

correct. Otherwise, the formant trajectory may follow a wrong track. The

correctness of the initial formant estimates is guaranteed by starting the

processing at the stationary portion of a vowel (called anchor point), and

then branching outward in both directions using the most recent formant

frequency estimates as the next reference. The anchor points can be defined







automatically using the voicing and intensity information. In our imple-

mentation, however, the anchor points are defined by the users. This

makes the program more flexible, and thus more accurate results will be

obtained.

Once the anchor point is defined, the analysis will proceed in the

manner shown in Figure 15. Processing of the backward branch is begun at

the next anchor point and continued until an unvoiced frame is encountered,

or until a frame is encountered which has already been processed by a

previous forward branch. Then the forward branch from the same anchor point

is begun and continued until an unvoiced frame is encountered, or until a

new subdivision boundary is reached. At this point, processing jumps to

the next anchor point and begins again with a backward branch, and so forth,

until the processing is complete. Notice that the unvoiced region is not

processed.

At each frame peaks in the spectral envelope, Pi's, mustbe mapped into

formants, Fi's. This is done by the following steps.

Step 1: Fill Slots. Fill each formant slot Si, i = 1 to 4, with the

best candidate P. by placing the peak P. closest in frequency to the estimates
J J
ESTi into slot i.

Step 2: Remove Duplicates. If the same peak P. fills more than one

slot Si, keep it only in the slot Sk which corresponds to the estimate ESTk

closest in frequency, and remove it from any other slots.

Step 3: Deal with Unassigned Peaks. If all peaks Pj have been assigned

to formant slots, go to Step 4. Otherwise try to fill empty slots with

values not assigned in Step 1 as follows:

(a) If there is an unassigned peak Pk, and an unfilled slot Sk,

fill the slot with the peak and go to Step 4. If Pk is unassigned, but

slot Sk is filled, check the amplitude (amp) corresponding to Pk as follows:














































SU >
0 CC
4- 0s


S0 *r- -
= ++ C (/I < i
1) *X Q. D 3ir 0
S- 0 Q. ( v *"- 0
- E a S. E 3 r- 0
3 E (s > EE
U S- ra 7c 4 W o
4- S- a) S- S- S- .-




- 4-oL 1 4-





( l II2
(n cu r_ a.
4- s- *r On
z S. W w

EU3 V)
S-.


4*- 4- C

4O V 0 -

S- c O II 0
0 -^ -0 4-.
4- W -0 0 S.



-00
* *1r U 0



4- C a *I S-
0 -0 (*- -o m =
(U *r- ) + u
/) c A 2 I- i E



S-0 W cL LL. '4-




u4. 0 0 c) U 0
a U- ( = -a-
E CtE- L C. 0 C


a) r- C*.a) U ", ..


) CO
*a) (a () 0) S-


LU U LL 4- -


*rS-

) *,r ..
LII L C. 0







if amp(Pk) < amp(peak assigned to Sk) throw Pk away and go to Step 4.

Otherwise, go to (b).

(b) If Pk is unassigned, and Sk+, is unfilled, move the peak in Sk

to Sk+l and put Pk in Sk. Go to Step 4.

(c) If Pk is still unassigned, but Sk_- is unfilled, move the peak

in Sk to Sk-1, and put Pk in Sk. Go to Step 4. If (a), (b), and (c) all
fail, ignore Pk'

Step 4: Deal with Unfilled Slots. If SI, S2, and S3 are all filled,

go to Step 5. (F4 may or may not be filled.) Otherwise, recompute the

spectrum on a circle with radius less than unity to hopefully separate

merged peaks. Fetch the peaks and go to Step 1.

Step 5: Update Estimates. Accept formant slot contents as formant

estimates for this frame, i.e., Fi = Si, i = 1, 2, 3. Also, use formant

slot contents as estimates for next frame, i.e., ESTi = Si, i = 1, 2, 3, 4.

(If a slot is empty, keep the original formant estimate for that formant.)

There will still be the possibility that a formant slot has not been

filled or that the formant value is grossly out of line for one or several

frames. Thus, a final smoothing procedure is needed to resolve these

situations. We will not go into the details of final smoothing here.

Results and Discussion

Both the root extraction method and the peak picking method have been

implemented as computer programs. In the peak picking method, the formant

frequency and amplitude data are estimated. The root extraction method,

on the other hand, analyzes the formant frequency and bandwidth. The

formant amplitude information is useful in parallel formant synthesis. The

formant bandwidth information is useful in cascade formant synthesis.








The two programs were used to analyze one test sentence. The results

are shown in Figure 16. The sentence is "It is a bird." We see that the

two methods give similar results. Since the root-extraction method obtains

the formant bandwidth information which is important for synthesizing high

quality speech, we will use this method in our research.


Glottal Inverse Filtering


According to the acoustic model of speech production, the speech signal

is the output of a linear time-invariant filter. The model of speech pro-

duction is drawn in Figure 17 in block diagram form. Here we restrict our

attention to voiced sounds, so the excitation to the vocal tract filter is

the glottal volume velocity.

According to linear system theory, it is possible to reverse the process

of speech generation to obtain the glottal volume velocity. This process is

called glottal inverse filtering. The glottal volume velocity waveform can

then be used in formant synthesis for high quality speech. Another appli-

cation of the glottal volume velocity waveform is detection of laryngeal

pathology because the glottal volume velocity is affected by the condition

of the larynx [25].

The process of glottal inverse filtering is illustrated in Figure 18.

The speech signal is first passed through a filter whose transfer function

is the inverse of the vocal tract filter, then through an integrator which

offsets the radiation effect to obtain the glottal volume velocity.

As we discussed before, the vocal tract filter can be modeled as a

cascade of resonant circuits for vowel sounds. The nth resonant circuit

has a transfer function (in terms of Z-transform)
















4998


3580

3026 -

250e

2000

1580
ieee




51998 -
150s

Sao


a 0.1 8.2 0.3 8.4 0.5 0.6 0.7 8.8 9.9 1.8 1.1
TIME (SEC


BIRTH


3588

3a0@


258s F3

290 F2


1500



lee
500


0 8.1 8.2 8.3 8.4 0.5 9.6 9.7


0.8 0.9 1.8 1.1 1.2
TIME(SEC.)


Figure 16. Comparison of results of two formant tracking algorithms.


PEAK-PICKING METHOD





F3 3




F22





11111111111 1111111111111111111111111


1.2
.)


ROOT-EXTRACTION METHOD


2
2,


11111111


1j11hl1lll1ll


in1lll11l1111~i1il1ll11l1


r 111111111

















speech
s(n)


Figure 17. Model of speech production.
















glottal
volume
velocity
Is(n)


Figure 18. Block diagram of glottal inverse filtering.







1 2e-BT cos 2TnFT + e2BnT
V (z) = __ 1 (13)
n -=eBnT -1 -27BnT -2
1 2eB cos 27F T z + e2 z2


where F is the nth formant frequency and B is the nth formant bandwidth.

Notice that the numerator constant is chosen such that the DC gain is 0 dB.

The overall transfer function for the vocal tract filter is the product of

these terms

M M 2e-BnT TT -2 TrBnT
V(z) = K T Vn(z) = K TT 1 e n cos 2wFT +e
n=l n=l 1 2eBnT cos 2TFnT z- + e BnT z-2

(14)
where M is the number of formants needed to account for the transmission

characteristics of the vocal tract filter in the frequency range of interest,

and K is a gain constant. The transfer function V(z) has poles only, so the

vocal tract filter for the vowels is an all-pole filter. The transfer func-

tion of the inverse vocal tract filter is given by

1 M 1 2e- BnT cos 27F T z-1 + e-2BnT z-2
A(z) = 1 V 1 -n (15)
V- z K 2e- BnT e_2BnT .15
n=l 1 2e Bn cos 2rF T + e nBn

Thus, the inverse vocal tract filter is an all zero filter which is guaran-

teed to be stable.

In order to construct the inverse vocal tract filter, we have to know

the formant frequencies and bandwidths. For the glottal inverse filtering

problem, the estimation of the formant frequencies and bandwidths should

be handled with great caution. The integration 'operation in the

glottal inverse filtering process will amplify small errors in the inverse

vocal tract filter and result in a distorted glottal volume velocity waveform.

An accurate method of estimating the formant frequencies and bandwidths is
called the closed-phased analysis [26]. This method applies the LP analysis







in an interval of speech where the glottis is closed (or the glottal volume

velocity is zero). Since there is no excitation during this interval, the

linear prediction model is exact and so is the analysis result. The defi-

nition of the closed-phased period and its relationship to the speech wave-

form is illustrated in Figure 19. Notice that the largest negative peak

in the speech waveform usually occurs at the instant the glottal volume

velocity becomes zero. Thus, closed-phase analysis can be applied to the

speech signal after this instant. The roots of the predictor polynomial

obtained by the closed-phase analysis are then used as an estimate of

formant frequencies and bandwidths.

The inverse filtering procedure discussed above is applied to a real

speech signal. Figure 20 shows a typical result. The vowel is /OW/ as in

"rose" spoken by a female subject. The LP analysis is applied to the

speech segment from 140 samples to 170 samples which is right after the

glottal closing point. The roots of the predictor polynomial are shown

in the middle of the figure. Except for the roots with frequency equal to

0 and 3193.10 Hz, all the other roots correspond to formants. We used these

formants to construct the inverse filter. The bottom figure shows the

results of inverse filtering. The solid line is the glottal volume velocity

waveform and the dotted line is the output (residue) of the inverse vocal

tract filter.

There are two things worth noting in the glottal waveform. They are

(a) The flat portion of the glottal volume velocity is not located

at the absolute zero. This is because the differentiation effect intro-

duced by radiation has destroyed the zero reference for the volume velocity

waveform.

(b) The flat portion of the glottal volume velocity is about 10 samples

(1 ms) long, while we have used 30 samples in the closed-phase analysis.





























Figure 19. The definition of the closed-phase interval and its
relationship to the speech waveform.






























Figure 20. An example of inverse filtering. The speech waveform is shown
in the top figure. The pole locations are shown in the middle.
The bottom figure shows the inverse-filtered waveforms.













LJ

) 296


-g
5-J

CL
e


a
-1

-J
-298





POLES
1
2
3
4
5
6
7
8
9
10
11
12


FREQUENCY
0.00
0.00
3193.10
3193.10
-669.85
669.85
-3747.16
3747.16
1608.64
1608.64
2697.32
2697.32


BANDWIDTH
6593.98
2030.36
2673.08
2673.08
191.58
191.58
440.46
440.46
176.04
176.04
127.10
127.10


290

S GLOTTALL VOLUME VELOCITY



a 8
I -












8 38 68 99 129 158 las 218 248 279 398
TIME (SAMPLES)


3 38 60 90 129 15 188 219 240 278 399
TIME(SAMPLES)







Thus the assumption of closed-phase analysis is not satisfied. This situa-

tion is usually found in female or child speech. In some cases, the glottal

volume velocity waveform will be distorted; in other cases (as in the

example) a good volume velocity waveform can still be obtained.

As discussed before, the glottal inverse filtering procedure is very

sensitive to low frequency distortion because of the integration operation.

One source of distortion is the phase distortion introduced by the tape

recorder. Another type of distortion is the 60 Hz noise due to imperfect

grounding. Our solution to both problems is to filter the frequency com-

ponents below 60 Hz completely.

The glottal volume velocity waveform can be used as the excitation

function for a formant synthesizer. Also, other types of excitation func-

tions can be derived from the glottal volume velocity function. We will

discuss the details of constructing the glottal excitation functions in

Chapter 3.


Analysis of a Sentence for Synthesis


Since the main objective of this research is speech synthesis, we

will conclude this chapter by discussing an example of how we analyze a

sentence to extract the necessary parameters for synthesis. The speech

analysis procedure is shown in the block diagram in Figure 21. The

necessary parameters for speech synthesis are the intensity, the pitch/

voicing information, the formant frequency/bandwidth, and the glottal volume

velocity. Except for the intensity, the analysis of the other parameters

has been discussed in the previous sections. The intensity is, by defi-

nition, the root-mean-square (rms) value of the speech. Thus, the intensity

contour for a sentence can be obtained by computing the rms value of every















speech Pitch
Detection Pitch/voicing

Segmentation

Formant Formants {F n
Analysis B



Inverse
Filtering Glottal volume velocity


Figure 21. Summary of the speech analysis procedure.







10 ms segment of speech. To smooth out the discontinuity in the intensity

contour due to the positioning of the speech segment, we have applied a

30 ms Hanning window* on the speech segment. The Hanning window is centered

around the speech segment as illustrated in Figure 22.

The analysis procedure is applied to a sentence "It is a bird." This

sentence was used in our previous studies [27,28]. The dialog of using

the analysis program is shown in Appendix A. The analysis results are

shown in Figure 23. They are, from top to bottom, the intensity contour,

the pitch-period contour, and the formant frequency contours. In comparing the

intensity contour with the pitch period contour, we can find a match between

the low intensity portion and the unvoiced/silence region. This indicates

that the fricatives are usually of lower intensity than vowels. Another

feature of the pitch period contour is that there is a break in pitch at

the end of the sentence. This phenomenon is also observed in the spectro-

gram in Figure 24 where the spacing between the vertical striation is

proportional to the pitch period.

The formant frequency contours are shown in the bottom of Figure 23.

The formant frequency contours are similar to the spectrogram in Figure 24.

The formant bandwidths are also estimated. We did not show them here

because the formant bandwidth estimation is usually less accurate than the

formant frequency estimation. Also, the formant bandwidth is not guaranteed

to be continuous, so the bandwidth contours may jump up and down and cross

each other. This makes it difficult to interpret the formant bandwidth

information. In our synthesis, we usually have to correct the formant band-

width values by applying the closed-phase analysis.

{ 1/2(1 cos (27rn/L)) 0 *Hanning window is defined by W(n) = 1 -
0 other
















Hanning window


Iv V


I 0 ms -s-


Figure 22.


Application of Hanning window for estimating the intensity
contour.















INTENSITY CONTOUR






/


20








-19
-5
-18


Figure 23. Intensity, pitch (top), and formant
sentence "It is a bird."


contours (bottom) of the


PITCH PERIOD CONTOUR





UNVOICED ---- --SILENCE

S9.2 8.4 9.6 8.8 1.8 1.2
TIME(SEC.)





IT IS a BIRD




FORMANT CONTOUR

-t


F3 2
2
F2 v09 22%






F 1 1 11 1111111111111111111111111


8 0.2 a.4 0.6 0.8 1.8 1.2
TIME (SEC.)


4889

35988

N eee






I,,,,ee
s15-



LL.. iaee

see
588



























1b jJ II



:i li
C illi~irr~ r! t'


.1 .2 .3 .4 .5 .6 .7 .8 .9 1.8 1.1
TIME(SEC.)

IT IS A BIRD


Figure 24. Spectrogram of the sentence "It is a bird."


s r
Pli




UJ 2
_3




a L
8.-


bi!5


1.6


iliiik!itii~~
; '' f
iii
,ii n!...













CHAPTER 3
SPEECH SYNTHESIS


This chapter concerns the synthesis of natural sounding speech with

different types of glottal excitation functions. In the first section we

discuss the speech synthesizer configuration, i.e., a serial/parallel

formant synthesizer. We will illustrate the basic principles of speech

synthesis by synthesizing simple vowels and consonants. The next section

is devoted to the discussion of different glottal excitation models and

the derivation of these models from speech. The third section describes

the speech data used in this research, i.e., three different sentences

produced by a male, a female, and a child, respectively. These sentences

were analyzed to extract parameters necessary for our synthesis study.

The sentences were then resynthesized using different glottal excitation

functions. We will discuss the difficulties we encountered in synthesizing

each sentence. We also compare the differences between male, female, and

child speech. Finally, the spectrograms of the natural and synthetic speech

are compared.


A Cascade/Parallel Formant Synthesizer

The formant synthesizer used in our research is based on the cascade/

parallel formant synthesizer by Klatt [16]. The Klatt synthesizer was first

implemented on the Nova 4 computer by B. George for her master's thesis [29].

Later we modified the glottal excitation portion of the original synthesizer

which will be discussed in detail in the next section. For the present,






we are going to discuss the overall configuration of the synthesizer and

the basic principles of synthesizing speech.

The block diagram of the cascade/parallel formant synthesizer is

shown in Figure 25. The two major components of the speech synthesizer

are the source excitation and the vocal tract filter. There are three

source excitations: the voicing source for voiced sounds such as vowels,

the aspiration source for synthesizing the aspirative, and the frication

source for synthesizing the fricatives such as /s/, /f/, etc. The aspi-

ration source and frication source are simply random noise generators.

The voicing source is more complex and will be discussed in the next

section.

The vocal tract filter consists of two portions. The first is a

cascade resonator branch which represents the vocal tract transfer function

for laryngeal sources (voicing source and aspiration source). It has been

shown by Fant [30] that for the cascade connection, the relative amplitudes

of formant peaks for vowels will come out just right without the need for

individual amplitude controls. The details of the cascade branch are shown

in Figure 26, where R1 to R5 represent the five formant resonators. Notice

that there is an additional RNZ-RNP pair which represents a resonance/anti-

resonance pair for synthesizing nasalized sounds. As for the non-nasal

sounds, the frequency of RNZ is set equal to the frequency of RNP, so the

two cancel each other.

The second portion is a parallel resonator branch which represents

the vocal tract transfer function for the frication source. The detailed

configuration is shown in Figure 27. Notice that each resonator has an

amplitude control which adjusts the relative amplitude of the formant peak.

As illustrated in Equation 16, the summation of the two all-pole functions

will create additional zeros, so the summation of the resonator transfer















4-- J

o ci



o D
3 Q.









4 e-

*,-
*r-























0 U -
ci,









S .,-





Z 4-0 \ o 0
0 0-L 0 ,-.
4-J MO 4- 4 4-
CO C

u S4 U u vC





S- o S- W-
0 -
o o 0 = 0


4- 1 o) I + J4







_4--
0 n S f S- 0 01









0 C.
(0 4 4- 4-) <0


















S-



.-0

0 ( O0

0 5- U
*r,
o s. n


C1 O O CT>1
Ll
*i- n3 (0Icn
0 S- (
> 3 Ii -rl
rf I 01 & T



























input
RNZ RNP R1 R2 R3 R4 R5 ---output


Figure 26. Configuration of cascade branch.





















output


Figure 27. Configuration of the parallel branch.


input






Pl + P2
2(z )
1 1
1 + 1 (16)
z P1 z P2 (z pl)(Z p2)

functions will create antiresonances. Thus with proper amplitude controls,

the parallel branch is capable of approximating any rational transfer

function. This makes the parallel branch suitable for synthesizing frica-

tives which have both poles and zeros in their transfer functions. Other

features of the parallel branch are the addition of a 6th formant resonator

(R6) and a bypass path with amplitude control AB. The 6th formant frequency

is fixed at 4900 Hz for the synthesis of very high frequency noise in [s,z].

The bypass path is present because the transfer functions for [f,v,e,,p,b]

contain no prominent peaks, and the synthesizer should include a means of

bypassing all the resonators to produce a flat transfer function.

Besides the sound sources and the vocal tract filter, a first order

difference filter is needed to simulate the effect of the radiation. This

completes our discussion of the cascade/formant synthesizer. The detailed

configuration of the synthesizer is shown in Figure 28, and the parameters

are summarized in Table 1.

The cascade/parallel formant synthesizer has been implemented as an

interactive speech synthesis program "HANDSY3" on a Nova 4 minicomputer.

The program is capable of doing the following functions:

1) Reading parameter values from an analysis file,

2) Modifying parameter values interactively on a graphics terminal, and

3) Having parameter values entered from the keyboard.

Thus the program can be used in both formant vocoder studies and in synthesis-

by-rule studies. The dialog of using the interactive speech synthesis

program is listed in Appendix A with a graphic plot illustrating the para-

meter modification process.






































.0s






Table 1. List of control parameters for Klatt's formant synthesizer.


N V/C Sym Name Min Max Typ

1 V AV Amplitude of voicing (dB) 0 80 0
2 V AF Amplitude of frication (dB) 0 80 0
3 V AH Amplitude of aspiration (dB) 0 80 0
4 V AVS Amplitude of sinusoidal voicing (dB) 0 80 0
5 V FO Fundamental freq. of voicing (Hz) 0 500 0
6 V F1 First formant frequency (Hz) 150 900 450
7 V F2 Second formant frequency (Hz) 500 2500 1450
8 V F3 Third formant frequency (Hz) 1300 3500 2450
9 V F4 Fourth formant frequency (Hz) 2500 4500 3300
10 V FNZ Nasal zero frequency (Hz) 200 700 250

11 C AN Nasal formant amplitude (dB) 0 80 0
12 C Al First formant amplitude (dB) 0 80 0
13 V A2 Second formant amplitude (dB) 0 80 0
14 V A3 Third formant amplitude (dB) 0 80 0
15 V A4 Fourth formant amplitude (dB) 0 80 0
16 V A5 Fifth formant amplitude (dB) 0 80 0
17 V A6 Sixth formant amplitude (dB) 0 80 0
18 V AB Bypass path amplitude (dB) 0 80 0
19 V B1 First formant bandwidth (Hz) 40 500 50
20 V B2 Second formant bandwidth (Hz) 40 500 70

21 V B3 Third formant bandwidth (Hz) 40 500 110
22 C SW Cascade/parallel switch O(CASC) 1(PARA) 0
23 C FGP Glottal resonator 1 frequency (Hz) 0 600 0
24 C BGP Glottal resonator 2 frequency (Hz) 100 2000 100
25 C FGZ Glottal zero frequency (Hz) 0 5000 1500
26 C BGZ Glottal zero bandwidth (Hz) 100 9000 6000
27 C B4 Fourth formant bandwidth (Hz) 100 500 250
28 V F5 Fifth formant frequency (Hz) 3500 4900 3750
29 C B5 Fifth formant bandwidth (Hz) 150 700 200
30 C F6 Sixth formant frequency (Hz) 4000 4999 4900

31 C B6 Sixth formant bandwidth (Hz) 200 2000 1000
32 C FNP Nasal pole frequency (Hz) 200 500 250
33 C BNP Nasal pole bandwidth (Hz) 50 500 100
34 C BNZ Nasal zero bandwidth (Hz) 50 500 100
35 C BGS Glottal resonator 2 bandwidth 100 1000 200
36 C SR Sampling rate 5000 20000 10000
37 C NWS Number of waveform samples per chunk 1 200 50
38 C GO Overall gain control (dB) 0 80 47
39 C NFC Number of cascaded formants 4 6 5






Synthesis Strategy


Now let us discuss the principles of synthesizing speech using the

cascade/parallel formant synthesizer. We will illustrate these principles

by synthesizing simple vowels and consonants. The synthesis of connected

speech (words, sentences) will be considered later.


Synthesis of Vowels


The parameters that are usually varied to generate an isolated vowel

are the amplitude of voicing (AV), the fundamental frequency of vocal fold

vibrations (FO), the lowest three formant frequencies (Fl, F2, and F3) and

bandwidths (Bl, B2, and B3). The fourth and fifth formant frequencies may

be varied to simulate the spectral details, but this is not essential for

good intelligibility.

The control parameters for synthesizing the vowel /i/ are shown in

Table 2. The length of the utterance is 300 ms. The amplitude of the

voicing source (AV) is set to about 60 db for a stressed vowel, and falls

gradually by 5 db near the end of the syllable. The fundamental frequency

(F0) follows a linear contour falling from 130 to 100 Hz. The first three

formant frequencies are fixed at 310 Hz, 2020 Hz, and 2960 Hz, respectively.

The first three formant bandwidths are also fixed at 45 Hz, 200 Hz, and

400 Hz, respectively. Figure 29 shows the waveform and the spectrum of

the synthetic vowel.


Synthesis of Consonants


English consonants can be further categorized into fricatives, plosives,

and nasals. The nasal consonant is voiced, while the fricatives and plosives

can be either voiced or voiceless.
















TABLE 2. Parameter Values for Vowel /i/



F1 310 Hz Bl 45 Hz

F2 2020 Hz B2 200 Hz

F3 2960 Hz B3 400 Hz


AV 60 dB

FO 130 Hz
















I38VEFORM OF VOIEL /i/


238 -





,- 1


-298

-39e

-486
8 38 68 99 129 158 188 219 248 279 38e
TIME (SAMPLES)








SPECTRUM OF VOWEL /i/

89 F 1





--


CF1
29 -












e 290 3890 4e8 58

FREQUENCY (HZ)


lWaveform (top) and spectrum of the synthetic vowel /i/.


Figure 29.






The nasal consonants are synthesized using the cascade branch.

Besides the parameters used in synthesizing vowels, an additional pole-

zero pair (FNP-FNZ) is added to the cascade branch for synthesizing nasal

consonants or nasalized vowels. The pole-zero pair is called the nasal

formant and antiformant. The nasal formant frequency is usually fixed

at 270 Hz. The antiformant frequency is varied according to the degree

of nasalization. For nasal consonants, the antiformant frequency is

very close to the first formant frequency, so that the first formant is

almost cancelled. For nasalized vowels, the antiformant frequency is

equal to the mean value of the nasal formant frequency and the first

formant frequency. The control parameters for synthesizing a nasal

consonant /n/ are shown in Table 3. The amplitude of voicing, AV, is

set at 55 dB. The fundamental frequency, FO, is set to 100 Hz. The

first three formant frequencies are 480 Hz, 1340 Hz, and 2470 Hz. The

first three formant bandwidths are 40 Hz, 300 Hz, and 300 Hz, respec-

tively. The nasal formant frequency, FNP, is 270 Hz. The nasal anti-

formant frequency is 450 Hz. Figure 30 shows the waveform and spectrum

of the nasal consonant /n/. Notice that the nasal consonant is character-

ized by a dominant nasal formant peak in the spectrum.

The voiceless fricatives are synthesized using the parallel branch

because the vocal tract transfer function contains both poles and zeros.

The frication source, AF, is used to synthesize fricatives. The formants

excited by the frication source are determined by the amplitude controls

A2, A3, A4, A5, A6, and AB. The voiced fricatives are synthesized using both

the voicing source, AV, and the frication source, AF. The control parameters

for synthesizing a voiceless fricative /s/ are shown in Table 4.














TABLE 3. Parameter Values for Nasal Consonant /n/





Fl 480 Hz Bl 40 Hz

F2 1340 Hz B2 300 Hz

F3 2470 Hz B3 300 Hz



FNZ 450 Hz

FNP 250 Hz

AV 55 dB

F0 100 Hz





































38 68 90 128


158 188 210 248 279 310
TIME (SAMPLES)


4eee s5ee
FREQUENCY (HZ)


Waveform (top) and spectrum of the synthetic nasal consonant /n/.


58

Li.
25
2S


- a
I-i

C. -25

CE
-58


-190











1Se


i8ee 28ee 3eee


Figure 30.
















TABLE 4. Parameter Values for Fricative Consonant /s/




Fl 320 Hz Bl 200 Hz A2 0 dB

F2 1390 Hz B2 80 Hz A3 0 dB

F3 2530 Hz B3 200 Hz A4 0 dB

F4 3300 Hz A5 0 dB

F5 3750 Hz A6 52 dB

F6 4900 Hz


AF 40 dB







amplitude of frication is 40 dB. For the amplitude control, A6 is set to

be 52 dB; the rest are set equal to 0 dB because the fricative /s/ mainly

consists of high frequency noise. The waveform and spectrum of the syn-

thetic speech are shown in Figure 31. Notice that there are no pitch

harmonics in the spectrum due to the noise-like property of the fricatives.

If we combine the fricative /s/, the vowel /i/, and the nasal /n/ with

proper formant transitions, the resultant speech is the syllable /sin/.

The spectrogram of the syllable /sin/ is shown in Figure 32.

Another class of consonant sounds is the plosives. The plosives are

characterized by a strong sudden burst of noise and very fast formant tran-

sitions. Klatt used a step function to simulate the sudden burst of noise.

We found that this will introduce a dc bias in the speech which may be

perceived as low frequency noise. So, instead of using a step function,

we simply change the frication amplitude, AF, from 0 to 50 db in a short

time interval (say 5 ms). This will create a sudden burst of noise without

any dc bias problem. The identity of the plosives is determined primarily

by the formant transition patterns. Thus, to synthesize a plosive consonant

we must know the formant frequencies of both the plosive and the following

vowel.

We have discussed the synthesis of only a few types of consonants.

There are other types of consonants such as aspiratives, affricates, etc.

The basic principles of synthesizing these consonants are similar-to what

we have discussed above. So we will not pursue this matter further.

The Glottal Excitations for the Formant Synthesizer

The objective of this research is to study the influence of glottal

excitation functions on the quality of speech. So, it is worthwhile spending

some time to discuss the glottal excitation functions. Three types of glottal




71












WAVEFORM OF FRICATIVE /'/




28



T.
ja


-20


-46
-4e -"


-60
a 39 60 90 120 150 188 219 248 279 388
TIME (SAMPLES)









SPECTRUM OF FRICATIVE /S/


88



LJ 68



-D


29 -
e


8 16ee 2098 3088 40ee 58se
FREQUENCY (HZ)


Waveform (top) and spectrum (bottom) of synthetic fricative /s/.


Figure 31.


























'~-`- ~~rr~rrir'rn~- -


.1 .2 .3 .4 .5 .6
TIME(SEC.)


Figure 32. Spectrogram of synthesized utterance /sin/.


9--












8.-
LJ 2






excitation functions are used in this research. They are

1) Impulse excitation

2) Glottal volume velocity (Fant's model)

3) Glottal area function (Guerin's model).

Impulse Excitation Source


What we call the "impulse excitation" is actually the glottal excitation

of the original Klatt's synthesizer as shown in Figure 28. The impulse is

first shaped by a glottal shaping filter before exciting the vocal tract

filter. The block diagram of the glottal shaping filter is shown in Figure 33.

The impulse train is filtered by RGP, a low pass filter. The low pass filter

has a double pole with a bandwidth equal to 100 Hz. This gives the excitation

function a spectrum that falls off smoothly at approximately -12 dB per

octave above 50 Hz. The waveform thus generated does not have the same

phase spectrum as a typical glottal pulse, nor does it contain spectral zeros

of the kind that often appear in natural voicing.

The antiresonator RNZ is used to modify the detail shape of the spectrum

of the excitation function with greater precision than would be possible

using only a single low pass filter. It is clear from the above discussion

that the impulse excitation source is a model of the spectral characteristics

of the actual glottal source.

The Glottal Volume Velocity Source


The second glottal excitation function is the glottal volume velocity.

As discussed in Chapter 2, the glottal volume velocity function can be

derived from the speech signal by the inverse filtering process. This is

the ideal glottal excitation function according to the model of speech

production. But it is impractical to use this time varying glottal volume







velocity for the following reasons. First, the glottal inverse filtering

is very sensitive to the low frequency distortion which often occurs in the

recording process. Second, it is too costly to extract the glottal volume

velocity pitch synchronously because of the root extraction and the poly-

nomial multiplication needed in the inverse filtering process.

Thus we have to use a single time-invariant glottal volume velocity

waveform. Again, we cannot use the glottal volume velocity waveform derived

by the inverse filtering process to synthesize sentences because the glot-

tal volume velocity derived by inverse filtering contains formant ripples

due to the effect of source-tract interaction. So the glottal volume velocity

waveform differs from one vowel to another. If we use the glottal volume

velocity derived from a vowel to synthesize another vowel, the synthetic

speech may have an undesirable quality [9].

So the glottal volume velocity waveform used to synthesize a sentence

must not contain any formant ripples (our source-tract interaction). One

simple waveform which approximates the general characteristics of the glottal

volume velocity and yet does not have formant ripples is given by


(1 cos n) 0 n T1
1

U(n) = A cos ( n T ) T n T2

0 T2 n < TO


The waveform is shown in Figure 34, where A is the maximum volume velocity,

TO is the pitch period, T1 is the duration of the opening phase, and T2 is

the duration of the closing phase. Rosenberg [7] first used this waveform

as glottal excitation for a formant synthesizer and showed that the resultant

synthetic speech had good quality. Fant [13] also used this waveform as



















glottal shaping filter


-2
H(z) = = z
( e-001 -)2 (z -0,01


Figure 33. Block diagram of the impulse excitation source.























C-%------ -- 0
T T2 T
1 T 0
ooeninn closing closed
phase phase phase


Figure 34. An idealized glottal volume velocity waveform.






a model of the glottal volume velocity and showed that it resembled the

general waveshape of the actual glottal volume velocity.

Now all we have to do is to decide the waveform parameters A, T1,

and T2. These parameters can be measured from the actual glottal volume

velocity waveform.

Since the model waveform also has a -12 dB/oct fall in spectral

envelope, it provides a good approximation in both time and spectral char-

acteristics of the actual glottal volume velocity.


The Glottal Area Function Excitation Source


Rosenberg's glottal waveform model provides a good approximation of

the actual glottal volume velocity except for the source-tract interaction

formantt ripples). How can we incorporate the source-tract interaction in

the glottal excitation model? In this section we discuss one possible

approach--through the glottal area function.

The glottal area function measures the opening of the glottis during

the production of voiced speech. Figure 35 shows a picture of the vocal

folds; the area of the dark portion defines the glottal area function.

Since the glottal area function is a measure of vocal fold mechanical move-

ments, it is believed to be unaffected by the acoustic interaction between

the glottal source and the vocal tract. Thus the glottal area function

should remain the same over the course of a sentence. In this sense, a

single glottal area function can be used to represent the glottal source

characteristics over the course of a sentence.

The question is, how do we transform the glottal area function into the

glottal volume velocity which includes the effect of source-tract interaction?

This transformation is accomplished through an impedance circuit (Figure 36)

proposed by Guerin et al. [15]. The circuit represents the input port of


















































Figure 35. Picture of a human vocal fold.


----]



















R R L
Rk Rv L

F UUI





P
F _-Uli 1


r R 2 2


Impedance circuit for computing the glottal volume velocity.


Figure 36.






the vocal tract filter shown in Figure 8. The vocal tract input impedance

for each formant is modeled as a parallel RLC circuit. Since only the
first two formant loading effects are significant, there are two RLC

circuits in the impedance circuit. The glottal impedance is a function of
glottal area (Ag) and glottal volume velocity (U ). The experimental value
for the glottal impedance has been obtained by van den Berg et al. [31].
The constant voltage source (Ps) represents the subglottal pressure. When

the subglottal pressure, the formant frequencies, and the glottal area

function are known, we can derive the glottal volume velocity (U ) by solving

the following set of differential equations.


Ps = Ug(t) [Rk + Rv] + Lg d Ug(t) + P1 + P2

P1 L1 d Ul(t)
P, dPl
Ug(t) = Ul(t) + + Cl
g 1 1 12^
d U2(t)
P2 = L 2dt 2
P dP2
g(t) = U2(t) 2 +C2 dt

Since Rk, Rv, and L are functions of Ag and Ug, this is a set of
nonlinear, time-varying differential equations. So the glottal volume

velocity can only be solved by using numerical methods. We have used the

Runge-Kutta method to solve the equations [32]. Figure 37 shows a compari-
son of the glottal area function and the derived glottal volume velocity.
The most prominent distinction is that formant ripples are introduced to
the glottal volume velocity. This phenomenon agrees with the observation
of inverse filtering experiments.
Now that we realize that the glottal excitation function can be
derived from the glottal area function, let us see how we obtain the glottal
































I- .6
i-4

C. e.s

8.4

8.3

a. e.2





Fi-ure 3
JI-
_J

( 8.1








Figure 33


I 2 3 4 5 6 7 8
TIME (MSEC.)


7.


Comparison of the glottal area function and the glottal
volume velocity derived from the glottal area function.







area function. There are two approaches for measuring the glottal area

function. One approach is to measure the glottal area directly, the other

is to estimate the glottal area function from the speech signal.

The technique we used to directly measure the glottal area is called

ultra high speed cinematography [25]. The picture of the vocal folds, as

shown in Figure 35, was taken at a frame rate of 5,000 frames per second.

The glottal area for each frame is measured by an interactive computer

image processing system. The glottal area as a function of frames is shown

in Figure 38. The measured glottal area function is sampled 5,000 times

per second, while the sampling rate of the speech signal is 10 KHz; thus

a 1:2 interpolation has to be applied to the glottal area function to syn-

chronize it with the speech signal. The disadvantage of the direct method

is that it is an invasive method and cannot be applied to continuous speech,

so we have to rely on the indirect method.

The indirect approach for obtaining the glottal area function is to

estimate the glottal area function from the glottal volume velocity. This

is the inverse of the transformation from the glottal area function to the

glottal volume velocity.

Unfortunately, the transformation is a nonlinear process, so the

inverse transformation cannot be carried out. An alternate way is minimum

mean-square error fitting as illustrated in Figure 39. The glottal volume

velocity, U;, for stylized glottal area function is computed and compared

with the glottal volume velocity, U obtained by inverse filtering the

speech signal. The mean squared error between the spectra of the two

normalized glottal volume velocities is computed as

M
E = M U (m) U;(m)
m=l





























40


F 3 3


S20

d"4

Li
---







Figure 38.


29 40 60 80 180 120 140
TIME (FRAMES)



Glottal area function obtained by the high speed filming
technique.














Glottal area


Glottal volume velocity


find
minimum
error


Figure 39.


The procedure for estimating the glottal area function from
speech.







where U'(n) is the Discrete Fourier Transform (DFT) of the normalized

volume velocity, U /E,,g, with E ,g being the energy of the waveform U';

similarly, U (n) is the DFT of the normalized volume velocity, U /E .

The errors are calculated for a set of stylized glottal area functions

which are obtained by varying the opening and closing phase of a model

waveform shown in Figure 40. The stylized waveform corresponding to the

minimum error is chosen as the best estimate of the glottal area function.

Figure 41 shows a three dimensional plot of the inverted error function

versus the opening phase and the closing phase of the glottal area function.

The peak in this plot corresponds to the minimum error or the best estimate

of the glottal area function. The bottom of Figure 41 shows the comparison

of the inverse-filtered waveform and the model-generated waveform.

One last parameter we are going to discuss in this section is the

subglottal pressure, P The subglottal pressure is the driving force for

the glottal air flow, as shown in Figure 36. During the production of a

sentence, the subglottal pressure is not constant. We need to know the

time variation of Ps for the simulation of Guerin's circuit. Previous

studies [33,34,35] have shown that the subglottal pressure is related to

both the intensity and the pitch of speech. For medium pitch phonation,

the intensity of speech is roughly proportional to the 3.3 0.7 power of

the subglottal pressure. On the other hand, the pitch increases 2.5 Hz

with a cm-H20 increase in subglottal pressure. Thus, it seems that the

subglottal pressure is a nonlinear function of both intensity and pitch of

speech. Fortunately, it was found by the two mass model simulations of

vocal cord motion that the fundamental frequency mainly depends on the

vocal cord tension [36]. Thus, we adopted a simple rule for estimating

the subglottal pressure from the sound intensity (I)


























opening phase closing phase


Figure 40.


The model of glottal area function.




88.








P TYPICAL ERROR FUNCTION FOR
FITTING MODEL-GENERATED WAVEFORM
TO THE GLOTTAL VOLUME VELOCITY
OBTAINED BY INVERSE FILTERING





CLOSING
PHASE







\ OPENING PHASE


GLC
VEL


)TTAL VOLUME SO
.OCITY IN

DO
GU


LID LINE:
VERSE FILTERED

TTED LINE:
ERIN'S MODEL


8F 1

Figure 41.


2 3 4 6 6 7 8 9
TIME(MSEC.)
The three-dimensional plot of the inverted error function (top).
Comparison of the inverse-filtered waveform and the model
generated waveform (bottom).


I.


60S


s) ee
se


488

L.J
P= 388

2-4
iee


I-


L


vv~






I+C
P 10 4 (cm-H20)


where C is a constant. The validity of this relation is not proved

theoretically, but it provides a workable relation and seems to produce

a good result when synthesizing speech.


Synthesis of Sentences


The only way to show that one speech synthesis algorithm works better

than others is by comparing the quality of the synthesized speech. There

is a problem in evaluating the quality of vowels because the vowels are

meaningless, while the listener usually tries to associate a meaning to

what he listens to. Therefore, we decided to synthesize sentences for the

purpose of speech quality evaluation.

We have synthesized three sentences for this research. One sentence

is "We were away a year ago" by a female child. The second sentence is

"The boy was there when the sun rose" by an adult female. The third sen-

tence is "We were away a year ago" by an adult male. In this section we

are going to discuss the details of synthesizing the above three sentences

case by case. We will point out the important principles of speech

synthesis involved. We will also discuss the difference between the child/

adult speech and male/female speech.


Synthesis of the Child's Sentence


The child subject is STA. The sentence "We were away a year ago" was

used as a test sentence in several previous speech analysis-synthesis

experiments. This sentence is used widely in speech research since it is

characterized by fast formant transitions which cause problems in both

analysis and synthesis. Thus the success of any algorithm with this sen-

tence is an indication of its usefulness.







The most distinct feature of a child's speech is its high pitch or

fundamental frequency. In this particular case, the fundamental frequency

was 250 Hz (or the pitch period is 4 ms). This pitch period contour was

extracted using the modified autocorrelation method (see Figure 42). Note

the sudden jump in pitch period (called pitch break) both at the beginning

and the end of voicing. This is caused by the switch of pitch control

mechanisms from the subglottal pressure to vocal cord tension, and vice

versa. The intensity contour is superimposed on the pitch period contour

as a dotted line.

Another characteristic of the child's speech is that only four formants

were found in the frequency range of interest, i.e., 0 5 KHz. This is

due to the shorter vocal tract length for children; thus, the average fre-

quency spacing between formants becomes larger, resulting in few formants.

The formant contours for the entire sentence is shown in Figure 43. Notice

that the formants (especially the second formant) are changing very rapidly

due to the semivowels /w/, /r/, and /j/.

One problem we encountered in estimating the formant frequencies of

the child's speech is that the first formant estimation is strongly influenced

by the harmonics of the fundamental frequency. Because of the high-pitched

characteristics of the child's speech, the error in the first formant fre-

quency estimation can easily exceed 100 Hz (or 20% error). According to

Flanagan [19] this amount of error will induce a noticeable change in the

quality of speech, e.g., the just noticeable difference in formant frequency

is 3 5 %. The effect of the fundamental frequency on the formant frequency

estimation is illustrated in Figure 44. The figure shows both the short

time spectrum (dotted line) and the linear prediction spectral envelope

(solid line). Notice that there are two peaks in the spectral envelope




















Figure 42. The pitch period contour and intensity contour of the
sentence "We were away a year ago."

























Figure 43. The formant contours of the sentence "We were away a
year ago."
































a 8.2 9.4 8.6 0.8 1.8 1.2 1.4 1.6
TIME(SEC.)


WE.C WERE
w &_u


n YEAR


FORMANT CONTOUR


7 w
*F3 3


- BMB nwvm 2-


F2A


AGO


lS 3313
$3
3 3



2 3
2
2
r 1


T listitlhi ad


222
S2 2
2 22mninn2 1 L


Fl .


0.2 0.4


8.6


8.8


1111g$21

i:


1.e 1.2 1.4 1.6
TIME(SEC.)


C.
C)
a-i


I-

a-


4eee






250a



LJ


iseee

see

B


2"










































i1se 2980 3809


40RE 99 5s
FREQUENCY(HZ)


Figure 44.


Spectrum of vowel /o/ for the child subject.


I-
40
4-

a
r 29



e8


i







corresponding to the second and the third harmonics of the fundamental

frequency. Either one may be labelled as the first formant, while the

"real" first formant may lie somewhere between the two. The higher formants

could also be in error. But the percentage of the error with respect to

the formant frequency is usually less than 5% (the just noticeable differ-

ence) so it will not affect the quality of the synthetic speech.

A way to reduce the error in the first formant frequency estimate is

to use the pitch synchronous linear prediction analysis. This reduces the

influence of the pitch pulses on the estimation of the autocorrelation

function, and hence increases the accuracy of the formant estimation.

The glottal volume velocity waveform can be obtained by inverse

filtering the speech waveform. Figure 45 shows the glottal volume velocity

waveform of vowel /i/ for the child's speech. Notice that the closed phase

interval is very short in this case. This phenomenon is usually observed

in children's speech and is associated with the breathy quality of the

speech. The glottal volume velocity waveform can be used to construct the

glottal excitation functions for the formant synthesizer as discussed in

the previous section. The glottal excitation functions used to synthesize

the child's speech are shown in Figure 46.

The parameters and excitation functions obtained by the analysis are

used to resynthesize the speech. Since the sentence consists of non-nasal

sounds throughout, no additional effort for setting the nasal formant/anti-

formant is needed. The only thing we have to adjust in this sentence is

the amplitude of frication and aspiration for the voiced plosive /g/ as

in "ago". In this case, we set the amplitude of frication to 49 dB for

10 ms to give a sudden burst of noise, and then set the amplitude of

aspiration to 30 dB for 10 ms before the voicing onset. This parameter