Citation
An articulatory speech synthesizer

Material Information

Title:
An articulatory speech synthesizer
Creator:
Bocchieri, Enrico Luigi, 1956- ( Dissertant )
Childers, Donald G. ( Thesis advisor )
Place of Publication:
Gainesville, Fla.
Publisher:
University of Florida
Publication Date:
Copyright Date:
1983
Language:
English
Physical Description:
vi, 169 leaves : ill. ; 28 cm.

Subjects

Subjects / Keywords:
Glottal consonants ( jstor )
Modeling ( jstor )
Signals ( jstor )
Simulations ( jstor )
Speech production ( jstor )
Spoken communication ( jstor )
Velocity ( jstor )
Vibration ( jstor )
Vocal cords ( jstor )
Vowels ( jstor )
Dissertations, Academic -- Electrical Engineering -- UF
Electrical Engineering thesis Ph. D
Speech synthesis ( lcsh )
Genre:
bibliography ( marcgt )
non-fiction ( marcgt )

Notes

Abstract:
Linear prediction and formant synthesizers are based on a rather approximate model of speech production physiology, using analysis or "identification" algorithms of natrual speech to overcome the model limitations and to synthesize good quality speech. On the contrary, articulatory synthesizers are based on a more exact speech production model, and do not use identification algorithms to derive the model parameters directly from the natural speech waveform. This dissertation shows that the amount of physiological detail captured by the articulatory synthesis method is sufficient for the generation of high quality synthetic speech and for the simulation of physiological and pathological aspects of speech that are reported in the literature. Articulatory synthesis of speech represents the acoustic properties of the vocal cavities by means of modeling and numerical simulation techniques that are reported in Chapters 3 and 4. We have been able to guarantee the stability of the numerical method and to halve the number of differential equations that must be solved for the simulation of the sound propagation in the vocal tract (Chapter 4). In the Appendix we present a new and more efficient algorithm for the simulation of the vocal cavity acoustics which can be efficiently implemented with parallel processing hardware. Interactive graphic software (Chapter 5) has been developed to represent the configurations of the vocal cavities and to provide us with a convenient interface for the manipulation of the geometric model of the vocal cavities. Chapter 6 employs the developed articulatory synthesis system for the simulation of different aspects of speech processing, for modeling speech physiology, and testing theories of linguistics reported in the literature. We discuss and illustrate such cases as source tract interaction, EGG modeling, onset spectra of voiced stops at consonantal release, the effects of yielding walls on phonation, sound intensity reduction during nasalization and glottal least square inverse filtering.
Thesis:
Thesis (Ph. D.)--University of Florida, 1983.
Bibliography:
Includes bibliographic references (leaves 158-168).
General Note:
Typescript.
General Note:
Vita.
Statement of Responsibility:
by Enrico Luigi Bocchieri.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright [name of dissertation author]. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Resource Identifier:
000437194 ( ALEPH )
11216733 ( OCLC )
ACJ7267 ( NOTIS )

Downloads

This item has the following downloads:


Full Text















AN ARTICULATORY SPEECH SYNTHESIZER


BY



ENRICO LUIGI BOCCHIERI





















A DISSERTATION PRESENTED TO THE GRADUATE COUNCIL
OF THE UNIVERSITY OF FLORIDA IN
PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR
THE DEGREE OF DOCTOR OF PHILOSOPHY


UNIVERSITY OF FLORIDA


1984


















ACKNOWLEDGMENTS


I would like to express my sincere appreciation to my

advisory committee for their help and guidance throughout

this work.

I would like to give special thanks to my committee

chairman, Dr. D. G. Childers for his competent advice, for

his support, both financial and moral, and for providing an

educational atmosphere without which this research would

never have been possible.

Special gratitude is also expressed to Dr. E. R.

Chenette for his guidance, encouragement and financial

support.

To my parents and family I am forever indebted. Their

unceasing support and encouragement made it all possible and

worthwhile.


















TABLE OF CONTENTS


ACKNOWLEDGMENTS. ...........................................ii

ABSTRACT.................................................... v

CHAPTER Page

1 INTRODUCTION: SPEECH SYNTHESIS APPLICATIONS.
RESEARCH GOALS........................................ 1

2 SPEECH PRODUCTION MODELS AND SYNTHESIS METHODS......... 7

2.1) Speech Physiology and the Source-Filter Model...8
2.2) Linear Prediction............................... 13
2.3) Formant Synthesis ............................ ..16
2.4) Articulatory Synthesis .........................18

3 ACOUSTIC MODELS OF THE VOCAL CAVITIES............... 23

3.1 Sound Propagation in the Vocal Cavities.........23
3.1.a)' Derivation of the Model ................... 23
3.l.b) Modeling the Yielding Wall Properties..... 32
3.1.c) Nasal Coupling............................. 37
3.2) Excitation Modeling ............................37
3.2.a) Subglottal Pressure .......................37
3.2.b) Voiced Excitation.........................38
3.2.c) Unvoiced Excitation.......................45
3.3) Radiation Load...... ............................ 47
3.4) Remarks and Other Acoustic Models..............48

4 NUMERICAL SOLUTION OF THE ACOUSTIC MODEL.............51

4.1) Requirements of the Numerical Solution
Procedure .......................... .......... 51
4.2) Runge-Kutta Methods.......................... 54
4.2.a) Derivation of the Method..................54
4.2.b) Control of the Step Size
with Runge-Kutta.......................... 57
4.2.c) Order Selection............................59
4.3) Multistep Methods............................... 60
4.3.a) Implicit and Explicit Methods .............60
4.3.b) Derivation of the Methods.................61
4.3.c) Characteristics of Multistep Methods ...... 64
4.4) Method Selection.... .............................66














5 THE ARTICULATORY MODEL AND ITS
INTERACTIVE GRAPHIC IMPLEMENTATION..................71

5.1) Definition of the Articulatory Model...........71
5.2) The Graphic Editor.............................74
5.3) Display of the Time Variations of the Model....80
5.4) Simultaneous and Animated Display of the
Articulatory and Acoustic Characteristics
of the Vocal Cavities.........................84

6 SIMULATION RESULTS....................... ........... ..88

6.1) Speech Synthesis............................... 88
6.2) Source Tract Interaction.......................91
6.3) Onset Spectra of Voiced Stops..................95
6.4) Glottal Inverse Filtering of Speech............99
6.5) Simulation of Wall Vibration Effects............105
6.5.a) Vocal Cords Vibration During Closure.....105
6.5.b) Formant Shift............................ 110
6.6) Pathology Simulation: Reduction of Sound
Intensity During Nasalization.................112
6.7) Coarticulation ................................117
6.8) Interpretation of the EGG Data with
the Two Mass Model of the Vocal Cords.........124

7 CONCLUSIONS ................................... 133

7.1) Summary........ ............ .......... ... ..... 133
7.2) Suggestions for Future Research...............135

APPENDIX

AN EFFICIENT ARTICULATORY SYNTHESIS ALGORITHM
FOR ARRAY PROCESSOR IMPLEMENTATION..................137

A.1) Wave Propagation in Concatenated
Lossless Tubes ...................... ........ 138
A.2) Modifications of Kelly-Lochbaum Algorithm.....141
A.2.a) Fricative Excitation....................141
A.2.b) Yielding Wall Simulation................146
A.3) Boundary Conditions ..........................150
A.3.a) Glottal Termination......................151
A.3.b) Radiation Load ....................... 153
A.3.c) Nasal Coupling ................. ........ 155

REFERENCES ..... ......................... ........... .. 158

BIOGRAPHICAL SKETCH..................................... 169


CHAPTER


Page
















Abstract of Dissertation Presented to the Graduate
Council of the University of Florida in Partial
Fulfillment of the Requirements for the
Degree of Doctor of Philosophy

AN ARTICULATORY SPEECH SYNTHESIZER


By

Enrico Luigi Bocchieri

April 1984

Chairman: Dr. D. G. Childers
Major Department: Electrical Engineering

Linear prediction and formant synthesizers are based on

a rather approximate model of speech production physiology,

using analysis or "identification" algorithms of natural

speech to overcome the model limitations and to synthesize

good quality speech.

On the contrary, articulatory synthesizers are based on

a more exact speech production model, and do not use

identification algorithms to derive the model parameters

directly from the natural speech waveform.

This dissertation shows that the amount of

physiological detail captured by the articulatory synthesis

method is sufficient for the generation of high quality

synthetic speech and for the simulation of physiological and

pathological aspects of speech that are reported in the

literature.












Articulatory synthesis of speech represents the

acoustic properties of the vocal cavities by means of

modeling and numerical simulation techniques that are

reported in Chapters 3 and 4.

We have been able to guarantee the stability of the

numerical method and to halve the number of differential

equations that must be solved for the simulation of the

sound propagation in the vocal tract (Chapter 4).

In the Appendix we present a new and more efficient

algorithm for the simulation of the vocal cavity acoustics

which can be efficiently implemented with parallel

processing hardware.

Interactive graphic software (Chapter 5) has been

developed to represent the configurations of the vocal

cavities and to provide us with a convenient interface for

the manipulation of the geometric model of the vocal

cavities.

Chapter 6 employs the developed articulatory synthesis

system for the simulation of different aspects of speech

processing, for modeling speech physiology, and testing

theories of linguistics reported in the literature. We

discuss and illustrate such cases as source tract

interaction, EGG modeling, onset spectra of voiced stops at

consonantal release, the effects of yielding walls on

phonation, sound intensity reduction during nasalization,

and glottal least squares inverse filtering.


















CHAPTER 1
INTRODUCTION.
SPEECH SYNTHESIS APPLICATIONS. RESEARCH GOALS.

In the last 3 4 decades both the engineering and

medical communities have devoted considerable research

effort to the problem of speech synthesis, i.e., the

generation of voice by artificial, electrical or mechanical

means.

The earliest attempts to construct talking machines can

be traced to the late 18th century. One of the first speech

synthesis devices was Kempelen's talking machine [1-2] which

in a demonstration in Vienna in 1791 was capable of

imitating the sounds of vowels and of many consonants.

Perhaps the greatest motivation for speech synthesis

research came from the development of telecommunications and

from the consequent engineering interest in efficient

methods for speech transmission. Moreover, the recent

progresses in circuit integration, microprocessors and

digital computers have made the implementation of high

performance speech transmission systems technologically

feasible [3-6]. This type of application requires a scheme

known as speech synthesis by analysis.












In its simplest form, speech communication is achieved

by modulating an electrical magnitude (for example, the

current in a transmission line) with the air pressure during

speech production. With this straightforward approach a

copy, in electrical terms, of the speech waveform can be

transmitted on a communication channel with a typical

bandwidth of about 3 kHz.

However, there appears to be a mismatch between the

information content of speech and the channel capacity. In

fact, the information content of written text may be

estimated at about 50 bit/sec [7] while the channel capacity

of a 3 kHz bandwidth and a typical signal-to-noise ratio is

about 30,000 bits/sec. Similar bit rates are also

encountered in conventional PCM speech transmission. Even

though spoken speech contains more information (such as

intonation and stress) than its written counterpart, the

above mentioned mismatch indicates that a smaller channel

bandwidth can be used for a more efficient transmission of

speech. Using different tradeoffs between the

intelligibility and naturalness of speech transmission on

one side and bit rate on the other, engineers have been able

to transmit speech with bit rates varying from 150 to 30,000

bit/s [8-11].

The reduction of channel bandwidth has been obtained by

means of analysis-synthesis systems. Before transmission,












speech analysis algorithms are used to extract relevant
information about the speech waveform. This information is

then encoded and transmitted (hopefully at low bit rates)

along the communication channel. At the receiver a

synthesis algorithm is used to reconstruct the speech
waveform from the transmitted information.

This synthesis by analysis process is useful not only

in voice communication systems; for example, in automatic

voice answering systems, words or sentences are stored for

successive playbacks. In addition synthesis by analysis can
be used to reduce the memory usage. Texas Instruments'
learning aid, Speak and Spell, is an example of this type of

application.

Synthesis by rule or text to speech synthesis is a

different type of application that has received considerable
attention lately [12-13]. In this case the problem is not

to "regenerate" synthetic speech after an analysis phase of

its natural counterpart. Instead synthetic speech is

automatically produced according to certain linguistic rules

which transform a string of discrete input symbols directly
into speech [14] (see Figure 1.1). Applications of text to
speech synthesis include reading machines for the blind

[15], automatic answering systems, ... man-machine
communication.

The medical community is interested in speech synthesis
systems for different reasons. Speech synthesizers are













SPEECH
SYNTHESIZER


SYNTHESIS
STRATEGY
STORED STORED
RULES DATA


->Ku


SPEECH


CONTROL
SIGNALS


DISCRETE INPUT
SYMBOLS


Figure 1.1. Text to speech synthesis.








5



often used in psycoacoustic and perceptual experiments

[16-18] in which the acoustic characteristics of speech must

be precisely and systematically controlled. Moreover the

vocal system is not easily accessible; therefore speech

physiologists and pathologists may use computer models as an

aid for the investigation of the physiology of the vocal

system and the diagnosis of voice disorders [19-20].

The purpose of this research is to apply speech

synthesis techniques for the simulation of the physiological

process of speech articulation in relation to the acoustic

characteristics of the speech signal.

Chapter 2 reviews the speech synthesis strategies used

most often and explains why the so-called articulatoryy

synthesis" method has been selected for our research.

Speech generation depends on the vocal cavities acoustic

properties which are physiologically determined during

speech articulation by the geometric configuration of the

vocal system.

The model of the acoustic characteristics of the vocal

cavities is explained in detail in Chapter 3, together with

its implementation by means of numerical simulation

techniques in Chapter 4. Chapter 5 focuses on the geometry

or spatial model of the vocal tract together with the

interactive graphic techniques that have been used for its

representation.








6



Synthesis and simulation results are presented in

Chapter 6. Chapter 7 provides a discussion of our findings

along with conclusions and suggestions for future research.



















CHAPTER 2
SPEECH PRODUCTION MODELS AND SYNTHESIS METHODS

As indicated in the introduction (Chapter 1) there are

many different applications that motivate research in the

area of speech synthesis. However, different goals usually

require different approaches for the solution of the

problem. This chapter will briefly consider the three most

popular and best documented techniques for speech synthesis

(namely linear prediction, formant and articulatory

synthesis), their relative "advantages" and "disadvantages"

and the applications for which they are most suitable. The

purpose is to review the available speech synthesis

techniques and to justify the choice of articulatory

synthesis for our research.

Every speech synthesis strategy is based on a more or

less complete model of the physiology of speech production

and ultimately its performance is determined by the amount

of acoustic and linguistic knowledge that the model can

capture.

In the first section of this chapter we therefore

discuss the basic notions of the physiology of speech

together with the source-filter production model upon which

both linear prediction and formant synthesis are based.












2.1) Speech Physiology and the Source-Filter Model.

The acoustic and articulatory features of speech

production can be most easily discussed by referring to

Figure 2.1, which shows the cross-section of the vocal

apparatus.

The thoracical and abdominal musculatures are the

source of energy for the production of speech. The

contraction of the rib cage and the upward movement of the

diaphragm increase the air pressure in the lungs and expell

air through the trachea to provide an acoustic excitation of

the supraglottal vocal cavities, i.e., the pharynx, mouth

and nasal passage.

The nature of speech sounds is mostly determined by the

vocal cords and by the supraglottal cavities. The vocal

cords are two lips of ligament and muscle located in the

larynx; the supraglottal cavities are the oral and nasal

cavities that are vented to the atmosphere through the mouth

and nostrils.

Physically, speech sounds are an acoustic pressure wave

that is radiated from the mouth and from the nostrils and is

generated by the acoustic excitation of the vocal cavities

with the stream of air that is coming from the lungs during

exhalation.

An obvious and important characteristic of speech is

that it is not a continuous type of sound but instead it is

















































Figure 2.1. Schematic diagram of the human vocal
mechanism (from [7] ). By permission
of Springer-Verlag.












perceived as a sequence of speech units or segments. In

general the different types of sound that occur during

speech production are generated by changing the manner of

excitation and the acoustic response of the vocal cavities.

As a first order approximation we can distinguish

between a "voiced" and a "fricative" or "unvoiced"

excitation of the vocal cavities.

The voiced excitation is obtained by allowing the vocal

cords to vibrate so that they modulate the stream of air

that is coming from the lungs, producing an almost periodic

signal. For example, vowels are generated in this way and

they are perceived as continuous non-hissy sounds because

the excitation is essentially periodic.

In contrast, unvoiced or fricative excitation is

achieved by forcing the air flow through a constriction in

the vocal tract with a sufficiently high Reynold's number,

thereby causing turbulence. This excitation has random or

"noisy" characteristics and, therefore, the resulting speech

sounds will be hissy or fricative (friction-like) as in the

case of the consonants /s/ and /f/.

Both voiced and unvoiced excitation signals have a

rather wide spectrum. Typically the power spectrum of the

voiced excitation decreases with an average slope of

12 db/octave [7] while the unvoiced spectrum can be

considered white over the speech frequencies [21].











The spectral characteristics of the excitation are

further modified by the acoustic transfer function of the

vocal cavities. Sound transmission is more efficient at the

resonance frequencies of the supraglottal vocal system and,

therefore, the acoustic energy of the radiated speech sound

is concentrated around these frequencies formantt

frequencies). During the generation of connected speech,

the shape and acoustic characteristics of the vocal cavities

are continuously changed by precisely timed movements of the

lips, tongue and of the other vocal organs. This process of

adjustment of the vocal cavity shape to produce different

types of speech sounds is called articulation.

These considerations about speech physiology lead to

the simple but extremely useful source-tract model of speech

production [22], which has been explicitly or implicitly

used since the earliest work in the area of speech synthesis

[23-25]. This model is still employed in linear prediction

and formant synthesis. It consists (see Figure 2.2) of a

filter whose transfer function models the acoustic response

of the vocal cavities and of an excitation source that

generates either a periodic or a random signal for the

production of voiced or unvoiced sounds, respectively. The

operation of the source and the filter transfer function can

be determined by external control parameters to obtain an

output signal with the same acoustic properties of speech.















h(t)
SOUND VOCAL
SOURCE 9(t) TRACT
(VOICED ORI TRANSFER-
UNVOICED) FUNCTION


s(t)=g(t)*h(t)
Sound output


gl t)


g ( t) unvoiced -


APivA~A


Figure 2.2. The source-tract speech production model.


vo iced -












2.2) Linear Prediction

The simplest and most widely used implementation of the

source tract model of speech production is Linear Prediction

synthesis. It is a synthesis by analysis method that was

first proposed by Atal and Hanauer [26] and it has been

investigated for a great variety of speech applications.

The method is particularly suitable for digital

implementation and it assumes a time discrete model of

speech production, typically with a sampling frequency

between 7 and 10 kHz. It consists (see Figure 2.3) of two

signal generators of voiced and unvoiced excitation and of

an all pole transfer function



H(z) = 1 (2.2.1)
1 + I aKZ
K=l


to represent the acoustic response of the vocal cavities.

Mathematically, the transfer function H(z) is determined by

the predictor coefficients, aKs. The great advantage of

linear prediction is that an estimate of the predictor

parameters can be efficiently obtained using an analysis

phase of natural speech. The literature presents several

algorithms to perform this analysis. Perhaps the schemes

most used for speech synthesis applications are the

autocorrelation method [27] and, for hardware

implementation, the PARCOR algorithm [28].














Pitch

PULSE
GENERATOR



WHITE
NOISE
GENERATOR


V/UV Switch


ALL POLE
TRANSFER
FUNCTION


Speech


H(z)


Predictor
parameters
ak


Figure 2.3. Linear prediction speech production model.












During speech articulation the vocal cavity transfer

function is continuously changing and, ideally, the result

of the analysis of natural speech is a time varying estimate

of the linear predictor parameters (or of an equivalent

representation) as they change during speech production.

Also, the literature reports many algorithms which can

be applied to natural speech to extract the fundamental

(vocal cord oscillation) frequency and to perform the

voiced/unvoiced decision [29-30]. Some of them have been

implemented on special purpose integrated circuits for real

time applications [3] [5].

Therefore, linear prediction provides a complete

synthesis by analysis method in which all the control

parameters of Figure 2.3 can be derived directly from

natural speech.

The shortcoming of linear prediction is that the

transfer function (2.2.1) cannot properly model the

production of nasal, fricative and stop consonants. The all

pole approximation of the vocal tract transfer function is

in fact theoretically justified only for vowel sounds, and

even in this case the linear prediction model assumes a

minimum phase excitation signal.

In spite of these disadvantages, however, linear

prediction performs well for speech synthesis applications

because the approximations introduced by the model of









16

Figure 2.3 do not severely affect the perceptual properties

of the speech sound. In fact the human hearing system

appears to be especially sensitive to the magnitude of the

short-time spectrum of speech [31], that is usually

adequately approximated by the linear prediction transfer

function [32]. Perhaps the minimum phase approximation is

responsible for a characteristic "buzziness" [33] of the

synthetic speech. A great deal of research is being

dedicated to improve the quality of linear prediction

synthesis by using a more suitable excitation than impulse

sequences [34-35].

Linear prediction of speech is, therefore, most useful

in those applications that require a fully automated

synthesis by analysis process. Speech compression, linear

prediction vocoders, very low bit rate transmissions are

typical examples. Also linear prediction has application in

speech products where speech may be recorded for later

playback with stringent memory constraints.



2.3) Formant Synthesis

Similar to linear prediction, formant synthesis is

based on the source-tract speech production model. However,

in this case the filter that models the vocal tract is not

implemented by an all pole digital filter but it consists of

a number of resonators whose transfer function is controlled

by their resonance formantt) frequencies and bandwidths.













Among the many formant synthesizers reported in the

literature [16] [36-37] two general configurations are

common. In one type of configuration, the formant

resonators that simulate the transfer function of the vocal

tract are connected in parallel. Each resonator is followed

by an amplitude gain control which determines the spectral

peak level. In the other type of synthesizer the resonators

are in cascade. The advantage here is that the relative

amplitudes of formant peaks for the generation of vowels are

produced correctly with no need for individual amplitude

control for each formant [16].

Formant synthesis was developed prior to linear

prediction synthesis probably because it is amenable to

analog hardware implementation. However, many synthesizers

have been implemented on general purpose digital computers

to obtain a more flexible design. The most recent and

perhaps most complete synthesizer reported in the literature

has been designed by Klatt [16], which consists of a cascade

and parallel configurations which are used for the

production of vowels and consonants, respectively.

An advantage of formant synthesizers is their

flexibility. In fact, thanks to the parallel configuration,

the filter transfer function may have both zeroes and

poles. Klatt's synthesizer, for example, has 39 control

parameters which not only control the filter transfer










18

function in terms of formant frequencies and bandwidths but

also determine different types of excitation such as voiced,

unvoiced, mixed, sinusoidal for the generation of

consonantal murmurs, and burst-like for the simulation of

stop release.

Formant synthesizers are particularly useful in

psycoacoustic studies since the synthetic speech waveform

can be precisely controlled by parameters which are more

directly related to the acoustic characteristics of speech

than the linear prediction coefficients. For the same

reason they are also more suitable for speech synthesis by

rule.

Synthesis by analysis with formant synthesizers

requires the extraction of the formant information from the

speech signal. For this purpose the literature presents

several formant analysis algorithms [38-41]. However linear

prediction analysis is simpler and more efficient and

therefore linear prediction is usually preferred in

synthesis by analysis type of applications.



2.4) Articulatory Synthesis

Linear prediction and formant synthesis are not

completely suitable for our research since they do not

faithfully model the human speech production mechanism.












A first disadvantage, which is inherent to the source-

filter model, is the assumption of separability between the

excitation and the acoustic properties of the vocal tract.

Clearly this assumption is not valid for the production of

fricative sounds in which the excitation depends on the

vocal tract constriction or for the generation of stops in

which the tract closure arrests the glottal air flow.

Source tract separability is a first order modeling

approximation even in the production of vowels as documented

by many recent papers concerning the nature of source tract

interaction [42-45]. We will address this issue in

Chapter 6 in more detail.

The second "disadvantage" (for our purpose) is that the

filter in the source-tract model (and also linear prediction

and formant synthesis) accounts only for the acoustic

input/output transfer function of the vocal cavities which

is estimated by means of analysis or "identification"

algorithms. For example, Linear Prediction and Formant

synthesis cannot model the areodynamic and myoelastic

effects that determine the vocal cords vibration, the air

pressure distribution along the oral and nasal tracts and

the vibration of the vocal cavity walls.

In other words, linear prediction and formant synthesis

define an algorithm for the generation of signals with the

same acoustic features of natural speech but they do not

model the physiological mechanism of speech generation.













These limitations of the source-tract model of speech

production can be overcome by the so called articulatoryy"

synthesis that is based on a physiological model of speech

production. It consists (see Figure 2.4) of at least two

separate components,

1) an articulatory model that has as input the time

varying vocal organ positions during speech

production to generate a description of the

corresponding vocal cavity shape and

2) an acoustic model which, given a certain time

varying vocal cavity configuration, is capable of

estimating not only the corresponding speech

waveform but also the pressure and volume velocity

distribution in the vocal tract, the vibration

pattern of the vocal cords and of the vocal cavity

walls.

The strength of linear predication and formant

synthesis, namely the existence of analysis algorithms for

natural speech is, however, a weak point for articulatory

synthesis. Even if several methods for estimating the vocal

tract configuration are presented in the literature [46-51],

these procedures cannot be easily applied to all the

different types of speech sounds. This estimation is made

even more difficult to achieve by the fact that the acoustic

to articulatory transformation is not unique [52-53]. This

















Vocal organ
positions


ARTICULATORY
MODEL

Desc ription
of vocal
cavity shape

VOCAL CAVITY
ACOUSTIC
MODEL

Synthetic speech,
vibration of the
vocal cords,
volume velocity
and pressure
distribution in
Sthe vocal cavity.


Figure 2.4. Articulatory synthesis of speech.












disadvantage, together with high computational requirements,

limits the use of articulatory synthesis for speech

applications which has been recently investigated by

Flanagan et al. [54].

In the following, Chapters 3 and 4 will discuss the

acoustic modeling of the vocal cavities and its

implementation with numerical simulation techniques.

Chapter 5 concentrates on the articulation model and the

computer graphic techniques used for its implementation.



















CHAPTER 3
ACOUSTIC MODELS OF THE VOCAL CAVITIES

The qualitative descriptions of the human speech

production mechanism and of the articulatory method of

speech synthesis that we have given in Section 2.1 cannot be

directly implemented on a digital computer. This knowledge

must be transformed into an analytical representation of the

physics of sound generation and propagation in the vocal

cavities.

The mathematical model of the vocal cavity acoustics

can be conveniently interpreted by means of equivalent

circuits. In such a representation the electrical current

and voltage correspond respectively to the air pressure and

volume velocity in the vocal cavities. In the following we

will always express all the physical dimensions in C.G.S.

units.



3.1) Sound Propagation in the Vocal Cavities.

3.1.a) Derivation of the Model.

Sound is nearly synonymous with vibration. Sound waves

are originated by mechanical vibrations and are propagated

in air or other media by vibrating the particles of the

media. The fundamental laws of mechanics, such as momentum,

23












mass and energy conservation and of fluid dynamics can be

applied to the compressible, low viscosity medium (air) to

quantitatively account for sound propagation.

The oral and nasal tracts are three-dimensional lossy

cavities of non-uniform cross-sections and non-rigid

walls. Their acoustic characteristics are described by a

three dimensional Navier-Stokes partial differential

_equation for boundary conditions appropriate to the yielding

walls. However, in practice, the solution of this

mathematical formulation requires

"an exhorbitant amount of computation, and we do not
even know the exact shape of the vocal tract and the
characteristics of the walls to take advantage of such a
rigorous approach." [55]

The simplifying assumption commonly made in the

literature is plane wave propagation. Most of the sound

energy during speech is contained in the frequency range

between 80 and 8000 Hz [56] but speech quality is not

significantly affected if only the frequencies below 5 kHz

are retained [16]. In this frequency range the cross

sectional dimensions of the vocal tract are sufficiently

small compared to the sound wavelength so that the departure

from plane wave propagation is not significant.

Thanks to the plane wave propagation assumption, the

geometric modeling of the vocal cavities can be greatly

simplified. Acoustically the vocal tract becomes equivalent

to a circular pipe of non-uniform cross-section (see












Figure 3.1) whose physical dimensions are completely

described by its cross-sectional area, A(x), as a function

of the distance x along the tube (area function). The sound

propagation can now be modeled by a one dimensional wave

equation. If the losses due to viscosity and thermal

conduction either in the bulk of the fluid or at the walls

of the tube are neglected, the following system of

differential equations accurately describes the wave

propagation [29]



u(x,t)
bp(x,t) + (x,t) = (3 1. la)
x- + "


bu(x,t) 1 b(p(x,t) A(x,t)) A(x,t)
x + P t at
Pc


x

p(x,t)



u(x,t)

A(x,t)

P


C


(3.1.1b)


A
= displacement along the axis of the tube
A
= pressure in the tube as function of

time and displacement
A
= air volume velocity in the tube
A
= area function of the tube
A
= air density

A sound velocity.


The first and second equations (3.1.1) correspond to

Newton's and continuity law, respectively.


where













glottis

South








10



I



0 L cm





Figure 3.1. The vocal tract represented by a non uniform
pipe and its area function.












Equations (3.1.1) indicate that the area function

A(x,t) is varying with time. Physiologically this is caused

by two different phenomena,

a) the voluntary movements of the vocal organs during

speech articulation, and

b) the vibration of the vocal cavity walls that are

caused by the variations of the vocal tract

pressure during speech.

We therefore represent the vocal tract area function as

a summation of two components



A(x,t) L Ao(x,t) + 6A(x,t) = Ao(x) + 6A(x,t) (3.1.2)



The first component A (x,t) is determined by a) above and it

represents the "nominal" cross-sectional area of the vocal

tract. Since the movements of the vocal organs are slow

compared to the sound propagation, this component can be

considered time invariant with a good approximation. The

2nd component 6A(x,t) represents the perturbation of the

cross-sectional area of the vocal tract that is caused by b)

above. Its dynamics cannot be neglected as compared with

the acoustic propagation. Nevertheless, this component has

a relatively small magnitude compared to A(x,t).

If we substitute (3.1.2) into (3.1.1) and if we neglect

2nd order terms we obtain











p(x, t) + P bu(x,t) = (3.1.3a)
Ox A -(x) 6(t3




u(x,t) + Ao(x) ap(x,t) (6A(x,t)) (3.1 3b)
Ox PC2 Ft t
Pc



The partial differential equations (3.1.3) can be

approximated with a system of ordinary differential

equations. Let the acoustic pipe be represented by a

sequence of N elemental lengths with circular and uniform

cross-sections Ai, i = 1, ...N. This is equivalent to

approximating the area function A(x) using a stepwise method

as shown in Figure 3.2.

If each elemental length is sufficiently shorter than

the sound wavelength, we can suppose that the pressure and

volume velocity are independent of the position in the

elemental length itself. Instead of the functions p(x,t)

and u(x,t), we need to consider a finite number of time

functions only



pi(t), ui(t) ; i = 1,N



that represent the pressure and volume velocity in the ith

elemental section as a function of time. The partial

derivatives can now be approximated by finite differences
























cm2


L cm


Figure 3.2. Stepwise approximation of the area function.













Sp.(t) p (t)
ap(x,t) ~ Pi(t) Pi 1(t)
Ox Ax


au(x,t)~ ui.(t) ui 1(t)
bx Ax


where pi(t), ui(t) = pressure and volume volume velocity

in the ith vocal tract section

Ax = L/N = length of each elemental section.


Therefore equations (3.1.3) become


d ui(t) A.
dt -x (Pi(t) Pi-(t)) (3.1.4a)



d Pi(t) 2 d6A. (t)
dt pi) (u (t) (t) Ax it
1dt x dt


(3.1.4b)


i = 1, ..., N



Equations (3.1.4) can be represented by the equivalent

electrical circuit as shown in Figure 3.3. Inductor L. and
1
capacitor Ci, represent the inertance and compressibility of

the air in the ith elemental length of the vocal tract.

They are defined in terms of the cross-sectional area Ai as


























Ui IUi+
I I




I I R LW





Uwi




Figure 3.3. Vocal tract elemental length and its equivalent
circuit.











A. Ax p Ax
C L.= = ...N
1 2 1
pc A.
1

dA.(t)
The component Ax dt in equation (3.1.4b) that represents

the vibration of the cavity walls is represented by the

current in the impedance ZWi as will be shown in the next

section.



3.l.b) Modeling the Yielding Wall Properties

From equation (3.1.4b) we can see that the effect of

wall vibrations are to generate in the ith elemental section

of the vocal tract an additional volume velocity component

equal to


dA (t)
Ax
dt


which is represented in Figure 3.3 by the current uWi in the

impedance ZWi. We will now consider how ZWi is related to

the mechanical properties of the vibrating walls.

Consider Figure 3.4 which shows an elemental length of

the pipe in which one wall is allowed to move under the

forcing action of the pressure p(t) in the pipe itself. Let

m, k and d represent the mass, elastic constant and damping

factor of a unit surface. Since the total vibrating surface

is lAx, and since we assume the walls to be locally

reacting, then the total mass, elastic constant and damping

factor of the vibrating wall are
















































1 1 Ax


Figure 3.4. Mechanical model of an elemental length of
the vocal tract with a yielding surface.











m(l Ax), k(l Ax), d(l Ax),



respectively.

According to Newton's law the forcing action on the

wall is


2
F = lAx p(t) = mlAx dy(t) + dlAx t + kly(t) (3.1.5)
dt


where y(t) is the wall displacement from the neutral

position (when the pressure in the tract is equal to

naught).

The airflow generated by the wall motion is


u(t) = Ax 1dydt


(3.1.6)


and by substitution of (3.1.6) into (3.1.5) we obtain


S du (t)
p(t) 1 Ax dt + (1 Ax) Uw(t)


(3.1.7)


In the frequency domain,

represented by the impedance,


(3.1.7) can be equivalently


k
AX) -.i~












ZW(s) = sLW + RW + 1 (3.1.8)
sCW


with



P(s) = ZW(s) UW(s)




A m R d A1Ax
LW ; RW CW A (3.1.9)
1 Ax 1 Ax k


Such an impedance appears to be inversely related to the

vibrating surface (lAx) and its components are

1. an inductance LW proportional to the mass of the

unit surface of the vibrating wall,

2. a resistor RW proportional to the viscous damping

per unit surface of the vibrating walls, and

3. a capacitor CW inversely proportional to the

elastic constant per unit surface.

Direct measurements of the mechanical properties of

different types of human tissues have been reported in the

literature [57], however such measurements are not directly

available for the vocal tract tissues. Moreover, it is

difficult to estimate the lateral surface of the vocal

cavities that is required to compute the numerical value of

the impedance ZW according to the above derivation.












We, therefore, use a slightly different approach

proposed by Sondhi [55]. He assumed that the vibrating

surface is proportional to the volume of the vocal cavities

and he estimated the mechanical properties of the vibrating

surface on the basis of acoustic measurements. The modeling

results match well with the direct measurements of the

mechanical impedance of human tissues performed by Ishizaka

et al. [57]. The model can be formulated as follows,

Adjust the inductive reactance LWi in the ith elemental

section to match the observed first formant frequency

for the closed mouth condition of about 200 Hz [22] [58]


S 0.0858 0.0858
LW. =
1 V. Ax A.


Next, adjust the wall loss component RWi to match the

closed glottis formant bandwidths [59]



RW. = 130 T LW.
1 1


Choose a value of CWi to obtain a resonance frequency of

the wall compatible with the direct measurements of

Ishizaka et al. [57]


(2 n 30)2
cw. =
l LW.
1












3.1.c) Nasal Coupling

During the production of nasal consonants and nasalized

vowels the nasal cavity is acoustically coupled to the oral

cavity by lowering the soft palate and opening the

velopharyngeal orifice.

The coupling passage can be represented as a

constriction of variable cross-sectional area and 1.5 cm

long [22] that can be modeled by an inductor (to account for

the air inertance) in series with a resistor (to account for

viscous losses in the passage).

The wave propagation in the nasal tract can be modeled

in terms of the nasal tract cross-sectional area as

discussed in Section (3.l.a) for the vocal tract. However,

we will use a different equivalent circuit to better account

for the nasal tract losses as we will discuss later in

Section 6.6.



3.2) Excitation Modeling

3.2.a) Subglottal Pressure

The source of energy for speech production lies in the

thoracic and abdominal musculatures. Air is drawn into the

lungs by enlarging the chest cavity and lowering the

diaphragm. It is expelled by contracting the rib cage and

increasing the lung pressure. The subglottal or lung

pressure typically ranges from 4 cm H20 for the production












of soft sounds to 20 cm H20 or more for the generation of

very loud, high pitched speech.

During speech the lung pressure is slowly varying in

comparison with the acoustic propagation in the vocal

cavities. We can, therefore, represent the lung pressure

with a continuous voltage generator whose value is

controlled in time by an external parameter Ps(t).



3.2.b) Voiced Excitation

During speech, the air is forced from the lungs through

the trachea into the pharynx or throat cavity. On top of

the trachea is mounted the larynx (see Figure 3.5), a

cartilagineous structure that houses two lips of ligament

and muscle called the vocal cords or vocal folds.

The vocal cords are posteriorly supported by the

arytenoid cartilages (see Figure 3.5). Their position and

the dimension of the opening between them (the glottis) can

be controlled by voluntary movements of the arytenoid

cartilages.

For the generation of voiced sounds the vocal cords are

brought close to each other so that the glottal aperture

becomes very small. As air is expelled from the lungs,

strong areodynamic effects put the vocal cords into a rapid

oscillation. Qualitatively, when the vocal cords are close

to each other during the oscillation cycle, the subglottal













































Figure 3.5. Cut-away view of the human larynx (from [7]).
VC vocal cords. AC arytenoid cartilages.
TC thyroid cartilage.













pressure forces them apart. This, however, increases the

air flow in the glottis and the consequent Bernoulli

pressure drop between the vocal cords approximates the vocal

cords again. In this way a mechanical "relaxation"

oscillator is developed which modulates the airflow from the

lungs into a quasiperiodic voiced excitation.

The vocal fold oscillation frequency determines

important perceptual characteristics of voiced speech and it

is called the "pitch" frequency or fundamental frequency,

F .
o
The first quantitative studies of the areodynamics of

the larynx were carried out by van den Berg et al. who made

steady flow measurements from plaster casts of a "typical"

larynx [60-61].

The first quantitative self-oscillating model of the

vocal folds was proposed by Flanagan and Landgraf [62] after

Flanagan and Meinhart's studies concerning source tract

interaction [63]. The fundamental idea was to combine van

den Berg's results with the 2nd order mechanical model of

the vocal cords shown in Figure 3.6. An acoustic-mechanic

relaxation oxcillator was obtained.

Bilateral symmetry was assumed and only a lateral

displacement x of the masses was allowed. Therefore, only a

2nd order differential equation was needed to describe the

motion of the mass






























TRACHEA
AND LUNGS


Figure 3.6. One mass model of the vocal cords.


K(x)


VOCAL
TRACT













M x + B(x) : + K(x) x = F(t)



The mechanical damping B(x) and elastic constants K(x) are

properly defined functions of the vocal cord position x.

The forcing action F(t) depends on the air pressure

distribution along the glottis which was estimated according

to van den Berg's results. A modified version of the one

mass model was also designed by Mermelstein [64].

Even if the one mass model had been able to simulate

important physiological characteristics of vocal cord

vibration, it presented several inconveniencies.

1) The frequency of vocal fold vibration was too

dependent on the vocal tract shape, indicating too

great a source tract interaction.

2) The model was unable to account for phase

differences between the upper and lower edge of the

cords.

3) The model was unable to oscillate with a capacitive

vocal tract input impedance.

These difficulties were overcome by the two mass model

of the vocal cords that is shown in Figure 3.7. It was

designed by Ishizaka and Matsuidara [65] and first

implemented by Ishazaka and Flanagan [66-67]. A mechanical

coupling between the two masses is represented by the spring

constant kc The springs sl and s2 were given a non-linear






















TRACHEA
AND LUNGS


VOCAL
TRACT


PS P11 P12 P21 P22 Ug P1


X.-T-./ \- -------- -/ i ----------
CONTRACTION GLOTTIS EXPANSION



Rc L RV1 Lgl R12 RV2 Lg2 Re


Ps P11 P12 P21 P22 P1
I I I I _II



CONTRACTION GLOTTIS EXPANSION





Figure 3.7. Two mass model of the vocal cords and
glottis equivalent circuit.











characteristic according to the stiffness measured on

excised human vocal cords. As in the case of the one mass

model the viscous damping was changing during the vocal

cords vibration period.

For computer simulation it is convenient to represent

the pressure distribution along the two masses with the

voltage values in an equivalent circuit. In Figure 3.7

resistance Rc accounts for the Bernoulli pressure drop and

"vena contract" effect at the inlet of the glottis.

Resistances RV, and RV2 model the viscous losses in the

glottis. Resistance R12 accounts for the pressure

difference between the two masses caused by the Bernoulli

effect. The inductors model the air inertance in the

glottis.

The two mass model of the vocal cords, that has been

used in connection with vocal tract synthesizers [67-68],

uses as control parameters the glottal neutral area AgO and

the cord tension Q.

AgO determines the glottal area in absence of phonation

and it is physiologically related to the position of the

arythenoid cartilages. Q controls the values of the elastic

constant of the model and greatly affects the two mass model

oscillation period. The suitability of the two mass model

and of its control parameters for speech synthesis has been

further validated in [69] and [70].












The acoustic synthesizer that we have implemented uses

the two mass model to provide voiced excitation. We

therefore account for source tract interaction since the

current in the equivalent circuit of the glottis (see

Figure 3.7) is dependent on the voltage pl that models the

pressure in the vocal tract just above the vocal cords.



3.2.c) Unvoiced Excitation

Speech sounds are generally excited by modulating the

air flow through a constriction of the glottal and

supraglottal system. For voiced sounds this modulation is

obtained through rapid changes of the glottal constriction

as explained in the review section. For fricative sounds,

the modulation comes from flow instabilities which arise by

forcing the air through a constriction with a sufficiently

high Reynold's number. In this case the classical

hypothesis of separability between the source and the tract

greatly limits the realism that can be incorporated into the

synthesizer. In fact, unvoiced excitation is greatly

dependent on the constricted area of the vocal tract itself.

The fricative self-excitation of the vocal cavities was

first modeled by Flanagan and Cherry [71]. The idea was to

use a resistor RNi and noise generator VNi in the equivalent

circuit of the ith elemental length of the vocal tract (see

Figure 3.8). The values of the resistor and of the noise





























I I
/I

I N + I
S Li R i VN Li+l


S[ 1 LWi


CW








Figure 3.8. Equivalent circuit of vocal tract elemental
length with fricative excitation.












generator variance depend on the Reynold's number of the

flow in the ith section [71]. The spectrum of the turbulent

noise VNi can be assumed white with a good approximation

[21].

In our simulation we have modeled the fricative

excitation by means of two sources. One is always located

in the first vocal tract section to generate aspirated

sounds. The second is not bound to a fixed position but can

be moved along with the vocal tract constriction location.



3.3) Radiation Load

The radiation effects at the mouth and nostrils can be

accounted for by modeling the mouth and nostrils as a

radiating surface placed on a sphere (the head) with a

radius of about 9 cm.

Flanagan [7] has proposed a simplified equivalent

circuit for the radiation load model by using a parallel

combination of an inductor and a resistor with values


128 8a
RR = 2' R 3wc
2


where a is the radius of the (circular) radiating surface

(mouth or nostrils). Titze [19] has shown that this

approximation, which we are using now, is valid also at

relatively high frequencies, when the speech wavelength has

the same order of magnitude of the mouth radius.












Our model is also able to account for the sound

pressure component that is radiated through the vibration of

the vocal cavity walls. The contribution to this component

from each elementary length of the vocal cavities is

represented as a voltage drop across a suitable impedance

[67] in series to the equivalent circuit of the yielding

wall defined in Section 3.1b.



3.4) Remarks and Other Acoustic Models

In this chapter we discussed the derivation of an

electrical circuit that models the sound propagation in the

vocal cavities. These considerations can be summarized in

Figure 3.9.

The vocal and nasal tracts are represented by two

circular pipes with non-uniform cross-section (plane wave

propagation assumption). Their equivalent circuit (nasal

and vocal tract networks in Figure 3.9) are made by a chain

of elementary circuits. Each circuit models the wave

propagation as a short length of cavity according to the

derivation of Sections 3.l.a, 3.l.b, 3.2.c.

The two mass models of the vocal cords, its control

parameters, Ago and Q, and the glottal impedance have been

treated in detail in Section 3.2.b.

Sections 3.2.a, 3.1.c, 3.3 have been concerned with the

subglottal pressure Ps, the velar coupling ZV and the

















MUSCLE FORCE

. ;


NASAL TRACT


-r
S- VOCAL TRACT
I CORD I NETWORK
I MODEL I
;-- 11 i -


A1 AN

Q(t) A 0 NC(t) A(x,t)


CORD REST NASAL AREA
TENSION AREA COUPLING FUNCTION


Figure 3.9. Equivalent circuit of the vocal cavities.


Ps(t)

SUBGLOTTAL
PRESSURE












radiation impedances at the mouth and nostrils, which are

shown in Figure 3.9.

The approach to the acoustic modeling of the vocal

cavities that we have just reviewed is not the only one

reported in the literature. In Section 3.2.b we considered

the two mass models of the vocal cords. A more complete

model of vocal cord dynamics has been designed by Titze

[19]. He divided each cord into two vertical levels, one

level corresponding to the mucous membrane, the other to the

vocalis muscle. Each level was further divided into eight

masses which were allowed to move both vertically and

horizontally. We did not use this model because its

simulation is computationally more expensive than Flanagan's

two mass models.

Different acoustic models of the vocal tract were

designed by Kelly and Lochbaum [72] and Mermelstein [73].

The latter has been recently implemented for an articulatory

synthesizer [74].

However, these modeling approaches account for vocal

tract losses in a phenomenological way and they do not model

source tract interaction and fricative self excitation.


















CHAPTER 4
NUMERICAL SOLUTION OF THE ACOUSTIC MODEL

4.1) Requirements of the Numerical Solution Procedure

The software implementation of the acoustic model of

the vocal cavities that we have derived in the previous

chapter requires the solution of a system of ordinary

differential equations with assigned initial values.

In general we will use the notation


y'(t) = f(y(t),t)
{ (4.1.1)
y(0) = Yo


The numerical approach employed to solve the above problem

consists of approximating the solution y(t) as a sequence of

discrete points called mesh points. The mesh points are

assumed to be equally spaced and we indicate with h the time

interval between them. In other words, the numerical

integration procedure will give us a sequence of values yo,

yl' *"*Yn which closely approximate the actual solution y(t)
at the times tO = 0, tl = h, ...t = nh.

In the area of ordinary differential equations the

first step toward the solution of the problem is the

selection of that particular technique among the many

available which will serve the solution best.













In our specific case the most stringent requirement is

the stability of the numerical method. Since the

integration of (4.1.1) is going to involve a large number of

mesh points we need a method which, for a sufficiently small

step size h, guarantees that the perturbation in one of the

mesh values y does not increase in the subsequent values,

ym' m > n.

In the following discussion, as in [75], we use a "test

equation"



y'(t) = Xy(t)



where X is a complex constant. We introduce the concept of

an absolute stability region, which is the set of real,

nonnegative values of h and X for which a perturbation in a

value yn does not increase from step to step.

When the stability requirement is satisfied, we should

select the fastest integration method for our particular

application. This second requirement is very important

since we are dealing with a large system of differential

equations (about 100 differential equations of the 1st

order) and it takes about five hours to generate one second

of synthetic speech on our Eclipse S/130 minicomputer.

The integration speed is directly related to the step

size h of the numerical integration method. The larger the












step size the faster the integration. Unfortunately, the

precision of the numerical solution decreases when the step

size is increased and the method may become unstable. The

program to solve (4.1.1) must therefore implement an

automatic procedure for changing the step size to achieve

the maximum integration speed compatible with precision and

stability requirements. A variable step size is

particularly convenient in our case. In fact the time

constants of (4.1.1) change with the shape of the vocal

cavities and the motion of the vocal cords. A variable

control of the step size allows one to obtain the maximum

integration speed which is allowed by the method and by the

time variant differential equations. In view of these

requirements we have considered both a 4th order Runge-Kutta

method with control of the step size and order.

The Runge-Kutta method requires the computation of

derivatives at an higher rate than multistep methods.

However, it needs less overhead for each derivative

computation [75]. In the next two sections the

characteristics of Runge-Kutta and multistep (also called

predictor-corrector) methods will be considered.

Section 4.4 will give a comparative discussion leading

to the selection of the Runge-Kutta method. Also, we will

describe a modification of the Runge-Kutta method, which

exploits the fact that wall vibration effects have larger












time constants than the propagation in the cavities. This

allows the reduction of the number of first order equations

to be solved by this method by almost a factor of two.



4.2) Runge-Kutta Methods

4.2.a) Derivation of the Method

Runge-Kutta methods are stable numerical procedures for

obtaining an approximate numerical solution of a system of

ordinary differential equations given by


y'(t) = f(y(t),t)
{ (4.2.1)
y(o) = Y

The method consists of approximating the Taylor expansion



y(t + h) = y(t ) + hy'(t ) + Y(t ) + ....(4.2.2)



so that, given an approximation of the solution at time to,

the solution at the next mesh point (to + h) can be

estimated.

To avoid the computation of higher order derivatives,

it is convenient to express (4.2.1) in integral form as


t +h
y(t0 + h) = y(tO) + f o f(y(t),t)dt (4.2.3)
t
o

We can approximate the above definite integrals by computing











f(y(t),t) at four different points

(to,tO + h) by defining


of the interval


K1 = hf(y(to),to)



K2 = hf(y(to) + OKl'to + ha)


(4.2.4)


K3 = hf(y(to) + lK1 + Y1K2 t + a h)


K4 = hf(y + 2K1 + Y2K2 + 2K3' to + a2h)


and then setting


y(to + h) y(to) =


t +h
o f
j f(t,y(t))dt


= i1K1 + P2K2 + P 3K3 + I4 K4


(4.2.5)


The problem is now to determine the a's, B's, y's, 62 and

y's so that (4.2.5) is equivalent to the Taylor expansion

(4.2.2) up to the highest possible power of h. We

substitute (4.2.5) into (4.2.3) and we choose the undefined

parameters so that the powers of hi (i = 0,4) have the same

coefficients as in (4.2.2).











We obtain a system of 8 equations in ten unknowns [76]



~1 + 2 + [3 + L4 = 1


2a2 + 3 al + P = 1/2
22 31 42


02a2 + 3al + P~4 = 1/3


3 3 3
12a3 + a31 + 4 = 1/4

(4.2.6)

3a 1Y1 + 14(a Y2 + a162) = 1/6


P3a2Y + 4(2Y2 + a262) = 1/12


3LaalY + (a Y2 + a62 )a = 1/8



P4 Y 6 = 1/24
4 1 2


which has two extra degrees of freedom that must be set

arbitrarily.

If we define a = a1 = 1/2, the solution of (4.2.6)

leads to the formulas of Kutta [76]. If a = 1/2 and

62 = 1 we obtain Runge's formula [76] which is equivalent

to Simpson's rule











t +h
S f(t)dt = Z[f(to) + 4hf(tO + ) + f(toth)]
t


when y (t) = f(t).

We use a calculation procedure derived from (4.2.6) by

Gill [76] which minimizes the memory requirements and allows

us to compensate the round off errors accumulated at each

step.



4.2.b) Control of the Step Size with Runge-Kutta

In the previous derivation we have emphasized how the

Runge-Kutta method approximates the Taylor expansion 4.2.2

up to the 4th power of h. It is therefore a fourth order

method with a local truncation error of order h5


f(5y(t ),to)
5f (5) t-o h5 + O(h6)
51


This accuracy is obtained without explicitly computing the

derivatives of orders higher than one, at the expense of

four evaluations of the first derivative for each mesh

point. This is a disadvantage with respect to multistep

methods (to be discussed later) which uses fewer

computations of the first derivative to obtain the same

truncation error. The number of derivative evaluations per

step increases to 5.5 to obtain a variable control of the

step size with Runge-Kutta.












The step size, h, should in fact be chosen so that the

local truncation error is less than a certain maximum

acceptable value specified by the user. Unfortunately, the

truncation error cannot be directly estimated because the

Runge-Kutta procedure does not provide any information about

higher order derivatives.

A practical solution [76] is based on the results of

numerical integration with steps h and 2h, respectively,

i.e., the computation is performed a first time using hi = h

and then it is repeated using h2 = 2h.

Let

Ch5 denote the truncation error using step
2h2
h2 = 2h
5
C h denote the truncation error using step

hI = h

y(2) denote the value "obtained" at (t + 2h)
using step h2 = 2h

Y(1) denote the value "obtained" at (to + 2h)
using step hI = h twice

Y denote the true value of y at time

(tt + 2h),

then

5
Y (2) C2 2


Y Y2C
Y- y(1) = 2C1hl












But for small h, C1 = C2 (assuming that the sixth derivative

of f(y(t),t) is continuous) and therefore we obtain the

local truncation error estimate


~ (1) ( ^2)
Y Y(1) Y(15 Y2) (4.2.7)



If (4.2.7) is greater than a given tolerance, say el,

-the increment h is halved and the procedure starts again at

the last computed mesh point to. If it is less than el'

y(1)(to+h) and Y(l)(to + 2h) are assumed correct.
Furthermore, if it is less than el/50, the next step will be

tried with a doubled increment.

Unfortunately, on the average this method requires a

total of 5.5 function (derivative) evaluations as opposed to

four if the step size is not automatically controlled.



4.2.c) Order Selection

In Section 4.2.1 we derived the 4th order Runge-Kutta

method but Runge-Kutta methods of different orders could be

derived as well.

This observation leads to the question, "How does the

choice of the order affect the amount of work required to

integrate a system of ordinary differential equations?".

For example, small approximation errors can be more

efficiently achieved with high order methods, while for low












accuracy requirements lower order methods are to be

preferred [75].

We would, therefore, like to have an automatic

mechanism for the selection of the order of the method.

This mechanism should evaluate the truncation error

corresponding to different integration orders and choose the

order which allows for the maximum step size and integration

speed compatible with the required precision.

Unfortunately, for the same reason discussed in

relation to the problem of step size control, namely the

absence of higher order derivative estimates, the Runge-

Kutta method does not provide an efficient procedure for

automatic order selection. Therefore we always use the

"standard" 4th order Runge-Kutta method.



4.3) Multistep Methods

4.3.a) Implicit and Explicit Methods

Those methods, like Runge-Kutta, which given an

approximation of y(t) at t = tn-1 (say Yn_-) provide a

technique for computing yn = (tn ) are called one step

methods. More general K-step methods require the values of

the dependent variables y(t) and of its derivatives at K

different mesh points tn-l, tn_2, .**tn-K to approximate the

solution at time tn.












The well known rules of "forward differentiation",

"backward differentiation" and trapezoidall rule" are one

step methods. They will be automatically considered in the

following discussion as particular cases.

The general expression of a multistep (K-step) method

is


K
Yn = (aiYn-i + -i ) + oYn
i=1
(4.3.1)

hy' = hf(y n,tn)



If 8 is equal to zero the method is explicit because it

provides an explicit way of computing yn and hy'n from the

values of y and its derivatives at preceding mesh points.

If 0 is different from zero, then (4.3.1) defines an

implicit multistep method because it is in general a non-

linear equation involving the function f(yn,t ) that must be

solved for the unknown yn'



4.3.b) Derivation of the Methods

We have considered the Adams-Bashforth and Adams-

Moulton [75] methods which are respectively explicit and

implicit methods with












al =

(4.3.2)

a. = 0 if i 1
1


Both of these methods can be obtained from the integral

relation

t
n
y(tn) = Y(tnl) + f f(y(t),t)dt
t
n-l

The integral is estimated by approximating f(y(t),t)

with an interpolating polynomial (for example, the Newton's

backward difference formula) through a number of known

values at t = tn-, tn-2, ..., tn-K in the explicit case or

through the values at times tn, tn-l' **tn-K for the

implicit case.

Therefore, for the explicit Adams-Bashforth case, the

equation (4.3.1) takes the form


K
Yn = y(tn-) + h Kif((t n),t ) (4.3.4a)
i=l


or with the equivalent representation in terms of finite

differences


K-1
Yn = y(tn-l) + h K yVJ f(y(tnl),tn-l) (4.3.4b)
J=0












where the operator V is defined by


J A J-l J-1
Vf =V f f
m m m-1


(4.3.5)


V f A f
m m


The values of Yn differ from the real solution y(tn) by

-a local truncation error which is of order (K + 1) in the

step size h


K+l (K+l)
ErrorAdams-Bashforth = K h y(t)


(4.3.6)


The values of y 's and PKi's coefficients are available

directly from the literature [75].

In the implicit Adams-Moulton case equation (4.3.1)

takes the form


K
y = y(t-1 ) +
i=O


* f(y(t ), t-i)
K,i n-i n-i


or with the equivalent representation in terms of backward

differences


K
S= Y(t)) + h y* V f(Y(t ), t)
J=0


(4.3.8)


where the V operator has been defined in (4.3.5).


(4.3.7)













In (4.3.8) the value of Yn differs from y(t ) by a

local truncation error that is of the order K + 2 in the

step size h.



=Error K+2 (K+2)(t) (4.3.9)
Adams-Moulton YK+l


The y*'s and P* 's coefficient values are available from the

literature [75]. In particular the one step Adams-Bashforth

method corresponds to the forward differentiation rule while

the zero and one step Adams-Moulton methods are the backward

and trapezoidal rule respectively.



4.3.c) Characteristics of Multistep Methods

An important characteristic of multistep methods is

that they require only one computation of the derivative for

each step as can be seen from equation (4.3.1). This is a

great advantage over the Runge-Kutta method that requires at

least four computations of the functions f(y(t),t) and it

has been the motivation for our experimentation with

multistep methods.

Another feature of multistep methods is that they allow

for a rather efficient implementation of automatic control

of the step size and order of the method itself.

A complete treatment of this subject would require too

long a discussion. Our intuitive explanation can be

obtained by observing that the step size and order selection












require an estimate of the local truncation error in terms

of different step sizes and integration orders. The order

that allows for the largest stepsize compatible with the

user defined upper limit for the truncation error is then

selected.

The local truncation error in the Adams-Bashforth and

Adams-Moulton methods (see (4.3.6) and (4.3.9)) is related

to high order derivatives which can be easily obtained in

terms of the same backward differences



VJhf(y(tn),tn) = VJhy' = hJ+ly(j+l)



that are used in the implementation method itself (see

(4.3.4) and (4.3.8)). A comparison between the explicit and

implicit multistep methods is necessary to complete this

discussion.

One difference between the methods, is that the y*

coefficients of the implicit methods are smaller than the y

coefficient of the explicit methods. This leads to smaller

truncation errors for the same order for the implicit case

(see (4.3.6) and (4.3.9)).

Another advantage of the implicit methods is that

K-step methods have a truncation error of order (K+2) in the

step size h (see (4.3.9)) to be compared with a truncation

error of order (K+l) for the explicit method (see












(4.3.6)). The reason for this fact is evident when we

consider that (4.3.7) has (K+l) coefficients i,K, i = O,K

while in the explicit method (4.3.5) 80,K has been set to

zero.

A disadvantage of implicit methods is that the non-

linear equation (4.3.7) in the unknown y must be solved

iteratively. Usually a first "guess" of Yn is obtained by

-means of an explicit method and then (4.3.7) is iterated.

However, for reasonable values of the step size h, no more

than two or three iterations are usually required, and this

extra effort is more than compensated for by the better

stability properties of the implicit methods. In fact with

respect to the "test" equation y' = Xy, the range of h

values for which implicit methods are stable is at least one

order of magnitude greater than in the explicit case [75].

Since the truncation errors of implicit methods are smaller,

the implicit methods can be used with a step size that is

several times larger than that of the explicit method. The

allowed increase in step size more than offsets the

additional effort of performing 2 or 3 iterations.



4.4) Method Selection

We have implemented the numerical simulation of the

acoustic model of the vocal cavities by means of both Runge-

Kutta and implicit multistep methods [77]. The Runge-Kutta












method runs about 2 = 3 times faster than the Adams-Moulton

method. This fact is at first rather surprising since we

have seen in the previous two sections that the Runge-Kutta

method requires a larger number of derivative evaluations

for each integration step.

However, in our case the evaluation of the derivatives

is not extremely time consuming. In fact, with the

exception of four state variables describing the motion of

the two mass models, the remaining system of differential

equations is essentially "uncoupled" (or characterized by a

"sparse matrix"), thanks to the "chain" structure of the

vocal cavity acoustic model that can be immediately observed

from Figure 3.9. In these conditions the high number of

derivative evaluations of the Runge-Kutta method is more

than compensated for by the limited overhead in comparison

with Adams predictor-corrector method [75].

We have modified the integration method to take

advantage of the dynamic properties of the vibrating walls.

From Figure 3.8, which represents the equivalent

circuit of the ith elemental section of the vocal tract as

discussed in Chapter 3, we have


dpi (t)
_dt 1 (4.4.1)
dt C. (u (t) i+t) uwi(t))
dt i il











du (t)
dt (p i(t) Pi(t) VN. RN. ui(t))
1
(4.4.2)






duWi(t) 1
dt L (Pi (t) v- RWi uWi(t))

(4.4.3)




dvi(t) 1
dt CW. UWi(t) (4.4.4)




The first two equations represent the pressure and

volume velocity propagation in the ith elemental length of

the vocal tract while (4.4.3) and (4.4.4) model the wall

vibration effects.

The dynamics of uwi(t) and vi (t) in (4.4.3) and

(4.4.4), are characterized by the time constants (see

Section 3.l.b)


LW.
L 1 1 1
RW. 130*w' CW LW. 2Tr*30
1 1

which are very large with respect to the time of the wave

propagation in each elemental length of the vocal tract.











In fact, if we divide the vocal tract into 20 elemental

sections, of approximately 0.875 cm each, the time of wave

propagation in each is 2.5 10-5 sec. This time gives the

order of magnitude of the largest step size for the

integration of equations (4.4.1) and (4.4.2) that, in fact,

is usually achieved with variable step sizes between

2.5 10-5 and 1.25 10-5 sec.

On the other hand equations (4.4.3) and (4.4.4) may

employ a larger integration step.

We, therefore, integrate equations (4.4.1) and (4.4.2)

together with the two mass model equations using a Runge-

Kutta method with variable control of the step size and

assuming uWi(t) in (4.4.1) constant during this procedure.

Every 5.10-5 seconds, i.e., at a frequency of 20 KHz, we

update the values of uWi(t) and vWi(t) by means of a simple

backward differentiation rule based on equations (4.4.3) and

(4.4.4).

At this time we also update the turbulent noise source

VNi and RNi according to the Reynold's number of the flow in

the cavity as explained in Section 3.2.c to provide a

fricative excitation.

In this way we halve the number of derivatives that

must be computed by the Runge-Kutta method to account for

vocal tract propagation and we save about 50% of the

integration time.









70


However, the numerical procedure is still correct, as

we will show in Chapter 6 where several effects associated

with cavity wall vibration are simulated.


















CHAPTER 5
THE ARTICULATORY MODEL AND ITS
INTERACTIVE GRAPHIC IMPLEMENTATION

The acoustic characteristics of the vocal cavities,

that we have modeled by means of an equivalent circuit in

Chapter 3, are greatly affected by the geometrical

configuration of the vocal cavities themselves. Therefore,

the acoustic and perceptual properties of speech depend on

the position of the lips, tongue, jaws and of the other

vocal organs that determine the shape of the vocal tract.

The physiological mechanism of speech production or

"articulation" involves precisely timed movements of the

vocal organs to produce the acoustic wave that we perceive

as connected speech.

This chapter is concerned with the definition and

implementation of a geometric or articulatoryy" model of the

vocal tract that can be used to describe the configuration

of the vocal cavities during speech production.



5.1) Definition of the Articulatory Model

All the articulatory models presented in the literature

[78-81] are two dimensional representations of the vocal

cavities which closely match the midsagittal section of the












vocal tract, even if they do not resolve individual

muscles. The articulatory models that have been designed by

Coker [78] and Mermelstein [79], are probably the most

suitable for speech synthesis applications, since their

configuration is determined by a small number of control

parameters.

Figure 5.1 shows the articulatory model that has been

designed by Mermelstein. We can distinguish between a fixed

and a movable structure of the model. The fixed structure

consists of the pharyngeal wall (segments GS and SR in

Figure 5.1), the soft palate (arc VM), and hard palate (arc

MN) and the alveolar ridge (segment NV).

The configuration of the movable structure is

determined by external control parameters, that we call

articulatory parameters and that are represented in

Figure 5.1 by arrows. For example, the tongue body is drawn

as the arc of a circle (PQ in Figure 5.1) whose position is

determined by the coordinates x and y of its center. Other

parameters are used to control the location of the tip of

the tongue, of the jaws, of the velum, of the hyoid bone and

the lip protrusion and width.

Mermelstein has shown that this model can match very

closely the midsagittal X-ray tracings of the vocal tract

that have been observed during speech production [82]. The

model of Figure 5.1 can therefore be used for speech







































Figure 5.1. Articulatory model of the vocal cavities.












synthesis if we can estimate the cross-sectional area of the

vocal tract (area function). The area function, in fact,

can be used to derive the equivalent circuit of the vocal

cavities, as discussed in Chapter 3.

In practical terms we superimpose a grid system, as

shown in Figure 5.2, on the articulatory model to obtain the

midsagittal dimensions of the vocal cavities at different

points, and then we convert this information into cross-

sectional area values by means of analytical relationship

defined in the literature [79].

We use a variable grid system, dependent on the tongue

body position, to make sure that each grid in Figure 5.2 is

always "almost" orthogonal to the vocal tract center line

regardless of the model configuration; to further correct

unavoidable misalignment of each grid, we multiply the

cross-sectional area estimate by the cosine of the angle a

(see Figure 5.2).



5.2 The Graphic Editor

Our computer implementation of the articulatory model

of the vocal cavities has been designed to be fast and easy

to use.

Traditionally, human-computer interaction employs

textual (alphanumeric) communication via on-line keyboard

terminals. This approach is satisfactory for many










































Figure 5.2. Grid system for the conversion of mid-sagittal
dimensions to cross-sectional area values.












applications but is being replaced by menu selection, joy

stick cursor, a light pen, touch sensitive terminals, or

other devices.

The conventional keyboard entry method is particularly

cumbersome if the data structure is not easily manipulated

via an alphanumeric selection process. Such an example

arises with pictorial or graphic images, as in computer

aided design. Here the user may communicate with the

computer by means of a graphic model. The system interprets

the model, evaluates its properties and characteristics, and

recognizes the user's changes to the model. The results are

presented graphically to the operator for further

interactive design and test.

Using a similar approach we have implemented on a

Tektronix 4113 graphics terminal interfaced to a DGC Eclipse

S/130 minicomputer an "interactive -graphic editor" that is

used to manipulate the articulatory model.

The user may alter the configuration of the model by

means of a simple interaction. Each articulatory parameter

of Figure 5.1, and the corresponding vocal organ's position,

can be set to the desired value by means of the graphic

cursor of the Tektronix 4113 terminal. This allows a rapid

definition of the desired articulatory configuration.

But the power of an interactive graphic system lies in

its ability to extract relevant information from the model












for further analysis and processing. The acoustic

properties of the vocal tract are determined, as discussed

in Chapter 3, by its area function, which is the cross-

sectional area of the cavity, as a function of the distance,

x, from the larynx.

When the user has defined a new articulatory parameter

value by means of the cross-hair cursor, the system

estimates the area function of the vocal tract by means of

the grid system of Figure 5.2. The resonance or formant

frequencies are also estimated. This information is

immediately displayed (see Figure 5.3) for the user as a

first order approximation of the acoustic properties of the

graphic model of the vocal tract.

The interaction cycle is shown in Figure 5.4. Commands

are available not only to modify the displayed vocal tract

shape but also to store and read it from a disk memory.

These commands are useful not only to generate a "data base"

of vocal tract configurations, but also to create back up

files before using the interactive graphic commands.

But the interaction depicted in Figure 5.4 is not

sufficient to define a specific articulatory pattern. In

fact, the articulation of connected speech is a dynamic

process which consists of precisely timed movements of the

vocal organs. We have introduced the temporal dimension in

the system by means of an animation frame technique.




































FIRS FOR FRAT


Figure 5.3. The articulatory model implemented on the
Tektronix 4113 graphic terminal.




























START


THE USER INDI RTES THE NAME
OF THE ARTICULATORY PATTERN


THE USER MODIFIES THE
DISPLAYED VOCAL CAVITY

THE SYSTEM COMPUTES AND
b DISPLAYS THE AREA FUNCTION
< ND THE FORMANT FREQUENCIES

IS
USER
SATISFIED

N


(


THE USER ISSUES SWT COMMAND


Figure 5.4.


Interaction cycle for the generation of
an animated articulatory pattern.
a sign on time. b interaction cycle.
c Store With Time command.


>












When the user signs on (Figure 5.4a), he is asked to

indicate the name of the articulatory pattern that he wishes

to edit. The user may, at this point, create a new file or

modify an existing file. The user then interacts with the

model (Figure 5.4b) until he has achieved the desired vocal

tract configuration.

Next, a "Store With Time" (SWT) command is issued

(Figure 5.4c). This attaches a time label to the displayed

vocal tract configuration, which is also memorized as a

"frame" of the articulatory pattern that the user has

indicated at sign-on. Another interaction cycle is then

entered, which will lead to the definition of another frame.

The frames defined by the SWT command appear as

"targets" which must be reached at the indicated time. The

articulatory model is guided between consecutive targets by

means of an interpolation algorithm to achieve smooth

transitions. This is particularly important, so that the

articulatory model of the vocal cavities may be interfaced

with the speech synthesizer; since a continuous variation of

the synthesizer input parameters is required to obtain good

quality speech.



5.3) Display of the Time Variations of the Model

The interaction cycle described above generates an

articulatory pattern by means of an animation frame












technique, which is used as an input to the speech

synthesizer. However, during the interaction cycle, only

the particular frame being manipulated is visible on the

terminal display. Consequently, the user has difficulty

visualizing the global time varying characteristics of the

articulatory pattern. To overcome this disadvantage, we use

an on-line animation of the model.

The animation frames are computed by means of

interpolation and stored in the memory of the 4113 terminal

as graphic segments. Then each frame is briefly displayed

in sequence, creating the animation effect.

Figure 5.5 shows a typical animation frame. Here the

contour filling capability of the terminal is not used,

allowing higher display frequency. Using this technique, we

are able to obtain a live animation effect with only a

slight flickering phenomenon. The maximum frame display

frequency is about 5 Hz.

We may also view the movements of the vocal organs,

defined with the graphic editor, in three dimensions, as may

be seen in Figure 5.6. This effect is achieved by using

many consecutive animation frames as sections of a three

dimensional object, with the third dimension being time.

The advantage of this technique is that the time

evolution of the model can be observed at a glance;

moreover, a three dimensional rigid rotation allows the user




















































Figure 5.5. A typical animation frame.




















































Figure 5.6. Three dimensional views of the vocal tract.












to choose the most convenient view angle. Different colors

(one every five frames) are used to. mark the specific time

events.

Figure 5.6 also shows that the hidden lines have been

removed. This is achieved very efficiently by means of the

contour filling capability of the terminal. In this 3-D

representation all the frames belong to planes parallel to

each other. It is, therefore, very simple to determine for

a given angle of rotation, which frame is in "front" and

which one is "behind". To remove the hidden lines the

contour capability of the terminal is used with the "ink

eradicator" color, starting from the frame which is the

farthest from the observer.



5.4) Simultaneous and Animated Display
of the Articulatory and Acoustic Characteristics
of the Vocal Cavities

When a certain articulatory pattern has been edited

with the interactive graphic model, we may estimate the

corresponding acoustic events, e.g., the speech waveform,

the pressure and air volume-velocity distribution in the

vocal cavities, the motion of the vocal cords, the vibration

of the cavity walls, etc. by means of the acoustic model

defined in Chapter 3.

Figure 5.7 shows the final configuration of the

system. The articulatory or "muscular" events generated by










85
































ITERACTIVE
'HIC EDITOR

ARTICULATORY
EVENTS
W VN COMPUTER MOVIE.
SIMULTANEOUS
JSTIC MODEL t D-ISPLAY OF
OF THE ARTICULATORY
aL CAVITIES AND ACOUSTIC
EVENTS.

ACOUSTIC E
EVENTS




System configuration.
The ariculatory events generated by the graphic
editor are input into the acoustic model of the
vocal cavities.
Both articulatory and acoustic events are later
displayed with a computer generated movie.


Figure 5.7. System configuration.












the graphic model is the (off-line) input to the

synthesizer, which computes the corresponding acoustic

waveforms. Later a simultaneous and animated representation

of the articulatory and acoustic events is displayed on the

Tektronix 4113 terminal.

Figure 5.8 illustrates a typical frame of a sample

animation. The figure below the vocal tract model

represents the two vocal cords. Each cord (left and right)

is schematically represented by two masses as proposed by

Flanagan [67]. The graphs on the right part of the screen

represent different acoustic events over the same time

interval.

Both the vocal tract and vocal cord models are

animated. During the animation a sliding green line runs

along the borders of the graphs to "mark the time". This

assists the viewer in relating the information displayed in

the graphs to the vocal cord and the vocal tract motion. A

live animation effect is obtained in this manner.






















































Figure 5.8. A typical animation frame including the
vocal tract and the vocal cords. The
various data waveforms calculated by
the model are also shown.


















CHAPTER 6
SIMULATION RESULTS

6.1) Speech Synthesis

As explained in Chapter 2, linear prediction and

formant synthesis are based on a rather approximate model of

speech production. However, the quality of the synthetic

speech may be very good because the synthesis algorithm uses

the information derived from an analysis phase of natural

speech that captures the most important perceptual features

of the speech waveform.

On the contrary, articulatory synthesis employs a more

detailed model of the human speech production mechanism, but

cannot exploit a good analysis algorithm to derive the

articulatory information directly from natural speech.

In this section we want to show that the amount of

physiological detail captured by our articulatory and

acoustic models of the vocal cavities is sufficient for the

generation of good quality English sentences.

In fact, Figure 6.1 shows the spectrogram of the

sentence "Goodbye Bob" that we have synthesized with our

computer programs. The quality of this sample compares

favorably with respect to other synthesis techniques.









KHz


0.4


0.3 SEC


Figure 6.1. Spectrogram of "Goodbye Bob", synthetic.


J r


0.8 SEC


Spectrogram of "Goodbye Bob",


Figure 6.2.


natural.













As the first step of the synthesis procedure we should

obtain the spectrogram of natural speech, which is shown in

Figure 6.2. We do not attempt to faithfully match the

synthetic spectrogram with its natural counterpart.

However, the natural spectrogram is useful to obtain a good

estimate of the required duration of each segment of the

synthetic sentence.

The articulatory information is obtained, in a rather

heuristic way, from phonetic considerations and from X-ray

data available in the literature [22] [82-83]. For example,

we know that a labial closure is required for the production

of /b/ and /p/ consonant, or that the tongue position must

be "low" and "back" for the production of the /a/ sound.

Using this linguistic knowledge, we can therefore use

the "graphic editor" described in Section 5.2 to define the

articulatory configurations that are necessary to synthesize

the desired sentence.

As described in Sections 3.2a and 3.2b, the subglottal

and vocal cord models are controlled by three parameters:

glottal neutral area Ago, cord tension Q and subglottal

pressure Ps.

We set the glottal neutral area to 0.5 cm2 or 0.05 cm2

for the generation of unvoiced or voiced synthetic speech

respectively. The values of the cord tension and subglottal

pressure can be estimated after a pitch and short time

energy analysis of natural speech [67] [70].












The procedure for the definition of the time evolution

of the articulatory model that we have described above is,

however, rather "heuristic". After a first trial,

adjustments of the articulatory model configuration are

usually necessary to improve the quality of the synthetic

speech.

In our opinion a development of this research should be

- the definition of vocal tract evolution for different

English allophones, as a first step toward an automatic

speech synthesis by rule system based on an articulatory

model. The solution of this problem is not at all

trivial. Section 6.7 illustrates the difficulties and

reviews part of the literature related to this subject.



6.2) Source Tract Interaction

The classical source-tract speech production model that

we have discussed in Section 2.1 is based on the assumption

that the glottal volume velocity during speech production is

independent of the acoustic properties of the vocal tract.

Evidently this source-tract separability assumption holds

only as a first order approximation. In fact the glottal

volume velocity depends on the transglottal pressure that is

related to the subglottal and vocal tract pressure.

The effects of source-tract interaction have been

modeled and analyzed by Geurin [44], Rothenberg [42],













Ananthapadmanaba and Fant [43]. Yea [34] has carried out a

perceptual investigation of source tract interaction using

Guerin's model to provide the excitation of a formant

synthesizer [16].

Source tract interaction, which is well represented by

the model discussed in Chapter 3, can be discussed with

reference to Figure 6.3. The glottal volume velocity UG

depends not only on the subglottal pressure Ps and on the

glottal impedance ZG, that is varying during the glottal

cycle, but also on the vocal tract input impedance Zin'

Source-tract separability holds only if the magnitude of Zin

is much smaller than the magnitude of ZG, since in this case

ZG and Ps are equivalent to an ideal current generator.

Therefore the amount of source tract interaction depends on

the magnitude of ZG with respect to Zin.

We have experimented with different amounts of source

tract interaction using the following procedure.

At first we have synthesized the word "goodbye" using

our model of the vocal tract and of the vocal cords. The

obtained glottal volume velocity is shown in the middle part

of Figure 6.4.

Then, to reduce source tract interaction, we have used

the same vocal tract configuration, but we have multiplied

by a factor of two the glottal impedance throughout the

entire synthesis of the word "goodbye". We have,

























sT Zin TRACT R





Figure 6.3. Source-tract interaction model.
P subglottal pressure. U glottal
volume velocity. Z glottal impedance.
g

















S\ Too /
Cou \ Much --
> WI a4I I --'---- i ,---1 ,-I
1i .4-. t
-1' Just /,,',-'-
\ ight ti
!- /" \ U- "
*o, ---.--,.---- -- ----.,____

0.L
Lc Too
,-- j \ little


MSEC




Figure 6.4. Three different glottal excitations.


8 .1 .2 .3 .4 SEC
Good b y e


Figure 6.5. Spectrogram of "good bye", synthetic.




Full Text
160
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
Stevens, K. N. ^ "Airflow and turbulence noise for
fricative and stop consonants: static
considerations," J. Acoust. Soc. Amer., vol. 50,
pp. 1180-1192, 1971.
Fant, G. Acoustic Theory of Speech Production,
Mouton and Co., The Hague, 1960.
Dunn, H. K. "The calculation of vowel resonances
and an electrical vocal tract," J. Acoust. Soc.
Amer., vol. 22, pp. 740-753, 1950.
Stevens, K. N., Kasowski, S., and Fant, G. M., "An
electrical analog of the vocal tract," J. Acoust.
Soc. Amer., vol. 25, pp. 734-742, 1953.
Stevens, K. N., and House, A. S., "Development of a
quantitative description of vowel articulation,"
J. Acoust. Soc. Amer., vol. 27, pp. 484-493, 1955.
Atal, B. S., and Hanauer, S. L., "Speech analysis
and synthesis by linear prediction of the speech
wave," J. Acoust. Soc. Amer., vol. 50, pp. 637-655,
1971.
Markel, J. D., and Gray, A. H., "On the
autocorrelation with application to speech
analysis," IEEE Trans. Audio Electroacoust.,
vol. 21, pp. 69-79, 1973.
Itakura, F., and Saita, S., "On the optimum
quantization of feature parameters in the PARCOR
speech synthesizer," Proceedings 1972 Conference
Speech Commun. Proces., Tokyo, pp. 434-437, 1972.
Rabiner, L. R., and Schafer, R. W., Digital
Processing of Speech Signal, Prentice Hall,
Englewood Cliffs, New Jersey, 1979.
Markel, J. D., and Gray, A. H., Linear Prediction
of Speech, Springer-Verlag, New York, 1976.
Schroeder, M. R., "Models of hearing," Proceedings
of the IEEE, vol. 63, pp. 1332-1350, 1975.
Makhoul, J., "Linear prediction: a tutorial
review," Proceedings of the IEEE, vol. 63, pp. 561-
580, 1975.


37
3.1.c) Nasal Coupling
During the production of nasal consonants and nasalized
vowels the nasal cavity is acoustically coupled to the oral
cavity by lowering the soft palate and opening the
velopharyngeal orifice.
The coupling passage can be represented as a
constriction of variable cross-sectional area and 1.5 cm
long [22] that can be modeled by an inductor (to account for
the air inertance) in series with a resistor (to account for
viscous losses in the passage).
The wave propagation in the nasal tract can be modeled
in terms of the nasal tract cross-sectional area as
discussed in Section (3.1.a) for the vocal tract. However,
we will use a different equivalent circuit to better account
for the nasal tract losses as we will discuss later in
Section 6.6.
3.2) Excitation Modeling
3.2.a) Subglottal Pressure
The source of energy for speech production lies in the
thoracic and abdominal musculatures. Air is drawn into the
lungs by enlarging the chest cavity and lowering the
diaphragm. It is expelled by contracting the rib cage and
increasing the lung pressure. The subglottal or lung
pressure typically ranges from 4 cm ^0 for the production


28
dp (x, t) P &u(x,t)
+ xjtr -~5t
0
(3.1.3a)
du(x,t) Vx) dp(x,t) d(A(x,t))
5x n 2 3t 5t
Pc
The partial differential equations (3.1.3) can be
approximated with a system of ordinary differential
equations. Let the acoustic pipe be Represented by a
sequence of N elemental lengths with circular and uniform
cross-sections A^, i = 1, . .N. This is equivalent to
approximating the area function A(x) using a stepwise method
as shown in Figure 3.2.
If each elemental length is sufficiently shorter than
the sound wavelength, we can suppose that the pressure and
volume velocity are independent of the position in the
elemental length itself. Instead of the functions p(x,t)
and u(x,t), we need to consider a finite number of time
functions only
p(t) u^t) ; i = 1,N
that represent the pressure and volume velocity in the ith
elemental section as a function of time. The partial
derivatives can now be approximated by finite differences


72
vocal tract, even if they do not resolve individual
muscles. The articulatory models that have been designed by
Coker [78] and Mermelstein [79], are probably the most
suitable for speech synthesis applications, since their
configuration is determined by a small number of control
parameters.
Figure 5.1 shows the articulatory model that has been
designed by Mermelstein. We can distinguish between a fixed
and a movable structure of the model. The fixed structure
consists of the pharyngeal wall (segments GS and SR in
Figure 5.1), the soft palate (arc VM), and hard palate (arc
MN) and the alveolar ridge (segment NV).
The configuration of the movable structure is
determined by external control parameters, that we call
articulatory parameters and that are represented in
Figure 5.1 by arrows. For example, the tongue body is drawn
as the arc of a circle (PQ in Figure 5.1) whose position is
determined by the coordinates x and y of its center. Other
parameters are used to control the location of the tip of
the tongue, of the jaws, of the velum, of the hyoid bone and
the lip protrusion and width.
Mermelstein has shown that this model can match very
closely the midsagittal X-ray tracings of the vocal tract
that have been observed during speech production [82]. The
model of Figure 5.1 can therefore be used for speech


83
Figure 5.6. Three dimensional views of the vocal tract.


124
consonant is different for the two cases. However, this
difference turns out to be perceptually indistinguishable.
The most relevant acoustic differences between the two
articulatory patterns appears to be the transition of the
second formant which starts ahead of consonantal closure
when "Ohman's" hypothesis is simulated. Since these two
samples of synthetic speech are perceptually equivalent, the
differences between the coarticulation patterns reported by
Ohman and Gay may be accounted for by differences between
speakers.
6.8) Interpretation of the EGG Data
With the Two Mass Model of the Vocal Cords
The relative inaccessibility of the larynx makes it
difficult to directly observe the vocal cord vibratory
motion in vivo. One must resort to various indirect
observation techniques such as ultrahigh speed photography,
photoglottography, ultrasound, electroglottography, X-ray
and inverse filtering of the acoustic speech.
Among these techniques electroglottography offers the
advantages of being non-invasive, inexpensive and simple to
use.
The electroglottograph is essentially an impedance
measuring device. It consists of two electrodes that are
placed on the opposite sides of the larynx, and of an R-F
modulator and detector. The electroglottograph (EGG)


Systea configuration.
The ariculatory events generated by the graphic
editor are input into the acoustic model of the
vocal cavities.
Both articulatory and acoustic events are later
displayed with a coaputer generated aovie.
Figure 5.7. System configuration


100
0 580 1000 1588 2008 2588 3888 3580 4888
HERTZ
Figure 6.7. Onset spectra of /B/, /D/, /G/ followed by /A/.


22
disadvantage, together with high computational requirements,
limits the use of articulatory synthesis for speech
applications which has been recently investigated by
Flanagan et al. [54].
In the following, Chapters 3 and 4 will discuss the
acoustic modeling of the vocal cavities and its
implementation with numerical simulation techniques.
Chapter 5 concentrates on the articulation model and the
computer graphic techniques used for its implementation.


103
Tz) =
K
I
i = l
R (z) = 1 -z
1+ Z *iZ
-i
Figure 6.9. Speech production model for glottal least
squares inverse filtering.


40
pressure forces them apart. This, however, increases the
air flow in the glottis and the consequent Bernoulli
pressure drop between the vocal cords approximates the vocal
cords again. In this way a mechanical "relaxation"
oscillator is developed which modulates the airflow from the
lungs into a quasiperiodic voiced excitation.
The vocal fold oscillation frequency determines
important perceptual characteristics of voiced speech and it
is called the "pitch" frequency or fundamental frequency,
F .
o
The first quantitative studies of the areodynamics of
the larynx were carried out by van den Berg et al. who made
steady flow measurements from plaster casts of a "typical"
larynx [60-61].
The first quantitative self-oscillating model of the
vocal folds was proposed by Flanagan and Landgraf [62] after
Flanagan and Meinhart's studies concerning source tract
interaction [63]. The fundamental idea was to combine van
den Berg's results with the 2nd order mechanical model of
the vocal cords shown in Figure 3.6. An acoustic-mechanic
relaxation oxcillator was obtained.
Bilateral symmetry was assumed and only a lateral
displacement x of the masses was allowed. Therefore, only a
2nd order differential equation was needed to describe the
motion of the mass


79
START
/
b <
Figure 5.4. Interaction cycle for the generation of
an animated articulatory pattern,
a sign on time, b interaction cycle,
c Store With Time command.


151
A.3.a) Glottal Termination
In the acoustic pipe model of the vocal tract the
transglottal pressure is, by definition, the difference
between the subglottal pressure Pg and the pressure in the
first tube of the model p-^(0,t). During speech production,
if we neglect the viscous losses, the transglottal pressure
consists of a kinetic component proportional to the square
-of the glottal volume velocity Ug(t) and of an inertive
component proportional to the time derivative of Ug(t).
Therefore
du (t) ~
Ps p^c^t) = L + a u (t) + V0 (A.3.1)
Vq in (A.3.1) is a random value which models the fricative
excitation at the glottis, occurring during aspirated
sounds. Observe that (A.3.1) accounts for source-tract
interaction since the glottal volume-velocity u (t) depends
on the pressure p^(0,t) in the vocal tract.
We need (A.3.1) to obtain a relationship between the
positive going and negative going pressure waves at the
beginning of the first tube of the vocal tract model. To
2
eliminate the square term u (t) from (A. 3.1) it is
convenient to use the identity
2 2
Ug(t) = (U (t-T) + (Ug(t) Ug(t-T))


68
dui(t)
dt
L.
i
- Pi(t)
VN. RN. u.(t))
l l l' '
(4.4.2)
^UWi^ ^ 1
dt = lw7 (Pi(t) vwi(t) RWi "wi^
(4.4.3)
dvWi(t) 1
dt
CW. Wi
l
uilT, (t)
(4.4.4)
The first two equations represent the pressure and
volume velocity propagation in the ith elemental length of
the vocal tract while (4.4.3) and (4.4.4) model the wall
vibration effects.
The dynamics of uwj.(t) and in (4.4.3) and
(4.4.4), are characterized by the time constants (see
Section 3.1.b)
LW.
l 1
RWi 130*7r'
/CW. LW.
l l
2 ir*30
which are very large with respect to the time of the wave
propagation in each elemental length of the vocal tract.


135
- We suggested a modification of the two mass models of
the vocal cords on the basis of EGG and ultra-high
speed laryngeal film observations.
7.2 Suggestions for Future Research
During this study we have identified two possible
directions for future research concerning the implementation
of the vocal cavity acoustic model and text to speech
synthesis with the articulatory model.
An important continuation of this research concerns the
definition of a more efficient algorithm to represent the
wave propagation in the vocal cavities. We are currently
investigating (see Appendix) a modification of the Kelly-
Lochbaum's algorithm [72] to be able to account for the
yielding characteristics of the vocal cavity walls and for
the fricative self excitation of the vocal tract. A
preliminary implementation runs an order of magnitude faster
than the now adopted Runge-Kutta method and it is amenable
for an array processor implementation.
Text to speech synthesis is a potential application of
articulatory synthesis. Perceptual research should be
devoted to the definition of an articulatory encoding method
of the English allophones. We have briefly reviewed the
problems concerned with this issue in Section 6.7. The
prosodic rules employed with other text to speech synthesis


65
require an estimate of the local truncation error in terms
of different step sizes and integration orders. The order
that allows for the largest stepsize compatible with the
user defined upper limit for the truncation error is then
selected.
The local truncation error in the Adams-Bashforth and
Adams-Moulton methods (see (4.3.6) and (4.3.9)) is related
to high order derivatives which can be easily obtained in
terms of the same backward differences
V^hf (y(tn),tn) = V^hy' = h-^^y ^ + ^
that are used in the implementation method itself (see
(4.3.4) and (4.3.8)). A comparison between the explicit and
implicit multistep methods is necessary to complete this
discussion.
One difference between the methods, is that the y*
coefficients of the implicit methods are smaller than the y
coefficient of the explicit methods. This leads to smaller
truncation errors for the same order for the implicit case
(see (4.3.6) and (4.3.9)).
Another advantage of the implicit methods is that
K-step methods have a truncation error of order (K+2) in the
step size h (see (4.3.9)) to be compared with a truncation
error of order (K+l) for the explicit method (see


82
Figure 5.5. A typical animation frame.


APPENDIX
AN EFFICIENT ARTICULATORY SYNTHESIS ALGORITHM
FOR ARRAY PROCESSOR IMPLEMENTATION
The computational burden for articulatory synthesis is
a major limitation for real time applications such as text
to speech synthesis.
We describe a new algorithm for the simulation of the
acoustic characteristics of the vocal cavities that is
currently being implemented. This algorithm offers several
advantages.
Similar to Flanagan's model of the vocal tract it can
1) accomodate the effects of yielding walls, 2) model the
fricative self excitation of the vocal cavities, 3) account
for source tract interaction and for the radiation load.
The losses in the vocal cavities are correctly modeled and
they are not phenomenologically represented as, for example,
with Kelly-Lochbaum's [72] or Mermelstein1s algorithms [73].
This algorithm has been designed as a modification of Kelly-
Lochbaum s procedure which models the wave propagation in
concatenated lossless tubes. Our algorithm runs an order of
magnitude faster than the currently employed Runge-Kutta
method. More importantly, this algorithm can easily take
advantage of parallel processing capabilities, as with array
processors, to further reduce the execution time.
137


CHAPTER
Page
5 THE ARTICULATORY MODEL AND ITS
INTERACTIVE GRAPHIC IMPLEMENTATION. 71
5.1) Definition of the Articulatory Model 71
5.2) The Graphic Editor 74
5.3) Display of the Time Variations of the Model....80
5.4) Simultaneous and Animated Display of the
Articulatory and Acoustic Characteristics
of the Vocal Cavities 84
6 SIMULATION RESULTS 88
6.1) Speech Synthesis 88
6.2) Source Tract Interaction 91
6.3) Onset Spectra of Voiced Stops 95
6.4) Glottal Inverse Filtering of Speech 99
6.5) Simulation of Wall Vibration Effects 105
6.5.a) Vocal Cords Vibration During Closure 105
6.5.b) Formant Shift 110
6.6) Pathology Simulation: Reduction of Sound
Intensity During Nasalization 112
6.7) Coarticulation ........117
6.8) Interpretation of the EGG Data with
the Two Mass Model of the Vocal Cords 124
7 CONCLUSIONS 133
7.1) Summary 133
7.2) Suggestions for Future Research 135
APPENDIX
AN EFFICIENT ARTICULATORY SYNTHESIS ALGORITHM
FOR ARRAY PROCESSOR IMPLEMENTATION 137
A.1) Wave Propagation in Concatenated
Lossless Tubes 138
A.2) Modifications of Kelly-Lochbaum Algorithm 141
A.2.a) Fricative Excitation .......141
A.2.b) Yielding Wall Simulation 146
A. 3) Boundary Conditions 150
A.3.a) Glottal Termination ...151
A.3.b) Radiation Load 153
A. 3 c) Nasal Coupling 155
REFERENCES 158
BIOGRAPHICAL SKETCH 169
iv


117
This phenomenon has been investigated by several
authors often with inconsistent results. Bernthal and
Benkelman [101] controlled the velopharyngeal orifice
condition in two normal subjects by means of prosthetic
appliances. To minimize speech intensity variations caused
by different vocal efforts the subglottal pressure was also
monitored. The observed intensity level reduction using a
.0.5 cm velar orifice ranged from 2 to 4 dB.
We have duplicated the same experiment using the just
discussed model of the nasal cavity acoustic losses and we
have obtained an intensity reduction of 4 dB which matches
the data reported by Bernthal.
Figure 6.16 shows the comparison of the speech sound
(vowel /a/) with and without nasal coupling from our
acoustic synthesizer.
6.7) Coarticulation
The relationship between articulation and the acoustic
properties of speech is the traditional field of study of
phoneticians and linguists. But during the last 20 years
this research has found applications in speech synthesis
[67] [78] and speech coding and transmission [54].
Unfortunately, not every aspect of speech articulation
has been fully understood. The problem is that connected
speech is perceived as a sequence of almost independent


143
r'vJUL'-
-vQJly-r-vMv-
Ajlib-i
Figure A.3. Lumped element equivalent circuit of two
adjacent lossless tubes.


TD A
140
p+Ct)
k
p+
p+it-'V)
k
-=>
t)
p,+ (t-r)
k + l
'K
flK+l
K
(t)
p-it+Y)
<-
p ft) p
FK+1 pK+1
(t+r)
Figure A.l. The Kth and (K+l)st tubes with positive and
negative going pressure waves.


Articulatory synthesis of speech represents the
acoustic properties of the vocal cavities by means of
modeling and numerical simulation techniques that are
reported in Chapters 3 and 4.
We have been able to guarantee the stability of the
numerical method and to halve the number of differential
equations that must be solved for the simulation of the
sound propagation in the vocal tract (Chapter 4).
In the Appendix we present a new and more efficient
algorithm for the simulation of the vocal cavity acoustics
which can be efficiently implemented with parallel
processing hardware.
Interactive graphic software (Chapter 5) has been
developed to represent the configurations of the vocal
cavities and to provide us with a convenient interface for
the manipulation of the geometric model of the vocal
cavities.
Chapter 6 employs the developed articulatory synthesis
system for the simulation of different aspects of speech
processing, for modeling speech physiology, and testing
theories of linguistics reported in the literature. We
discuss and illustrate such cases as source tract
interaction, EGG modeling, onset spectra of voiced stops at
consonantal release, the effects of yielding walls on
phonation, sound intensity reduction during nasalization,
and glottal least squares inverse filtering.
vi


29
Figure 3.2. Stepwise approximation of the area function.


120
Left to right coarticulation effects have been observed

by Ohman [105], Stevens et al. [109] and many other authors.
Kozhevnikov and Chistovich [110] tried to explain
coarticulation by speculating that consonant-vowel (CV) or
consonant-consonant-vowel (CCV) type syllables are the
minimum encoding units of speech that are transformed by the
articulatory system into articulatory gestures. This model
is not adequate to account for anticipatory effects observed
with nasal consonants (CWN) [108].
Ohman [80] [105] describes the articulation of vowel-
consonant-vowel (V^CVg) utterances as a dipthongal
transition from the first to the second vowel with the
consonantal gesture for vocal tract closure superimposed on
it.
More generally, MacNeilage [111] assumed a phoneme
sized input unit, a system of articulatory targets and a
closed loop control system which tries to achieve the
relative invariant motor goal from varying initial
positions.
Henke [112] and Moll and Daniloff [108] proposed
"feature based" models to explain the articulation
process. Each phoneme was characterized by a set of
articulatory features. Each feature is binary valued with
the possibility of a compatibility or "don't care"
condition. The assumption is that at every instant each
portion of the vocal apparatus is seeking a goal which is


73
Figure 5.1. Articulatory model of the vocal cavities.


105
by means of an autocovariance analysis [30] performed during
the closed phase.
The above procedure is based on the speech production
model of Figure 6.9 that is theoretically correct for the
production of vowels if the cavity walls are assumed rigid.
However Figure 6.10 shows the result of the inverse
filtering procedure performed on our synthetic speech when
the yielding wall properties are modeled.
The recovered glottal volume velocity is compared with
the actual glottal volume velocity generated during speech
synthesis by the two mass models of the vocal cords.
We can see that the two waveforms match almost
perfectly. We conclude that the vibrations of the vocal
tract walls do not appreciably affect the inverse filtering
result, especially in comparison with other sources of error
that are present when inverse filtering is applied to real
speech, such as ambient room noise, low frequency bias, and
tape recorder distortion [90].
6.5) Simulation of Wall Vibration Effects
6.5.a) Vocal Cords Vibration During Closure
According to the myoleastic-areodynamic theory of
phonation the vocal cords oscillate when
1) they are properly adducted and tensed, and


39
Figure 3.5. Cut-away view of the human larynx (from [7]).
VC vocal cords. AC arytenoid cartilages.
TC thyroid cartilage.


139
PK(xft) = p*(t x/c) + p(t + x/c)
(A.1.1)
Arr i _
uK(x,t) = (pR(t x/c) Pr^ + x/c^
where p is the air density and Zv = is the characteristic
* na
impedance of the pipe. Therefore, when we represent the
vocal tract by a series of concatenated uniform lossless
pipes we may use (A. 1.1) to model the wave propagation in
the Kth section. The relationship between the traveling
waves in adjacent tubes can be obtained by applying the
physical principle that pressure and volume velocity must be
continuous in both time and space everywhere in the
system. This provides boundary conditions that can be
applied at both ends of each tube. Consider Figure A.l
which shows the junction between the Kth and the (K + l)st
pipes with cross-sectional areas AK and Ak+-^ respectively.
Let Ax be the length of each tube and x=Ax/c the wave
propagation time.
Applying the continuity conditions of pressure and
volume velocity at the junction gives
PK(Ax,t) = PR+1(0#t)
uR(Ax,t) = uK+1(0,t)
(A.1.2)


64
In (4.3.8) the value of yn differs from y(t ) by a
local truncation error that is of the order K + 2 in the
step size h.
(4.3.9)
The y*'s and B*'s coefficient values are available from the
literature [75]. In particular the one step Adams-Bashforth
method corresponds to the forward differentiation rule while
the zero and one step Adams-Moulton methods are the backward
and trapezoidal rule respectively.
4.3.c) Characteristics of Multistep Methods
An important characteristic of multistep methods is
that they require only one computation of the derivative for
each step as can be seen from equation (4.3.1). This is a
great advantage over the Runge-Kutta method that requires at
least four computations of the functions f(y(t),t) and it
has been the motivation for our experimentation with
multistep methods.
Another feature of multistep methods is that they allow
for a rather efficient implementation of automatic control
of the step size and order of the method itself.
A complete treatment of this subject would require too
long a discussion. Our intuitive explanation can be
obtained by observing that the step size and order selection


2
In its simplest form, speech communication is achieved
by modulating an electrical magnitude (for example, the
current in a transmission line) with the air pressure during
speech production. With this straightforward approach a
copy, in electrical terms, of the speech waveform can be
transmitted on a communication channel with a typical
bandwidth of about 3 KHz.
However, there appears to be a mismatch between the
information content of speech and the channel capacity. In
fact, the information content of written text may be
estimated at about 50 bit/sec [73 while the channel capacity
of a 3 kHz bandwidth and a typical signal-to-noise ratio is
about 30,000 bits/sec. Similar bit rates are also
encountered in conventional PCM speech transmission. Even
though spoken speech contains more information (such as
intonation and stress) than its written counterpart, the
above mentioned mismatch indicates that a smaller channel
bandwidth can be used for a more efficient transmission of
speech. Using different tradeoffs between the
intelligibility and naturalness of speech transmission on
one side and bit rate on the other, engineers have been able
to transmit speech with bit rates varying from 150 to 30,000
bit/s [8-11].
The reduction of channel bandwidth has been obtained by
means of analysis-synthesis systems. Before transmission,


11
The spectral characteristics of the excitation are
further modified by the acoustic transfer function of the
vocal cavities. Sound transmission is more efficient at the
resonance frequencies of the supraglottal vocal system and,
therefore, the acoustic energy of the radiated speech sound
is concentrated around these frequencies (formant
frequencies). During the generation of connected speech,
the shape and acoustic characteristics of the vocal cavities
are continuously changed by precisely timed movements of the
lips, tongue and of the other vocal organs. This process of
adjustment of the vocal cavity shape to produce different
types of speech sounds is called articulation.
These considerations about speech physiology lead to
the simple but extremely useful source-tract model of speech
production [22], which has been explicitly or implicitly
used since the earliest work in the area of speech synthesis
[23-25]. This model is still employed in linear prediction
and formant synthesis. It consists (see Figure 2.2) of a
filter whose transfer function models the acoustic response
of the vocal cavities and of an excitation source that
generates either a periodic or a random signal for the
production of voiced or unvoiced sounds, respectively. The
operation of the source and the filter transfer function can
be determined by external control parameters to obtain an
output signal with the same acoustic properties of speech.


156
pj(t-r) p(t+r)
A x A c
Pr(t) Pp(t)
pj(t) pj(t-r)!
3 port
junction
'PV+l(t)
1
pj+i(t-r)
AV ¡
A
| V+l
Py(t) Py(t+r)
¡Pv+i(t)
Pv+i(t+r)
Figure A.7. Three port junction to model the acoustic
coupling between the oral and nasal tracts.


149
uR(x,t) = xoKr(x,t)
2 6AK(t)
PK(x,t) = pKr(x,t) pc s
(A.2.4)
Equation (A.2.4) could also be explained by intuitive and
physical reasoning.
Now it is useful to substitute (A.1.1) into (A.2.4) to
obtain
p£(t x/c) = pj^it x/c) -&§-
6Ap.it)
Kr
K
. 2 SA (t)
p(t + x/c) = PKr(t + x/c) -fiS ^
(A.2.5)
K
Equations (A.2.5) suggest a simple modification of (A.2.2)
to account for vocal wall vibrations.
2 6A (t) 2 A ( x)
We can in fact add (- = ) and (-
K
A.
K
auu vkM2 respectively for every evaluation of
to VKM1 and V,
(A.2.2) (see also Figure A.5).
We must realize that 6A^(x) represents the variation of
the cross-sectional area during the time interval x. Since
the time constants of the yielding wall vibrations are much
larger than the propagation time x, we can use the value
SA-Jx) that has been evaluated at the previous integration
l\
step.


Internet Distribution Consejtf/i&gJPeement
In reference to the following dissertation:
AUTHOR:
TITLE:
Bocchieri Etxrko
Articulatory Spegglp Sj^l^s^pTlifflfd/bumber: 437194)
PUBLICATION DATE: 1981
benfdta^opyright holder for the
aforementioned dissertation, hfEy graM'specific ahcTlimited archive and distribution rights to
the Board of Trustees of the University'fOTfiS^a and-is-*agnts. I authorize the University of
Florida to digitize and distribute the dissertation described above for nonprofit, educational
purposes via the Internet or successive technologies.
This is a non-exclusive grant of permissions for specific off-line and on-line uses for an
indefinite term. Off-line uses shall be limited to those specifically allowed by "Fair Use" as
prescribed by the terms of United States copyright legislation (cf, Title 17, U.S. Code) as well as
to the maintenance and preservation of a digital archive copy. Digitization allows the University
of Florida to generate image- and text-based versions as appropriate and to provide and enhance
access using search software.
This grant of permissions prohibits use of the digitized versions for commercial use or profit.
Signature of Copyright Holder
( £ Y\
Printed or Typed Name of Copyright Holder/Licensee
Personal Information Blurred
Please print, sign and return to:
Cathleen Martyniak
UF Dissertation Project
Preservation Department
University of Florida Libraries
P.O. Box 117007
Gainesville, FL 32611-7007
5/28/2008


108
increased vocal cavity volume can in fact accommodate part
of the air volume velocity transmitted during the glottal
pulse and therefore it facilitates vocal cord vibration.
Cinefluorographic data supporting this hypothesis have been
reported by Perkell [82], Kent and Moll [96] and more
recently by Westbury [93].
For similar reasons the yielding properties of the
vocal tract walls may facilitate voicing [93]. As soon as
there is a pressure build-up behind the vocal tract
constriction, the volume of the vocal cavities is increased
and a certain amount of air can flow through the glottis to
"fill up" the extra volume.
We have confirmed this explanation with our computer
implementation of the vocal cords and vocal tract models.
Different yielding wall parameters corresponding to "tensed
cheeks", "relaxed cheeks" [57], and rigid walls have been
used.
Figure 6.11 shows the supraglottal pressure, vocal cord
motions and the glottal volume velocity before, during and
after closure with relaxed vocal cavity walls, and constant
subglottal pressure. Evidently during closure the average
value of the vocal tract pressure increases but still there
is enough pressure drop across the glottis to maintain a
reduced but sufficient glottal volume velocity for vocal
cord oscillation. It can also be observed that the


95
therefore, obtained the glottal waveform in the bottom of
Figure 6.4.
To increase source tract interaction (see top part of
Figure 6.4) we have reduced the glottal impedance by a
factor of two.
The glottal pulse with the greatest source-tract
interaction is characterized by a more pronounced first
formant ripple during the rising slope and by a steeper
final slope that increases the higher frequency components
of the glottal excitation [42].
The spectra of the glottal waveforms show that the
greatest source-tract interaction determines an increase of
about 8 dB around 3 hHz. This supports the fact that
source/tract interaction can be used (for example by
singers) to modify the voice quality [45], [84].
The spectrogram of the synthetic "goodbye" obtained
with the "just right" source tract interaction is shown in
Figure 6.5.
The perceptual quality of the synthetic speech with
"too much", "just right" and "too little" source tract
interaction is judged slightly different.
6.3) Onset Spectra of Voiced Stops
In the recent literature there has been considerable
interest in the correlation between place of articulation of


107
2) a sufficient transglottal pressure and glottal air
flow is present.
Obviously the second condition is always met during the
production of vowels and sonorant consonants when the vocal
tract is vented to the atmosphere.
However, during a stop consonant, the closure of the
vocal tract blocks the air flow and causes the pressure drop
and volume velocity across the glottis to decrease. In
spite of this, vocal cord vibration is commonly observed
during the closure period of voiced stops [92].
Several factors may in fact contribute to vocal cord
oscillation during closure [93].
For example, the vocal cord tension could be decreased
to facilitate voicing. However there are no physiological
data showing that speakers make such an adjustment during
voiced stop closure.
To sustain an adequate transglottal pressure the
subglottal pressure could be increased or the nasopharyngeal
orifice could be partially opened during consonantal
closure. However, these possible voicing mechanisms appear
to be very unlikely in the presence of tracheal flow
measurements [94-95].
Another mechanism which would allow on-going
transglottal flow during vocal tract closure is a muscularly
activated enlargement of the supraglottal cavity. The


146
C
A A
A K+l K
R A A
+ K,K+1 K K+l
pc
K22
D
D
A
One can easily check that (A.2.2) becomes equivalent to
(A. 1.3) when RR and VK are equal to zero. The schematic
representation of (A.2.2) is shown in Figure A.5.
Observe that the computation of the CK coefficients and
the evaluation of (A. 2.2) can be done in parallel for all
the junctions between uniform pipes. Equation (A.2.2) is
therefore efficiently implemented using an array processor.
A.2.b) Yielding Wall Simulation
Given certain conditions in the acoustic pipes at time
t, the relationships (A.2.2) are used to calculate the wave
propagation at time t + x, as is schematically shown in
Figure A.5. This calculation is used to model fricative
excitation in lossless tubes. Here we introduce an
additional computational step to account for the yielding
wall effects during speech production.


SPEECH
Y
SYNTHESIZER
\
7K 7K 7K 7K
SPEECH
CONTROL
SIGNALS
SYNTHESIS
STRATEGY
STORED
RULES
STORED
DATA
A
DISCRETE INPUT
SYMBOLS
Figure 1.1. Text to speech synthesis.


80
When the user signs on (Figure 5.4a), he is asked to
indicate the name of the articulatory pattern that he wishes
to edit. The user may, at this point, create a new file or
modify an existing file. The user then interacts with the
model (Figure 5.4b) until he has achieved the desired vocal
tract configuration.
Next, a "Store With Time" (SWT) command is issued
(Figure 5.4c). This attaches a time label to the displayed
vocal tract configuration, which is also memorized as a
"frame" of the articulatory pattern that the user has
indicated at sign-on. Another interaction cycle is then
entered, which will lead to the definition of another frame.
The frames defined by the SWT command appear as
"targets" which must be reached at the indicated time. The
articulatory model is guided between consecutive targets by
means of an interpolation algorithm to achieve smooth
transitions. This is particularly important, so that the
articulatory model of the vocal cavities may be interfaced
with the speech synthesizer; since a continuous variation of
the synthesizer input parameters is required to obtain good
quality speech.
5.3) Display of the Time Variations of the Model
The interaction cycle described above generates an
articulatory pattern by means of an animation frame


36
We, therefore, use a slightly different approach
proposed by Sondhi [55]. He assumed that the vibrating
surface is proportional to the volume of the vocal cavities
and he estimated the mechanical properties of the vibrating
surface on the basis of acoustic measurements. The modeling
results match well with the direct measurements of the
mechanical impedance of human tissues performed by Ishizaka
et al. [57]. The model can be formulated as follows,
Adjust the inductive reactance LW^ in the ith elemental
section to match the observed first formant frequency
for the closed mouth condition of about 200 Hz [22] [58]
0.0858
0.0858
Ax
LW.
l
Next, adjust the wall loss component RW^ to match the
closed glottis formant bandwidths [59]
RW. = 130 n LW.
x x
Choose a value of CW^ to obtain a resonance frequency of
the wall compatible with the direct measurements of
Ishizaka et al. [57]
(2 % 30)
2


93
*4

/
2 ^
in ^
VOCAL
TRACT
Figure 6.3. Source-tract interaction model.
P subglottal pressure. U glottal
ySlume velocity. Z glottal impedance.


53
step size the faster the integration. Unfortunately, the
precision of the numerical solution decreases when the step
size is increased and the method may become unstable. The
\
program to solve (4.1.1) must therefore implement an
automatic procedure for changing the step size to achieve
the maximum integration speed compatible with precision and
stability requirements. A variable step size is
particularly convenient in our case. In fact the time
constants of (4.1.1) change with the shape of the vocal
cavities and the motion of the vocal cords. A variable
control of the step size allows one to obtain the maximum
integration speed which is allowed by the method and by the
time variant differential equations. In view of these
requirements we have considered both a 4th order Runge-Kutta
method with control of the step size and order.
The Runge-Kutta method requires the computation of
derivatives at an higher rate than multistep methods.
However, it needs less overhead for each derivative
computation [75]. In the next two sections the
characteristics of Runge-Kutta and multistep (also called
predictor-corrector) methods will be considered.
Section 4.4 will give a comparative discussion leading
to the selection of the Runge-Kutta method. Also, we will
describe a modification of the Runge-Kutta method, which
exploits the fact that wall vibration effects have larger


8
2.1) Speech Physiology and the Source-Filter Model.
The acoustic and articulatory features of speech
production can be most easily discussed by referring to
Figure 2.1, which shows the cross-section of the vocal
apparatus.
The thoracical and abdominal musculatures are the
source of energy for the production of speech. The
contraction of the rib cage and the upward movement of the
diaphragm increase the air pressure in the lungs and expel
air through the trachea to provide an acoustic excitation of
the supraglottal vocal cavities, i.e., the pharynx, mouth
and nasal passage.
The nature of speech sounds is mostly determined by the
vocal cords and by the supraglottal cavities. The vocal
cords are two lips of ligament and muscle located in the
larynx; the supraglottal cavities are the oral and nasal
cavities that are vented to the atmosphere through the mouth
and nostrils.
Physically, speech sounds are an acoustic pressure wave
that is radiated from the mouth and from the nostrils and is
generated by the acoustic excitation of the vocal cavities
with the stream of air that is coming from the lungs during
exhalation.
An obvious and important characteristic of speech is
that it is not a continuous type of sound but instead it is


DB
114
Figure 6.14. Spectra of two synthetic nasal murmurs


96
stop consonants and the acoustic properties at consonantal
release.
Kewley-Port [85] and Searle [86] et al. have used a
speech signal transformation based on peripheral auditory
filters approximated by analog l/3-octave filters to obtain
a three dimensional running spectra display which is used
for speech recognition.
Stevens [87] and Stevens and Blumstein [17], in their
investigation of onset spectra of stop consonant release
have determined characteristic patterns for each place of
articulation (labial, alveolar, and velar). They have also
tested the perceptual relevance of their hypothesis with a
formant synthesizer.
Here we discuss the spectral characteristics of
consonant release using our articulatory synthesizer. The
formant spectrum of vowels is completely determined by the
formant frequencies and bandwidths [16]. For example, with
a uniform vocal tract we typically have equally spaced
formants at 500, 1500, 2500, 3500 Hz and all of them have
the same spectral peak values if their bandwidths are the
same.
When the vocal tract becomes constricted, as occurs
during the production of stops, the first formant frequency
is always lowered while the behavior of the other formants
depends on the position of the constriction. For a labial


78
Ficrure 5.3. The articulatory model implemented on the
Tektronix 4113 graphic terminal.


50
radiation impedances at the mouth and nostrils, which are
shown in Figure 3.9.
The approach to the acoustic modeling of the vocal
cavities that we have just reviewed is not the only one
reported in the literature. In Section 3.2.b we considered
the two mass models of the vocal cords. A more complete
model of vocal cord dynamics has been designed by Titze
[19]. He divided each cord into two vertical levels, one
level corresponding to the mucous membrane, the other to the
vocalis muscle. Each level was further divided into eight
masses which were allowed to move both vertically and
horizontally. We did not use this model because its
simulation is computationally more expensive than Flanagan's
two mass models.
Different acoustic models of the vocal tract were
designed by Kelly and Lochbaum [72] and Mermelstein [73].
The latter has been recently implemented for an articulatory
synthesizer [74].
However, these modeling approaches account for vocal
tract losses in a phenomenological way and they do not model
source tract interaction and fricative self excitation.


AREA s D EGG EGG
126
Figure 6.18. EGG signal, differentiated EGG and glottal
area.
Ld _! H
(/) O O
u. a.
Ld h-
>J2
Z CL O
-o o
o
V
1-2 VOCAL FOLDS MAXIMALLY CLOSED. COMPLETE CLOSURE
MAY HOT BE OBTAIHEB. FLAT PORTIOH IDEALIZED.
2-3 FOLDS PARTING, USUALLY FROM LOWER MARGINS TOWARD
UPPER MARGINS.
3 WHEN THIS 3REAK POINT IS PRESEHT, THIS USUALLY
CORRESPONDS TO FOLDS OPENING ALONG UPPER KARGIN.
3-4 UPPER FOLD MARGINS CONTINUE TO OPEN.
4-5 FOLDS APART, HO LATERAL CONTACT. IDEALIZED.
3-6 OPEN PHASE.
5-6 FOLDS CLOSING.
6 FOLD CLOSURE OCCURS ALONG LOWER OR CENTRAL MARGIN.
COMPLETE CLOSURE MAY HOT OCCUR.
6-1 RAPID INCREASE IN VOCAL FOLD CONTACT
Figure 6.19. Schematic model of the EGG signal


20
These limitations of the source-tract model of speech
production can be overcome by the so called "articulatory"
synthesis that is based on a physiological model of speech
production. It consists (see Figure 2.4) of at least two
separate components,
1) an articulatory model that has as input the time
varying vocal organ positions during speech
production to generate a description of the
corresponding vocal cavity shape and
2) an acoustic model which, given a certain time
varying vocal cavity configuration, is capable of
estimating not only the corresponding speech
waveform but also the pressure and volume velocity
distribution in the vocal tract, the vibration
pattern of the vocal cords and of the vocal cavity
walls.
The strength of linear predication and formant
synthesis, namely the existence of analysis algorithms for
natural speech is, however, a weak point for articulatory
synthesis. Even if several methods for estimating the vocal
tract configuration are presented in the literature [46-51],
these procedures cannot be easily applied to all the
different types of speech sounds. This estimation is made
even more difficult to achieve by the fact that the acoustic
to articulatory transformation is not unique [52-53]. This


128
we assume a plastic collision between the right and left
vocal cord the length of vertical contact Ax of Figure 6.21
can be easily computed and the contact area is estimated as:
A = L*Ax
Figure 6.22 shows the EGG signal and its derivative,
-estimated with the aid of this model, together with the
glottal area (opening area between the vocal cords).
This simple model of the EGG signal is a good indicator
of glottal opening and glottal closure but it does not
account for the different rates of variations at glottal
closure and glottal opening (compare with Figure 6.18).
An apparent limitation of the two mass models of the
vocal cords that we have just used to account for the EGG
signal is that it does not account for longitudinal or
horizontal variations in the vocal cords. High speed films
of the vocal cords show that there exists a phase difference
along the length of the vocal cords during their vibration
and therefore during the closing (opening) phase. Contact
(opening) between the folds first occurs over a small
portion of their length.
"In succeeding frames [of ultrahigh speed laryngeal
film] this contact (opening) proceeds zipper-like
along the length of the folds until the whole
glottis is closed (open). This behaviour is more


56
We obtain a system of 8 equations in ten unknowns [76]
u a + u a + u a
2 2 3 1 4 2
v2 + v¡+v*
V3 *3 l + 4 2
lay + li (a y + a )
3114' 2 12'
VS + 'V'S + l62>
u.aay + u (a y + a ) a
3 114' 2 12' 2
u. a y
*4 1 2
= 1
= 1/2
= 1/3
= 1/4
= 1/6
= 1/12
= 1/8
= 1/24
(4.2.6)
which has two extra degrees of freedom that must be set
arbitrarily.
If we define a = = l/2, the solution of (4.2.6)
leads to the formulas of Kutta [76]. If = 1/2 and
2 = 1 we obtain Runge1s formula [76] which is equivalent
to Simpson's rule


25
Figure 3.1) whose physical dimensions are completely
described by its cross-sectional area, A(x), as a function
of the distance x along the tube (area function). The sound
propagation can now be modeled by a one dimensional wave
equation. If the losses due to viscosity and thermal
conduction either in the bulk of the fluid or at the walls
of the tube are neglected, the following system of
..differential equations accurately describes the wave
propagation [29]
5 u(x>t)
8p(x,t) A(x,t) n
"^5 + p 5t 0
(3.1.1a)
9u(x,t) 1 d(p(x,t) A(x,t)) 9A(x,t)
+ gt St
Pc
(3.1.1b)
where x = displacement along the axis of the tube
p(x,t) = pressure in the tube as function of
time and displacement
u(x,t) = air volume velocity in the tube
A(x,t) = area function of the tube
A .
P = air density
A sound velocity.
The first and second equations (3.1.1) correspond to
and continuity law, respectively.
Newton1s


121
selected among the articulatory features of the input
phenemic string by a scan ahead mechanism. The
compatibility criterion allows a feature to be realized at
the articulatory level earlier than its parent phoneme if it
does not contradict the articulatory requirements of the
segment that is currently being produced. The advantage of
feature based models is their simplicity. The disadvantages
..are an incomplete modeling of the timing of speech
articulation and a too rigid definition of the compatibility
principle [103].
A different approach to the explanation of speech
articulation has been proposed recently by Fowler [113].
His criticism of the above mentioned theories, which he
calls extrinsic timing theories, is that they exclude time
from the representation in the talkers articulatory plan
and they instead propose that an utterance is given
coherence in time only by its actualization. Fowler
hypothesizes that coarticulation should not be treated as an
adjustment of the characteristic properties of a segment to
its neightbors but it should be viewed as the overlapping
production (coproduction) of successive, continuous time
segments.
Our articulatory synthesizer can be used to simulate
articulation patterns that correspond to different
coarticulation hypotheses.


106
Figure 6.10. Above glottal volume velocity simulated
by the acoustic model.
Below estimate of the glottal volume
velocity obtained by inverse filtering
of synthetic speech. -


AN ARTICULATORY SPEECH SYNTHESIZER
BY
ENRICO LUIGI BOCCHIERI
A DISSERTATION PRESENTED TO THE GRADUATE COUNCIL
OF THE UNIVERSITY OF FLORIDA IN
PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR
THE DEGREE OF DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
1984


CHAPTER 4
NUMERICAL SOLUTION OF THE ACOUSTIC MODEL
4.1) Requirements of the Numerical Solution Procedure
The software implementation of the acoustic model of
the vocal cavities that we have derived in the previous
chapter requires the solution of a system of ordinary
differential equations with assigned initial values.
In general we will use the notation
y'(t) = f(y(t),t)
{ (4.1.1)
y(0) = yQ
The numerical approach employed to solve the above problem
consists of approximating the solution y(t) as a sequence of
discrete points called mesh points. The mesh points are
assumed to be equally spaced and we indicate with h the time
interval between them. In other words, the numerical
integration procedure will give us a sequence of values yQ,
yi' *yn which closely approximate the actual solution y(t)
at the times tQ =0, t^ = h, ...tn = nh.
In the area of ordinary differential equations the
first step toward the solution of the problem is the
selection of that particular technique among the many
available which will serve the solution best.
51


70
However, the numerical procedure is still correct, as
we will show in Chapter 6 where several effects associated
with cavity wall vibration are simulated.


CHAPTER 1
INTRODUCTION.
SPEECH SYNTHESIS APPLICATIONS. RESEARCH GOALS.
In the last 3-4 decades both the engineering and
medical communities have devoted considerable research
effort to the problem of speech synthesis, i.e., the
generation of voice by artificial, electrical or mechanical
means.
The earliest attempts to construct talking machines can
be traced to the late 18th century. One of the first speech
synthesis devices was Kempelen's talking machine [1-2] which
in a demonstration in Vienna in 1791 was capable of
imitating the sounds of vowels and of many consonants.
Perhaps the greatest motivation for speech synthesis
research came from the development of telecommunications and
from the consequent engineering interest in efficient
methods for speech transmission. Moreover, the recent
progresses in circuit integration, microprocessors and
digital computers have made the implementation of high
performance speech transmission systems technologically
feasible [3-6]. This type of application requires a scheme
known as speech synthesis by analysis.
1


67
method runs about 2 3 times faster than the Adams-Moulton
method. This fact is at first rather surprising since we
have seen in the previous two sections that the Runge-Kutta
method requires a larger number of derivative evaluations
for each integration step.
However, in our case the evaluation of the derivatives
is not extremely time consuming. In fact, with the
exception of four state variables describing the motion of
the two mass models, the remaining system of differential
equations is essentially "uncoupled" (or characterized by a
"sparse matrix"), thanks to the "chain" structure of the
vocal cavity acoustic model that can be immediately observed
from Figure 3.9. In these conditions the high number of
derivative evaluations of the Runge-Kutta method is more
than compensated for by the limited overhead in comparision
with Adams predictor-corrector method [75].
We have modified the integration method to take
advantage of the dynamic properties of the vibrating walls.
From Figure 3.8, which represents the equivalent
circuit of the ith elemental section of the vocal tract as
discussed in Chapter 3, we have
dp (t)
= cT
dt
(4.4.1)


138
This new procedure has been tested with the synthesis
of the sentence "Goodbye Bob". We have obtained the same
quality as with' the more classical Runge-Kutta method. We
are now experimenting with nasal sounds and fricative
excitation.
The classical Kelly-Lochbaum1s algorithm which we
review in Section A.l, employs a system of concatenated
lossless tubes to represent the wave propagation in the
vocal tract. Section A. 2 considers the algorithm
modifications that we have designed for a more realistic
model of the cavity wall characteristics and fricative
excitation. Section A.3 shows how the glottal termination,
the nasal coupling and the radiation load can be modeled
with our computational approach.
A.1 Wave propagation in Concatenated Lossless Tubes
The wave propagation in a uniform lossless tube of
cross-section AK can be represented by the superposition of
a positive-going pressure wave p*(t x/c) and a negative
going pressure wave p~(t + x/c) where c is the sound
i\
velocity and x is the displacement along the tube. More
specifically the pressure and volume velocity in the uniform
pipe is


3
speech analysis algorithms are used to extract relevant
information about the speech waveform. This information is
then encoded and transmitted (hopefully at low bit rates)
along the communication channel. At the receiver a
synthesis algorithm is used to reconstruct the speech
waveform from the transmitted information.
This synthesis by analysis process is useful not only
in voice communication systems; for example, in automatic
voice answering systems, words or sentences are stored for
successive playbacks. In addition synthesis by analysis can
be used to reduce the memory usage. Texas Instruments'
learning aid, Speak and Spell, is an example of this type of
application.
Synthesis by rule or text to speech synthesis is a
different type of application that has received considerable
attention lately [12-13]. In this case the problem is not
to "regenerate" synthetic speech after an analysis phase of
its natural counterpart. Instead synthetic speech is
automatically produced according to certain linguistic rules
which transform a string of discrete input symbols directly
into speech [14] (see Figure 1.1). Applications of text to
speech synthesis include reading machines for the blind
[15], automatic answering systems, _man-machine
communication.
The medical community is interested in speech synthesis
systems for different reasons. Speech synthesizers are


TABLE OF CONTENTS
ACKNOWLEDGMENTS ii
ABSTRACT . v
CHAPTER Page
1 INTRODUCTION: SPEECH SYNTHESIS APPLICATIONS.
RESEARCH GOALS 1
2 SPEECH PRODUCTION MODELS AND SYNTHESIS METHODS 7
2.1) Speech Physiology and the Source-Filter Model...8
2.2) Linear Prediction 13
2.3) Formant Synthesis 16
2.4) Articulatory Synthesis 18
3 ACOUSTIC MODELS OF THE VOCAL CAVITIES 23
3.1 Sound Propagation in the Vocal Cavities 23
3.1.a)' Derivation of the Model 23
3.1.b) Modeling the Yielding Wall Properties 32
3.1.c) Nasal Coupling 37
3.2) Excitation Modeling 37
3.2. a) Subglottal Pressure 37
3.2.b) Voiced Excitation 38
3.2.c) Unvoiced Excitation 45
3.3) Radiation Load 47
3.4) Remarks and Other Acoustic Models 48
4 NUMERICAL SOLUTION OF THE ACOUSTIC MODEL 51
4.1) Requirements of the Numerical Solution
Procedure 51
4.2) Runge-Kutta Methods 54
4.2. a) Derivation of the Method 54
4.2.b) Control of the Step Size
with Runge-Kutta 57
4.2.c) Order Selection 59
4.3) Multistep Methods 60
4.3. a) Implicit and Explicit Methods 60
4.3.b) Derivation of the Methods 61
4.3.c) Characteristics of Multistep Methods 64
4.4) Method Selection 66
iii


BIOGRAPHICAL SKETCH
Enrico Luigi Bocchieri was born in Pavia, Italy, on
January 7, 1956. After completing high school in 1974 he
joined the University of Pavia where he received the
"Laurea" in Electrical Engineering in July 1979.
Since September 1979 he has been with the Department of
Electrical Engineering at the University of Florida,
receiving his M.S. degree in August 1981. After completing
his Ph.D. he plans to join Texas Instruments and work in the
area of speech algorithm development and implementation.
169


164
[65] Ishizaka, K. and Matsuidara, M. "What makes the
vocal cords vibrate," Proceedings Sixth
International Congress Acoust., vol. 2, pp. B9-12,
1968.
[66] Ishizaka, K. L. and Flanagan, J. L. "Synthesis of
voiced speech from a two mass model of the vocal
cords," Bell System Tech. J., vol. 51, pp. 1233-
1268, 1972.
[67] Flanagan, J. L., and Ishizaka, K. L., "Synthesis of
speech from a dynamical model of the vocal cords
and vocal tract," Bell System Tech. J., vol. 54,
pp. 484-506, 1975.
[68] Flanagan, J. L., and Ishizaka, K., "Automatic
generation of voiceless excitation in a vocal cord,
vocal tract speech synthesizer," IEEE Trans. Audio
Electroacoust., pp. 163-169, 1976.
[69] Flanagan, J. L. and Ishizaka, K. "Computer model
to characterize the air volume displaced by the
vibrating vocal cords," J. Acoust. Soc. Amer.,
vol. 63, pp. 1559-1566, 1978.
[70] Monsen, R. B. Engebretson, A. M. and Vemula,
N. R., "Indirect assessment of the contribution of
subglottal air pressure and vocal fold tension to
changes of fundamental frequency in English,"
J. Acoust. Soc. Amer., vol. 64, pp. 65-81, 1978.
[71] Flanagan, J. L. and Cherry, L., "Excitation of
vocal tract synthesizer," J. Acoust. Soc. Amer.,
vol. 45, pp. 764-769, 1969.
[72] Kelly, J. L. and Lochbaum, C. C., "Speech
synthesis," Proceedings Fourth International
Congress Acoust., pp. 1-4, 1962.
[73] Mermelstein, P., "Calculation of the vocal tract
transfer function for speech synthesis
applications," Proceedings Seventh International
Congress Acoust., pp. 173-176, 1971.
[74] Rubin, P., Baer, T., and Mermelstein, P., "An
articulatory synthesizer for perceptual research,"
J. Acoust. Soc. Amer., vol. 70, pp. 321-328,
1981.


31
s CW1-
Figure 3.3. Vocal tract elemental length and its equivalent
circuit.


33
Figure 3.4. Mechanical model of an elemental length of
the vocal tract with a yielding surface.


54
time constants than the propagation in the cavities. This
allows the reduction of the number of first order equations
to be solved by this method by almost a factor of two.
/
4.2) Runge-Kutta Methods
4.2.a) Derivation of the Method
Runge-Kutta methods are stable numerical procedures for
obtaining an approximate numerical solution of a system of
ordinary differential equations given by
y'(t) = f(y(t),t)
{ (4.2.1)
y(o) = yQ
The method consists of approximating the Taylor expansion
v,2
y(tQ + h) = y(tQ) + hy'(to) + Yi y"(t0) + (4.2.2)
so that, given an approximation of the solution at time tQ,
the solution at the next mesh point (tQ + h) can be
estimated.
To avoid the computation of higher order derivatives,
it is convenient to express (4.2.1) in integral form as
t +h
y(t + h) = y(t ) + / f(y(t),t)dt (4.2.3)
t
o
We can approximate the above definite integrals by computing


38
of soft sounds to 20 cm H2O or more for the generation of
very loud, high pitched speech.
During speech the lung pressure is slowly varying in
comparison with the acoustic propagation in the vocal
cavities. We can, therefore, represent the lung pressure
with a continuous voltage generator whose value is
controlled in time by an external parameter Ps(t).
3.2.b) Voiced Excitation
During speech, the air is forced from the lungs through
the trachea into the pharynx or throat cavity. On top of
the trachea is mounted the larynx (see Figure 3.5), a
cartilagineous structure that houses two lips of ligament
and muscle called the vocal cords or vocal folds.
The vocal cords are posteriorly supported by the
arytenoid cartilages (see Figure 3.5). Their position and
the dimension of the opening between them (the glottis) can
be controlled by voluntary movements of the arytenoid
cartilages.
For the generation of voiced sounds the vocal cords are
brought close to each other so that the glottal aperture
becomes very small. As air is expelled from the lungs,
strong areodynamic effects put the vocal cords into a rapid
oscillation. Qualitatively, when the vocal cords are close
to each other during the oscillation cycle, the subglottal


48
Our model is also able to account for the sound
pressure component that is radiated through the vibration of
the vocal cavity walls. The contribution to this component
from each elementary length of the vocal cavities is
represented as a voltage drop across a suitable impedance
[67] in series to the equivalent circuit of the yielding
wall defined in Section 3.1b.
3.4) Remarks and Other Acoustic Models
In this chapter we discussed the derivation of an
electrical circuit that models the sound propagation in the
vocal cavities. These considerations can be summarized in
Figure 3.9.
The vocal and nasal tracts are represented by two
circular pipes with non-uniform cross-section (plane wave
propagation assumption). Their equivalent circuit (nasal
and vocal tract networks in Figure 3.9) are made by a chain
of elementary circuits. Each circuit models the wave
propagation as a short length of cavity according to the
derivation of Sections 3.1.a, 3.1.b, 3.2.c.
The two mass models of the vocal cords, its control
parameters, AgQ and Q, and the glottal impedance have been
treated in detail in Section 3.2.b.
Sections 3.2.a, 3.1.c, 3.3 have been concerned with the
subglottal pressure P the velar coupling Z.7 and the
S V


12
hit)
SOUND
VOCAL
SOURCE
gtt)^
TRACT
s(t)=g(t)*h(t) ^
(VOICED OR
TRANSFER
Sound output
UNVOICED)
FUNCTION
gt)
Vftiran
W A W
git)
Unun i r arl \
Figure 2.2. The source-tract speech production model.


155
PN(t-x) = Ej^p^t-x) + E2(uR(t-x) u^AXjt-x)) + E3
(A.3.7)
uR(t) = (uN(Ax,t) uN(Ax,t-x) + u^t-x^E"1
where
Ei = < E2 = RR E¡1 Ei"
E3 VN EiX
E.
Rr
1 +
LR
E =
1 + (kr e;1 + V ^
A.3.c) Nasal Coupling
We suppose that the nasal passage is coupled to the
vocal tract model through a 3 port junction placed between
the Vth and (V + l)st tube of the vocal tract as shown in
Figure A.7. The cross-sectional areas of the pipes that are
connected to the junction are v Ay+1 and Ac and all three
pipes have the same length Ax.
When the nasal passage is closed A^ is equal to zero
and the three port junction becomes equivalent to a two port


141
and substituting of (A.1.1) into (A.1.2) gives, after some
manipulations
P~(t+T)
rK 1 + rK
~p*(tt)
-pK+l(t)-
1
H
1
l-l
*
1
-PK+l(t)-
(A.1.3)
where
a A
A K K+l
K \ + Vi
(A.1.4)
Equations (A.1.3) and (A.1.4) can be graphically represented
by Figure A. 2. The factor r represents the propagation
delay in the uniform tubes of the positive going and
negative going pressure waves. The junction between
adjacent pipes is represented by four multiplications as in
the classical Kelly-Lochbaum algorithm.
A.2) Modifications of Kelly-Lochbaum Algorithm
A.2.a) Fricative Excitation
Figure A.3 shows the lumped element equivalent circuit
of two adjacent lossless pipes. Flanagan and Cherry [71]
suggested a modification of this equivalent circuit to
account for the fricative self-excitation of the vocal tract
as shown in Figure A. 4. The resistor % and the random


104
The problem is therefore to estimate the transfer function
T(z), i.e., the predictor coefficients a^. When the
estimate Test^z^ T(z) is known the glottal volume
velocity can be estimated by integrating the data s(z) to
recover the mouth volume velocity u(n) and then by inverse
filtering Test(z) to give
Gest
Test(Z) (1 z-1)
S(z)
(6.4.5)
The vocal tract transfer function can be deduced from the
analysis of the speech waveform during vocal folds closure
when the glottal volume velocity is equal to zero. Then
from (6.4.2) we have
u(n) + a1u(n-l) + + a u(n-K) = 0
X i\
and from (6.4.4) we get
s(n) + a^s(n-l) + + aKs(n K) = 0 (6.4.6)
Equation (6.4.6) shows that during the closed phase the
speech waveform is a freely decaying oscillation that is
determined by the predictor coefficients a^ and K initial
values of s(i). The parameters a^ can be exactly estimated


77
for further analysis and processing. The acoustic
properties of the vocal tract are determined, as discussed
in Chapter 3, by its area function, which is the cross-
sectional area of the cavity, as a function of the distance,
x, from the larynx.
When the user has defined a new articulatory parameter
value by means of the cross-hair cursor, the system
estimates the area function of the vocal tract by means of
the grid system of Figure 5.2. The resonance or formant
frequencies are also estimated. This information is
immediately displayed (see Figure 5.3) for the user as a
first order approximation of the acoustic properties of the
graphic model of the vocal tract.
The interaction cycle is shown in Figure 5.4. Commands
are available not only to modify the displayed vocal tract
shape but also to store and read it from a disk memory.
These commands are useful not only to generate a "data base"
of vocal tract configurations, but also to create back up
files before using the interactive graphic commands.
But the interaction depicted in Figure 5.4 is not
sufficient to define a specific articulatory pattern. In
fact, the articulation of connected speech is a dynamic
process which consists of precisely timed movements of the
vocal organs. We have introduced the temporal dimension in
the system by means of an animation frame technique.


142
Figure A.2. Graphical representation of Kelly-Lochbaum's
algorithm.


19
*
A first disadvantage, which is inherent to the source-
filter model, is the assumption of separability between the
excitation and the acoustic properties of the vocal tract.
Clearly this assumption is not valid for the production of
fricative sounds in which the excitation depends on the
vocal tract constriction or for the generation of stops in
which the tract closure arrests the glottal air flow.
Source tract separability is a first order modeling
approximation even in the production of vowels as documented
by many recent papers concerning the nature of source tract
interaction [42-45]. We will address this issue in
Chapter 6 in more detail.
The second "disadvantage" (for our purpose) is that the
filter in the source-tract model (and also linear prediction
and formant synthesis) accounts only for the acoustic
input/output transfer function of the vocal cavities which
is estimated by means of analysis or "identification"
algorithms. For example, Linear Prediction and Formant
synthesis cannot model the areodynamic and myoelastic
effects that determine the vocal cords vibration, the air
pressure distribution along the oral and nasal tracts and
the vibration of the vocal cavity walls.
In other words, linear prediction and formant synthesis
define an algorithm for the generation of signals with the
same acoustic features of natural speech but they do not
model the physiological mechanism of speech generation.


14
4/ K
k v i i
y
V/UV Switch
parameters
Figure 2.3. Linear prediction speech production model.


154
P¡(t)
p J(t-r)
Figure A.6. Boundary conditions at the lips or nostrils.


130
pronounced during the opening than closing phase."
[114].
We can model this behavior as in Figure 6.23 where the left
and right vocal cords are not parallel, but they are
"tilted" with an angle a and the contact area is
proportional to Al.
Using this model modification with a equal to 0.2 and
2.7 respectively, we obtain the EGG signal of Figure 6.24
which more closely represents the actual EGG sample of
Figure 6.18. The observation of ultrahigh speed laryngeal
films has suggested a simple modification of the two mass
models of the vocal cords, to better represent the EGG data.


112
The small shift at higher formant frequencies shown in
this figure is attributable to the finite resolution of the
autocovariance analysis used to obtain the spectrum.
6.6) Pathology Simulation. Reduction of Sound Intensity
During Nasalization
Physiologically, nasal sounds are generated by lowering
the soft palate and opening the velopharyngeal orifice so
that the nasal cavity communicates with the oral tract
during speech production. The spectral characteristics of
nasal sounds are therefore determined by the acoustic
interaction of the oral and nasal cavities.
Nasal sounds are differentiated by opening or closing
the mouth, e.g., nasalized vowels vs. nasalized consonants
(nasal murmurs). Nasalization usually occurs in those
vowels in the proximity of a nasal consonant and sometimes,
if the nasal consonant is in an unstressed position, the
nasal murmur may disappear and the vowel nasalization
remains the only clue for the existence of a nasal phoneme.
Nasal murmurs are radiated exclusively from the
nostrils. The pharyngeal and nasal cavities constitute the
sound passage that is shunted by the oral cavity. The
formant structure of nasal murmurs is very different from
other speech sounds and it is characterized by the existence
of zeroes (antiformants) that are caused by the shunting
effect of the oral cavity.


32
C.
i
A. Ax
i
P G
L. =
l
p Ax
o
A.
l
i
1,
.N
dA.(t)
The component Ax ^ in equation (3.1.4b) that represents
the vibration of the cavity walls is represented by the
current in the impedance ZW^ as will be shown in the next
section.
3.1.b) Modeling the Yielding Wall Properties
From equation (3.1.4b) we can see that the effect of
wall vibrations are to generate in the ith elemental section
of the vocal tract an additional volume velocity component
equal to
dA.(t)
which is represented in Figure 3.3 by the current uw^ in the
impedance ZW^. We will now consider how ZW^ is related to
the mechanical properties of the vibrating walls.
Consider Figure 3.4 which shows an elemental length of
the pipe in which one wall is allowed to move under the
forcing action of the pressure p(t) in the pipe itself. Let
m, k and d represent the mass, elastic constant and damping
factor of a unit surface. Since the total vibrating surface
is lAx, and since we assume the walls to be locally
reacting, then the total mass, elastic constant and damping
factor of the vibrating wall are


90
As the first step of the synthesis procedure we should
obtain the spectrogram of natural speech, which is shown in
Figure 6.2. We do not attempt to faithfully match the
synthetic spectrogram with its natural counterpart.
However, the natural spectrogram is useful to obtain a good
estimate of the required duration of each segment of the
synthetic sentence.
The articulatory information is obtained, in a rather
heuristic way, from phonetic considerations and from X-ray
data available in the literature [22] [82-83]. For example,
we know that a labial closure is required for the production
of /b/ and /p/ consonant, or that the tongue position must
be "low" and "back" for the production of the /a/ sound.
Using this linguistic knowledge, we can therefore use
the "graphic editor" described in Section 5.2 to define the
articulatory configurations that are necessary to synthesize
the desired sentence.
As described in Sections 3.2a and 3.2b, the subglottal
and vocal cord models are controlled by three parameters:
glottal neutral area A. Q, cord tension Q and subglottal
pressure Pg.
2 2
We set the glottal neutral area to 0.5 cm or 0.05 cm
for the generation of unvoiced or voiced synthetic speech
respectively. The values of the cord tension and subglottal
pressure can be estimated after a pitch and short time
energy analysis of natural speech [67] [70].


49
MUSCLE FORCE
PRESSURE
TENSION AREA
COUPLING
FUNCTION
Figure 3.9. Equivalent circuit of the vocal cavities.


81
technique, which is used as an input to the speech
synthesizer. However, during the interaction cycle, only
the particular frame being manipulated is visible on the
terminal display. Consequently, the user has difficulty
visualizing the global time varying characteristics of the
articulatory pattern. To overcome this disadvantage, we use
an on-line animation of the model.
The animation frames are computed by means of
interpolation and stored in the memory of the 4113 terminal
as graphic segments. Then each frame is briefly displayed
in sequence, creating the animation effect.
Figure 5.5 shows a typical animation frame. Here the
contour filling capability of the terminal is not used,
allowing higher display frequency. Using this technique, we
are able to obtain a live animation effect with only a
slight flickering phenomenon. The maximum frame display
frequency is about 5 Hz.
We may also view the movements of the vocal organs,
defined with the graphic editor, in three dimensions, as may
be seen in Figure 5.6. This effect is achieved by using
many consecutive animation frames as sections of a three
dimensional object, with the third dimension being time.
The advantage of this technique is that the time
evolution of the model can be observed at a glance;
moreover, a three dimensional rigid rotation allows the user


127
y2 'y2
Figure 6.20. Top and cross-sectional views of the vocal
cords model.
A
L
V
-55-
Figure 6.21. Plastic collision between the vocal cords.


92
Ananthapadmanaba and Fant [433 Yea [34] has carried out a
perceptual investigation of source tract interaction using
Guerin's model to provide the excitation of a formant
synthesizer [16].
Source tract interaction, which is well represented by
the model discussed in Chapter 3, can be discussed with
reference to Figure 6.3. The glottal volume velocity UG
depends not only on the subglottal pressure P and on the
s
glottal impedance ZG, that is varying during the glottal
cycle, but also on the vocal tract input impedance Z^n.
Source-tract separability holds only if the magnitude of Z^n
is much smaller than the magnitude of ZG, since in this case
Zr and P are equivalent to an ideal current generator.
Therefore the amount of source tract interaction depends on
the magnitude of ZQ with respect to Z^n.
We have experimented with different amounts of source
tract interaction using the following procedure.
At first we have synthesized the word "goodbye" using
our model of the vocal tract and of the vocal cords. The
obtained glottal volume velocity is shown in the middle part
of Figure 6.4.
Then, to reduce source tract interaction, we have used
the same vocal tract configuration, but we have multiplied
by a factor of two the glottal impedance throughout the
entire synthesis of the word "goodbye". We have,


76
applications but is being replaced by menu selection, joy
stick cursor, a light pen, touch sensitive terminals, or
other devices.
The conventional keyboard entry method is particularly
cumbersome if the data structure is not easily manipulated
via an alphanumeric selection process. Such an example
arises with pictorial or graphic images, as in computer
aided design. Here the user may communicate with the
computer by means of a graphic model. The system interprets
the model, evaluates its properties and characteristics, and
recognizes the user's changes to the model. The results are
presented graphically to the operator for further
interactive design and test.
Using a similar approach we have implemented on a
Tektronix 4113 graphics terminal interfaced to a DGC Eclipse
S/130 minicomputer an "interactive graphic editor" that is
used to manipulate the articulatory model.
The user may alter the configuration of the model by
means of a simple interaction. Each articulatory parameter
of Figure 5.1, and the corresponding vocal organ's position,
can be set to the desired value by means of the graphic
cursor of the Tektronix 4113 terminal. This allows a rapid
definition of the desired articulatory configuration.
But the power of an interactive graphic system lies in
its ability to extract relevant information from the model


122
We will now consider an example motivated by a
cinefluorographic study of the articulation of v^cv2
utterances reported by Thomas Gay [83], who observed that
the articulators that do not achieve consonantal closure
start the transition movements from the first to the second
vowel during the closure period of the intervocalic
consonant. This suggests that CV components of VCV
sequences may be organized as a basic unit. This hypothesis
is in agreement with the theory of Kozhevnikov and
Chistovich [110]. However, it does not match the
coarticulation model presented by Ohman [80].
According to Ohman the transition from the first to the
second vowel is essentially diphtongal, except for the
articulator effecting closure, and the onset time of all
articulator movements is before consonantal closure.
Therefore, Ohman's study implies an anticipatory effect from
the second to the first vowel of the utterance.
Using the articulatory data presented in [83], we have
simulated both hypotheses for the utterance /iba/. The
intervocalic /b/ consonant has been chosen because it does
not require the use of the tongue for consonantal closure.
The whole tongue is relatively free to coarticulate the /i/
to /a/ transition during the /b/ sound.
Figure 6.17 shows the spectrograms of the synthetic
utterances. The transition between the first vowel and


16
Figure 2.3 do not severely affect the perceptual properties
of the speech sound. In fact the human hearing system
appears to be especially sensitive to the magnitude of the
short-time spectrum of speech [31], that is usually
adequately approximated by the linear prediction transfer
function [32]. Perhaps the minimum phase approximation is
responsible for a characteristic "buzziness" [33] of the
synthetic speech. A great deal of research is being
dedicated to improve the quality of linear prediction
synthesis by using a more suitable excitation than impulse
sequences [34-35].
Linear prediction of speech is, therefore, most useful
in those applications that require a fully automated
synthesis by analysis process. Speech compression, linear
prediction vocoders, very low bit rate transmissions are
typical examples. Also linear prediction has application in
speech products where speech may be recorded for later
playback with stringent memory constraints.
2.3) Formant Synthesis
Similar to linear prediction, formant synthesis is
based on the source-tract speech production model. However,
in this case the filter that models the vocal tract is not
implemented by an all pole digital filter but it consists of
a number of resonators whose transfer function is controlled
by their resonance (formant) frequencies and bandwidths.


69
In fact, if we divide the vocal tract into 20 elemental
sections, of approximately 0.875 cm each, the time of wave
. c:
propagation in each is 2.5 10 sec. This time gives the
order of magnitude of the largest step size for the
integration of equations (4.4.1) and (4.4.2) that, in fact,
is usually achieved with variable step sizes between
2.5 10"5 and 1.25 10^ sec.
On the other hand equations (4.4.3) and (4.4.4) may
employ a larger integration step.
We, therefore, integrate equations (4.4.1) and (4.4.2)
together with the two mass model equations using a Runge-
Kutta method with variable control of the step size and
assuming uw^(t) in (4.4.1) constant during this procedure.
Every 5.10^ seconds, i.e., at a frequency of 20 KHz, we
update the values of u^(t) and vw^(t) by means of a simple
backward differentiation rule based on equations (4.4.3) and
(4.4.4).
At this time we also update the turbulent noise source
Wb and Ri'ib according to the Reynold's number of the flow in
the cavity as explained in Section 3.2.c to provide a
fricative excitation.
In this way we halve the number of derivatives that
must be computed by the Runge-Kutta method to account for
vocal tract propagation and we save about 50% of the
integration time.


74
synthesis if we can estimate the cross-sectional area of the
vocal tract (area function). The area function, in fact,
can be used to derive the equivalent circuit of the vocal
cavities, as discussed in Chapter 3.
In practical terms we superimpose a grid system, as
shown in Figure 5.2, on the articulatory model to obtain the
midsagittal dimensions of the vocal cavities at different
points, and then we convert this information into cross-
sectional area values by means of analytical relationship
defined in the literature [79].
We use a variable grid system, dependent on the tongue
body position, to make sure that each grid in Figure 5.2 is
always "almost" orthogonal to the vocal tract center line
regardless of the model configuration? to further correct
unavoidable misalignment of each grid, we multiply the
cross-sectional area estimate by the cosine of the angle a
(see Figure 5.2).
5.2 The Graphic Editor
Our computer implementation of the articulatory model
of the vocal cavities has been designed to be fast and easy
to use.
Traditionally, human-computer interaction employs
textual (alphanumeric) communication via on-line keyboard
terminals. This approach is satisfactory for many


CM3/SEC CM3/SEC .
101
314 316 313 329 322 324 326
TIME MSEC
TIME MSEC
Figure 6.8. Glottal volume velocity (dashed line) and
air volume velocity displaced by the
vibrating walls of the vocal tract (solid
line) for vowels /a/ (above) and /i/
(below).


Synthesis and simulation results are presented in
Chapter 6. Chapter 7 provides a discussion of our findings
along with conclusions and suggestions for future research.


61
The well known rules of "forward differentiation",
"backward differentiation" and "trapezoidal rule" are one
step methods. They will be automatically considered in the
following discussion as particular cases.
The general expression of a multistep (K-step) method
is
n
K
- I
i=l
(a. y
l-^n-i
3.y' )
+ 3 y'
hy1 = hf(y ,t )
1 n Jn n7
(4.3.1)
If 3q is equal to zero the method is explicit because it
provides an explicit way of computing yn and hy'n from the
values of y and its derivatives at preceding mesh points.
If 3q is different from zero, then (4.3.1) defines an
implicit multistep method because it is in general a non
linear equation involving the function f(yntn) that must be
solved for the unknown y .
4.3.b) Derivation of the Methods
We have considered the Adams-Bashforth and Adams-
Moulton [75] methods which are respectively explicit and
implicit methods with


15
During speech articulation the vocal cavity transfer
function is continuously changing and, ideally, the result
of the analysis of natural speech is a time varying estimate
of the linear predictor parameters (or of an equivalent
representation) as they change during speech production.
Also, the literature reports many algorithms which can
be applied to natural speech to extract the fundamental
(vocal cord oscillation) frequency and to perform the
voiced/unvoiced decision [29-30]. Some of them have been
implemented on special purpose integrated circuits for real
time applications [3] [5].
Therefore, linear prediction provides a complete
synthesis by analysis method in which all the control
parameters of Figure 2.3 can be derived directly from
natural speech.
The shortcoming of linear prediction is that the
transfer function (2.2.1) cannot properly model the
production of nasal, fricative and stop consonants. The all
pole approximation of the vocal tract transfer function is
in fact theoretically justified only for vowel sounds, and
even in this case the linear prediction model assumes a
minimum phase excitation signal.
In spite of these disadvantages, however, linear
prediction performs well for speech synthesis applications
because the approximations introduced by the model of


no
amplitude of oscillation of the vocal cords is slightly
increased during closure because of the larger pressure in
the tract.
If, in the simulation, the walls are assumed rigid (see
Figure 6.12) the pressure in the vocal tract during closure
soon reaches the subglottal pressure value and the glottal
volume velocity drops to zero.
The vocal cords, therefore, stop vibrating and they are
pushed apart by the subglottal and vocal tract pressure.
Vocal cord oscillation is resumed right after closure.
6.5.b) Formant Shift
An acoustic effect of the yielding properties of the
cavity walls is to increase the first formant frequency.
A simple rule which is given in the literature [55] is
F^Hz) = /2002 + F^
Where F^ is the formant frequency of the rigid wall vocal
tract and F-^ is the formant frequency when the vibration
properties of the walls are accounted.
Figure 6.13 shows the formant shift which can be
observed in our simulation when we use soft cavity walls.
The lower formant at 500 Hz is increased by about 50 Hz
which corresponds well to the formula above.


125
detects the impedance variations of the larynx caused by the
vibration of the vocal cords.
The physiological interpretation of the EGG signal has
been recently investigated with the aid of a synchronized
data base consisting of EGG data, ultrahigh speed laryngeal
films and speech waveforms that were recorded from different
speakers performing various types of phonation tasks
[114]. From the analysis of these data it was concluded
that the EGG signal is indicative of the lateral area of
contact between the vocal cords and of the pitch
periodicity. Figure 6.18 shows the EGG signal, the
differentiated EGG data and the glottal area recorded
simultaneously during phonation [114]. We can see that the
fastest variations of the EGG signal occur at instant of
vocal cord closure and opening. Figure 6.19 shows a
schematic model of the EGG signal [115]. The EGG waveform
has been simulated and interpreted also by Titze and Talkin
[116].
Here we shall show that the EGG data suggests a simple
modification of the two-mass model of the vocal cords.
At first we want to estimate the lateral contact of the
vocal folds with the aid of the two-mass model. We suppose
that the horizontal displacements y^ and of the two
masses are indicative of the displacements of the upper and
lower edges of the vocal folds as shown in Figure 6.20. If


84
to choose the most convenient view angle. Different colors
(one every five frames) are used to. mark the specific time
events.
Figure 5.6 also shows that the hidden lines have been
removed. This is achieved very efficiently by means of the
contour filling capability of the terminal. In this 3-D
representation all the frames belong to planes parallel to
- each other. It is, therefore, very simple to determine for
a given angle of rotation, which frame is in "front" and
which one is "behind". To remove the hidden lines the
contour capability of the terminal is used with the "ink
eradicator" color, starting from the frame which is the
farthest from the observer.
5*4) Simultaneous and Animated Display
of the Articulatory and Acoustic Characteristics
of the Vocal Cavitiel"
When a certain articulatory pattern has been edited
with the interactive graphic model, we may estimate the
corresponding acoustic events, e.g., the speech waveform,
the pressure and air volume-velocity distribution in the
vocal cavities, the motion of the vocal cords, the vibration
of the cavity walls, etc. by means of the acoustic model
defined in Chapter 3.
Figure 5.7 shows the final configuration of the
system. The articulatory or "muscular" events generated by


87
Figure 5.8. A typical animation frame including the
vocal tract and the vocal cords. The
various data waveforms calculated by
the model are also shown.


41
Figure 3.6. One mass model of the vocal cords.


60
accuracy requirements lower order methods are to be
preferred [75].
We would, therefore, like to have an automatic
mechanism for the selection of the order of the method.
This mechanism should evaluate the truncation error
corresponding to different integration orders and choose the
order which allows for the maximum step size and integration
speed compatible with the required precision.
Unfortunately, for the same reason discussed in
relation to the problem of step size control, namely the
absence of higher order derivative estimates, the Runge-
Kutta method does not provide an efficient procedure for
automatic order selection. Therefore we always use the
"standard" 4th order Runge-Kutta method.
4.3) Multistep Methods
4.3.a) Implicit and Explicit Methods
Those methods, like Runge-Kutta, which given an
approximation of y(t) at t = tn_^, (say Yn_^) provide a
technique for computing yn = y(tn) are called one step
methods. More general K-step methods require the values of
the dependent variables y(t) and of its derivatives at K
different mesh points t tn_2> * ,tn-K to aPProximate the
n'
solution at time t


98
HERTZ
Figure 6.6. Above synthetic speech waveforms at the
release of /g/ with fricative (left) and
without fricative (right) excitation.
Below spectra at the release of /g/
with fricative (solid line) and without
fricative (dashed line) excitation.


129
52 54 55 S3 53 52 4 55 53 78 72
MSEC
Figure 6.22. EGG signal, differentiated EGG signal and
glottal area estimated with the model of
Figure 6.20.


ACKNOWLEDGMENTS
I would like to express my sincere appreciation to my
advisory committee for their help and guidance throughout
this work.
I would like to give special thanks to my committee
chairman, Dr. D. G. Childers for his competent advice, for
his support, both financial and moral, and for providing an
educational atmosphere without which this research would
never have been possible.
Special gratitude is also expressed to Dr. E. R.
Chenette for his guidance, encouragement and financial
support.
To my parents and family I am forever indebted. Their
unceasing support and encouragement made it all possible and
worthwhile.
11


xml record header identifier oai:www.uflib.ufl.edu.ufdc:UF0008244900001datestamp 2009-02-16setSpec [UFDC_OAI_SET]metadata oai_dc:dc xmlns:oai_dc http:www.openarchives.orgOAI2.0oai_dc xmlns:dc http:purl.orgdcelements1.1 xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.openarchives.orgOAI2.0oai_dc.xsd dc:title An articulatory speech synthesizerdc:creator Bocchieri, Enrico Luigidc:publisher Enrico Luigi Bocchieridc:date 1984dc:type Bookdc:identifier http://www.uflib.ufl.edu/ufdc/?b=UF00082449&v=0000111216733 (oclc)000437194 (alephbibnum)dc:source University of Floridadc:language English


166
[87] Stevens, K. N. "Acoustic correlates of some
phonetic categories," J. Acoust. Soc. Amer.,
vol. 68, pp. 836-842, 1980.
[88] Fant., G., Speech Sounds and Features, MIT Press,
Cambridge, Massachusetts, 1973.
[89] Rothemberg, M. R. and Zahorian, S., "Nonlinear
inverse filtering technique for estimating the
glottal area waveform," J. Acoust. Soc. Amer.,
vol. 61, pp. 1063-1071, 1977.
[90] Wong, D. Y., Markel, J. D., and Gray, A. H., "Least
squares inverse filtering from the acoustic speech
waveform," IEEE Trans. Acoust. Speech Signal
Proces., vol. 27, 1979.
[91] Rothemberg, M., "Some relations between glottal air
flow and vocal fold contact area," Proceedings of
the Conference on the Assessment of Vocal
Pathology, Bethesda, Maryland, pp. 88-96,
[92] Coker, C. H., and Umeda, N. "The important of
spectral detail in initial-final contrasts of
voiced stops," J. of Phonetics, vol. 3, pp. 63-68,
1975.
[93] Westbury, J. R. "Enlargement of the supraglottal
cavity and its relation to stop consonant voicing,"
J. Acoust. Soc. Amer., vol. 73, pp. 1322-1336,
1983.
[94] MacGlone, R. E., and Shipp, T., "Comparison of
subglottal air pressure associated with /p/ and
/b/," J. Acoust. Soc. Amer., vol. 51, pp. 664-665,
1972.
[95] Lubcker, J. F. "Transglottal air flow during stop
consonant production," J. Acoust. Soc. Amer.,
vol. 53, pp. 212-215, 1973.
[96] Kent, R. D., and Moll, K. L., "Vocal tract
characteristics of the stop cognates," J. Acoust.
Soc. Amer., vol. 46, pp. 1549-1555, 1969.
House, S. A., and Stevens, K. N. "Analog studies
of the nasalization of vowels, J. of Speech and
Hearing Disorders, vol. 21, pp. 218-232, 1956.
[97]


86
the graphic model is the (off-line) input to the
synthesizer, which computes the corresponding acoustic
waveforms. Later a simultaneous and animated representation
of the articulatory and acoustic events is displayed on the
Tektronix 4113 terminal.
Figure 5.8 illustrates a typical frame of a sample
animation. The figure below the vocal tract model
.represents the two vocal cords. Each cord (left and right)
is schematically represented by two masses as proposed by
Flanagan [67]. The graphs on the right part of the screen
represent different acoustic events over the same time
interval.
Both the vocal tract and vocal cord models are
animated. During the animation a sliding green line runs
along the borders of the graphs to "mark the time". This
assists the viewer in relating the information displayed in
the graphs to the vocal cord and the vocal tract motion. A
live animation effect is obtained in this manner.


52
In our specific case the most stringent requirement is
the stability of the numerical method. Since the
integration of (4.1.1) is going to involve a large number of
mesh points we need a method which, for a sufficiently small
step size h, guarantees that the perturbation in one of the
mesh values yn does not increase in the subsequent values,
y m > n.
In the following discussion, as in [75], we use a "test
equation"
y'(t) = \y(t)
where X is a complex constant. We introduce the concept of
an absolute stability region, which is the set of real,
nonnegative values of h and X for which a perturbation in a
value yn does not increase from step to step.
When the stability requirement is satisfied, we should
select the fastest integration method for our particular
application. This second requirement is very important
since we are dealing with a large system of differential
equations (about 100 differential equations of the 1st
order) and it takes about five hours to generate one second
of synthetic speech on our Eclipse S/130 minicomputer.
The integration speed is directly related to the step
size h of the numerical integration method. The larger the


45
The acoustic synthesizer that we have implemented uses
the two mass model to provide voiced excitation. We
therefore account for source tract interaction since the
current in the equivalent circuit of the glottis (see
Figure 3.7) is dependent on the voltage p-^ that models the
pressure in the vocal tract just above the vocal cords.
3.2.c) Unvoiced Excitation
Speech sounds are generally excited by modulating the
air flow through a constriction of the glottal and
supraglottal system. For voiced sounds this modulation is
obtained through rapid changes of the glottal constriction
as explained in the review section. For fricative sounds,
the modulation comes from flow instabilities which arise by
forcing the air through a constriction with a sufficiently
high Reynold's number. In this case the classical
hypothesis of separability between the source and the tract
greatly limits the realism that can be incorporated into the
synthesizer. In fact, unvoiced excitation is greatly
dependent on the constricted area of the vocal tract itself.
The fricative self-excitation of the vocal cavities was
first modeled by Flanagan and Cherry [71]. The idea was to
use a resistor RN^ and noise generator VNj_ in the equivalent
circuit of the ith elemental length of the vocal tract (see
Figure 3.8). The values of the resistor and of the noise


94
r-H " O
CP 3 in
MSEC
Figure 6.4. Three different glottal excitations.
0 .1 .2 .3 .4 SEC
Good bye
Figure 6.5. Spectrogram of "good bye", synthetic.


167
[98] Hecker, M. H. "Studies of nasal consonants with an
articulatory speech synthesizer," J. Acoust. Soc.
Amer., vol. 34, pp. 179-188, 1962.
[99] Fujimura, 0., "Analysis of nasal consonants,"
J. Acoust. Soc. Amer., vol. 34, pp. 1865-1875,
1962.
[100] Morris, H. L. "Etiological bases for speech
problems," in D. Spriesterbach and D. Sheranon
(eds), Cleft Palate and Communication, Academic
Press, New York, pp. 119-168, 1968.
[101] Bernthal, J. E. and Benkelman, D. R. "The effect
of changes in velopharyngeal orifice area on vowel
intensity," Cleft Palate Journal, vol. 14,
pp. 63-77, 1977.
[102] Daniloff, R. G., and Hammarberg, R. E. "On
defining coarticulation, J. of Phonetics, vol. 1,
pp. 239-248, 1973.
[103] Kent, R. D. and Minife, F. D. "Coarticulation in
recent speech production models," J. of Phonetics,
vol. 5, pp. 115-133, 1977.
[104] Daniloff, R., and Moll, K., "Coarticulation of lip
rounding," J. of Speech and Hearing Research,
vol. 11, pp. 707-721, 1968.
[105] Ohman, S. E. G., "Coarticulation in VCV utterances.
Spectrographic measurements," J. Acoust. Soc.
Amer., vol. 39, pp. 151-168, 1966.
[106] Carney, P. J., and Moll, K. L., "A
cinefluorographic investigation of fricative
consonant-vowel coarticulation," Phonetica,
vol. 23, pp. 193-202, 1971.
[107] Amerman, J. D., Daniloff, R. and Moll, K., "Lip
and jaw coarticulation for the phoneme /ae/," J. of
Speech and Hearing Research, vol. 13, pp. 147-161,
1970.
[108] Moll, K. and Daniloff, R. "Investigation of the
timing of velar movements during speech,"
J. Acoust. Soc. Amer., vol. 50, pp. 678-684, 1971.


123
KHz
0 9.2 0.4 0.6 0.3 SEC
KHz
0 0.2 0.4 0.6 0.8 SEC
Figure 6.17. Spectrograms of /iba/ with the simulation
of Ohman's coarticulation model (above)
and Gay's hypothesis.


CHAPTER 7
CONCLUSIONS
7.1) Summary
What is the amount of physiological detail captured by
the articulatory synthesis method? Can articulatory
synthesis generate high quality synthetic speech? Can it be
employed for the simulation of physiological and
pathological aspects of speech production?
A positive answer to these questions is the most
significant contribution of the research presented in this
dissertation.
The modeling and numerical simulation techniques used
in this study have been reported in Chapters 3 and 4. We
were able to guarantee the stability of the numerical
simulation method and to halve the number of differential
equations that must be solved for the simulation of the wave
propagation in the vocal tract (Chapter 4).
We also developed interactive graphic software
(Chapter 5) to provide us with a convenient interface for
the manipulation of the articulatory model of the vocal
cavities.
133


CHAPTER 2
SPEECH PRODUCTION MODELS AND SYNTHESIS METHODS
As indicated in the introduction (Chapter 1) there are
many different applications that motivate research in the
area of speech synthesis. However, different goals usually
require different approaches for the solution of the
problem. This chapter will briefly consider the three most
popular and best documented techniques for speech synthesis
(namely linear prediction, formant and articulatory
synthesis), their relative "advantages" and "disadvantages"
and the applications for which they are most suitable. The
purpose is to review the available speech synthesis
techniques and to justify the choice of articulatory
synthesis for our research.
Every speech synthesis strategy is based on a more or
less complete model of the physiology of speech production
and ultimately its performance is determined by the amount
of acoustic and linguistic knowledge that the model can
capture.
In
the
first
section
of
this
chapter we
therefore
discuss
the
basic
notions
of
the
physiology
of speech
together with the source-filter production model upon which
both linear prediction and formant synthesis are based.
7


9
Figure 2.1. Schematic diagram of the human vocal
mechanism (from [7] ). By permission
of Springer-Verlag.


168
[109]
[110]
[111]
[112]
[113]
[114]
[115]
[116]
Stevens, K. N. House, A.. S., and Paul, A. P.,
"Acoustic description of syllable nuclei: an
interpretation in terms of dynamic model of
articulation," J. Acoust. Soc. Amer., vol. 40,
pp. 123-132, 1966.
Kozhevnikov, V. A., and Chistovich, L. A., "Speech
Articulation and Perception," Joint Publications
Research Services, Washington D.C., 1965.
MacNeilage, P. F., "Motor control of serial
ordering of speech," Psychological Review, vol. 77,
pp. 182-196, 1970.
Henke, W. L. Dynamic Articulatory Model of Speech
Production Using Computer Simulation, Doctoral
Dissertation, Massachusett Institute of Technology,
Cambridge, 1966.
Fowler, C. A., "Coarticulation and theories of
extrinsic timing," J. of Phonetics, vol. 8,
pp. 113-133, 1980.
Krishnamurthy, A. K., Study of Vocal Fold Vibration
and the Glottal Sound Source Using Synchronized
Speech, Electroglottography and Ultra-High Speed
Laryngeal Films, Doctoral Dissertation, University
of Florida, Gainesville, Florida, 1983.
Childers, D. G., Moore, G. P., Naik, J. M., Larar,
J. N. and Krishnamurthy, A. K., "Assessment of
laryngeal function by simultaneous, synchronized
measurement of speech, electroglottography and
ultra-high speed film," Transcripts of the Eleventh
Symposium Care of the Professional Voice, The
Julliard School, New York, pp. 234-244, 1982.
Titze, I. R. and Talkin, D., "Simulation and
interpretation of glottographic waveforms,"
Proceedings of the Conference on the Assessment of
Vocal Pathology, Bethesda, Maryland, 1979.


62
a
1
1
(4.3.2)
a. = 0 if i 1
l
Both of these methods can be obtained from the integral
relation
y(tn) = y^tn-l^ + / f(y(t),t)dt
n-1
The integral is estimated by approximating f(y(t),t)
with an interpolating polynomial (for example, the Newton's
backward difference formula) through a number of known
values at t = t ***' tn-K -*-n the exPlicit case or
through the values at times tn, tn_^, ...tn_K ^or the
implicit case.
Therefore, for (the explicit Adams-Bashforth case, the
equation (4.3.1) takes the form
K
(4.3.4a)
or with the equivalent representation in terms of finite
differences


57
t +h ,
/ f(t)dt = gCf(t0) + 4hf(tQ + + f(toth)]
t
o
when y (t) = f(t).
We use a calculation procedure derived from (4.2.6) by
Gill [76] which minimizes the memory requirements and allows
us to compensate the round off errors accumulated at each
step.
4.2.b) Control of the Step Size with Runge-Kutta
In the previous derivation we have emphasized how the
Runge-Kutta method approximates the Taylor expansion 4.2.2
up to the 4th power of h. It is therefore a fourth order
c
method with a local truncation error of order h
f(5)(y(t ),t ) ,
w o o .5
51 h
+ oOi6)
This accuracy is obtained without explicitly computing the
derivatives of orders higher than one, at the expense of
four evaluations of the first derivative for each mesh
point. This is a disadvantage with respect to multistep
methods (to be discussed later) which uses fewer
computations of the first derivative to obtain the same
truncation error. The number of derivative evaluations per
step increases to 5.5 to obtain a variable control of the
step size with Runge-Kutta.


115
We have modeled each elementary length of the nasal
tract with an equivalent circuit different than the one
discussed in Section 3.1.a for the oral tract. We modeled
the nasal cavity losses with resistors RL^s and RH^s (see
Figure 6.15) which determine the losses at low and high
frequencies respectively.
Their value is determined by the expressions
RH. = 20000 L. RL. = 500 L.
li li
in order to match the formant bandwidths of the nasal
formants at 300 Hz and 2200 Hz that are typical of nasal
murmurs [99].
We have used this representation of the acoustic
properties of the vocal cavities to model the reduction of
the intensity level of vowel sounds when the velopharyngeal
orifice is not completely sealed.
Speech clinicians have noted a reduction in the speech
loudness level of speakers without adequate velopharyngeal
closure that could be caused either by a voluntary attempt
of the subject to minimize the obviousness of his
communication problem or by additional damping of the speech
signals that results from an increased coupling with the
nasal cavity [100].


Abstract of Dissertation Presented to the Graduate
Council of the University of Florida in Partial
Fulfillment of the Requirements for the
Degree of Doctor of Philosophy
AN ARTICULATORY SPEECH SYNTHESIZER
By
Enrico Luigi Bocchieri
April 1984
Chairman: Dr. D. G. Childers
Major Department: Electrical Engineering
Linear prediction and formant synthesizers are based on
a rather approximate model of speech production physiology,
using analysis or "identification" algorithms of natural
speech to overcome the model limitations and to synthesize
good quality speech.
On the contrary, articulatory synthesizers are based on
a more exact speech production model, and do not use
identification algorithms to derive the model parameters
directly from the natural speech waveform.
This dissertation shows that the amount of
physiological detail captured by the articulatory synthesis
method is sufficient for the generation of high quality
synthetic speech and for the simulation of physiological and
pathological aspects of speech that are reported in the
literature.
v


58
The step size, h, should in fact be chosen so that the
local truncation error is less than a certain maximum
acceptable value specified by the user. Unfortunately, the
truncation error cannot be directly estimated because the
Runge-Kutta procedure does not provide any information about
higher order derivatives.
A practical solution [76] is based on the results of
numerical integration with steps h and 2h, respectively,
i.e., the computation is performed a first time using h-^ = h
and then it is repeated using h2 = 2h.
denote the truncation error using step
h2 = 2h
denote the truncation error using step
h-^ = h
denote the value "obtained" at (tQ + 2h)
using step h2 = 2h
denote the value "obtained" at (tQ + 2h)
using step h^ = h twice
denote the true value of y at time
(tQ + 2h),
Jbet.
C2^2
Clhl
y(2)
*(1)
then
Y y
(2)
= 2h2
(1)
2Clh-
Y y


Figure 3.8. Equivalent circuit of vocal tract elemental
length with fricative excitation.


136
methods could probably be used for articulatory text to
speech synthesis.
The availability of a faster implementation of the
vocal cavity acoustic model would greatly help this
research, since a great amount of experimentation based on a
trial and error procedure (see also Section 6.1) is probably
required.


47
generator variance depend on the Reynold's number of the
flow in the ith section [71]. The spectrum of the turbulent
noise VN^ can be assumed white with a good approximation
[21].
In our simulation we have modeled the fricative
excitation by means of two sources. One is always located
in the first vocal tract section to generate aspirated
sounds. The second is not bound to a fixed position but can
be moved along with the vocal tract constriction location.
3*3) Radiation Load
The radiation effects at the mouth and nostrils can be
accounted for by modeling the mouth and nostrils as a
radiating surface placed on a sphere (the head) with a
radius of about 9 cm.
Flanagan [7] has proposed a simplified equivalent
circuit for the radiation load model by using a parallel
combination of an inductor and a resistor with values
_ 128 _ 8a
KR ~ 2' ^R 3ire
2 it
where a is the radius of the (circular) radiating surface
(mouth or nostrils). Titze [19] has shown that this
approximation, which we are using now, is valid also at
relatively high frequencies, when the speech wavelength has
the same order of magnitude of the mouth radius.


147
Figure A.5. Graphic representation of equation (A.2.2)
to account for fricative excitation.


66
(4.3.6)). The reason for this fact is evident when we
consider that (4.3.7) has (K+l) coefficients 3. r, i = 0,K
1, a.
while in the explicit method (4.3.5) 3A has been set to
U / i\
zero.
A disadvantage of implicit methods is that the non
linear equation (4.3.7) in the unknown yn must be solved
iteratively. Usually a first "guess" of yn is obtained by
-means of an explicit method and then (4.3.7) is iterated.
However, for reasonable values of the step size h, no more
than two or three iterations are usually required, and this
extra effort is more than compensated for by the better
stability properties of the implicit methods. In fact with
respect to the "test" equation y' = Xy, the range of h
values for which implicit methods are stable is at least one
order of magnitude greater than in the explicit case [75] .
Since the truncation errors of implicit methods are smaller,
the implicit methods can be used with a step size that is
several times larger than that of the explicit method. The
allowed increase in step size more than offsets the
additional effort of performing 2 or 3 iterations.
4.4) Method Selection
We have implemented the numerical simulation of the
acoustic model of the vocal cavities by means of both Runge-
Kutta and implicit multistep methods [77]. The Runge-Kutta


165
[75]
[76]
[77]
' [78]
[79]
[80]
[81]
[82]
[83]
[84]
[85]
[86]
Gear, C. W. Numerical Initial Value Problems in
Ordinary Differential Equations, Prentice Hall,
Englewood Cliffs, New Jersey, 1971.
Ralston, A., and Wilf, H. S., Mathematical Methods
for Digital Computers, John Wiley and Sons, New
York, 1967.
Gear, C. W., "DIFSUB for solution of ordinary
differential equations," Collected Algorithms from
ACM, ACM, Inc., New York, pp. 407P1-407P7, 1980.
Coker, C. H., "A model of articulatory dynamics and
control," Proceedings of the IEEE, vol. 64,
pp. 452-460, 1976.
Mermelstein, P., "Articulatory model for the study
of speech production," J. Acoust. Soc. Amer.,
vol. 53, pp. 1070-1082, 1973.
Ohman, S. E. G., "Numerical model of coarticu
lation," J. Acoust. Soc. Amer., vol. 41, pp. 310-
320, 1967.
Lindquist, J., and Sundberg, J., "Acoustic
properties of the nasal tract," STL-QPSR, no. 1,
pp. 13-17, 1972.
Perkell, J. S., Physiology of Speech Production,
MIT Press, Cambridge, 1969.
Gay, T., "Articulatory movements in VCV sequences,"
J. Acoust. Soc. Amer., vol. 62, pp. 183-194, 1977.
Rothemberg, M., "The voice source in singing,"
Research Aspects on Singing, Royal Swedish Academy
of Music, Stockholm, pp. 15-33, 1981.
Kewley-Port, D. "Time varying features as
correlates of place of articulation in stop
consonants," J. Acoust. Soc. Amer., vol. 73,
pp. 322-335, 1983.
Searle, C. L., Jacobson, J. Z., and Rayment, S. G.,
"Stop consonant discrimination based on human
audition," J. Acoust. Soc. Amer., vol. 65, pp. 799-
809, 1979.


89
Figure 6.2
Spectrogram of
"Goodbye Bob"
/
natural


CHAPTER 5
THE ARTICULATORY MODEL AND ITS
INTERACTIVE GRAPHIC IMPLEMENTATION
The acoustic characteristics of the vocal cavities,
that we have modeled by means of an equivalent circuit in
Chapter 3, are greatly affected by the geometrical
configuration of the vocal cavities themselves. Therefore,
the acoustic and perceptual properties of speech depend on
the position of the lips, tongue, jaws and of the other
vocal organs that determine the shape of the vocal tract.
The physiological mechanism of speech production or
"articulation" involves precisely timed movements of the
vocal organs to produce the acoustic wave that we perceive
as connected speech.
This chapter is concerned with the definition and
implementation of a geometric or "articulatory" model of the
vocal tract that can be used to describe the configuration
of the vocal cavities during speech production.
5.1) Definition of the Articulatory Model
All the articulatory models presented in the literature
[78-81] are two dimensional representations of the vocal
cavities which closely match the midsagittal section of the
71


43
i 1 1
CONTRACTION GLOTTIS EXPANSION
Figure 3.7. Two mass model of the vocal cords and
glottis equivalent circuit.


163
[54] Flanagan, J. L., Ishizaka, K. and Shipley, K. L.,
"Signal model for low bit rating coding of speech,"
J. Acoust. Soc. Amer., vol. 68, pp. 780-791, 1980.
[55] Sondhi, M. M., "A model for wave propagation in a
lossy vocal tract," J. Acoust. Soc. Amer., vol. 55,
pp. 1070-1079, 1974.
[56] Dunn, H. K. and White, S. D. "Statistical
measurements of conversational speech," J. Acoust.
Soc. Amer., vol. 11, pp. 278-288, 1940.
[57] Ishizaka, K., French, J. C., and Flanagan, J. L.,
"Direct determination of the vocal tract wall
impedance," IEEE Trans. Acoust. Speech and Signal
Proces., vol. 23, pp. 370-373, 1975.
[58] Morrow, C. J., "Speech in deep submergence
atmosphere," J. Acoust. Soc. Amer., vol. 50,
pp. 715-728, 1971.
[59] Fujimura, 0., and Lindquist, J., "Sweep tone
measurement of vocal tract characteristics,"
J. Acoust. Soc. Amer., vol. 49, pp. 541-558, 1979.
[60] Van den Berg, J., "On the role of the laryngeal
ventricle in voice production," Folia Phoniatrica,
vol. 7, pp. 57-69, 1955.
[61] Van den Berg, J., Zantema, J. T., and Doornenbal,
P., "On the air resistance and the Bernoulli effect
of the human larynx," J. Acoust. Soc. Amer.,
vol. 5, pp. 626-631, 1957.
[62] Flanagan, J. L., and Landgraf, L. L., "Self
oscillating source for vocal tract synthesizers,"
IEEE Trans. Audio Electroacoust. vol. 16, pp. 57-
58, 1968.
[63] Flanagan, J. L., and Meinhart, D., "Source system
interaction in the vocal tract," J. Acoust. Soc.
Amer., vol. 64, pp. 2001(A), 1964.
Mermelstein, P., "An extension of Flanagan's model
of vocal cord oscillations," J. Acoust. Soc. Amer.,
vol. 50, pp. 1208-1210, 1971.
[64]


CHAPTER 3
ACOUSTIC MODELS OF THE VOCAL CAVITIES
The qualitative descriptions of the human speech
production mechanism and of the articulatory method of
speech synthesis that we have given in Section 2.1 cannot be
directly implemented on a digital computer. This knowledge
must be transformed into an analytical representation of the
physics of sound generation and propagation in the vocal
cavities.
The mathematical model of the vocal cavity acoustics
can be conveniently interpreted by means of equivalent
circuits. In such a representation the electrical current
and voltage correspond respectively to the air pressure and
volume velocity in the vocal cavities. In the following we
will always express all the physical dimensions in C.G.S.
units.
3.1) Sound Propagation in the Vocal Cavities.
3.1.a) Derivation of the Model.
Sound is nearly synonymous with vibration. Sound waves
are originated by mechanical vibrations and are propagated
in air or other media by vibrating the particles of the
media. The fundamental laws of mechanics, such as momentum,
23


91
The procedure for the definition of the time evolution
of the articulatory model that we have described above is,
however, rather "heuristic". After a first trial,
adjustments of the articulatory model configuration are
usually necessary to improve the quality of the synthetic
speech.
In our opinion a development of this research should be
the definition of vocal tract evolution for different
English allophones, as a first step toward an automatic
speech synthesis by rule system based on an articulatory
model. The solution of this problem is not at all
trivial. Section 6.7 illustrates the difficulties and
reviews part of the literature related to this subject.
6.2) Source Tract Interaction
The classical source-tract speech production model that
we have discussed in Section 2.1 is based on the assumption
that the glottal volume velocity during speech production is
independent of the acoustic properties of the vocal tract.
Evidently this source-tract separability assumption holds
only as a first order approximation. In fact the glottal
volume velocity depends on the transglottal pressure that is
related to the subglottal and vocal tract pressure.
The effects of source-tract interaction have been
modeled and analyzed by Geurin [44], Rothenberg [42],


63
where the operator V is defined by
vJf vJ"1f VJ_1f ,
m m m-1
Vf = f
m m
(4.3.5)
The values of yn differ from the real solution y(tn) by
-a local truncation error which is of order (K + 1) in the
step size h
Error
Adams-Bashforth
K+l (K+l)
a yk ^ y(t)
(4.3.6)
The values of y^'s and P^i's coefficients are available
directly from the literature [75].
In the implicit Adams-Moulton case equation (4.3.1)
takes the form
n
- + 2
1=0
t )
n-i
(4.3.7)
or with the equivalent representation in terms of backward
differences
K
y = y(t ,) + h y y* V f(y(t ), t )
2n 1 s n-1' rJ n' n'
u u
where the v operator has been defined in (4.3.5)
(4.3.8)


145
voltage generator VR represent the turbulent losses and
turbulent excitation respectively. Such an equivalent
circuit, however, represents the "junction" of two uniform
pipes obtained by means of a resistor RR and a voltage
generator VR as shown in the lower part of Figure A.4. The
continuity conditions (A.1.2) become
PK(Ax, t)
uR(A x,t)
PK+l^0,t^ + RKUK+l^,t^ + V]
K
(A.2.1)
UK+l^0,t^
The substitution of (A.1.1) into (A.2.1) gives, after some
manipulations, the following equations.
P+ (t)
K+lv 1
'Kll K12
'K21 K22
pj£(tt)
pK(t)
V.
KM1
V.
KM 2
(A.2.2)
where
A AK
AK+1 +
'Kll
RKAKAK+1
pc
D
'K12
0 A
A K+l
D
, A 2 AK
'K21 D


10
perceived as a sequence of speech units or segments. In
general the different types of sound that occur during
speech production are generated by changing the manner of
excitation and the acoustic response of the vocal cavities.
As a first order approximation we can distinguish
between a "voiced" and a "fricative" or "unvoiced"
excitation of the vocal cavities.
The voiced excitation is obtained by allowing the vocal
cords to vibrate so that they modulate the stream of air
that is coming from the lungs, producing an almost periodic
signal. For example, vowels are generated in this way and
they are perceived as continuous non-hissy sounds because
the excitation is essentially periodic.
In contrast, unvoiced or fricative excitation is
achieved by forcing the air flow through a constriction in
the vocal tract with a sufficiently high Reynold's number,
thereby causing turbulence. This excitation has random or
"noisy" characteristics and, therefore, the resulting speech
sounds will be hissy or fricative (friction-like) as in the
case of the consonants /s/ and /f/.
Both voiced and unvoiced excitation signals have a
rather wide spectrum. Typically the power spectrum of the
voiced excitation decreases with an average slope of
12 db/octave [7] while the unvoiced spectrum can be
considered white over the speech frequencies [21].


44
characteristic according to the stiffness measured on
excised human vocal cords. As in the case of the one mass
model the viscous damping was changing during the vocal
cords vibration period.
For computer simulation it is convenient to represent
the pressure distribution along the two masses with the
voltage values in an equivalent circuit. In Figure 3.7
resistance Rc accounts for the Bernoulli pressure drop and
"vena contract" effect at the inlet of the glottis.
Resistances Ry^ and R^ model the viscous losses in the
glottis. Resistance R-^ accounts for the pressure
difference between the two masses caused by the Bernoulli
effect. The inductors model the air inertance in the
glottis.
The two mass model of the vocal cords, that has been
used in connection with vocal tract synthesizers [67-68],
uses as control parameters the glottal neutral area AgQ and
the cord tension Q.
AgQ determines the glottal area in absence of phonation
and it is physiologically related to the position of the
arythenoid cartilages. Q controls the values of the elastic
constant of the model and greatly affects the two mass model
oscillation period. The suitability of the two mass model
and of its control parameters for speech synthesis has been
further validated in [69] and [70].


13
2.2) Linear Prediction
The simplest and most widely used implementation of the
source tract model of speech production is Linear Prediction
synthesis. It is a synthesis by analysis method that was
first proposed by Atal and Hanauer [26] and it has been
investigated for a great variety of speech applications.
The method is particularly suitable for digital
implementation and it assumes a time discrete model of
speech production, typically with a sampling frequency
between 7 and 10 kHz. It consists (see Figure 2.3) of two
signal generators of voiced and unvoiced excitation and of
an all pole transfer function
H(Z) = rp (2.2.1)
1 + I aKz"K
K=1
to represent the acoustic response of the vocal cavities.
Mathematically, the transfer function H(z) is determined by
the predictor coefficients, as. The great advantage of
linear prediction is that an estimate of the predictor
parameters can be efficiently obtained using an analysis
phase of natural speech. The literature presents several
algorithms to perform this analysis. Perhaps the schemes
most used for speech synthesis applications are the
autocorrelation method [27] and, for hardware
implementation, the PARCOR algorithm [28].


75
Figure 5.2. Grid system for the conversion of mid-sagittal
dimensions to cross-sectional area values.


116
RHi
o-
AA/V
vJlQJL^A/VV
Li
RLi
ci
RHi+l
AA/V
AAA/^SlSlSiy
rl+i
Li+1
-O
Figure 6.15. Equivalent circuit of an elemental length
of the nasal tract.


102
The possibility that vibrating walls could influence
the glottal volume velocity estimate has also been
hypothesized by Titze (see discussion in [91]).
Figure 6.9 shows the speech production model on which
the algorithm is based. The mouth volume velocity u(n) and
the glottal waveforms g(n) are related by an all pole
transfer function that models the vocal tract
= T(z) (6.4.1)
l+l a.z"1
i=l
or in the time domain
u(n) + a u(n-l) + + a u(n-K) = g(n) (6.4.2)
J. J\
The sound pressure s(n), which is the data supplied for the
inverse filtering algorithm, is equal to the derivative of
the mouth volume velocity
= R(z) = 1 z"1 (6.4.3)
u(z)
or equivalently
U(z) _
G (z)
s(n) = u(n) u(n-l)
(6.4.4)


152
which is approximated by
u2 (t)
g
u (t-x) (1+2
g
u (t) u (t-x)
2 2L_I \
u^Tt-x)
since the variation of ug(t) is usually small during an
interval of duration x.
Because of continuity conditions at the glottal
termination and (A.1.1) we obtain
Ug(t) = u^Ojt) = ~ (p^(t) p(t)) (A.3.3)
P1(0,t) = p*(t) + p1(t) (A.3.4)
By substituting (A.3.2), (A.3.3) and (A.3.4) into
(A.3.1), and by approximating the time derivatives with
backward differences we can obtain a linear relationship
4. _
between p^(t), p^(t) and the subglottal pressure Pg
P*(t) = -DlP^(t) + D2(Ps VQ) + D3 (A.3.5)
where
D = (1 ( + 2au (t-x))) D
1 pc x g 2
2 = (1 + (x + 2aug(t-x)))
-1


18
function in terms of formant frequencies and bandwidths but
also determine different types of excitation such as voiced,
unvoiced, mixed, sinusoidal for the generation of
consonantal murmurs, and burst-like for the simulation of
stop release.
Formant synthesizers are particularly useful in
psycoacoustic studies since the synthetic speech waveform
can be precisely controlled by parameters which are more
directly related to the acoustic characteristics of speech
than the linear prediction coefficients. For the same
reason they are also more suitable for speech synthesis by
rule.
Synthesis by analysis with formant synthesizers
requires the extraction of the formant information from the
speech signal. For this purpose the literature presents
several formant analysis algorithms [38-41]. However linear
prediction analysis is simpler and more efficient and
therefore linear prediction is usually preferred in
synthesis by analysis type of applications.
2.4) Articulatory Synthesis
Linear
prediction and
formant
synthesis are
not
completely
suitable for our
research
since they do
not
faithfully model the human speech production mechanism.


148
Consider the partial differential equations (3.1.3),
which we reproduce below and which account for the wave
propagation in a pipe with area function A(x) with a small
time varying perturbation 6A(x,t).
3p(x,t) p 3u(x,t)
3 x A(x) 3t
0
3u(x,t) A(x) 3p(x,t) 3(6A(x,t))
3x 2 2t 3t
pc
Apply the above relationships to the Kth uniform tube of our
model where
A(x) = Ak# 6A( x, t) = A^t)
This gives
3pK(x,t)
_
3u(x,t)
P _
A
K
3t
3uK(x,t) Aj^ 3pK(x,t)
3 x
pc
3t
d(6AK(t))
dt
(A.2.3)
Let pKr(x,t) and uKr(x,t) be the solution of (A.2.3) in the
rigid wall case when 6A(t) is equal to zero.
K
Then (A.2.3) shows that the solution p(x,t) and u(x,t)
can be easily obtained from the rigid wall solution


Ill
0. see see* isee 2*00 2000 3008
Hz
Figure 6.13. Linear prediction spectra of synthetic /i/
with yielding cavity walls (solid line)
and rigid cavity walls (dashed line).


CHAPTER 6
SIMULATION RESULTS
6.1) Speech Synthesis
As explained in Chapter 2, linear prediction and
formant synthesis are based on a rather approximate model of
speech production. However, the quality of the synthetic
speech may be very good because the synthesis algorithm uses
the information derived from an analysis phase of natural
speech that captures the most important perceptual features
of the speech waveform.
On the contrary, articulatory synthesis employs a more
detailed model of the human speech production mechanism, but
cannot exploit a good analysis algorithm to derive the
articulatory information directly from natural speech.
In this section we want to show that the amount of
physiological detail captured by our articulatory and
acoustic models of the vocal cavities is sufficient for the
generation of good quality English sentences.
In fact, Figure 6.1 shows the spectrogram of the
sentence "Goodbye Bob" that we have synthesized with our
computer programs. The quality of this sample compares
favorably with respect to other synthesis techniques.
88


118
MSEC
Figure 6.16. Synthetic speech waveforms of non nasalized
/a/ (solid line) and nasalized /a/ (dashed
line).


161
[33] Sambur, M. R. Rosenberg, A. E., and McGonegal,
C. A., "On reducing the buzz in LPC synthesis,"
J. Acoust. Soc. Amer., vol. 63, 1978.
[34] Yea, J. J., The Influence of Glottal Excitation
Functions on the Quality of Synthetic Speech,
Doctoral Dissertation, University of Florida,
Gainesville, Florida, 1983.
[35] Naik, J., Synthesis and Objective Evaluation of
Natural Sounding Speech Using the Linear Prediction
Analysis-Synthesis Scheme, Doctoral Dissertation,
University of Florida, Gainesville, Florida, 1983.
[36] Holmes, J. N., "The influence of the glottal
waveform on the naturalness of speech from a
parallel formant synthesizer," IEEE Trans. Audio
Electroacoust., pp. 298-305, 1973.
[37] Rabiner, L. R., "Digital-formant synthesizer for
speech synthesis studies," J. Acoust. Soc. Amer.,
vol. 43, pp. 822-828, 1968.
[38] Schafer, R. W. and Rabiner, L. R. "System for
automatic formant analysis of voiced speech,"
J. Acoust. Soc. Amer., vol. 47, pp. 634-648, 1970.
[39] Olive, J., "Automatic formant tracking in a Newton-
Raphson technique," J. Acoust. Soc. Amer., vol. 50,
pp. 661-670, 1971.
[40] Markel, J. D., "Automatic formant and fundamental
frequency from a digital inverse filtering
formulation," Proceedings International Conference
Speech Commun. Proces., Boston, pp. 81-84, 1972.
[41] McCandless, S. S., "An algorithm for automatic
formant extraction using linear prediction
spectra," IEEE Trans. Acoust. Speech Signal
Proces., vol. 22, pp. 135-141, 1974.
[42] Rothemberg, M., "The effect of flow dependence on
source-tract interaction," presented and to appear
in Proceedings of Vocal Folds Physiology and
Biophysics of Voice, University of Iowa, 1983.
[43] Ananthapadmanaba, T. V., and Fant, G., "Calculation
of true glottal flow and its components," STL-QPSR,
no. 1, pp. 1-30, 1982.


99
articulation [17]. Figure 6.7 shows the spectra of /b/,
/d/, /g/ of our synthetic speech using a 256 ras window after
consonant release.
The /g/ spectrum is characterized by an energy
concentration in the 2 KHz range. The /d/ spectrum has a
rising characteristic with maxima around 4 KHz.
Instead the /b/ spectrum is more vowel-like with global
constant trend on the speech frequency range.
6.4 Glottal Inverse Filtering of Speech
The problem of estimating the glottal volume velocity
directly from the speech waveform has interested speech
researchers for a number of years [89-90].
Here we consider the algorithm of least square inverse
filtering of speech [90] that is based on the linear
prediction model of speech production.
We will show that even if this model does not account
for the vibration of the vocal tract walls, nevertheless the
inverse filtering algorithm is capable of giving a correct
estimate of the glottal volume velocity. This is not an
obvious result since the volume velocity that is displaced
by the wall vibration is not small compared to the glottal
volume velocity (see Figure 6.8) and it is present also
during the closed phase.


144
RK +VK
rk +vk
s/x/N/1' )-
K+l
Ax
Ax
Figure A.4. Equivalent circuit to account for fricative
excitation and its interpretation in terms
of lossless tubes.


113
Studies of nasal consonants with analog articulatory
synthesizers have been reported by House and Stevens [97],
Hecker [98] and Fant [22]. The latter reference uses data
taken directly from X-rays of the vocal system. Sweep-
frequency measurements of the nasal tract transfer function
have been performed by Fujimura and Lindquist [59].
Nasal murmurs appear to have typical spectral
characteristics associated with the nasal tract and
pharynx. Formants occur around 300 Hz, 1000 Hz, 2200 Hz,
3000 Hz. According to Fujimura [99] there is also a formant
between 1000 and 2000 Hz which is heavily dependent on the
oral tract configuration.
An increase in the coupling area of the mouth to the
pharyngeal cavity may shift the antiresonance associated
with the mouth cavity up to 1000 Hz which causes a
neutralization of the 1000 Hz nasal formant [22]. This can
be verified with the aid of our vocal tract synthesizer.
Figure 6.14 shows the spectra of two synthetic nasal
murmurs. The 1000 Hz formant disappears when the mouth
coupling area is increased with respect to the area of the
velar passage.
The formant bandwidths of nasal murmurs are comparable
or larger than those observed for vowels. Both high and low
frequency components are in fact alternated by the soft,
deeply convoluted walls of the nasal cavities which provide
a high factor of viscous damping.


REFERENCES
[1] Flanagan, J. L., "Voices of men and machines,"
J. Acoust. Soc. Amer., vol. 51, pp. 1375-1387,
1972.
11
to
1 1
Holmes,
London,
J. N. ,
1972.
Speech Synthesis,
Mills and Boon,
[3]
Feldman,
J. A.
, Hofstetter, E.
M. and Malpass,
M. L., "A compact flexible LPC vocoder based on a
commercial signal processing microcomputer," IEEE
Trans. Acoust. Speech Signal Proces., vol. 31,
pp. 252-257, 1983.
[4] Cox, R. V., Crochierie, R. E. and Johnston, J. D.,
"Real time implementation of time domain harmonic
scaling of speech for rate modification and
coding," IEEE Trans. Acoust. Speech Signal Proces.,
vol. 31, pp. 258-272, 1983.
[5] Fette, B., Harrison, D., Olson, D., and
Allen, S. P., "A family of special purpose
micrprogrammable digital signal processor IC's in
an LPC vocoder system," IEEE Trans. Acoust. Speech
Signal Proces., vol. 31, pp. 273-280, 1983.
[6] Irie, K. Uno, T. Uchimura, K. and Iwata, A., "A
single-chip ADM LSI CODEC," IEEE Trans. Acoust.
Speech Signal Proces., vol. 31, pp. 281-287, 1983.
[7] Flanagan, J. L., Speech Analysis, Synthesis and
Perception, Springer Verlag, New York, 1972.
[8] Sambur, M. R., "An efficient linear prediction
vocoder," Bell System Tech. J. vol. 54, pp. 1693-
1723, 1975.
[9] Tribolet, J. M. and Crochierie, R. E. "Frequency
domain coding of speech," IEEE Trans. Acoust.
Speech and Signal Proces., vol. 27, pp. 512-530,
1979.
158


162
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
Guerin, B., Mryaty, M. and Carre, R., "A voice
source taking into account of coupling with the
supraglottal cavities," IEEE International
Conference Acoust. Speech Signal Proces., pp. 47-
50, 1976.
Childers, D. G., Yea, J. J., and Bocchieri, E. L.,
"Source-vocal tract interaction in speech and
singing synthesis," presented and to appear in
Proceedings of Stockholm Music Acoust. Conference,
Stockholm, Aug. 1983.
Wakita, H., "Direct estimation of the vocal tract
shape by inverse filtering of acoustic speech
wave forms," IEEE Trans. Audio Electroacoust.,
vol. 21, pp. 417-427, 1973.
Wakita, H., "Estimation of vocal tract shapes from
acoustical analysis of the speech wave: the state
of the art," IEEE Trans. Acoust. Speech Signal
Proces., vol. 27, pp. 281-285, 1979.
Sondhi, M. M., "Estimation of vocal tract areas:
the need for acoustical measurements," IEEE Trans.
Acoust. Speech Signal Proces., vol. 27, pp. 268-
273, 1979.
Sondhi, M. M., and Gopinath, B., "Determination of
vocal tract shape from impulse response at the
lips," J. Acoust. Soc. Amer., vol. 49, pp. 1867-
1873, 1971.
Sondhi, M. M., and Resnick, J. R., "The inverse
problem of the vocal tract: numerical methods,
acoustical experiments and speech synthesis,"
J. Acoust. Soc. Amer., vol. 73, pp. 985-1002, 1983.
Mermelstein, P., "Determination of the vocal tract
shape from measured formant frequencies,"
J. Acoust. Soc. Amer., vol. 41, 1283-1294, 1967.
Atal, B. S., Chang, J. J., Mathews, M. V., and
Tukey, J. W., "Inversion of articulatory to
acoustic transformation by a computer-sorting
technique," J. Acoust. Soc. Amer., vol. 63,
pp. 1535-1555, 1978.
Lieberman, P., Speech Physiology and Acoustic
Phonetics: An Introduction, MacMillan, New York,
1977.


42
M x + B(x) x + K(x) x = F(t)
The mechanical damping B(x) and elastic constants K(x) are
properly defined functions of the vocal cord position x.
The forcing action F(t) depends on the air pressure
distribution along the glottis which was estimated according
to van den Berg's results. A modified version of the one
mass model was also designed by Mermelstein [64].
Even if the one mass model had been able to simulate
important physiological characteristics of vocal cord
vibration, it presented several inconveniencies.
1) The frequency of vocal fold vibration was too
dependent on the vocal tract shape, indicating too
great a source tract interaction.
2) The model was unable to account for phase
differences between the upper and lower edge of the
cords.
3) The model was unable to oscillate with a capacitive
vocal tract input impedance.
These difficulties were overcome by the two mass model
of the vocal cords that is shown in Figure 3.7. It was
designed by Ishizaka and Matsuidara [65] and first
implemented by Ishazaka and Flanagan [66-67]. A mechanical
coupling between the two masses is represented by the spring
constant kc. The springs s-^ and S2 were given a non-linear


24
mass and energy conservation and of fluid dynamics can be
applied to the compressible, low viscosity medium (air) to
quantitatively account for sound propagation.
The oral and nasal tracts are three-dimensional lossy
cavities of non-uniform cross-sections and non-rigid
walls. Their acoustic characteristics are described by a
three dimensional Navier-Stokes partial differential
equation for boundary conditions appropriate to the yielding
walls. However, in practice, the solution of this
mathematical formulation requires
"an exhorbitant amount of computation, and we do not
even know the exact shape of the vocal tract and the
characteristics of the walls to take advantage of such a
rigorous approach." [55]
The simplifying assumption commonly made in the
literature is plane wave propagation. Most of the sound
energy during speech is contained in the frequency range
between 80 and 8000 Hz [56] but speech quality is not
significantly affected if only the frequencies below 5 kHz
are retained [16]. In this frequency range the cross
sectional dimensions of the vocal tract are sufficiently
small compared to the sound wavelength so that the departure
from plane wave propagation is not significant.
Thanks to the plane wave propagation assumption, the
geometric modeling of the vocal cavities can be greatly
simplified. Acoustically the vocal tract becomes equivalent
to a circular pipe of non-uniform cross-section (see


21
v
Vocal organ
positions
L
ARTICU
MOD
LATORY
EL
v
Desc ription
of vocal
cavity shape
L
VOCAL CAVITY
ACOUSTIC
MODEL
\
Synthetic speech,
vibration of the
vocal cords,
volume velocity
and pressure
distribution in
t the vocal cavity.
Figure 2.4. Articulatory synthesis of speech.


17
Among the many formant synthesizers reported in the
literature [16] [36-37] two general configurations are
common. In one type of configuration, the formant
resonators that simulate the transfer function of the vocal
tract are connected in parallel. Each resonator is followed
by an amplitude gain control which determines the spectral
peak level. In the other type of synthesizer the resonators
are in cascade. The advantage here is that the relative
amplitudes of formant peaks for the generation of vowels are
produced correctly with no need for individual amplitude
control for each formant [16].
Formant synthesis was developed prior to linear
prediction synthesis probably because it is amenable to
analog hardware implementation. However, many synthesizers
have been implemented on general purpose digital computers
to obtain a more flexible design. The most recent and
perhaps most complete synthesizer reported in the literature
has been designed by Klatt [16], which consists of a cascade
and parallel configurations which are used for the
production of vowels and consonants, respectively.
An advantage of formant synthesizers is their
flexibility. In fact, thanks to the parallel configuration,
the filter transfer function may have both zeroes and
poles. Klatt1s synthesizer, for example, has 39 control
parameters which not only control the filter transfer


35
ZW(s) = sLW + RW +
(3.1.8)
with
P(s) = ZW(s) Uw(s)
(3.1.9)
Such an impedance appears to be inversely related to the
vibrating surface (lAx) and its components are
1. an inductance LW proportional to the mass of the
unit surface of the vibrating wall,
2. a resistor RW proportional to the viscous damping
per unit surface of the vibrating walls, and
3. a capacitor CW inversely proportional to the
elastic constant per unit surface.
Direct measurements of the mechanical properties of
different types of human tissues have been reported in the
literature [57], however such measurements are not directly
available for the vocal tract tissues. Moreover, it is
difficult to estimate the lateral surface of the vocal
cavities that is required to compute the numerical value of
the impedance ZW according to the above derivation.


97
consonant the second and higher formant frequencies are
lowered which determines also a decrease of the higher order
formant peak amplitudes. An alveolar constriction, instead,
shifts the second formant toward higher frequencies which in
turn increases the spectral peak of the higher order
formants. A velar constriction divides the vocal tract into
a front and back cavity of similar volumes. The second and
third formants are therefore brought closer to each other
causing a spectral peak in the mid frequency range around
2 KHz.
These spectral characteristics of vocal tract closure
are further enhanced by the fricative excitation that may
occur at consonant release.
For example, at the release of the velar consonant /g/,
the fricative noise enhances the spectral peak at 2000 KHz
that is associated with the front cavity (see Figure 6.6).
The noise burst of an alveolar (/d/) consonant emphasizes
the higher frequency components around 4 KHz that is
associated with the small front cavity.
The frication spectra in the labial consonant /b/ is
less significant because there is no cavity in front of the
constriction and because the consonant release is very rapid
[88].
We see there are different spectral patterns for
various consonant releases, depending on the place of


153
D3 = ug(t-x)(aug(t-x) + D2 .
Equations (A.3.5) can, therefore, be used to model the
glottal termination of the vocal tract.
A.3.b) Radiation Load
Figure A. 6 shows the termination of the last (Nth)
section of the vocal tract. As discussed in Section 3.3, RR
and LR represent the radiation load of the vocal tract at
the mouth. The resistor Rjj and the random voltage generator
VN model the turbulent losses and the fricative excitation
at the lips. A similar equivalent circuit is obviously
valid also for the radiation effects at the nostrils.
To obtain a mathematical relation between the positive
and negative going pressure waves p*(t-x) and p~(t+x), one
should consider the continuity conditions
PN(Ax,t) = RRuR(t) + RNuN(Ax,t) + VN
du (t)
LR-dt = VR(t) (A36)
uN(Ax,t) = uR(t) + uL(t)
Substituting (A.1.1) into (A.3.6) and approximating the time
derivatives with backward differences, we obtain, after
several manipulations,


27
Equations (3.1.1) indicate that the area function
A(x,t) is varying with time. Physiologically this is caused
by two different phenomena,
a) the voluntary movements of the vocal organs during
speech articulation, and
b) the vibration of the vocal cavity walls that are
caused by the variations of the vocal tract
pressure during speech.
We therefore represent the vocal tract area function as
a summation of two components
A(x,t) = AQ(x,t) + 6A(x,t) a Aq(x) + 6A(x,t) (3.1.2)
The first component AQ(x,t) is determined by a) above and it
represents the "nominal" cross-sectional area of the vocal
tract. Since the movements of the vocal organs are slow
compared to the sound propagation, this component can be
considered time invariant with a good approximation. The
2nd component 6A(x,t) represents the perturbation of the
cross-sectional area of the vocal tract that is caused by b)
above. Its dynamics cannot be neglected as compared with
the acoustic propagation. Nevertheless, this component has
a relatively small magnitude compared to A(x,t).
If we substitute (3.1.2) into (3.1.1) and if we neglect
2nd order terms we obtain


134
Chapter 6 uses the developed synthesis system for the
investigation of different aspects of speech processing
(Sections 6.1, 6.2, 6.3, 6.4) and speech physiology
(Sections 6.5, 6.6, 6.7, 6.8) that are reported in the
literature.
In particular we have been able to show
- Articulatory synthesis is capable of generating high
quality synthetic speech.
- Source tract interaction increases the high frequency
components of the sound source.
- The spectra of voiced stops at consonantal release
provide a cue for the recognition of the place of
articulation.
- The glottal least square inverse filtering of speech
based on the closed phase autocovariance analysis is
not significantly affected by the soft characteristics
of the vocal cavity walls.
- The yielding properties of the cavity walls are
greatly responsible for the vocal cord oscillation
during consonantal closure.
- The reduction of voiced intensity observed in human
subjects as a consequence of nasal-pharyngeal coupling
has been correctly simulated.


59
But for small h, (assuming that the sixth derivative
of f(y(t),t) is continuous) and therefore we obtain the
local truncation error estimate
~ Y(l) y(2)
Y y(D ~ is <4-2-7>
If (4.2.7) is greater than a given tolerance, say e-^,
the increment h is halved and the procedure starts again at
the last computed mesh point tQ. If it is less than e^,
y(l)(t0+h) and y(i)(to + 2h) are assumed correct.
Furthermore, if it is less than e-^/50, the next step will be
tried with a doubled increment.
Unfortunately, on the average this method requires a
total of 5.5 function (derivative) evaluations as opposed to
four if the step size is not automatically controlled.
4.2.c) Order Selection
In Section 4.2.1 we derived the 4th order Runge-Kutta
method but Runge-Kutta methods of different orders could be
derived as well.
This observation leads to the question, "How does the
choice of the order affect the amount of work required to
integrate a system of ordinary differential equations?".
For example, small approximation errors can be more
efficiently achieved with high order methods, while for low


157
junction between the Vth and (V + l)st tubes of the vocal
tract.
If we assume a lossless junction, the relation between
the pressure waves leaving and entering the junction can be
easily determined with the aid of continuity conditions
Pv(Ax,t) = Pc(o,t) = pv+1(o,t)
uy( Ax, t) = uc(o,t) + uv+1(o,t)
Substituting A.1.1 into A.3.8, we obtain
(A.3.8)
- p(t-x) -
-Pv(t-T) "
PV+l(t)
= CS]
P_(t)
V+l
.pjit) -
0
O 1
t+
1
where
Av-Av+i"Ac
2AV+1
to
>
O
1
[S] 4 A Va -1' V A
AV AV+1 AC
2Av
Av+iAvAc
2Ac
2A
^ V
2AV+1
AcAv"Av+i-


119
sound segments at the acoustic level, however the segments
are not separately articulated [102]. Rather they are
"coarticulated" and neighboring segments overlap and affect
one another in various ways. This phenomenon is usually
denoted as coarticulation.
Coarticulation is a conceptualization of speech
behaviour that implies [103] discrete and invariant units
serving as input to the system of motor (muscular) control
and an eventual obscuration of the boundaries between units
at the articulatory or acoustic levels.
Coarticulation is bidirectional. Given a sequence of
speech segments ABC, if B exerts an influence on C we talk
about left to right or carry over coarticulation; if B
exerts an influence on A we talk about right to left or
anticipatory coarticulation. For example, both effects are
present in the word spoon (/spun/) in which the lip
protrusion characteristic of /u/ extends both to the right
and to the left [104].
Other examples of anticipatory coarticulation have been
reported by a number of authors. Tongue shape for vowels
appears to be influenced by the following vowel and
viceversa [80] [105-106]. Amerman et al. [107] observed the
jaw opening for an open vowel two consonants before the
vowel. Similarly [108], the velopharyngeal opening may
occur two vowels in advance of a nasal consonant.


AN ARTICULATORY SPEECH SYNTHESIZER
BY
ENRICO LUIGI BOCCHIERI
A DISSERTATION PRESENTED TO THE GRADUATE COUNCIL
OF THE UNIVERSITY OF FLORIDA IN
PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR
THE DEGREE OF DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
1984

ACKNOWLEDGMENTS
I would like to express my sincere appreciation to my
advisory committee for their help and guidance throughout
this work.
I would like to give special thanks to my committee
chairman, Dr. D. G. Childers for his competent advice, for
his support, both financial and moral, and for providing an
educational atmosphere without which this research would
never have been possible.
Special gratitude is also expressed to Dr. E. R.
Chenette for his guidance, encouragement and financial
support.
To my parents and family I am forever indebted. Their
unceasing support and encouragement made it all possible and
worthwhile.
11

TABLE OF CONTENTS
ACKNOWLEDGMENTS ii
ABSTRACT . v
CHAPTER Page
1 INTRODUCTION: SPEECH SYNTHESIS APPLICATIONS.
RESEARCH GOALS 1
2 SPEECH PRODUCTION MODELS AND SYNTHESIS METHODS 7
2.1) Speech Physiology and the Source-Filter Model...8
2.2) Linear Prediction 13
2.3) Formant Synthesis 16
2.4) Articulatory Synthesis 18
3 ACOUSTIC MODELS OF THE VOCAL CAVITIES 23
3.1 Sound Propagation in the Vocal Cavities 23
3.1.a)' Derivation of the Model 23
3.1.b) Modeling the Yielding Wall Properties 32
3.1.c) Nasal Coupling 37
3.2) Excitation Modeling 37
3.2. a) Subglottal Pressure 37
3.2.b) Voiced Excitation 38
3.2.c) Unvoiced Excitation 45
3.3) Radiation Load 47
3.4) Remarks and Other Acoustic Models 48
4 NUMERICAL SOLUTION OF THE ACOUSTIC MODEL 51
4.1) Requirements of the Numerical Solution
Procedure 51
4.2) Runge-Kutta Methods 54
4.2. a) Derivation of the Method 54
4.2.b) Control of the Step Size
with Runge-Kutta 57
4.2.c) Order Selection 59
4.3) Multistep Methods 60
4.3. a) Implicit and Explicit Methods 60
4.3.b) Derivation of the Methods 61
4.3.c) Characteristics of Multistep Methods 64
4.4) Method Selection 66
iii

CHAPTER
Page
5 THE ARTICULATORY MODEL AND ITS
INTERACTIVE GRAPHIC IMPLEMENTATION. 71
5.1) Definition of the Articulatory Model 71
5.2) The Graphic Editor 74
5.3) Display of the Time Variations of the Model....80
5.4) Simultaneous and Animated Display of the
Articulatory and Acoustic Characteristics
of the Vocal Cavities 84
6 SIMULATION RESULTS 88
6.1) Speech Synthesis 88
6.2) Source Tract Interaction 91
6.3) Onset Spectra of Voiced Stops 95
6.4) Glottal Inverse Filtering of Speech 99
6.5) Simulation of Wall Vibration Effects 105
6.5.a) Vocal Cords Vibration During Closure 105
6.5.b) Formant Shift 110
6.6) Pathology Simulation: Reduction of Sound
Intensity During Nasalization 112
6.7) Coarticulation ........117
6.8) Interpretation of the EGG Data with
the Two Mass Model of the Vocal Cords 124
7 CONCLUSIONS 133
7.1) Summary 133
7.2) Suggestions for Future Research 135
APPENDIX
AN EFFICIENT ARTICULATORY SYNTHESIS ALGORITHM
FOR ARRAY PROCESSOR IMPLEMENTATION 137
A.1) Wave Propagation in Concatenated
Lossless Tubes 138
A.2) Modifications of Kelly-Lochbaum Algorithm 141
A.2.a) Fricative Excitation .......141
A.2.b) Yielding Wall Simulation 146
A. 3) Boundary Conditions 150
A.3.a) Glottal Termination ...151
A.3.b) Radiation Load 153
A. 3 c) Nasal Coupling 155
REFERENCES 158
BIOGRAPHICAL SKETCH 169
iv

Abstract of Dissertation Presented to the Graduate
Council of the University of Florida in Partial
Fulfillment of the Requirements for the
Degree of Doctor of Philosophy
AN ARTICULATORY SPEECH SYNTHESIZER
By
Enrico Luigi Bocchieri
April 1984
Chairman: Dr. D. G. Childers
Major Department: Electrical Engineering
Linear prediction and formant synthesizers are based on
a rather approximate model of speech production physiology,
using analysis or "identification" algorithms of natural
speech to overcome the model limitations and to synthesize
good quality speech.
On the contrary, articulatory synthesizers are based on
a more exact speech production model, and do not use
identification algorithms to derive the model parameters
directly from the natural speech waveform.
This dissertation shows that the amount of
physiological detail captured by the articulatory synthesis
method is sufficient for the generation of high quality
synthetic speech and for the simulation of physiological and
pathological aspects of speech that are reported in the
literature.
v

Articulatory synthesis of speech represents the
acoustic properties of the vocal cavities by means of
modeling and numerical simulation techniques that are
reported in Chapters 3 and 4.
We have been able to guarantee the stability of the
numerical method and to halve the number of differential
equations that must be solved for the simulation of the
sound propagation in the vocal tract (Chapter 4).
In the Appendix we present a new and more efficient
algorithm for the simulation of the vocal cavity acoustics
which can be efficiently implemented with parallel
processing hardware.
Interactive graphic software (Chapter 5) has been
developed to represent the configurations of the vocal
cavities and to provide us with a convenient interface for
the manipulation of the geometric model of the vocal
cavities.
Chapter 6 employs the developed articulatory synthesis
system for the simulation of different aspects of speech
processing, for modeling speech physiology, and testing
theories of linguistics reported in the literature. We
discuss and illustrate such cases as source tract
interaction, EGG modeling, onset spectra of voiced stops at
consonantal release, the effects of yielding walls on
phonation, sound intensity reduction during nasalization,
and glottal least squares inverse filtering.
vi

CHAPTER 1
INTRODUCTION.
SPEECH SYNTHESIS APPLICATIONS. RESEARCH GOALS.
In the last 3-4 decades both the engineering and
medical communities have devoted considerable research
effort to the problem of speech synthesis, i.e., the
generation of voice by artificial, electrical or mechanical
means.
The earliest attempts to construct talking machines can
be traced to the late 18th century. One of the first speech
synthesis devices was Kempelen's talking machine [1-2] which
in a demonstration in Vienna in 1791 was capable of
imitating the sounds of vowels and of many consonants.
Perhaps the greatest motivation for speech synthesis
research came from the development of telecommunications and
from the consequent engineering interest in efficient
methods for speech transmission. Moreover, the recent
progresses in circuit integration, microprocessors and
digital computers have made the implementation of high
performance speech transmission systems technologically
feasible [3-6]. This type of application requires a scheme
known as speech synthesis by analysis.
1

2
In its simplest form, speech communication is achieved
by modulating an electrical magnitude (for example, the
current in a transmission line) with the air pressure during
speech production. With this straightforward approach a
copy, in electrical terms, of the speech waveform can be
transmitted on a communication channel with a typical
bandwidth of about 3 KHz.
However, there appears to be a mismatch between the
information content of speech and the channel capacity. In
fact, the information content of written text may be
estimated at about 50 bit/sec [73 while the channel capacity
of a 3 kHz bandwidth and a typical signal-to-noise ratio is
about 30,000 bits/sec. Similar bit rates are also
encountered in conventional PCM speech transmission. Even
though spoken speech contains more information (such as
intonation and stress) than its written counterpart, the
above mentioned mismatch indicates that a smaller channel
bandwidth can be used for a more efficient transmission of
speech. Using different tradeoffs between the
intelligibility and naturalness of speech transmission on
one side and bit rate on the other, engineers have been able
to transmit speech with bit rates varying from 150 to 30,000
bit/s [8-11].
The reduction of channel bandwidth has been obtained by
means of analysis-synthesis systems. Before transmission,

3
speech analysis algorithms are used to extract relevant
information about the speech waveform. This information is
then encoded and transmitted (hopefully at low bit rates)
along the communication channel. At the receiver a
synthesis algorithm is used to reconstruct the speech
waveform from the transmitted information.
This synthesis by analysis process is useful not only
in voice communication systems; for example, in automatic
voice answering systems, words or sentences are stored for
successive playbacks. In addition synthesis by analysis can
be used to reduce the memory usage. Texas Instruments'
learning aid, Speak and Spell, is an example of this type of
application.
Synthesis by rule or text to speech synthesis is a
different type of application that has received considerable
attention lately [12-13]. In this case the problem is not
to "regenerate" synthetic speech after an analysis phase of
its natural counterpart. Instead synthetic speech is
automatically produced according to certain linguistic rules
which transform a string of discrete input symbols directly
into speech [14] (see Figure 1.1). Applications of text to
speech synthesis include reading machines for the blind
[15], automatic answering systems, _man-machine
communication.
The medical community is interested in speech synthesis
systems for different reasons. Speech synthesizers are

SPEECH
Y
SYNTHESIZER
\
7K 7K 7K 7K
SPEECH
CONTROL
SIGNALS
SYNTHESIS
STRATEGY
STORED
RULES
STORED
DATA
A
DISCRETE INPUT
SYMBOLS
Figure 1.1. Text to speech synthesis.

5
often used in psycoacoustic and perceptual experiments
[16-18] in which the acoustic characteristics of speech must
be precisely and systematically controlled. Moreover the
vocal system is not easily accessible; therefore speech
physiologists and pathologists may use computer models as an
aid for the investigation of the physiology of the vocal
system and the diagnosis of voice disorders [19-20].
The purpose of this research is to apply speech
synthesis techniques for the simulation of the physiological
process of speech articulation in relation to the acoustic
characteristics of the speech signal.
Chapter 2 reviews the speech synthesis strategies used
most often and explains why the so-called "articulatory
synthesis" method has been selected for our research.
Speech generation depends on the vocal cavities acoustic
properties which are physiologically determined during
speech articulation by the geometric configuration of the
vocal system.
The model of the acoustic characteristics of the vocal
cavities is explained in detail in Chapter 3, together with
its implementation by means of numerical simulation
techniques in Chapter 4. Chapter 5 focuses on the geometry
or spatial model of the vocal tract together with the
interactive graphic techniques that have been used for its
representation.

Synthesis and simulation results are presented in
Chapter 6. Chapter 7 provides a discussion of our findings
along with conclusions and suggestions for future research.

CHAPTER 2
SPEECH PRODUCTION MODELS AND SYNTHESIS METHODS
As indicated in the introduction (Chapter 1) there are
many different applications that motivate research in the
area of speech synthesis. However, different goals usually
require different approaches for the solution of the
problem. This chapter will briefly consider the three most
popular and best documented techniques for speech synthesis
(namely linear prediction, formant and articulatory
synthesis), their relative "advantages" and "disadvantages"
and the applications for which they are most suitable. The
purpose is to review the available speech synthesis
techniques and to justify the choice of articulatory
synthesis for our research.
Every speech synthesis strategy is based on a more or
less complete model of the physiology of speech production
and ultimately its performance is determined by the amount
of acoustic and linguistic knowledge that the model can
capture.
In
the
first
section
of
this
chapter we
therefore
discuss
the
basic
notions
of
the
physiology
of speech
together with the source-filter production model upon which
both linear prediction and formant synthesis are based.
7

8
2.1) Speech Physiology and the Source-Filter Model.
The acoustic and articulatory features of speech
production can be most easily discussed by referring to
Figure 2.1, which shows the cross-section of the vocal
apparatus.
The thoracical and abdominal musculatures are the
source of energy for the production of speech. The
contraction of the rib cage and the upward movement of the
diaphragm increase the air pressure in the lungs and expel
air through the trachea to provide an acoustic excitation of
the supraglottal vocal cavities, i.e., the pharynx, mouth
and nasal passage.
The nature of speech sounds is mostly determined by the
vocal cords and by the supraglottal cavities. The vocal
cords are two lips of ligament and muscle located in the
larynx; the supraglottal cavities are the oral and nasal
cavities that are vented to the atmosphere through the mouth
and nostrils.
Physically, speech sounds are an acoustic pressure wave
that is radiated from the mouth and from the nostrils and is
generated by the acoustic excitation of the vocal cavities
with the stream of air that is coming from the lungs during
exhalation.
An obvious and important characteristic of speech is
that it is not a continuous type of sound but instead it is

9
Figure 2.1. Schematic diagram of the human vocal
mechanism (from [7] ). By permission
of Springer-Verlag.

10
perceived as a sequence of speech units or segments. In
general the different types of sound that occur during
speech production are generated by changing the manner of
excitation and the acoustic response of the vocal cavities.
As a first order approximation we can distinguish
between a "voiced" and a "fricative" or "unvoiced"
excitation of the vocal cavities.
The voiced excitation is obtained by allowing the vocal
cords to vibrate so that they modulate the stream of air
that is coming from the lungs, producing an almost periodic
signal. For example, vowels are generated in this way and
they are perceived as continuous non-hissy sounds because
the excitation is essentially periodic.
In contrast, unvoiced or fricative excitation is
achieved by forcing the air flow through a constriction in
the vocal tract with a sufficiently high Reynold's number,
thereby causing turbulence. This excitation has random or
"noisy" characteristics and, therefore, the resulting speech
sounds will be hissy or fricative (friction-like) as in the
case of the consonants /s/ and /f/.
Both voiced and unvoiced excitation signals have a
rather wide spectrum. Typically the power spectrum of the
voiced excitation decreases with an average slope of
12 db/octave [7] while the unvoiced spectrum can be
considered white over the speech frequencies [21].

11
The spectral characteristics of the excitation are
further modified by the acoustic transfer function of the
vocal cavities. Sound transmission is more efficient at the
resonance frequencies of the supraglottal vocal system and,
therefore, the acoustic energy of the radiated speech sound
is concentrated around these frequencies (formant
frequencies). During the generation of connected speech,
the shape and acoustic characteristics of the vocal cavities
are continuously changed by precisely timed movements of the
lips, tongue and of the other vocal organs. This process of
adjustment of the vocal cavity shape to produce different
types of speech sounds is called articulation.
These considerations about speech physiology lead to
the simple but extremely useful source-tract model of speech
production [22], which has been explicitly or implicitly
used since the earliest work in the area of speech synthesis
[23-25]. This model is still employed in linear prediction
and formant synthesis. It consists (see Figure 2.2) of a
filter whose transfer function models the acoustic response
of the vocal cavities and of an excitation source that
generates either a periodic or a random signal for the
production of voiced or unvoiced sounds, respectively. The
operation of the source and the filter transfer function can
be determined by external control parameters to obtain an
output signal with the same acoustic properties of speech.

12
hit)
SOUND
VOCAL
SOURCE
gtt)^
TRACT
s(t)=g(t)*h(t) ^
(VOICED OR
TRANSFER
Sound output
UNVOICED)
FUNCTION
gt)
Vftiran
W A W
git)
Unun i r arl \
Figure 2.2. The source-tract speech production model.

13
2.2) Linear Prediction
The simplest and most widely used implementation of the
source tract model of speech production is Linear Prediction
synthesis. It is a synthesis by analysis method that was
first proposed by Atal and Hanauer [26] and it has been
investigated for a great variety of speech applications.
The method is particularly suitable for digital
implementation and it assumes a time discrete model of
speech production, typically with a sampling frequency
between 7 and 10 kHz. It consists (see Figure 2.3) of two
signal generators of voiced and unvoiced excitation and of
an all pole transfer function
H(Z) = rp (2.2.1)
1 + I aKz"K
K=1
to represent the acoustic response of the vocal cavities.
Mathematically, the transfer function H(z) is determined by
the predictor coefficients, as. The great advantage of
linear prediction is that an estimate of the predictor
parameters can be efficiently obtained using an analysis
phase of natural speech. The literature presents several
algorithms to perform this analysis. Perhaps the schemes
most used for speech synthesis applications are the
autocorrelation method [27] and, for hardware
implementation, the PARCOR algorithm [28].

14
4/ K
k v i i
y
V/UV Switch
parameters
Figure 2.3. Linear prediction speech production model.

15
During speech articulation the vocal cavity transfer
function is continuously changing and, ideally, the result
of the analysis of natural speech is a time varying estimate
of the linear predictor parameters (or of an equivalent
representation) as they change during speech production.
Also, the literature reports many algorithms which can
be applied to natural speech to extract the fundamental
(vocal cord oscillation) frequency and to perform the
voiced/unvoiced decision [29-30]. Some of them have been
implemented on special purpose integrated circuits for real
time applications [3] [5].
Therefore, linear prediction provides a complete
synthesis by analysis method in which all the control
parameters of Figure 2.3 can be derived directly from
natural speech.
The shortcoming of linear prediction is that the
transfer function (2.2.1) cannot properly model the
production of nasal, fricative and stop consonants. The all
pole approximation of the vocal tract transfer function is
in fact theoretically justified only for vowel sounds, and
even in this case the linear prediction model assumes a
minimum phase excitation signal.
In spite of these disadvantages, however, linear
prediction performs well for speech synthesis applications
because the approximations introduced by the model of

16
Figure 2.3 do not severely affect the perceptual properties
of the speech sound. In fact the human hearing system
appears to be especially sensitive to the magnitude of the
short-time spectrum of speech [31], that is usually
adequately approximated by the linear prediction transfer
function [32]. Perhaps the minimum phase approximation is
responsible for a characteristic "buzziness" [33] of the
synthetic speech. A great deal of research is being
dedicated to improve the quality of linear prediction
synthesis by using a more suitable excitation than impulse
sequences [34-35].
Linear prediction of speech is, therefore, most useful
in those applications that require a fully automated
synthesis by analysis process. Speech compression, linear
prediction vocoders, very low bit rate transmissions are
typical examples. Also linear prediction has application in
speech products where speech may be recorded for later
playback with stringent memory constraints.
2.3) Formant Synthesis
Similar to linear prediction, formant synthesis is
based on the source-tract speech production model. However,
in this case the filter that models the vocal tract is not
implemented by an all pole digital filter but it consists of
a number of resonators whose transfer function is controlled
by their resonance (formant) frequencies and bandwidths.

17
Among the many formant synthesizers reported in the
literature [16] [36-37] two general configurations are
common. In one type of configuration, the formant
resonators that simulate the transfer function of the vocal
tract are connected in parallel. Each resonator is followed
by an amplitude gain control which determines the spectral
peak level. In the other type of synthesizer the resonators
are in cascade. The advantage here is that the relative
amplitudes of formant peaks for the generation of vowels are
produced correctly with no need for individual amplitude
control for each formant [16].
Formant synthesis was developed prior to linear
prediction synthesis probably because it is amenable to
analog hardware implementation. However, many synthesizers
have been implemented on general purpose digital computers
to obtain a more flexible design. The most recent and
perhaps most complete synthesizer reported in the literature
has been designed by Klatt [16], which consists of a cascade
and parallel configurations which are used for the
production of vowels and consonants, respectively.
An advantage of formant synthesizers is their
flexibility. In fact, thanks to the parallel configuration,
the filter transfer function may have both zeroes and
poles. Klatt1s synthesizer, for example, has 39 control
parameters which not only control the filter transfer

18
function in terms of formant frequencies and bandwidths but
also determine different types of excitation such as voiced,
unvoiced, mixed, sinusoidal for the generation of
consonantal murmurs, and burst-like for the simulation of
stop release.
Formant synthesizers are particularly useful in
psycoacoustic studies since the synthetic speech waveform
can be precisely controlled by parameters which are more
directly related to the acoustic characteristics of speech
than the linear prediction coefficients. For the same
reason they are also more suitable for speech synthesis by
rule.
Synthesis by analysis with formant synthesizers
requires the extraction of the formant information from the
speech signal. For this purpose the literature presents
several formant analysis algorithms [38-41]. However linear
prediction analysis is simpler and more efficient and
therefore linear prediction is usually preferred in
synthesis by analysis type of applications.
2.4) Articulatory Synthesis
Linear
prediction and
formant
synthesis are
not
completely
suitable for our
research
since they do
not
faithfully model the human speech production mechanism.

19
*
A first disadvantage, which is inherent to the source-
filter model, is the assumption of separability between the
excitation and the acoustic properties of the vocal tract.
Clearly this assumption is not valid for the production of
fricative sounds in which the excitation depends on the
vocal tract constriction or for the generation of stops in
which the tract closure arrests the glottal air flow.
Source tract separability is a first order modeling
approximation even in the production of vowels as documented
by many recent papers concerning the nature of source tract
interaction [42-45]. We will address this issue in
Chapter 6 in more detail.
The second "disadvantage" (for our purpose) is that the
filter in the source-tract model (and also linear prediction
and formant synthesis) accounts only for the acoustic
input/output transfer function of the vocal cavities which
is estimated by means of analysis or "identification"
algorithms. For example, Linear Prediction and Formant
synthesis cannot model the areodynamic and myoelastic
effects that determine the vocal cords vibration, the air
pressure distribution along the oral and nasal tracts and
the vibration of the vocal cavity walls.
In other words, linear prediction and formant synthesis
define an algorithm for the generation of signals with the
same acoustic features of natural speech but they do not
model the physiological mechanism of speech generation.

20
These limitations of the source-tract model of speech
production can be overcome by the so called "articulatory"
synthesis that is based on a physiological model of speech
production. It consists (see Figure 2.4) of at least two
separate components,
1) an articulatory model that has as input the time
varying vocal organ positions during speech
production to generate a description of the
corresponding vocal cavity shape and
2) an acoustic model which, given a certain time
varying vocal cavity configuration, is capable of
estimating not only the corresponding speech
waveform but also the pressure and volume velocity
distribution in the vocal tract, the vibration
pattern of the vocal cords and of the vocal cavity
walls.
The strength of linear predication and formant
synthesis, namely the existence of analysis algorithms for
natural speech is, however, a weak point for articulatory
synthesis. Even if several methods for estimating the vocal
tract configuration are presented in the literature [46-51],
these procedures cannot be easily applied to all the
different types of speech sounds. This estimation is made
even more difficult to achieve by the fact that the acoustic
to articulatory transformation is not unique [52-53]. This

21
v
Vocal organ
positions
L
ARTICU
MOD
LATORY
EL
v
Desc ription
of vocal
cavity shape
L
VOCAL CAVITY
ACOUSTIC
MODEL
\
Synthetic speech,
vibration of the
vocal cords,
volume velocity
and pressure
distribution in
t the vocal cavity.
Figure 2.4. Articulatory synthesis of speech.

22
disadvantage, together with high computational requirements,
limits the use of articulatory synthesis for speech
applications which has been recently investigated by
Flanagan et al. [54].
In the following, Chapters 3 and 4 will discuss the
acoustic modeling of the vocal cavities and its
implementation with numerical simulation techniques.
Chapter 5 concentrates on the articulation model and the
computer graphic techniques used for its implementation.

CHAPTER 3
ACOUSTIC MODELS OF THE VOCAL CAVITIES
The qualitative descriptions of the human speech
production mechanism and of the articulatory method of
speech synthesis that we have given in Section 2.1 cannot be
directly implemented on a digital computer. This knowledge
must be transformed into an analytical representation of the
physics of sound generation and propagation in the vocal
cavities.
The mathematical model of the vocal cavity acoustics
can be conveniently interpreted by means of equivalent
circuits. In such a representation the electrical current
and voltage correspond respectively to the air pressure and
volume velocity in the vocal cavities. In the following we
will always express all the physical dimensions in C.G.S.
units.
3.1) Sound Propagation in the Vocal Cavities.
3.1.a) Derivation of the Model.
Sound is nearly synonymous with vibration. Sound waves
are originated by mechanical vibrations and are propagated
in air or other media by vibrating the particles of the
media. The fundamental laws of mechanics, such as momentum,
23

24
mass and energy conservation and of fluid dynamics can be
applied to the compressible, low viscosity medium (air) to
quantitatively account for sound propagation.
The oral and nasal tracts are three-dimensional lossy
cavities of non-uniform cross-sections and non-rigid
walls. Their acoustic characteristics are described by a
three dimensional Navier-Stokes partial differential
equation for boundary conditions appropriate to the yielding
walls. However, in practice, the solution of this
mathematical formulation requires
"an exhorbitant amount of computation, and we do not
even know the exact shape of the vocal tract and the
characteristics of the walls to take advantage of such a
rigorous approach." [55]
The simplifying assumption commonly made in the
literature is plane wave propagation. Most of the sound
energy during speech is contained in the frequency range
between 80 and 8000 Hz [56] but speech quality is not
significantly affected if only the frequencies below 5 kHz
are retained [16]. In this frequency range the cross
sectional dimensions of the vocal tract are sufficiently
small compared to the sound wavelength so that the departure
from plane wave propagation is not significant.
Thanks to the plane wave propagation assumption, the
geometric modeling of the vocal cavities can be greatly
simplified. Acoustically the vocal tract becomes equivalent
to a circular pipe of non-uniform cross-section (see

25
Figure 3.1) whose physical dimensions are completely
described by its cross-sectional area, A(x), as a function
of the distance x along the tube (area function). The sound
propagation can now be modeled by a one dimensional wave
equation. If the losses due to viscosity and thermal
conduction either in the bulk of the fluid or at the walls
of the tube are neglected, the following system of
..differential equations accurately describes the wave
propagation [29]
5 u(x>t)
8p(x,t) A(x,t) n
"^5 + p 5t 0
(3.1.1a)
9u(x,t) 1 d(p(x,t) A(x,t)) 9A(x,t)
+ gt St
Pc
(3.1.1b)
where x = displacement along the axis of the tube
p(x,t) = pressure in the tube as function of
time and displacement
u(x,t) = air volume velocity in the tube
A(x,t) = area function of the tube
A .
P = air density
A sound velocity.
The first and second equations (3.1.1) correspond to
and continuity law, respectively.
Newton1s

26
Figure 3.1. The vocal tract represented by a non uniform
pipe and its area function.

27
Equations (3.1.1) indicate that the area function
A(x,t) is varying with time. Physiologically this is caused
by two different phenomena,
a) the voluntary movements of the vocal organs during
speech articulation, and
b) the vibration of the vocal cavity walls that are
caused by the variations of the vocal tract
pressure during speech.
We therefore represent the vocal tract area function as
a summation of two components
A(x,t) = AQ(x,t) + 6A(x,t) a Aq(x) + 6A(x,t) (3.1.2)
The first component AQ(x,t) is determined by a) above and it
represents the "nominal" cross-sectional area of the vocal
tract. Since the movements of the vocal organs are slow
compared to the sound propagation, this component can be
considered time invariant with a good approximation. The
2nd component 6A(x,t) represents the perturbation of the
cross-sectional area of the vocal tract that is caused by b)
above. Its dynamics cannot be neglected as compared with
the acoustic propagation. Nevertheless, this component has
a relatively small magnitude compared to A(x,t).
If we substitute (3.1.2) into (3.1.1) and if we neglect
2nd order terms we obtain

28
dp (x, t) P &u(x,t)
+ xjtr -~5t
0
(3.1.3a)
du(x,t) Vx) dp(x,t) d(A(x,t))
5x n 2 3t 5t
Pc
The partial differential equations (3.1.3) can be
approximated with a system of ordinary differential
equations. Let the acoustic pipe be Represented by a
sequence of N elemental lengths with circular and uniform
cross-sections A^, i = 1, . .N. This is equivalent to
approximating the area function A(x) using a stepwise method
as shown in Figure 3.2.
If each elemental length is sufficiently shorter than
the sound wavelength, we can suppose that the pressure and
volume velocity are independent of the position in the
elemental length itself. Instead of the functions p(x,t)
and u(x,t), we need to consider a finite number of time
functions only
p(t) u^t) ; i = 1,N
that represent the pressure and volume velocity in the ith
elemental section as a function of time. The partial
derivatives can now be approximated by finite differences

29
Figure 3.2. Stepwise approximation of the area function.

30
5p(x,t)
"^x
9u(x,t)
5^
p(t) p 1(t)
Uj_(t) u x (t)
where p^(t), u^(t) = pressure and volume volume velocity
in the ith vocal tract section
a / A
Ax = L/N = length of each elemental section.
Therefore equations (3.1.3) become
d u.(t) A.
at pS lpi(t>
pi-i(t)>
(3.1.4a)
d p.(t)
dt
ick
u.
i
- 1
(t)
Ax
d^A.(t)
l '
dt
(3.1.4b)
i = 1, ... N
Equations (3.1.4) can be represented by the equivalent
electrical circuit as shown in Figure 3.3. Inductor and
capacitor C^, represent the inertance and compressibility of
the air in the ith elemental length of the vocal tract.
They are defined in terms of the cross-sectional area A^ as

31
s CW1-
Figure 3.3. Vocal tract elemental length and its equivalent
circuit.

32
C.
i
A. Ax
i
P G
L. =
l
p Ax
o
A.
l
i
1,
.N
dA.(t)
The component Ax ^ in equation (3.1.4b) that represents
the vibration of the cavity walls is represented by the
current in the impedance ZW^ as will be shown in the next
section.
3.1.b) Modeling the Yielding Wall Properties
From equation (3.1.4b) we can see that the effect of
wall vibrations are to generate in the ith elemental section
of the vocal tract an additional volume velocity component
equal to
dA.(t)
which is represented in Figure 3.3 by the current uw^ in the
impedance ZW^. We will now consider how ZW^ is related to
the mechanical properties of the vibrating walls.
Consider Figure 3.4 which shows an elemental length of
the pipe in which one wall is allowed to move under the
forcing action of the pressure p(t) in the pipe itself. Let
m, k and d represent the mass, elastic constant and damping
factor of a unit surface. Since the total vibrating surface
is lAx, and since we assume the walls to be locally
reacting, then the total mass, elastic constant and damping
factor of the vibrating wall are

33
Figure 3.4. Mechanical model of an elemental length of
the vocal tract with a yielding surface.

34
ra(1 Ax), k(l Ax), d(l Ax),
respectively.
According to Newton's law the forcing action on the
wall is
2
F = lAx p(t) = mlAx + dlAx + kly(t) (3.1.5)
dt
where y(t) is the wall displacement from the neutral
position (when the pressure in the tract is equal to
naught).
The airflow generated by the wall motion is
uw(t) = Ax ldy^£^- (3.1.6)
and by substitution of (3.1.6) into (3.1.5) we obtain
p(t)
m
1 Ax
duW(t)
dt
(1 Ax)
uw(t)
+
k
(1 Ax)
t
/
_00
(3.1.7)
In the frequency domain, (3.1.7) can be equivalently
represented by the impedance,

35
ZW(s) = sLW + RW +
(3.1.8)
with
P(s) = ZW(s) Uw(s)
(3.1.9)
Such an impedance appears to be inversely related to the
vibrating surface (lAx) and its components are
1. an inductance LW proportional to the mass of the
unit surface of the vibrating wall,
2. a resistor RW proportional to the viscous damping
per unit surface of the vibrating walls, and
3. a capacitor CW inversely proportional to the
elastic constant per unit surface.
Direct measurements of the mechanical properties of
different types of human tissues have been reported in the
literature [57], however such measurements are not directly
available for the vocal tract tissues. Moreover, it is
difficult to estimate the lateral surface of the vocal
cavities that is required to compute the numerical value of
the impedance ZW according to the above derivation.

36
We, therefore, use a slightly different approach
proposed by Sondhi [55]. He assumed that the vibrating
surface is proportional to the volume of the vocal cavities
and he estimated the mechanical properties of the vibrating
surface on the basis of acoustic measurements. The modeling
results match well with the direct measurements of the
mechanical impedance of human tissues performed by Ishizaka
et al. [57]. The model can be formulated as follows,
Adjust the inductive reactance LW^ in the ith elemental
section to match the observed first formant frequency
for the closed mouth condition of about 200 Hz [22] [58]
0.0858
0.0858
Ax
LW.
l
Next, adjust the wall loss component RW^ to match the
closed glottis formant bandwidths [59]
RW. = 130 n LW.
x x
Choose a value of CW^ to obtain a resonance frequency of
the wall compatible with the direct measurements of
Ishizaka et al. [57]
(2 % 30)
2

37
3.1.c) Nasal Coupling
During the production of nasal consonants and nasalized
vowels the nasal cavity is acoustically coupled to the oral
cavity by lowering the soft palate and opening the
velopharyngeal orifice.
The coupling passage can be represented as a
constriction of variable cross-sectional area and 1.5 cm
long [22] that can be modeled by an inductor (to account for
the air inertance) in series with a resistor (to account for
viscous losses in the passage).
The wave propagation in the nasal tract can be modeled
in terms of the nasal tract cross-sectional area as
discussed in Section (3.1.a) for the vocal tract. However,
we will use a different equivalent circuit to better account
for the nasal tract losses as we will discuss later in
Section 6.6.
3.2) Excitation Modeling
3.2.a) Subglottal Pressure
The source of energy for speech production lies in the
thoracic and abdominal musculatures. Air is drawn into the
lungs by enlarging the chest cavity and lowering the
diaphragm. It is expelled by contracting the rib cage and
increasing the lung pressure. The subglottal or lung
pressure typically ranges from 4 cm ^0 for the production

38
of soft sounds to 20 cm H2O or more for the generation of
very loud, high pitched speech.
During speech the lung pressure is slowly varying in
comparison with the acoustic propagation in the vocal
cavities. We can, therefore, represent the lung pressure
with a continuous voltage generator whose value is
controlled in time by an external parameter Ps(t).
3.2.b) Voiced Excitation
During speech, the air is forced from the lungs through
the trachea into the pharynx or throat cavity. On top of
the trachea is mounted the larynx (see Figure 3.5), a
cartilagineous structure that houses two lips of ligament
and muscle called the vocal cords or vocal folds.
The vocal cords are posteriorly supported by the
arytenoid cartilages (see Figure 3.5). Their position and
the dimension of the opening between them (the glottis) can
be controlled by voluntary movements of the arytenoid
cartilages.
For the generation of voiced sounds the vocal cords are
brought close to each other so that the glottal aperture
becomes very small. As air is expelled from the lungs,
strong areodynamic effects put the vocal cords into a rapid
oscillation. Qualitatively, when the vocal cords are close
to each other during the oscillation cycle, the subglottal

39
Figure 3.5. Cut-away view of the human larynx (from [7]).
VC vocal cords. AC arytenoid cartilages.
TC thyroid cartilage.

40
pressure forces them apart. This, however, increases the
air flow in the glottis and the consequent Bernoulli
pressure drop between the vocal cords approximates the vocal
cords again. In this way a mechanical "relaxation"
oscillator is developed which modulates the airflow from the
lungs into a quasiperiodic voiced excitation.
The vocal fold oscillation frequency determines
important perceptual characteristics of voiced speech and it
is called the "pitch" frequency or fundamental frequency,
F .
o
The first quantitative studies of the areodynamics of
the larynx were carried out by van den Berg et al. who made
steady flow measurements from plaster casts of a "typical"
larynx [60-61].
The first quantitative self-oscillating model of the
vocal folds was proposed by Flanagan and Landgraf [62] after
Flanagan and Meinhart's studies concerning source tract
interaction [63]. The fundamental idea was to combine van
den Berg's results with the 2nd order mechanical model of
the vocal cords shown in Figure 3.6. An acoustic-mechanic
relaxation oxcillator was obtained.
Bilateral symmetry was assumed and only a lateral
displacement x of the masses was allowed. Therefore, only a
2nd order differential equation was needed to describe the
motion of the mass

41
Figure 3.6. One mass model of the vocal cords.

42
M x + B(x) x + K(x) x = F(t)
The mechanical damping B(x) and elastic constants K(x) are
properly defined functions of the vocal cord position x.
The forcing action F(t) depends on the air pressure
distribution along the glottis which was estimated according
to van den Berg's results. A modified version of the one
mass model was also designed by Mermelstein [64].
Even if the one mass model had been able to simulate
important physiological characteristics of vocal cord
vibration, it presented several inconveniencies.
1) The frequency of vocal fold vibration was too
dependent on the vocal tract shape, indicating too
great a source tract interaction.
2) The model was unable to account for phase
differences between the upper and lower edge of the
cords.
3) The model was unable to oscillate with a capacitive
vocal tract input impedance.
These difficulties were overcome by the two mass model
of the vocal cords that is shown in Figure 3.7. It was
designed by Ishizaka and Matsuidara [65] and first
implemented by Ishazaka and Flanagan [66-67]. A mechanical
coupling between the two masses is represented by the spring
constant kc. The springs s-^ and S2 were given a non-linear

43
i 1 1
CONTRACTION GLOTTIS EXPANSION
Figure 3.7. Two mass model of the vocal cords and
glottis equivalent circuit.

44
characteristic according to the stiffness measured on
excised human vocal cords. As in the case of the one mass
model the viscous damping was changing during the vocal
cords vibration period.
For computer simulation it is convenient to represent
the pressure distribution along the two masses with the
voltage values in an equivalent circuit. In Figure 3.7
resistance Rc accounts for the Bernoulli pressure drop and
"vena contract" effect at the inlet of the glottis.
Resistances Ry^ and R^ model the viscous losses in the
glottis. Resistance R-^ accounts for the pressure
difference between the two masses caused by the Bernoulli
effect. The inductors model the air inertance in the
glottis.
The two mass model of the vocal cords, that has been
used in connection with vocal tract synthesizers [67-68],
uses as control parameters the glottal neutral area AgQ and
the cord tension Q.
AgQ determines the glottal area in absence of phonation
and it is physiologically related to the position of the
arythenoid cartilages. Q controls the values of the elastic
constant of the model and greatly affects the two mass model
oscillation period. The suitability of the two mass model
and of its control parameters for speech synthesis has been
further validated in [69] and [70].

45
The acoustic synthesizer that we have implemented uses
the two mass model to provide voiced excitation. We
therefore account for source tract interaction since the
current in the equivalent circuit of the glottis (see
Figure 3.7) is dependent on the voltage p-^ that models the
pressure in the vocal tract just above the vocal cords.
3.2.c) Unvoiced Excitation
Speech sounds are generally excited by modulating the
air flow through a constriction of the glottal and
supraglottal system. For voiced sounds this modulation is
obtained through rapid changes of the glottal constriction
as explained in the review section. For fricative sounds,
the modulation comes from flow instabilities which arise by
forcing the air through a constriction with a sufficiently
high Reynold's number. In this case the classical
hypothesis of separability between the source and the tract
greatly limits the realism that can be incorporated into the
synthesizer. In fact, unvoiced excitation is greatly
dependent on the constricted area of the vocal tract itself.
The fricative self-excitation of the vocal cavities was
first modeled by Flanagan and Cherry [71]. The idea was to
use a resistor RN^ and noise generator VNj_ in the equivalent
circuit of the ith elemental length of the vocal tract (see
Figure 3.8). The values of the resistor and of the noise

Figure 3.8. Equivalent circuit of vocal tract elemental
length with fricative excitation.

47
generator variance depend on the Reynold's number of the
flow in the ith section [71]. The spectrum of the turbulent
noise VN^ can be assumed white with a good approximation
[21].
In our simulation we have modeled the fricative
excitation by means of two sources. One is always located
in the first vocal tract section to generate aspirated
sounds. The second is not bound to a fixed position but can
be moved along with the vocal tract constriction location.
3*3) Radiation Load
The radiation effects at the mouth and nostrils can be
accounted for by modeling the mouth and nostrils as a
radiating surface placed on a sphere (the head) with a
radius of about 9 cm.
Flanagan [7] has proposed a simplified equivalent
circuit for the radiation load model by using a parallel
combination of an inductor and a resistor with values
_ 128 _ 8a
KR ~ 2' ^R 3ire
2 it
where a is the radius of the (circular) radiating surface
(mouth or nostrils). Titze [19] has shown that this
approximation, which we are using now, is valid also at
relatively high frequencies, when the speech wavelength has
the same order of magnitude of the mouth radius.

48
Our model is also able to account for the sound
pressure component that is radiated through the vibration of
the vocal cavity walls. The contribution to this component
from each elementary length of the vocal cavities is
represented as a voltage drop across a suitable impedance
[67] in series to the equivalent circuit of the yielding
wall defined in Section 3.1b.
3.4) Remarks and Other Acoustic Models
In this chapter we discussed the derivation of an
electrical circuit that models the sound propagation in the
vocal cavities. These considerations can be summarized in
Figure 3.9.
The vocal and nasal tracts are represented by two
circular pipes with non-uniform cross-section (plane wave
propagation assumption). Their equivalent circuit (nasal
and vocal tract networks in Figure 3.9) are made by a chain
of elementary circuits. Each circuit models the wave
propagation as a short length of cavity according to the
derivation of Sections 3.1.a, 3.1.b, 3.2.c.
The two mass models of the vocal cords, its control
parameters, AgQ and Q, and the glottal impedance have been
treated in detail in Section 3.2.b.
Sections 3.2.a, 3.1.c, 3.3 have been concerned with the
subglottal pressure P the velar coupling Z.7 and the
S V

49
MUSCLE FORCE
PRESSURE
TENSION AREA
COUPLING
FUNCTION
Figure 3.9. Equivalent circuit of the vocal cavities.

50
radiation impedances at the mouth and nostrils, which are
shown in Figure 3.9.
The approach to the acoustic modeling of the vocal
cavities that we have just reviewed is not the only one
reported in the literature. In Section 3.2.b we considered
the two mass models of the vocal cords. A more complete
model of vocal cord dynamics has been designed by Titze
[19]. He divided each cord into two vertical levels, one
level corresponding to the mucous membrane, the other to the
vocalis muscle. Each level was further divided into eight
masses which were allowed to move both vertically and
horizontally. We did not use this model because its
simulation is computationally more expensive than Flanagan's
two mass models.
Different acoustic models of the vocal tract were
designed by Kelly and Lochbaum [72] and Mermelstein [73].
The latter has been recently implemented for an articulatory
synthesizer [74].
However, these modeling approaches account for vocal
tract losses in a phenomenological way and they do not model
source tract interaction and fricative self excitation.

CHAPTER 4
NUMERICAL SOLUTION OF THE ACOUSTIC MODEL
4.1) Requirements of the Numerical Solution Procedure
The software implementation of the acoustic model of
the vocal cavities that we have derived in the previous
chapter requires the solution of a system of ordinary
differential equations with assigned initial values.
In general we will use the notation
y'(t) = f(y(t),t)
{ (4.1.1)
y(0) = yQ
The numerical approach employed to solve the above problem
consists of approximating the solution y(t) as a sequence of
discrete points called mesh points. The mesh points are
assumed to be equally spaced and we indicate with h the time
interval between them. In other words, the numerical
integration procedure will give us a sequence of values yQ,
yi' *yn which closely approximate the actual solution y(t)
at the times tQ =0, t^ = h, ...tn = nh.
In the area of ordinary differential equations the
first step toward the solution of the problem is the
selection of that particular technique among the many
available which will serve the solution best.
51

52
In our specific case the most stringent requirement is
the stability of the numerical method. Since the
integration of (4.1.1) is going to involve a large number of
mesh points we need a method which, for a sufficiently small
step size h, guarantees that the perturbation in one of the
mesh values yn does not increase in the subsequent values,
y m > n.
In the following discussion, as in [75], we use a "test
equation"
y'(t) = \y(t)
where X is a complex constant. We introduce the concept of
an absolute stability region, which is the set of real,
nonnegative values of h and X for which a perturbation in a
value yn does not increase from step to step.
When the stability requirement is satisfied, we should
select the fastest integration method for our particular
application. This second requirement is very important
since we are dealing with a large system of differential
equations (about 100 differential equations of the 1st
order) and it takes about five hours to generate one second
of synthetic speech on our Eclipse S/130 minicomputer.
The integration speed is directly related to the step
size h of the numerical integration method. The larger the

53
step size the faster the integration. Unfortunately, the
precision of the numerical solution decreases when the step
size is increased and the method may become unstable. The
\
program to solve (4.1.1) must therefore implement an
automatic procedure for changing the step size to achieve
the maximum integration speed compatible with precision and
stability requirements. A variable step size is
particularly convenient in our case. In fact the time
constants of (4.1.1) change with the shape of the vocal
cavities and the motion of the vocal cords. A variable
control of the step size allows one to obtain the maximum
integration speed which is allowed by the method and by the
time variant differential equations. In view of these
requirements we have considered both a 4th order Runge-Kutta
method with control of the step size and order.
The Runge-Kutta method requires the computation of
derivatives at an higher rate than multistep methods.
However, it needs less overhead for each derivative
computation [75]. In the next two sections the
characteristics of Runge-Kutta and multistep (also called
predictor-corrector) methods will be considered.
Section 4.4 will give a comparative discussion leading
to the selection of the Runge-Kutta method. Also, we will
describe a modification of the Runge-Kutta method, which
exploits the fact that wall vibration effects have larger

54
time constants than the propagation in the cavities. This
allows the reduction of the number of first order equations
to be solved by this method by almost a factor of two.
/
4.2) Runge-Kutta Methods
4.2.a) Derivation of the Method
Runge-Kutta methods are stable numerical procedures for
obtaining an approximate numerical solution of a system of
ordinary differential equations given by
y'(t) = f(y(t),t)
{ (4.2.1)
y(o) = yQ
The method consists of approximating the Taylor expansion
v,2
y(tQ + h) = y(tQ) + hy'(to) + Yi y"(t0) + (4.2.2)
so that, given an approximation of the solution at time tQ,
the solution at the next mesh point (tQ + h) can be
estimated.
To avoid the computation of higher order derivatives,
it is convenient to express (4.2.1) in integral form as
t +h
y(t + h) = y(t ) + / f(y(t),t)dt (4.2.3)
t
o
We can approximate the above definite integrals by computing

55
f(y(t), t) at four different points of the interval
(tQlt0 + h) by defining
Kx = hf(y(tQ),to)
K2 = hf(y(tQ) + 3K^,tQ + ha)
K3 = hf(y(to) + 81K1 + YlK2, to + ajh)
K4 = M and then setting
t +h
o
y(t + h) y(t ) = / f(t,y(t))dt
t
o
(4.2.4)
= y1K1 + u2K2 + u3K3 + y4I<4 (4.2.5)
The problem is now to determine the a's, 3's, y's, 2 and
y's so that (4.2.5) is equivalent to the Taylor expansion
(4.2.2) up to the highest possible power of h. We
substitute (4.2.5) into (4.2.3) and we choose the undefined
parameters so that the powers of h1 (i = 0,4) have the same
coefficients as in (4.2.2).

56
We obtain a system of 8 equations in ten unknowns [76]
u a + u a + u a
2 2 3 1 4 2
v2 + v¡+v*
V3 *3 l + 4 2
lay + li (a y + a )
3114' 2 12'
VS + 'V'S + l62>
u.aay + u (a y + a ) a
3 114' 2 12' 2
u. a y
*4 1 2
= 1
= 1/2
= 1/3
= 1/4
= 1/6
= 1/12
= 1/8
= 1/24
(4.2.6)
which has two extra degrees of freedom that must be set
arbitrarily.
If we define a = = l/2, the solution of (4.2.6)
leads to the formulas of Kutta [76]. If = 1/2 and
2 = 1 we obtain Runge1s formula [76] which is equivalent
to Simpson's rule

57
t +h ,
/ f(t)dt = gCf(t0) + 4hf(tQ + + f(toth)]
t
o
when y (t) = f(t).
We use a calculation procedure derived from (4.2.6) by
Gill [76] which minimizes the memory requirements and allows
us to compensate the round off errors accumulated at each
step.
4.2.b) Control of the Step Size with Runge-Kutta
In the previous derivation we have emphasized how the
Runge-Kutta method approximates the Taylor expansion 4.2.2
up to the 4th power of h. It is therefore a fourth order
c
method with a local truncation error of order h
f(5)(y(t ),t ) ,
w o o .5
51 h
+ oOi6)
This accuracy is obtained without explicitly computing the
derivatives of orders higher than one, at the expense of
four evaluations of the first derivative for each mesh
point. This is a disadvantage with respect to multistep
methods (to be discussed later) which uses fewer
computations of the first derivative to obtain the same
truncation error. The number of derivative evaluations per
step increases to 5.5 to obtain a variable control of the
step size with Runge-Kutta.

58
The step size, h, should in fact be chosen so that the
local truncation error is less than a certain maximum
acceptable value specified by the user. Unfortunately, the
truncation error cannot be directly estimated because the
Runge-Kutta procedure does not provide any information about
higher order derivatives.
A practical solution [76] is based on the results of
numerical integration with steps h and 2h, respectively,
i.e., the computation is performed a first time using h-^ = h
and then it is repeated using h2 = 2h.
denote the truncation error using step
h2 = 2h
denote the truncation error using step
h-^ = h
denote the value "obtained" at (tQ + 2h)
using step h2 = 2h
denote the value "obtained" at (tQ + 2h)
using step h^ = h twice
denote the true value of y at time
(tQ + 2h),
Jbet.
C2^2
Clhl
y(2)
*(1)
then
Y y
(2)
= 2h2
(1)
2Clh-
Y y

59
But for small h, (assuming that the sixth derivative
of f(y(t),t) is continuous) and therefore we obtain the
local truncation error estimate
~ Y(l) y(2)
Y y(D ~ is <4-2-7>
If (4.2.7) is greater than a given tolerance, say e-^,
the increment h is halved and the procedure starts again at
the last computed mesh point tQ. If it is less than e^,
y(l)(t0+h) and y(i)(to + 2h) are assumed correct.
Furthermore, if it is less than e-^/50, the next step will be
tried with a doubled increment.
Unfortunately, on the average this method requires a
total of 5.5 function (derivative) evaluations as opposed to
four if the step size is not automatically controlled.
4.2.c) Order Selection
In Section 4.2.1 we derived the 4th order Runge-Kutta
method but Runge-Kutta methods of different orders could be
derived as well.
This observation leads to the question, "How does the
choice of the order affect the amount of work required to
integrate a system of ordinary differential equations?".
For example, small approximation errors can be more
efficiently achieved with high order methods, while for low

60
accuracy requirements lower order methods are to be
preferred [75].
We would, therefore, like to have an automatic
mechanism for the selection of the order of the method.
This mechanism should evaluate the truncation error
corresponding to different integration orders and choose the
order which allows for the maximum step size and integration
speed compatible with the required precision.
Unfortunately, for the same reason discussed in
relation to the problem of step size control, namely the
absence of higher order derivative estimates, the Runge-
Kutta method does not provide an efficient procedure for
automatic order selection. Therefore we always use the
"standard" 4th order Runge-Kutta method.
4.3) Multistep Methods
4.3.a) Implicit and Explicit Methods
Those methods, like Runge-Kutta, which given an
approximation of y(t) at t = tn_^, (say Yn_^) provide a
technique for computing yn = y(tn) are called one step
methods. More general K-step methods require the values of
the dependent variables y(t) and of its derivatives at K
different mesh points t tn_2> * ,tn-K to aPProximate the
n'
solution at time t

61
The well known rules of "forward differentiation",
"backward differentiation" and "trapezoidal rule" are one
step methods. They will be automatically considered in the
following discussion as particular cases.
The general expression of a multistep (K-step) method
is
n
K
- I
i=l
(a. y
l-^n-i
3.y' )
+ 3 y'
hy1 = hf(y ,t )
1 n Jn n7
(4.3.1)
If 3q is equal to zero the method is explicit because it
provides an explicit way of computing yn and hy'n from the
values of y and its derivatives at preceding mesh points.
If 3q is different from zero, then (4.3.1) defines an
implicit multistep method because it is in general a non
linear equation involving the function f(yntn) that must be
solved for the unknown y .
4.3.b) Derivation of the Methods
We have considered the Adams-Bashforth and Adams-
Moulton [75] methods which are respectively explicit and
implicit methods with

62
a
1
1
(4.3.2)
a. = 0 if i 1
l
Both of these methods can be obtained from the integral
relation
y(tn) = y^tn-l^ + / f(y(t),t)dt
n-1
The integral is estimated by approximating f(y(t),t)
with an interpolating polynomial (for example, the Newton's
backward difference formula) through a number of known
values at t = t ***' tn-K -*-n the exPlicit case or
through the values at times tn, tn_^, ...tn_K ^or the
implicit case.
Therefore, for (the explicit Adams-Bashforth case, the
equation (4.3.1) takes the form
K
(4.3.4a)
or with the equivalent representation in terms of finite
differences

63
where the operator V is defined by
vJf vJ"1f VJ_1f ,
m m m-1
Vf = f
m m
(4.3.5)
The values of yn differ from the real solution y(tn) by
-a local truncation error which is of order (K + 1) in the
step size h
Error
Adams-Bashforth
K+l (K+l)
a yk ^ y(t)
(4.3.6)
The values of y^'s and P^i's coefficients are available
directly from the literature [75].
In the implicit Adams-Moulton case equation (4.3.1)
takes the form
n
- + 2
1=0
t )
n-i
(4.3.7)
or with the equivalent representation in terms of backward
differences
K
y = y(t ,) + h y y* V f(y(t ), t )
2n 1 s n-1' rJ n' n'
u u
where the v operator has been defined in (4.3.5)
(4.3.8)

64
In (4.3.8) the value of yn differs from y(t ) by a
local truncation error that is of the order K + 2 in the
step size h.
(4.3.9)
The y*'s and B*'s coefficient values are available from the
literature [75]. In particular the one step Adams-Bashforth
method corresponds to the forward differentiation rule while
the zero and one step Adams-Moulton methods are the backward
and trapezoidal rule respectively.
4.3.c) Characteristics of Multistep Methods
An important characteristic of multistep methods is
that they require only one computation of the derivative for
each step as can be seen from equation (4.3.1). This is a
great advantage over the Runge-Kutta method that requires at
least four computations of the functions f(y(t),t) and it
has been the motivation for our experimentation with
multistep methods.
Another feature of multistep methods is that they allow
for a rather efficient implementation of automatic control
of the step size and order of the method itself.
A complete treatment of this subject would require too
long a discussion. Our intuitive explanation can be
obtained by observing that the step size and order selection

65
require an estimate of the local truncation error in terms
of different step sizes and integration orders. The order
that allows for the largest stepsize compatible with the
user defined upper limit for the truncation error is then
selected.
The local truncation error in the Adams-Bashforth and
Adams-Moulton methods (see (4.3.6) and (4.3.9)) is related
to high order derivatives which can be easily obtained in
terms of the same backward differences
V^hf (y(tn),tn) = V^hy' = h-^^y ^ + ^
that are used in the implementation method itself (see
(4.3.4) and (4.3.8)). A comparison between the explicit and
implicit multistep methods is necessary to complete this
discussion.
One difference between the methods, is that the y*
coefficients of the implicit methods are smaller than the y
coefficient of the explicit methods. This leads to smaller
truncation errors for the same order for the implicit case
(see (4.3.6) and (4.3.9)).
Another advantage of the implicit methods is that
K-step methods have a truncation error of order (K+2) in the
step size h (see (4.3.9)) to be compared with a truncation
error of order (K+l) for the explicit method (see

66
(4.3.6)). The reason for this fact is evident when we
consider that (4.3.7) has (K+l) coefficients 3. r, i = 0,K
1, a.
while in the explicit method (4.3.5) 3A has been set to
U / i\
zero.
A disadvantage of implicit methods is that the non
linear equation (4.3.7) in the unknown yn must be solved
iteratively. Usually a first "guess" of yn is obtained by
-means of an explicit method and then (4.3.7) is iterated.
However, for reasonable values of the step size h, no more
than two or three iterations are usually required, and this
extra effort is more than compensated for by the better
stability properties of the implicit methods. In fact with
respect to the "test" equation y' = Xy, the range of h
values for which implicit methods are stable is at least one
order of magnitude greater than in the explicit case [75] .
Since the truncation errors of implicit methods are smaller,
the implicit methods can be used with a step size that is
several times larger than that of the explicit method. The
allowed increase in step size more than offsets the
additional effort of performing 2 or 3 iterations.
4.4) Method Selection
We have implemented the numerical simulation of the
acoustic model of the vocal cavities by means of both Runge-
Kutta and implicit multistep methods [77]. The Runge-Kutta

67
method runs about 2 3 times faster than the Adams-Moulton
method. This fact is at first rather surprising since we
have seen in the previous two sections that the Runge-Kutta
method requires a larger number of derivative evaluations
for each integration step.
However, in our case the evaluation of the derivatives
is not extremely time consuming. In fact, with the
exception of four state variables describing the motion of
the two mass models, the remaining system of differential
equations is essentially "uncoupled" (or characterized by a
"sparse matrix"), thanks to the "chain" structure of the
vocal cavity acoustic model that can be immediately observed
from Figure 3.9. In these conditions the high number of
derivative evaluations of the Runge-Kutta method is more
than compensated for by the limited overhead in comparision
with Adams predictor-corrector method [75].
We have modified the integration method to take
advantage of the dynamic properties of the vibrating walls.
From Figure 3.8, which represents the equivalent
circuit of the ith elemental section of the vocal tract as
discussed in Chapter 3, we have
dp (t)
= cT
dt
(4.4.1)

68
dui(t)
dt
L.
i
- Pi(t)
VN. RN. u.(t))
l l l' '
(4.4.2)
^UWi^ ^ 1
dt = lw7 (Pi(t) vwi(t) RWi "wi^
(4.4.3)
dvWi(t) 1
dt
CW. Wi
l
uilT, (t)
(4.4.4)
The first two equations represent the pressure and
volume velocity propagation in the ith elemental length of
the vocal tract while (4.4.3) and (4.4.4) model the wall
vibration effects.
The dynamics of uwj.(t) and in (4.4.3) and
(4.4.4), are characterized by the time constants (see
Section 3.1.b)
LW.
l 1
RWi 130*7r'
/CW. LW.
l l
2 ir*30
which are very large with respect to the time of the wave
propagation in each elemental length of the vocal tract.

69
In fact, if we divide the vocal tract into 20 elemental
sections, of approximately 0.875 cm each, the time of wave
. c:
propagation in each is 2.5 10 sec. This time gives the
order of magnitude of the largest step size for the
integration of equations (4.4.1) and (4.4.2) that, in fact,
is usually achieved with variable step sizes between
2.5 10"5 and 1.25 10^ sec.
On the other hand equations (4.4.3) and (4.4.4) may
employ a larger integration step.
We, therefore, integrate equations (4.4.1) and (4.4.2)
together with the two mass model equations using a Runge-
Kutta method with variable control of the step size and
assuming uw^(t) in (4.4.1) constant during this procedure.
Every 5.10^ seconds, i.e., at a frequency of 20 KHz, we
update the values of u^(t) and vw^(t) by means of a simple
backward differentiation rule based on equations (4.4.3) and
(4.4.4).
At this time we also update the turbulent noise source
Wb and Ri'ib according to the Reynold's number of the flow in
the cavity as explained in Section 3.2.c to provide a
fricative excitation.
In this way we halve the number of derivatives that
must be computed by the Runge-Kutta method to account for
vocal tract propagation and we save about 50% of the
integration time.

70
However, the numerical procedure is still correct, as
we will show in Chapter 6 where several effects associated
with cavity wall vibration are simulated.

CHAPTER 5
THE ARTICULATORY MODEL AND ITS
INTERACTIVE GRAPHIC IMPLEMENTATION
The acoustic characteristics of the vocal cavities,
that we have modeled by means of an equivalent circuit in
Chapter 3, are greatly affected by the geometrical
configuration of the vocal cavities themselves. Therefore,
the acoustic and perceptual properties of speech depend on
the position of the lips, tongue, jaws and of the other
vocal organs that determine the shape of the vocal tract.
The physiological mechanism of speech production or
"articulation" involves precisely timed movements of the
vocal organs to produce the acoustic wave that we perceive
as connected speech.
This chapter is concerned with the definition and
implementation of a geometric or "articulatory" model of the
vocal tract that can be used to describe the configuration
of the vocal cavities during speech production.
5.1) Definition of the Articulatory Model
All the articulatory models presented in the literature
[78-81] are two dimensional representations of the vocal
cavities which closely match the midsagittal section of the
71

72
vocal tract, even if they do not resolve individual
muscles. The articulatory models that have been designed by
Coker [78] and Mermelstein [79], are probably the most
suitable for speech synthesis applications, since their
configuration is determined by a small number of control
parameters.
Figure 5.1 shows the articulatory model that has been
designed by Mermelstein. We can distinguish between a fixed
and a movable structure of the model. The fixed structure
consists of the pharyngeal wall (segments GS and SR in
Figure 5.1), the soft palate (arc VM), and hard palate (arc
MN) and the alveolar ridge (segment NV).
The configuration of the movable structure is
determined by external control parameters, that we call
articulatory parameters and that are represented in
Figure 5.1 by arrows. For example, the tongue body is drawn
as the arc of a circle (PQ in Figure 5.1) whose position is
determined by the coordinates x and y of its center. Other
parameters are used to control the location of the tip of
the tongue, of the jaws, of the velum, of the hyoid bone and
the lip protrusion and width.
Mermelstein has shown that this model can match very
closely the midsagittal X-ray tracings of the vocal tract
that have been observed during speech production [82]. The
model of Figure 5.1 can therefore be used for speech

73
Figure 5.1. Articulatory model of the vocal cavities.

74
synthesis if we can estimate the cross-sectional area of the
vocal tract (area function). The area function, in fact,
can be used to derive the equivalent circuit of the vocal
cavities, as discussed in Chapter 3.
In practical terms we superimpose a grid system, as
shown in Figure 5.2, on the articulatory model to obtain the
midsagittal dimensions of the vocal cavities at different
points, and then we convert this information into cross-
sectional area values by means of analytical relationship
defined in the literature [79].
We use a variable grid system, dependent on the tongue
body position, to make sure that each grid in Figure 5.2 is
always "almost" orthogonal to the vocal tract center line
regardless of the model configuration? to further correct
unavoidable misalignment of each grid, we multiply the
cross-sectional area estimate by the cosine of the angle a
(see Figure 5.2).
5.2 The Graphic Editor
Our computer implementation of the articulatory model
of the vocal cavities has been designed to be fast and easy
to use.
Traditionally, human-computer interaction employs
textual (alphanumeric) communication via on-line keyboard
terminals. This approach is satisfactory for many

75
Figure 5.2. Grid system for the conversion of mid-sagittal
dimensions to cross-sectional area values.

76
applications but is being replaced by menu selection, joy
stick cursor, a light pen, touch sensitive terminals, or
other devices.
The conventional keyboard entry method is particularly
cumbersome if the data structure is not easily manipulated
via an alphanumeric selection process. Such an example
arises with pictorial or graphic images, as in computer
aided design. Here the user may communicate with the
computer by means of a graphic model. The system interprets
the model, evaluates its properties and characteristics, and
recognizes the user's changes to the model. The results are
presented graphically to the operator for further
interactive design and test.
Using a similar approach we have implemented on a
Tektronix 4113 graphics terminal interfaced to a DGC Eclipse
S/130 minicomputer an "interactive graphic editor" that is
used to manipulate the articulatory model.
The user may alter the configuration of the model by
means of a simple interaction. Each articulatory parameter
of Figure 5.1, and the corresponding vocal organ's position,
can be set to the desired value by means of the graphic
cursor of the Tektronix 4113 terminal. This allows a rapid
definition of the desired articulatory configuration.
But the power of an interactive graphic system lies in
its ability to extract relevant information from the model

77
for further analysis and processing. The acoustic
properties of the vocal tract are determined, as discussed
in Chapter 3, by its area function, which is the cross-
sectional area of the cavity, as a function of the distance,
x, from the larynx.
When the user has defined a new articulatory parameter
value by means of the cross-hair cursor, the system
estimates the area function of the vocal tract by means of
the grid system of Figure 5.2. The resonance or formant
frequencies are also estimated. This information is
immediately displayed (see Figure 5.3) for the user as a
first order approximation of the acoustic properties of the
graphic model of the vocal tract.
The interaction cycle is shown in Figure 5.4. Commands
are available not only to modify the displayed vocal tract
shape but also to store and read it from a disk memory.
These commands are useful not only to generate a "data base"
of vocal tract configurations, but also to create back up
files before using the interactive graphic commands.
But the interaction depicted in Figure 5.4 is not
sufficient to define a specific articulatory pattern. In
fact, the articulation of connected speech is a dynamic
process which consists of precisely timed movements of the
vocal organs. We have introduced the temporal dimension in
the system by means of an animation frame technique.

78
Ficrure 5.3. The articulatory model implemented on the
Tektronix 4113 graphic terminal.

79
START
/
b <
Figure 5.4. Interaction cycle for the generation of
an animated articulatory pattern,
a sign on time, b interaction cycle,
c Store With Time command.

80
When the user signs on (Figure 5.4a), he is asked to
indicate the name of the articulatory pattern that he wishes
to edit. The user may, at this point, create a new file or
modify an existing file. The user then interacts with the
model (Figure 5.4b) until he has achieved the desired vocal
tract configuration.
Next, a "Store With Time" (SWT) command is issued
(Figure 5.4c). This attaches a time label to the displayed
vocal tract configuration, which is also memorized as a
"frame" of the articulatory pattern that the user has
indicated at sign-on. Another interaction cycle is then
entered, which will lead to the definition of another frame.
The frames defined by the SWT command appear as
"targets" which must be reached at the indicated time. The
articulatory model is guided between consecutive targets by
means of an interpolation algorithm to achieve smooth
transitions. This is particularly important, so that the
articulatory model of the vocal cavities may be interfaced
with the speech synthesizer; since a continuous variation of
the synthesizer input parameters is required to obtain good
quality speech.
5.3) Display of the Time Variations of the Model
The interaction cycle described above generates an
articulatory pattern by means of an animation frame

81
technique, which is used as an input to the speech
synthesizer. However, during the interaction cycle, only
the particular frame being manipulated is visible on the
terminal display. Consequently, the user has difficulty
visualizing the global time varying characteristics of the
articulatory pattern. To overcome this disadvantage, we use
an on-line animation of the model.
The animation frames are computed by means of
interpolation and stored in the memory of the 4113 terminal
as graphic segments. Then each frame is briefly displayed
in sequence, creating the animation effect.
Figure 5.5 shows a typical animation frame. Here the
contour filling capability of the terminal is not used,
allowing higher display frequency. Using this technique, we
are able to obtain a live animation effect with only a
slight flickering phenomenon. The maximum frame display
frequency is about 5 Hz.
We may also view the movements of the vocal organs,
defined with the graphic editor, in three dimensions, as may
be seen in Figure 5.6. This effect is achieved by using
many consecutive animation frames as sections of a three
dimensional object, with the third dimension being time.
The advantage of this technique is that the time
evolution of the model can be observed at a glance;
moreover, a three dimensional rigid rotation allows the user

82
Figure 5.5. A typical animation frame.

83
Figure 5.6. Three dimensional views of the vocal tract.

84
to choose the most convenient view angle. Different colors
(one every five frames) are used to. mark the specific time
events.
Figure 5.6 also shows that the hidden lines have been
removed. This is achieved very efficiently by means of the
contour filling capability of the terminal. In this 3-D
representation all the frames belong to planes parallel to
- each other. It is, therefore, very simple to determine for
a given angle of rotation, which frame is in "front" and
which one is "behind". To remove the hidden lines the
contour capability of the terminal is used with the "ink
eradicator" color, starting from the frame which is the
farthest from the observer.
5*4) Simultaneous and Animated Display
of the Articulatory and Acoustic Characteristics
of the Vocal Cavitiel"
When a certain articulatory pattern has been edited
with the interactive graphic model, we may estimate the
corresponding acoustic events, e.g., the speech waveform,
the pressure and air volume-velocity distribution in the
vocal cavities, the motion of the vocal cords, the vibration
of the cavity walls, etc. by means of the acoustic model
defined in Chapter 3.
Figure 5.7 shows the final configuration of the
system. The articulatory or "muscular" events generated by

Systea configuration.
The ariculatory events generated by the graphic
editor are input into the acoustic model of the
vocal cavities.
Both articulatory and acoustic events are later
displayed with a coaputer generated aovie.
Figure 5.7. System configuration

86
the graphic model is the (off-line) input to the
synthesizer, which computes the corresponding acoustic
waveforms. Later a simultaneous and animated representation
of the articulatory and acoustic events is displayed on the
Tektronix 4113 terminal.
Figure 5.8 illustrates a typical frame of a sample
animation. The figure below the vocal tract model
.represents the two vocal cords. Each cord (left and right)
is schematically represented by two masses as proposed by
Flanagan [67]. The graphs on the right part of the screen
represent different acoustic events over the same time
interval.
Both the vocal tract and vocal cord models are
animated. During the animation a sliding green line runs
along the borders of the graphs to "mark the time". This
assists the viewer in relating the information displayed in
the graphs to the vocal cord and the vocal tract motion. A
live animation effect is obtained in this manner.

87
Figure 5.8. A typical animation frame including the
vocal tract and the vocal cords. The
various data waveforms calculated by
the model are also shown.

CHAPTER 6
SIMULATION RESULTS
6.1) Speech Synthesis
As explained in Chapter 2, linear prediction and
formant synthesis are based on a rather approximate model of
speech production. However, the quality of the synthetic
speech may be very good because the synthesis algorithm uses
the information derived from an analysis phase of natural
speech that captures the most important perceptual features
of the speech waveform.
On the contrary, articulatory synthesis employs a more
detailed model of the human speech production mechanism, but
cannot exploit a good analysis algorithm to derive the
articulatory information directly from natural speech.
In this section we want to show that the amount of
physiological detail captured by our articulatory and
acoustic models of the vocal cavities is sufficient for the
generation of good quality English sentences.
In fact, Figure 6.1 shows the spectrogram of the
sentence "Goodbye Bob" that we have synthesized with our
computer programs. The quality of this sample compares
favorably with respect to other synthesis techniques.
88

89
Figure 6.2
Spectrogram of
"Goodbye Bob"
/
natural

90
As the first step of the synthesis procedure we should
obtain the spectrogram of natural speech, which is shown in
Figure 6.2. We do not attempt to faithfully match the
synthetic spectrogram with its natural counterpart.
However, the natural spectrogram is useful to obtain a good
estimate of the required duration of each segment of the
synthetic sentence.
The articulatory information is obtained, in a rather
heuristic way, from phonetic considerations and from X-ray
data available in the literature [22] [82-83]. For example,
we know that a labial closure is required for the production
of /b/ and /p/ consonant, or that the tongue position must
be "low" and "back" for the production of the /a/ sound.
Using this linguistic knowledge, we can therefore use
the "graphic editor" described in Section 5.2 to define the
articulatory configurations that are necessary to synthesize
the desired sentence.
As described in Sections 3.2a and 3.2b, the subglottal
and vocal cord models are controlled by three parameters:
glottal neutral area A. Q, cord tension Q and subglottal
pressure Pg.
2 2
We set the glottal neutral area to 0.5 cm or 0.05 cm
for the generation of unvoiced or voiced synthetic speech
respectively. The values of the cord tension and subglottal
pressure can be estimated after a pitch and short time
energy analysis of natural speech [67] [70].

91
The procedure for the definition of the time evolution
of the articulatory model that we have described above is,
however, rather "heuristic". After a first trial,
adjustments of the articulatory model configuration are
usually necessary to improve the quality of the synthetic
speech.
In our opinion a development of this research should be
the definition of vocal tract evolution for different
English allophones, as a first step toward an automatic
speech synthesis by rule system based on an articulatory
model. The solution of this problem is not at all
trivial. Section 6.7 illustrates the difficulties and
reviews part of the literature related to this subject.
6.2) Source Tract Interaction
The classical source-tract speech production model that
we have discussed in Section 2.1 is based on the assumption
that the glottal volume velocity during speech production is
independent of the acoustic properties of the vocal tract.
Evidently this source-tract separability assumption holds
only as a first order approximation. In fact the glottal
volume velocity depends on the transglottal pressure that is
related to the subglottal and vocal tract pressure.
The effects of source-tract interaction have been
modeled and analyzed by Geurin [44], Rothenberg [42],

92
Ananthapadmanaba and Fant [433 Yea [34] has carried out a
perceptual investigation of source tract interaction using
Guerin's model to provide the excitation of a formant
synthesizer [16].
Source tract interaction, which is well represented by
the model discussed in Chapter 3, can be discussed with
reference to Figure 6.3. The glottal volume velocity UG
depends not only on the subglottal pressure P and on the
s
glottal impedance ZG, that is varying during the glottal
cycle, but also on the vocal tract input impedance Z^n.
Source-tract separability holds only if the magnitude of Z^n
is much smaller than the magnitude of ZG, since in this case
Zr and P are equivalent to an ideal current generator.
Therefore the amount of source tract interaction depends on
the magnitude of ZQ with respect to Z^n.
We have experimented with different amounts of source
tract interaction using the following procedure.
At first we have synthesized the word "goodbye" using
our model of the vocal tract and of the vocal cords. The
obtained glottal volume velocity is shown in the middle part
of Figure 6.4.
Then, to reduce source tract interaction, we have used
the same vocal tract configuration, but we have multiplied
by a factor of two the glottal impedance throughout the
entire synthesis of the word "goodbye". We have,

93
*4

/
2 ^
in ^
VOCAL
TRACT
Figure 6.3. Source-tract interaction model.
P subglottal pressure. U glottal
ySlume velocity. Z glottal impedance.

94
r-H " O
CP 3 in
MSEC
Figure 6.4. Three different glottal excitations.
0 .1 .2 .3 .4 SEC
Good bye
Figure 6.5. Spectrogram of "good bye", synthetic.

95
therefore, obtained the glottal waveform in the bottom of
Figure 6.4.
To increase source tract interaction (see top part of
Figure 6.4) we have reduced the glottal impedance by a
factor of two.
The glottal pulse with the greatest source-tract
interaction is characterized by a more pronounced first
formant ripple during the rising slope and by a steeper
final slope that increases the higher frequency components
of the glottal excitation [42].
The spectra of the glottal waveforms show that the
greatest source-tract interaction determines an increase of
about 8 dB around 3 hHz. This supports the fact that
source/tract interaction can be used (for example by
singers) to modify the voice quality [45], [84].
The spectrogram of the synthetic "goodbye" obtained
with the "just right" source tract interaction is shown in
Figure 6.5.
The perceptual quality of the synthetic speech with
"too much", "just right" and "too little" source tract
interaction is judged slightly different.
6.3) Onset Spectra of Voiced Stops
In the recent literature there has been considerable
interest in the correlation between place of articulation of

96
stop consonants and the acoustic properties at consonantal
release.
Kewley-Port [85] and Searle [86] et al. have used a
speech signal transformation based on peripheral auditory
filters approximated by analog l/3-octave filters to obtain
a three dimensional running spectra display which is used
for speech recognition.
Stevens [87] and Stevens and Blumstein [17], in their
investigation of onset spectra of stop consonant release
have determined characteristic patterns for each place of
articulation (labial, alveolar, and velar). They have also
tested the perceptual relevance of their hypothesis with a
formant synthesizer.
Here we discuss the spectral characteristics of
consonant release using our articulatory synthesizer. The
formant spectrum of vowels is completely determined by the
formant frequencies and bandwidths [16]. For example, with
a uniform vocal tract we typically have equally spaced
formants at 500, 1500, 2500, 3500 Hz and all of them have
the same spectral peak values if their bandwidths are the
same.
When the vocal tract becomes constricted, as occurs
during the production of stops, the first formant frequency
is always lowered while the behavior of the other formants
depends on the position of the constriction. For a labial

97
consonant the second and higher formant frequencies are
lowered which determines also a decrease of the higher order
formant peak amplitudes. An alveolar constriction, instead,
shifts the second formant toward higher frequencies which in
turn increases the spectral peak of the higher order
formants. A velar constriction divides the vocal tract into
a front and back cavity of similar volumes. The second and
third formants are therefore brought closer to each other
causing a spectral peak in the mid frequency range around
2 KHz.
These spectral characteristics of vocal tract closure
are further enhanced by the fricative excitation that may
occur at consonant release.
For example, at the release of the velar consonant /g/,
the fricative noise enhances the spectral peak at 2000 KHz
that is associated with the front cavity (see Figure 6.6).
The noise burst of an alveolar (/d/) consonant emphasizes
the higher frequency components around 4 KHz that is
associated with the small front cavity.
The frication spectra in the labial consonant /b/ is
less significant because there is no cavity in front of the
constriction and because the consonant release is very rapid
[88].
We see there are different spectral patterns for
various consonant releases, depending on the place of

98
HERTZ
Figure 6.6. Above synthetic speech waveforms at the
release of /g/ with fricative (left) and
without fricative (right) excitation.
Below spectra at the release of /g/
with fricative (solid line) and without
fricative (dashed line) excitation.

99
articulation [17]. Figure 6.7 shows the spectra of /b/,
/d/, /g/ of our synthetic speech using a 256 ras window after
consonant release.
The /g/ spectrum is characterized by an energy
concentration in the 2 KHz range. The /d/ spectrum has a
rising characteristic with maxima around 4 KHz.
Instead the /b/ spectrum is more vowel-like with global
constant trend on the speech frequency range.
6.4 Glottal Inverse Filtering of Speech
The problem of estimating the glottal volume velocity
directly from the speech waveform has interested speech
researchers for a number of years [89-90].
Here we consider the algorithm of least square inverse
filtering of speech [90] that is based on the linear
prediction model of speech production.
We will show that even if this model does not account
for the vibration of the vocal tract walls, nevertheless the
inverse filtering algorithm is capable of giving a correct
estimate of the glottal volume velocity. This is not an
obvious result since the volume velocity that is displaced
by the wall vibration is not small compared to the glottal
volume velocity (see Figure 6.8) and it is present also
during the closed phase.

100
0 580 1000 1588 2008 2588 3888 3580 4888
HERTZ
Figure 6.7. Onset spectra of /B/, /D/, /G/ followed by /A/.

CM3/SEC CM3/SEC .
101
314 316 313 329 322 324 326
TIME MSEC
TIME MSEC
Figure 6.8. Glottal volume velocity (dashed line) and
air volume velocity displaced by the
vibrating walls of the vocal tract (solid
line) for vowels /a/ (above) and /i/
(below).

102
The possibility that vibrating walls could influence
the glottal volume velocity estimate has also been
hypothesized by Titze (see discussion in [91]).
Figure 6.9 shows the speech production model on which
the algorithm is based. The mouth volume velocity u(n) and
the glottal waveforms g(n) are related by an all pole
transfer function that models the vocal tract
= T(z) (6.4.1)
l+l a.z"1
i=l
or in the time domain
u(n) + a u(n-l) + + a u(n-K) = g(n) (6.4.2)
J. J\
The sound pressure s(n), which is the data supplied for the
inverse filtering algorithm, is equal to the derivative of
the mouth volume velocity
= R(z) = 1 z"1 (6.4.3)
u(z)
or equivalently
U(z) _
G (z)
s(n) = u(n) u(n-l)
(6.4.4)

103
Tz) =
K
I
i = l
R (z) = 1 -z
1+ Z *iZ
-i
Figure 6.9. Speech production model for glottal least
squares inverse filtering.

104
The problem is therefore to estimate the transfer function
T(z), i.e., the predictor coefficients a^. When the
estimate Test^z^ T(z) is known the glottal volume
velocity can be estimated by integrating the data s(z) to
recover the mouth volume velocity u(n) and then by inverse
filtering Test(z) to give
Gest
Test(Z) (1 z-1)
S(z)
(6.4.5)
The vocal tract transfer function can be deduced from the
analysis of the speech waveform during vocal folds closure
when the glottal volume velocity is equal to zero. Then
from (6.4.2) we have
u(n) + a1u(n-l) + + a u(n-K) = 0
X i\
and from (6.4.4) we get
s(n) + a^s(n-l) + + aKs(n K) = 0 (6.4.6)
Equation (6.4.6) shows that during the closed phase the
speech waveform is a freely decaying oscillation that is
determined by the predictor coefficients a^ and K initial
values of s(i). The parameters a^ can be exactly estimated

105
by means of an autocovariance analysis [30] performed during
the closed phase.
The above procedure is based on the speech production
model of Figure 6.9 that is theoretically correct for the
production of vowels if the cavity walls are assumed rigid.
However Figure 6.10 shows the result of the inverse
filtering procedure performed on our synthetic speech when
the yielding wall properties are modeled.
The recovered glottal volume velocity is compared with
the actual glottal volume velocity generated during speech
synthesis by the two mass models of the vocal cords.
We can see that the two waveforms match almost
perfectly. We conclude that the vibrations of the vocal
tract walls do not appreciably affect the inverse filtering
result, especially in comparison with other sources of error
that are present when inverse filtering is applied to real
speech, such as ambient room noise, low frequency bias, and
tape recorder distortion [90].
6.5) Simulation of Wall Vibration Effects
6.5.a) Vocal Cords Vibration During Closure
According to the myoleastic-areodynamic theory of
phonation the vocal cords oscillate when
1) they are properly adducted and tensed, and

106
Figure 6.10. Above glottal volume velocity simulated
by the acoustic model.
Below estimate of the glottal volume
velocity obtained by inverse filtering
of synthetic speech. -

107
2) a sufficient transglottal pressure and glottal air
flow is present.
Obviously the second condition is always met during the
production of vowels and sonorant consonants when the vocal
tract is vented to the atmosphere.
However, during a stop consonant, the closure of the
vocal tract blocks the air flow and causes the pressure drop
and volume velocity across the glottis to decrease. In
spite of this, vocal cord vibration is commonly observed
during the closure period of voiced stops [92].
Several factors may in fact contribute to vocal cord
oscillation during closure [93].
For example, the vocal cord tension could be decreased
to facilitate voicing. However there are no physiological
data showing that speakers make such an adjustment during
voiced stop closure.
To sustain an adequate transglottal pressure the
subglottal pressure could be increased or the nasopharyngeal
orifice could be partially opened during consonantal
closure. However, these possible voicing mechanisms appear
to be very unlikely in the presence of tracheal flow
measurements [94-95].
Another mechanism which would allow on-going
transglottal flow during vocal tract closure is a muscularly
activated enlargement of the supraglottal cavity. The

108
increased vocal cavity volume can in fact accommodate part
of the air volume velocity transmitted during the glottal
pulse and therefore it facilitates vocal cord vibration.
Cinefluorographic data supporting this hypothesis have been
reported by Perkell [82], Kent and Moll [96] and more
recently by Westbury [93].
For similar reasons the yielding properties of the
vocal tract walls may facilitate voicing [93]. As soon as
there is a pressure build-up behind the vocal tract
constriction, the volume of the vocal cavities is increased
and a certain amount of air can flow through the glottis to
"fill up" the extra volume.
We have confirmed this explanation with our computer
implementation of the vocal cords and vocal tract models.
Different yielding wall parameters corresponding to "tensed
cheeks", "relaxed cheeks" [57], and rigid walls have been
used.
Figure 6.11 shows the supraglottal pressure, vocal cord
motions and the glottal volume velocity before, during and
after closure with relaxed vocal cavity walls, and constant
subglottal pressure. Evidently during closure the average
value of the vocal tract pressure increases but still there
is enough pressure drop across the glottis to maintain a
reduced but sufficient glottal volume velocity for vocal
cord oscillation. It can also be observed that the

109
Figure 6.11. Synthetic supraglottal pressure, vocal cord
displacement and glottal volume velocity
with yielding wall simulation.
I VOCAL TRACT CLOSURE
t
t t mi Morr*
t 1 I'lL riOLL
Figure 6.12. Synthetic supraglottal pressure, vocal cord
displacement and glottal volume velocity
with rigid wall simulation.

no
amplitude of oscillation of the vocal cords is slightly
increased during closure because of the larger pressure in
the tract.
If, in the simulation, the walls are assumed rigid (see
Figure 6.12) the pressure in the vocal tract during closure
soon reaches the subglottal pressure value and the glottal
volume velocity drops to zero.
The vocal cords, therefore, stop vibrating and they are
pushed apart by the subglottal and vocal tract pressure.
Vocal cord oscillation is resumed right after closure.
6.5.b) Formant Shift
An acoustic effect of the yielding properties of the
cavity walls is to increase the first formant frequency.
A simple rule which is given in the literature [55] is
F^Hz) = /2002 + F^
Where F^ is the formant frequency of the rigid wall vocal
tract and F-^ is the formant frequency when the vibration
properties of the walls are accounted.
Figure 6.13 shows the formant shift which can be
observed in our simulation when we use soft cavity walls.
The lower formant at 500 Hz is increased by about 50 Hz
which corresponds well to the formula above.

Ill
0. see see* isee 2*00 2000 3008
Hz
Figure 6.13. Linear prediction spectra of synthetic /i/
with yielding cavity walls (solid line)
and rigid cavity walls (dashed line).

112
The small shift at higher formant frequencies shown in
this figure is attributable to the finite resolution of the
autocovariance analysis used to obtain the spectrum.
6.6) Pathology Simulation. Reduction of Sound Intensity
During Nasalization
Physiologically, nasal sounds are generated by lowering
the soft palate and opening the velopharyngeal orifice so
that the nasal cavity communicates with the oral tract
during speech production. The spectral characteristics of
nasal sounds are therefore determined by the acoustic
interaction of the oral and nasal cavities.
Nasal sounds are differentiated by opening or closing
the mouth, e.g., nasalized vowels vs. nasalized consonants
(nasal murmurs). Nasalization usually occurs in those
vowels in the proximity of a nasal consonant and sometimes,
if the nasal consonant is in an unstressed position, the
nasal murmur may disappear and the vowel nasalization
remains the only clue for the existence of a nasal phoneme.
Nasal murmurs are radiated exclusively from the
nostrils. The pharyngeal and nasal cavities constitute the
sound passage that is shunted by the oral cavity. The
formant structure of nasal murmurs is very different from
other speech sounds and it is characterized by the existence
of zeroes (antiformants) that are caused by the shunting
effect of the oral cavity.

113
Studies of nasal consonants with analog articulatory
synthesizers have been reported by House and Stevens [97],
Hecker [98] and Fant [22]. The latter reference uses data
taken directly from X-rays of the vocal system. Sweep-
frequency measurements of the nasal tract transfer function
have been performed by Fujimura and Lindquist [59].
Nasal murmurs appear to have typical spectral
characteristics associated with the nasal tract and
pharynx. Formants occur around 300 Hz, 1000 Hz, 2200 Hz,
3000 Hz. According to Fujimura [99] there is also a formant
between 1000 and 2000 Hz which is heavily dependent on the
oral tract configuration.
An increase in the coupling area of the mouth to the
pharyngeal cavity may shift the antiresonance associated
with the mouth cavity up to 1000 Hz which causes a
neutralization of the 1000 Hz nasal formant [22]. This can
be verified with the aid of our vocal tract synthesizer.
Figure 6.14 shows the spectra of two synthetic nasal
murmurs. The 1000 Hz formant disappears when the mouth
coupling area is increased with respect to the area of the
velar passage.
The formant bandwidths of nasal murmurs are comparable
or larger than those observed for vowels. Both high and low
frequency components are in fact alternated by the soft,
deeply convoluted walls of the nasal cavities which provide
a high factor of viscous damping.

DB
114
Figure 6.14. Spectra of two synthetic nasal murmurs

115
We have modeled each elementary length of the nasal
tract with an equivalent circuit different than the one
discussed in Section 3.1.a for the oral tract. We modeled
the nasal cavity losses with resistors RL^s and RH^s (see
Figure 6.15) which determine the losses at low and high
frequencies respectively.
Their value is determined by the expressions
RH. = 20000 L. RL. = 500 L.
li li
in order to match the formant bandwidths of the nasal
formants at 300 Hz and 2200 Hz that are typical of nasal
murmurs [99].
We have used this representation of the acoustic
properties of the vocal cavities to model the reduction of
the intensity level of vowel sounds when the velopharyngeal
orifice is not completely sealed.
Speech clinicians have noted a reduction in the speech
loudness level of speakers without adequate velopharyngeal
closure that could be caused either by a voluntary attempt
of the subject to minimize the obviousness of his
communication problem or by additional damping of the speech
signals that results from an increased coupling with the
nasal cavity [100].

116
RHi
o-
AA/V
vJlQJL^A/VV
Li
RLi
ci
RHi+l
AA/V
AAA/^SlSlSiy
rl+i
Li+1
-O
Figure 6.15. Equivalent circuit of an elemental length
of the nasal tract.

117
This phenomenon has been investigated by several
authors often with inconsistent results. Bernthal and
Benkelman [101] controlled the velopharyngeal orifice
condition in two normal subjects by means of prosthetic
appliances. To minimize speech intensity variations caused
by different vocal efforts the subglottal pressure was also
monitored. The observed intensity level reduction using a
.0.5 cm velar orifice ranged from 2 to 4 dB.
We have duplicated the same experiment using the just
discussed model of the nasal cavity acoustic losses and we
have obtained an intensity reduction of 4 dB which matches
the data reported by Bernthal.
Figure 6.16 shows the comparison of the speech sound
(vowel /a/) with and without nasal coupling from our
acoustic synthesizer.
6.7) Coarticulation
The relationship between articulation and the acoustic
properties of speech is the traditional field of study of
phoneticians and linguists. But during the last 20 years
this research has found applications in speech synthesis
[67] [78] and speech coding and transmission [54].
Unfortunately, not every aspect of speech articulation
has been fully understood. The problem is that connected
speech is perceived as a sequence of almost independent

118
MSEC
Figure 6.16. Synthetic speech waveforms of non nasalized
/a/ (solid line) and nasalized /a/ (dashed
line).

119
sound segments at the acoustic level, however the segments
are not separately articulated [102]. Rather they are
"coarticulated" and neighboring segments overlap and affect
one another in various ways. This phenomenon is usually
denoted as coarticulation.
Coarticulation is a conceptualization of speech
behaviour that implies [103] discrete and invariant units
serving as input to the system of motor (muscular) control
and an eventual obscuration of the boundaries between units
at the articulatory or acoustic levels.
Coarticulation is bidirectional. Given a sequence of
speech segments ABC, if B exerts an influence on C we talk
about left to right or carry over coarticulation; if B
exerts an influence on A we talk about right to left or
anticipatory coarticulation. For example, both effects are
present in the word spoon (/spun/) in which the lip
protrusion characteristic of /u/ extends both to the right
and to the left [104].
Other examples of anticipatory coarticulation have been
reported by a number of authors. Tongue shape for vowels
appears to be influenced by the following vowel and
viceversa [80] [105-106]. Amerman et al. [107] observed the
jaw opening for an open vowel two consonants before the
vowel. Similarly [108], the velopharyngeal opening may
occur two vowels in advance of a nasal consonant.

120
Left to right coarticulation effects have been observed

by Ohman [105], Stevens et al. [109] and many other authors.
Kozhevnikov and Chistovich [110] tried to explain
coarticulation by speculating that consonant-vowel (CV) or
consonant-consonant-vowel (CCV) type syllables are the
minimum encoding units of speech that are transformed by the
articulatory system into articulatory gestures. This model
is not adequate to account for anticipatory effects observed
with nasal consonants (CWN) [108].
Ohman [80] [105] describes the articulation of vowel-
consonant-vowel (V^CVg) utterances as a dipthongal
transition from the first to the second vowel with the
consonantal gesture for vocal tract closure superimposed on
it.
More generally, MacNeilage [111] assumed a phoneme
sized input unit, a system of articulatory targets and a
closed loop control system which tries to achieve the
relative invariant motor goal from varying initial
positions.
Henke [112] and Moll and Daniloff [108] proposed
"feature based" models to explain the articulation
process. Each phoneme was characterized by a set of
articulatory features. Each feature is binary valued with
the possibility of a compatibility or "don't care"
condition. The assumption is that at every instant each
portion of the vocal apparatus is seeking a goal which is

121
selected among the articulatory features of the input
phenemic string by a scan ahead mechanism. The
compatibility criterion allows a feature to be realized at
the articulatory level earlier than its parent phoneme if it
does not contradict the articulatory requirements of the
segment that is currently being produced. The advantage of
feature based models is their simplicity. The disadvantages
..are an incomplete modeling of the timing of speech
articulation and a too rigid definition of the compatibility
principle [103].
A different approach to the explanation of speech
articulation has been proposed recently by Fowler [113].
His criticism of the above mentioned theories, which he
calls extrinsic timing theories, is that they exclude time
from the representation in the talkers articulatory plan
and they instead propose that an utterance is given
coherence in time only by its actualization. Fowler
hypothesizes that coarticulation should not be treated as an
adjustment of the characteristic properties of a segment to
its neightbors but it should be viewed as the overlapping
production (coproduction) of successive, continuous time
segments.
Our articulatory synthesizer can be used to simulate
articulation patterns that correspond to different
coarticulation hypotheses.

122
We will now consider an example motivated by a
cinefluorographic study of the articulation of v^cv2
utterances reported by Thomas Gay [83], who observed that
the articulators that do not achieve consonantal closure
start the transition movements from the first to the second
vowel during the closure period of the intervocalic
consonant. This suggests that CV components of VCV
sequences may be organized as a basic unit. This hypothesis
is in agreement with the theory of Kozhevnikov and
Chistovich [110]. However, it does not match the
coarticulation model presented by Ohman [80].
According to Ohman the transition from the first to the
second vowel is essentially diphtongal, except for the
articulator effecting closure, and the onset time of all
articulator movements is before consonantal closure.
Therefore, Ohman's study implies an anticipatory effect from
the second to the first vowel of the utterance.
Using the articulatory data presented in [83], we have
simulated both hypotheses for the utterance /iba/. The
intervocalic /b/ consonant has been chosen because it does
not require the use of the tongue for consonantal closure.
The whole tongue is relatively free to coarticulate the /i/
to /a/ transition during the /b/ sound.
Figure 6.17 shows the spectrograms of the synthetic
utterances. The transition between the first vowel and

123
KHz
0 9.2 0.4 0.6 0.3 SEC
KHz
0 0.2 0.4 0.6 0.8 SEC
Figure 6.17. Spectrograms of /iba/ with the simulation
of Ohman's coarticulation model (above)
and Gay's hypothesis.

124
consonant is different for the two cases. However, this
difference turns out to be perceptually indistinguishable.
The most relevant acoustic differences between the two
articulatory patterns appears to be the transition of the
second formant which starts ahead of consonantal closure
when "Ohman's" hypothesis is simulated. Since these two
samples of synthetic speech are perceptually equivalent, the
differences between the coarticulation patterns reported by
Ohman and Gay may be accounted for by differences between
speakers.
6.8) Interpretation of the EGG Data
With the Two Mass Model of the Vocal Cords
The relative inaccessibility of the larynx makes it
difficult to directly observe the vocal cord vibratory
motion in vivo. One must resort to various indirect
observation techniques such as ultrahigh speed photography,
photoglottography, ultrasound, electroglottography, X-ray
and inverse filtering of the acoustic speech.
Among these techniques electroglottography offers the
advantages of being non-invasive, inexpensive and simple to
use.
The electroglottograph is essentially an impedance
measuring device. It consists of two electrodes that are
placed on the opposite sides of the larynx, and of an R-F
modulator and detector. The electroglottograph (EGG)

125
detects the impedance variations of the larynx caused by the
vibration of the vocal cords.
The physiological interpretation of the EGG signal has
been recently investigated with the aid of a synchronized
data base consisting of EGG data, ultrahigh speed laryngeal
films and speech waveforms that were recorded from different
speakers performing various types of phonation tasks
[114]. From the analysis of these data it was concluded
that the EGG signal is indicative of the lateral area of
contact between the vocal cords and of the pitch
periodicity. Figure 6.18 shows the EGG signal, the
differentiated EGG data and the glottal area recorded
simultaneously during phonation [114]. We can see that the
fastest variations of the EGG signal occur at instant of
vocal cord closure and opening. Figure 6.19 shows a
schematic model of the EGG signal [115]. The EGG waveform
has been simulated and interpreted also by Titze and Talkin
[116].
Here we shall show that the EGG data suggests a simple
modification of the two-mass model of the vocal cords.
At first we want to estimate the lateral contact of the
vocal folds with the aid of the two-mass model. We suppose
that the horizontal displacements y^ and of the two
masses are indicative of the displacements of the upper and
lower edges of the vocal folds as shown in Figure 6.20. If

AREA s D EGG EGG
126
Figure 6.18. EGG signal, differentiated EGG and glottal
area.
Ld _! H
(/) O O
u. a.
Ld h-
>J2
Z CL O
-o o
o
V
1-2 VOCAL FOLDS MAXIMALLY CLOSED. COMPLETE CLOSURE
MAY HOT BE OBTAIHEB. FLAT PORTIOH IDEALIZED.
2-3 FOLDS PARTING, USUALLY FROM LOWER MARGINS TOWARD
UPPER MARGINS.
3 WHEN THIS 3REAK POINT IS PRESEHT, THIS USUALLY
CORRESPONDS TO FOLDS OPENING ALONG UPPER KARGIN.
3-4 UPPER FOLD MARGINS CONTINUE TO OPEN.
4-5 FOLDS APART, HO LATERAL CONTACT. IDEALIZED.
3-6 OPEN PHASE.
5-6 FOLDS CLOSING.
6 FOLD CLOSURE OCCURS ALONG LOWER OR CENTRAL MARGIN.
COMPLETE CLOSURE MAY HOT OCCUR.
6-1 RAPID INCREASE IN VOCAL FOLD CONTACT
Figure 6.19. Schematic model of the EGG signal

127
y2 'y2
Figure 6.20. Top and cross-sectional views of the vocal
cords model.
A
L
V
-55-
Figure 6.21. Plastic collision between the vocal cords.

128
we assume a plastic collision between the right and left
vocal cord the length of vertical contact Ax of Figure 6.21
can be easily computed and the contact area is estimated as:
A = L*Ax
Figure 6.22 shows the EGG signal and its derivative,
-estimated with the aid of this model, together with the
glottal area (opening area between the vocal cords).
This simple model of the EGG signal is a good indicator
of glottal opening and glottal closure but it does not
account for the different rates of variations at glottal
closure and glottal opening (compare with Figure 6.18).
An apparent limitation of the two mass models of the
vocal cords that we have just used to account for the EGG
signal is that it does not account for longitudinal or
horizontal variations in the vocal cords. High speed films
of the vocal cords show that there exists a phase difference
along the length of the vocal cords during their vibration
and therefore during the closing (opening) phase. Contact
(opening) between the folds first occurs over a small
portion of their length.
"In succeeding frames [of ultrahigh speed laryngeal
film] this contact (opening) proceeds zipper-like
along the length of the folds until the whole
glottis is closed (open). This behaviour is more

129
52 54 55 S3 53 52 4 55 53 78 72
MSEC
Figure 6.22. EGG signal, differentiated EGG signal and
glottal area estimated with the model of
Figure 6.20.

130
pronounced during the opening than closing phase."
[114].
We can model this behavior as in Figure 6.23 where the left
and right vocal cords are not parallel, but they are
"tilted" with an angle a and the contact area is
proportional to Al.
Using this model modification with a equal to 0.2 and
2.7 respectively, we obtain the EGG signal of Figure 6.24
which more closely represents the actual EGG sample of
Figure 6.18. The observation of ultrahigh speed laryngeal
films has suggested a simple modification of the two mass
models of the vocal cords, to better represent the EGG data.

131
GLOTTIS
Figure 6.23. Top view of the modified model of the vocal
cords.

132
Figure 6.24. EGG signal, differentiated EGG signal and
glottal area estimated by means of the model
of Figure 6.23.

CHAPTER 7
CONCLUSIONS
7.1) Summary
What is the amount of physiological detail captured by
the articulatory synthesis method? Can articulatory
synthesis generate high quality synthetic speech? Can it be
employed for the simulation of physiological and
pathological aspects of speech production?
A positive answer to these questions is the most
significant contribution of the research presented in this
dissertation.
The modeling and numerical simulation techniques used
in this study have been reported in Chapters 3 and 4. We
were able to guarantee the stability of the numerical
simulation method and to halve the number of differential
equations that must be solved for the simulation of the wave
propagation in the vocal tract (Chapter 4).
We also developed interactive graphic software
(Chapter 5) to provide us with a convenient interface for
the manipulation of the articulatory model of the vocal
cavities.
133

134
Chapter 6 uses the developed synthesis system for the
investigation of different aspects of speech processing
(Sections 6.1, 6.2, 6.3, 6.4) and speech physiology
(Sections 6.5, 6.6, 6.7, 6.8) that are reported in the
literature.
In particular we have been able to show
- Articulatory synthesis is capable of generating high
quality synthetic speech.
- Source tract interaction increases the high frequency
components of the sound source.
- The spectra of voiced stops at consonantal release
provide a cue for the recognition of the place of
articulation.
- The glottal least square inverse filtering of speech
based on the closed phase autocovariance analysis is
not significantly affected by the soft characteristics
of the vocal cavity walls.
- The yielding properties of the cavity walls are
greatly responsible for the vocal cord oscillation
during consonantal closure.
- The reduction of voiced intensity observed in human
subjects as a consequence of nasal-pharyngeal coupling
has been correctly simulated.

135
- We suggested a modification of the two mass models of
the vocal cords on the basis of EGG and ultra-high
speed laryngeal film observations.
7.2 Suggestions for Future Research
During this study we have identified two possible
directions for future research concerning the implementation
of the vocal cavity acoustic model and text to speech
synthesis with the articulatory model.
An important continuation of this research concerns the
definition of a more efficient algorithm to represent the
wave propagation in the vocal cavities. We are currently
investigating (see Appendix) a modification of the Kelly-
Lochbaum's algorithm [72] to be able to account for the
yielding characteristics of the vocal cavity walls and for
the fricative self excitation of the vocal tract. A
preliminary implementation runs an order of magnitude faster
than the now adopted Runge-Kutta method and it is amenable
for an array processor implementation.
Text to speech synthesis is a potential application of
articulatory synthesis. Perceptual research should be
devoted to the definition of an articulatory encoding method
of the English allophones. We have briefly reviewed the
problems concerned with this issue in Section 6.7. The
prosodic rules employed with other text to speech synthesis

136
methods could probably be used for articulatory text to
speech synthesis.
The availability of a faster implementation of the
vocal cavity acoustic model would greatly help this
research, since a great amount of experimentation based on a
trial and error procedure (see also Section 6.1) is probably
required.

APPENDIX
AN EFFICIENT ARTICULATORY SYNTHESIS ALGORITHM
FOR ARRAY PROCESSOR IMPLEMENTATION
The computational burden for articulatory synthesis is
a major limitation for real time applications such as text
to speech synthesis.
We describe a new algorithm for the simulation of the
acoustic characteristics of the vocal cavities that is
currently being implemented. This algorithm offers several
advantages.
Similar to Flanagan's model of the vocal tract it can
1) accomodate the effects of yielding walls, 2) model the
fricative self excitation of the vocal cavities, 3) account
for source tract interaction and for the radiation load.
The losses in the vocal cavities are correctly modeled and
they are not phenomenologically represented as, for example,
with Kelly-Lochbaum's [72] or Mermelstein1s algorithms [73].
This algorithm has been designed as a modification of Kelly-
Lochbaum s procedure which models the wave propagation in
concatenated lossless tubes. Our algorithm runs an order of
magnitude faster than the currently employed Runge-Kutta
method. More importantly, this algorithm can easily take
advantage of parallel processing capabilities, as with array
processors, to further reduce the execution time.
137

138
This new procedure has been tested with the synthesis
of the sentence "Goodbye Bob". We have obtained the same
quality as with' the more classical Runge-Kutta method. We
are now experimenting with nasal sounds and fricative
excitation.
The classical Kelly-Lochbaum1s algorithm which we
review in Section A.l, employs a system of concatenated
lossless tubes to represent the wave propagation in the
vocal tract. Section A. 2 considers the algorithm
modifications that we have designed for a more realistic
model of the cavity wall characteristics and fricative
excitation. Section A.3 shows how the glottal termination,
the nasal coupling and the radiation load can be modeled
with our computational approach.
A.1 Wave propagation in Concatenated Lossless Tubes
The wave propagation in a uniform lossless tube of
cross-section AK can be represented by the superposition of
a positive-going pressure wave p*(t x/c) and a negative
going pressure wave p~(t + x/c) where c is the sound
i\
velocity and x is the displacement along the tube. More
specifically the pressure and volume velocity in the uniform
pipe is

139
PK(xft) = p*(t x/c) + p(t + x/c)
(A.1.1)
Arr i _
uK(x,t) = (pR(t x/c) Pr^ + x/c^
where p is the air density and Zv = is the characteristic
* na
impedance of the pipe. Therefore, when we represent the
vocal tract by a series of concatenated uniform lossless
pipes we may use (A. 1.1) to model the wave propagation in
the Kth section. The relationship between the traveling
waves in adjacent tubes can be obtained by applying the
physical principle that pressure and volume velocity must be
continuous in both time and space everywhere in the
system. This provides boundary conditions that can be
applied at both ends of each tube. Consider Figure A.l
which shows the junction between the Kth and the (K + l)st
pipes with cross-sectional areas AK and Ak+-^ respectively.
Let Ax be the length of each tube and x=Ax/c the wave
propagation time.
Applying the continuity conditions of pressure and
volume velocity at the junction gives
PK(Ax,t) = PR+1(0#t)
uR(Ax,t) = uK+1(0,t)
(A.1.2)

TD A
140
p+Ct)
k
p+
p+it-'V)
k
-=>
t)
p,+ (t-r)
k + l
'K
flK+l
K
(t)
p-it+Y)
<-
p ft) p
FK+1 pK+1
(t+r)
Figure A.l. The Kth and (K+l)st tubes with positive and
negative going pressure waves.

141
and substituting of (A.1.1) into (A.1.2) gives, after some
manipulations
P~(t+T)
rK 1 + rK
~p*(tt)
-pK+l(t)-
1
H
1
l-l
*
1
-PK+l(t)-
(A.1.3)
where
a A
A K K+l
K \ + Vi
(A.1.4)
Equations (A.1.3) and (A.1.4) can be graphically represented
by Figure A. 2. The factor r represents the propagation
delay in the uniform tubes of the positive going and
negative going pressure waves. The junction between
adjacent pipes is represented by four multiplications as in
the classical Kelly-Lochbaum algorithm.
A.2) Modifications of Kelly-Lochbaum Algorithm
A.2.a) Fricative Excitation
Figure A.3 shows the lumped element equivalent circuit
of two adjacent lossless pipes. Flanagan and Cherry [71]
suggested a modification of this equivalent circuit to
account for the fricative self-excitation of the vocal tract
as shown in Figure A. 4. The resistor % and the random

142
Figure A.2. Graphical representation of Kelly-Lochbaum's
algorithm.

143
r'vJUL'-
-vQJly-r-vMv-
Ajlib-i
Figure A.3. Lumped element equivalent circuit of two
adjacent lossless tubes.

144
RK +VK
rk +vk
s/x/N/1' )-
K+l
Ax
Ax
Figure A.4. Equivalent circuit to account for fricative
excitation and its interpretation in terms
of lossless tubes.

145
voltage generator VR represent the turbulent losses and
turbulent excitation respectively. Such an equivalent
circuit, however, represents the "junction" of two uniform
pipes obtained by means of a resistor RR and a voltage
generator VR as shown in the lower part of Figure A.4. The
continuity conditions (A.1.2) become
PK(Ax, t)
uR(A x,t)
PK+l^0,t^ + RKUK+l^,t^ + V]
K
(A.2.1)
UK+l^0,t^
The substitution of (A.1.1) into (A.2.1) gives, after some
manipulations, the following equations.
P+ (t)
K+lv 1
'Kll K12
'K21 K22
pj£(tt)
pK(t)
V.
KM1
V.
KM 2
(A.2.2)
where
A AK
AK+1 +
'Kll
RKAKAK+1
pc
D
'K12
0 A
A K+l
D
, A 2 AK
'K21 D

146
C
A A
A K+l K
R A A
+ K,K+1 K K+l
pc
K22
D
D
A
One can easily check that (A.2.2) becomes equivalent to
(A. 1.3) when RR and VK are equal to zero. The schematic
representation of (A.2.2) is shown in Figure A.5.
Observe that the computation of the CK coefficients and
the evaluation of (A. 2.2) can be done in parallel for all
the junctions between uniform pipes. Equation (A.2.2) is
therefore efficiently implemented using an array processor.
A.2.b) Yielding Wall Simulation
Given certain conditions in the acoustic pipes at time
t, the relationships (A.2.2) are used to calculate the wave
propagation at time t + x, as is schematically shown in
Figure A.5. This calculation is used to model fricative
excitation in lossless tubes. Here we introduce an
additional computational step to account for the yielding
wall effects during speech production.

147
Figure A.5. Graphic representation of equation (A.2.2)
to account for fricative excitation.

148
Consider the partial differential equations (3.1.3),
which we reproduce below and which account for the wave
propagation in a pipe with area function A(x) with a small
time varying perturbation 6A(x,t).
3p(x,t) p 3u(x,t)
3 x A(x) 3t
0
3u(x,t) A(x) 3p(x,t) 3(6A(x,t))
3x 2 2t 3t
pc
Apply the above relationships to the Kth uniform tube of our
model where
A(x) = Ak# 6A( x, t) = A^t)
This gives
3pK(x,t)
_
3u(x,t)
P _
A
K
3t
3uK(x,t) Aj^ 3pK(x,t)
3 x
pc
3t
d(6AK(t))
dt
(A.2.3)
Let pKr(x,t) and uKr(x,t) be the solution of (A.2.3) in the
rigid wall case when 6A(t) is equal to zero.
K
Then (A.2.3) shows that the solution p(x,t) and u(x,t)
can be easily obtained from the rigid wall solution

149
uR(x,t) = xoKr(x,t)
2 6AK(t)
PK(x,t) = pKr(x,t) pc s
(A.2.4)
Equation (A.2.4) could also be explained by intuitive and
physical reasoning.
Now it is useful to substitute (A.1.1) into (A.2.4) to
obtain
p£(t x/c) = pj^it x/c) -&§-
6Ap.it)
Kr
K
. 2 SA (t)
p(t + x/c) = PKr(t + x/c) -fiS ^
(A.2.5)
K
Equations (A.2.5) suggest a simple modification of (A.2.2)
to account for vocal wall vibrations.
2 6A (t) 2 A ( x)
We can in fact add (- = ) and (-
K
A.
K
auu vkM2 respectively for every evaluation of
to VKM1 and V,
(A.2.2) (see also Figure A.5).
We must realize that 6A^(x) represents the variation of
the cross-sectional area during the time interval x. Since
the time constants of the yielding wall vibrations are much
larger than the propagation time x, we can use the value
SA-Jx) that has been evaluated at the previous integration
l\
step.

150
The value of p* and pT>r can be used to evaluate the
I\ JE\
pressure in the vocal tract (see (A.1.1)) and to estimate
(as discussed in Sections 3.1.6 and 4.4) the perturbation
6Ar to be used in the following integration step.
In summary, in Sections A.2.a and A.2.b we have defined
an algorithm, represented in Figure A. 5, which models the
fricative self-excitation of the vocal tract and the effects
of the vibrating walls. This algorithm can be calculated in
parallel for each junction between adjacent uniform tubes,
computing the wave propagation at time (t + t) when the
initial conditions at time t are given.
We have indicated (see Sections 3.1.b and 4.4) how to
estimate the vibration 6A^(r) of the yielding walls. Again
this operation can be performed in parallel, using the
backward difference rule (see Section 4.4) for each tube of
our model.
A.3)
Boundary Conditions
In
Section
A.2
we modeled the wave propagation in
uniform
tubes
and
at the junction between
adjacent
sections
. Here
we simulate the termination of
the vocal
cavities at the glottis, lips (or nostrils) and the acoustic
coupling between the oral and nasal passages.

151
A.3.a) Glottal Termination
In the acoustic pipe model of the vocal tract the
transglottal pressure is, by definition, the difference
between the subglottal pressure Pg and the pressure in the
first tube of the model p-^(0,t). During speech production,
if we neglect the viscous losses, the transglottal pressure
consists of a kinetic component proportional to the square
-of the glottal volume velocity Ug(t) and of an inertive
component proportional to the time derivative of Ug(t).
Therefore
du (t) ~
Ps p^c^t) = L + a u (t) + V0 (A.3.1)
Vq in (A.3.1) is a random value which models the fricative
excitation at the glottis, occurring during aspirated
sounds. Observe that (A.3.1) accounts for source-tract
interaction since the glottal volume-velocity u (t) depends
on the pressure p^(0,t) in the vocal tract.
We need (A.3.1) to obtain a relationship between the
positive going and negative going pressure waves at the
beginning of the first tube of the vocal tract model. To
2
eliminate the square term u (t) from (A. 3.1) it is
convenient to use the identity
2 2
Ug(t) = (U (t-T) + (Ug(t) Ug(t-T))

152
which is approximated by
u2 (t)
g
u (t-x) (1+2
g
u (t) u (t-x)
2 2L_I \
u^Tt-x)
since the variation of ug(t) is usually small during an
interval of duration x.
Because of continuity conditions at the glottal
termination and (A.1.1) we obtain
Ug(t) = u^Ojt) = ~ (p^(t) p(t)) (A.3.3)
P1(0,t) = p*(t) + p1(t) (A.3.4)
By substituting (A.3.2), (A.3.3) and (A.3.4) into
(A.3.1), and by approximating the time derivatives with
backward differences we can obtain a linear relationship
4. _
between p^(t), p^(t) and the subglottal pressure Pg
P*(t) = -DlP^(t) + D2(Ps VQ) + D3 (A.3.5)
where
D = (1 ( + 2au (t-x))) D
1 pc x g 2
2 = (1 + (x + 2aug(t-x)))
-1

153
D3 = ug(t-x)(aug(t-x) + D2 .
Equations (A.3.5) can, therefore, be used to model the
glottal termination of the vocal tract.
A.3.b) Radiation Load
Figure A. 6 shows the termination of the last (Nth)
section of the vocal tract. As discussed in Section 3.3, RR
and LR represent the radiation load of the vocal tract at
the mouth. The resistor Rjj and the random voltage generator
VN model the turbulent losses and the fricative excitation
at the lips. A similar equivalent circuit is obviously
valid also for the radiation effects at the nostrils.
To obtain a mathematical relation between the positive
and negative going pressure waves p*(t-x) and p~(t+x), one
should consider the continuity conditions
PN(Ax,t) = RRuR(t) + RNuN(Ax,t) + VN
du (t)
LR-dt = VR(t) (A36)
uN(Ax,t) = uR(t) + uL(t)
Substituting (A.1.1) into (A.3.6) and approximating the time
derivatives with backward differences, we obtain, after
several manipulations,

154
P¡(t)
p J(t-r)
Figure A.6. Boundary conditions at the lips or nostrils.

155
PN(t-x) = Ej^p^t-x) + E2(uR(t-x) u^AXjt-x)) + E3
(A.3.7)
uR(t) = (uN(Ax,t) uN(Ax,t-x) + u^t-x^E"1
where
Ei = < E2 = RR E¡1 Ei"
E3 VN EiX
E.
Rr
1 +
LR
E =
1 + (kr e;1 + V ^
A.3.c) Nasal Coupling
We suppose that the nasal passage is coupled to the
vocal tract model through a 3 port junction placed between
the Vth and (V + l)st tube of the vocal tract as shown in
Figure A.7. The cross-sectional areas of the pipes that are
connected to the junction are v Ay+1 and Ac and all three
pipes have the same length Ax.
When the nasal passage is closed A^ is equal to zero
and the three port junction becomes equivalent to a two port

156
pj(t-r) p(t+r)
A x A c
Pr(t) Pp(t)
pj(t) pj(t-r)!
3 port
junction
'PV+l(t)
1
pj+i(t-r)
AV ¡
A
| V+l
Py(t) Py(t+r)
¡Pv+i(t)
Pv+i(t+r)
Figure A.7. Three port junction to model the acoustic
coupling between the oral and nasal tracts.

157
junction between the Vth and (V + l)st tubes of the vocal
tract.
If we assume a lossless junction, the relation between
the pressure waves leaving and entering the junction can be
easily determined with the aid of continuity conditions
Pv(Ax,t) = Pc(o,t) = pv+1(o,t)
uy( Ax, t) = uc(o,t) + uv+1(o,t)
Substituting A.1.1 into A.3.8, we obtain
(A.3.8)
- p(t-x) -
-Pv(t-T) "
PV+l(t)
= CS]
P_(t)
V+l
.pjit) -
0
O 1
t+
1
where
Av-Av+i"Ac
2AV+1
to
>
O
1
[S] 4 A Va -1' V A
AV AV+1 AC
2Av
Av+iAvAc
2Ac
2A
^ V
2AV+1
AcAv"Av+i-

REFERENCES
[1] Flanagan, J. L., "Voices of men and machines,"
J. Acoust. Soc. Amer., vol. 51, pp. 1375-1387,
1972.
11
to
1 1
Holmes,
London,
J. N. ,
1972.
Speech Synthesis,
Mills and Boon,
[3]
Feldman,
J. A.
, Hofstetter, E.
M. and Malpass,
M. L., "A compact flexible LPC vocoder based on a
commercial signal processing microcomputer," IEEE
Trans. Acoust. Speech Signal Proces., vol. 31,
pp. 252-257, 1983.
[4] Cox, R. V., Crochierie, R. E. and Johnston, J. D.,
"Real time implementation of time domain harmonic
scaling of speech for rate modification and
coding," IEEE Trans. Acoust. Speech Signal Proces.,
vol. 31, pp. 258-272, 1983.
[5] Fette, B., Harrison, D., Olson, D., and
Allen, S. P., "A family of special purpose
micrprogrammable digital signal processor IC's in
an LPC vocoder system," IEEE Trans. Acoust. Speech
Signal Proces., vol. 31, pp. 273-280, 1983.
[6] Irie, K. Uno, T. Uchimura, K. and Iwata, A., "A
single-chip ADM LSI CODEC," IEEE Trans. Acoust.
Speech Signal Proces., vol. 31, pp. 281-287, 1983.
[7] Flanagan, J. L., Speech Analysis, Synthesis and
Perception, Springer Verlag, New York, 1972.
[8] Sambur, M. R., "An efficient linear prediction
vocoder," Bell System Tech. J. vol. 54, pp. 1693-
1723, 1975.
[9] Tribolet, J. M. and Crochierie, R. E. "Frequency
domain coding of speech," IEEE Trans. Acoust.
Speech and Signal Proces., vol. 27, pp. 512-530,
1979.
158

159
[10] Flanagan, J. L. and Christensen, S. W. "Computer
studies on parameteric coding of speech spectra,"
J. Acoust. Soc. Amer., vol. 68, pp. 420-429, 1980.
[11] Roucos, S., Schwartz, R. M., and Makhoul, J., "A
segment vocoder at 150 B/S," Proceedings
International Conference Acoust. Speech Signal
Proces., vol. 1, pp. 61-65, 1983.
[12] Klatt, D. H., "Structure of a phonological rule
component for a synthesis by rules program," IEEE
Trans. Acoust. Speech Signal Proces., vol 64,
pp. 391-398, 1976.
[13] Klatt, D. H., "Real time speech synthesis by rule,"
J. Acoust. Soc. Amer., vol. 68(S1), pp. S18(a),
1980.
[14] Rabiner, L. R., "A model for synthesizing speech by
rule," IEEE Trans. Audio Electroacoust., vol. 17,
pp. 7-13, 1969.
[15] Cooper, J. S. Gaitenby, J. H. Mattingly, I. G.,
and Umeda, N., "Reading aids for the blind: a
special case of man-machine communication," IEEE
Trans. Audio Electroacoust., vol. 17, pp. 266-270,
1969.
[16] Klatt, D. H., "Software for a cascade/parallel
formant synthesizer," J. Acoust. Soc. Amer.,
vol. 67, pp. 971-995, 1980.
[17] Stevens, K. N., and Blumstein, S. E., "Invariant
cues for place of articulation in stop consonants,"
J. Acoust. Soc. Amer., vol. 64, pp. 1358-1368,
1978.
[18] Abramson, A. S., Nye, P. W. Henderson, J. B. and
Marshall, C. W., "Vowel height and the perception
of consonantal nasality," J. Acoust. Soc. Amer.,
vol. 70, pp. 329-339, 1981.
[19] Titze, I. R., The Human Vocal Cords: A Mathematical
Model, Doctoral Dissertation, Dept, of Physics and
Astronomy, Brigham Young University, Provo, Utah
1973.
[20] Ishizaka, K., and Ishiki, N., "Computer simulation
of pathological vocal cord vibration," J. Acoust.
Soc. Amer., vol. 60, pp. 1193-1198, 1976.

160
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
Stevens, K. N. ^ "Airflow and turbulence noise for
fricative and stop consonants: static
considerations," J. Acoust. Soc. Amer., vol. 50,
pp. 1180-1192, 1971.
Fant, G. Acoustic Theory of Speech Production,
Mouton and Co., The Hague, 1960.
Dunn, H. K. "The calculation of vowel resonances
and an electrical vocal tract," J. Acoust. Soc.
Amer., vol. 22, pp. 740-753, 1950.
Stevens, K. N., Kasowski, S., and Fant, G. M., "An
electrical analog of the vocal tract," J. Acoust.
Soc. Amer., vol. 25, pp. 734-742, 1953.
Stevens, K. N., and House, A. S., "Development of a
quantitative description of vowel articulation,"
J. Acoust. Soc. Amer., vol. 27, pp. 484-493, 1955.
Atal, B. S., and Hanauer, S. L., "Speech analysis
and synthesis by linear prediction of the speech
wave," J. Acoust. Soc. Amer., vol. 50, pp. 637-655,
1971.
Markel, J. D., and Gray, A. H., "On the
autocorrelation with application to speech
analysis," IEEE Trans. Audio Electroacoust.,
vol. 21, pp. 69-79, 1973.
Itakura, F., and Saita, S., "On the optimum
quantization of feature parameters in the PARCOR
speech synthesizer," Proceedings 1972 Conference
Speech Commun. Proces., Tokyo, pp. 434-437, 1972.
Rabiner, L. R., and Schafer, R. W., Digital
Processing of Speech Signal, Prentice Hall,
Englewood Cliffs, New Jersey, 1979.
Markel, J. D., and Gray, A. H., Linear Prediction
of Speech, Springer-Verlag, New York, 1976.
Schroeder, M. R., "Models of hearing," Proceedings
of the IEEE, vol. 63, pp. 1332-1350, 1975.
Makhoul, J., "Linear prediction: a tutorial
review," Proceedings of the IEEE, vol. 63, pp. 561-
580, 1975.

161
[33] Sambur, M. R. Rosenberg, A. E., and McGonegal,
C. A., "On reducing the buzz in LPC synthesis,"
J. Acoust. Soc. Amer., vol. 63, 1978.
[34] Yea, J. J., The Influence of Glottal Excitation
Functions on the Quality of Synthetic Speech,
Doctoral Dissertation, University of Florida,
Gainesville, Florida, 1983.
[35] Naik, J., Synthesis and Objective Evaluation of
Natural Sounding Speech Using the Linear Prediction
Analysis-Synthesis Scheme, Doctoral Dissertation,
University of Florida, Gainesville, Florida, 1983.
[36] Holmes, J. N., "The influence of the glottal
waveform on the naturalness of speech from a
parallel formant synthesizer," IEEE Trans. Audio
Electroacoust., pp. 298-305, 1973.
[37] Rabiner, L. R., "Digital-formant synthesizer for
speech synthesis studies," J. Acoust. Soc. Amer.,
vol. 43, pp. 822-828, 1968.
[38] Schafer, R. W. and Rabiner, L. R. "System for
automatic formant analysis of voiced speech,"
J. Acoust. Soc. Amer., vol. 47, pp. 634-648, 1970.
[39] Olive, J., "Automatic formant tracking in a Newton-
Raphson technique," J. Acoust. Soc. Amer., vol. 50,
pp. 661-670, 1971.
[40] Markel, J. D., "Automatic formant and fundamental
frequency from a digital inverse filtering
formulation," Proceedings International Conference
Speech Commun. Proces., Boston, pp. 81-84, 1972.
[41] McCandless, S. S., "An algorithm for automatic
formant extraction using linear prediction
spectra," IEEE Trans. Acoust. Speech Signal
Proces., vol. 22, pp. 135-141, 1974.
[42] Rothemberg, M., "The effect of flow dependence on
source-tract interaction," presented and to appear
in Proceedings of Vocal Folds Physiology and
Biophysics of Voice, University of Iowa, 1983.
[43] Ananthapadmanaba, T. V., and Fant, G., "Calculation
of true glottal flow and its components," STL-QPSR,
no. 1, pp. 1-30, 1982.

162
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
Guerin, B., Mryaty, M. and Carre, R., "A voice
source taking into account of coupling with the
supraglottal cavities," IEEE International
Conference Acoust. Speech Signal Proces., pp. 47-
50, 1976.
Childers, D. G., Yea, J. J., and Bocchieri, E. L.,
"Source-vocal tract interaction in speech and
singing synthesis," presented and to appear in
Proceedings of Stockholm Music Acoust. Conference,
Stockholm, Aug. 1983.
Wakita, H., "Direct estimation of the vocal tract
shape by inverse filtering of acoustic speech
wave forms," IEEE Trans. Audio Electroacoust.,
vol. 21, pp. 417-427, 1973.
Wakita, H., "Estimation of vocal tract shapes from
acoustical analysis of the speech wave: the state
of the art," IEEE Trans. Acoust. Speech Signal
Proces., vol. 27, pp. 281-285, 1979.
Sondhi, M. M., "Estimation of vocal tract areas:
the need for acoustical measurements," IEEE Trans.
Acoust. Speech Signal Proces., vol. 27, pp. 268-
273, 1979.
Sondhi, M. M., and Gopinath, B., "Determination of
vocal tract shape from impulse response at the
lips," J. Acoust. Soc. Amer., vol. 49, pp. 1867-
1873, 1971.
Sondhi, M. M., and Resnick, J. R., "The inverse
problem of the vocal tract: numerical methods,
acoustical experiments and speech synthesis,"
J. Acoust. Soc. Amer., vol. 73, pp. 985-1002, 1983.
Mermelstein, P., "Determination of the vocal tract
shape from measured formant frequencies,"
J. Acoust. Soc. Amer., vol. 41, 1283-1294, 1967.
Atal, B. S., Chang, J. J., Mathews, M. V., and
Tukey, J. W., "Inversion of articulatory to
acoustic transformation by a computer-sorting
technique," J. Acoust. Soc. Amer., vol. 63,
pp. 1535-1555, 1978.
Lieberman, P., Speech Physiology and Acoustic
Phonetics: An Introduction, MacMillan, New York,
1977.

163
[54] Flanagan, J. L., Ishizaka, K. and Shipley, K. L.,
"Signal model for low bit rating coding of speech,"
J. Acoust. Soc. Amer., vol. 68, pp. 780-791, 1980.
[55] Sondhi, M. M., "A model for wave propagation in a
lossy vocal tract," J. Acoust. Soc. Amer., vol. 55,
pp. 1070-1079, 1974.
[56] Dunn, H. K. and White, S. D. "Statistical
measurements of conversational speech," J. Acoust.
Soc. Amer., vol. 11, pp. 278-288, 1940.
[57] Ishizaka, K., French, J. C., and Flanagan, J. L.,
"Direct determination of the vocal tract wall
impedance," IEEE Trans. Acoust. Speech and Signal
Proces., vol. 23, pp. 370-373, 1975.
[58] Morrow, C. J., "Speech in deep submergence
atmosphere," J. Acoust. Soc. Amer., vol. 50,
pp. 715-728, 1971.
[59] Fujimura, 0., and Lindquist, J., "Sweep tone
measurement of vocal tract characteristics,"
J. Acoust. Soc. Amer., vol. 49, pp. 541-558, 1979.
[60] Van den Berg, J., "On the role of the laryngeal
ventricle in voice production," Folia Phoniatrica,
vol. 7, pp. 57-69, 1955.
[61] Van den Berg, J., Zantema, J. T., and Doornenbal,
P., "On the air resistance and the Bernoulli effect
of the human larynx," J. Acoust. Soc. Amer.,
vol. 5, pp. 626-631, 1957.
[62] Flanagan, J. L., and Landgraf, L. L., "Self
oscillating source for vocal tract synthesizers,"
IEEE Trans. Audio Electroacoust. vol. 16, pp. 57-
58, 1968.
[63] Flanagan, J. L., and Meinhart, D., "Source system
interaction in the vocal tract," J. Acoust. Soc.
Amer., vol. 64, pp. 2001(A), 1964.
Mermelstein, P., "An extension of Flanagan's model
of vocal cord oscillations," J. Acoust. Soc. Amer.,
vol. 50, pp. 1208-1210, 1971.
[64]

164
[65] Ishizaka, K. and Matsuidara, M. "What makes the
vocal cords vibrate," Proceedings Sixth
International Congress Acoust., vol. 2, pp. B9-12,
1968.
[66] Ishizaka, K. L. and Flanagan, J. L. "Synthesis of
voiced speech from a two mass model of the vocal
cords," Bell System Tech. J., vol. 51, pp. 1233-
1268, 1972.
[67] Flanagan, J. L., and Ishizaka, K. L., "Synthesis of
speech from a dynamical model of the vocal cords
and vocal tract," Bell System Tech. J., vol. 54,
pp. 484-506, 1975.
[68] Flanagan, J. L., and Ishizaka, K., "Automatic
generation of voiceless excitation in a vocal cord,
vocal tract speech synthesizer," IEEE Trans. Audio
Electroacoust., pp. 163-169, 1976.
[69] Flanagan, J. L. and Ishizaka, K. "Computer model
to characterize the air volume displaced by the
vibrating vocal cords," J. Acoust. Soc. Amer.,
vol. 63, pp. 1559-1566, 1978.
[70] Monsen, R. B. Engebretson, A. M. and Vemula,
N. R., "Indirect assessment of the contribution of
subglottal air pressure and vocal fold tension to
changes of fundamental frequency in English,"
J. Acoust. Soc. Amer., vol. 64, pp. 65-81, 1978.
[71] Flanagan, J. L. and Cherry, L., "Excitation of
vocal tract synthesizer," J. Acoust. Soc. Amer.,
vol. 45, pp. 764-769, 1969.
[72] Kelly, J. L. and Lochbaum, C. C., "Speech
synthesis," Proceedings Fourth International
Congress Acoust., pp. 1-4, 1962.
[73] Mermelstein, P., "Calculation of the vocal tract
transfer function for speech synthesis
applications," Proceedings Seventh International
Congress Acoust., pp. 173-176, 1971.
[74] Rubin, P., Baer, T., and Mermelstein, P., "An
articulatory synthesizer for perceptual research,"
J. Acoust. Soc. Amer., vol. 70, pp. 321-328,
1981.

165
[75]
[76]
[77]
' [78]
[79]
[80]
[81]
[82]
[83]
[84]
[85]
[86]
Gear, C. W. Numerical Initial Value Problems in
Ordinary Differential Equations, Prentice Hall,
Englewood Cliffs, New Jersey, 1971.
Ralston, A., and Wilf, H. S., Mathematical Methods
for Digital Computers, John Wiley and Sons, New
York, 1967.
Gear, C. W., "DIFSUB for solution of ordinary
differential equations," Collected Algorithms from
ACM, ACM, Inc., New York, pp. 407P1-407P7, 1980.
Coker, C. H., "A model of articulatory dynamics and
control," Proceedings of the IEEE, vol. 64,
pp. 452-460, 1976.
Mermelstein, P., "Articulatory model for the study
of speech production," J. Acoust. Soc. Amer.,
vol. 53, pp. 1070-1082, 1973.
Ohman, S. E. G., "Numerical model of coarticu
lation," J. Acoust. Soc. Amer., vol. 41, pp. 310-
320, 1967.
Lindquist, J., and Sundberg, J., "Acoustic
properties of the nasal tract," STL-QPSR, no. 1,
pp. 13-17, 1972.
Perkell, J. S., Physiology of Speech Production,
MIT Press, Cambridge, 1969.
Gay, T., "Articulatory movements in VCV sequences,"
J. Acoust. Soc. Amer., vol. 62, pp. 183-194, 1977.
Rothemberg, M., "The voice source in singing,"
Research Aspects on Singing, Royal Swedish Academy
of Music, Stockholm, pp. 15-33, 1981.
Kewley-Port, D. "Time varying features as
correlates of place of articulation in stop
consonants," J. Acoust. Soc. Amer., vol. 73,
pp. 322-335, 1983.
Searle, C. L., Jacobson, J. Z., and Rayment, S. G.,
"Stop consonant discrimination based on human
audition," J. Acoust. Soc. Amer., vol. 65, pp. 799-
809, 1979.

166
[87] Stevens, K. N. "Acoustic correlates of some
phonetic categories," J. Acoust. Soc. Amer.,
vol. 68, pp. 836-842, 1980.
[88] Fant., G., Speech Sounds and Features, MIT Press,
Cambridge, Massachusetts, 1973.
[89] Rothemberg, M. R. and Zahorian, S., "Nonlinear
inverse filtering technique for estimating the
glottal area waveform," J. Acoust. Soc. Amer.,
vol. 61, pp. 1063-1071, 1977.
[90] Wong, D. Y., Markel, J. D., and Gray, A. H., "Least
squares inverse filtering from the acoustic speech
waveform," IEEE Trans. Acoust. Speech Signal
Proces., vol. 27, 1979.
[91] Rothemberg, M., "Some relations between glottal air
flow and vocal fold contact area," Proceedings of
the Conference on the Assessment of Vocal
Pathology, Bethesda, Maryland, pp. 88-96,
[92] Coker, C. H., and Umeda, N. "The important of
spectral detail in initial-final contrasts of
voiced stops," J. of Phonetics, vol. 3, pp. 63-68,
1975.
[93] Westbury, J. R. "Enlargement of the supraglottal
cavity and its relation to stop consonant voicing,"
J. Acoust. Soc. Amer., vol. 73, pp. 1322-1336,
1983.
[94] MacGlone, R. E., and Shipp, T., "Comparison of
subglottal air pressure associated with /p/ and
/b/," J. Acoust. Soc. Amer., vol. 51, pp. 664-665,
1972.
[95] Lubcker, J. F. "Transglottal air flow during stop
consonant production," J. Acoust. Soc. Amer.,
vol. 53, pp. 212-215, 1973.
[96] Kent, R. D., and Moll, K. L., "Vocal tract
characteristics of the stop cognates," J. Acoust.
Soc. Amer., vol. 46, pp. 1549-1555, 1969.
House, S. A., and Stevens, K. N. "Analog studies
of the nasalization of vowels, J. of Speech and
Hearing Disorders, vol. 21, pp. 218-232, 1956.
[97]

167
[98] Hecker, M. H. "Studies of nasal consonants with an
articulatory speech synthesizer," J. Acoust. Soc.
Amer., vol. 34, pp. 179-188, 1962.
[99] Fujimura, 0., "Analysis of nasal consonants,"
J. Acoust. Soc. Amer., vol. 34, pp. 1865-1875,
1962.
[100] Morris, H. L. "Etiological bases for speech
problems," in D. Spriesterbach and D. Sheranon
(eds), Cleft Palate and Communication, Academic
Press, New York, pp. 119-168, 1968.
[101] Bernthal, J. E. and Benkelman, D. R. "The effect
of changes in velopharyngeal orifice area on vowel
intensity," Cleft Palate Journal, vol. 14,
pp. 63-77, 1977.
[102] Daniloff, R. G., and Hammarberg, R. E. "On
defining coarticulation, J. of Phonetics, vol. 1,
pp. 239-248, 1973.
[103] Kent, R. D. and Minife, F. D. "Coarticulation in
recent speech production models," J. of Phonetics,
vol. 5, pp. 115-133, 1977.
[104] Daniloff, R., and Moll, K., "Coarticulation of lip
rounding," J. of Speech and Hearing Research,
vol. 11, pp. 707-721, 1968.
[105] Ohman, S. E. G., "Coarticulation in VCV utterances.
Spectrographic measurements," J. Acoust. Soc.
Amer., vol. 39, pp. 151-168, 1966.
[106] Carney, P. J., and Moll, K. L., "A
cinefluorographic investigation of fricative
consonant-vowel coarticulation," Phonetica,
vol. 23, pp. 193-202, 1971.
[107] Amerman, J. D., Daniloff, R. and Moll, K., "Lip
and jaw coarticulation for the phoneme /ae/," J. of
Speech and Hearing Research, vol. 13, pp. 147-161,
1970.
[108] Moll, K. and Daniloff, R. "Investigation of the
timing of velar movements during speech,"
J. Acoust. Soc. Amer., vol. 50, pp. 678-684, 1971.

168
[109]
[110]
[111]
[112]
[113]
[114]
[115]
[116]
Stevens, K. N. House, A.. S., and Paul, A. P.,
"Acoustic description of syllable nuclei: an
interpretation in terms of dynamic model of
articulation," J. Acoust. Soc. Amer., vol. 40,
pp. 123-132, 1966.
Kozhevnikov, V. A., and Chistovich, L. A., "Speech
Articulation and Perception," Joint Publications
Research Services, Washington D.C., 1965.
MacNeilage, P. F., "Motor control of serial
ordering of speech," Psychological Review, vol. 77,
pp. 182-196, 1970.
Henke, W. L. Dynamic Articulatory Model of Speech
Production Using Computer Simulation, Doctoral
Dissertation, Massachusett Institute of Technology,
Cambridge, 1966.
Fowler, C. A., "Coarticulation and theories of
extrinsic timing," J. of Phonetics, vol. 8,
pp. 113-133, 1980.
Krishnamurthy, A. K., Study of Vocal Fold Vibration
and the Glottal Sound Source Using Synchronized
Speech, Electroglottography and Ultra-High Speed
Laryngeal Films, Doctoral Dissertation, University
of Florida, Gainesville, Florida, 1983.
Childers, D. G., Moore, G. P., Naik, J. M., Larar,
J. N. and Krishnamurthy, A. K., "Assessment of
laryngeal function by simultaneous, synchronized
measurement of speech, electroglottography and
ultra-high speed film," Transcripts of the Eleventh
Symposium Care of the Professional Voice, The
Julliard School, New York, pp. 234-244, 1982.
Titze, I. R. and Talkin, D., "Simulation and
interpretation of glottographic waveforms,"
Proceedings of the Conference on the Assessment of
Vocal Pathology, Bethesda, Maryland, 1979.

BIOGRAPHICAL SKETCH
Enrico Luigi Bocchieri was born in Pavia, Italy, on
January 7, 1956. After completing high school in 1974 he
joined the University of Pavia where he received the
"Laurea" in Electrical Engineering in July 1979.
Since September 1979 he has been with the Department of
Electrical Engineering at the University of Florida,
receiving his M.S. degree in August 1981. After completing
his Ph.D. he plans to join Texas Instruments and work in the
area of speech algorithm development and implementation.
169

Internet Distribution Consejtf/i&gJPeement
In reference to the following dissertation:
AUTHOR:
TITLE:
Bocchieri Etxrko
Articulatory Spegglp Sj^l^s^pTlifflfd/bumber: 437194)
PUBLICATION DATE: 1981
benfdta^opyright holder for the
aforementioned dissertation, hfEy graM'specific ahcTlimited archive and distribution rights to
the Board of Trustees of the University'fOTfiS^a and-is-*agnts. I authorize the University of
Florida to digitize and distribute the dissertation described above for nonprofit, educational
purposes via the Internet or successive technologies.
This is a non-exclusive grant of permissions for specific off-line and on-line uses for an
indefinite term. Off-line uses shall be limited to those specifically allowed by "Fair Use" as
prescribed by the terms of United States copyright legislation (cf, Title 17, U.S. Code) as well as
to the maintenance and preservation of a digital archive copy. Digitization allows the University
of Florida to generate image- and text-based versions as appropriate and to provide and enhance
access using search software.
This grant of permissions prohibits use of the digitized versions for commercial use or profit.
Signature of Copyright Holder
( £ Y\
Printed or Typed Name of Copyright Holder/Licensee
Personal Information Blurred
Please print, sign and return to:
Cathleen Martyniak
UF Dissertation Project
Preservation Department
University of Florida Libraries
P.O. Box 117007
Gainesville, FL 32611-7007
5/28/2008



5
often used in psycoacoustic and perceptual experiments
[16-18] in which the acoustic characteristics of speech must
be precisely and systematically controlled. Moreover the
vocal system is not easily accessible; therefore speech
physiologists and pathologists may use computer models as an
aid for the investigation of the physiology of the vocal
system and the diagnosis of voice disorders [19-20].
The purpose of this research is to apply speech
synthesis techniques for the simulation of the physiological
process of speech articulation in relation to the acoustic
characteristics of the speech signal.
Chapter 2 reviews the speech synthesis strategies used
most often and explains why the so-called "articulatory
synthesis" method has been selected for our research.
Speech generation depends on the vocal cavities acoustic
properties which are physiologically determined during
speech articulation by the geometric configuration of the
vocal system.
The model of the acoustic characteristics of the vocal
cavities is explained in detail in Chapter 3, together with
its implementation by means of numerical simulation
techniques in Chapter 4. Chapter 5 focuses on the geometry
or spatial model of the vocal tract together with the
interactive graphic techniques that have been used for its
representation.


132
Figure 6.24. EGG signal, differentiated EGG signal and
glottal area estimated by means of the model
of Figure 6.23.


55
f(y(t), t) at four different points of the interval
(tQlt0 + h) by defining
Kx = hf(y(tQ),to)
K2 = hf(y(tQ) + 3K^,tQ + ha)
K3 = hf(y(to) + 81K1 + YlK2, to + ajh)
K4 = M and then setting
t +h
o
y(t + h) y(t ) = / f(t,y(t))dt
t
o
(4.2.4)
= y1K1 + u2K2 + u3K3 + y4I<4 (4.2.5)
The problem is now to determine the a's, 3's, y's, 2 and
y's so that (4.2.5) is equivalent to the Taylor expansion
(4.2.2) up to the highest possible power of h. We
substitute (4.2.5) into (4.2.3) and we choose the undefined
parameters so that the powers of h1 (i = 0,4) have the same
coefficients as in (4.2.2).


34
ra(1 Ax), k(l Ax), d(l Ax),
respectively.
According to Newton's law the forcing action on the
wall is
2
F = lAx p(t) = mlAx + dlAx + kly(t) (3.1.5)
dt
where y(t) is the wall displacement from the neutral
position (when the pressure in the tract is equal to
naught).
The airflow generated by the wall motion is
uw(t) = Ax ldy^£^- (3.1.6)
and by substitution of (3.1.6) into (3.1.5) we obtain
p(t)
m
1 Ax
duW(t)
dt
(1 Ax)
uw(t)
+
k
(1 Ax)
t
/
_00
(3.1.7)
In the frequency domain, (3.1.7) can be equivalently
represented by the impedance,


131
GLOTTIS
Figure 6.23. Top view of the modified model of the vocal
cords.


150
The value of p* and pT>r can be used to evaluate the
I\ JE\
pressure in the vocal tract (see (A.1.1)) and to estimate
(as discussed in Sections 3.1.6 and 4.4) the perturbation
6Ar to be used in the following integration step.
In summary, in Sections A.2.a and A.2.b we have defined
an algorithm, represented in Figure A. 5, which models the
fricative self-excitation of the vocal tract and the effects
of the vibrating walls. This algorithm can be calculated in
parallel for each junction between adjacent uniform tubes,
computing the wave propagation at time (t + t) when the
initial conditions at time t are given.
We have indicated (see Sections 3.1.b and 4.4) how to
estimate the vibration 6A^(r) of the yielding walls. Again
this operation can be performed in parallel, using the
backward difference rule (see Section 4.4) for each tube of
our model.
A.3)
Boundary Conditions
In
Section
A.2
we modeled the wave propagation in
uniform
tubes
and
at the junction between
adjacent
sections
. Here
we simulate the termination of
the vocal
cavities at the glottis, lips (or nostrils) and the acoustic
coupling between the oral and nasal passages.


109
Figure 6.11. Synthetic supraglottal pressure, vocal cord
displacement and glottal volume velocity
with yielding wall simulation.
I VOCAL TRACT CLOSURE
t
t t mi Morr*
t 1 I'lL riOLL
Figure 6.12. Synthetic supraglottal pressure, vocal cord
displacement and glottal volume velocity
with rigid wall simulation.


26
Figure 3.1. The vocal tract represented by a non uniform
pipe and its area function.


159
[10] Flanagan, J. L. and Christensen, S. W. "Computer
studies on parameteric coding of speech spectra,"
J. Acoust. Soc. Amer., vol. 68, pp. 420-429, 1980.
[11] Roucos, S., Schwartz, R. M., and Makhoul, J., "A
segment vocoder at 150 B/S," Proceedings
International Conference Acoust. Speech Signal
Proces., vol. 1, pp. 61-65, 1983.
[12] Klatt, D. H., "Structure of a phonological rule
component for a synthesis by rules program," IEEE
Trans. Acoust. Speech Signal Proces., vol 64,
pp. 391-398, 1976.
[13] Klatt, D. H., "Real time speech synthesis by rule,"
J. Acoust. Soc. Amer., vol. 68(S1), pp. S18(a),
1980.
[14] Rabiner, L. R., "A model for synthesizing speech by
rule," IEEE Trans. Audio Electroacoust., vol. 17,
pp. 7-13, 1969.
[15] Cooper, J. S. Gaitenby, J. H. Mattingly, I. G.,
and Umeda, N., "Reading aids for the blind: a
special case of man-machine communication," IEEE
Trans. Audio Electroacoust., vol. 17, pp. 266-270,
1969.
[16] Klatt, D. H., "Software for a cascade/parallel
formant synthesizer," J. Acoust. Soc. Amer.,
vol. 67, pp. 971-995, 1980.
[17] Stevens, K. N., and Blumstein, S. E., "Invariant
cues for place of articulation in stop consonants,"
J. Acoust. Soc. Amer., vol. 64, pp. 1358-1368,
1978.
[18] Abramson, A. S., Nye, P. W. Henderson, J. B. and
Marshall, C. W., "Vowel height and the perception
of consonantal nasality," J. Acoust. Soc. Amer.,
vol. 70, pp. 329-339, 1981.
[19] Titze, I. R., The Human Vocal Cords: A Mathematical
Model, Doctoral Dissertation, Dept, of Physics and
Astronomy, Brigham Young University, Provo, Utah
1973.
[20] Ishizaka, K., and Ishiki, N., "Computer simulation
of pathological vocal cord vibration," J. Acoust.
Soc. Amer., vol. 60, pp. 1193-1198, 1976.


30
5p(x,t)
"^x
9u(x,t)
5^
p(t) p 1(t)
Uj_(t) u x (t)
where p^(t), u^(t) = pressure and volume volume velocity
in the ith vocal tract section
a / A
Ax = L/N = length of each elemental section.
Therefore equations (3.1.3) become
d u.(t) A.
at pS lpi(t>
pi-i(t)>
(3.1.4a)
d p.(t)
dt
ick
u.
i
- 1
(t)
Ax
d^A.(t)
l '
dt
(3.1.4b)
i = 1, ... N
Equations (3.1.4) can be represented by the equivalent
electrical circuit as shown in Figure 3.3. Inductor and
capacitor C^, represent the inertance and compressibility of
the air in the ith elemental length of the vocal tract.
They are defined in terms of the cross-sectional area A^ as