• TABLE OF CONTENTS
HIDE
 Title Page
 Acknowledgement
 Table of Contents
 Abstract
 Introduction: speech synthesis...
 Speech production models and synthesis...
 Acoustic models of the vocal...
 Numerical solution of the acoustic...
 The articulatory model and its...
 Simulation results
 Conclusions
 Appendix
 Reference
 Biographical sketch
 Copyright














Group Title: articulatory speech synthesizer
Title: An articulatory speech synthesizer
CITATION THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00082449/00001
 Material Information
Title: An articulatory speech synthesizer
Physical Description: vi, 169 leaves : ill. ; 28 cm.
Language: English
Creator: Bocchieri, Enrico Luigi, 1956-
Publication Date: 1983
 Subjects
Subject: Speech synthesis   ( lcsh )
Electrical Engineering thesis Ph. D
Dissertations, Academic -- Electrical Engineering -- UF
Genre: bibliography   ( marcgt )
non-fiction   ( marcgt )
 Notes
Thesis: Thesis (Ph. D.)--University of Florida, 1983.
Bibliography: Bibliography: leaves 158-168.
Statement of Responsibility: by Enrico Luigi Bocchieri.
General Note: Typescript.
General Note: Vita.
 Record Information
Bibliographic ID: UF00082449
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
Resource Identifier: aleph - 000437194
oclc - 11216733
notis - ACJ7267

Table of Contents
    Title Page
        Page i
    Acknowledgement
        Page ii
    Table of Contents
        Page iii
        Page iv
    Abstract
        Page v
        Page vi
    Introduction: speech synthesis applications
        Page 1
        Page 2
        Page 3
        Page 4
        Page 5
        Page 6
    Speech production models and synthesis methods
        Page 7
        Page 8
        Page 9
        Page 10
        Page 11
        Page 12
        Page 13
        Page 14
        Page 15
        Page 16
        Page 17
        Page 18
        Page 19
        Page 20
        Page 21
        Page 22
    Acoustic models of the vocal cavities
        Page 23
        Page 24
        Page 25
        Page 26
        Page 27
        Page 28
        Page 29
        Page 30
        Page 31
        Page 32
        Page 33
        Page 34
        Page 35
        Page 36
        Page 37
        Page 38
        Page 39
        Page 40
        Page 41
        Page 42
        Page 43
        Page 44
        Page 45
        Page 46
        Page 47
        Page 48
        Page 49
        Page 50
    Numerical solution of the acoustic model
        Page 51
        Page 52
        Page 53
        Page 54
        Page 55
        Page 56
        Page 57
        Page 58
        Page 59
        Page 60
        Page 61
        Page 62
        Page 63
        Page 64
        Page 65
        Page 66
        Page 67
        Page 68
        Page 69
        Page 70
    The articulatory model and its interactive graphic implementation
        Page 71
        Page 72
        Page 73
        Page 74
        Page 75
        Page 76
        Page 77
        Page 78
        Page 79
        Page 80
        Page 81
        Page 82
        Page 83
        Page 84
        Page 85
        Page 86
        Page 87
    Simulation results
        Page 88
        Page 89
        Page 90
        Page 91
        Page 92
        Page 93
        Page 94
        Page 95
        Page 96
        Page 97
        Page 98
        Page 99
        Page 100
        Page 101
        Page 102
        Page 103
        Page 104
        Page 105
        Page 106
        Page 107
        Page 108
        Page 109
        Page 110
        Page 111
        Page 112
        Page 113
        Page 114
        Page 115
        Page 116
        Page 117
        Page 118
        Page 119
        Page 120
        Page 121
        Page 122
        Page 123
        Page 124
        Page 125
        Page 126
        Page 127
        Page 128
        Page 129
        Page 130
        Page 131
        Page 132
    Conclusions
        Page 133
        Page 134
        Page 135
        Page 136
    Appendix
        Page 137
        Page 138
        Page 139
        Page 140
        Page 141
        Page 142
        Page 143
        Page 144
        Page 145
        Page 146
        Page 147
        Page 148
        Page 149
        Page 150
        Page 151
        Page 152
        Page 153
        Page 154
        Page 155
        Page 156
        Page 157
    Reference
        Page 158
        Page 159
        Page 160
        Page 161
        Page 162
        Page 163
        Page 164
        Page 165
        Page 166
        Page 167
        Page 168
    Biographical sketch
        Page 169
    Copyright
        Copyright
Full Text















AN ARTICULATORY SPEECH SYNTHESIZER


BY



ENRICO LUIGI BOCCHIERI





















A DISSERTATION PRESENTED TO THE GRADUATE COUNCIL
OF THE UNIVERSITY OF FLORIDA IN
PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR
THE DEGREE OF DOCTOR OF PHILOSOPHY


UNIVERSITY OF FLORIDA


1984


















ACKNOWLEDGMENTS


I would like to express my sincere appreciation to my

advisory committee for their help and guidance throughout

this work.

I would like to give special thanks to my committee

chairman, Dr. D. G. Childers for his competent advice, for

his support, both financial and moral, and for providing an

educational atmosphere without which this research would

never have been possible.

Special gratitude is also expressed to Dr. E. R.

Chenette for his guidance, encouragement and financial

support.

To my parents and family I am forever indebted. Their

unceasing support and encouragement made it all possible and

worthwhile.


















TABLE OF CONTENTS


ACKNOWLEDGMENTS. ...........................................ii

ABSTRACT.................................................... v

CHAPTER Page

1 INTRODUCTION: SPEECH SYNTHESIS APPLICATIONS.
RESEARCH GOALS........................................ 1

2 SPEECH PRODUCTION MODELS AND SYNTHESIS METHODS......... 7

2.1) Speech Physiology and the Source-Filter Model...8
2.2) Linear Prediction............................... 13
2.3) Formant Synthesis ............................ ..16
2.4) Articulatory Synthesis .........................18

3 ACOUSTIC MODELS OF THE VOCAL CAVITIES............... 23

3.1 Sound Propagation in the Vocal Cavities.........23
3.1.a)' Derivation of the Model ................... 23
3.l.b) Modeling the Yielding Wall Properties..... 32
3.1.c) Nasal Coupling............................. 37
3.2) Excitation Modeling ............................37
3.2.a) Subglottal Pressure .......................37
3.2.b) Voiced Excitation.........................38
3.2.c) Unvoiced Excitation.......................45
3.3) Radiation Load...... ............................ 47
3.4) Remarks and Other Acoustic Models..............48

4 NUMERICAL SOLUTION OF THE ACOUSTIC MODEL.............51

4.1) Requirements of the Numerical Solution
Procedure .......................... .......... 51
4.2) Runge-Kutta Methods.......................... 54
4.2.a) Derivation of the Method..................54
4.2.b) Control of the Step Size
with Runge-Kutta.......................... 57
4.2.c) Order Selection............................59
4.3) Multistep Methods............................... 60
4.3.a) Implicit and Explicit Methods .............60
4.3.b) Derivation of the Methods.................61
4.3.c) Characteristics of Multistep Methods ...... 64
4.4) Method Selection.... .............................66














5 THE ARTICULATORY MODEL AND ITS
INTERACTIVE GRAPHIC IMPLEMENTATION..................71

5.1) Definition of the Articulatory Model...........71
5.2) The Graphic Editor.............................74
5.3) Display of the Time Variations of the Model....80
5.4) Simultaneous and Animated Display of the
Articulatory and Acoustic Characteristics
of the Vocal Cavities.........................84

6 SIMULATION RESULTS....................... ........... ..88

6.1) Speech Synthesis............................... 88
6.2) Source Tract Interaction.......................91
6.3) Onset Spectra of Voiced Stops..................95
6.4) Glottal Inverse Filtering of Speech............99
6.5) Simulation of Wall Vibration Effects............105
6.5.a) Vocal Cords Vibration During Closure.....105
6.5.b) Formant Shift............................ 110
6.6) Pathology Simulation: Reduction of Sound
Intensity During Nasalization.................112
6.7) Coarticulation ................................117
6.8) Interpretation of the EGG Data with
the Two Mass Model of the Vocal Cords.........124

7 CONCLUSIONS . ................................... 133

7.1) Summary........ ............ .......... ... ..... 133
7.2) Suggestions for Future Research...............135

APPENDIX

AN EFFICIENT ARTICULATORY SYNTHESIS ALGORITHM
FOR ARRAY PROCESSOR IMPLEMENTATION..................137

A.1) Wave Propagation in Concatenated
Lossless Tubes ...................... ........ 138
A.2) Modifications of Kelly-Lochbaum Algorithm.....141
A.2.a) Fricative Excitation....................141
A.2.b) Yielding Wall Simulation................146
A.3) Boundary Conditions ..........................150
A.3.a) Glottal Termination......................151
A.3.b) Radiation Load ....................... 153
A.3.c) Nasal Coupling ................. ........ 155

REFERENCES ..... ......................... ........... .. 158

BIOGRAPHICAL SKETCH..................................... 169


CHAPTER


Page
















Abstract of Dissertation Presented to the Graduate
Council of the University of Florida in Partial
Fulfillment of the Requirements for the
Degree of Doctor of Philosophy

AN ARTICULATORY SPEECH SYNTHESIZER


By

Enrico Luigi Bocchieri

April 1984

Chairman: Dr. D. G. Childers
Major Department: Electrical Engineering

Linear prediction and formant synthesizers are based on

a rather approximate model of speech production physiology,

using analysis or "identification" algorithms of natural

speech to overcome the model limitations and to synthesize

good quality speech.

On the contrary, articulatory synthesizers are based on

a more exact speech production model, and do not use

identification algorithms to derive the model parameters

directly from the natural speech waveform.

This dissertation shows that the amount of

physiological detail captured by the articulatory synthesis

method is sufficient for the generation of high quality

synthetic speech and for the simulation of physiological and

pathological aspects of speech that are reported in the

literature.












Articulatory synthesis of speech represents the

acoustic properties of the vocal cavities by means of

modeling and numerical simulation techniques that are

reported in Chapters 3 and 4.

We have been able to guarantee the stability of the

numerical method and to halve the number of differential

equations that must be solved for the simulation of the

sound propagation in the vocal tract (Chapter 4).

In the Appendix we present a new and more efficient

algorithm for the simulation of the vocal cavity acoustics

which can be efficiently implemented with parallel

processing hardware.

Interactive graphic software (Chapter 5) has been

developed to represent the configurations of the vocal

cavities and to provide us with a convenient interface for

the manipulation of the geometric model of the vocal

cavities.

Chapter 6 employs the developed articulatory synthesis

system for the simulation of different aspects of speech

processing, for modeling speech physiology, and testing

theories of linguistics reported in the literature. We

discuss and illustrate such cases as source tract

interaction, EGG modeling, onset spectra of voiced stops at

consonantal release, the effects of yielding walls on

phonation, sound intensity reduction during nasalization,

and glottal least squares inverse filtering.


















CHAPTER 1
INTRODUCTION.
SPEECH SYNTHESIS APPLICATIONS. RESEARCH GOALS.

In the last 3 4 decades both the engineering and

medical communities have devoted considerable research

effort to the problem of speech synthesis, i.e., the

generation of voice by artificial, electrical or mechanical

means.

The earliest attempts to construct talking machines can

be traced to the late 18th century. One of the first speech

synthesis devices was Kempelen's talking machine [1-2] which

in a demonstration in Vienna in 1791 was capable of

imitating the sounds of vowels and of many consonants.

Perhaps the greatest motivation for speech synthesis

research came from the development of telecommunications and

from the consequent engineering interest in efficient

methods for speech transmission. Moreover, the recent

progresses in circuit integration, microprocessors and

digital computers have made the implementation of high

performance speech transmission systems technologically

feasible [3-6]. This type of application requires a scheme

known as speech synthesis by analysis.












In its simplest form, speech communication is achieved

by modulating an electrical magnitude (for example, the

current in a transmission line) with the air pressure during

speech production. With this straightforward approach a

copy, in electrical terms, of the speech waveform can be

transmitted on a communication channel with a typical

bandwidth of about 3 kHz.

However, there appears to be a mismatch between the

information content of speech and the channel capacity. In

fact, the information content of written text may be

estimated at about 50 bit/sec [7] while the channel capacity

of a 3 kHz bandwidth and a typical signal-to-noise ratio is

about 30,000 bits/sec. Similar bit rates are also

encountered in conventional PCM speech transmission. Even

though spoken speech contains more information (such as

intonation and stress) than its written counterpart, the

above mentioned mismatch indicates that a smaller channel

bandwidth can be used for a more efficient transmission of

speech. Using different tradeoffs between the

intelligibility and naturalness of speech transmission on

one side and bit rate on the other, engineers have been able

to transmit speech with bit rates varying from 150 to 30,000

bit/s [8-11].

The reduction of channel bandwidth has been obtained by

means of analysis-synthesis systems. Before transmission,












speech analysis algorithms are used to extract relevant
information about the speech waveform. This information is

then encoded and transmitted (hopefully at low bit rates)

along the communication channel. At the receiver a

synthesis algorithm is used to reconstruct the speech
waveform from the transmitted information.

This synthesis by analysis process is useful not only

in voice communication systems; for example, in automatic

voice answering systems, words or sentences are stored for

successive playbacks. In addition synthesis by analysis can
be used to reduce the memory usage. Texas Instruments'
learning aid, Speak and Spell, is an example of this type of

application.

Synthesis by rule or text to speech synthesis is a

different type of application that has received considerable
attention lately [12-13]. In this case the problem is not

to "regenerate" synthetic speech after an analysis phase of

its natural counterpart. Instead synthetic speech is

automatically produced according to certain linguistic rules

which transform a string of discrete input symbols directly
into speech [14] (see Figure 1.1). Applications of text to
speech synthesis include reading machines for the blind

[15], automatic answering systems, ... man-machine
communication.

The medical community is interested in speech synthesis
systems for different reasons. Speech synthesizers are













SPEECH
SYNTHESIZER


SYNTHESIS
STRATEGY
STORED STORED
RULES DATA


->Ku


SPEECH


CONTROL
SIGNALS


DISCRETE INPUT
SYMBOLS


Figure 1.1. Text to speech synthesis.








5



often used in psycoacoustic and perceptual experiments

[16-18] in which the acoustic characteristics of speech must

be precisely and systematically controlled. Moreover the

vocal system is not easily accessible; therefore speech

physiologists and pathologists may use computer models as an

aid for the investigation of the physiology of the vocal

system and the diagnosis of voice disorders [19-20].

The purpose of this research is to apply speech

synthesis techniques for the simulation of the physiological

process of speech articulation in relation to the acoustic

characteristics of the speech signal.

Chapter 2 reviews the speech synthesis strategies used

most often and explains why the so-called articulatoryy

synthesis" method has been selected for our research.

Speech generation depends on the vocal cavities acoustic

properties which are physiologically determined during

speech articulation by the geometric configuration of the

vocal system.

The model of the acoustic characteristics of the vocal

cavities is explained in detail in Chapter 3, together with

its implementation by means of numerical simulation

techniques in Chapter 4. Chapter 5 focuses on the geometry

or spatial model of the vocal tract together with the

interactive graphic techniques that have been used for its

representation.








6



Synthesis and simulation results are presented in

Chapter 6. Chapter 7 provides a discussion of our findings

along with conclusions and suggestions for future research.



















CHAPTER 2
SPEECH PRODUCTION MODELS AND SYNTHESIS METHODS

As indicated in the introduction (Chapter 1) there are

many different applications that motivate research in the

area of speech synthesis. However, different goals usually

require different approaches for the solution of the

problem. This chapter will briefly consider the three most

popular and best documented techniques for speech synthesis

(namely linear prediction, formant and articulatory

synthesis), their relative "advantages" and "disadvantages"

and the applications for which they are most suitable. The

purpose is to review the available speech synthesis

techniques and to justify the choice of articulatory

synthesis for our research.

Every speech synthesis strategy is based on a more or

less complete model of the physiology of speech production

and ultimately its performance is determined by the amount

of acoustic and linguistic knowledge that the model can

capture.

In the first section of this chapter we therefore

discuss the basic notions of the physiology of speech

together with the source-filter production model upon which

both linear prediction and formant synthesis are based.












2.1) Speech Physiology and the Source-Filter Model.

The acoustic and articulatory features of speech

production can be most easily discussed by referring to

Figure 2.1, which shows the cross-section of the vocal

apparatus.

The thoracical and abdominal musculatures are the

source of energy for the production of speech. The

contraction of the rib cage and the upward movement of the

diaphragm increase the air pressure in the lungs and expell

air through the trachea to provide an acoustic excitation of

the supraglottal vocal cavities, i.e., the pharynx, mouth

and nasal passage.

The nature of speech sounds is mostly determined by the

vocal cords and by the supraglottal cavities. The vocal

cords are two lips of ligament and muscle located in the

larynx; the supraglottal cavities are the oral and nasal

cavities that are vented to the atmosphere through the mouth

and nostrils.

Physically, speech sounds are an acoustic pressure wave

that is radiated from the mouth and from the nostrils and is

generated by the acoustic excitation of the vocal cavities

with the stream of air that is coming from the lungs during

exhalation.

An obvious and important characteristic of speech is

that it is not a continuous type of sound but instead it is

















































Figure 2.1. Schematic diagram of the human vocal
mechanism (from [7] ). By permission
of Springer-Verlag.












perceived as a sequence of speech units or segments. In

general the different types of sound that occur during

speech production are generated by changing the manner of

excitation and the acoustic response of the vocal cavities.

As a first order approximation we can distinguish

between a "voiced" and a "fricative" or "unvoiced"

excitation of the vocal cavities.

The voiced excitation is obtained by allowing the vocal

cords to vibrate so that they modulate the stream of air

that is coming from the lungs, producing an almost periodic

signal. For example, vowels are generated in this way and

they are perceived as continuous non-hissy sounds because

the excitation is essentially periodic.

In contrast, unvoiced or fricative excitation is

achieved by forcing the air flow through a constriction in

the vocal tract with a sufficiently high Reynold's number,

thereby causing turbulence. This excitation has random or

"noisy" characteristics and, therefore, the resulting speech

sounds will be hissy or fricative (friction-like) as in the

case of the consonants /s/ and /f/.

Both voiced and unvoiced excitation signals have a

rather wide spectrum. Typically the power spectrum of the

voiced excitation decreases with an average slope of

12 db/octave [7] while the unvoiced spectrum can be

considered white over the speech frequencies [21].











The spectral characteristics of the excitation are

further modified by the acoustic transfer function of the

vocal cavities. Sound transmission is more efficient at the

resonance frequencies of the supraglottal vocal system and,

therefore, the acoustic energy of the radiated speech sound

is concentrated around these frequencies formantt

frequencies). During the generation of connected speech,

the shape and acoustic characteristics of the vocal cavities

are continuously changed by precisely timed movements of the

lips, tongue and of the other vocal organs. This process of

adjustment of the vocal cavity shape to produce different

types of speech sounds is called articulation.

These considerations about speech physiology lead to

the simple but extremely useful source-tract model of speech

production [22], which has been explicitly or implicitly

used since the earliest work in the area of speech synthesis

[23-25]. This model is still employed in linear prediction

and formant synthesis. It consists (see Figure 2.2) of a

filter whose transfer function models the acoustic response

of the vocal cavities and of an excitation source that

generates either a periodic or a random signal for the

production of voiced or unvoiced sounds, respectively. The

operation of the source and the filter transfer function can

be determined by external control parameters to obtain an

output signal with the same acoustic properties of speech.















h(t)
SOUND VOCAL
SOURCE 9(t) TRACT
(VOICED ORI TRANSFER-
UNVOICED) FUNCTION


s(t)=g(t)*h(t)
Sound output


gl t)


g ( t) unvoiced -


APivA~A


Figure 2.2. The source-tract speech production model.


vo iced -












2.2) Linear Prediction

The simplest and most widely used implementation of the

source tract model of speech production is Linear Prediction

synthesis. It is a synthesis by analysis method that was

first proposed by Atal and Hanauer [26] and it has been

investigated for a great variety of speech applications.

The method is particularly suitable for digital

implementation and it assumes a time discrete model of

speech production, typically with a sampling frequency

between 7 and 10 kHz. It consists (see Figure 2.3) of two

signal generators of voiced and unvoiced excitation and of

an all pole transfer function



H(z) = 1 (2.2.1)
1 + I aKZ
K=l


to represent the acoustic response of the vocal cavities.

Mathematically, the transfer function H(z) is determined by

the predictor coefficients, aKs. The great advantage of

linear prediction is that an estimate of the predictor

parameters can be efficiently obtained using an analysis

phase of natural speech. The literature presents several

algorithms to perform this analysis. Perhaps the schemes

most used for speech synthesis applications are the

autocorrelation method [27] and, for hardware

implementation, the PARCOR algorithm [28].














Pitch

PULSE
GENERATOR



WHITE
NOISE
GENERATOR


V/UV Switch


ALL POLE
TRANSFER
FUNCTION


Speech


H(z)


Predictor
parameters
ak


Figure 2.3. Linear prediction speech production model.












During speech articulation the vocal cavity transfer

function is continuously changing and, ideally, the result

of the analysis of natural speech is a time varying estimate

of the linear predictor parameters (or of an equivalent

representation) as they change during speech production.

Also, the literature reports many algorithms which can

be applied to natural speech to extract the fundamental

(vocal cord oscillation) frequency and to perform the

voiced/unvoiced decision [29-30]. Some of them have been

implemented on special purpose integrated circuits for real

time applications [3] [5].

Therefore, linear prediction provides a complete

synthesis by analysis method in which all the control

parameters of Figure 2.3 can be derived directly from

natural speech.

The shortcoming of linear prediction is that the

transfer function (2.2.1) cannot properly model the

production of nasal, fricative and stop consonants. The all

pole approximation of the vocal tract transfer function is

in fact theoretically justified only for vowel sounds, and

even in this case the linear prediction model assumes a

minimum phase excitation signal.

In spite of these disadvantages, however, linear

prediction performs well for speech synthesis applications

because the approximations introduced by the model of









16

Figure 2.3 do not severely affect the perceptual properties

of the speech sound. In fact the human hearing system

appears to be especially sensitive to the magnitude of the

short-time spectrum of speech [31], that is usually

adequately approximated by the linear prediction transfer

function [32]. Perhaps the minimum phase approximation is

responsible for a characteristic "buzziness" [33] of the

synthetic speech. A great deal of research is being

dedicated to improve the quality of linear prediction

synthesis by using a more suitable excitation than impulse

sequences [34-35].

Linear prediction of speech is, therefore, most useful

in those applications that require a fully automated

synthesis by analysis process. Speech compression, linear

prediction vocoders, very low bit rate transmissions are

typical examples. Also linear prediction has application in

speech products where speech may be recorded for later

playback with stringent memory constraints.



2.3) Formant Synthesis

Similar to linear prediction, formant synthesis is

based on the source-tract speech production model. However,

in this case the filter that models the vocal tract is not

implemented by an all pole digital filter but it consists of

a number of resonators whose transfer function is controlled

by their resonance formantt) frequencies and bandwidths.













Among the many formant synthesizers reported in the

literature [16] [36-37] two general configurations are

common. In one type of configuration, the formant

resonators that simulate the transfer function of the vocal

tract are connected in parallel. Each resonator is followed

by an amplitude gain control which determines the spectral

peak level. In the other type of synthesizer the resonators

are in cascade. The advantage here is that the relative

amplitudes of formant peaks for the generation of vowels are

produced correctly with no need for individual amplitude

control for each formant [16].

Formant synthesis was developed prior to linear

prediction synthesis probably because it is amenable to

analog hardware implementation. However, many synthesizers

have been implemented on general purpose digital computers

to obtain a more flexible design. The most recent and

perhaps most complete synthesizer reported in the literature

has been designed by Klatt [16], which consists of a cascade

and parallel configurations which are used for the

production of vowels and consonants, respectively.

An advantage of formant synthesizers is their

flexibility. In fact, thanks to the parallel configuration,

the filter transfer function may have both zeroes and

poles. Klatt's synthesizer, for example, has 39 control

parameters which not only control the filter transfer










18

function in terms of formant frequencies and bandwidths but

also determine different types of excitation such as voiced,

unvoiced, mixed, sinusoidal for the generation of

consonantal murmurs, and burst-like for the simulation of

stop release.

Formant synthesizers are particularly useful in

psycoacoustic studies since the synthetic speech waveform

can be precisely controlled by parameters which are more

directly related to the acoustic characteristics of speech

than the linear prediction coefficients. For the same

reason they are also more suitable for speech synthesis by

rule.

Synthesis by analysis with formant synthesizers

requires the extraction of the formant information from the

speech signal. For this purpose the literature presents

several formant analysis algorithms [38-41]. However linear

prediction analysis is simpler and more efficient and

therefore linear prediction is usually preferred in

synthesis by analysis type of applications.



2.4) Articulatory Synthesis

Linear prediction and formant synthesis are not

completely suitable for our research since they do not

faithfully model the human speech production mechanism.












A first disadvantage, which is inherent to the source-

filter model, is the assumption of separability between the

excitation and the acoustic properties of the vocal tract.

Clearly this assumption is not valid for the production of

fricative sounds in which the excitation depends on the

vocal tract constriction or for the generation of stops in

which the tract closure arrests the glottal air flow.

Source tract separability is a first order modeling

approximation even in the production of vowels as documented

by many recent papers concerning the nature of source tract

interaction [42-45]. We will address this issue in

Chapter 6 in more detail.

The second "disadvantage" (for our purpose) is that the

filter in the source-tract model (and also linear prediction

and formant synthesis) accounts only for the acoustic

input/output transfer function of the vocal cavities which

is estimated by means of analysis or "identification"

algorithms. For example, Linear Prediction and Formant

synthesis cannot model the areodynamic and myoelastic

effects that determine the vocal cords vibration, the air

pressure distribution along the oral and nasal tracts and

the vibration of the vocal cavity walls.

In other words, linear prediction and formant synthesis

define an algorithm for the generation of signals with the

same acoustic features of natural speech but they do not

model the physiological mechanism of speech generation.













These limitations of the source-tract model of speech

production can be overcome by the so called articulatoryy"

synthesis that is based on a physiological model of speech

production. It consists (see Figure 2.4) of at least two

separate components,

1) an articulatory model that has as input the time

varying vocal organ positions during speech

production to generate a description of the

corresponding vocal cavity shape and

2) an acoustic model which, given a certain time

varying vocal cavity configuration, is capable of

estimating not only the corresponding speech

waveform but also the pressure and volume velocity

distribution in the vocal tract, the vibration

pattern of the vocal cords and of the vocal cavity

walls.

The strength of linear predication and formant

synthesis, namely the existence of analysis algorithms for

natural speech is, however, a weak point for articulatory

synthesis. Even if several methods for estimating the vocal

tract configuration are presented in the literature [46-51],

these procedures cannot be easily applied to all the

different types of speech sounds. This estimation is made

even more difficult to achieve by the fact that the acoustic

to articulatory transformation is not unique [52-53]. This

















Vocal organ
positions


ARTICULATORY
MODEL

Desc ription
of vocal
cavity shape

VOCAL CAVITY
ACOUSTIC
MODEL

Synthetic speech,
vibration of the
vocal cords,
volume velocity
and pressure
distribution in
Sthe vocal cavity.


Figure 2.4. Articulatory synthesis of speech.












disadvantage, together with high computational requirements,

limits the use of articulatory synthesis for speech

applications which has been recently investigated by

Flanagan et al. [54].

In the following, Chapters 3 and 4 will discuss the

acoustic modeling of the vocal cavities and its

implementation with numerical simulation techniques.

Chapter 5 concentrates on the articulation model and the

computer graphic techniques used for its implementation.



















CHAPTER 3
ACOUSTIC MODELS OF THE VOCAL CAVITIES

The qualitative descriptions of the human speech

production mechanism and of the articulatory method of

speech synthesis that we have given in Section 2.1 cannot be

directly implemented on a digital computer. This knowledge

must be transformed into an analytical representation of the

physics of sound generation and propagation in the vocal

cavities.

The mathematical model of the vocal cavity acoustics

can be conveniently interpreted by means of equivalent

circuits. In such a representation the electrical current

and voltage correspond respectively to the air pressure and

volume velocity in the vocal cavities. In the following we

will always express all the physical dimensions in C.G.S.

units.



3.1) Sound Propagation in the Vocal Cavities.

3.1.a) Derivation of the Model.

Sound is nearly synonymous with vibration. Sound waves

are originated by mechanical vibrations and are propagated

in air or other media by vibrating the particles of the

media. The fundamental laws of mechanics, such as momentum,

23












mass and energy conservation and of fluid dynamics can be

applied to the compressible, low viscosity medium (air) to

quantitatively account for sound propagation.

The oral and nasal tracts are three-dimensional lossy

cavities of non-uniform cross-sections and non-rigid

walls. Their acoustic characteristics are described by a

three dimensional Navier-Stokes partial differential

_equation for boundary conditions appropriate to the yielding

walls. However, in practice, the solution of this

mathematical formulation requires

"an exhorbitant amount of computation, and we do not
even know the exact shape of the vocal tract and the
characteristics of the walls to take advantage of such a
rigorous approach." [55]

The simplifying assumption commonly made in the

literature is plane wave propagation. Most of the sound

energy during speech is contained in the frequency range

between 80 and 8000 Hz [56] but speech quality is not

significantly affected if only the frequencies below 5 kHz

are retained [16]. In this frequency range the cross

sectional dimensions of the vocal tract are sufficiently

small compared to the sound wavelength so that the departure

from plane wave propagation is not significant.

Thanks to the plane wave propagation assumption, the

geometric modeling of the vocal cavities can be greatly

simplified. Acoustically the vocal tract becomes equivalent

to a circular pipe of non-uniform cross-section (see












Figure 3.1) whose physical dimensions are completely

described by its cross-sectional area, A(x), as a function

of the distance x along the tube (area function). The sound

propagation can now be modeled by a one dimensional wave

equation. If the losses due to viscosity and thermal

conduction either in the bulk of the fluid or at the walls

of the tube are neglected, the following system of

differential equations accurately describes the wave

propagation [29]



u(x,t)
bp(x,t) + (x,t) = (3 1. la)
x- + "


bu(x,t) 1 b(p(x,t) A(x,t)) A(x,t)
x + P t at
Pc


x

p(x,t)



u(x,t)

A(x,t)

P


C


(3.1.1b)


A
= displacement along the axis of the tube
A
= pressure in the tube as function of

time and displacement
A
= air volume velocity in the tube
A
= area function of the tube
A
= air density

A sound velocity.


The first and second equations (3.1.1) correspond to

Newton's and continuity law, respectively.


where













glottis

South








10



I



0 L cm





Figure 3.1. The vocal tract represented by a non uniform
pipe and its area function.












Equations (3.1.1) indicate that the area function

A(x,t) is varying with time. Physiologically this is caused

by two different phenomena,

a) the voluntary movements of the vocal organs during

speech articulation, and

b) the vibration of the vocal cavity walls that are

caused by the variations of the vocal tract

pressure during speech.

We therefore represent the vocal tract area function as

a summation of two components



A(x,t) L Ao(x,t) + 6A(x,t) = Ao(x) + 6A(x,t) (3.1.2)



The first component A (x,t) is determined by a) above and it

represents the "nominal" cross-sectional area of the vocal

tract. Since the movements of the vocal organs are slow

compared to the sound propagation, this component can be

considered time invariant with a good approximation. The

2nd component 6A(x,t) represents the perturbation of the

cross-sectional area of the vocal tract that is caused by b)

above. Its dynamics cannot be neglected as compared with

the acoustic propagation. Nevertheless, this component has

a relatively small magnitude compared to A(x,t).

If we substitute (3.1.2) into (3.1.1) and if we neglect

2nd order terms we obtain











p(x, t) + P bu(x,t) = (3.1.3a)
Ox A -(x) 6(t3




u(x,t) + Ao(x) ap(x,t) (6A(x,t)) (3.1 3b)
Ox PC2 Ft t
Pc



The partial differential equations (3.1.3) can be

approximated with a system of ordinary differential

equations. Let the acoustic pipe be represented by a

sequence of N elemental lengths with circular and uniform

cross-sections Ai, i = 1, ...N. This is equivalent to

approximating the area function A(x) using a stepwise method

as shown in Figure 3.2.

If each elemental length is sufficiently shorter than

the sound wavelength, we can suppose that the pressure and

volume velocity are independent of the position in the

elemental length itself. Instead of the functions p(x,t)

and u(x,t), we need to consider a finite number of time

functions only



pi(t), ui(t) ; i = 1,N



that represent the pressure and volume velocity in the ith

elemental section as a function of time. The partial

derivatives can now be approximated by finite differences
























cm2


L cm


Figure 3.2. Stepwise approximation of the area function.













Sp.(t) p (t)
ap(x,t) ~ Pi(t) Pi 1(t)
Ox Ax


au(x,t)~ ui.(t) ui 1(t)
bx Ax


where pi(t), ui(t) = pressure and volume volume velocity

in the ith vocal tract section

Ax = L/N = length of each elemental section.


Therefore equations (3.1.3) become


d ui(t) A.
dt -x (Pi(t) Pi-(t)) (3.1.4a)



d Pi(t) 2 d6A. (t)
dt pi) (u (t) (t) Ax it
1dt x dt


(3.1.4b)


i = 1, ..., N



Equations (3.1.4) can be represented by the equivalent

electrical circuit as shown in Figure 3.3. Inductor L. and
1
capacitor Ci, represent the inertance and compressibility of

the air in the ith elemental length of the vocal tract.

They are defined in terms of the cross-sectional area Ai as


























Ui IUi+
I I




I I R LW





Uwi




Figure 3.3. Vocal tract elemental length and its equivalent
circuit.











A. Ax p Ax
C L.= = ...N
1 2 1
pc A.
1

dA.(t)
The component Ax dt in equation (3.1.4b) that represents

the vibration of the cavity walls is represented by the

current in the impedance ZWi as will be shown in the next

section.



3.l.b) Modeling the Yielding Wall Properties

From equation (3.1.4b) we can see that the effect of

wall vibrations are to generate in the ith elemental section

of the vocal tract an additional volume velocity component

equal to


dA (t)
Ax
dt


which is represented in Figure 3.3 by the current uWi in the

impedance ZWi. We will now consider how ZWi is related to

the mechanical properties of the vibrating walls.

Consider Figure 3.4 which shows an elemental length of

the pipe in which one wall is allowed to move under the

forcing action of the pressure p(t) in the pipe itself. Let

m, k and d represent the mass, elastic constant and damping

factor of a unit surface. Since the total vibrating surface

is lAx, and since we assume the walls to be locally

reacting, then the total mass, elastic constant and damping

factor of the vibrating wall are
















































1 1 Ax


Figure 3.4. Mechanical model of an elemental length of
the vocal tract with a yielding surface.











m(l Ax), k(l Ax), d(l Ax),



respectively.

According to Newton's law the forcing action on the

wall is


2
F = lAx p(t) = mlAx dy(t) + dlAx t + kly(t) (3.1.5)
dt


where y(t) is the wall displacement from the neutral

position (when the pressure in the tract is equal to

naught).

The airflow generated by the wall motion is


u(t) = Ax 1dydt


(3.1.6)


and by substitution of (3.1.6) into (3.1.5) we obtain


S du (t)
p(t) 1 Ax dt + (1 Ax) Uw(t)


(3.1.7)


In the frequency domain,

represented by the impedance,


(3.1.7) can be equivalently


k
AX) -.i~












ZW(s) = sLW + RW + 1 (3.1.8)
sCW


with



P(s) = ZW(s) UW(s)




A m R d A1Ax
LW ; RW CW A (3.1.9)
1 Ax 1 Ax k


Such an impedance appears to be inversely related to the

vibrating surface (lAx) and its components are

1. an inductance LW proportional to the mass of the

unit surface of the vibrating wall,

2. a resistor RW proportional to the viscous damping

per unit surface of the vibrating walls, and

3. a capacitor CW inversely proportional to the

elastic constant per unit surface.

Direct measurements of the mechanical properties of

different types of human tissues have been reported in the

literature [57], however such measurements are not directly

available for the vocal tract tissues. Moreover, it is

difficult to estimate the lateral surface of the vocal

cavities that is required to compute the numerical value of

the impedance ZW according to the above derivation.












We, therefore, use a slightly different approach

proposed by Sondhi [55]. He assumed that the vibrating

surface is proportional to the volume of the vocal cavities

and he estimated the mechanical properties of the vibrating

surface on the basis of acoustic measurements. The modeling

results match well with the direct measurements of the

mechanical impedance of human tissues performed by Ishizaka

et al. [57]. The model can be formulated as follows,

Adjust the inductive reactance LWi in the ith elemental

section to match the observed first formant frequency

for the closed mouth condition of about 200 Hz [22] [58]


S 0.0858 0.0858
LW. =
1 V. Ax A.


Next, adjust the wall loss component RWi to match the

closed glottis formant bandwidths [59]



RW. = 130 T LW.
1 1


Choose a value of CWi to obtain a resonance frequency of

the wall compatible with the direct measurements of

Ishizaka et al. [57]


(2 n 30)2
cw. =
l LW.
1












3.1.c) Nasal Coupling

During the production of nasal consonants and nasalized

vowels the nasal cavity is acoustically coupled to the oral

cavity by lowering the soft palate and opening the

velopharyngeal orifice.

The coupling passage can be represented as a

constriction of variable cross-sectional area and 1.5 cm

long [22] that can be modeled by an inductor (to account for

the air inertance) in series with a resistor (to account for

viscous losses in the passage).

The wave propagation in the nasal tract can be modeled

in terms of the nasal tract cross-sectional area as

discussed in Section (3.l.a) for the vocal tract. However,

we will use a different equivalent circuit to better account

for the nasal tract losses as we will discuss later in

Section 6.6.



3.2) Excitation Modeling

3.2.a) Subglottal Pressure

The source of energy for speech production lies in the

thoracic and abdominal musculatures. Air is drawn into the

lungs by enlarging the chest cavity and lowering the

diaphragm. It is expelled by contracting the rib cage and

increasing the lung pressure. The subglottal or lung

pressure typically ranges from 4 cm H20 for the production












of soft sounds to 20 cm H20 or more for the generation of

very loud, high pitched speech.

During speech the lung pressure is slowly varying in

comparison with the acoustic propagation in the vocal

cavities. We can, therefore, represent the lung pressure

with a continuous voltage generator whose value is

controlled in time by an external parameter Ps(t).



3.2.b) Voiced Excitation

During speech, the air is forced from the lungs through

the trachea into the pharynx or throat cavity. On top of

the trachea is mounted the larynx (see Figure 3.5), a

cartilagineous structure that houses two lips of ligament

and muscle called the vocal cords or vocal folds.

The vocal cords are posteriorly supported by the

arytenoid cartilages (see Figure 3.5). Their position and

the dimension of the opening between them (the glottis) can

be controlled by voluntary movements of the arytenoid

cartilages.

For the generation of voiced sounds the vocal cords are

brought close to each other so that the glottal aperture

becomes very small. As air is expelled from the lungs,

strong areodynamic effects put the vocal cords into a rapid

oscillation. Qualitatively, when the vocal cords are close

to each other during the oscillation cycle, the subglottal













































Figure 3.5. Cut-away view of the human larynx (from [7]).
VC vocal cords. AC arytenoid cartilages.
TC thyroid cartilage.













pressure forces them apart. This, however, increases the

air flow in the glottis and the consequent Bernoulli

pressure drop between the vocal cords approximates the vocal

cords again. In this way a mechanical "relaxation"

oscillator is developed which modulates the airflow from the

lungs into a quasiperiodic voiced excitation.

The vocal fold oscillation frequency determines

important perceptual characteristics of voiced speech and it

is called the "pitch" frequency or fundamental frequency,

F .
o
The first quantitative studies of the areodynamics of

the larynx were carried out by van den Berg et al. who made

steady flow measurements from plaster casts of a "typical"

larynx [60-61].

The first quantitative self-oscillating model of the

vocal folds was proposed by Flanagan and Landgraf [62] after

Flanagan and Meinhart's studies concerning source tract

interaction [63]. The fundamental idea was to combine van

den Berg's results with the 2nd order mechanical model of

the vocal cords shown in Figure 3.6. An acoustic-mechanic

relaxation oxcillator was obtained.

Bilateral symmetry was assumed and only a lateral

displacement x of the masses was allowed. Therefore, only a

2nd order differential equation was needed to describe the

motion of the mass






























TRACHEA
AND LUNGS


Figure 3.6. One mass model of the vocal cords.


K(x)


VOCAL
TRACT













M x + B(x) : + K(x) x = F(t)



The mechanical damping B(x) and elastic constants K(x) are

properly defined functions of the vocal cord position x.

The forcing action F(t) depends on the air pressure

distribution along the glottis which was estimated according

to van den Berg's results. A modified version of the one

mass model was also designed by Mermelstein [64].

Even if the one mass model had been able to simulate

important physiological characteristics of vocal cord

vibration, it presented several inconveniencies.

1) The frequency of vocal fold vibration was too

dependent on the vocal tract shape, indicating too

great a source tract interaction.

2) The model was unable to account for phase

differences between the upper and lower edge of the

cords.

3) The model was unable to oscillate with a capacitive

vocal tract input impedance.

These difficulties were overcome by the two mass model

of the vocal cords that is shown in Figure 3.7. It was

designed by Ishizaka and Matsuidara [65] and first

implemented by Ishazaka and Flanagan [66-67]. A mechanical

coupling between the two masses is represented by the spring

constant kc The springs sl and s2 were given a non-linear






















TRACHEA
AND LUNGS


VOCAL
TRACT


PS P11 P12 P21 P22 Ug P1


X.-T-./ \- -------- -/ i ----------
CONTRACTION GLOTTIS EXPANSION



Rc L RV1 Lgl R12 RV2 Lg2 Re


Ps P11 P12 P21 P22 P1
I I I I _II



CONTRACTION GLOTTIS EXPANSION





Figure 3.7. Two mass model of the vocal cords and
glottis equivalent circuit.











characteristic according to the stiffness measured on

excised human vocal cords. As in the case of the one mass

model the viscous damping was changing during the vocal

cords vibration period.

For computer simulation it is convenient to represent

the pressure distribution along the two masses with the

voltage values in an equivalent circuit. In Figure 3.7

resistance Rc accounts for the Bernoulli pressure drop and

"vena contract" effect at the inlet of the glottis.

Resistances RV, and RV2 model the viscous losses in the

glottis. Resistance R12 accounts for the pressure

difference between the two masses caused by the Bernoulli

effect. The inductors model the air inertance in the

glottis.

The two mass model of the vocal cords, that has been

used in connection with vocal tract synthesizers [67-68],

uses as control parameters the glottal neutral area AgO and

the cord tension Q.

AgO determines the glottal area in absence of phonation

and it is physiologically related to the position of the

arythenoid cartilages. Q controls the values of the elastic

constant of the model and greatly affects the two mass model

oscillation period. The suitability of the two mass model

and of its control parameters for speech synthesis has been

further validated in [69] and [70].












The acoustic synthesizer that we have implemented uses

the two mass model to provide voiced excitation. We

therefore account for source tract interaction since the

current in the equivalent circuit of the glottis (see

Figure 3.7) is dependent on the voltage pl that models the

pressure in the vocal tract just above the vocal cords.



3.2.c) Unvoiced Excitation

Speech sounds are generally excited by modulating the

air flow through a constriction of the glottal and

supraglottal system. For voiced sounds this modulation is

obtained through rapid changes of the glottal constriction

as explained in the review section. For fricative sounds,

the modulation comes from flow instabilities which arise by

forcing the air through a constriction with a sufficiently

high Reynold's number. In this case the classical

hypothesis of separability between the source and the tract

greatly limits the realism that can be incorporated into the

synthesizer. In fact, unvoiced excitation is greatly

dependent on the constricted area of the vocal tract itself.

The fricative self-excitation of the vocal cavities was

first modeled by Flanagan and Cherry [71]. The idea was to

use a resistor RNi and noise generator VNi in the equivalent

circuit of the ith elemental length of the vocal tract (see

Figure 3.8). The values of the resistor and of the noise





























I I
/I

I N + I
S Li R i VN Li+l


S[ 1 LWi


CW








Figure 3.8. Equivalent circuit of vocal tract elemental
length with fricative excitation.












generator variance depend on the Reynold's number of the

flow in the ith section [71]. The spectrum of the turbulent

noise VNi can be assumed white with a good approximation

[21].

In our simulation we have modeled the fricative

excitation by means of two sources. One is always located

in the first vocal tract section to generate aspirated

sounds. The second is not bound to a fixed position but can

be moved along with the vocal tract constriction location.



3.3) Radiation Load

The radiation effects at the mouth and nostrils can be

accounted for by modeling the mouth and nostrils as a

radiating surface placed on a sphere (the head) with a

radius of about 9 cm.

Flanagan [7] has proposed a simplified equivalent

circuit for the radiation load model by using a parallel

combination of an inductor and a resistor with values


128 8a
RR = 2' R 3wc
2


where a is the radius of the (circular) radiating surface

(mouth or nostrils). Titze [19] has shown that this

approximation, which we are using now, is valid also at

relatively high frequencies, when the speech wavelength has

the same order of magnitude of the mouth radius.












Our model is also able to account for the sound

pressure component that is radiated through the vibration of

the vocal cavity walls. The contribution to this component

from each elementary length of the vocal cavities is

represented as a voltage drop across a suitable impedance

[67] in series to the equivalent circuit of the yielding

wall defined in Section 3.1b.



3.4) Remarks and Other Acoustic Models

In this chapter we discussed the derivation of an

electrical circuit that models the sound propagation in the

vocal cavities. These considerations can be summarized in

Figure 3.9.

The vocal and nasal tracts are represented by two

circular pipes with non-uniform cross-section (plane wave

propagation assumption). Their equivalent circuit (nasal

and vocal tract networks in Figure 3.9) are made by a chain

of elementary circuits. Each circuit models the wave

propagation as a short length of cavity according to the

derivation of Sections 3.l.a, 3.l.b, 3.2.c.

The two mass models of the vocal cords, its control

parameters, Ago and Q, and the glottal impedance have been

treated in detail in Section 3.2.b.

Sections 3.2.a, 3.1.c, 3.3 have been concerned with the

subglottal pressure Ps, the velar coupling ZV and the

















MUSCLE FORCE

. ;


NASAL TRACT


-r
S- VOCAL TRACT
I CORD I NETWORK
I MODEL I
- ;-- 11 i - -


A1 AN

Q(t) A 0 NC(t) A(x,t)


CORD REST NASAL AREA
TENSION AREA COUPLING FUNCTION


Figure 3.9. Equivalent circuit of the vocal cavities.


Ps(t)

SUBGLOTTAL
PRESSURE












radiation impedances at the mouth and nostrils, which are

shown in Figure 3.9.

The approach to the acoustic modeling of the vocal

cavities that we have just reviewed is not the only one

reported in the literature. In Section 3.2.b we considered

the two mass models of the vocal cords. A more complete

model of vocal cord dynamics has been designed by Titze

[19]. He divided each cord into two vertical levels, one

level corresponding to the mucous membrane, the other to the

vocalis muscle. Each level was further divided into eight

masses which were allowed to move both vertically and

horizontally. We did not use this model because its

simulation is computationally more expensive than Flanagan's

two mass models.

Different acoustic models of the vocal tract were

designed by Kelly and Lochbaum [72] and Mermelstein [73].

The latter has been recently implemented for an articulatory

synthesizer [74].

However, these modeling approaches account for vocal

tract losses in a phenomenological way and they do not model

source tract interaction and fricative self excitation.


















CHAPTER 4
NUMERICAL SOLUTION OF THE ACOUSTIC MODEL

4.1) Requirements of the Numerical Solution Procedure

The software implementation of the acoustic model of

the vocal cavities that we have derived in the previous

chapter requires the solution of a system of ordinary

differential equations with assigned initial values.

In general we will use the notation


y'(t) = f(y(t),t)
{ (4.1.1)
y(0) = Yo


The numerical approach employed to solve the above problem

consists of approximating the solution y(t) as a sequence of

discrete points called mesh points. The mesh points are

assumed to be equally spaced and we indicate with h the time

interval between them. In other words, the numerical

integration procedure will give us a sequence of values yo,

yl' *"*Yn which closely approximate the actual solution y(t)
at the times tO = 0, tl = h, ...t = nh.

In the area of ordinary differential equations the

first step toward the solution of the problem is the

selection of that particular technique among the many

available which will serve the solution best.













In our specific case the most stringent requirement is

the stability of the numerical method. Since the

integration of (4.1.1) is going to involve a large number of

mesh points we need a method which, for a sufficiently small

step size h, guarantees that the perturbation in one of the

mesh values y does not increase in the subsequent values,

ym' m > n.

In the following discussion, as in [75], we use a "test

equation"



y'(t) = Xy(t)



where X is a complex constant. We introduce the concept of

an absolute stability region, which is the set of real,

nonnegative values of h and X for which a perturbation in a

value yn does not increase from step to step.

When the stability requirement is satisfied, we should

select the fastest integration method for our particular

application. This second requirement is very important

since we are dealing with a large system of differential

equations (about 100 differential equations of the 1st

order) and it takes about five hours to generate one second

of synthetic speech on our Eclipse S/130 minicomputer.

The integration speed is directly related to the step

size h of the numerical integration method. The larger the












step size the faster the integration. Unfortunately, the

precision of the numerical solution decreases when the step

size is increased and the method may become unstable. The

program to solve (4.1.1) must therefore implement an

automatic procedure for changing the step size to achieve

the maximum integration speed compatible with precision and

stability requirements. A variable step size is

particularly convenient in our case. In fact the time

constants of (4.1.1) change with the shape of the vocal

cavities and the motion of the vocal cords. A variable

control of the step size allows one to obtain the maximum

integration speed which is allowed by the method and by the

time variant differential equations. In view of these

requirements we have considered both a 4th order Runge-Kutta

method with control of the step size and order.

The Runge-Kutta method requires the computation of

derivatives at an higher rate than multistep methods.

However, it needs less overhead for each derivative

computation [75]. In the next two sections the

characteristics of Runge-Kutta and multistep (also called

predictor-corrector) methods will be considered.

Section 4.4 will give a comparative discussion leading

to the selection of the Runge-Kutta method. Also, we will

describe a modification of the Runge-Kutta method, which

exploits the fact that wall vibration effects have larger












time constants than the propagation in the cavities. This

allows the reduction of the number of first order equations

to be solved by this method by almost a factor of two.



4.2) Runge-Kutta Methods

4.2.a) Derivation of the Method

Runge-Kutta methods are stable numerical procedures for

obtaining an approximate numerical solution of a system of

ordinary differential equations given by


y'(t) = f(y(t),t)
{ (4.2.1)
y(o) = Y

The method consists of approximating the Taylor expansion



y(t + h) = y(t ) + hy'(t ) + Y(t ) + ....(4.2.2)



so that, given an approximation of the solution at time to,

the solution at the next mesh point (to + h) can be

estimated.

To avoid the computation of higher order derivatives,

it is convenient to express (4.2.1) in integral form as


t +h
y(t0 + h) = y(tO) + f o f(y(t),t)dt (4.2.3)
t
o

We can approximate the above definite integrals by computing











f(y(t),t) at four different points

(to,tO + h) by defining


of the interval


K1 = hf(y(to),to)



K2 = hf(y(to) + OKl'to + ha)


(4.2.4)


K3 = hf(y(to) + lK1 + Y1K2 t + a h)


K4 = hf(y + 2K1 + Y2K2 + 2K3' to + a2h)


and then setting


y(to + h) y(to) =


t +h
o f
j f(t,y(t))dt


= i1K1 + P2K2 + P 3K3 + I4 K4


(4.2.5)


The problem is now to determine the a's, B's, y's, 62 and

y's so that (4.2.5) is equivalent to the Taylor expansion

(4.2.2) up to the highest possible power of h. We

substitute (4.2.5) into (4.2.3) and we choose the undefined

parameters so that the powers of hi (i = 0,4) have the same

coefficients as in (4.2.2).











We obtain a system of 8 equations in ten unknowns [76]



~1 + 2 + [3 + L4 = 1


2a2 + 3 al + P = 1/2
22 31 42


02a2 + 3al + P~4 = 1/3


3 3 3
12a3 + a31 + 4 = 1/4

(4.2.6)

3a 1Y1 + 14(a Y2 + a162) = 1/6


P3a2Y + 4(2Y2 + a262) = 1/12


3LaalY + (a Y2 + a62 )a = 1/8



P4 Y 6 = 1/24
4 1 2


which has two extra degrees of freedom that must be set

arbitrarily.

If we define a = a1 = 1/2, the solution of (4.2.6)

leads to the formulas of Kutta [76]. If a = 1/2 and

62 = 1 we obtain Runge's formula [76] which is equivalent

to Simpson's rule











t +h
S f(t)dt = Z[f(to) + 4hf(tO + ) + f(toth)]
t


when y (t) = f(t).

We use a calculation procedure derived from (4.2.6) by

Gill [76] which minimizes the memory requirements and allows

us to compensate the round off errors accumulated at each

step.



4.2.b) Control of the Step Size with Runge-Kutta

In the previous derivation we have emphasized how the

Runge-Kutta method approximates the Taylor expansion 4.2.2

up to the 4th power of h. It is therefore a fourth order

method with a local truncation error of order h5


f(5y(t ),to)
5f (5) t-o h5 + O(h6)
51


This accuracy is obtained without explicitly computing the

derivatives of orders higher than one, at the expense of

four evaluations of the first derivative for each mesh

point. This is a disadvantage with respect to multistep

methods (to be discussed later) which uses fewer

computations of the first derivative to obtain the same

truncation error. The number of derivative evaluations per

step increases to 5.5 to obtain a variable control of the

step size with Runge-Kutta.












The step size, h, should in fact be chosen so that the

local truncation error is less than a certain maximum

acceptable value specified by the user. Unfortunately, the

truncation error cannot be directly estimated because the

Runge-Kutta procedure does not provide any information about

higher order derivatives.

A practical solution [76] is based on the results of

numerical integration with steps h and 2h, respectively,

i.e., the computation is performed a first time using hi = h

and then it is repeated using h2 = 2h.

Let

Ch5 denote the truncation error using step
2h2
h2 = 2h
5
C h denote the truncation error using step

hI = h

y(2) denote the value "obtained" at (t + 2h)
using step h2 = 2h

Y(1) denote the value "obtained" at (to + 2h)
using step hI = h twice

Y denote the true value of y at time

(tt + 2h),

then

5
Y (2) C2 2


Y Y2C
Y- y(1) = 2C1hl












But for small h, C1 = C2 (assuming that the sixth derivative

of f(y(t),t) is continuous) and therefore we obtain the

local truncation error estimate


~ (1) ( ^2)
Y Y(1) Y(15 Y2) (4.2.7)



If (4.2.7) is greater than a given tolerance, say el,

-the increment h is halved and the procedure starts again at

the last computed mesh point to. If it is less than el'

y(1)(to+h) and Y(l)(to + 2h) are assumed correct.
Furthermore, if it is less than el/50, the next step will be

tried with a doubled increment.

Unfortunately, on the average this method requires a

total of 5.5 function (derivative) evaluations as opposed to

four if the step size is not automatically controlled.



4.2.c) Order Selection

In Section 4.2.1 we derived the 4th order Runge-Kutta

method but Runge-Kutta methods of different orders could be

derived as well.

This observation leads to the question, "How does the

choice of the order affect the amount of work required to

integrate a system of ordinary differential equations?".

For example, small approximation errors can be more

efficiently achieved with high order methods, while for low












accuracy requirements lower order methods are to be

preferred [75].

We would, therefore, like to have an automatic

mechanism for the selection of the order of the method.

This mechanism should evaluate the truncation error

corresponding to different integration orders and choose the

order which allows for the maximum step size and integration

speed compatible with the required precision.

Unfortunately, for the same reason discussed in

relation to the problem of step size control, namely the

absence of higher order derivative estimates, the Runge-

Kutta method does not provide an efficient procedure for

automatic order selection. Therefore we always use the

"standard" 4th order Runge-Kutta method.



4.3) Multistep Methods

4.3.a) Implicit and Explicit Methods

Those methods, like Runge-Kutta, which given an

approximation of y(t) at t = tn-1 (say Yn_-) provide a

technique for computing yn = (tn ) are called one step

methods. More general K-step methods require the values of

the dependent variables y(t) and of its derivatives at K

different mesh points tn-l, tn_2, .**tn-K to approximate the

solution at time tn.












The well known rules of "forward differentiation",

"backward differentiation" and trapezoidall rule" are one

step methods. They will be automatically considered in the

following discussion as particular cases.

The general expression of a multistep (K-step) method

is


K
Yn = (aiYn-i + -i ) + oYn
i=1
(4.3.1)

hy' = hf(y n,tn)



If 8 is equal to zero the method is explicit because it

provides an explicit way of computing yn and hy'n from the

values of y and its derivatives at preceding mesh points.

If 0 is different from zero, then (4.3.1) defines an

implicit multistep method because it is in general a non-

linear equation involving the function f(yn,t ) that must be

solved for the unknown yn'



4.3.b) Derivation of the Methods

We have considered the Adams-Bashforth and Adams-

Moulton [75] methods which are respectively explicit and

implicit methods with












al =

(4.3.2)

a. = 0 if i 1
1


Both of these methods can be obtained from the integral

relation

t
n
y(tn) = Y(tnl) + f f(y(t),t)dt
t
n-l

The integral is estimated by approximating f(y(t),t)

with an interpolating polynomial (for example, the Newton's

backward difference formula) through a number of known

values at t = tn-, tn-2, ..., tn-K in the explicit case or

through the values at times tn, tn-l' **tn-K for the

implicit case.

Therefore, for the explicit Adams-Bashforth case, the

equation (4.3.1) takes the form


K
Yn = y(tn-) + h Kif((t n),t ) (4.3.4a)
i=l


or with the equivalent representation in terms of finite

differences


K-1
Yn = y(tn-l) + h K yVJ f(y(tnl),tn-l) (4.3.4b)
J=0












where the operator V is defined by


J A J-l J-1
Vf =V f f
m m m-1


(4.3.5)


V f A f
m m


The values of Yn differ from the real solution y(tn) by

-a local truncation error which is of order (K + 1) in the

step size h


K+l (K+l)
ErrorAdams-Bashforth = K h y(t)


(4.3.6)


The values of y 's and PKi's coefficients are available

directly from the literature [75].

In the implicit Adams-Moulton case equation (4.3.1)

takes the form


K
y = y(t-1 ) +
i=O


* f(y(t ), t-i)
K,i n-i n-i


or with the equivalent representation in terms of backward

differences


K
S= Y(t)) + h y* V f(Y(t ), t)
J=0


(4.3.8)


where the V operator has been defined in (4.3.5).


(4.3.7)













In (4.3.8) the value of Yn differs from y(t ) by a

local truncation error that is of the order K + 2 in the

step size h.



=Error K+2 (K+2)(t) (4.3.9)
Adams-Moulton YK+l


The y*'s and P* 's coefficient values are available from the

literature [75]. In particular the one step Adams-Bashforth

method corresponds to the forward differentiation rule while

the zero and one step Adams-Moulton methods are the backward

and trapezoidal rule respectively.



4.3.c) Characteristics of Multistep Methods

An important characteristic of multistep methods is

that they require only one computation of the derivative for

each step as can be seen from equation (4.3.1). This is a

great advantage over the Runge-Kutta method that requires at

least four computations of the functions f(y(t),t) and it

has been the motivation for our experimentation with

multistep methods.

Another feature of multistep methods is that they allow

for a rather efficient implementation of automatic control

of the step size and order of the method itself.

A complete treatment of this subject would require too

long a discussion. Our intuitive explanation can be

obtained by observing that the step size and order selection












require an estimate of the local truncation error in terms

of different step sizes and integration orders. The order

that allows for the largest stepsize compatible with the

user defined upper limit for the truncation error is then

selected.

The local truncation error in the Adams-Bashforth and

Adams-Moulton methods (see (4.3.6) and (4.3.9)) is related

to high order derivatives which can be easily obtained in

terms of the same backward differences



VJhf(y(tn),tn) = VJhy' = hJ+ly(j+l)



that are used in the implementation method itself (see

(4.3.4) and (4.3.8)). A comparison between the explicit and

implicit multistep methods is necessary to complete this

discussion.

One difference between the methods, is that the y*

coefficients of the implicit methods are smaller than the y

coefficient of the explicit methods. This leads to smaller

truncation errors for the same order for the implicit case

(see (4.3.6) and (4.3.9)).

Another advantage of the implicit methods is that

K-step methods have a truncation error of order (K+2) in the

step size h (see (4.3.9)) to be compared with a truncation

error of order (K+l) for the explicit method (see












(4.3.6)). The reason for this fact is evident when we

consider that (4.3.7) has (K+l) coefficients i,K, i = O,K

while in the explicit method (4.3.5) 80,K has been set to

zero.

A disadvantage of implicit methods is that the non-

linear equation (4.3.7) in the unknown y must be solved

iteratively. Usually a first "guess" of Yn is obtained by

-means of an explicit method and then (4.3.7) is iterated.

However, for reasonable values of the step size h, no more

than two or three iterations are usually required, and this

extra effort is more than compensated for by the better

stability properties of the implicit methods. In fact with

respect to the "test" equation y' = Xy, the range of h

values for which implicit methods are stable is at least one

order of magnitude greater than in the explicit case [75].

Since the truncation errors of implicit methods are smaller,

the implicit methods can be used with a step size that is

several times larger than that of the explicit method. The

allowed increase in step size more than offsets the

additional effort of performing 2 or 3 iterations.



4.4) Method Selection

We have implemented the numerical simulation of the

acoustic model of the vocal cavities by means of both Runge-

Kutta and implicit multistep methods [77]. The Runge-Kutta












method runs about 2 = 3 times faster than the Adams-Moulton

method. This fact is at first rather surprising since we

have seen in the previous two sections that the Runge-Kutta

method requires a larger number of derivative evaluations

for each integration step.

However, in our case the evaluation of the derivatives

is not extremely time consuming. In fact, with the

exception of four state variables describing the motion of

the two mass models, the remaining system of differential

equations is essentially "uncoupled" (or characterized by a

"sparse matrix"), thanks to the "chain" structure of the

vocal cavity acoustic model that can be immediately observed

from Figure 3.9. In these conditions the high number of

derivative evaluations of the Runge-Kutta method is more

than compensated for by the limited overhead in comparison

with Adams predictor-corrector method [75].

We have modified the integration method to take

advantage of the dynamic properties of the vibrating walls.

From Figure 3.8, which represents the equivalent

circuit of the ith elemental section of the vocal tract as

discussed in Chapter 3, we have


dpi (t)
_dt 1 (4.4.1)
dt C. (u (t) i+t) uwi(t))
dt i il











du (t)
dt (p i(t) Pi(t) VN. RN. ui(t))
1
(4.4.2)






duWi(t) 1
dt L (Pi (t) v- RWi uWi(t))

(4.4.3)




dvi(t) 1
dt CW. UWi(t) (4.4.4)




The first two equations represent the pressure and

volume velocity propagation in the ith elemental length of

the vocal tract while (4.4.3) and (4.4.4) model the wall

vibration effects.

The dynamics of uwi(t) and vi (t) in (4.4.3) and

(4.4.4), are characterized by the time constants (see

Section 3.l.b)


LW.
L 1 1 1
RW. 130*w' CW LW. 2Tr*30
1 1

which are very large with respect to the time of the wave

propagation in each elemental length of the vocal tract.











In fact, if we divide the vocal tract into 20 elemental

sections, of approximately 0.875 cm each, the time of wave

propagation in each is 2.5 10-5 sec. This time gives the

order of magnitude of the largest step size for the

integration of equations (4.4.1) and (4.4.2) that, in fact,

is usually achieved with variable step sizes between

2.5 10-5 and 1.25 10-5 sec.

On the other hand equations (4.4.3) and (4.4.4) may

employ a larger integration step.

We, therefore, integrate equations (4.4.1) and (4.4.2)

together with the two mass model equations using a Runge-

Kutta method with variable control of the step size and

assuming uWi(t) in (4.4.1) constant during this procedure.

Every 5.10-5 seconds, i.e., at a frequency of 20 KHz, we

update the values of uWi(t) and vWi(t) by means of a simple

backward differentiation rule based on equations (4.4.3) and

(4.4.4).

At this time we also update the turbulent noise source

VNi and RNi according to the Reynold's number of the flow in

the cavity as explained in Section 3.2.c to provide a

fricative excitation.

In this way we halve the number of derivatives that

must be computed by the Runge-Kutta method to account for

vocal tract propagation and we save about 50% of the

integration time.









70


However, the numerical procedure is still correct, as

we will show in Chapter 6 where several effects associated

with cavity wall vibration are simulated.


















CHAPTER 5
THE ARTICULATORY MODEL AND ITS
INTERACTIVE GRAPHIC IMPLEMENTATION

The acoustic characteristics of the vocal cavities,

that we have modeled by means of an equivalent circuit in

Chapter 3, are greatly affected by the geometrical

configuration of the vocal cavities themselves. Therefore,

the acoustic and perceptual properties of speech depend on

the position of the lips, tongue, jaws and of the other

vocal organs that determine the shape of the vocal tract.

The physiological mechanism of speech production or

"articulation" involves precisely timed movements of the

vocal organs to produce the acoustic wave that we perceive

as connected speech.

This chapter is concerned with the definition and

implementation of a geometric or articulatoryy" model of the

vocal tract that can be used to describe the configuration

of the vocal cavities during speech production.



5.1) Definition of the Articulatory Model

All the articulatory models presented in the literature

[78-81] are two dimensional representations of the vocal

cavities which closely match the midsagittal section of the












vocal tract, even if they do not resolve individual

muscles. The articulatory models that have been designed by

Coker [78] and Mermelstein [79], are probably the most

suitable for speech synthesis applications, since their

configuration is determined by a small number of control

parameters.

Figure 5.1 shows the articulatory model that has been

designed by Mermelstein. We can distinguish between a fixed

and a movable structure of the model. The fixed structure

consists of the pharyngeal wall (segments GS and SR in

Figure 5.1), the soft palate (arc VM), and hard palate (arc

MN) and the alveolar ridge (segment NV).

The configuration of the movable structure is

determined by external control parameters, that we call

articulatory parameters and that are represented in

Figure 5.1 by arrows. For example, the tongue body is drawn

as the arc of a circle (PQ in Figure 5.1) whose position is

determined by the coordinates x and y of its center. Other

parameters are used to control the location of the tip of

the tongue, of the jaws, of the velum, of the hyoid bone and

the lip protrusion and width.

Mermelstein has shown that this model can match very

closely the midsagittal X-ray tracings of the vocal tract

that have been observed during speech production [82]. The

model of Figure 5.1 can therefore be used for speech







































Figure 5.1. Articulatory model of the vocal cavities.












synthesis if we can estimate the cross-sectional area of the

vocal tract (area function). The area function, in fact,

can be used to derive the equivalent circuit of the vocal

cavities, as discussed in Chapter 3.

In practical terms we superimpose a grid system, as

shown in Figure 5.2, on the articulatory model to obtain the

midsagittal dimensions of the vocal cavities at different

points, and then we convert this information into cross-

sectional area values by means of analytical relationship

defined in the literature [79].

We use a variable grid system, dependent on the tongue

body position, to make sure that each grid in Figure 5.2 is

always "almost" orthogonal to the vocal tract center line

regardless of the model configuration; to further correct

unavoidable misalignment of each grid, we multiply the

cross-sectional area estimate by the cosine of the angle a

(see Figure 5.2).



5.2 The Graphic Editor

Our computer implementation of the articulatory model

of the vocal cavities has been designed to be fast and easy

to use.

Traditionally, human-computer interaction employs

textual (alphanumeric) communication via on-line keyboard

terminals. This approach is satisfactory for many










































Figure 5.2. Grid system for the conversion of mid-sagittal
dimensions to cross-sectional area values.












applications but is being replaced by menu selection, joy

stick cursor, a light pen, touch sensitive terminals, or

other devices.

The conventional keyboard entry method is particularly

cumbersome if the data structure is not easily manipulated

via an alphanumeric selection process. Such an example

arises with pictorial or graphic images, as in computer

aided design. Here the user may communicate with the

computer by means of a graphic model. The system interprets

the model, evaluates its properties and characteristics, and

recognizes the user's changes to the model. The results are

presented graphically to the operator for further

interactive design and test.

Using a similar approach we have implemented on a

Tektronix 4113 graphics terminal interfaced to a DGC Eclipse

S/130 minicomputer an "interactive -graphic editor" that is

used to manipulate the articulatory model.

The user may alter the configuration of the model by

means of a simple interaction. Each articulatory parameter

of Figure 5.1, and the corresponding vocal organ's position,

can be set to the desired value by means of the graphic

cursor of the Tektronix 4113 terminal. This allows a rapid

definition of the desired articulatory configuration.

But the power of an interactive graphic system lies in

its ability to extract relevant information from the model












for further analysis and processing. The acoustic

properties of the vocal tract are determined, as discussed

in Chapter 3, by its area function, which is the cross-

sectional area of the cavity, as a function of the distance,

x, from the larynx.

When the user has defined a new articulatory parameter

value by means of the cross-hair cursor, the system

estimates the area function of the vocal tract by means of

the grid system of Figure 5.2. The resonance or formant

frequencies are also estimated. This information is

immediately displayed (see Figure 5.3) for the user as a

first order approximation of the acoustic properties of the

graphic model of the vocal tract.

The interaction cycle is shown in Figure 5.4. Commands

are available not only to modify the displayed vocal tract

shape but also to store and read it from a disk memory.

These commands are useful not only to generate a "data base"

of vocal tract configurations, but also to create back up

files before using the interactive graphic commands.

But the interaction depicted in Figure 5.4 is not

sufficient to define a specific articulatory pattern. In

fact, the articulation of connected speech is a dynamic

process which consists of precisely timed movements of the

vocal organs. We have introduced the temporal dimension in

the system by means of an animation frame technique.




































FIRS FOR FRAT


Figure 5.3. The articulatory model implemented on the
Tektronix 4113 graphic terminal.




























START


THE USER INDI RTES THE NAME
OF THE ARTICULATORY PATTERN


THE USER MODIFIES THE
DISPLAYED VOCAL CAVITY

THE SYSTEM COMPUTES AND
b DISPLAYS THE AREA FUNCTION
< ND THE FORMANT FREQUENCIES

IS
USER
SATISFIED

N


(


THE USER ISSUES SWT COMMAND


Figure 5.4.


Interaction cycle for the generation of
an animated articulatory pattern.
a sign on time. b interaction cycle.
c Store With Time command.


>












When the user signs on (Figure 5.4a), he is asked to

indicate the name of the articulatory pattern that he wishes

to edit. The user may, at this point, create a new file or

modify an existing file. The user then interacts with the

model (Figure 5.4b) until he has achieved the desired vocal

tract configuration.

Next, a "Store With Time" (SWT) command is issued

(Figure 5.4c). This attaches a time label to the displayed

vocal tract configuration, which is also memorized as a

"frame" of the articulatory pattern that the user has

indicated at sign-on. Another interaction cycle is then

entered, which will lead to the definition of another frame.

The frames defined by the SWT command appear as

"targets" which must be reached at the indicated time. The

articulatory model is guided between consecutive targets by

means of an interpolation algorithm to achieve smooth

transitions. This is particularly important, so that the

articulatory model of the vocal cavities may be interfaced

with the speech synthesizer; since a continuous variation of

the synthesizer input parameters is required to obtain good

quality speech.



5.3) Display of the Time Variations of the Model

The interaction cycle described above generates an

articulatory pattern by means of an animation frame












technique, which is used as an input to the speech

synthesizer. However, during the interaction cycle, only

the particular frame being manipulated is visible on the

terminal display. Consequently, the user has difficulty

visualizing the global time varying characteristics of the

articulatory pattern. To overcome this disadvantage, we use

an on-line animation of the model.

The animation frames are computed by means of

interpolation and stored in the memory of the 4113 terminal

as graphic segments. Then each frame is briefly displayed

in sequence, creating the animation effect.

Figure 5.5 shows a typical animation frame. Here the

contour filling capability of the terminal is not used,

allowing higher display frequency. Using this technique, we

are able to obtain a live animation effect with only a

slight flickering phenomenon. The maximum frame display

frequency is about 5 Hz.

We may also view the movements of the vocal organs,

defined with the graphic editor, in three dimensions, as may

be seen in Figure 5.6. This effect is achieved by using

many consecutive animation frames as sections of a three

dimensional object, with the third dimension being time.

The advantage of this technique is that the time

evolution of the model can be observed at a glance;

moreover, a three dimensional rigid rotation allows the user




















































Figure 5.5. A typical animation frame.




















































Figure 5.6. Three dimensional views of the vocal tract.












to choose the most convenient view angle. Different colors

(one every five frames) are used to. mark the specific time

events.

Figure 5.6 also shows that the hidden lines have been

removed. This is achieved very efficiently by means of the

contour filling capability of the terminal. In this 3-D

representation all the frames belong to planes parallel to

each other. It is, therefore, very simple to determine for

a given angle of rotation, which frame is in "front" and

which one is "behind". To remove the hidden lines the

contour capability of the terminal is used with the "ink

eradicator" color, starting from the frame which is the

farthest from the observer.



5.4) Simultaneous and Animated Display
of the Articulatory and Acoustic Characteristics
of the Vocal Cavities

When a certain articulatory pattern has been edited

with the interactive graphic model, we may estimate the

corresponding acoustic events, e.g., the speech waveform,

the pressure and air volume-velocity distribution in the

vocal cavities, the motion of the vocal cords, the vibration

of the cavity walls, etc. by means of the acoustic model

defined in Chapter 3.

Figure 5.7 shows the final configuration of the

system. The articulatory or "muscular" events generated by










85
































ITERACTIVE
'HIC EDITOR

ARTICULATORY
EVENTS
W VN COMPUTER MOVIE.
SIMULTANEOUS
JSTIC MODEL t D-ISPLAY OF
OF THE ARTICULATORY
aL CAVITIES AND ACOUSTIC
EVENTS.

ACOUSTIC E
EVENTS




System configuration.
The ariculatory events generated by the graphic
editor are input into the acoustic model of the
vocal cavities.
Both articulatory and acoustic events are later
displayed with a computer generated movie.


Figure 5.7. System configuration.












the graphic model is the (off-line) input to the

synthesizer, which computes the corresponding acoustic

waveforms. Later a simultaneous and animated representation

of the articulatory and acoustic events is displayed on the

Tektronix 4113 terminal.

Figure 5.8 illustrates a typical frame of a sample

animation. The figure below the vocal tract model

represents the two vocal cords. Each cord (left and right)

is schematically represented by two masses as proposed by

Flanagan [67]. The graphs on the right part of the screen

represent different acoustic events over the same time

interval.

Both the vocal tract and vocal cord models are

animated. During the animation a sliding green line runs

along the borders of the graphs to "mark the time". This

assists the viewer in relating the information displayed in

the graphs to the vocal cord and the vocal tract motion. A

live animation effect is obtained in this manner.






















































Figure 5.8. A typical animation frame including the
vocal tract and the vocal cords. The
various data waveforms calculated by
the model are also shown.


















CHAPTER 6
SIMULATION RESULTS

6.1) Speech Synthesis

As explained in Chapter 2, linear prediction and

formant synthesis are based on a rather approximate model of

speech production. However, the quality of the synthetic

speech may be very good because the synthesis algorithm uses

the information derived from an analysis phase of natural

speech that captures the most important perceptual features

of the speech waveform.

On the contrary, articulatory synthesis employs a more

detailed model of the human speech production mechanism, but

cannot exploit a good analysis algorithm to derive the

articulatory information directly from natural speech.

In this section we want to show that the amount of

physiological detail captured by our articulatory and

acoustic models of the vocal cavities is sufficient for the

generation of good quality English sentences.

In fact, Figure 6.1 shows the spectrogram of the

sentence "Goodbye Bob" that we have synthesized with our

computer programs. The quality of this sample compares

favorably with respect to other synthesis techniques.









KHz


0.4


0.3 SEC


Figure 6.1. Spectrogram of "Goodbye Bob", synthetic.


J r


0.8 SEC


Spectrogram of "Goodbye Bob",


Figure 6.2.


natural.













As the first step of the synthesis procedure we should

obtain the spectrogram of natural speech, which is shown in

Figure 6.2. We do not attempt to faithfully match the

synthetic spectrogram with its natural counterpart.

However, the natural spectrogram is useful to obtain a good

estimate of the required duration of each segment of the

synthetic sentence.

The articulatory information is obtained, in a rather

heuristic way, from phonetic considerations and from X-ray

data available in the literature [22] [82-83]. For example,

we know that a labial closure is required for the production

of /b/ and /p/ consonant, or that the tongue position must

be "low" and "back" for the production of the /a/ sound.

Using this linguistic knowledge, we can therefore use

the "graphic editor" described in Section 5.2 to define the

articulatory configurations that are necessary to synthesize

the desired sentence.

As described in Sections 3.2a and 3.2b, the subglottal

and vocal cord models are controlled by three parameters:

glottal neutral area Ago, cord tension Q and subglottal

pressure Ps.

We set the glottal neutral area to 0.5 cm2 or 0.05 cm2

for the generation of unvoiced or voiced synthetic speech

respectively. The values of the cord tension and subglottal

pressure can be estimated after a pitch and short time

energy analysis of natural speech [67] [70].












The procedure for the definition of the time evolution

of the articulatory model that we have described above is,

however, rather "heuristic". After a first trial,

adjustments of the articulatory model configuration are

usually necessary to improve the quality of the synthetic

speech.

In our opinion a development of this research should be

- the definition of vocal tract evolution for different

English allophones, as a first step toward an automatic

speech synthesis by rule system based on an articulatory

model. The solution of this problem is not at all

trivial. Section 6.7 illustrates the difficulties and

reviews part of the literature related to this subject.



6.2) Source Tract Interaction

The classical source-tract speech production model that

we have discussed in Section 2.1 is based on the assumption

that the glottal volume velocity during speech production is

independent of the acoustic properties of the vocal tract.

Evidently this source-tract separability assumption holds

only as a first order approximation. In fact the glottal

volume velocity depends on the transglottal pressure that is

related to the subglottal and vocal tract pressure.

The effects of source-tract interaction have been

modeled and analyzed by Geurin [44], Rothenberg [42],













Ananthapadmanaba and Fant [43]. Yea [34] has carried out a

perceptual investigation of source tract interaction using

Guerin's model to provide the excitation of a formant

synthesizer [16].

Source tract interaction, which is well represented by

the model discussed in Chapter 3, can be discussed with

reference to Figure 6.3. The glottal volume velocity UG

depends not only on the subglottal pressure Ps and on the

glottal impedance ZG, that is varying during the glottal

cycle, but also on the vocal tract input impedance Zin'

Source-tract separability holds only if the magnitude of Zin

is much smaller than the magnitude of ZG, since in this case

ZG and Ps are equivalent to an ideal current generator.

Therefore the amount of source tract interaction depends on

the magnitude of ZG with respect to Zin.

We have experimented with different amounts of source

tract interaction using the following procedure.

At first we have synthesized the word "goodbye" using

our model of the vocal tract and of the vocal cords. The

obtained glottal volume velocity is shown in the middle part

of Figure 6.4.

Then, to reduce source tract interaction, we have used

the same vocal tract configuration, but we have multiplied

by a factor of two the glottal impedance throughout the

entire synthesis of the word "goodbye". We have,

























sT Zin TRACT R





Figure 6.3. Source-tract interaction model.
P subglottal pressure. U glottal
volume velocity. Z glottal impedance.
g

















S\ Too /
Cou \ Much --
> WI a4I I --'---- i ,---1 ,-I
1i .4-. t
-1' Just /,,',-'-
\ ight ti
!- /" \ U- "
*o, ---.--,.---- -- ----.,____

0.L
Lc Too
,-- j \ little


MSEC




Figure 6.4. Three different glottal excitations.


8 .1 .2 .3 .4 SEC
Good b y e


Figure 6.5. Spectrogram of "good bye", synthetic.




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs