AN ARTICULATORY SPEECH SYNTHESIZER
ENRICO LUIGI BOCCHIERI
A DISSERTATION PRESENTED TO THE GRADUATE COUNCIL
OF THE UNIVERSITY OF FLORIDA IN
PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR
THE DEGREE OF DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
I would like to express my sincere appreciation to my
advisory committee for their help and guidance throughout
I would like to give special thanks to my committee
chairman, Dr. D. G. Childers for his competent advice, for
his support, both financial and moral, and for providing an
educational atmosphere without which this research would
never have been possible.
Special gratitude is also expressed to Dr. E. R.
Chenette for his guidance, encouragement and financial
To my parents and family I am forever indebted. Their
unceasing support and encouragement made it all possible and
TABLE OF CONTENTS
1 INTRODUCTION: SPEECH SYNTHESIS APPLICATIONS.
RESEARCH GOALS........................................ 1
2 SPEECH PRODUCTION MODELS AND SYNTHESIS METHODS......... 7
2.1) Speech Physiology and the Source-Filter Model...8
2.2) Linear Prediction............................... 13
2.3) Formant Synthesis ............................ ..16
2.4) Articulatory Synthesis .........................18
3 ACOUSTIC MODELS OF THE VOCAL CAVITIES............... 23
3.1 Sound Propagation in the Vocal Cavities.........23
3.1.a)' Derivation of the Model ................... 23
3.l.b) Modeling the Yielding Wall Properties..... 32
3.1.c) Nasal Coupling............................. 37
3.2) Excitation Modeling ............................37
3.2.a) Subglottal Pressure .......................37
3.2.b) Voiced Excitation.........................38
3.2.c) Unvoiced Excitation.......................45
3.3) Radiation Load...... ............................ 47
3.4) Remarks and Other Acoustic Models..............48
4 NUMERICAL SOLUTION OF THE ACOUSTIC MODEL.............51
4.1) Requirements of the Numerical Solution
Procedure .......................... .......... 51
4.2) Runge-Kutta Methods.......................... 54
4.2.a) Derivation of the Method..................54
4.2.b) Control of the Step Size
with Runge-Kutta.......................... 57
4.2.c) Order Selection............................59
4.3) Multistep Methods............................... 60
4.3.a) Implicit and Explicit Methods .............60
4.3.b) Derivation of the Methods.................61
4.3.c) Characteristics of Multistep Methods ...... 64
4.4) Method Selection.... .............................66
5 THE ARTICULATORY MODEL AND ITS
INTERACTIVE GRAPHIC IMPLEMENTATION..................71
5.1) Definition of the Articulatory Model...........71
5.2) The Graphic Editor.............................74
5.3) Display of the Time Variations of the Model....80
5.4) Simultaneous and Animated Display of the
Articulatory and Acoustic Characteristics
of the Vocal Cavities.........................84
6 SIMULATION RESULTS....................... ........... ..88
6.1) Speech Synthesis............................... 88
6.2) Source Tract Interaction.......................91
6.3) Onset Spectra of Voiced Stops..................95
6.4) Glottal Inverse Filtering of Speech............99
6.5) Simulation of Wall Vibration Effects............105
6.5.a) Vocal Cords Vibration During Closure.....105
6.5.b) Formant Shift............................ 110
6.6) Pathology Simulation: Reduction of Sound
Intensity During Nasalization.................112
6.7) Coarticulation ................................117
6.8) Interpretation of the EGG Data with
the Two Mass Model of the Vocal Cords.........124
7 CONCLUSIONS . ................................... 133
7.1) Summary........ ............ .......... ... ..... 133
7.2) Suggestions for Future Research...............135
AN EFFICIENT ARTICULATORY SYNTHESIS ALGORITHM
FOR ARRAY PROCESSOR IMPLEMENTATION..................137
A.1) Wave Propagation in Concatenated
Lossless Tubes ...................... ........ 138
A.2) Modifications of Kelly-Lochbaum Algorithm.....141
A.2.a) Fricative Excitation....................141
A.2.b) Yielding Wall Simulation................146
A.3) Boundary Conditions ..........................150
A.3.a) Glottal Termination......................151
A.3.b) Radiation Load ....................... 153
A.3.c) Nasal Coupling ................. ........ 155
REFERENCES ..... ......................... ........... .. 158
BIOGRAPHICAL SKETCH..................................... 169
Abstract of Dissertation Presented to the Graduate
Council of the University of Florida in Partial
Fulfillment of the Requirements for the
Degree of Doctor of Philosophy
AN ARTICULATORY SPEECH SYNTHESIZER
Enrico Luigi Bocchieri
Chairman: Dr. D. G. Childers
Major Department: Electrical Engineering
Linear prediction and formant synthesizers are based on
a rather approximate model of speech production physiology,
using analysis or "identification" algorithms of natural
speech to overcome the model limitations and to synthesize
good quality speech.
On the contrary, articulatory synthesizers are based on
a more exact speech production model, and do not use
identification algorithms to derive the model parameters
directly from the natural speech waveform.
This dissertation shows that the amount of
physiological detail captured by the articulatory synthesis
method is sufficient for the generation of high quality
synthetic speech and for the simulation of physiological and
pathological aspects of speech that are reported in the
Articulatory synthesis of speech represents the
acoustic properties of the vocal cavities by means of
modeling and numerical simulation techniques that are
reported in Chapters 3 and 4.
We have been able to guarantee the stability of the
numerical method and to halve the number of differential
equations that must be solved for the simulation of the
sound propagation in the vocal tract (Chapter 4).
In the Appendix we present a new and more efficient
algorithm for the simulation of the vocal cavity acoustics
which can be efficiently implemented with parallel
Interactive graphic software (Chapter 5) has been
developed to represent the configurations of the vocal
cavities and to provide us with a convenient interface for
the manipulation of the geometric model of the vocal
Chapter 6 employs the developed articulatory synthesis
system for the simulation of different aspects of speech
processing, for modeling speech physiology, and testing
theories of linguistics reported in the literature. We
discuss and illustrate such cases as source tract
interaction, EGG modeling, onset spectra of voiced stops at
consonantal release, the effects of yielding walls on
phonation, sound intensity reduction during nasalization,
and glottal least squares inverse filtering.
SPEECH SYNTHESIS APPLICATIONS. RESEARCH GOALS.
In the last 3 4 decades both the engineering and
medical communities have devoted considerable research
effort to the problem of speech synthesis, i.e., the
generation of voice by artificial, electrical or mechanical
The earliest attempts to construct talking machines can
be traced to the late 18th century. One of the first speech
synthesis devices was Kempelen's talking machine [1-2] which
in a demonstration in Vienna in 1791 was capable of
imitating the sounds of vowels and of many consonants.
Perhaps the greatest motivation for speech synthesis
research came from the development of telecommunications and
from the consequent engineering interest in efficient
methods for speech transmission. Moreover, the recent
progresses in circuit integration, microprocessors and
digital computers have made the implementation of high
performance speech transmission systems technologically
feasible [3-6]. This type of application requires a scheme
known as speech synthesis by analysis.
In its simplest form, speech communication is achieved
by modulating an electrical magnitude (for example, the
current in a transmission line) with the air pressure during
speech production. With this straightforward approach a
copy, in electrical terms, of the speech waveform can be
transmitted on a communication channel with a typical
bandwidth of about 3 kHz.
However, there appears to be a mismatch between the
information content of speech and the channel capacity. In
fact, the information content of written text may be
estimated at about 50 bit/sec  while the channel capacity
of a 3 kHz bandwidth and a typical signal-to-noise ratio is
about 30,000 bits/sec. Similar bit rates are also
encountered in conventional PCM speech transmission. Even
though spoken speech contains more information (such as
intonation and stress) than its written counterpart, the
above mentioned mismatch indicates that a smaller channel
bandwidth can be used for a more efficient transmission of
speech. Using different tradeoffs between the
intelligibility and naturalness of speech transmission on
one side and bit rate on the other, engineers have been able
to transmit speech with bit rates varying from 150 to 30,000
The reduction of channel bandwidth has been obtained by
means of analysis-synthesis systems. Before transmission,
speech analysis algorithms are used to extract relevant
information about the speech waveform. This information is
then encoded and transmitted (hopefully at low bit rates)
along the communication channel. At the receiver a
synthesis algorithm is used to reconstruct the speech
waveform from the transmitted information.
This synthesis by analysis process is useful not only
in voice communication systems; for example, in automatic
voice answering systems, words or sentences are stored for
successive playbacks. In addition synthesis by analysis can
be used to reduce the memory usage. Texas Instruments'
learning aid, Speak and Spell, is an example of this type of
Synthesis by rule or text to speech synthesis is a
different type of application that has received considerable
attention lately [12-13]. In this case the problem is not
to "regenerate" synthetic speech after an analysis phase of
its natural counterpart. Instead synthetic speech is
automatically produced according to certain linguistic rules
which transform a string of discrete input symbols directly
into speech  (see Figure 1.1). Applications of text to
speech synthesis include reading machines for the blind
, automatic answering systems, ... man-machine
The medical community is interested in speech synthesis
systems for different reasons. Speech synthesizers are
Figure 1.1. Text to speech synthesis.
often used in psycoacoustic and perceptual experiments
[16-18] in which the acoustic characteristics of speech must
be precisely and systematically controlled. Moreover the
vocal system is not easily accessible; therefore speech
physiologists and pathologists may use computer models as an
aid for the investigation of the physiology of the vocal
system and the diagnosis of voice disorders [19-20].
The purpose of this research is to apply speech
synthesis techniques for the simulation of the physiological
process of speech articulation in relation to the acoustic
characteristics of the speech signal.
Chapter 2 reviews the speech synthesis strategies used
most often and explains why the so-called articulatoryy
synthesis" method has been selected for our research.
Speech generation depends on the vocal cavities acoustic
properties which are physiologically determined during
speech articulation by the geometric configuration of the
The model of the acoustic characteristics of the vocal
cavities is explained in detail in Chapter 3, together with
its implementation by means of numerical simulation
techniques in Chapter 4. Chapter 5 focuses on the geometry
or spatial model of the vocal tract together with the
interactive graphic techniques that have been used for its
Synthesis and simulation results are presented in
Chapter 6. Chapter 7 provides a discussion of our findings
along with conclusions and suggestions for future research.
SPEECH PRODUCTION MODELS AND SYNTHESIS METHODS
As indicated in the introduction (Chapter 1) there are
many different applications that motivate research in the
area of speech synthesis. However, different goals usually
require different approaches for the solution of the
problem. This chapter will briefly consider the three most
popular and best documented techniques for speech synthesis
(namely linear prediction, formant and articulatory
synthesis), their relative "advantages" and "disadvantages"
and the applications for which they are most suitable. The
purpose is to review the available speech synthesis
techniques and to justify the choice of articulatory
synthesis for our research.
Every speech synthesis strategy is based on a more or
less complete model of the physiology of speech production
and ultimately its performance is determined by the amount
of acoustic and linguistic knowledge that the model can
In the first section of this chapter we therefore
discuss the basic notions of the physiology of speech
together with the source-filter production model upon which
both linear prediction and formant synthesis are based.
2.1) Speech Physiology and the Source-Filter Model.
The acoustic and articulatory features of speech
production can be most easily discussed by referring to
Figure 2.1, which shows the cross-section of the vocal
The thoracical and abdominal musculatures are the
source of energy for the production of speech. The
contraction of the rib cage and the upward movement of the
diaphragm increase the air pressure in the lungs and expell
air through the trachea to provide an acoustic excitation of
the supraglottal vocal cavities, i.e., the pharynx, mouth
and nasal passage.
The nature of speech sounds is mostly determined by the
vocal cords and by the supraglottal cavities. The vocal
cords are two lips of ligament and muscle located in the
larynx; the supraglottal cavities are the oral and nasal
cavities that are vented to the atmosphere through the mouth
Physically, speech sounds are an acoustic pressure wave
that is radiated from the mouth and from the nostrils and is
generated by the acoustic excitation of the vocal cavities
with the stream of air that is coming from the lungs during
An obvious and important characteristic of speech is
that it is not a continuous type of sound but instead it is
Figure 2.1. Schematic diagram of the human vocal
mechanism (from  ). By permission
perceived as a sequence of speech units or segments. In
general the different types of sound that occur during
speech production are generated by changing the manner of
excitation and the acoustic response of the vocal cavities.
As a first order approximation we can distinguish
between a "voiced" and a "fricative" or "unvoiced"
excitation of the vocal cavities.
The voiced excitation is obtained by allowing the vocal
cords to vibrate so that they modulate the stream of air
that is coming from the lungs, producing an almost periodic
signal. For example, vowels are generated in this way and
they are perceived as continuous non-hissy sounds because
the excitation is essentially periodic.
In contrast, unvoiced or fricative excitation is
achieved by forcing the air flow through a constriction in
the vocal tract with a sufficiently high Reynold's number,
thereby causing turbulence. This excitation has random or
"noisy" characteristics and, therefore, the resulting speech
sounds will be hissy or fricative (friction-like) as in the
case of the consonants /s/ and /f/.
Both voiced and unvoiced excitation signals have a
rather wide spectrum. Typically the power spectrum of the
voiced excitation decreases with an average slope of
12 db/octave  while the unvoiced spectrum can be
considered white over the speech frequencies .
The spectral characteristics of the excitation are
further modified by the acoustic transfer function of the
vocal cavities. Sound transmission is more efficient at the
resonance frequencies of the supraglottal vocal system and,
therefore, the acoustic energy of the radiated speech sound
is concentrated around these frequencies formantt
frequencies). During the generation of connected speech,
the shape and acoustic characteristics of the vocal cavities
are continuously changed by precisely timed movements of the
lips, tongue and of the other vocal organs. This process of
adjustment of the vocal cavity shape to produce different
types of speech sounds is called articulation.
These considerations about speech physiology lead to
the simple but extremely useful source-tract model of speech
production , which has been explicitly or implicitly
used since the earliest work in the area of speech synthesis
[23-25]. This model is still employed in linear prediction
and formant synthesis. It consists (see Figure 2.2) of a
filter whose transfer function models the acoustic response
of the vocal cavities and of an excitation source that
generates either a periodic or a random signal for the
production of voiced or unvoiced sounds, respectively. The
operation of the source and the filter transfer function can
be determined by external control parameters to obtain an
output signal with the same acoustic properties of speech.
SOURCE 9(t) TRACT
(VOICED ORI TRANSFER-
g ( t) unvoiced -
Figure 2.2. The source-tract speech production model.
vo iced -
2.2) Linear Prediction
The simplest and most widely used implementation of the
source tract model of speech production is Linear Prediction
synthesis. It is a synthesis by analysis method that was
first proposed by Atal and Hanauer  and it has been
investigated for a great variety of speech applications.
The method is particularly suitable for digital
implementation and it assumes a time discrete model of
speech production, typically with a sampling frequency
between 7 and 10 kHz. It consists (see Figure 2.3) of two
signal generators of voiced and unvoiced excitation and of
an all pole transfer function
H(z) = 1 (2.2.1)
1 + I aKZ
to represent the acoustic response of the vocal cavities.
Mathematically, the transfer function H(z) is determined by
the predictor coefficients, aKs. The great advantage of
linear prediction is that an estimate of the predictor
parameters can be efficiently obtained using an analysis
phase of natural speech. The literature presents several
algorithms to perform this analysis. Perhaps the schemes
most used for speech synthesis applications are the
autocorrelation method  and, for hardware
implementation, the PARCOR algorithm .
Figure 2.3. Linear prediction speech production model.
During speech articulation the vocal cavity transfer
function is continuously changing and, ideally, the result
of the analysis of natural speech is a time varying estimate
of the linear predictor parameters (or of an equivalent
representation) as they change during speech production.
Also, the literature reports many algorithms which can
be applied to natural speech to extract the fundamental
(vocal cord oscillation) frequency and to perform the
voiced/unvoiced decision [29-30]. Some of them have been
implemented on special purpose integrated circuits for real
time applications  .
Therefore, linear prediction provides a complete
synthesis by analysis method in which all the control
parameters of Figure 2.3 can be derived directly from
The shortcoming of linear prediction is that the
transfer function (2.2.1) cannot properly model the
production of nasal, fricative and stop consonants. The all
pole approximation of the vocal tract transfer function is
in fact theoretically justified only for vowel sounds, and
even in this case the linear prediction model assumes a
minimum phase excitation signal.
In spite of these disadvantages, however, linear
prediction performs well for speech synthesis applications
because the approximations introduced by the model of
Figure 2.3 do not severely affect the perceptual properties
of the speech sound. In fact the human hearing system
appears to be especially sensitive to the magnitude of the
short-time spectrum of speech , that is usually
adequately approximated by the linear prediction transfer
function . Perhaps the minimum phase approximation is
responsible for a characteristic "buzziness"  of the
synthetic speech. A great deal of research is being
dedicated to improve the quality of linear prediction
synthesis by using a more suitable excitation than impulse
Linear prediction of speech is, therefore, most useful
in those applications that require a fully automated
synthesis by analysis process. Speech compression, linear
prediction vocoders, very low bit rate transmissions are
typical examples. Also linear prediction has application in
speech products where speech may be recorded for later
playback with stringent memory constraints.
2.3) Formant Synthesis
Similar to linear prediction, formant synthesis is
based on the source-tract speech production model. However,
in this case the filter that models the vocal tract is not
implemented by an all pole digital filter but it consists of
a number of resonators whose transfer function is controlled
by their resonance formantt) frequencies and bandwidths.
Among the many formant synthesizers reported in the
literature  [36-37] two general configurations are
common. In one type of configuration, the formant
resonators that simulate the transfer function of the vocal
tract are connected in parallel. Each resonator is followed
by an amplitude gain control which determines the spectral
peak level. In the other type of synthesizer the resonators
are in cascade. The advantage here is that the relative
amplitudes of formant peaks for the generation of vowels are
produced correctly with no need for individual amplitude
control for each formant .
Formant synthesis was developed prior to linear
prediction synthesis probably because it is amenable to
analog hardware implementation. However, many synthesizers
have been implemented on general purpose digital computers
to obtain a more flexible design. The most recent and
perhaps most complete synthesizer reported in the literature
has been designed by Klatt , which consists of a cascade
and parallel configurations which are used for the
production of vowels and consonants, respectively.
An advantage of formant synthesizers is their
flexibility. In fact, thanks to the parallel configuration,
the filter transfer function may have both zeroes and
poles. Klatt's synthesizer, for example, has 39 control
parameters which not only control the filter transfer
function in terms of formant frequencies and bandwidths but
also determine different types of excitation such as voiced,
unvoiced, mixed, sinusoidal for the generation of
consonantal murmurs, and burst-like for the simulation of
Formant synthesizers are particularly useful in
psycoacoustic studies since the synthetic speech waveform
can be precisely controlled by parameters which are more
directly related to the acoustic characteristics of speech
than the linear prediction coefficients. For the same
reason they are also more suitable for speech synthesis by
Synthesis by analysis with formant synthesizers
requires the extraction of the formant information from the
speech signal. For this purpose the literature presents
several formant analysis algorithms [38-41]. However linear
prediction analysis is simpler and more efficient and
therefore linear prediction is usually preferred in
synthesis by analysis type of applications.
2.4) Articulatory Synthesis
Linear prediction and formant synthesis are not
completely suitable for our research since they do not
faithfully model the human speech production mechanism.
A first disadvantage, which is inherent to the source-
filter model, is the assumption of separability between the
excitation and the acoustic properties of the vocal tract.
Clearly this assumption is not valid for the production of
fricative sounds in which the excitation depends on the
vocal tract constriction or for the generation of stops in
which the tract closure arrests the glottal air flow.
Source tract separability is a first order modeling
approximation even in the production of vowels as documented
by many recent papers concerning the nature of source tract
interaction [42-45]. We will address this issue in
Chapter 6 in more detail.
The second "disadvantage" (for our purpose) is that the
filter in the source-tract model (and also linear prediction
and formant synthesis) accounts only for the acoustic
input/output transfer function of the vocal cavities which
is estimated by means of analysis or "identification"
algorithms. For example, Linear Prediction and Formant
synthesis cannot model the areodynamic and myoelastic
effects that determine the vocal cords vibration, the air
pressure distribution along the oral and nasal tracts and
the vibration of the vocal cavity walls.
In other words, linear prediction and formant synthesis
define an algorithm for the generation of signals with the
same acoustic features of natural speech but they do not
model the physiological mechanism of speech generation.
These limitations of the source-tract model of speech
production can be overcome by the so called articulatoryy"
synthesis that is based on a physiological model of speech
production. It consists (see Figure 2.4) of at least two
1) an articulatory model that has as input the time
varying vocal organ positions during speech
production to generate a description of the
corresponding vocal cavity shape and
2) an acoustic model which, given a certain time
varying vocal cavity configuration, is capable of
estimating not only the corresponding speech
waveform but also the pressure and volume velocity
distribution in the vocal tract, the vibration
pattern of the vocal cords and of the vocal cavity
The strength of linear predication and formant
synthesis, namely the existence of analysis algorithms for
natural speech is, however, a weak point for articulatory
synthesis. Even if several methods for estimating the vocal
tract configuration are presented in the literature [46-51],
these procedures cannot be easily applied to all the
different types of speech sounds. This estimation is made
even more difficult to achieve by the fact that the acoustic
to articulatory transformation is not unique [52-53]. This
vibration of the
Sthe vocal cavity.
Figure 2.4. Articulatory synthesis of speech.
disadvantage, together with high computational requirements,
limits the use of articulatory synthesis for speech
applications which has been recently investigated by
Flanagan et al. .
In the following, Chapters 3 and 4 will discuss the
acoustic modeling of the vocal cavities and its
implementation with numerical simulation techniques.
Chapter 5 concentrates on the articulation model and the
computer graphic techniques used for its implementation.
ACOUSTIC MODELS OF THE VOCAL CAVITIES
The qualitative descriptions of the human speech
production mechanism and of the articulatory method of
speech synthesis that we have given in Section 2.1 cannot be
directly implemented on a digital computer. This knowledge
must be transformed into an analytical representation of the
physics of sound generation and propagation in the vocal
The mathematical model of the vocal cavity acoustics
can be conveniently interpreted by means of equivalent
circuits. In such a representation the electrical current
and voltage correspond respectively to the air pressure and
volume velocity in the vocal cavities. In the following we
will always express all the physical dimensions in C.G.S.
3.1) Sound Propagation in the Vocal Cavities.
3.1.a) Derivation of the Model.
Sound is nearly synonymous with vibration. Sound waves
are originated by mechanical vibrations and are propagated
in air or other media by vibrating the particles of the
media. The fundamental laws of mechanics, such as momentum,
mass and energy conservation and of fluid dynamics can be
applied to the compressible, low viscosity medium (air) to
quantitatively account for sound propagation.
The oral and nasal tracts are three-dimensional lossy
cavities of non-uniform cross-sections and non-rigid
walls. Their acoustic characteristics are described by a
three dimensional Navier-Stokes partial differential
_equation for boundary conditions appropriate to the yielding
walls. However, in practice, the solution of this
mathematical formulation requires
"an exhorbitant amount of computation, and we do not
even know the exact shape of the vocal tract and the
characteristics of the walls to take advantage of such a
rigorous approach." 
The simplifying assumption commonly made in the
literature is plane wave propagation. Most of the sound
energy during speech is contained in the frequency range
between 80 and 8000 Hz  but speech quality is not
significantly affected if only the frequencies below 5 kHz
are retained . In this frequency range the cross
sectional dimensions of the vocal tract are sufficiently
small compared to the sound wavelength so that the departure
from plane wave propagation is not significant.
Thanks to the plane wave propagation assumption, the
geometric modeling of the vocal cavities can be greatly
simplified. Acoustically the vocal tract becomes equivalent
to a circular pipe of non-uniform cross-section (see
Figure 3.1) whose physical dimensions are completely
described by its cross-sectional area, A(x), as a function
of the distance x along the tube (area function). The sound
propagation can now be modeled by a one dimensional wave
equation. If the losses due to viscosity and thermal
conduction either in the bulk of the fluid or at the walls
of the tube are neglected, the following system of
differential equations accurately describes the wave
bp(x,t) + (x,t) = (3 1. la)
x- + "
bu(x,t) 1 b(p(x,t) A(x,t)) A(x,t)
x + P t at
= displacement along the axis of the tube
= pressure in the tube as function of
time and displacement
= air volume velocity in the tube
= area function of the tube
= air density
A sound velocity.
The first and second equations (3.1.1) correspond to
Newton's and continuity law, respectively.
0 L cm
Figure 3.1. The vocal tract represented by a non uniform
pipe and its area function.
Equations (3.1.1) indicate that the area function
A(x,t) is varying with time. Physiologically this is caused
by two different phenomena,
a) the voluntary movements of the vocal organs during
speech articulation, and
b) the vibration of the vocal cavity walls that are
caused by the variations of the vocal tract
pressure during speech.
We therefore represent the vocal tract area function as
a summation of two components
A(x,t) L Ao(x,t) + 6A(x,t) = Ao(x) + 6A(x,t) (3.1.2)
The first component A (x,t) is determined by a) above and it
represents the "nominal" cross-sectional area of the vocal
tract. Since the movements of the vocal organs are slow
compared to the sound propagation, this component can be
considered time invariant with a good approximation. The
2nd component 6A(x,t) represents the perturbation of the
cross-sectional area of the vocal tract that is caused by b)
above. Its dynamics cannot be neglected as compared with
the acoustic propagation. Nevertheless, this component has
a relatively small magnitude compared to A(x,t).
If we substitute (3.1.2) into (3.1.1) and if we neglect
2nd order terms we obtain
p(x, t) + P bu(x,t) = (3.1.3a)
Ox A -(x) 6(t3
u(x,t) + Ao(x) ap(x,t) (6A(x,t)) (3.1 3b)
Ox PC2 Ft t
The partial differential equations (3.1.3) can be
approximated with a system of ordinary differential
equations. Let the acoustic pipe be represented by a
sequence of N elemental lengths with circular and uniform
cross-sections Ai, i = 1, ...N. This is equivalent to
approximating the area function A(x) using a stepwise method
as shown in Figure 3.2.
If each elemental length is sufficiently shorter than
the sound wavelength, we can suppose that the pressure and
volume velocity are independent of the position in the
elemental length itself. Instead of the functions p(x,t)
and u(x,t), we need to consider a finite number of time
pi(t), ui(t) ; i = 1,N
that represent the pressure and volume velocity in the ith
elemental section as a function of time. The partial
derivatives can now be approximated by finite differences
Figure 3.2. Stepwise approximation of the area function.
Sp.(t) p (t)
ap(x,t) ~ Pi(t) Pi 1(t)
au(x,t)~ ui.(t) ui 1(t)
where pi(t), ui(t) = pressure and volume volume velocity
in the ith vocal tract section
Ax = L/N = length of each elemental section.
Therefore equations (3.1.3) become
d ui(t) A.
dt -x (Pi(t) Pi-(t)) (3.1.4a)
d Pi(t) 2 d6A. (t)
dt pi) (u (t) (t) Ax it
1dt x dt
i = 1, ..., N
Equations (3.1.4) can be represented by the equivalent
electrical circuit as shown in Figure 3.3. Inductor L. and
capacitor Ci, represent the inertance and compressibility of
the air in the ith elemental length of the vocal tract.
They are defined in terms of the cross-sectional area Ai as
I I R LW
Figure 3.3. Vocal tract elemental length and its equivalent
A. Ax p Ax
C L.= = ...N
1 2 1
The component Ax dt in equation (3.1.4b) that represents
the vibration of the cavity walls is represented by the
current in the impedance ZWi as will be shown in the next
3.l.b) Modeling the Yielding Wall Properties
From equation (3.1.4b) we can see that the effect of
wall vibrations are to generate in the ith elemental section
of the vocal tract an additional volume velocity component
which is represented in Figure 3.3 by the current uWi in the
impedance ZWi. We will now consider how ZWi is related to
the mechanical properties of the vibrating walls.
Consider Figure 3.4 which shows an elemental length of
the pipe in which one wall is allowed to move under the
forcing action of the pressure p(t) in the pipe itself. Let
m, k and d represent the mass, elastic constant and damping
factor of a unit surface. Since the total vibrating surface
is lAx, and since we assume the walls to be locally
reacting, then the total mass, elastic constant and damping
factor of the vibrating wall are
1 1 Ax
Figure 3.4. Mechanical model of an elemental length of
the vocal tract with a yielding surface.
m(l Ax), k(l Ax), d(l Ax),
According to Newton's law the forcing action on the
F = lAx p(t) = mlAx dy(t) + dlAx t + kly(t) (3.1.5)
where y(t) is the wall displacement from the neutral
position (when the pressure in the tract is equal to
The airflow generated by the wall motion is
u(t) = Ax 1dydt
and by substitution of (3.1.6) into (3.1.5) we obtain
S du (t)
p(t) 1 Ax dt + (1 Ax) Uw(t)
In the frequency domain,
represented by the impedance,
(3.1.7) can be equivalently
ZW(s) = sLW + RW + 1 (3.1.8)
P(s) = ZW(s) UW(s)
A m R d A1Ax
LW ; RW CW A (3.1.9)
1 Ax 1 Ax k
Such an impedance appears to be inversely related to the
vibrating surface (lAx) and its components are
1. an inductance LW proportional to the mass of the
unit surface of the vibrating wall,
2. a resistor RW proportional to the viscous damping
per unit surface of the vibrating walls, and
3. a capacitor CW inversely proportional to the
elastic constant per unit surface.
Direct measurements of the mechanical properties of
different types of human tissues have been reported in the
literature , however such measurements are not directly
available for the vocal tract tissues. Moreover, it is
difficult to estimate the lateral surface of the vocal
cavities that is required to compute the numerical value of
the impedance ZW according to the above derivation.
We, therefore, use a slightly different approach
proposed by Sondhi . He assumed that the vibrating
surface is proportional to the volume of the vocal cavities
and he estimated the mechanical properties of the vibrating
surface on the basis of acoustic measurements. The modeling
results match well with the direct measurements of the
mechanical impedance of human tissues performed by Ishizaka
et al. . The model can be formulated as follows,
Adjust the inductive reactance LWi in the ith elemental
section to match the observed first formant frequency
for the closed mouth condition of about 200 Hz  
S 0.0858 0.0858
1 V. Ax A.
Next, adjust the wall loss component RWi to match the
closed glottis formant bandwidths 
RW. = 130 T LW.
Choose a value of CWi to obtain a resonance frequency of
the wall compatible with the direct measurements of
Ishizaka et al. 
(2 n 30)2
3.1.c) Nasal Coupling
During the production of nasal consonants and nasalized
vowels the nasal cavity is acoustically coupled to the oral
cavity by lowering the soft palate and opening the
The coupling passage can be represented as a
constriction of variable cross-sectional area and 1.5 cm
long  that can be modeled by an inductor (to account for
the air inertance) in series with a resistor (to account for
viscous losses in the passage).
The wave propagation in the nasal tract can be modeled
in terms of the nasal tract cross-sectional area as
discussed in Section (3.l.a) for the vocal tract. However,
we will use a different equivalent circuit to better account
for the nasal tract losses as we will discuss later in
3.2) Excitation Modeling
3.2.a) Subglottal Pressure
The source of energy for speech production lies in the
thoracic and abdominal musculatures. Air is drawn into the
lungs by enlarging the chest cavity and lowering the
diaphragm. It is expelled by contracting the rib cage and
increasing the lung pressure. The subglottal or lung
pressure typically ranges from 4 cm H20 for the production
of soft sounds to 20 cm H20 or more for the generation of
very loud, high pitched speech.
During speech the lung pressure is slowly varying in
comparison with the acoustic propagation in the vocal
cavities. We can, therefore, represent the lung pressure
with a continuous voltage generator whose value is
controlled in time by an external parameter Ps(t).
3.2.b) Voiced Excitation
During speech, the air is forced from the lungs through
the trachea into the pharynx or throat cavity. On top of
the trachea is mounted the larynx (see Figure 3.5), a
cartilagineous structure that houses two lips of ligament
and muscle called the vocal cords or vocal folds.
The vocal cords are posteriorly supported by the
arytenoid cartilages (see Figure 3.5). Their position and
the dimension of the opening between them (the glottis) can
be controlled by voluntary movements of the arytenoid
For the generation of voiced sounds the vocal cords are
brought close to each other so that the glottal aperture
becomes very small. As air is expelled from the lungs,
strong areodynamic effects put the vocal cords into a rapid
oscillation. Qualitatively, when the vocal cords are close
to each other during the oscillation cycle, the subglottal
Figure 3.5. Cut-away view of the human larynx (from ).
VC vocal cords. AC arytenoid cartilages.
TC thyroid cartilage.
pressure forces them apart. This, however, increases the
air flow in the glottis and the consequent Bernoulli
pressure drop between the vocal cords approximates the vocal
cords again. In this way a mechanical "relaxation"
oscillator is developed which modulates the airflow from the
lungs into a quasiperiodic voiced excitation.
The vocal fold oscillation frequency determines
important perceptual characteristics of voiced speech and it
is called the "pitch" frequency or fundamental frequency,
The first quantitative studies of the areodynamics of
the larynx were carried out by van den Berg et al. who made
steady flow measurements from plaster casts of a "typical"
The first quantitative self-oscillating model of the
vocal folds was proposed by Flanagan and Landgraf  after
Flanagan and Meinhart's studies concerning source tract
interaction . The fundamental idea was to combine van
den Berg's results with the 2nd order mechanical model of
the vocal cords shown in Figure 3.6. An acoustic-mechanic
relaxation oxcillator was obtained.
Bilateral symmetry was assumed and only a lateral
displacement x of the masses was allowed. Therefore, only a
2nd order differential equation was needed to describe the
motion of the mass
Figure 3.6. One mass model of the vocal cords.
M x + B(x) : + K(x) x = F(t)
The mechanical damping B(x) and elastic constants K(x) are
properly defined functions of the vocal cord position x.
The forcing action F(t) depends on the air pressure
distribution along the glottis which was estimated according
to van den Berg's results. A modified version of the one
mass model was also designed by Mermelstein .
Even if the one mass model had been able to simulate
important physiological characteristics of vocal cord
vibration, it presented several inconveniencies.
1) The frequency of vocal fold vibration was too
dependent on the vocal tract shape, indicating too
great a source tract interaction.
2) The model was unable to account for phase
differences between the upper and lower edge of the
3) The model was unable to oscillate with a capacitive
vocal tract input impedance.
These difficulties were overcome by the two mass model
of the vocal cords that is shown in Figure 3.7. It was
designed by Ishizaka and Matsuidara  and first
implemented by Ishazaka and Flanagan [66-67]. A mechanical
coupling between the two masses is represented by the spring
constant kc The springs sl and s2 were given a non-linear
PS P11 P12 P21 P22 Ug P1
X.-T-./ \- -------- -/ i ----------
CONTRACTION GLOTTIS EXPANSION
Rc L RV1 Lgl R12 RV2 Lg2 Re
Ps P11 P12 P21 P22 P1
I I I I _II
CONTRACTION GLOTTIS EXPANSION
Figure 3.7. Two mass model of the vocal cords and
glottis equivalent circuit.
characteristic according to the stiffness measured on
excised human vocal cords. As in the case of the one mass
model the viscous damping was changing during the vocal
cords vibration period.
For computer simulation it is convenient to represent
the pressure distribution along the two masses with the
voltage values in an equivalent circuit. In Figure 3.7
resistance Rc accounts for the Bernoulli pressure drop and
"vena contract" effect at the inlet of the glottis.
Resistances RV, and RV2 model the viscous losses in the
glottis. Resistance R12 accounts for the pressure
difference between the two masses caused by the Bernoulli
effect. The inductors model the air inertance in the
The two mass model of the vocal cords, that has been
used in connection with vocal tract synthesizers [67-68],
uses as control parameters the glottal neutral area AgO and
the cord tension Q.
AgO determines the glottal area in absence of phonation
and it is physiologically related to the position of the
arythenoid cartilages. Q controls the values of the elastic
constant of the model and greatly affects the two mass model
oscillation period. The suitability of the two mass model
and of its control parameters for speech synthesis has been
further validated in  and .
The acoustic synthesizer that we have implemented uses
the two mass model to provide voiced excitation. We
therefore account for source tract interaction since the
current in the equivalent circuit of the glottis (see
Figure 3.7) is dependent on the voltage pl that models the
pressure in the vocal tract just above the vocal cords.
3.2.c) Unvoiced Excitation
Speech sounds are generally excited by modulating the
air flow through a constriction of the glottal and
supraglottal system. For voiced sounds this modulation is
obtained through rapid changes of the glottal constriction
as explained in the review section. For fricative sounds,
the modulation comes from flow instabilities which arise by
forcing the air through a constriction with a sufficiently
high Reynold's number. In this case the classical
hypothesis of separability between the source and the tract
greatly limits the realism that can be incorporated into the
synthesizer. In fact, unvoiced excitation is greatly
dependent on the constricted area of the vocal tract itself.
The fricative self-excitation of the vocal cavities was
first modeled by Flanagan and Cherry . The idea was to
use a resistor RNi and noise generator VNi in the equivalent
circuit of the ith elemental length of the vocal tract (see
Figure 3.8). The values of the resistor and of the noise
I N + I
S Li R i VN Li+l
S[ 1 LWi
Figure 3.8. Equivalent circuit of vocal tract elemental
length with fricative excitation.
generator variance depend on the Reynold's number of the
flow in the ith section . The spectrum of the turbulent
noise VNi can be assumed white with a good approximation
In our simulation we have modeled the fricative
excitation by means of two sources. One is always located
in the first vocal tract section to generate aspirated
sounds. The second is not bound to a fixed position but can
be moved along with the vocal tract constriction location.
3.3) Radiation Load
The radiation effects at the mouth and nostrils can be
accounted for by modeling the mouth and nostrils as a
radiating surface placed on a sphere (the head) with a
radius of about 9 cm.
Flanagan  has proposed a simplified equivalent
circuit for the radiation load model by using a parallel
combination of an inductor and a resistor with values
RR = 2' R 3wc
where a is the radius of the (circular) radiating surface
(mouth or nostrils). Titze  has shown that this
approximation, which we are using now, is valid also at
relatively high frequencies, when the speech wavelength has
the same order of magnitude of the mouth radius.
Our model is also able to account for the sound
pressure component that is radiated through the vibration of
the vocal cavity walls. The contribution to this component
from each elementary length of the vocal cavities is
represented as a voltage drop across a suitable impedance
 in series to the equivalent circuit of the yielding
wall defined in Section 3.1b.
3.4) Remarks and Other Acoustic Models
In this chapter we discussed the derivation of an
electrical circuit that models the sound propagation in the
vocal cavities. These considerations can be summarized in
The vocal and nasal tracts are represented by two
circular pipes with non-uniform cross-section (plane wave
propagation assumption). Their equivalent circuit (nasal
and vocal tract networks in Figure 3.9) are made by a chain
of elementary circuits. Each circuit models the wave
propagation as a short length of cavity according to the
derivation of Sections 3.l.a, 3.l.b, 3.2.c.
The two mass models of the vocal cords, its control
parameters, Ago and Q, and the glottal impedance have been
treated in detail in Section 3.2.b.
Sections 3.2.a, 3.1.c, 3.3 have been concerned with the
subglottal pressure Ps, the velar coupling ZV and the
S- VOCAL TRACT
I CORD I NETWORK
I MODEL I
- ;-- 11 i - -
Q(t) A 0 NC(t) A(x,t)
CORD REST NASAL AREA
TENSION AREA COUPLING FUNCTION
Figure 3.9. Equivalent circuit of the vocal cavities.
radiation impedances at the mouth and nostrils, which are
shown in Figure 3.9.
The approach to the acoustic modeling of the vocal
cavities that we have just reviewed is not the only one
reported in the literature. In Section 3.2.b we considered
the two mass models of the vocal cords. A more complete
model of vocal cord dynamics has been designed by Titze
. He divided each cord into two vertical levels, one
level corresponding to the mucous membrane, the other to the
vocalis muscle. Each level was further divided into eight
masses which were allowed to move both vertically and
horizontally. We did not use this model because its
simulation is computationally more expensive than Flanagan's
two mass models.
Different acoustic models of the vocal tract were
designed by Kelly and Lochbaum  and Mermelstein .
The latter has been recently implemented for an articulatory
However, these modeling approaches account for vocal
tract losses in a phenomenological way and they do not model
source tract interaction and fricative self excitation.
NUMERICAL SOLUTION OF THE ACOUSTIC MODEL
4.1) Requirements of the Numerical Solution Procedure
The software implementation of the acoustic model of
the vocal cavities that we have derived in the previous
chapter requires the solution of a system of ordinary
differential equations with assigned initial values.
In general we will use the notation
y'(t) = f(y(t),t)
y(0) = Yo
The numerical approach employed to solve the above problem
consists of approximating the solution y(t) as a sequence of
discrete points called mesh points. The mesh points are
assumed to be equally spaced and we indicate with h the time
interval between them. In other words, the numerical
integration procedure will give us a sequence of values yo,
yl' *"*Yn which closely approximate the actual solution y(t)
at the times tO = 0, tl = h, ...t = nh.
In the area of ordinary differential equations the
first step toward the solution of the problem is the
selection of that particular technique among the many
available which will serve the solution best.
In our specific case the most stringent requirement is
the stability of the numerical method. Since the
integration of (4.1.1) is going to involve a large number of
mesh points we need a method which, for a sufficiently small
step size h, guarantees that the perturbation in one of the
mesh values y does not increase in the subsequent values,
ym' m > n.
In the following discussion, as in , we use a "test
y'(t) = Xy(t)
where X is a complex constant. We introduce the concept of
an absolute stability region, which is the set of real,
nonnegative values of h and X for which a perturbation in a
value yn does not increase from step to step.
When the stability requirement is satisfied, we should
select the fastest integration method for our particular
application. This second requirement is very important
since we are dealing with a large system of differential
equations (about 100 differential equations of the 1st
order) and it takes about five hours to generate one second
of synthetic speech on our Eclipse S/130 minicomputer.
The integration speed is directly related to the step
size h of the numerical integration method. The larger the
step size the faster the integration. Unfortunately, the
precision of the numerical solution decreases when the step
size is increased and the method may become unstable. The
program to solve (4.1.1) must therefore implement an
automatic procedure for changing the step size to achieve
the maximum integration speed compatible with precision and
stability requirements. A variable step size is
particularly convenient in our case. In fact the time
constants of (4.1.1) change with the shape of the vocal
cavities and the motion of the vocal cords. A variable
control of the step size allows one to obtain the maximum
integration speed which is allowed by the method and by the
time variant differential equations. In view of these
requirements we have considered both a 4th order Runge-Kutta
method with control of the step size and order.
The Runge-Kutta method requires the computation of
derivatives at an higher rate than multistep methods.
However, it needs less overhead for each derivative
computation . In the next two sections the
characteristics of Runge-Kutta and multistep (also called
predictor-corrector) methods will be considered.
Section 4.4 will give a comparative discussion leading
to the selection of the Runge-Kutta method. Also, we will
describe a modification of the Runge-Kutta method, which
exploits the fact that wall vibration effects have larger
time constants than the propagation in the cavities. This
allows the reduction of the number of first order equations
to be solved by this method by almost a factor of two.
4.2) Runge-Kutta Methods
4.2.a) Derivation of the Method
Runge-Kutta methods are stable numerical procedures for
obtaining an approximate numerical solution of a system of
ordinary differential equations given by
y'(t) = f(y(t),t)
y(o) = Y
The method consists of approximating the Taylor expansion
y(t + h) = y(t ) + hy'(t ) + Y(t ) + ....(4.2.2)
so that, given an approximation of the solution at time to,
the solution at the next mesh point (to + h) can be
To avoid the computation of higher order derivatives,
it is convenient to express (4.2.1) in integral form as
y(t0 + h) = y(tO) + f o f(y(t),t)dt (4.2.3)
We can approximate the above definite integrals by computing
f(y(t),t) at four different points
(to,tO + h) by defining
of the interval
K1 = hf(y(to),to)
K2 = hf(y(to) + OKl'to + ha)
K3 = hf(y(to) + lK1 + Y1K2 t + a h)
K4 = hf(y + 2K1 + Y2K2 + 2K3' to + a2h)
and then setting
y(to + h) y(to) =
= i1K1 + P2K2 + P 3K3 + I4 K4
The problem is now to determine the a's, B's, y's, 62 and
y's so that (4.2.5) is equivalent to the Taylor expansion
(4.2.2) up to the highest possible power of h. We
substitute (4.2.5) into (4.2.3) and we choose the undefined
parameters so that the powers of hi (i = 0,4) have the same
coefficients as in (4.2.2).
We obtain a system of 8 equations in ten unknowns 
~1 + 2 + [3 + L4 = 1
2a2 + 3 al + P = 1/2
22 31 42
02a2 + 3al + P~4 = 1/3
3 3 3
12a3 + a31 + 4 = 1/4
3a 1Y1 + 14(a Y2 + a162) = 1/6
P3a2Y + 4(2Y2 + a262) = 1/12
3LaalY + (a Y2 + a62 )a = 1/8
P4 Y 6 = 1/24
4 1 2
which has two extra degrees of freedom that must be set
If we define a = a1 = 1/2, the solution of (4.2.6)
leads to the formulas of Kutta . If a = 1/2 and
62 = 1 we obtain Runge's formula  which is equivalent
to Simpson's rule
S f(t)dt = Z[f(to) + 4hf(tO + ) + f(toth)]
when y (t) = f(t).
We use a calculation procedure derived from (4.2.6) by
Gill  which minimizes the memory requirements and allows
us to compensate the round off errors accumulated at each
4.2.b) Control of the Step Size with Runge-Kutta
In the previous derivation we have emphasized how the
Runge-Kutta method approximates the Taylor expansion 4.2.2
up to the 4th power of h. It is therefore a fourth order
method with a local truncation error of order h5
5f (5) t-o h5 + O(h6)
This accuracy is obtained without explicitly computing the
derivatives of orders higher than one, at the expense of
four evaluations of the first derivative for each mesh
point. This is a disadvantage with respect to multistep
methods (to be discussed later) which uses fewer
computations of the first derivative to obtain the same
truncation error. The number of derivative evaluations per
step increases to 5.5 to obtain a variable control of the
step size with Runge-Kutta.
The step size, h, should in fact be chosen so that the
local truncation error is less than a certain maximum
acceptable value specified by the user. Unfortunately, the
truncation error cannot be directly estimated because the
Runge-Kutta procedure does not provide any information about
higher order derivatives.
A practical solution  is based on the results of
numerical integration with steps h and 2h, respectively,
i.e., the computation is performed a first time using hi = h
and then it is repeated using h2 = 2h.
Ch5 denote the truncation error using step
h2 = 2h
C h denote the truncation error using step
hI = h
y(2) denote the value "obtained" at (t + 2h)
using step h2 = 2h
Y(1) denote the value "obtained" at (to + 2h)
using step hI = h twice
Y denote the true value of y at time
(tt + 2h),
Y (2) C2 2
Y- y(1) = 2C1hl
But for small h, C1 = C2 (assuming that the sixth derivative
of f(y(t),t) is continuous) and therefore we obtain the
local truncation error estimate
~ (1) ( ^2)
Y Y(1) Y(15 Y2) (4.2.7)
If (4.2.7) is greater than a given tolerance, say el,
-the increment h is halved and the procedure starts again at
the last computed mesh point to. If it is less than el'
y(1)(to+h) and Y(l)(to + 2h) are assumed correct.
Furthermore, if it is less than el/50, the next step will be
tried with a doubled increment.
Unfortunately, on the average this method requires a
total of 5.5 function (derivative) evaluations as opposed to
four if the step size is not automatically controlled.
4.2.c) Order Selection
In Section 4.2.1 we derived the 4th order Runge-Kutta
method but Runge-Kutta methods of different orders could be
derived as well.
This observation leads to the question, "How does the
choice of the order affect the amount of work required to
integrate a system of ordinary differential equations?".
For example, small approximation errors can be more
efficiently achieved with high order methods, while for low
accuracy requirements lower order methods are to be
We would, therefore, like to have an automatic
mechanism for the selection of the order of the method.
This mechanism should evaluate the truncation error
corresponding to different integration orders and choose the
order which allows for the maximum step size and integration
speed compatible with the required precision.
Unfortunately, for the same reason discussed in
relation to the problem of step size control, namely the
absence of higher order derivative estimates, the Runge-
Kutta method does not provide an efficient procedure for
automatic order selection. Therefore we always use the
"standard" 4th order Runge-Kutta method.
4.3) Multistep Methods
4.3.a) Implicit and Explicit Methods
Those methods, like Runge-Kutta, which given an
approximation of y(t) at t = tn-1 (say Yn_-) provide a
technique for computing yn = (tn ) are called one step
methods. More general K-step methods require the values of
the dependent variables y(t) and of its derivatives at K
different mesh points tn-l, tn_2, .**tn-K to approximate the
solution at time tn.
The well known rules of "forward differentiation",
"backward differentiation" and trapezoidall rule" are one
step methods. They will be automatically considered in the
following discussion as particular cases.
The general expression of a multistep (K-step) method
Yn = (aiYn-i + -i ) + oYn
hy' = hf(y n,tn)
If 8 is equal to zero the method is explicit because it
provides an explicit way of computing yn and hy'n from the
values of y and its derivatives at preceding mesh points.
If 0 is different from zero, then (4.3.1) defines an
implicit multistep method because it is in general a non-
linear equation involving the function f(yn,t ) that must be
solved for the unknown yn'
4.3.b) Derivation of the Methods
We have considered the Adams-Bashforth and Adams-
Moulton  methods which are respectively explicit and
implicit methods with
a. = 0 if i 1
Both of these methods can be obtained from the integral
y(tn) = Y(tnl) + f f(y(t),t)dt
The integral is estimated by approximating f(y(t),t)
with an interpolating polynomial (for example, the Newton's
backward difference formula) through a number of known
values at t = tn-, tn-2, ..., tn-K in the explicit case or
through the values at times tn, tn-l' **tn-K for the
Therefore, for the explicit Adams-Bashforth case, the
equation (4.3.1) takes the form
Yn = y(tn-) + h Kif((t n),t ) (4.3.4a)
or with the equivalent representation in terms of finite
Yn = y(tn-l) + h K yVJ f(y(tnl),tn-l) (4.3.4b)
where the operator V is defined by
J A J-l J-1
Vf =V f f
m m m-1
V f A f
The values of Yn differ from the real solution y(tn) by
-a local truncation error which is of order (K + 1) in the
step size h
ErrorAdams-Bashforth = K h y(t)
The values of y 's and PKi's coefficients are available
directly from the literature .
In the implicit Adams-Moulton case equation (4.3.1)
takes the form
y = y(t-1 ) +
* f(y(t ), t-i)
K,i n-i n-i
or with the equivalent representation in terms of backward
S= Y(t)) + h y* V f(Y(t ), t)
where the V operator has been defined in (4.3.5).
In (4.3.8) the value of Yn differs from y(t ) by a
local truncation error that is of the order K + 2 in the
step size h.
=Error K+2 (K+2)(t) (4.3.9)
The y*'s and P* 's coefficient values are available from the
literature . In particular the one step Adams-Bashforth
method corresponds to the forward differentiation rule while
the zero and one step Adams-Moulton methods are the backward
and trapezoidal rule respectively.
4.3.c) Characteristics of Multistep Methods
An important characteristic of multistep methods is
that they require only one computation of the derivative for
each step as can be seen from equation (4.3.1). This is a
great advantage over the Runge-Kutta method that requires at
least four computations of the functions f(y(t),t) and it
has been the motivation for our experimentation with
Another feature of multistep methods is that they allow
for a rather efficient implementation of automatic control
of the step size and order of the method itself.
A complete treatment of this subject would require too
long a discussion. Our intuitive explanation can be
obtained by observing that the step size and order selection
require an estimate of the local truncation error in terms
of different step sizes and integration orders. The order
that allows for the largest stepsize compatible with the
user defined upper limit for the truncation error is then
The local truncation error in the Adams-Bashforth and
Adams-Moulton methods (see (4.3.6) and (4.3.9)) is related
to high order derivatives which can be easily obtained in
terms of the same backward differences
VJhf(y(tn),tn) = VJhy' = hJ+ly(j+l)
that are used in the implementation method itself (see
(4.3.4) and (4.3.8)). A comparison between the explicit and
implicit multistep methods is necessary to complete this
One difference between the methods, is that the y*
coefficients of the implicit methods are smaller than the y
coefficient of the explicit methods. This leads to smaller
truncation errors for the same order for the implicit case
(see (4.3.6) and (4.3.9)).
Another advantage of the implicit methods is that
K-step methods have a truncation error of order (K+2) in the
step size h (see (4.3.9)) to be compared with a truncation
error of order (K+l) for the explicit method (see
(4.3.6)). The reason for this fact is evident when we
consider that (4.3.7) has (K+l) coefficients i,K, i = O,K
while in the explicit method (4.3.5) 80,K has been set to
A disadvantage of implicit methods is that the non-
linear equation (4.3.7) in the unknown y must be solved
iteratively. Usually a first "guess" of Yn is obtained by
-means of an explicit method and then (4.3.7) is iterated.
However, for reasonable values of the step size h, no more
than two or three iterations are usually required, and this
extra effort is more than compensated for by the better
stability properties of the implicit methods. In fact with
respect to the "test" equation y' = Xy, the range of h
values for which implicit methods are stable is at least one
order of magnitude greater than in the explicit case .
Since the truncation errors of implicit methods are smaller,
the implicit methods can be used with a step size that is
several times larger than that of the explicit method. The
allowed increase in step size more than offsets the
additional effort of performing 2 or 3 iterations.
4.4) Method Selection
We have implemented the numerical simulation of the
acoustic model of the vocal cavities by means of both Runge-
Kutta and implicit multistep methods . The Runge-Kutta
method runs about 2 = 3 times faster than the Adams-Moulton
method. This fact is at first rather surprising since we
have seen in the previous two sections that the Runge-Kutta
method requires a larger number of derivative evaluations
for each integration step.
However, in our case the evaluation of the derivatives
is not extremely time consuming. In fact, with the
exception of four state variables describing the motion of
the two mass models, the remaining system of differential
equations is essentially "uncoupled" (or characterized by a
"sparse matrix"), thanks to the "chain" structure of the
vocal cavity acoustic model that can be immediately observed
from Figure 3.9. In these conditions the high number of
derivative evaluations of the Runge-Kutta method is more
than compensated for by the limited overhead in comparison
with Adams predictor-corrector method .
We have modified the integration method to take
advantage of the dynamic properties of the vibrating walls.
From Figure 3.8, which represents the equivalent
circuit of the ith elemental section of the vocal tract as
discussed in Chapter 3, we have
_dt 1 (4.4.1)
dt C. (u (t) i+t) uwi(t))
dt i il
dt (p i(t) Pi(t) VN. RN. ui(t))
dt L (Pi (t) v- RWi uWi(t))
dt CW. UWi(t) (4.4.4)
The first two equations represent the pressure and
volume velocity propagation in the ith elemental length of
the vocal tract while (4.4.3) and (4.4.4) model the wall
The dynamics of uwi(t) and vi (t) in (4.4.3) and
(4.4.4), are characterized by the time constants (see
L 1 1 1
RW. 130*w' CW LW. 2Tr*30
which are very large with respect to the time of the wave
propagation in each elemental length of the vocal tract.
In fact, if we divide the vocal tract into 20 elemental
sections, of approximately 0.875 cm each, the time of wave
propagation in each is 2.5 10-5 sec. This time gives the
order of magnitude of the largest step size for the
integration of equations (4.4.1) and (4.4.2) that, in fact,
is usually achieved with variable step sizes between
2.5 10-5 and 1.25 10-5 sec.
On the other hand equations (4.4.3) and (4.4.4) may
employ a larger integration step.
We, therefore, integrate equations (4.4.1) and (4.4.2)
together with the two mass model equations using a Runge-
Kutta method with variable control of the step size and
assuming uWi(t) in (4.4.1) constant during this procedure.
Every 5.10-5 seconds, i.e., at a frequency of 20 KHz, we
update the values of uWi(t) and vWi(t) by means of a simple
backward differentiation rule based on equations (4.4.3) and
At this time we also update the turbulent noise source
VNi and RNi according to the Reynold's number of the flow in
the cavity as explained in Section 3.2.c to provide a
In this way we halve the number of derivatives that
must be computed by the Runge-Kutta method to account for
vocal tract propagation and we save about 50% of the
However, the numerical procedure is still correct, as
we will show in Chapter 6 where several effects associated
with cavity wall vibration are simulated.
THE ARTICULATORY MODEL AND ITS
INTERACTIVE GRAPHIC IMPLEMENTATION
The acoustic characteristics of the vocal cavities,
that we have modeled by means of an equivalent circuit in
Chapter 3, are greatly affected by the geometrical
configuration of the vocal cavities themselves. Therefore,
the acoustic and perceptual properties of speech depend on
the position of the lips, tongue, jaws and of the other
vocal organs that determine the shape of the vocal tract.
The physiological mechanism of speech production or
"articulation" involves precisely timed movements of the
vocal organs to produce the acoustic wave that we perceive
as connected speech.
This chapter is concerned with the definition and
implementation of a geometric or articulatoryy" model of the
vocal tract that can be used to describe the configuration
of the vocal cavities during speech production.
5.1) Definition of the Articulatory Model
All the articulatory models presented in the literature
[78-81] are two dimensional representations of the vocal
cavities which closely match the midsagittal section of the
vocal tract, even if they do not resolve individual
muscles. The articulatory models that have been designed by
Coker  and Mermelstein , are probably the most
suitable for speech synthesis applications, since their
configuration is determined by a small number of control
Figure 5.1 shows the articulatory model that has been
designed by Mermelstein. We can distinguish between a fixed
and a movable structure of the model. The fixed structure
consists of the pharyngeal wall (segments GS and SR in
Figure 5.1), the soft palate (arc VM), and hard palate (arc
MN) and the alveolar ridge (segment NV).
The configuration of the movable structure is
determined by external control parameters, that we call
articulatory parameters and that are represented in
Figure 5.1 by arrows. For example, the tongue body is drawn
as the arc of a circle (PQ in Figure 5.1) whose position is
determined by the coordinates x and y of its center. Other
parameters are used to control the location of the tip of
the tongue, of the jaws, of the velum, of the hyoid bone and
the lip protrusion and width.
Mermelstein has shown that this model can match very
closely the midsagittal X-ray tracings of the vocal tract
that have been observed during speech production . The
model of Figure 5.1 can therefore be used for speech
Figure 5.1. Articulatory model of the vocal cavities.
synthesis if we can estimate the cross-sectional area of the
vocal tract (area function). The area function, in fact,
can be used to derive the equivalent circuit of the vocal
cavities, as discussed in Chapter 3.
In practical terms we superimpose a grid system, as
shown in Figure 5.2, on the articulatory model to obtain the
midsagittal dimensions of the vocal cavities at different
points, and then we convert this information into cross-
sectional area values by means of analytical relationship
defined in the literature .
We use a variable grid system, dependent on the tongue
body position, to make sure that each grid in Figure 5.2 is
always "almost" orthogonal to the vocal tract center line
regardless of the model configuration; to further correct
unavoidable misalignment of each grid, we multiply the
cross-sectional area estimate by the cosine of the angle a
(see Figure 5.2).
5.2 The Graphic Editor
Our computer implementation of the articulatory model
of the vocal cavities has been designed to be fast and easy
Traditionally, human-computer interaction employs
textual (alphanumeric) communication via on-line keyboard
terminals. This approach is satisfactory for many
Figure 5.2. Grid system for the conversion of mid-sagittal
dimensions to cross-sectional area values.
applications but is being replaced by menu selection, joy
stick cursor, a light pen, touch sensitive terminals, or
The conventional keyboard entry method is particularly
cumbersome if the data structure is not easily manipulated
via an alphanumeric selection process. Such an example
arises with pictorial or graphic images, as in computer
aided design. Here the user may communicate with the
computer by means of a graphic model. The system interprets
the model, evaluates its properties and characteristics, and
recognizes the user's changes to the model. The results are
presented graphically to the operator for further
interactive design and test.
Using a similar approach we have implemented on a
Tektronix 4113 graphics terminal interfaced to a DGC Eclipse
S/130 minicomputer an "interactive -graphic editor" that is
used to manipulate the articulatory model.
The user may alter the configuration of the model by
means of a simple interaction. Each articulatory parameter
of Figure 5.1, and the corresponding vocal organ's position,
can be set to the desired value by means of the graphic
cursor of the Tektronix 4113 terminal. This allows a rapid
definition of the desired articulatory configuration.
But the power of an interactive graphic system lies in
its ability to extract relevant information from the model
for further analysis and processing. The acoustic
properties of the vocal tract are determined, as discussed
in Chapter 3, by its area function, which is the cross-
sectional area of the cavity, as a function of the distance,
x, from the larynx.
When the user has defined a new articulatory parameter
value by means of the cross-hair cursor, the system
estimates the area function of the vocal tract by means of
the grid system of Figure 5.2. The resonance or formant
frequencies are also estimated. This information is
immediately displayed (see Figure 5.3) for the user as a
first order approximation of the acoustic properties of the
graphic model of the vocal tract.
The interaction cycle is shown in Figure 5.4. Commands
are available not only to modify the displayed vocal tract
shape but also to store and read it from a disk memory.
These commands are useful not only to generate a "data base"
of vocal tract configurations, but also to create back up
files before using the interactive graphic commands.
But the interaction depicted in Figure 5.4 is not
sufficient to define a specific articulatory pattern. In
fact, the articulation of connected speech is a dynamic
process which consists of precisely timed movements of the
vocal organs. We have introduced the temporal dimension in
the system by means of an animation frame technique.
FIRS FOR FRAT
Figure 5.3. The articulatory model implemented on the
Tektronix 4113 graphic terminal.
THE USER INDI RTES THE NAME
OF THE ARTICULATORY PATTERN
THE USER MODIFIES THE
DISPLAYED VOCAL CAVITY
THE SYSTEM COMPUTES AND
b DISPLAYS THE AREA FUNCTION
< ND THE FORMANT FREQUENCIES
THE USER ISSUES SWT COMMAND
Interaction cycle for the generation of
an animated articulatory pattern.
a sign on time. b interaction cycle.
c Store With Time command.
When the user signs on (Figure 5.4a), he is asked to
indicate the name of the articulatory pattern that he wishes
to edit. The user may, at this point, create a new file or
modify an existing file. The user then interacts with the
model (Figure 5.4b) until he has achieved the desired vocal
Next, a "Store With Time" (SWT) command is issued
(Figure 5.4c). This attaches a time label to the displayed
vocal tract configuration, which is also memorized as a
"frame" of the articulatory pattern that the user has
indicated at sign-on. Another interaction cycle is then
entered, which will lead to the definition of another frame.
The frames defined by the SWT command appear as
"targets" which must be reached at the indicated time. The
articulatory model is guided between consecutive targets by
means of an interpolation algorithm to achieve smooth
transitions. This is particularly important, so that the
articulatory model of the vocal cavities may be interfaced
with the speech synthesizer; since a continuous variation of
the synthesizer input parameters is required to obtain good
5.3) Display of the Time Variations of the Model
The interaction cycle described above generates an
articulatory pattern by means of an animation frame
technique, which is used as an input to the speech
synthesizer. However, during the interaction cycle, only
the particular frame being manipulated is visible on the
terminal display. Consequently, the user has difficulty
visualizing the global time varying characteristics of the
articulatory pattern. To overcome this disadvantage, we use
an on-line animation of the model.
The animation frames are computed by means of
interpolation and stored in the memory of the 4113 terminal
as graphic segments. Then each frame is briefly displayed
in sequence, creating the animation effect.
Figure 5.5 shows a typical animation frame. Here the
contour filling capability of the terminal is not used,
allowing higher display frequency. Using this technique, we
are able to obtain a live animation effect with only a
slight flickering phenomenon. The maximum frame display
frequency is about 5 Hz.
We may also view the movements of the vocal organs,
defined with the graphic editor, in three dimensions, as may
be seen in Figure 5.6. This effect is achieved by using
many consecutive animation frames as sections of a three
dimensional object, with the third dimension being time.
The advantage of this technique is that the time
evolution of the model can be observed at a glance;
moreover, a three dimensional rigid rotation allows the user
Figure 5.5. A typical animation frame.
Figure 5.6. Three dimensional views of the vocal tract.
to choose the most convenient view angle. Different colors
(one every five frames) are used to. mark the specific time
Figure 5.6 also shows that the hidden lines have been
removed. This is achieved very efficiently by means of the
contour filling capability of the terminal. In this 3-D
representation all the frames belong to planes parallel to
each other. It is, therefore, very simple to determine for
a given angle of rotation, which frame is in "front" and
which one is "behind". To remove the hidden lines the
contour capability of the terminal is used with the "ink
eradicator" color, starting from the frame which is the
farthest from the observer.
5.4) Simultaneous and Animated Display
of the Articulatory and Acoustic Characteristics
of the Vocal Cavities
When a certain articulatory pattern has been edited
with the interactive graphic model, we may estimate the
corresponding acoustic events, e.g., the speech waveform,
the pressure and air volume-velocity distribution in the
vocal cavities, the motion of the vocal cords, the vibration
of the cavity walls, etc. by means of the acoustic model
defined in Chapter 3.
Figure 5.7 shows the final configuration of the
system. The articulatory or "muscular" events generated by
W VN COMPUTER MOVIE.
JSTIC MODEL t D-ISPLAY OF
OF THE ARTICULATORY
aL CAVITIES AND ACOUSTIC
The ariculatory events generated by the graphic
editor are input into the acoustic model of the
Both articulatory and acoustic events are later
displayed with a computer generated movie.
Figure 5.7. System configuration.
the graphic model is the (off-line) input to the
synthesizer, which computes the corresponding acoustic
waveforms. Later a simultaneous and animated representation
of the articulatory and acoustic events is displayed on the
Tektronix 4113 terminal.
Figure 5.8 illustrates a typical frame of a sample
animation. The figure below the vocal tract model
represents the two vocal cords. Each cord (left and right)
is schematically represented by two masses as proposed by
Flanagan . The graphs on the right part of the screen
represent different acoustic events over the same time
Both the vocal tract and vocal cord models are
animated. During the animation a sliding green line runs
along the borders of the graphs to "mark the time". This
assists the viewer in relating the information displayed in
the graphs to the vocal cord and the vocal tract motion. A
live animation effect is obtained in this manner.
Figure 5.8. A typical animation frame including the
vocal tract and the vocal cords. The
various data waveforms calculated by
the model are also shown.
6.1) Speech Synthesis
As explained in Chapter 2, linear prediction and
formant synthesis are based on a rather approximate model of
speech production. However, the quality of the synthetic
speech may be very good because the synthesis algorithm uses
the information derived from an analysis phase of natural
speech that captures the most important perceptual features
of the speech waveform.
On the contrary, articulatory synthesis employs a more
detailed model of the human speech production mechanism, but
cannot exploit a good analysis algorithm to derive the
articulatory information directly from natural speech.
In this section we want to show that the amount of
physiological detail captured by our articulatory and
acoustic models of the vocal cavities is sufficient for the
generation of good quality English sentences.
In fact, Figure 6.1 shows the spectrogram of the
sentence "Goodbye Bob" that we have synthesized with our
computer programs. The quality of this sample compares
favorably with respect to other synthesis techniques.
Figure 6.1. Spectrogram of "Goodbye Bob", synthetic.
Spectrogram of "Goodbye Bob",
As the first step of the synthesis procedure we should
obtain the spectrogram of natural speech, which is shown in
Figure 6.2. We do not attempt to faithfully match the
synthetic spectrogram with its natural counterpart.
However, the natural spectrogram is useful to obtain a good
estimate of the required duration of each segment of the
The articulatory information is obtained, in a rather
heuristic way, from phonetic considerations and from X-ray
data available in the literature  [82-83]. For example,
we know that a labial closure is required for the production
of /b/ and /p/ consonant, or that the tongue position must
be "low" and "back" for the production of the /a/ sound.
Using this linguistic knowledge, we can therefore use
the "graphic editor" described in Section 5.2 to define the
articulatory configurations that are necessary to synthesize
the desired sentence.
As described in Sections 3.2a and 3.2b, the subglottal
and vocal cord models are controlled by three parameters:
glottal neutral area Ago, cord tension Q and subglottal
We set the glottal neutral area to 0.5 cm2 or 0.05 cm2
for the generation of unvoiced or voiced synthetic speech
respectively. The values of the cord tension and subglottal
pressure can be estimated after a pitch and short time
energy analysis of natural speech  .
The procedure for the definition of the time evolution
of the articulatory model that we have described above is,
however, rather "heuristic". After a first trial,
adjustments of the articulatory model configuration are
usually necessary to improve the quality of the synthetic
In our opinion a development of this research should be
- the definition of vocal tract evolution for different
English allophones, as a first step toward an automatic
speech synthesis by rule system based on an articulatory
model. The solution of this problem is not at all
trivial. Section 6.7 illustrates the difficulties and
reviews part of the literature related to this subject.
6.2) Source Tract Interaction
The classical source-tract speech production model that
we have discussed in Section 2.1 is based on the assumption
that the glottal volume velocity during speech production is
independent of the acoustic properties of the vocal tract.
Evidently this source-tract separability assumption holds
only as a first order approximation. In fact the glottal
volume velocity depends on the transglottal pressure that is
related to the subglottal and vocal tract pressure.
The effects of source-tract interaction have been
modeled and analyzed by Geurin , Rothenberg ,
Ananthapadmanaba and Fant . Yea  has carried out a
perceptual investigation of source tract interaction using
Guerin's model to provide the excitation of a formant
Source tract interaction, which is well represented by
the model discussed in Chapter 3, can be discussed with
reference to Figure 6.3. The glottal volume velocity UG
depends not only on the subglottal pressure Ps and on the
glottal impedance ZG, that is varying during the glottal
cycle, but also on the vocal tract input impedance Zin'
Source-tract separability holds only if the magnitude of Zin
is much smaller than the magnitude of ZG, since in this case
ZG and Ps are equivalent to an ideal current generator.
Therefore the amount of source tract interaction depends on
the magnitude of ZG with respect to Zin.
We have experimented with different amounts of source
tract interaction using the following procedure.
At first we have synthesized the word "goodbye" using
our model of the vocal tract and of the vocal cords. The
obtained glottal volume velocity is shown in the middle part
of Figure 6.4.
Then, to reduce source tract interaction, we have used
the same vocal tract configuration, but we have multiplied
by a factor of two the glottal impedance throughout the
entire synthesis of the word "goodbye". We have,
sT Zin TRACT R
Figure 6.3. Source-tract interaction model.
P subglottal pressure. U glottal
volume velocity. Z glottal impedance.
S\ Too /
Cou \ Much --
> WI a4I I --'---- i ,---1 ,-I
1i .4-. t
-1' Just /,,',-'-
\ ight ti
!- /" \ U- "
*o, ---.--,.---- -- ----.,____
,-- j \ little
Figure 6.4. Three different glottal excitations.
8 .1 .2 .3 .4 SEC
Good b y e
Figure 6.5. Spectrogram of "good bye", synthetic.