Title Page
 Table of Contents
 Source-tract interaction
 Incorporation of source-tract...
 Implementation and results
 Biographical Sketch

Title: The incorporation of glottal source-vocal tract interaction effects to improve the naturalness of synthetic speech
Full Citation
Permanent Link: http://ufdc.ufl.edu/UF00082179/00001
 Material Information
Title: The incorporation of glottal source-vocal tract interaction effects to improve the naturalness of synthetic speech
Physical Description: vi, 130 leaves : ill. ; 28 cm.
Language: English
Creator: Wong, Chun-Fan, 1961-
Publication Date: 1991
Subject: Speech synthesis   ( lcsh )
Speech processing systems   ( lcsh )
Glottis -- Mathematical models   ( lcsh )
Vocal cords -- Mathematical models   ( lcsh )
Formants (Speech) -- Computer simulation   ( lcsh )
Electrical Engineering thesis Ph. D
Dissertations, Academic -- Electrical Engineering -- UF
Genre: bibliography   ( marcgt )
non-fiction   ( marcgt )
Thesis: Thesis (Ph. D.)--University of Florida, 1991.
Bibliography: Includes bibliographical references (leaves 124-129).
Statement of Responsibility: by Chun-Fan Wong.
General Note: Typescript.
General Note: Vita.
 Record Information
Bibliographic ID: UF00082179
Volume ID: VID00001
Source Institution: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
Resource Identifier: aleph - 001683422
oclc - 25034730
notis - AHZ5393

Table of Contents
    Title Page
        Page i
        Page ii
    Table of Contents
        Page iii
        Page iv
        Page v
        Page vi
        Page 1
        Page 2
        Page 3
        Page 4
        Page 5
        Page 6
        Page 7
        Page 8
        Page 9
        Page 10
        Page 11
        Page 12
        Page 13
        Page 14
        Page 15
        Page 16
        Page 17
    Source-tract interaction
        Page 18
        Page 19
        Page 20
        Page 21
        Page 22
        Page 23
        Page 24
        Page 25
        Page 26
        Page 27
        Page 28
        Page 29
        Page 30
        Page 31
        Page 32
        Page 33
        Page 34
        Page 35
        Page 36
        Page 37
        Page 38
        Page 39
        Page 40
        Page 41
        Page 42
        Page 43
        Page 44
        Page 45
        Page 46
        Page 47
        Page 48
        Page 49
        Page 50
        Page 51
        Page 52
        Page 53
        Page 54
        Page 55
    Incorporation of source-tract interaction
        Page 56
        Page 57
        Page 58
        Page 59
        Page 60
        Page 61
        Page 62
        Page 63
        Page 64
        Page 65
        Page 66
        Page 67
        Page 68
        Page 69
        Page 70
        Page 71
        Page 72
        Page 73
        Page 74
        Page 75
        Page 76
        Page 77
        Page 78
        Page 79
        Page 80
        Page 81
    Implementation and results
        Page 82
        Page 83
        Page 84
        Page 85
        Page 86
        Page 87
        Page 88
        Page 89
        Page 90
        Page 91
        Page 92
        Page 93
        Page 94
        Page 95
        Page 96
        Page 97
        Page 98
        Page 99
        Page 100
        Page 101
        Page 102
        Page 103
        Page 104
        Page 105
        Page 106
        Page 107
        Page 108
        Page 109
        Page 110
        Page 111
        Page 112
        Page 113
        Page 114
        Page 115
        Page 116
        Page 117
        Page 118
        Page 119
        Page 120
        Page 121
        Page 122
        Page 123
        Page 124
        Page 125
        Page 126
        Page 127
        Page 128
        Page 129
    Biographical Sketch
        Page 130
        Page 131
        Page 132
Full Text








I would like to thank my advisor and committee

chairman, Dr. Donald G. Childers, for giving me the

opportunity to do research in the Mind-Machine Interaction

Research Center, for his invaluable guidance and assistance,

and for bearing with me during the course of my research. I

am also thankful for his giving me some financial


I am also grateful to Dr. Leon W. Couch, II, for his

discussion of my research subject, and for taking the time

to serve on my committee.

I would also like to thank Dr. A. Antonio Arroyo, Dr.

Jose C. Principe, and Dr. Howard B. Rothman for their

invaluable time and interest in serving on my supervisory




ACKNOWLEDGMENTS ...................................... ii

ABSTRACT ...............................................


1 INTRODUCTION ................ ..................... 1

The Structure of Speech ......................... 2
The Conventional Speech Production Model ........ 4
Speech Analysis/Synthesis ....................... 4
Introduction to Source-Tract Interaction ........ 12

2 SOURCE-TRACT INTERACTION ........................ 18

Introduction .................................... 18
Speech Production Models ....................... 18
Effects of Source-Tract Interaction ............. 24
Literature Review ............................... 34
Glottal Impedance .............................. 34
Simulation Experiments ........................ 36
Theoretical Studies ............................ 40
Summary of All Source-Tract Interaction Effects 54


Introduction .................................... 56
Modeling Source-Tract Interaction ............... 56
Source-Tract Interaction Incorporation .......... 59
Measurement of Model Parameters ................. 59
Electroglottograph (EGG) ........................ 61
Glottal Inverse Filtering ....................... 63
Literature Review ............................. 67
Automatic Glottal Inverse Filtering .......... 73

Source Models ................................... 76

4 IMPLEMENTATION AND RESULTS ...................... 82

Introduction .................................... 82
Experimental Data Base .......................... 82
Implementation of Proposed Speech Production
Model ........................................ 84
Correction of Distorted Speech Signals .......... 89
Inverse Filtering Results ...................... 104
Modeling of Glottal Flow Waveform ............... 107

5 DISCUSSION ...................................... 115

Summary ......................................... 115
Future Research Directions ..................... 116
Further Work .................................... 118

APPENDIX ............................................. 120

REFERENCES ........................................... 124

BIOGRAPHICAL SKETCH .................................. 130

Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy



Chun-Fan Wong

May, 1991

Chairman: Dr. D.G. Childers
Major Department: Electrical Engineering

In the conventional speech production model, voiced

sounds are generated by exciting the vocal tract filter with

a train of quasi-periodic glottal pulses. The glottal

source and the vocal tract are assumed to be linearly

separable. A fixed stylized glottal pulse shape is adopted

as the source for all voiced sounds. Speech synthesized

with this approach can be highly intelligible, but often

does not sound natural.

It is believed that glottal source-vocal tract

interaction, or simply source-tract interaction, is

important for synthesizing natural sounding speech. The

glottal source and the vocal tract interact with each other,

and different glottal pulse shapes should be used for

different voiced sounds.

A number of source-tract interaction effects have been

identified. The interaction causes the volume-velocity

signal to be skewed to the right with respect to the glottal

area. A ripple in the volume-velocity waveform may appear,

which is thought to be caused primarily by the vocal tract

first-formant frequency. The ripple is a result of glottal

damping during the glottal open interval.

To incorporate the source-tract interaction effects

into the conventional speech production model, a variable

glottal pulse model was proposed. The glottal pulse shape

is defined by a few parameters, and represents the smoothed

glottal volume-velocity signal. The ripple effect was

approximated by increasing the first-formant bandwidth of

the closed-phase vocal tract filter during the glottal open


An analysis/synthesis system was implemented to test

the performance of the proposed model. Automatic glottal

inverse filtering was used to obtain the glottal

volume-velocity waveform and the closed-phase vocal tract

filter. A least squared error fit was performed on the

glottal volume-velocity waveform to obtain the glottal

source parameters. Speech synthesized with source-tract

interaction sounds more natural than speech synthesized

without source-tract interaction. Hence, source-tract

interaction should be used to generate natural sounding



The most natural way to communicate between human

beings is speech communication, and it is a very efficient

process. With the rapid advancement of computer technology,

speech communication between human and machine is also

becoming feasible. This has the advantage of freeing our

hands and eyes to do other important tasks. Also, a higher

input rate than typing can be achieved.

Digital techniques and computer technology have opened

vast possibilities for machine assistance to humans. Not

the least of these is an immense potential for storage of

and rapid access to voluminous amounts of information.

Automatic voice readout of computer-stored data allows the

machine to meet the individual on human terms. Using voice

response, the machine can provide sophisticated information

in a form easily assimilated by the human user. Moreover,

this information can be accessed and transported to the

human user over ubiquitous telephone circuits.

Typical uses of computer voice output are automatic

information services, computer-based instruction, reading

machines for the blind, spoken status reports from aircraft

and space-vehicle systems.

At present, the intelligibility of synthetic speech

attainable from the conventional linear speech production

model is quite high, but naturalness still needs to be

improved. The research objective of this dissertation is to

improve the naturalness of the synthetic speech, using a

more accurate model of speech production that includes

source-tract interaction.

The Structure of Speech

(Rabiner and Schafer, 1978)

Speech signals are composed of a sequence of sounds.

These sounds and the transitions between them serve as a

symbolic representation of information. The arrangement of

these sounds (symbols) is governed by the rules of language.

The study of these rules and their implications in human

communication is the domain of linguistics, and the study

and classification of the sounds of speech are called


Speech sounds can be classified into 3 distinct

classes according to their mode of excitation. Voiced

sounds are produced by forcing air through the glottis with

the tension of the vocal folds adjusted so that they vibrate

as a relaxation oscillation, thereby producing

quasi-periodic pulses of air that excite the vocal tract.

Unvoiced sounds are generated by forming a constriction at

some point in the vocal tract (usually toward the mouth),

and forcing air through the constriction at a high enough

velocity to produce turbulence. This creates a

broad-spectrum noise source to excite the vocal tract.

Plosive sounds result from making a complete closure (again,

usually toward the front of the vocal tract), building up

pressure behind the closure, and abruptly releasing it.

Most languages, including English, can be described in

terms of a set of distinctive sounds, or phonemes. In

particular, for American English, there are about 42

phonemes including vowels, diphthongs, semivowels and

consonants. Each of the phonemes can be classified as

either a continuant, or a noncontinuant sound. Continuant

sounds are produced by a fixed (non-time-varying) vocal

tract configuration excited by the appropriate source. The

class of continuant sounds includes the vowels, the

fricatives (both unvoiced and voiced), and the nasals. The

remaining sounds are produced by a changing vocal tract

configuration. These are therefore classed as


Vowels are produced by exciting a fixed vocal tract

with quasi-periodic pulses of air caused by vibration of the

vocal folds. If the nasal tract is effectively coupled to

the vocal tract during the production of a vowel, the vowel

becomes nasalized. Each vowel sound can be characterized by

the vocal tract configuration that is used in its

production. An alternative representation is in terms of

the resonance frequencies of the vocal tract (formants).

The Conventional Speech Production Model

A linear model of speech production was developed by

Fant in the late 1950's (Fant, 1960). Its discrete-time

version is shown in Figure 1-1. The speech signal is

modeled as the output of a linear quasi-time-invariant

filter excited by quasi-periodic pulses for voiced sounds,

or by random noise for unvoiced sounds. Acoustic theory

shows that the transmission characteristics of the vocal

tract is well approximated by a cascade of uncoupled

resonators and antiresonators whose bandwidths and center

frequencies may be independently controlled.

Speech Analysis/Synthesis

Efforts in speech analysis and synthesis frequently

aim at the efficient encoding and transmission of speech

information. Another important motivation is to acquire

basic understanding of speech communication. Speech

synthesis systems also play a fundamental role in learning

about the process of human speech production.

Speech synthesis is the process of producing an

acoustic signal by controlling the model for speech

production with an appropriate set of parameters. If the

model is sufficiently accurate and the parameters are

accurately determined, the resulting output of the model is

in some cases indistinguishable from natural speech.




Figure 1-1. General discrete-time model of speech
production. (From Rabiner and Schafer, 1978)

Most synthesizers use speech bandwidths between 3 kHz

(for telephone applications) and 5 kHz (for higher quality).

The perception of some consonants is slightly impaired if

frequencies between 3 and 5 kHz are omitted. Frequencies

above 5 kHz are perhaps useful to improve the speech clarity

and naturalness, but do little to aid speech understanding.

Current speech synthesizers represent trade-offs among the

conflicting demands of maximizing speech quality while

minimizing memory space, algorithmic complexity, and

computation speed.

Speech analysis is simply the process of estimating

the (time-varying) parameters of the model for speech

production from a speech signal that is assumed to be the

output of that model. Continuous speech is usually analyzed

by performing these operations on short segments of speech

which are selected at equally-spaced time intervals,

typically 10-20 msec apart.

Spectral analysis of speech came of age, so to speak,

with the development of the Sound Spectrograph. This device

provides a convenient means for permanently displaying the

short-time spectrum of a sizeable duration of signal. In

this device, a short speech utterance repeatedly modulates a

variable frequency oscillator. The modulated signal is

input to a bandpass filter. The average energy in the

output of the bandpass filter at a given time and frequency

is a crude measure of the time-dependent Fourier transform.

This energy is recorded by an ingenious electromechanical

system on teledeltos paper. The result, called a

spectrogram, is a two-dimensional representation of the

time-dependent spectrum in which the vertical dimension on

the paper represents frequency and the horizontal dimension

represents time. The spectrum magnitude is represented by

the darkness of the marking on the paper. If the bandpass

filter has a wide bandwidth (300 Hz) the spectrogram

displays good temporal resolution and poor frequency

resolution. On the other hand, if the bandpass filter has a

narrow bandwidth (45 Hz), the spectrogram has good frequency

resolution and poor time resolution.

With the advance of the digital computer, more

sophisticated digital signal processing techniques have been

developed to analyze the speech signal. One of the most

powerful speech analysis techniques is the method of linear

predictive analysis. This method has become the predominant

technique for estimating the basic speech parameters, e.g.,

pitch, formants, spectra, vocal tract area functions, and

for representing speech for low bit rate transmission or

storage. Simplicity of the model and the efficient

non-iterative algorithm for determination of the model

parameters are the main reasons for its use in many speech

applications. The disadvantage is its excessive sensitivity

to the details of the fine structure of the approximated

speech spectrum. The fine structure reflects the effect of

excitation, the effects of shape, length and position of the

time window and the effect of noise disturbances in the


The basic idea behind linear predictive analysis is

that a speech sample can be approximated as a linear

combination of past speech samples. By minimizing the sum

of the squared differences (over a finite interval) between

the actual speech samples and the linearly predicted ones, a

unique set of predictor coefficients can be determined.

(The predictor coefficients are the weighting coefficients

used in the linear combination.)

Speech synthesis falls into three broad categories:

1. LPC Synthesis (Markel and Gray, 1976)

Linear prediction synthesis is based on a mathematical

model of the speech signal. The basic assumption is that

speech samples can be predicted from a linear weighting of

past speech values, that is, the speech waveform can be

modeled as the output of an all-pole digital filter, which

is excited by an impulse train for voiced speech or a random

noise sequence for unvoiced speech. The actual voiced

excitation waveform resembles a half-rectified sine wave,

but its spectral shaping is included with the filter

response. The glottal shaping is accounted for by the

inclusion of two poles in the system (since the glottal

excitation has an approximate -12 dB roll-off). The filter

is usually represented by 10-12 coefficients and is updated

every 10-30 ms frame. In addition, two parameters

representing the filter excitation are updated every frame:

an amplitude factor and pitch period. The time-varying

prediction coefficients can be obtained by solving a set of

matrix equations which minimizes the mean square error

between the original speech waveform and the synthesized


The LPC envelope matches the high-amplitude portions

of the spectrum, at the expense of a poorer model between

formants. So it is only an approximation for unvoiced and

nasal sounds which contain both poles and zeros. The

procedure often underestimates bandwidths.

Its main advantage is that there is a very efficient

algorithm for finding the linear prediction coefficients.

The disadvantage inherent in the LPC method is that an

all-pole model is used to model the speech spectrum. It is

not clear how the bandwidth of a root is related to the

actual formant bandwidth, because the bandwidth of the root

is sensitive to the frame duration, frame position, and

method of analysis. Also there is no direct correspondence

between LP coefficients and physiology of speech production.

2. Formant Synthesis

The formant synthesizer is based on an acoustic model

of speech production. It can be designed and realized in

the forms of analog circuit, digital hardware or computer


The transfer function of the vocal tract is simulated

using a number of formant resonators. In the cascade

connection the relative amplitudes of formant peaks for

vowels come out just right without the need for individual

amplitude controls for each formant. But a parallel formant

configuration is still needed for the generation of

fricatives and plosive bursts. So a hybrid cascade/parallel

formant synthesizer is the best choice. Nasal murmurs and

vowel nasalization are approximated by the insertion of an

additional resonator and anti-resonator into the cascade

vocal tract model.

Formant synthesizers have an advantage over LPC

systems in that bandwidths can be more easily manipulated,

and zeros can be directly introduced into the filter

simulating the vocal tract (Klatt, 1980). Formant

synthesizers are capable of providing speech of good

quality, provided good input data are available. Estimation

of parameters, such as the formant frequencies, formant

bandwidths, pitch etc. can be measured directly from the

speech signal. Locating the poles and zeros automatically

in natural speech is, however, a difficult task.

3. Articulatory Synthesis

Articulatory synthesizers are based on a physiological

model of speech production. They attempt to model

faithfully the mechanical motions of the articulators and

the resulting distributions of volume velocity and sound

pressure in the lungs, larynx, and vocal and nasal tracts.

The anatomical structures that shape the vocal tract

impose constraints on the area functions, and knowledge of

these constraints enable us to simplify our specification of

the shape of the vocal tract.

In effect, the X-ray studies indicate that during the

articulation of vowels the dimensions of the vocal tract

along the length of the tongue are controlled primarily by

the position of the tongue constriction and by the degree of

tongue constriction. A minimum set of parameters may be the

position of the tongue constriction, the size of the

constriction formed by the tongue, and the dimensions in the

vicinity of the mouth opening.

Input to this type of speech synthesizer is in terms

of symbols designating target configurations of the vocal

tract. Then a synthesis program interprets these as

discrete sets of data for an articulatory model, and

generates a sequence of changing area function. The

formants are then calculated and used to control a speech


Articulatory models of the vocal tract can provide

very natural structural representations of the underlying

anatomy and physiology of the vocal tract which give rise to

the speech signal. Many effects of coarticulation, vowel

reduction, and place allophones are produced automatically,

without the need for special rules. The articulatory model

is very useful for phonetic studies. The disadvantages are

that knowledge about the articulator positions is not easily

available for continuous speech, and the computation is more

complex than for other synthesis methods.

Introduction to Source-Tract Interaction

In the conventional speech production model, the

source and the vocal tract system are considered to be

independent of each other (i.e., varying vocal tract

configuration will not have any effect on the source), which

is true when the glottis is closed, or almost closed. But

when the glottis is open, loading of the vocal tract system

on the source occurs. It has been shown that the loading of

the vocal tract can have appreciable effects on the glottal

pulse shape (Ishizaka and Flanagan, 1972). This is called

glottal source vocal tract interaction or simply

source-tract interaction. It is believed that the

intelligibility of synthetic speech depends on the ability

to reproduce dynamic (transient) sounds such as stops,

whereas the naturalness of voice is mainly determined by the

true reproduction of voiced segments (Childers et al.,

1985). Source-tract interaction has been conjectured to be

important for synthesizing high quality, natural sounding

speech. Speech synthesized with source-tract interaction

sounds more natural than speech generated without such

interaction (Allen and Strong, 1985; Childers et al., 1983).

The acoustics of speech production is based on the

concept of a source and a filter function. In current

models, the source of voiced sounds is represented by a

quasiperiodic succession of pulses of air emitted through

the glottis, as the vocal folds open and close, and the

supraglottal vocal tract acts as a filter that shapes the

spectrum of the glottal flow to produce different sounds.

The filter function is assumed to be linear and short-time

invariant. When the glottis is closed, the vocal tract

filter is typically thought to be best represented by an all

pole model of order 10 to 14 (Ananth et al., 1985). The

phonetic message in speech is mainly conveyed by the

transfer function of the filter representing the vocal tract

system. For example, silent articulation with an external

throat vibrator can produce intelligible speech. Not much

work has been done on the description of the voice source

with reference to speaker specifics and to contextual

factors. Speech synthesis has gained a fair quality on the

basis of conventional idealizations, such as a -12 dB/oct

average spectrum slope and uniform shape. Source and filter

are assumed to be linearly separable.

The shape and periodicity of the vocal fold wave can

vary considerably. The extent to which variability in

period and shape affect speech naturalness and quality is an

important research question. In many existing electrical

synthesizers, the properties of the vocal fold source are

approximated only in a gross form. It is customary to

specify the vocal pitch as a smooth, continuous function and

to use a fixed glottal wave shape whose amplitude spectrum

falls at about -12 dB/octave. In many synthesizers the

source is produced by repeated impulse excitation of a

fixed, spectral-shaping network. An impulse train filtered

by a 2-pole low-pass filter has about the right average

spectrum, but the phase of this waveform is wrong. Primary

excitation of the vocal tract filter occurs at a time

corresponding to the instant the folds open, rather than at

closure. Furthermore, the spectrum envelope is perfectly

regular (i.e. monotonically decreasing at 12 dB per octave),

which contrasts with evidence indicating the presence of

zeros in the spectra of normal voicing waveforms. Such lack

of fidelity in duplicating actual glottal characteristics

undoubtedly detracts from speech naturalness and the ability

to simulate a given voice.

It has been shown that the glottal pulse waveform can

have important effects on the quality of synthetic speech

(the voice produced). Rosenberg (1971) studied the effect

on naturalness of the variation of glottal pulse shape. He

used simulated waveforms with pulse shapes differing in the

number and location of slope discontinuities. His result

indicated that simulated excitations with pulse shapes with

a single slope discontinuity at closure were preferred.

Holmes (1973) has pursued this area and shown that under

certain listening conditions, the use of glottal pulses

derived from speech significantly improves the naturalness

over fixed glottal models. Wong and Markel (1978) have

recently shown that retaining the phase characteristics of a

typical glottal pulse can improve LPC synthesis quality.

To study the glottal source, glottall) inverse

filtering is used. Glottal inverse filtering is a technique

by which the flow past the time varying glottal constriction

is estimated by a filtering operation on the acoustic signal

in human speech. The filtering operation removes the

effects of the vocal tract resonances to reveal the

underlying voice source signal. Almost all techniques for

inverse filtering the speech signal to obtain the glottal

volume-velocity are based on the linear model. The source

is assumed to be a periodic waveform generator which outputs

pulses of volume-velocity. The volume-velocity is input to

a linear, time invariant vocal tract filter. The transfer

function of the vocal tract filter is determined by the

supraglottal articulators. The output of this filter is

then passed through a second filter that models the

radiation at the lips, and is finally output as speech. The

vocal tract filter for vowel sounds is usually modeled as an

all-pole filter; this can be theoretically justified on the

basis of acoustic tube modeling of the vocal tract. The

inverse of the vocal tract filter therefore contains only

zeros or anti-resonances. If the radiated speech wave is

passed through this inverse vocal tract filter, the output

will be the differentiated glottal volume-velocity. A

simple integration of this signal will yield the glottal

volume-velocity. Inverse filtering studies on different

vowels have revealed two important phenomena:

(a) Skewing

High speed and stroboscopic motion pictures of the

glottis during normal voice have generally yielded a rather

symmetrical, triangular waveform for the projected glottal

area (Rothenberg, 1981a). On the other hand, measurements

of the glottal flow waveform by inverse filtering the sound

pressure or the flow at the mouth have often shown a

markedly unsymmetrical waveform, with a slowly-rising

glottal opening phase and a sharply terminating glottal

closing phase (for example, see Miller, 1959; Holmes, 1962;

and Rothenberg, 1973). This dissymmetry of the glottal flow

waveform can be an important determinant of voice quality in

that it increases the high frequency energy of the waveform,

as compared to the projected area waveform, and hence

affects the levels of the formants. It also concentrates

the energy in the glottal closed phase, during which the

vocal tract is most efficient (Fant, 1979a). The degree of

skewness differs for different vowels, and it has been shown

that the first formant load is the most important in

determining the degree of skewing (Ananthapadmanabha and

Fant, 1982).

(b) Ripple

Ripples are sometimes observed in the open phase of

the glottal flow waveforms. It is generally realized that

there can be appreciable first formant energy absorbed by

the glottis during the open phase of the glottal cycle, when

the glottal impedance is finite (Rothenberg, 1981a). This

is called truncation, which means the termination of formant

oscillations by excessive damping within the glottal open

period (Fant and Ananthapadmanabha, 1982). Glottal damping

causes a truncation of formant amplitudes and changes the

formant frequencies and increases their bandwidths during

the glottal open interval. This effect is especially

apparent for high frequency Fl sounds. The vowel /a/

typically has a truncated speech waveform. Its equivalent

effect is to cause oscillations on the glottal flow

waveform. The main perceptual effect of truncation is a

reduction of the loudness level of the formant (Fant and

Liljencrants, 1979). The first formant load has been found

to be the most important in determining the degree of this


Other source-tract interaction effects have also been

identified by computer simulations, theoretical studies and

inverse filtering experiments.



In this chapter, we first describe the different types

of speech production models and derive the conventional

speech production model. Then we explain what source-tract

interaction is and what its effects are. A literature

review of source-tract interaction, which includes

theoretical and simulation studies, is given next. Finally,

a summary of all the source-tract interaction effects known

is given.

Speech Production Models

The conventional source-filter model of speech

production includes many inherent limitations and

assumptions which may not be always true. Source filter

theory of voice production is the theory of its

approximations. Time-varying and nonlinear functions

introduce difficulties which obscure an insight in

essentials unless we resort to simplified and approximate

transformations. To completely understand what effects they

have on the resultant synthesized speech, we shall describe

how the conventional speech production model is derived.

(a) Physical Model

The physical model in Figure 2-1 attempts to

simulate the actual physical processes involved in human

speech generation. The physically most correct approach

would be to regard the lung pressure and its associated

muscular back-up as the source and derive the output from a

set of differential equations representing the state of flow

and pressure in the combined subglottal, glottal, and

supraglottal systems in extremely short successive intervals

of time (Fant, 1981). This is actually the principle of the

Ishizaka-Flanagan (1972) model which has become most

influential in speech research. The parameters used are

physiological and include the lung pressure, the vocal tract

area function, etc. The two-mass model for the vocal folds

is shown here, but other relevant models can be used as


The vocal system may be represented in terms of

incremental contiguous sections of a lossy cylindrical pipe.

The shape of the vocal tract may deviate markedly from that

of a straight acoustic tube. For frequencies corresponding

to wavelengths that are long compared to the dimensions of

the vocal tract (less than about 4000 Hz), it is reasonable

to assume plane wave propagation along the axis of the tube.

It is difficult to synthesize any speech signal from

this model, and can probably only be done by digital

simulation using a computer. Extensive computation will be

required, and so it is not practical.



Figure 2-1. Physical model of speech production.
Flanagan et al., 1975)


N Un





1 DZn

t t
A1--------------------------- A,


Figure 2-2. Electrical
et al., 1975)


equivalent circuit. (From Flanagan


(b) Electrical Equivalent

Sound pressure and volume velocity for plane wave

propagation in a uniform tube satisfy the same wave equation

as do voltage and current on a uniform transmission line

(Flanagan, 1972). Hence, sound pressure and volume velocity

can be considered to be analogous to the voltage and

current, respectively, in an electrical line. The uniform

transmission line can be represented by an equivalent

T-network, and an equivalent electrical circuit (Figure 2-2)

can be established for the vocal system. This has an

additional advantage of being able to use the network

theory, but it is still too complex.

(c) Simplified Electrical Equivalent

The effects of the subglottal and vocal tract systems

for voiced sounds are to introduce resonances, called

formants, in the frequency domain, which have to be

accurately represented in order to have high quality,

natural sounding speech. Hence, limiting ourselves to the

nonnasalized voiced sounds only, a simplified electrical

equivalent circuit (Figure 2-3) can be obtained using

parallel resonant circuits as loads, with the parameters

chosen so as to represent the formants and to approximate

the impedances of the subglottal and the vocal tract

systems. The cascade arrangement of resonators is shown to

yield vowel sounds having formants of proper amplitude when

information specifying the formant frequencies only is

*- Vsi-* Vsn"-

Figure 2-3. Simplified electrical equivalent circuit for
voiced sounds.




Figure 2-4. Source-filter model for voice production.

known. The effect of the radiation has been incorporated

into the formant loads. The time-varying, nonlinear glottal

impedance may be calculated from the measured or modeled

glottal area function, and simulation experiments can be run

to see how the glottal flow varies with different vocal

tract loads.

(d) The Conventional Source-Filter Model

The time-varying and nonlinear glottal impedance still

makes the solution difficult to obtain. To simplify even

further, the subglottal system is completely ignored,

because the pressure drop across the bronchi and trachea is

small, and the subglottal pressure is maintained sensibly

constant over the duration of several pitch periods by the

low-impedance lung reservoir. Also, since the glottal

impedance is generally very large when compared to the vocal

tract input impedance, because of the relatively small

opening of the glottis which separates the subglottal and

supraglottal regions, the effects of the vocal tract loading

on the glottal source are ignored also. This is similar to

having an ideal current source. Here, the glottal source

for the supraglottal region is generally defined as the

volume velocity of air flow passing the vocal folds. The

supraglottal vocal tract acts as a filter that shapes the

spectrum of the glottal flow to produce different sounds.

That is, the source does not depend on the vocal tract

shape. This gives the voiced part of the linear

source-filter model of speech production. The model for

voiced sound production is shown in Figure 2-4.

As we look for more precise models of the voice

source, whether this be for higher quality synthesis of

speech or singing, or for the study of unusual or

pathological voice qualities, the effects of the vocal tract

load on the glottal flow cannot be neglected (Rothenberg,

1981b), i.e., the glottal source and the vocal tract

interact with each other. This is called source-tract


In recent years the demand for greater naturalness in

speech synthesis has developed a renewed interest in studies

of the voice source and source-filter interactions. This

demand which is paralleled by a general descriptive need in

acoustic phonetics, e.g., with respect to speech prosody and

individual voice qualities and speaking habits, is revealed

when we attempt to synthesize female speech by rule. Voice

pathology and theory of singing are related topics, where

the need for improved voice source models are apparent

(Fant, 1981).

Effects of Source-Tract Interaction

The major effects of source-tract interaction are

described below:

a) Skewing

The glottal flow waveforms obtained from inverse

filtering are usually skewed to the right when compared with

the glottal area functions (Figure 2-5). This effect has

been studied by Rothenberg (1981a). Using a triangular

shape for the glottal conductance, he was able to get

volume-velocity skewing by loading the conductance with the

inertance of the subglottal and supraglottal tracts. Due to

inertia effects the glottal flow generally builds up more

slowly than implied by the glottal area function but

terminates more steeply to meet the requirement of zero flow

after closure. If the lowest supraglottal resonance (the

first formant Fl) is higher than the voice fundamental

frequency FO, as is usually the case, the supraglottal

loading will be inertive at frequencies between FO and Fl,

which most strongly influence the overall waveshape of the

glottal pulse. The same applies for the lowest subglottal

formant. Hence, a pure inertance can be used as the load

for finding the glottal pulse shape. Since different vowels

will have different inertances for the vocal tract system,

the degree of skewing will be different for different

vowels. Skewing can be an important determinant of voice

quality because it varies the high frequency energy of the

waveform, and thus affects the levels of the formant


For female voices, the fundamental frequency (FO) is

higher and the ratio of Fl/FO is smaller than that for male





Figure 2-5. Glottal flow skewed to the right.


Figure 2-6. Ripple superimposed on glottal flow waveform.

voices, so the female glottal flow waveforms tend to be more

symmetrical, as have been seen in the measurements of the

glottal flow waveforms (Monsen and Engebretson, 1977).

Acoustic interaction can cause the glottal source

waveform to vary widely as a function of vowel value and FO,

since the first formant must be high compared to FO in order

for the supraglottal loading to be inertive at an

appreciable number of glottal harmonics, and because the

magnitude of the impedance below Fl varies as a function of

vowel value.

b) Ripple

The glottal flow waveforms obtained from inverse

filtering may have some superimposed ripples (Figure 2-6),

especially when the steady state is reached. One aspect of

formant ripple in the glottal pulse is the increasing rate

of decay of Fl-oscillation during glottal opening, which may

cause a smaller or greater degree of "truncation" of the

oscillation preventing the carry-over of oscillatory energy

to the next period (Fant, 1982a). This has been studied

extensively by Fant (Ananthapadmanabha and Fant, 1982; Fant

and Ananthapadmanabha, 1982; Fant, 1982a; Fant, 1982b) and

can be explained as follows. In the glottal open state, the

sub- and supraglottal parts of the vocal tract are

acoustically coupled through the time variable and nonlinear

glottal impedance, whereas when the glottis is closed the

sub- and supraglottal systems execute approximately free and

separate oscillations. Resonance frequencies and especially

bandwidths may differ in the two states. The glottal open

period can be regarded as a phase of acoustical energy

charging, followed by a discharge at glottal closure.

During the glottal closed conditions, the formants prevail

with a constant bandwidth, i.e., a constant rate of decay,

followed by a relatively faster decay of amplitude during

the next glottal open interval, when the glottal impedance

becomes finite. Thus, glottal damping can cause a

truncation of formant amplitudes during the interval of

glottal opening. The glottal waveform of the vowel /a/ is

typical of truncation. This reflects the higher sensitivity

to glottal damping of open back vowels, such as /a/,

compared to vowels with a wide pharynx cavity. The high

glottal damping is also a general characteristic of vowels

produced with incomplete glottal closure as in a breathy

voice (Fant and Ananthapadmanabha, 1982).

With truncation, the wave shape of glottal pulses may

display irregularities due to superimposed formant ripple.

The presence of such a ripple is one consequence of glottal

damping. It is convenient to use the differentiated glottal

flow to represent the source, thereby including the

frequency characteristics of radiation transfer. The

perturbations of glottal flow induced by formant ripple

become more apparent after differentiation. In the initial

positive part of the differentiated glottal flow, there is

often seen a double peak. The double peak in the

differentiated flow does not correspond to a constant

frequency F1 ripple, because of the time-varying, nonlinear

glottal impedance. In addition to truncation there occurs a

spectral broadening due to Fl-modulation within the glottal

open period. The Fl-ripple starts at a higher frequency and

ends at a lower frequency. The spectrum displays sharp

zeros of appreciable magnitude, the first located above Fl.

Considering truncation phenomena alone-without Fl-modulation

a zero would have been found at Fl. Recently, Fant et al.

(1985b) have shown that the double peak glottal flow

derivative has a spectral correlate of a broad maximum at,

or somewhat lower than, 2 Fl intersected by a sharp zero.

The intraglottal variations in the system function are only

approximately definable by time varying frequencies and

bandwidths of vocal resonances.

Effective bandwidth due to truncation increases with

the open quotient of the glottal cycle and with FO at

constant open quotient (Fant, 1981) and is highly specific

to vowel formant and phonatory mode. Boves and Cranen

(1983) have tried to estimate the changes in the formant

parameters when the glottis opens, using the covariance

linear prediction (LP) analysis. However, LP analysis over

the open phase always failed, so they compared formant

parameters obtained from the closed glottal interval with

corresponding measures obtained from a complete period.

Their results showed that generally the bandwidths obtained

from a complete cycle exceeded those from the closed glottal


The main perceptual effect of truncation is a

reduction of the loudness level of the formant. According

to Fant and Liljencrants (1979) the truncated envelope

produces a loudness equal to that of an untruncated

exponential which has the same initial amplitude and the

same mean (not r.m.s.) amplitude within the complete glottal

cycle. The spectral 3 dB-bandwidth of the truncated signal

is much larger than this effective bandwidth. Perceptual

matching experiments on truncated periodic Fl responses with

untruncated responses suggested that subjects tend to match

for an equal loudness level of Fl which implies equal mean

value of the envelope.

The first formant is quoted to be the most susceptible

to interaction effects. Examination of the first formant

could indicate the presence of any non-uniform damping.

This would identify the presence of source-tract

interaction. Its effects on the quality is not clear, but

it has been known that the first formant is very important

for vowel quality.

Other source-tract interaction effects have also been

identified (Nord et al., 1984), such as the nonlinear

superposition, where the superposition of the ripples may

affect the glottal flow derivative at the instant of closing

discontinuity and, thus, a change in the excitation level.

c) Linear Superposition

Superposition, which requires a lower degree of

glottal damping, is in a sense the opposite of truncation.

Superposition results from the carry over of formant energy

from one fundamental period to the next. Superposition

provides a maximally positive net effect when a formant

coincides with a harmonic, and a maximally negative effect

when the corresponding resonant peak is located halfway

between two harmonics. In addition, superposition of

formant ripple within the glottal open period may change the

source excitation waveform and spectrum. This is a

nonlinear source-filter interaction.

d) Nonlinear Superposition

Under conditions of small supraglottal damping and

high FO, the Fl ripple may prevail in the latter part of the

glottal opening interval, just before closure, thus

affecting the strength of the main excitation. The ripple

components may affect the phase of the signal (residue),

thus causing either an increase or a decrease in the slope

at closure. This will be referred to as a nonlinear

superposition. It is concluded that formant amplitude

changes at varying FO are a function of both linear

superposition and changes in the scale factor and spectrum

of the glottal source.

Informal experimentation with an electrical analog

version of an interactive voice source model (Rothenberg,

1981b) has shown that as the ratio Fi/FO gets smaller than

about three, the value of this ratio is increasingly

significant in determining voice quality. When Fl/FO is

near integral values, energy from previous glottal cycles

tends to cause a decrease in supraglottal pressure as the

glottis is closing (in addition to any component caused by

the low frequency vocal tract inertance). This decrease in

pressure raises the transglottal pressure and causes a

sharper drop in flow at closure. Likewise, values of Fl/FO

that fall about halfway between integral values tend to

decrease transglottal pressure during glottal closure and

cause a less sharp drop in flow at the instant of complete

closure. Thus, if the ratio Fl/FO is low, the high

frequency energy generated by the glottal closure is

determined by both the vocal tract inertance at low

frequencies and the value of Fl. Fant and Ananthapadmanabha

(1982) estimated that the condition for increased excitation

force would be satisfied by Fl = [n + 1/4]-FO, whereas Fl =

[n 1/4]-FO would have the opposite effect.

The interaction between Fl and FO should be

differentiated from the interaction predicted by the linear

superposition. In linear superposition, a formant is

maximally strengthened when it is an exact multiple of the

fundamental frequency, while the value of Fl/FO for maximum

transglottal pressure during the glottal closure may not be

an exact integer. Of more significance is the fact that

linear superposition predicts that the coincidence of Fl and

a multiple of FO will strengthen only Fl and not the higher

order formants. For nonlinear superposition the ratio of Fl

to FO can have a significant effect on all formants.

Rothenberg (1987) has studied the case of soprano

voice when Fl/FO approximately equals to one. He showed

that a phonation with Fl close to FO minimized air


A typical feature found in most simulations is that

because of superposition interaction, one glottal pulse

differs somewhat in shape, peak amplitude, and even in

duration from neighboring pulses even at constant FO and

other phonatory conditions held constant. The period by

period perturbations of glottal flow waveshape probably add

to the naturalness element in normal speech as opposed to

the regularity stereotype of conventional speech

synthesizers. Pathologically extreme perturbations such as

vocal fry affect both periodicity and the gross shape of

glottal flow pulses including mechanical interaction.

Source-filter interaction theory may prove to be especially

important for the analysis of the voice excitation mechanism

at higher fundamental frequencies as in female and

children's voices.

Glottal air flow as well as vocal fold vibratory

patterns may be affected by changes in subglottal pressure

and supraglottal articulations. It is not inconceivable

that the glottal flow pattern may influence the mechanical

motion of the folds and thus of the glottal area function.

One would expect a small shortening of the closure time as a

response to a negative supraglottal pressure peak preceding

the termination of the glottal pulse.

Literature Review

A number of authors have investigated the problem of

source-tract interaction and have tried to elucidate the

effect of the vocal tract on the waveform of glottal volume

flow by means of theoretical analysis or simulation based on

models of speech production.

Glottal Impedance

The glottis is a narrow constriction formed between

two vibrating vocal folds. The excess air pressure in the

lungs causes air to flow through the trachea and the

glottis. This pulsating air flow not only causes acoustic

excitation of vocal cavity resonances but is also

responsible for sustaining vocal fold vibration.

Van den Berg et al. (1957) proposed an empirical

formula for predicting the steady (dc) flow based on

extensive experiments on static larynx models with uniform

glottis. His formula for the pressure drop is

U2 2 U
AP=kP-+ 12 D123
2A A

where p is the density of air, p is the coefficient of

viscosity, U is the volume velocity flow, AP is the excess

subglottal pressure, and A is the glottal area, D is the

glottal depth, and 1 is the length of the glottis. The

constant k = (kI k2) is determined empirically. The

constants kI and k2 are the so-called entry drop and exit

recovery coefficients. The form of the above equation is

guided by aerodynamic theories. The first term is called

the kinetic resistance drop and the second term is called

the viscous drop. According to van den Berg, ki = 1.375,

and k2 = 0.5. Thus, the volume velocity of air flow, U, can

be calculated for given glottal dimensions and pressure.

Recent investigations suggest a revision of the value for

k2, which is much smaller, of the order of 0.01 0.05.

The above discussion is restricted to the glottal flow

through a uniform glottis. Experiments on nonuniform

glottal shapes strongly indicate the dependence of glottal

resistance on the shape of the glottis. We can, to a first

order approximation, estimate the glottal resistance of a

nonuniform glottis to be the same as that of the glottal

resistance of a uniform glottis of width corresponding to

the minimum width of the nonuniform glottis, but with the

viscous drop term omitted. Thus A in the equation

corresponds to the minimum area of the glottis. Owing to

the flow dependent resistance, the glottal damping is larger

at low voice efforts than at high voice efforts.

During phonation the glottal area changes continually.

For small dimensions of the glottis, incompressibility is a

valid assumption. With the air mass in the glottis, the

glottal impedance now incorporates an additional term

corresponding to the glottal inertance. Assuming a constant

glottal depth D the glottal inertance is

Lg= P -

Since Lg changes in time with A, the pressure drop across Lg


APL = Lg + Udg
dt dt

The effect of glottal inertance is rather small.

For typical values of vocal subglottal pressure, the

viscous term approximately equals the kinetic term when the

glottal areas are just a fraction (< 1/5) of the maximum

area. In other words, over most of the open cycle of the

vocal folds the glottal resistance is determined by the

kinetic term.

The Lg/Rg time constant is small compared with the

fundamental period, so the glottal volume flow may be

considered as a series of consecutively established steady


Simulation Experiments

Ishizaka and Flanagan (1972) proposed their two-mass

model of the vocal folds and succeeded in explaining the

glottal source-vocal tract interaction in terms of

physiological and acoustic parameters of the vocal folds and

vocal tract.

Guerin et al. (1976) developed a glottal source model,

which is shown in Figure 2-7, to generate a volume velocity

signal that includes the interaction with the vocal

cavities. The model is loaded by an equivalent circuit for

the driving point impedance of the vocal tract. This

circuit is controlled dynamically by the first two formant

frequencies. The synthesis of this circuit was based on the

results they have obtained from a computer simulation of a

lossy vocal tract transmission-line model. Rv and Rk are

the viscous and kinetic components of the glottal

resistance, Lg is the glottal inertance and the area

generating box represents the mechanical one (or two) mass

model. They calculated the volume velocity function U(t) of

different vowels by a computer simulation of the circuit,

using the same glottal area function 'A(t). Compared with

A(t), the glottal flow has additional fluctuations at

frequencies near the first formant frequency of the

considered vowel. A "coupling index" was defined as a

measure of the degree of source-tract interaction which was

found to be important for all the first formant range.

In the work of Ananthapadmanabha and Fant (1982), the

acoustic modeling of voice production started by assuming a

specific glottal area function within a fundamental period

and a specific lung pressure. Recent modeling efforts have

incorporated more realistic circuits to simulate the

impedances of the subglottal and supraglottal tracts,

leading to rather complex time-varying relationships between

P, A(t) F1 V c

U (t)

kj1(t) Rv= k L k3 ck4
A2(t) At) A3(t) A((t)

F2 (t)

Fl(t): first formant; F2(t): second formant.

Figure 2-7. Glottal source model loaded by an equivalent
circuit of the vocal tract input impedance. (From Guerin
et al., 1976)

area and glottal flow. The glottal area function in general

depends on phonetic context, phonatory mode, voice

intensity, pitch etc. The flow and pressure states in other

parts of the system were then calculated by techniques

similar to those of Ishizaka and Flanagan (1972) leading to

numerical determinations of the glottal flow, the output

flow at the lips, and the sound pressure at a distance from

the speaker's mouth.

Speech synthesized with source-tract interaction

sounds more natural than speech generated without such

interaction (Childers et al., 1983; Childers and Wu, 1990).

This was confirmed by employing a glottal area function

which controls the time varying glottal impedance in an

equivalent circuit of the vocal system. The output of this

circuit is a time-varying glottal volume velocity function

which includes the effect of source-tract interaction.

Glottal air flow as well as vocal folds vibratory

patterns may be affected by changes in subglottal pressure

and supraglottal articulations. Guerin (1985) has studied

the effect of source-tract interaction of different vowels

on the fundamental frequency of the vocal folds oscillation,

using a two-mass model of the vocal folds loaded by an

equivalent circuit of the vocal tract input impedance. His

results showed that FO increased slightly with Fl which was

contrary to the measurements of intrinsic fundamental

frequency made on natural speech. Hence, the intrinsic

value of the fundamental frequency of different vowels is

not determined by the acoustic coupling, but by other more

important effects which largely compensate for the effects

of the acoustic coupling.

Cheng and Guerin (1987) have studied the strategy for

controlling the male and female glottal sources, based on

the research on source-tract interaction. Physiological

parameters make the control easier but need heavy

calculations, while acoustic parameters lead to simpler

calculations, but do not allow an accurate control of the

individual characteristics. So they used mixed

physiological acoustic parameters in order to use the

advantages of the two kinds of parameters. The three

independent parameters adopted were lung pressure,

vocal-fold tension and first supraglottal formant which led

to the three dependent parameters of fundamental frequency,

asymmetry quotient and open quotient.

Theoretical Studies

Detailed physical-acoustic models of the subglottal

systems have been proposed that can generate patterns of

pressure and air flow that seem quite realistic (Ishizaka

and Flanagan, 1972). However, such detailed models often do

not make clear which aspects of the interaction between the

glottal source and vocal tract are most active in

determining the quality of the voice. In order to

understand the way in which voice quality is affected by the

source-tract interaction it is desirable to formulate a

model or models that break down this interaction into its

more important and less important components.

The studies will be restricted to acoustic interaction

and not include the effect that the supraglottal pressure

variations might have on the motion of the vocal folds.

(a) Rothenberg's Studies

Rothenberg (1981b) described an interactive model for

the voice source which includes the acoustic interaction

between the glottal source and the subglottal and

supraglottal acoustic systems. He concentrated on the

development of a model which is valid for the more open

vocalic sounds that comprise most of speech and singing. In

such sounds the dissipativee) supraglottal flow resistance

is small compared to the glottal flow resistance and the

frequency of the first formant is appreciably greater than

the voice fundamental frequency. His studies of the glottal

flow have indicated that for such unconstricted vocal tract

configurations the influence of the vocal tract acoustics on

the glottal flow waveform stems primarily from two factors.

The first is the subglottal and supraglottal pressure

variations caused by the inertive components of the

subglottal and supraglottal vocal tract impedances at the

voice fundamental frequency FO and its lower harmonics, and

the second is the supraglottal pressure oscillations at the

lowest vocal tract resonance. The subglottal pressure

oscillations at the lowest subglottal resonance may also be

significant at the higher ranges of fundamental frequency

used in singing and some types of speech, but this factor

has not been included explicitly in his model.

When the ratio of the first formant frequency Fl to FO

is high, say more than about three, the formant energy

carried over between glottal cycles is small enough so that

the inertive loading tends to be the more significant

factor, tilting the glottal flow pulse to the right, and

causing the sharp slope discontinuity at the instant of

glottal closure which generates most of the higher frequency

energy in voiced speech.

For the purpose of this simplified discussion, the

glottal constriction can be thought of as a purely

dissipative flow resistance which is inversely proportional

to the glottal area. The glottal area waveform is

represented by a roughly triangular pulse. This pulse is

similar in shape to many recordings of projected glottal

area (the area of the opening that would be seen from

directly above or below the glottis) that have been made

using photoglottographic techniques. In addition, the

acoustic impedance of the supraglottal and subglottal

systems can be approximated by an inertive reactance at FO

and those glottal harmonics falling below Fl (for the

supraglottal system) and below the lowest subglottal

acoustic resonance (for the subglottal system). The

justification for this simplified representation is that the

supraglottal acoustic impedance as seen by the glottis is

inertive for frequencies more than a few percent less than

Fl, and the subglottal acoustic impedance as seen by the

glottis also tends to be inertive for frequencies between

the highest respiratory tissue resonance, which is of the

order-of-magnitude of 10 Hz in adults, and the lowest

acoustic resonance, which is roughly 300 to 400 Hz in


Since the subglottal and supraglottal air masses can

be considered to be more inertive (mass-like) than compliant

(compressible) under his assumptions, if the vocal folds

open after being closed a long time, there will be a delay

or lag in the build-up of air flow relative to the increase

in area, as the lung pressure acts to overcome the inertia

of the combined air mass. (The inertance of the air mass in

the glottis acts differently because it is time-varying and

will be neglected in this simplified discussion.) If we

assume a linear-system viewpoint, the opening phase of the

glottal air flow, until about 3/4 of the glottal area pulse

has passed, shows a time lag, or shift to the right, due to

the time constant Lt/Rg, where Lt is the tract inertance at

FO and its lowest harmonics and Rg is the (time-varying)

glottal resistance. This time constant also causes an

appreciable rounding or smoothing of the top of the air flow

pulse, since the time constant is near its largest value at

that time due to the low value of Rg.

However, the linear system analogy breaks down during

the final 1/4 of the glottal pulse, since the closing vocal

folds force the glottal resistance to be infinite at the

closure (assuming perfect closure), and thereby force the

flow to zero in a relatively short time. During that time

interval (the last 1/4 or so of the glottal pulse) the

tracheal pressure can be found to have a significant

increase due to the inertance of the subglottal flow, and

the pharyngeal pressure a significant decrease due to the

inertance of the supraglottal flow. Thus, the transglottal

pressure during this interval is much higher than during the

rest of the glottal pulse, and acts to support the glottal

air flow until the actual instant of glottal closure is


Figure 2-8 shows the solution of the nonlinear

differential equation that results when the glottis is

represented by a time-varying resistance and the subglottal

and supraglottal acoustic systems by a single constant

inertance (Rothenberg, 1981a). The subglottal pressure will

be considered to be constant and the glottal inertance will

be considered to be zero. The glottal admittance will then

be a pure conductance. This conductance will be considered

to have a symmetrical, triangular shape, as the vocal folds

open and close during the glottal air pulse. Though this

representation does not properly reflect the flow dependency

of the glottal resistance, it should yield a good

approximation to the actual glottal flow for small values of

supraglottal impedance, if the variation in projected

glottal area is approximately triangular. The system is


Y9 U9
.1 + L

0 TIME=0

Figure 2-8. Glottal air flow resulting from a symmetrical,
triangular variation of glottal admittance, assuming the
simplified interactive model shown in the figure for the
glottal source and vocal tract. (From Rothenberg, 1981b)

Ug .5



Figure 2-8. Glottal air flow resulting from a symmetrical,
triangular variation of glottal admittance, assuming the
simplified interactive model shown in the figure for the
glottal source and vocal tract. (From Rothenberg, 1981b)


shown in the figure in its analogous electrical circuit

form, where

Yg = 1/R = the glottal conductance

PL = the average alveolar pressure in the lungs

Lt = the sum of subglottal and supraglottal

inertance near FO

Ug = the glottal volume velocity

The form of the resulting current pulse is determined

by the "normalized vocal tract inertance" Lt defined as

Lt = Lt2Y

where Zp is the duration of the glottal pulse, and YgMAX is

the maximum glottal conductance.

The major feature of the air flow waveforms in

Figure 2-8 is that there is a critical range for the

normalized inertance Lt, from about 0.2 to 1.0, in which the

glottal flow changes from a roughly symmetrical triangle to

a rounded sawtoothh" having one major point of slope

discontinuity at the instant of closure. In fact, the

mathematical solution to this idealized case shows that the

slope of the flow waveform becomes infinite at closure for

all values of Lt larger than unity. Since the high

frequency energy produced at a discontinuity of slope tends

to be proportional to the change of slope at the

discontinuity, the high frequency energy produced at the

termination of the glottal pulse can be greatly increased by

inertive loading.

The pulse skewing effect increases with increasing

total vocal tract inductance, with increasing glottal

conductance, decreasing glottal pulse duration and

decreasing lung pressure.

The flow dependency of the glottal conductance was not

considered in the analysis, but it was discussed in

Rothenberg (1985b). Also, Fant (1982a) gave an exact closed

form solution for the flow given lung pressure, a triangular

glottal area function, and a representation of the sub- and

supraglottal impedance by inductance elements only.

Inverse filtering studies and motion pictures of the

glottis during voicing have shown that there is often a

patent air path between the arytenoid cartilages. The

general result of this glottal leakage is to cause a gradual

onset of the glottal flow pulse, and a more gradual offset,

with less high frequency energy produced at both locations.

The reduction of high frequency energy at the instant of

glottal closure is of special interest because of its strong

potential effect on voice quality.

The R, L model in Figure 2-8 does not include the

interaction with the first formant. To include a

first-order approximation to the action of the first

formant, the model can be modified by adding an oral

compliance, Co, as shown in Figure 2-9. This oral

compliance can be considered a lumped approximation to the





N.A-- l---s'^


Figure 2-9. An interactive model for the glottal source and
vocal tract that includes the effect of the first
formant. (From Rothenberg, 1981b)

compressibility of the supraglottal air and, at lower values

of Fl, a small component due to the effective compliance of

the walls of the supraglottal tract. In this model, the

supraglottal inertance is split into two parts, one on

either side of the oral compliance. The forward or oral

component is the prime determinant of Fl, in combination

with Co, while the rear or pharyngeal component is more

important in determining the overall asymmetry or tilting of

the glottal air flow waveform, since it acts directly on the

glottis, without the "cushioning" effect of an intermediate

compliance. In this model, a back vowel such as /a/ would

have a high value for the pharyngeal inertance and a low

value for the oral component, while the reverse would hold

for a front vowel such as /i/. Naturally, if this model is

to be useful, a more detailed definition would have to be

worked out from these general principles.

The dissipative elements associated with the vocal

tract, Roc, ROL, and RON, are shown dashed, since not all may

be needed in a simple model. Roc primarily represents the

dissipation associated with the compressibility of the air

flow and the compliance of the cavity walls; ROL represents

the dissipation associated with the velocity of the air flow

(boundary layer effects, etc.); and RON represents any

shunting effects, such as a small velopharyngeal leakage.

For non-nasal vowels with a high value of Fl, the main

effect of oral dissipation is to determine the damping of Fl

during the period of glottal closure, and since the total

dissipative loss is generally very small in this case, any

one of these three components can be used. However, for low

values of Fl or for nasalized vowels, the placement and

distribution of the dissipative loss elements should be


Though an inertive loading of the glottal source would

produce flow waveforms very much like those observed by

using standard inverse filtering techniques, given a

symmetrical variation of glottal admittance, it still

remains to be shown that this type of glottal-supraglottal

interaction is significant in such activities as speech or

singing. The most direct method might be to measure the

supraglottal impedance, but this requires the measurement of

the pressure just above the glottis, and is difficult to

implement. Using another approach, a nonlinear inverse

filter in which some of the effects of the glottal

supraglottal interaction are removed (Rothenberg and

Zahorian, 1977) can be implemented. It can be seen that in

this case the supraglottal impedance did cause an alteration

of the glottal flow similar to that produced by inertive


Another approach in measuring the effect of

supraglottal load is to change it while keeping the glottal

area function approximately invariant. One way this can be

done is by changing the vowel value. It is found that

vowels with a constriction closer to the glottis and a

higher first formant tended to have a glottal flow waveform

which was more skewed to the right, with a steeper flow

termination, and therefore might be expected to have more

high frequency energy generated by the termination of the

glottal closure.

Finally, the presence of supraglottal loading effects

can be tested by comparing vocalizations made with air and

with a large proportion of helium mixed with the air. By

reducing the acoustic inertance in the vocal tract, the

helium would be expected to reduce any supraglottal loading

effect, if present. It was found that the symmetry of the

waveform increases significantly with helium displacing some

of the air.

(b) Fant's Studies

Ananthapadmanabha and Fant (1982) calculated the true

glottal flow and its components, based on the assumption

that the source-filter interaction was mainly confined to

the first formant of the vocal tract and that the glottal

area function was given. They derived a linear differential

equation describing the variation of pressure drop across

the first formant load, and solved the equation by making

use of the Laplace transform. In their analysis, the

glottal flow was assumed to be linearly related to the

pressure drop across the first formant load. According to

Koizumi et al. (1985), this was not the case and they solved

a nonlinear differential equation with the glottal flow

properly expressed as a polynomial in the pressure drop

across the first formant load. In the time domain, the

glottal flow has two components:

(a) source residue component it describes the main

pulse shape of the true glottal flow, and

(b) ripple component.

Finite initial conditions of vocal tract energy

storage adds to the ripple component. Since the source

residue component is a smooth function, it can be described


The simplest way to understand the ripple phenomenon

is to consider the single-formant model (Figure 2-10). This

model has been used extensively to study source-tract

interaction effects. The vocal tract is represented to a

first approximation by the RLC resonant circuit for the

first formant. Rg(t) and Lg(t) are the nonlinear and

time-varying glottal resistance and inductance,

respectively, and are controlled by the glottal area

function Ag(t) as well as the current flowing through them,

Ug(t). If the impedance due to Rg(t) and Lg(t) is much

larger than the vocal tract input impedance Zt for all t,

then the glottal volume velocity Ug(t) will be essentially

independent of Zt and there will be no source-tract

interaction. This is true, however, only when Ag(t) is very

small or zero.

During the glottal closed conditions, the formants

prevail with a constant bandwidth, i.e., a constant rate of

decay. During the next glottal open interval, the glottal



Figure 2-10. A simple model to study the source-tract
interaction effects.

Figue 210. siple ode tostud th souce-rac
inteactin effect

impedance becomes finite and acts as an additional path for

the parallel resonant circuit. The bandwidth increases

because of the extra damping, and the formant decays at a

relatively faster rate. Thus, glottal damping can cause a

truncation of formant amplitudes during the interval of

glottal opening. This truncation effect is especially

apparent in maximally open vowels, especially back vowels of

high Fl, such as /a/.

Summary of All Source-Tract Interaction Effects

On the basis of transient theory analysis, the

following major effects of source-filter interaction have

been identified. These are (1) skewing, (2) truncation, (3)

dispersion, (4) superposition linear, (5) superposition -

nonlinear, (6) superposition mechanical, and (7)


(1) Skewing: Due to the load on the glottis, the flow

pulse is skewed compared to the area function. This skewing

depends on the input inductance of the load. Acoustically,

the main effect of skewing is to uniformly increase the

level of all the formants.

(2) Truncation and (3) Dispersion: The vocal tract

resonances and bandwidths undergo continuous modulation over

the glottal open phase due to coupling to the subglottal

system and the time-varying nonlinear glottal impedance.

The bandwidth varies considerably, resulting in an

exponential ringing over the glottal closed phase and an

almost complete truncation over the open phase due to the

increased glottal damping.

(4) Superposition linear: In those cases (notably

high pitched voices) where the truncation is not

significant, energy will be carried over from one glottal

period to the next. The degree of the superposition depends

on the relation between pitch frequency and formant


(5) Superposition nonlinear: The superposition

within the open phase may affect the glottal flow derivative

at the instant of closing discontinuity and, thus, a change

in the excitation level.

(6) Superposition mechanical: It is suspected that

the presence of a superposition component may, at times,

affect the mechanical vibrations of the vocal folds.

(7) Supraglottal: It is evident that a supraglottal

constriction will affect the transglottal pressure drop and,

thus, the pattern of vocal fold vibrations. This is typical

of voiced fricatives. Bickley and Stevens (1987) have

studied the effects of a vocal tract constriction on the

glottal source. Their results showed that the more

constricted configurations resulted in a longer open times

for the glottis compared to that during the open vowel




In this chapter, we first describe the different ways

possible to synthesize speech that will include source-tract

interaction. Then, our method of incorporating source-tract

interaction effects into the conventional speech production

model is presented. The glottal flow waveforms required for

the model are estimated by the technique of glottal inverse

filtering, and a review of its literature is given. Source

models to fit the glottal flow waveforms are discussed.

Modeling Source-Tract Interaction

Thanks to the linear separability of voice source and

vocal tract system, the conventional model allows for an

unequivocal definition of both the source and the system.

For models in which voice source and vocal tract interact,

source and system can no longer be defined unambiguously,

nor can they be totally separated. There exist various

combinations of a source and a filter function that will

produce one and the same or approximately the same output.

(a) The most complete model of speech production is that of

Ishizaka and Flanagan (1972) who have developed a two-mass

model of the vocal folds incorporated in a complete vocal

tract network including the subglottal system, i.e., the

lungs, bronchia, and trachea and the supraglottal system

including a finite cavity wall impedance and nasal system.

Self-oscillation is insured by appropriate feedback of

pressure/flow states affecting the mechanical system

function of vocal fold vibration, the main power deriving

from the expiratory force as represented by the lung

pressure. This model does not have a specific source in the

linear network sense. The two-mass vibratory pattern is

accordingly under the influence of physiological parameters,

i.e., lung pressure, muscular tensions, the rest position of

the vocal folds, and the mass of vibrating parts to which

are added the acoustic load of the sub- and supraglottal

systems. As long as the flow is not reduced by a

supraglottal constriction comparable to that at the glottis,

the glottal area function remains rather insensitive to

articulatory variations.

This model incorporates both mechanical and acoustical

source-tract interaction. However, the knowledge of

aerodynamic forces within the glottis is not sufficiently

well developed and the vibratory pattern of vocal folds is

too complex and too detailed to be directly modeled in a

practical synthesis scheme.

(b) The glottal area as a function of time, is an

alternative to the vocal-fold model. One could thus start

out with lung pressure and glottal area function and solve

for any flow and pressure within the complete system, e.g.,

the output flow at the lips or the input flow to the

supraglottal system, i.e., the glottal volume velocity flow.

Another possibility is to produce a two-stage

synthesis. In the first stage a glottal area proportional

voltage is fed into a Foster-type cascade of parallel

resonant circuits representing the Fl and F2 vocal

tract-impedance load. The resulting current function is the

input to the second stage which is a conventional cascade

formant synthesizer.

If the two-dimensional projected glottal area function

is used, it cannot represent the complex three-dimensional

movements of the vocal folds. The lower and upper lips of

the folds do not move in phase, the longitudinal component

producing alternatively converging and diverging shapes.

(c) In the third and approximate model, the source function

is the already smoothed version of glottal flow, i.e., the

ripple is excluded, feeding into a vocal tract network with

the glottal impedance (in a more general model in series

with the subglottal impedance) shunting the input and

drawing current during the glottal open phase which produces

truncation and corresponds to the ripple component of the

true glottal flow.

Alternatively, the smoothed glottal flow can be

combined with rules for bandwidth broadening to represent

truncation or even better intraglottal variation of formant

bandwidths and maybe frequencies.

This model includes the acoustical source-tract

interaction only, but not the mechanical source-tract


Source-Tract Interaction Incorporation

To incorporate the source-tract interaction effects

into the conventional speech production model, the simplest

way is to adopt a variable glottal pulse model. The

smoothed glottal flow can be modeled here, which may be

represented by a few glottal source parameters. To simulate

the ripple effects, an open phase vocal tract filter which

has higher formant bandwidths and maybe formant frequencies

than the closed phase counterpart is used for voiced sounds.

The proposed speech production model is shown in Figure 3-1.

Measurement of Model Parameters

Measurements of model parameters should be made from

the speech signal, but with the additional knowledge of the

electroglottographic (EGG) signal, more robust processing

can be performed.

For voiced sounds, the closed phase vocal tract

parameters can be obtained from the closed phase covariance

LPC analysis. The glottal flow waveform can then be





Figure 3-1. Speech production model with source-tract

obtained by inverse filtering. A source model can be fitted

to the glottal flow waveform to obtain the glottal source

parameters. To estimate the open phase vocal tract

parameters, a new method has to be developed.

Electroglottograph (EGG)

The EGG measures the radiofrequency (RF) impedance

across the larynx and hence the area of contact of the vocal

folds (Childers and Krishnamurthy, 1985). It monitors the

vibratory motion of the vocal folds.

A pair of electrodes is applied to the neck at the

level of the larynx. A high frequency (about 5 MHz) current

passes from one electrode through the neck and is picked up

by the other electrode. As the subject phonates, the

opening and closing of the vocal folds change the electrical

impedance of the neck in the region of the electrodes. This

modulates the radio frequency (RF) current, which is

demodulated using a detector to yield the electroglottograph

(EGG) signal.

The EGG indicates the electrical impedance through the

neck at the level of the larynx and thus monitors variations

in vocal fold contact: glottal closure is associated with a

reduction in tissue impedance. A time lag for the acoustic

propagation delay from the glottis to the microphone is

applied to the EGG signal when it is compared with the

speech signal. The EGG signal is for inverse vocal fold

contact, so that an increase in amplitude denotes glottal


The steep negative slope of the EGG signal associated

with glottal closure occurs in one or two sample points.

Glottal opening occurs more slowly and makes the true

opening point more difficult to determine accurately

(Childers and Krishnamurthy, 1985).

The EGG signal is immune to the surrounding acoustic

disturbances and provides a robust pitch detector. The

pitch period cannot be reliably estimated from the EGG

waveform by determining the zero-crossing interval because

of noise perturbations and dc level shifts. A 351-point FIR

linear phase high-pass filter with a cutoff frequency of 80

Hz is used to remove the low-frequency variation.

The EGG can be used as a tool for validating speech

processing algorithms and for estimating possible lower

bounds for both computation and performance of these

algorithms. It can also be used for "bench marking" how

well certain speech analyses algorithms perform. Further,

with the aid of the EGG as an exploratory tool one may be

able to invent new algorithms which will work on the speech

signal alone.

The EGG channel can help solve the deconvolution

problem of inverse filtering the speech signal, improve

voiced, unvoiced, and silence detection and fundamental

frequency estimation, and facilitate spectral estimation and

formant tracking (Krishnamurthy and Childers, 1986). The

two-channel (speech, EGG) speech analysis technique leads to

computational and performance improvements over "speech

only" analysis methods due to the added EGG channel. The

EGG-based pitch detection scheme provides the pitch on a

period-by-period basis.

The EGG does not reflect a direct measure of glottal

area. It is postulated that the tissue impedance is

inversely proportional to lateral contact area of the vocal

folds. Both the glottal volume-velocity and EGG waveforms

capture features of the vocal fold motion, but in a

complimentary fashion. The EGG waveform provides more

information during glottal closure, whereas volume flow

provides more information during the open portion of the

cycle. The two taken together are superior in many ways to

the glottal area function determined by photographic means,

in which information is apparently lost in the process of

projecting the three-dimensional glottis onto a horizontal


Glottal Inverse Filtering

The process of estimating the glottal volume velocity

by removal of vocal tract resonances from speech signals is

known as glottal inverse filtering. The objective is to

find the shape of the vocal fold wave (source of

excitation), since the final speech signal is a function of

both the source waveshape and the vocal tract transfer

function. The vocal folds are located in a rather

inaccessible place as far as making measurements is

concerned, particularly when they are in operation. The

output waves which are generated by the vocal folds have

been thoroughly hidden by the complex acoustical nature of

the vocal tract.

If a closed glottal phase does not actually occur, a

unique glottal volume-velocity cannot be determined from the

speech waveform. Similarly, if zeros are present in the

vocal tract system, as during nasalized speech, the

resulting time lag of the input waveform makes a unique

determination of the glottal volume-velocity impossible

because the glottal waveshape zeros cannot be separated

unambiguously from the vocal tract zeros. Thus, an all pole

model of the vocal tract is used for glottal inverse

filtering, care is taken to avoid nasalized speech and a

closed glottal phase is assumed.

Inverse filtering provides a picture of the true

glottal flow provided that it is an exact inverse of the

transfer from glottal flow to the speech wave. This implies

that frequencies and bandwidths of the inverse filter be set

to represent glottal closed conditions. An accurate

estimation of either the glottal volume-velocity or the

vocal-tract filter allows a determination of the other

quantity to within the limits of the assumed model. Due to

the lip radiation impedance which introduces a zero at dc,

the glottal volume velocity baseline cannot be recovered


Low frequency noise is an ever present problem in

glottal inverse filtering. It was impossible to avoid the

presence of very low frequency noise in the microphone

output due to slight variations in room air pressure, and

air movements caused by breathing and slight body movements.

This noise made it essential to limit the low frequency

response of the system by means of a.c. couplings in the

amplifiers, but even so some remaining noise, mainly in the

region of 10 Hz, caused slight variations of the base line

on the measured laryngeal waves.

The glottal volume-velocity reflects the action of the

vocal folds and is thus an important indicator of laryngeal

function. Any abnormality of the larynx that affects the

vibrational pattern of the vocal folds and the audible

quality of the speech will be evident in the glottal

volume-velocity waveform. However, the resonances created

by the vocal tract make the degree of an abnormality

difficult to quantify in the resultant acoustic speech

waveform. The estimation of the glottal volume-velocity

waveform from acoustic speech thus has an important

application in the study of laryngeal pathology, in the

detection and diagnosis of laryngeal disorders.

To recover the glottal pulse shape accurately the

original speech signal has to be recorded and sampled

without phase distortion throughout the frequencies of

interest. Because standard tape recorders distort phase,

the recording must be made with an FM system or by direct

digital conversion. Standard studio microphones also

introduce considerable phase distortion, so the speech

pressure wave has to be recorded with an instrumentation

condenser microphone. Instrumentation condenser microphones

have the necessary phase characteristics and a frequency

response that permits the recording of considerable detail

in the glottal pulse, but recording in a free field they

cannot capture the dc component of the airflow. This means

that the free-field microphone technique will miss the

continuous part of the airflow that occurs when the glottis

vibrates without making a complete closure during the cycle.

Other drawbacks include the fact that the amplitude of

glottal flow cannot be calibrated and that the microphone is

sensitive to very low frequency changes in pressure in the

environment in which the recording is made.

Time domain inverse filtering reveals more of the

overall pulse shapes relevant for a lower part of the

spectrum than spectral levels in formant regions. Frequency

domain inverse filtering does not demand an extended low

frequency response and may thus be performed from ordinary

tape recordings. One alternative, preserving spectral

amplitude information only, is to start out from a narrow

band harmonic spectrum and subtract on a dB scale the

all-pole transfer function given an estimate of formant

frequencies and bandwidths.

Literature Review

Miller (1959) used an analog inverse network to try to

cancel the first formant. The parameters of the network

were adjusted so as to obtain zero current. His results

indicated that the main excitation of the higher resonances

occurs at the point of vocal fold closure and that the

magnitude of this excitation can be controlled by the talker

over wide ranges.

Holmes (1962) used an inverse filter consisting of a

cascade of five similar networks whose antiresonance

frequencies and bandwidths can be adjusted to remove the

formants. The frequencies and bandwidths of the zeros are

adjusted to produce minimum formant frequency ripple in the

output waveform. He showed that in many cases there were

also well defined instants of excitation of the second and

higher formants at other points in the laryngeal wave (the

instant of opening).

Rothenberg (1973) derived the volume velocity waveform

at the glottis during voiced speech by inverse-filtering the

volume velocity waveform at the mouth. The volume-velocity

waveform at the mouth was captured by means of a

circumferentially vented pneumotachograph mask. Unlike the

technique of inverse-filtering radiated acoustic pressure,

this method provides a signal that is accurate down to zero

frequency, not susceptible to low-frequency noise, and

easily calibrated in amplitude by a constant air flow.

Sondhi (1975) employed a reflectionless uniform tube

to neutralize the effects of the vocal tract transfer

function, so as to measure the glottal airflow directly.

Variations in the male and female glottal wave with

different speakers and different phonetic conditions have

been studied by Monsen and Engebretson (1977) with this

tube. Analysis of the data indicated a wide variation of

the glottal waveform shape, its rms intensity and

fundamental frequency, phase spectrum, and intensity

spectrum. It was observed that as the fundamental frequency

changes over time, the glottal source varies in one of two

different ways. In one type of change, the harmonic

relations in the glottal spectrum become steeper as the

fundamental frequency rises. In a different type of

glottal-wave change, relations between harmonics tend to

remain the same despite a change in the fundamental

frequency; the source spectrum in this case is simply

shifted along the frequency and amplitude axes as a function

of fundamental frequency. To account for these variations

in the glottal source, at least three factors must be known:

the sex of the speaker, the voice register in which he

phonates, and the linguistic context in which the phonation

occurs. The slope of the spectrum for males is less steep

than that of the females. The female glottal waveshape

tends to be more symmetrical. Loud voice typically has a

higher FO than normal voice. Soft voice has a more

symmetrical waveform and a more steeply declining spectrum.

For loud voice, the closing portion of the wave is brief and

abrupt, and there is a consequent increase of energy

particularly in the higher frequencies.

Holmes (1976) used partial inverse filtering to

produce waveforms representing single formants of voiced

speech. His results confirmed the well known fact that the

main formant excitation normally occurs at glottal closure.

However, there is, frequently, evidence of additional

excitation, not only at glottal opening and during the open

phase, but also after closure.

Hunt et al. (1978) described an interactive digital

inverse filtering system in which the advantages of analog

and digital methods are combined to provide a facility with

much greater convenience and power than either.

Wong et al. (1979) suggested a straight forward (but

computationally expensive) approach for performing glottal

inverse filtering from the acoustic speech waveform by

analyzing the normalized linear prediction error sequence

obtained by calculation of the p-pole total linear

predictive error on an M-point window of the speech

waveform. Both the moment of glottal closure and opening

can be determined from the normalized total squared error

with proper choices of analysis window length and filter

order. The window is moved through the speech waveform one

point at a time and after energy normalization the

total-error sequence represents a measure of the fit of a

p-pole model to segments of the waveform. The total error

is at a minimum (ideally, zero) during analysis of a

completely closed phase segment. However, in cases of high

frequency or breathy speech where the closed phase is of

shorter duration, this method may not provide unambiguous

local minimal ranges of the normalized error sequence that

are necessary to indicate closed phase. This problem occurs

when the duration of the closed phase is less than the

length of the analysis window. To estimate the actual

volume velocity waveform, the filter for a single period was

chosen as that corresponding to the minimum normalized


Fant (1982a) has studied the covariation of flow

parameters with voice intensity and pitch using analog

inverse filtering. There appear to be two modes available

to produce an intensity increase. One is a rise in the

overall scale factor of glottal flow pulses which is a main

consequence of increased subglottal pressure. The other is

an adduction of the vocal folds physiologically induced by a

medial compression, which may increase the steepness of the

closing branch of glottal pulses while maintaining or even

reducing the amount of air contained in a single pulse. For

FO-variations, the flow amplitude shows a maximum around 115

Hz and then decreases in inverse proportion to FO. The flow

derivative, indicative of formant amplitudes, shows a

maximum at FO=118 Hz and then exhibits a fall rise contour

indicating increased efficiency at higher pitch.

Veeneman and BeMent (1985) developed an automated

on-line method to determine the glottal volume-velocity

waveform from both speech and EGG signals. A high pass

filtering operation on the speech pressure waveform using an

electret condenser microphone, was combined with a

compensating low pass operation to maintain a flat closed

phase in the derived glottal volume-velocity.

Milenkovic (1986) described a linear model of a

glottal pulse waveform along with a procedure for jointly

determining an AR model of the vocal tract response together

with the parameters of the glottal pulse model. Closed

glottal LPC analysis is based on an implied model for the

glottal pulse, and the inverse filter coefficients are

optimized with reference to that model. Very simply stated,

the model specifies that the glottal pulse is zero valued

over the interval one has chosen to minimize the square of

the inverse filter output signal. An alternative philosophy

to the one underlying closed glottal analysis is to employ a

voice source model which describes the voice source signal

over an entire pitch period, and to optimize the inverse

filter using data from an entire pitch period. Use of a

whole pitch period provides the increased number of speech

samples required for proper analysis of high fundamental

frequency speech. In addition, a model fit that includes

the entire pitch period may be less sensitive to incomplete

glottal closure than a fit that relies on the closed glottal

interval alone. The assumption was made that during the

glottal closure interval, the voice source waveform is flat.

The assumption was also made that the glottal wave begins

and ends at the same amplitude in the course of the open

phase. These assumptions put a constraint on the basis

functions that the time integral of each basis function

over a pitch period is equal to zero.

Hunt (1987) studied the glottal excitation in steady

vowels of several modes of phonation obtained by interactive

inverse filtering. Electroglottograph waveforms were shown

together with simultaneous glottal airflow waveforms and

waveforms for individual formants. Modal voice shows

formant excitation concentrated on the instant of closure.

Falsetto voice shows a triangular or sinusoidal airflow

waveform. Breathy voice shows appreciable formant

excitation both on closure and at the center of the open

phase. Creaky voice shows appreciable excitation at the

start of the open phase as well as its end, and there is

often an alternation in spectral content of the excitation

from cycle to cycle causing the relative intensities of

formants to vary. In examples of extreme creak the airflow

waveforms are complex and difficult to interpret, but they

are similar to the electroglottograph waveforms.

Lee (1988) used a two-pass method to perform the

inverse filtering on the speech signal. In the first pass,

the locations of the main pulses of the LP error signal were

identified. Then, using these main pulses as indicators of

glottal closure, a "pseudo closed phase" is selected as the

analysis interval for a pitch-synchronous covariance LP

analysis to estimate the vocal tract filter, which in turn

is used to obtain the desired glottal volume-velocity


Automatic Glottal Inverse Filtering

As long as inverse filtering studies are undertaken in

order to obtain a basic understanding of the nature of the

glottal pulses an extremely high reliability may not be

required. If, however, the inverse filter outputs are

brought to bear to test models of the aerodynamics and

acoustics of speech production, reliability and accuracy

become central issues.

At first, the inverse filtering was done with an

analog inverse filter adjusted to cancel vocal resonances of

recorded signals. The parameter settings are supposed to be

optimal if the output glottal waveform contains a

ripple-free flat interval representing the closed glottal

interval where the air flow should be zero (if the glottis

closes completely) or fairly constant (if, in breathy voice,

an opening in the cartilaginous part of the glottis

remains). Usually this requires the adjustment of the

parameters of the inverse filter by a skilled operator and

the reliability and accuracy of this is in question.

However, with the advent of the digital computer, a number

of (quasi-)automatic inverse filtering implementations have

been reported which do not need operator intervention.

Inverse filtering outputs can also be compared with (almost)

simultaneous physiological registrations to study its

performance (Krishnamurthy and Childers, 1981).

With the additional knowledge of the

electroglottographic (EGG) signal, the glottal closed phase

can be identified more easily, and fully automatic glottal

inverse filtering becomes possible. Glottal closure is

associated with a rapid reduction in tissue impedance.

Glottal opening occurs more slowly and makes the true

opening point more difficult to determine accurately.

The inverse filter is determined by a linear

prediction covariance analysis on the closed phase region of

the speech waveform, as identified from an EGG signal. The

bounds of the potential closed phase region determined from

the EGG will define the maximum frame size. To select a

closed phase analysis region from the many possible, a

closed phase "flatness" measure is used. This yields the

closed phase region that gives the minimum variance glottal

volume-velocity waveform over all the closed phase regions

derived from the sample epoch.

Since vocal tract resonant poles appear only as

complex-conjugate pairs, any real roots of the polynomial

should be removed after factoring. The real pole at zero

frequency will typically occur due to low-frequency

recording noise or a non-zero mean in the short analysis

window. Real poles may also occur when the required filter

order is over specified. The first effect is avoided by

high-pass filtering the speech data, but the second cause

may still lead to a real zero. If it is not removed, "jags"

at the points of glottal closure will occur. A real pole

may also occur at the half-sampling frequency. When it is

of narrow bandwidth, it generally indicates a formant

location nearby, and thus should be retained. If a real

pole occurs due to spectral shaping requirements in the

analysis, without there being a nearby resonance, it will

generally be of wide bandwidth. Including such a pole in

the inverse filter will have a minimal effect on the

results. Therefore, as a practical matter, poles at the

half-sampling frequency are not removed, if they occur. The

derived vocal tract filter was then used to obtain the

volume velocity by inverse filtering the speech signal.

A number of limitations can be raised for closed

glottal interval LPC analysis. The first is that the closed

glottal interval is difficult to locate. For a real-time

voice source analysis in a vocoder, this is a valid

limitation, but in clinical speech situations, human

intervention to identify closed glottal intervals with the

aid of an electroglottograph (EGG) is not an unrealistic

requirement. A more serious limitation is that with higher

pitch voices, the closed glottal interval contains too few

samples to permit an effective least-squares determination

of the AR coefficients. With some voice types, the vocal

folds may not even fully close, and these voice types are of

high clinical interest.

Source Models

Recent efforts to characterize the essential features

of the voicing source waveform for different male and female

voices have led to several new parametric models of glottal

output. Source models can be used to represent the glottal

flow pulse shape in a few parameters. They should model the

skewing effect quite well, and hence the smoothed glottal


a) Fant's Model (3 parameters)

This model for the glottal flow was introduced by Fant

(1979a) in support of inverse filtering studies with analog

instrumentation. It is given in Figure 3-2. In addition to

the voice fundamental frequency Fo = l/To, its basic

parameters are the peak flow Uo, the glottal frequency

Fg = Og/2x and the asymmetry factor K. The pulse may be

ascribed a starting point t = Ti. The rising branch

U=-Uo[l-cosWg(t-Ti) ] T < t < T2

reaches the peak value Uo at t=T2.

T2- T1 =




Figure 3-2. Fant's model.


ti tp

1 Uo
Td= -
Td =J2K -l U(t=T3)

Closure excitation
H(s) '(t = T3) Uo
H(s) s = ST

Figure 3-3. The LF-model of differentiated glottal flow.

The falling branch

U=Uo[KcosUg(t-T2) -K+1] T2 < t < T3

hits the zero line after a time

1 K-1
T3 T2 =--cos-1 ()-
Wg K

Providing K > 0.5 the termination is abrupt with a slope

I dU 2
U3 = = Uo g K-
dt (t-T3)

If K > 1 this is the maximum slope during the course of the

falling branch. For 0.5 < K < 1 the maximum slope occurs

prior to closure. At K = 0.5 the falling branch is

symmetrical to the rising branch. This is the lower bound

of K in the present use of the model and represents a lowest

degree of excitation strength. The obvious shortcoming of a

model with abrupt flow termination is that it does not allow

for an incomplete closure or for a residual phase of

progressing closure after the major discontinuity.

b) LF-Model (4 parameters)

The properties of the LF-model, which is shown in

Figure 3-3, have been fully described in Fant et al.

(1985a). The four parameters are used to model the

differentiated flow rather than the real glottal flow. The

differentiated flow is commonly used in speech synthesis,

and includes the effect of radiation at the lips. The model

consists of two parts. The first part is an exponentially

growing sinusoid to which three of the four parameters of

the model pertain. This segment is

dU (t)
g =E (t) = Eo e sinogt to < t < te

modeling the flow from glottal opening until the main

excitation occurs (the moment of maximum discontinuity in

the glottal airflow function, which normally coincides with

the moment of maximum negative flow derivative). As opposed

to most other models of glottal flow, the LF-model is a

continuous function until the main excitation, and therefore

does not introduce additional excitations. In comparison,

Fant's model is composed of two different segments; a rising

branch up to maximum flow and a falling branch down to

complete closure. The discontinuity between the two

segments introduces a secondary weak excitation at the flow


The three parameters pertaining to the first segment

of the LF-model are

(1) Eo which is merely a scale factor.

(2) a = -BX where B is the "negative bandwidth" of the

exponentially growing amplitude.

(3) )g = 27Fg where Fg = 1/2tp and tp is the rising-time (the

time from glottal opening to maximum flow).

The second part of the model is an exponential segment

that allows a residual flow (dynamic leakage) after the main

discontinuity, at time te, when the vocal folds close. The

segment used for this "return phase" is

E (t) = -[e-'(t-t) -e -(tc-t)] t < t < tc

where ta is the fourth parameter of the model. The

parameter ta is the time constant of the exponential curve

and is determined by the projection on the time axis of the

derivative at time te. The parameter e can iteratively be

determined from

ta = 1 e-' (tc-t,)

and for small values of ta, Eis approximately equal to 1/ta.

Ee is the negative amplitude of the excitation spike and tc

is the moment when complete closure is reached.

The effect of the return phase on the source spectrum

is, due to its exponential waveshape, approximately a first

order low-pass filter with a cutoff frequency Fa = 1/(27ta).

This means that the longer the return phase, the lower the

cutoff frequency, and the larger the high frequency


By convention tc = to, the time of glottal opening for

the forth-coming pulse period. This implies that the model

lacks a closed phase. In practice this is no drawback: for

normal (small) values of ta, the exponential curve will fit

closely to the zero line, providing,. for all extents and

purposes, a closed phase. The lesser number of parameters

makes the implementation of the model simpler.

Apart from the four parameters, there is a requirement

of area balance,

E(t) =0

which keeps the zero flow line from drifting.

Other glottal source models have also been developed.

An evaluation of some source models was given in Fujisaki

and Ljungqvist (1986).



In this chapter, we first describe how the data used

in our experiments are collected. Then, the implementation

of the proposed speech production model is described in

detail. The method to correct the distorted speech signals

is given next. The glottal flow waveforms obtained from

inverse filtering corrected sustained vowel signals are then

presented. Modeled glottal flow waveforms using the

LF-model and the corresponding synthesized signals which do

not include the source-tract interaction ripple effects are

then shown. Synthesized signals which include the ripple

effects by using different vocal tract filters during the

glottal open and closed phases are also to be presented.

Experimental Data Base

The procedure for collecting the data used in our

experiments is given as follows. Each subject was seated

inside an Industrial Acoustics Company (IAC) single-wall

sound room. An Electro-Voice RE-10 dynamic cardiod

microphone was located at a fixed distance of 6 inches from

the speaker's lips. The speech and electroglottographic

(EGG) signals were collected simultaneously. The

electroglottograph was a Synchrovoice Inc. model. The

speech and EGG data were amplified prior to digitization by

a Digital Sound Corporation DSC-240 audio control console.

The two data channels were directly digitized at a sampling

frequency of 10 kHz per channel by a Digital Sound

Corporation DSC-200 system with 16-bit precision. Both

channels were bandlimited to 5 kHz by passive elliptic

filters with a minimum stopband attenuation of -55 dB and a

passband ripple of 0.2 dB.

In addition, some speech data were collected with a

Bruel & Kjaer (B&K) model 4133 condenser microphone. In

addition, twelve sustained vowels were collected

simultaneously without EGG using both the Electro-Voice and

B&K microphones.

The B&K 4133 condenser microphone has a good

low-frequency response. Its amplitude response is within

1 dB down to 20 Hz, and its phase response is essentially

linear. The -3 dB low-frequency cut-off is around 10 Hz.

This feature is required when the speech signal is used for

glottal source estimation, since the glottal source

waveform, which is to be estimated, has its major energy

components at low frequencies (dc to 1 kHz). However, this

microphone characteristic also makes the B&K 4133 condenser

microphone sensitive to low-frequency breath and ambient

noise, which may cause problems in speech analysis.

Therefore, the Electro-Voice RE-10 microphone was used to

collect most of the speech data. This microphone has a good

frequency response at frequencies above 50 Hz.

Implementation of Proposed Speech Production Model

The proposed speech production model, which includes

source-tract interaction has been implemented as an

analysis/synthesis system. The analysis program extracts

the model parameters from both the speech and EGG signals;

the synthesis program uses the extracted parameters to

recreate the original speech signal. Since we are only

interested in the voiced portions of the speech signal,

which can be determined from the differentiated EGG signals,

unvoiced portions of the speech signal are not analyzed and

are copied directly from the original speech signal analyzed

during synthesis. The analysis of the voiced speech

segments is pitch synchronous. Speech signals are not

pre-emphasized before analysis.

The analysis program stores the differentiated glottal

flow and glottal flow signals, as well as a feature file,

which contains for each frame, the starting point of the

frame, the frame type, and the frame length. For voiced

frames, the following are stored: closed phase covariance

linear prediction coefficients, the instant of glottal

opening, the location of glottal flow peak, and the location

and magnitude of the negative minimum of the differentiated

glottal flow. The synthesis program uses the parameters in

the feature file, as well as the glottal flow, to generate a

modeled glottal flow signal which is then used to synthesize

the original speech signal using the linear prediction

coefficients in the feature file. If the same vocal tract

filter is used in both the open and closed phases, the

resulting synthesized speech will not include the ripple

effects, but ripple effects can be approximated by using a

different vocal tract filter during the open phase which has

a larger first formant bandwidth than the closed phase vocal

tract filter.

Voiced and unvoiced segments in the speech signal are

determined from the differentiated EGG signal. It is known

that voiced sounds have large negative minima in the

differentiated EGG corresponding to the instants of closure,

so a negative threshold is used to locate these minima.

Voicing is considered to start when two consecutive minima

less than the threshold are found, and the duration between

these minima gives rise to a frequency which is within the

normal pitch frequency range of the talker. A range of 50

Hz to 400 Hz is used. Voicing is considered to stop when

the above condition is no longer met. Since we used pitch

synchronous analysis, the frame size is a pitch period. In

order that the LF-model can be fitted to the differentiated

glottal flow, the pitch period is chosen to start at the

instant of glottal opening, and end at the next instant of

glottal opening. The instant of glottal opening is

determined as the location of the maximum between two minima

in the differentiated EGG. To accommodate for error in the

location of the instant of glottal opening, the starting

point of a frame is actually a few points in front of the

instant of opening found. The region between a minimum

which corresponds to the instant of closure and the next

maximum of a differentiated EGG is considered as the closed

phase. To allow for possible errors in the identification

of such minima, the closed phase region is shortened by 3

points at the closure and 5 points at the opening.

We have assumed that there will be a closed glottal

interval of sufficient duration to perform a closed phase

covariance LPC analysis over the closed phases, which are

determined from the differentiated EGG. A time lag of 0.9

msec (9 points for 10 kHz sampling rate) for the acoustic

propagation delay from the glottis to the microphone is

applied to the EGG signal when it is compared with the

speech signal. If the closed phase regions are short, the

LPC analysis may not have enough data samples to be

accurate. For very short closed phases, as is often the

case for female voices, the LPC analysis cannot be applied.

If there are no closed phases, as in breathy voices, but

regions of sufficient duration are mistakenly determined as

a closed phase from the differentiated EGG, the glottal flow

waveforms obtained from inverse filtering may not be

accurate. However, such waveforms may still give a general

idea of what the actual glottal flow waveforms may appear


The default linear prediction order is 12, and the

default region of minimization is 28 samples, which combine

to give a window size of 40 samples. These parameters can

be changed interactively. They are verified to determine if

the LPC analysis window fits in the closed phase interval as

determined from the differentiated EGG; if not they are

changed by the program as necessary. The actual location of

the LPC analysis window is determined by minimizing the

total squared error. To find the inverse filter, the

formant frequencies and bandwidths of the poles are

calculated. Real poles at dc are removed, but real poles at

the sampling frequency are retained. Also, extraneous

formants may occur, such as those with very low frequencies

or very large bandwidths, and they are removed if they

distort the glottal flow waveforms. In order that the

output is stable during synthesis, poles outside the unit

circle are reflected inside the unit circle, even though

this may distort the glottal flow waveforms. However, for

good vowel data, pole reflection seldom occurs, except

possibly at the onset or offset of voicing. The inverse

filter is reconstructed from the remaining poles.

The speech signal inside a frame is inverse filtered

to obtain the differentiated glottal flow, which is then

integrated to obtain the glottal flow. In order to remove

any low-frequency trend from the glottal flow, any dc level

of the differentiated glottal flow in the frame is removed.

The analysis algorithm is summarized in Figure 4-1. The

Finish up:
write glottal flow,
glottal flow, and
close feature file

N (voiced)
Find open/closed
phase from FAIL?
DEGG (unv
SIY (unvoiced)

Figure 4-1. The block diagram for the analysis algorithm.

differentiated EGG is represented as DEGG in the flow chart.

The analysis assumes a normal EGG waveform.

The synthesis algorithm is given in Figure 4-2. The

LF-model is used because it can match the glottal flow

waveforms accurately. To ensure that the programs run

correctly, synthesized data are used to test the programs,

as described in the Appendix.

Correction of Distorted Speech Signals

The inverse filtering results for a sustained vowel

/a/ recorded from a male speaker (DMH) with the

Electro-Voice and B&K microphones are shown in Figure 4-3

and Figure 4-4, respectively. The data collected by the B&K

microphone contains low frequency noise, and was high-pass

filtered, as was the B&K data. Since the phase must be

maintained to obtain undistorted glottal flow waveforms from

inverse filtering, a linear phase high pass FIR filter was

designed with a cutoff frequency of 50 Hz. The output from

the filter was delayed to maintain the synchronization of

speech and EGG data.

The glottal flow waveforms obtained from inverse

filtering the Electro-Voice speech signals did not appear as

expected. This was caused by the slow decay to zero of the

differentiated glottal flow during the closed phase

interval. The glottal flow waveforms obtained from the B&K

data were better, but the closed phase interval was not

always apparent. This is usually because the values of the

Write modeled glottal
flow, modeled
differentiated glottal
flow, and
synthesized speech

Figure 4-2. The block diagram for the synthesis algorithm.

25CC r




mi <-m.>

m (b)

1 C 1 5


amn (m>

Figure 4-3. Data for sustained vowel /a/ (DMH), recorded
using the Electro-Voice microphone. (a) Speech signal;
(b) Differentiated EGG; (c) Differentiated glottal flow;
(d) Glottal flow.






.mI -m<.


L-- <>

Figure 4-4. Data for sustained vowel /a/ (DMH), recorded
using B&K condenser microphone. (a) Speech signal;
(b) Differentiated EGG; (c) Differentiated glottal
flow; (d) Glottal flow.

differentiated glottal flow during the closed phase are

positive. If values of the differentiated glottal flow are

not near zero during the closed phase, the glottal flow

obtained from integrating the differentiated glottal flow

will not have a flat portion in the closed phase.

By comparing the speech data in Figure 4-3(a) and

Figure 4-4(a), we can see that the characteristics for /a/

recorded with the B&K microphone and Electro-Voice

microphone are quite different from each other, although the

data were taken from the same speaker. Hence, we conclude

that the differences in the data are caused by the different

frequency responses of the microphones, particularly the low

frequency response of the audio equipment and the


Since we deduced that the microphone frequency

response is the main source of distortion, we determined

that a calibration of the two types of microphone is

required. Both the magnitude and phase responses are

needed, but the phase response cannot be easily obtained, so

another method was developed.

Calibration of the recording equipment, without the

microphone, with square waves was attempted to determine the

extent that the input signals are distorted. The resulting

digitized output signals for a square wave input of 100 Hz

and 200 Hz are shown in Figure 4-5. It can be seen that the

flat portion of the square wave becomes sloped. This is

thought to be caused by the low frequency distortion of the







0 50 100 150 200 250 300








0 50 100 150 200 250 300


Figure 4-5. The distorted digitized output for a square

wave input of (a) 100 Hz; (b) 200 Hz.

University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs