Electroglottographic signal analysis applied to larygeal function assessment


Material Information

Electroglottographic signal analysis applied to larygeal function assessment
Physical Description:
Alsaka, Yacoub A., 1953-
Publication Date:

Record Information

Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 21066366
oclc - 17886322
System ID:

Table of Contents
    Title Page
        Page i
        Page ii
        Page iii
    Table of Contents
        Page iv
        Page v
        Page vi
        Page vii
    Chapter 1. Introduction
        Page 1
        Page 2
        Page 3
        Page 4
        Page 5
        Page 6
        Page 7
        Page 8
    Chapter 2. The vocal folds
        Page 9
        Page 10
        Page 11
        Page 12
        Page 13
        Page 14
        Page 15
        Page 16
        Page 17
        Page 18
        Page 19
        Page 20
        Page 21
        Page 22
        Page 23
        Page 24
        Page 25
        Page 26
        Page 27
        Page 28
        Page 29
    Chapter 3. Electroglottograph waveform modeling
        Page 30
        Page 31
        Page 32
        Page 33
        Page 34
        Page 35
        Page 36
        Page 37
        Page 38
        Page 39
        Page 40
        Page 41
        Page 42
        Page 43
        Page 44
        Page 45
        Page 46
        Page 47
        Page 48
        Page 49
        Page 50
        Page 51
        Page 52
        Page 53
        Page 54
        Page 55
        Page 56
        Page 57
        Page 58
        Page 59
        Page 60
        Page 61
        Page 62
        Page 63
        Page 64
        Page 65
        Page 66
        Page 67
        Page 68
        Page 69
        Page 70
        Page 71
        Page 72
    Chapter 4. Data collection and measurement system
        Page 73
        Page 74
        Page 75
        Page 76
        Page 77
        Page 78
        Page 79
        Page 80
        Page 81
        Page 82
        Page 83
        Page 84
        Page 85
        Page 86
        Page 87
        Page 88
        Page 89
        Page 90
        Page 91
        Page 92
        Page 93
    Chapter 5. Measurements of vocal fold vibration parameters
        Page 94
        Page 95
        Page 96
        Page 97
        Page 98
        Page 99
        Page 100
        Page 101
        Page 102
        Page 103
        Page 104
        Page 105
        Page 106
        Page 107
        Page 108
        Page 109
        Page 110
        Page 111
        Page 112
        Page 113
        Page 114
        Page 115
        Page 116
        Page 117
        Page 118
        Page 119
        Page 120
        Page 121
        Page 122
        Page 123
        Page 124
        Page 125
        Page 126
        Page 127
        Page 128
        Page 129
        Page 130
        Page 131
        Page 132
        Page 133
        Page 134
        Page 135
        Page 136
        Page 137
        Page 138
        Page 139
        Page 140
    Chapter 6. Feature extraction for laryngeal pathology detection
        Page 141
        Page 142
        Page 143
        Page 144
        Page 145
        Page 146
        Page 147
        Page 148
        Page 149
        Page 150
        Page 151
        Page 152
        Page 153
        Page 154
        Page 155
        Page 156
        Page 157
        Page 158
        Page 159
        Page 160
        Page 161
        Page 162
        Page 163
        Page 164
        Page 165
        Page 166
        Page 167
        Page 168
        Page 169
    Chapter 7. Results and conclusions
        Page 170
        Page 171
        Page 172
        Page 173
        Page 174
    Appendix. Measurement algorithms
        Page 175
        Page 176
        Page 177
        Page 178
        Page 179
        Page 180
        Page 181
        Page 182
        Page 183
        Page 184
        Page 185
        Page 186
    Biographical sketch
        Page 187
        Page 188
        Page 189
Full Text









I wish to express my deep appreciation and gratitude to my advisor

and committee chairman, Dr. D. G. Childers, for his guidance,

encouragement, and financial support throughout this work.

I also wish to thank Dr. G. Paul Moore for many stimulating

discussions. His enthusiasm and positive outlook will always be an

inspiration to me.

I would like to thank Drs. J. R. Smith, L. W. Couch, II, and I. S.

Fischler for their genuine interest and ready advice. I thank all of

them for serving on my supervisory committee.

One of the enjoyable experiences during this research was my

interaction with my colleagues at the Mind-Machine Interaction Research

Center. The stimulating discussions with Dr. D. Hicks, Dr. Jerry Larar,

Ananth Abludu, Ajit Alawani, C. Q. Ding, Wu Ke and others made it all


I would like to thank Ms. Debbie Hagin for being a great friend and

a fine typist. The author appreciates her unceasing help and

cooperation during the typing of this manuscript.

Finally, I would like to thank my family for their love and

support, without which this work would have been impossible.



ACKNOWLEDGEMENTS . . . . . . . . . .. .

ABSTRACT . . . . . . . . . . . . . .

1 INTRODUCTION . . . . . . . .
1.1 Speech Production and Perception ....
1.2 Research Issues . . . . . ....
1.3 Description of Chapters . . . ....

2 THE VOCAL FOLDS . . . . . . . . . ....
2.1 Vocal Folds' Structure . . . .. ...
2.2 Detection Methods for Vocal Fold Disorders . ....
2.2.1 Aerodynamic Tests . . . . . ..
2.2.2 Examination of the Vocal Folds' Vibration . .
2.2.3 Acoustic Analysis . . . . . ..
2.2.4 Psycho-Acoustic Evaluation . . . ....

3.1 Introduction . . . . . . . . . .
3.2 Previous Models . . . . .... . .
3.3 Simple Unitary Mass Model of the Vocal Folds ....
3.4 Triangular Mass Model of the Vocal Folds. . ....
3.5 Simulation Results . . . . . . . .
3.5.1 Varying the Opening Angle 8 0 . . ...
3.5.2 Varying the Closing Angle 8 . ..
3.5.3 Varying the Phase Differenc Between
Upper and Lower Vocal Fold Margins . . ...
3.5.4 Effects of Mucus . . . . . . .
3.5.5 Effects of Nodules and Polyps . ...
3.6 Discussion . . . . . . . . .. .
3.7 Conclusions . . . . . . . . . . .

4.1 Data Base . . . . . ..
4.1.1 Normal Subject Data Base ...
4.1.2 Patient Data Base . . ...
4.2 High Speed Photography . . . ...
4.3 Speech and EGG Signals Recording ..
4.4 Data Measurement and Preprocessing . .
4.4.1 Speech and EGG Signals . ...
4.4.2 High Speed Film Data . . ...

* . . .
* . . .
* . . .
* . . .
* . * .
. . . .
* * . .
. . . .

4.5 Synchronization of the Measured Data . . . . 90
4.6 Error Sources . . . . . . . . . . 91
4.6.1 Film Data ........... 91
4.6.2 Digitized Tape Recordings" .......... 92
4.7 Conclusions . . . . . . . . . . 93

5.1 Introduction . . .. . . ........ 94
5.2 Vocal Fold Vibratory Parameters ....... . . 95
5.2.1 Area Function . . . . . . . 98
5.2.2 Speed Quotient and Speed Index . ....... 113
5.2.3 Fundamental Frequency . . . ....... 117
5.2.4 Open Quotient . . . . . ....... 123
5.2.5 EGG Closing Time .. . . . . 127
5.3 Short Time and Long Time EGG Records . ....... 135
5.4 Conclusions . . . . . . . . .. . 138

6.1 Introduction . . . . . . . . . . 141
6.2 Fundamental Frequency . . . . . . . . 142
6.2.1 Pitch Perturbation . . . . . . . 143
6.2.2 Pitch Perturbation Factor . . . . . 143
6.2.3 Relative Average Perturbation . . . . 143
6.2.4 Average Percent Jitter . . . . . . 144
6.2.5 Measurement Results . . . . . . . 144
6.3 Probability Mass Function Method . . . . . 145
6.3.1 Computational Aspects of the PMF . . . . 153
6.3.2 Algorithm PMF and PDF . . . . . . 154
6.3.3 Relative Entropy Method . . . . . . 160
6.4 Discussion. . . . . . . . . . . 166

7 RESULTS AND CONCLUSIONS . . . . . . . . . 170
7.1 Summary . . . . . . . . . . . 170
7.2 Future Research . . . . . . . . . . 172


MEASUREMENT ALGORITHMS . . . . . . . . . 175
A.1 Opening and Closing Phase . . . . . . . 175
A.2 Speed Quotient and Speed Index . . . . . . 176
A.3 Jitter, Pitch Perturbation, Pitch Perturbation
A.4 Factor, Relative Average Perturbation . . . . 177

REFERENCES . . . . . . . . . . . . . ... 179

BIOGRAPHICAL SKETCH ....................... 187

Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy



Yacoub A. Alsaka

August, 1987

Chairman: D. G. Childers
Major Department: Electrical Engineering

The purpose of this research is threefold. The first goal is to

model the Electroglottographic (EGG) signal, the second is to

investigate its potential in measuring parameters of the vibrations of

the vocal folds; and the last is to explore its use in laryngeal

pathology detection.

The EGG signal is believed to be closely related to the lateral

area of contact between the vocal folds. Observations of ultra-high-

speed films of the vibrations of the vocal folds presented a plausible

case for such a conjecture and resulted in a descriptive model of this

signal. In this work we carried this idea further and constructed a

computer or mathematical model that simulates this signal.

The resulting model is very flexible and permits simulation of EGG

signals for vibrations of the vocal folds under various conditions. In

Chapter 3 we provide examples of the model output for many of these

conditions, such as varying the phase between the lower and upper edges

of the folds as well as the opening and closing angles. It is shown

that at a certain phase lag, the simulated EGG signal, is remarkably

similar to the EGG of a vocal fry. Another important aspect of this

model is its ability to simulate the effects of a vocal fold nodule on

the EGG signal. Other phenomena, such as mucus, can also be simulated

using this model.

Our second goal for using the EGG signal to measure different

vibration parameters was also accomplished. In Chapter 5 and the

Appendix we discuss different algorithms that were developed to

accomplish these measurements. A new parameter called EGG Closing Time

or "ECT" for short was also proposed. The results are discussed and

organized in tables and presented in Chapter 5.

Finally, we proposed and developed a new method for feature

extraction for laryngeal pathology detection based on the probability

mass function (PMF) measured from the EGG signal. The results show that

the PMFs for subjects with a normal larynx are different from those with

a laryngeal pathology.


1.1 Speech Production and Perception

Many members of the animal kingdom produce voice, and some of them,

such as whales and porpoises, are thought to communicate using clicks

and pop sounds. However, the ability to code and transmit information

vocally seems to be unique to the human species. Some went further to

suggest that humans should be labeled "homoloquens" [1], in recognition

of this remarkable achievement.

Humans use voice not only for communication through speech. They

have also used it in artistic activities including singing and

theatrical performances. However, the basic function of the voice

production system in humans is communication via speech. This is

readily obvious when one realizes that without the ability to speak, one

cannot engage in many endeavors and is excluded from many professions

such as acting, teaching, singing, lecturing, preaching, etc. Actually,

one is hard pressed to find any human activity that doesn't involve, to

some extent, the act of speaking or communication via speech.

Speech is the most efficient method of communication between

humans. This fact revolutionized the field of electronic communications

and was the driving force behind the invention of the telephone

system. Now, terrestrial and satellite communication systems are

commonplace and serve to connect people throughout the globe and people

traveling in outer space.

One aspect of this dissertation is to add to our understanding of

the process of human speech. Figure 1.1 shows the human vocal system.

The lungs act as the power source that drives the entire system. The

lungs are used primarily to extract oxygen from air, but also as a

reservoir of air that drives the vocal system. The increase in lung

pressure below the vocal folds forces them to open, and air rushes

through the vocal cavities. The flow of air through the opening between

the vocal folds (glottis) is different for different sounds. For voiced

sounds, the air flow is periodically interrupted by the vibration of the

vocal folds. According to the aerodynamic-myoelastic theory of

phonation [2], the vibration of the folds is directly connected to the

air flow through the glottis and is the steady-state result of the

interplay of two sources. On one hand, the pressure built on the

subglottal space causes the air to push the folds apart and flow through

the glottis. On the other hand, the velocity of air passing through the

glottis results in a drop in pressure across the fold opening producing

a suction effect that pulls the folds back together and closes the

glottis. This phenomenon is referred to as the Bernoulli effect.

Tension and stiffness in the folds is, perhaps, an even more important

factor in controlling vocal fold vibration. Other theories [3] have

been postulated to explain the fold vibration; however, these theories

have not been fully accepted. The vibration of the folds generates an

acoustic wave that travels through the vocal and nasal tract cavities.

The shape of the vocal and nasal tract cavities filters the acoustic

pressure wave that is radiated finally as a pressure waveform at the

lips and nostrils. For unvoiced sounds a constriction is formed either

Figure 1.1 Schematic diagram of the human
vocal mechanism. Adapted from [4].

at the glottal level or at some location in the supraglottal level. The

air flow through this constriction produces turbulance in the air

flow. This again is filtered by the vocal and nasal tracts as described

earlier. Fant [5,6] provided an excellent mathematical treatment of the

entire process.

The system described above is controlled by higher order processes

in the human brain [7]. Abstract thought is converted into language by

the linguistic or artistic centers in the cerebral cortex. The commands

from these centers are transmitted to the motor cortex. The motor

cortex gives a series of commands to the respiratory, laryngeal, and

articulatory muscles. The extra pyramidal system, which includes some

parts of the cerebral cortex, the cerebellum, and the basal ganglia,

provides additional regulation of the activity of the respiratory,

laryngeal, and articulatory musculature. The activity of these muscles

results in movements of the phonatory organs which produce a series of

sounds known as voice. This voice is a pressure wave that activates the

hearing mechanism of the listener. The hearing mechanism converts this

physical signal into neural messages that are translated into a language

form and then an abstract thought in the listener's brain.

The voice is also transmitted to the speaker's brain where it is

checked against the preplanned sound by the feedback mechanism of the

speaker. The feedback mechanism operates via the deep and superficial

sensory receptors which provide information about the muscular

contraction and movements of the phonatory organs. Figure 1.2

illustrates this process.


Figure 1.2 Schematic representation of voice production and its
control system. Adapted from [7].

A voice disorder may result if any part of this system is

damaged. However, in this research we are most interested in the

laryngeal structure section of the system and more precisely the vocal

folds mechanism. Although this part of the system seems to be the

easiest to study, there is little understanding of the operation of the

vocal folds. One reason can be attributed to the relative

inaccessability of the larynx that holds the vocal folds. According to

Ludlow [8], "Identifying research needs for the assessment of phonatory

functioning and vocal pathologies is not difficult, the gaps in

knowledge are awesome" (p. 3).

1.2 Research Issues

The inaccessability of the laryngeal structure has prevented us

from fully understanding or observing the vibrations of the folds. The

electroglottographic (EGG) signal represents a potential electrical

probe that can be used to study events inside the larynx.

The EGG signal is believed to be directly related to the folds'

lateral area of contact, and a descriptive model has been proposed based

on this conjecture. However, no adequate mathematical model that could

be implemented on a computer has yet been developed for this signal.

This dissertation develops such a model. We believe that such a model

is essential for understanding many aspects of the folds' vibration as

well as other events at the laryngeal level. We also believe that given

the EGG signal, this model can be further developed to predict the

glottal configuration. This result can be very significant in

implementing a successful articulatory based vocoder.

So far, measurements of the vocal folds' vibration parameters are

accomplished through measurements made from high speed films of the

folds' vibration. This is an expensive, laborious, and extremely

complicated process, requiring a high level of expertise. Also, the

process imposes severe restrictions on subject selection and the

possible uttered sounds that can be filmed. On the other hand, the

electroglottograph is inexpensive, noninvasive, easy to use, and does

not impose any restriction on the uttered sounds or the subject. Given

all these desired features, is it possible to use such a device, and in

particular, its output signal, to measure the vibration parameters?

Finally, since the EGG is related to the contact area between the

folds, is it possible to use the EGG to detect any organic changes in

the folds' tissue caused by a certain pathology? Also, using the EGG,

is it possible to detect these pathologies in their early development?

Can we differentiate between different pathologies? In this research we

attempt to answer some of these questions.

1.3 Description of Chapters

Chapter 2 gives a brief description of the physiology of the vocal

folds. It also discusses some of the methods used to detect vocal

folds' disorders.

Chapter 3 presents a model of the EGG signal based on vocal folds'

contact area. Simulation results of the EGG signal for various

conditions and phenomena are also presented. Chapter 4 details the data

collection and measurement system we used to collect our data base of

synchronized area, speech, and EGG signals. The methods and algorithms


developed for measuring the vibration parameters, along with the

measurement results, are presented in Chapter 5.

Chapter 6 presents our measurement results for parameters believed

to indicate the presence of a pathology. We also present our proposed

new method for feature extraction for laryngeal pathology detection.

Finally, Chapter 7 summarizes our results and proposes further

development of some areas of this research.


The purpose of this chapter is to provide an introduction to the

vocal folds' structure along with pertinent studies that investigated

different pathologies associated with this structure. Previous methods

that were utilized to detect these pathologies are also presented.

2.1 Vocal Folds' Structure

As mentioned earlier, the vocal folds are the basic vibrator in the

human speech production system. They reside inside the larynx at the

top of the trachea, where they also protect the trachea and lungs from

the intrusion of foreign material.

The vibration of the vocal folds is essential for voiced sound

production in English speech. The structure of the folds is unique and

is intricately controlled by the activity of the laryngeal muscles. The

pair of folds is capable of producing a great variety of fundamental

frequencies, intensities, and tonal qualities compared to the multiple

strings needed in many musical instruments.

Hirano studied the structure of the folds extensively [7]. Figure

2.1 shows a frontal section through the middle of the membranous portion

of the human vocal fold. From a histological point of view the folds'

structure can be divided into five different layers:


Epithelium Z-
I~~n loayl ,.
Le Iattermediate

oeep layer Ma,

Muscle 1'


Figure 2.1 A frontal section of a human vocal fold
through the middle of the membranous portion [7].

(1) The thin and stiff capsule that maintains the shape of the

folds, called the epithelium. This layer is made of squamous

type cells.

(2) The lamina propria superficial layer. This layer is somewhat

like a mass of soft gelatin consisting of loose fiberous

components, and occasionally referred to as Reinke's space.

(3) The lamina propria intermediate layer. This layer resembles

a bundle of soft rubber consisting chiefly of elastic fibers.

(4) The deep layer of the lamina propria. This layer consists of

collagenous fibers resembling a bundle of cotton thread.

(5) The vocalis muscle. This muscle resembles a bundle of rubber

bands that constitutes the main body of the vocal fold.

From a mechanical point of view the vocal folds are divided into

three different sections: the cover, the transition layer, and the

body, which is the vocalis muscle. Each of these sections differs in

mechanical properties from the other sections. The control of these

sections is also different. While the first two sections, which are

commonly referred to as the mucus membrane, are controlled passively,

the body of the vocalis muscle is controlled both actively and


As in any other mechanical system, lubrication of the folds is

essential to a proper and sustained functioning [9]. The ventricular

glands squirt mucus on the vibrating folds. Excessive mucus can

sometimes act as an additional layer of the vocal fold [10].

Krishnamurthy [11] found that the mucus can influence the

electroglottographic (EGG) signal considerably, providing a highly

conductive path for the radio frequency signal used by the EGG device.

According to Hirano [12], most pathologies originate in one layer

of the vocal folds.

2.2 Detection Methods for Vocal Fold Disorders

Voice disorders can be classified into two general categories,

functional and organic. Functional disorders are associated with

incorrect use or misuse of otherwise healthy and nondefective vocal

organs. The incorrect use of the vocal folds is sometimes related to

psychological problems. However, in some cases, misuse is intentional,

to produce a unique quality as in the case of actors or singers.

Organic disorders result from organic changes to the structure of

the folds. It is interesting to note that functional misuse of the

voice such as excessive shouting in a football game will result in an

organic disorder due to the development of a nodule.

Many methods exist for detecting or evaluating voice disorders.

However, the vast majority have been used only in research. They are

complicated and usually are not readily available in voice clinics. The

need exists for simple acoustic analysis procedures to evaluate or

detect different pathologies.

We now turn our attention to the methods used to either detect or

evaluate voice disorders.

2.2.1 Aerodynamic Tests

Four aerodynamic parameters are usually utilized in this

procedure: subglottal pressure, supraglottal pressure, glottal

impedance, and the volume velocity of the air flow at the glottis.

The subglottal pressure is measured by locating a pressure

transducer below the vocal folds. As expected, this is a difficult

procedure and rarely used in living human beings. Five methods [7] are

used for locating the pressure transducer below the glottis. A tracheal

puncture using a spinal needle or a modified version is inserted into

the trachea through the cervical skin. A transglottal catheter can also

be inserted into the subglottal space through the mouth; however, this

procedure interferes with the normal vibration of the folds.

Measurement of subglottal pressure is most easily implemented for

patients with tracheostomies, through the opening already present, but

this is a very small patient population.

Koike and Perkins [13] utilized an ultra-miniature solid-state

pressure transducer placed directly in the subglottal space. This

procedure does not interfere with the folds' vibration and the

transducer has a high frequency response, but the frequency response of

the transducer is temperature dependent.

The preceding procedures employ direct measurement techniques of

the subglottal pressure. The esophageal balloon uses an indirect method

of measuring the subglottal pressure. In this procedure a balloon

connected to a tube is inserted into the esophagus through the nose.

The intraesophageal pressure measured is related to the intratracheal

pressure, which is approximately equal to the subglottal pressure.

However, factors other than the intratracheal pressure, such as the

expanding and contracting lung volume, and contraction of esophageal

muscles during swallowing, contribute to the intraesophageal pressure.

This technique, therefore, is valid only in some limited conditions.

Researchers [14-16] reported an increase in subglottal pressure for

patients with carcinoma and recurrent laryngeal nerve paralysis.

The subglottal pressure along with the supraglottal pressure and

the glottal resistance can be used to calculate the mean flow rate. It

is usually measured using devices such as the spirometer,

pneumotachograph, and the hot-wire anemometer.

The glottal resistance cannot be measured directly, but it is

calculated by dividing the subglottal pressure by the mean flow rate.

The mean flow rate is generally greater for pathological subjects

than normals [17-20] and in general it can be used to monitor progress

of treatment.

The instantaneous flow rate, usually referred to as the volume

velocity wave (v-v), can also be measured. Berouti [21] applied linear

prediction inverse-filtering techniques to obtain the volume velocity

waveform. Rothenberg [22] obtained the volume velocity waveform using a

pneumotachograph mask to inverse filter the oral volume velocity. On

the other hand, Sondi [23] used a reflectionless acoustical tube to

inverse filter the acoustic wave at the lips.

Although Berouti reported peculiar v-v wave shapes for pathological

cases, the v-v waveform has not been used extensively for pathology

detection. However, the residue signal derived from the v-v signal has

been used as a potential source of information regarding disorders of

the vocal folds, as we will see later in the discussion.

Using the aerodynamic measurements mentioned above, phonatory

parameters can be measured to detect pathologies. One parameter is the

maximum phonation time (MPT), where the subject is instructed to sustain

a vowel /a/ as long as possible following a deep inspiration. For

nonsinger patients, the sustained phonation is made at a comfortable

fundamental frequency and intensity. For singers, the two parameters

range and level can be controlled. Studies on pathological subjects

[18,24] indicate a decrease in the MPT from that of normal subjects.

This test can also be used to monitor progress of treatment.

2.2.2 Examination of the Vocal Folds' Vibration

Normal vibration of the vocal folds are a prerequisite for a normal

sounding voice. Moore [25] and Moore et al. [26] have observed that it

is the folds' vibration and not the disease itself that determines the

phonatory quality of the resulting sound. This fact suggests a

nonuniqueness between perceptual acoustics of the sound and the

underlying pathology, if different pathologies result in similar folds'


Many methods exist for observing the vibration of the vocal

folds. These techniques include stroboscopy, ultra-high-speed

photography, photo-electric glottography, electroglottography, and

ultrasound glottography.

Stroboscopic examination of the vibrating fold allows the examiner

to freeze the image of the folds at the same position in the vibratory

cycle when the strobe light is synchronized with the folds' vibratory

cycle. If the strobe's flashes are emitted at frequencies slightly less

than the folds' vibrating frequency, a slow motion effect of the

vibrating folds is produced. This technique can also be applied using

x-rays instead of light beams. Still photographs can also be produced

using this technique. These photographs allow evaluation of the size,

position, shape, orientation, and color of the laryngeal structure, but

they do not show fine details of the vibratory cycle. This is the basic

difference between stroboscopy and ultra high speed photography. This

technique can be and is implemented in clinical settings.

On the other hand, ultra high speed photography of the vibrating

folds [27,28] at speeds of up to 5000 frames/sec provides a detailed

description of the folds' vibration. The film produced allows a study

of the vibration of the folds during a glottal cycle. This technique is

used in this study and is discussed in detail in Chapter 4. Using this

method, different parameters of the folds' vibration can be measured as

will be illustrated later in this section. However, this technique can

only be implemented in research laboratories since it is too complex and

expensive for clinical use.

Another technique used is the photo-electric glottography or

photoglottography [29,30]. This method utilizes a light source placed

superior to the glottis and a photo-multiplier placed against the neck

just below the larynx. The intensity of the light detected by the

photomultiplier is directly related to the area of the glottis between

the vocal folds. The location of the light transmitter and receiver can

be reversed. This technique suffers from many shortcomings, such as the

light density distribution within the vocal folds may not be constant,

or the changing cross-sectional area of the vocal folds in an interior

posterior plane may result in an uneven illumination of the folds.

Also, slight differences in the locations of the detector causes

different waveforms. However, this technique has the advantage of

unrestricted range of phonation.

Ultrasound glottography [31] is another method of monitoring vocal

fold vibration. The technique is based on the reflection of the sound

wave at the interface between two media with different specific acoustic

impedance. This is a new technique and research is being conducted to

assess its usefulness.

On the other hand, electroglottography [10,32-34] is based on the

electrical transmission of a high frequency current through the tissues

at the glottal level. The vibration of the vocal fold constitute a

varying impedance path that modulates a high frequency rf current (3.5

MHz) transmitted between two electrodes placed on either side of the

thyroid cartilage. The Mind-Machine Interaction Laboratory has

conducted extensive studies [32-35] using this device in synchrony with

high speed photography and the speech signal. The studies indicate the

versatility of the resulting EGG signal. It was found that the EGG

signal can indicate the fundamental frequency of the folds' vibration,

the closed and open phase regions, the closing time of the folds, and

the duty cycle, along with other interesting features, such as the use

of the EGG signal as the excitation signal for the LPC synthesizer [35].

Fourcin [10] used the EGG signal for pathology discrimination. He

studied the fundamental frequency distribution from normal and abnormal

subjects. His studies indicate differences in the frequency range and

distribution between normals and subjects with voice pathology. Smith

[36] used a discriminant analysis procedure on the distribution of the

roots of the autocorrelation linear prediction analysis of the EGG

signal. His procedure was successful 74% of the time in discriminating

between normals and pathological subjects. He also noted some time

domain characteristics of the EGG signal associated with the presence of

pathology. These include double periodicity, changes in the rising

slope and the often shorter glottal open phase for patients with

pathology. Vocal fold vibration parameters

Using the techniques discussed earlier many vibration related

parameters can be measured. These parameters can be divided as follows


Horizontal excursion of the edge of the vocal folds. This

parameter is hard to measure since the edge of the fold is not a fixed

location on the fold; it is the most medially located part of the

vibrating fold. In chest register, a phase difference exists along the

length and depth of the vocal folds during their vibration [11], and it

is not possible to define the horizontal excursion. However, in

falsetto phonation, when the folds are taut and the entire fold moves

medially along its length, horizontal excursion can be measured

easily. The high speed filming has a definite advantage when measuring

this parameter due to the high sampling frequency of the vibrating folds

and the two dimensional projection present on the face of the film.

Limited excursion of one of the folds or both can indicate possible

fold paralysis.

Glottal width. This parameter is measured routinely in high speed

filming. Since the width between the two folds is not uniform along

their length, it is measured at five or more specific points along the

fold's length. However, this parameter could be defined to indicate the

distance between the middle of the membranous part of the two folds.

Glottal area. The term glottis refers to the area between the

edges of the two folds. The area function characterize the time history

of the vibration of the folds. High speed films can give an absolute

area measurement when a grid in the same focal plane as the folds is

superimposed on the laryngeal film. This procedure is described fully

in Chapter 4. Nonclosure in certain types of phonation indicates the

possible presence of pathology.

Fundamental frequency of vibration. The fundamental frequency of

vibration is the inverse of the period of one glottal cycle. The

glottal cycle is defined by an opening phase, during which the glottal

area increases; a closing phase, where the glottal area is decreasing;

and a closed phase, during which the glottis is closed [37]. In the

event that no closure is present, the fundamental frequency is the

inverse of the time span between two similar events in the glottal area.

Excessive variability of the fundamental frequency is an indication

of a possible pathology.

Opening phase, closing phase, and closed phase. As mentioned

earlier, the glottal duty cycle is divided into three regions: opening

phase, closing phase, and closed phase. The open phase is defined as

the time the glottis has non-zero area, and in terms of the definition

it is usually equal to the combined time of opening phase and closing

phase. When there is no closure, the open phase is equal to the entire

cycle time. Again the absence of closure in some types of phonation

indicates the possible presence of pathology.

Open quotient (OQ), speed quotient (SQ), and speed index (SI).

Timcke, Von Lenden, and Moore [38] defined two parameters to represent

the vibration of the vocal folds, the open quotient, and the speed

quotient. The two parameters are defined as

duration of the open phase
OQ =
duration of the entire glottal cycle

duration of the opening phase
SQ =
duration of the closing phase

The speed index is defined in terms of the speed quotient as

SQ 1
SI =
SQ + 1

duration of opening phase duration of closing phase
or SI =
duration of opening phase + duration of closing phase

The speed quotient varies directly with the intensity of the sound

produced while the open quotient varies inversely with the sound

intensity. Hildebrand [39] conducted an intensive study of the effects

of intensity and frequency on the open and speed quotients in normal

subjects. Her results showed that the OQ increased significantly on

transition from low to medium frequency, while it remained constant or

decreased slightly at higher frequencies. Increasing the intensity from

low to medium levels had little influence on the OQ, but increasing

intensity from medium to high level decreased the OQ significantly.

The SQ, on the other hand, varied inversely with frequency. For

frequency change from low to medium the SQ decreased significantly,

while it remained unchanged for frequency transition from medium to high

level. The SQ had a nonsignificant increase with increasing intensity.

The speed index as defined is related to the SQ. However, unlike

the SQ, which varies from 0 to -, it has a range of -1 to 1. Also for

two waveforms that have the same basic shape except that one is the

reverse of the other, the speed index will have the same absolute value

while the speed quotient gives two different values that have a product

of 1. It is readily seen that the SI has a simpler waveform than the SQ

and can easily be visualized.

Contact area. This is the lateral area of contact between the

folds while touching each other. During the closed phase this area

changes, however, this change cannot be viewed from the high speed

films, but the electroglottographic waveform seems to give a good

indication of this change.

Other parameters such as amplitude, mucosal wave, homogeniety,

regularity or periodicity of successive vibrations, and symmetry of the

folds are important. Moore and Thompson [37] considered the symmetry

and equal amplitude of the vocal fold vibration as one of the necessary

conditions for the production of normal phonations.

2.2.3 Acoustic Analysis

Traditionally, laryngologists, phoniatricians, and speech

pathologists have relied on two basic techniques: listening to the voice

and viewing the larynx with the aid of a mirror or laryngoscope.

Changes in voice quality often result from laryngeal pathology, and

experienced laryngologists are able, to some extent, to detect some

pathologies by administering listening tests. However, this is a

subjective procedure and different laryngologists often give different

diagnoses for the same patient [40].

Acoustic analysis is currently gaining popularity. The noninvasive

nature of this technique is its greatest asset. It lends itself to

screening of a large population for early detection of voice pathology

and for following the effects of voice therapy. It does not require

close cooperation of the subject and can be made offline from tape

recordings. The method tries to provide objective and quantitative

measure of the voice. The following discussion lists the parameters

used for pathology detection and evaluation of certain aspects of the

acoustical signal of the voice. Fundamental frequency related measures

Lieberman [41,42] showed that the average pitch period perturbation

is larger for pathologic speakers than in normal speakers. He proposed

two measures:

(a) Pitch perturbation (PP)

This parameter is defined as the time difference between

duration of successive pitch periods in the speech signal.


PP = AP = Pi Pi+

(b) Pitch purturbation factor (PPF)

This parameter is the relative frequency of pitch period

perturbation larger than 0.5 msec occurring in a steady vowel


# of AP > .5 msec
total # of AP

(c) Relative average perturbation

Koike [43] observed that normal subjects phonating a

steady vowel sound exhibit a slow and relatively smooth

change in the pitch period. He defined a measure called the

relative average perturbation (RAP), where

N-2 l I P(i-1) + P(i) + P(i+l) P(i)

N P(i)

P(i), i = 1, 2, ..., N, denotes the successive pitch periods.

Koike [44] used a contact throat microphone to record a signal

related to the folds' vibration. He proposed another measure called the

correlogram. This measure is based on the quasi-periodic amplitude

modulation observed in the steady vowel sounds of pathological

speakers. He found that the correlograms of pathological and normal

speakers are generally distinguishable from one another. Spectral measures

Several investigators attempted to relate the spectral content of

speech to certain pathologies in voice production. Yanagihara [45,46]

proposed four types of classifications of the hoarse voice. Type I is

when the regular harmonic components are mixed with the noise components

primarily in the formant region of the vowels. Type II is when the

second formant of /e/ and /i/ are dominated by the noise components with

additional noise in the region above 3000 Hz. Type III is when the

second formant of /e/ and /i/ are totally replaced by the noise

component and the high frequency noise in the region above 3000 Hz are

further intensified. Type IV is when the second formants of /a/, /e/,

and /i/ are replaced by noise components.

Another method is the long-time-average spectra (LTAS) used by

Frokjaer-Jensen and Prytz [47]. This method utilizes a 400 channel,

real time, narrow band analyzer to compute the spectra of speech

averaged over 45 secs. Several parameters of the resulting spectra are

examined. These parameters include the fundamental frequency FO, the

first formant (F1), Fmin the frequency of the minimum spectra level

between FO and F1, Fmax and Lmax the level of the maximum peak in the

entire spectrum, the quotient of FO and F1, the harmonic richness

defined as the energy below 1 Khz divided by the energy above 1 Khz, and

other parameters. Recently, Kitzing [48] conducted experiments using

the LTAS method to investigate qualities of disturbed voice due to

aberrant functioning in organically healthy vocal organs. In these

experiments ten experienced voice therapists each produced four

different recognizable voice qualities. He concluded that the most

important measures were the harmonic richness, the spectral slope

inclination in the first formant region, and the ratio between peak

level of the fundamental and first formant region. Multidimensional approach

So far the discussion has centered on examining single acoustic

parameters as to their potential for detecting voice disorders. Some

researchers reached the conclusion that a more robust method would be to

use multidimensional analysis using multiple acoustic parameters.

Working from this premise Davis [49,50] constructed a voice profile

using six parameters (PPQ, APQ, EX, PA, SFF, and SFR), where the pitch

perturbation quotient (PPQ) is similar to the RAP measure by Koike,

except the averaging window is taken over five points instead of

three. The amplitude perturbation quotient (APQ) is also defined as the

RAP, but the averaging is done using the amplitude values of each pitch

rather than the period. The EX is the coefficient of excess defined as

E[(X 7)4]3
EX = ________)_2___
E[(X 21

This excess coefficient is measured from the residue signal after

inverse filtering the speech signal. Although this signal does not have

a physical correlate, it is found that for pathologic subjects the

excess coefficient is higher than for normal subjects.

The pitch amplitude (PA) is the pitch period peak in the residue

signal autocorrelation function. The PA is high for voiced sounds from

normal subjects, since the PA is a measure of voicing. However, for

breathy voiced sounds associated with some pathological speakers, the PA

is low indicating weak periodicity due to abnormal fold's vibration and

hence the presence of noise that is heard.

The spectral flatness of the residue inverse filter (SFF) is

defined as the ratio in decibels (dB) of the geometric mean of the

spectrum to the arithmetic mean of the spectrum. The more noise the

signal contains, the greater is its spectral flatness. So for unvoiced

sounds, where the glottis is open, the spectral flatness is large. For

voiced sounds the spectral flatness is not as large. For voiced sounds

of pathological speakers, the spectral flatness is expected to be

greater than that of normal subjects, indicating improper closure of the


The spectral flatness of the residue signal (SFR) is similar to the

SFF. The voiced sound from normal subjects is expected to have the FO

harmonics in the spectrum of the residue signal. When the voiced sound

becomes more noiselike, as in the case of pathologic speakers, the SFR

does not have the expected FO harmonics. The results of analysis of the

pathologic voice agrees with this observation. Davis reported a

detection probability of 95.2% in a closed test and 67.4% in an open


Other researchers [51,52] derived fourteen acoustic measures from

the glottal sound waveform or volume velocity found by inverse filtering

the speech signal. They related these acoustic parameters to other

phonatory function factors, such as vibratory pattern of the vocal

folds, physical properties of the folds, aerodynamic measures, and

psycho-acoustic parameters of the voice. Their system was successful

70-80% of the time in separating normal subjects from subjects with

pathological disorders.

2.2.4 Psycho-Acoustic Evaluation

As mentioned earlier, speech pathologists and otolaryngologists are

frequently able to determine vocal fold pathologies by listening to the

voice. This is called psycho-acoustic evaluation.

The psycho-acoustic parameters are the voice pitch, loudness,

laryngeal quality, and resonance [53]. A voice is judged abnormal when

any of these parameters deviate from the expected range of persons

having the same age, sex, and cultural background.

The pitch is the perceptual correlate of the frequency of the

folds' vibration, also referred to as the fundamental frequency. It is

judged atypical or defective if it is too high, too low, monotonous, or


Loudness is the sensation related to the amplitude of the molecular

motion in the sound wave. It is judged abnormal when a voice is too

loud or too quiet in relation to a specific environmental situation or

when the loudness variation is inappropriate to the meaning of the


Quality of the voice is not as easily defined. Moore [53] used an

excellent example for explaining this parameter:

A listener can identify the tones of a trombone, a saxophone,
and a violin even when all three are producing sounds at the
same pitch and equal loudness.

The quality of voice is thought to be related to the weighting of the

spectral components of the speech signal. Based on this assumption,

Yanagihara [45] developed his criteria for classifying the hoarse voice

as mentioned in the previous section.

Voice quality disorders encompass a wide range of voice

disorders. A voice with quality disorder can be breathy, rough or

harsh, hoarse, husky, throaty, metallic, hypernasal, and denasal, among

other terms used to describe the quality of the voice. The descriptive

nature of these terms lends themselves to different interpretation by

different voice pathologists. Many researchers around the world are

currently working to provide a universally accepted method for defining

these terms, and a standard procedure to measure the extent of these

disorders using a universally accepted scale.

Ishiki [54], working towards this goal, used a semantic

differential technique for factor analysis of hoarseness, using

seventeen polar-opposite adjectives as the scales. Three pairs

represented an evaluation factor, another three pairs represented a

potency factor, an activity factor was represented by three other pairs,

and eight pairs were used to describe hoarseness. Based on the analysis

of sixteen recorded samples of speech, he concluded that hoarse voice

consisted of at least four factors. The first factor, and most dominant

of the four, is related to the roughness (R), rumbling, or rattling

quality. The second factor is related to the breathiness (B) quality.

The third factor is related to the quality asthenic (A), and the final

factor is related to the degree (D) of hoarseness. Ishiki also

associated a four-point grading scale with these factors. The scale

ranged from "0" for normal, "1" for slight, "2" for fair, and "3" for


The psycho-acoustic evaluation procedure suffers from non-

standardization of the terms used to describe the pathologic voice. The

classification of the voice as hoarse, breathy, harsh, rough, etc. has

different meaning to different voice pathologists. The definition of

the descriptive terms used for diagnosis are not standardized. In an

attempt to overcome this problem, the Committee for Phonatory Function

Tests of the Japan Society of Logopedics and Phoniatrics (CPFTJSLP)

proposed a system called "GRBAS." This system is used mainly to

evaluate hoarseness. It consists of five scales: Grade (G), Rough (R),

Breathy (B), Asthenic (A), and Strained (S).

The grade (G) represents the degree of hoarseness of the voice

abnormality. Scale "R" represents a psycho-acoustic impression of the

irregularity of vocal fold vibrations. It also corresponds to the

fluctuations in the fundamental frequency and amplitude of the glottal

source sound. Scale "B" is related to the psycho-acoustic impression of

the extent of air leakage through the glottis. Scale "A" is related to

weakness or lack of power in the voice. Scale "S" is related to the

impression of the hyperfunctional state of phonation. It is usually

related to an abnormally high pitch frequency, noise in the hiyh

frequency range, and richness in high frequency harmonics.

Using the "GRBAS" system the hoarse voice can be evaluated by a

four-point grading for each scale: "0" for normal, "1" for voice with

slight hoarseness, "2" for voice with moderate hoarseness, and "3" for

the voice with extreme hoarseness.

The GRBAS system is still subjective. The CPFTJSLP provides a

standard tape which has typical voice samples of the above quality

defects represented by the GRBAS scale to provide a certain measure of

objectivity among specialists using this system. The users of this

system are expected to possess highly trained ears.


3.1 Introduction

The EGG signal is the output of a device called the

electroglottograph. The device operates in the following fashion. Two

electrodes are placed on either side of the thyroid cartilage. One

electrode acts as a transmitter and the other one as a receiver. A high

frequency RF current (300 Khz 5 Mhz) is applied to one electrode.

This RF current is transmitted across the larynx and is detected by the

other electrode (receiver). The impedance change resulting from

vibrations of the vocal folds modulates the amplitude of the RF

current. The amount of modulation is directly related to the amount of

impedance change across the larynx. On the receiver side, the detected

signal is demodulated to produce the EGG signal.

The change in impedance across the larynx is primarily due to the

change in the lateral contact area of the vocal folds [10]. Hence, the

EGG is a measure of the amount of vocal fold contact area and not of the

area of the glottis, i.e., the EGG signal is not directly related to the

glottal area function. Figure 3.1 shows a functional block diagram of

the electroglottograph with a typical EGG signal.

Various Current Paths

Glottis Open

Cl osed

Figure 3.1 Functional block diagram of the Electroglottograph
and its output signal the EGG.

3.2 Previous Models

The previous section described the electroglottographic (EGG)

signal, or electroglottogram (EGG), and its relationship to vocal fold

vibration. In summary, Fourcin [55] upgraded the original Fabre device

to its current state. Using stroboscopic photography synchronized with

the EGG waveform, he concluded that the EGG amplitude was related to the

amount of contact between the two folds during the phonation of a voiced


Fant et al. [56] combined the EGG with optical glottography and

inverse filtering of the speech waveform. Their results can be

summarized as four points: the opening instant of folds is often

associated with a sudden change in the slope of the opening phase of the

EGG, the "outstanding feature" is the rapid fall of the EGG when the

folds reach contact in the closing phase, the flat top of the EGG

corresponds to the open phase of the glottal cycle, and the ascending

portion of the EGG corresponds to the opening phase of the glottal


In order for electroglottography to be useful to the clinician or

speech researcher the electroglottographic waveform must be related to a

model of vocal fold vibratory behavior. Fog-Pedersen [57] carried out a

study similar to the one done by Fourcin, where stroboscopic

observations of the fold vibrations are synchronized with the EGG

signal. Based on these observations, he constructed a model for the EGG

during a single cycle, as shown in Figure 3.2.

Lecluse [58] also combined electroglottography synchronized with

stroboscopic photography and constructed a model similar to the Fog-

1 3



1 Maximum opening phase
2 Maximum closing phase
Points 3 and 4 are changes
from the plateau to the
glottal slope of the
glottographic curves.

Fog-Pedersen's model for the EGG.

Figure 3.2

Pedersen model. However, he included details that related specific

points in the EGG waveform to be different events in the vibratory cycle

of the vocal folds. His model is presented in Figure 3.3.

Rothenberg [59] correlated the EGG with the glottal volume velocity

waveform derived by analog inverse filtering the speech signal. He used

an idealized model of the EGG signal, shown in Figure 3.4, to describe

the relationship between the vocal fold vibration and different features

in the EGG waveform.

Krishnamurthy [11] used a large data base of synchronized EGG,

glottal area function, and glottal volume velocity to modify the

Rothenberg model. He used the differentiated EGG waveform to pinpoint

the opening and closing instants in the glottal cycle. This modified

model is presented in Figure 3.5.

Childers et al. [32-34,60,61] used their extensive data base of

ultra-high speed laryngeal films synchronized with EGG and speech

waveforms, to further modify the Rothenberg model. The resulting model

is depicted in Figure 3.6.

Other researchers [62-64] carried out various experiments using the

EGG synchronized with stroboscopic photography and/or photoglottography

and the air flow signal. Their results support the notion that the EGG

signal is inversely related to the contact area of the vocal folds.

The above models were based on extensive observations of the EGG,

synchronized with stroboscopic photography, photoglottography, and

ultra-high speed cinematography along with the glottal volume velocity

waveform. These models are descriptive in nature and do not give

quantitative measures of different parameters of the vocal fold





1 is the moment of initial closure at a single point
2 is the moment at which closure is completed over the whole
length, but not in the vertical plane
3 is the moment at which closure is compeleted over the whole
vertical plane
4 is the moment at which opening begins
5 is the moment at which time whole length is open

Figure 3.3 Lecluse's model for the EGG.



1-2 vocal folds maximally closed

3-4 folds separating from lower margins
towards upper margins

4-5 upper fold margins separating

7 lower margins close

3-7 folds apart
I closure reaches upper fold margins

Figure 3.4 Rothenberg's model for the EGG.





Figure 3.5 Krishnamurthy's qualitative EGG model.

glottis is open, folds are moving away from
glottis is open, folds are moving towards e
folds make initial contact
folds close rapidly
time of maximum negative value in Diff. EGG
closed phase region
folds reach maximal lateral area of contact

each other
ach other


1 2 3 4 5 61


1-2 Vocal folds maximally closed. Complete closure
may not be obtained. Flat portion idealized.
2-3 Folds parting, usually from lower margins toward
upper margins.
3 When this break point is present, this usually
corresponds to folds opening along upper margin.
3-4 Upper fold margins continue to open.
4-5 Folds apart, no lateral contact. Idealized.
3-5 Open phase.
5 Glottal area zero. Folds in contact along lower
margin. Idealized.
5-6 Folds closing from lower to upper margin.
6-1 Rapid increase in vocal fold contact.

Modified Rothenberg EGG Model.

Figure 3.6

Titze et al. [65] and Titze [66] presented a mathematical model for

computing various glottographic waveforms. His model is still in the

development stage.

In the following section we present a different mathematical

development that generates and models the EGG waveform directly from the

contact area of the vocal folds. The calculated EGG corresponds

accurately with the area function of the vibrating folds. Also, we show

illustrative results for predicting EGG waveforms for models of

vibrating vocal folds that have a nodule or polyp on one fold.

3.3 Simple Unitary Mass Model of the Vocal Folds

Our simple model is an extension of the two-mass vocal fold

articulatory speech synthesis model of Ishizaka and Flanagan [67],

Flanagan and Ishizaka [68-70], and Flanagan, Ishizaka, and Shipley

[71]. The Flanagan and Ishizaka two-mass model provides sufficient

details to calculate the speech waveform, the pressure and air volume-

velocity distributions in the vocal cavities, the motion of the vocal

folds, the glottal area, the vibration of the cavity walls, and other

factors. A failing of this two-mass model is that the projected glottal

area is always rectangular. The folds are either open or closed; there

is no gradation of opening or closure. The lateral contact area is

stair-stepped, being zero (open glottis), next partial contact (lower

masses in contact), and finally full contact (lower and upper masses

both in contact). This approximation does not adequately replicate the

vibratory motion of true vocal folds.

A more realistic simulation of the vocal folds is achieved with a

unitary mass model as depicted in Figure 3.7. Here the mass of each

fold is thought of as a plastic wedge. The horizontal displacements, d2

and d, of the superior (upper) and inferior (lower) margins (edges)

respectively of the wedge which simulates the vocal folds, are

determined from the two-mass model. The wedge is constructed by using a

straight line interpolation between points dI and d2. The thickness of

the simulated folds is T and their length is L. The ratio of L to T can

be specified. In the examples to follow we have set L/T = 5. The model

assumes a plastic collision between the right and left vocal folds. The

length of vertical contact, Ax, can then be easily computed and the

contact area may be estimated at a particular time instant as

A = (L)(Ax) (3.1)

The EGG waveform is proportional to the reciprocal of A. Figure 3.8a

shows the EGG signal, its derivative, and the calculated glottal area,

all estimated with the aid of this model. Compare these results with

Figure 3.8b, which shows a measured EGG and glottal area. The

differentiated EGG (DEGG) will be discussed further in another section.

This simple model of the EGG signal is a good first approximation

relating glottal opening and glottal closure to EGG events, as depicted

earlier in Figure 3.5. However, it does not account for the different

angles of glottal closure from anterior to posterior and of glottal

opening from posterior to anterior. Note also that this model is a

simplified method for calculating the lateral area of contact of the

LI \


1d2 I2

I -


q d Id


Figure 3.7 Triangular unitary mass vocal fold model, top
and lateral views.

<:i o '

5 10 15




05 10 15 20

Figure 3.8 (a) EGG, differentiated EGG (DEGG), and the glottal
area calculated from the simple unitary mass vocal fold model.
(b) measured glottal area and EGG waveforms.

vocal folds. We do not account for conservation of momentum, nor do we

allow the folds to expand as they compress and thereby possibly increase

the lateral contact area by pushing the vocal fold tissue both

superiorly and inferiorly along the mid-saggital contact plane. The

model is a simple additional calculation, appended to the articulatory

speech synthesis model, which estimates, to a first approximation, the

lateral contact area. As can be seen from Figure 3.8, this simple model

works reasonably well.

3.4 Triangular Unitary Mass Model of the Vocal Folds

High speed films of the vocal folds of males in modal register show

that there exists a phase difference along the length of the vocal folds

during their vibration and therefore during the closing (opening)

phase. During closure, contact between the folds first occurs over a

small portion of their length. Closure continues, zipper-like, along

the length of the folds until the glottis is closed. Similar behavior

occurs during the opening phase. The angles of vocal fold closure and

opening differ from one another.

We can model this behavior as in Figure 3.9 by creating an angle,

e, between the left and right vocal folds. The contact area is now

proportional to Al. This is similar to the approach taken by Titze


In addition to the longitudinal vocal fold phase differences

described above, a vertical (superior-inferior) phase difference between

the upper and lower margins of the vocal folds has also been observed.


d2 dj'

Figure 3.9 Triangular unitary mass vocal fold model,
top and lateral views.

An artistic rendition of the elastic, flexible unitary mass vocal

fold model and its relation to EGG waveform events is depicted in Figure

3.10. Note the vertical phase difference between the edges of each mass

as well as the longitudinal phase difference. The latter phenomenon

resembles the action of a zipper being closed and opened.

The manner in which our model functions is similar to that shown in

Figure 3.10 for the flexible one mass model. This model works as

follows. The upper and lower glottal areas (AG2 and AG1, respectively)

are calculated using the Flanagan-Ishizaka two-mass vocal fold

articulatory synthesis model. The displacements of these parallel

masses of the vocal folds (both upper and lower) from the mid-sagittal

line is calculated for a particular time instant, n, as

AG1(n) (3.2)
d1(n) 2L

AG2(n) (3.3)
d2(n) 2L

where n is a time index and L is the length of the vocal folds (see

Figure 3.7). These displacement values are used to position the upper

and lower margins at the posterior ends of the vocal folds in the model

in Figure 3.9. This modified model has a triangular glottal area with

an angle e as shown. The phase difference between dl(n) and d2(n) is

maintained along the complete length of the unitary model of the vocal


The vocal fold contact area is calculated using this triangular,

phase shifted configuration. Several conditions may be specified in the

w 4t

1 2

4 5 6

Figure 3.10 The flexible one mass model.

computer program implementation of this model: the folds are not in

contact (no contact area), the lower margins of the folds are in contact

and the upper margins of the folds are not in contact (lateral area of

contact is triangular), lower margins are not in contact and the upper

margins are in contact (lateral area of contact is again triangular),

and both upper and lower vocal fold margins are in contact and possibly

out of phase (lateral area of contact is trapezoidal, the condition

shown in Figure 3.9).

With these conditions the EGG waveform is specified as

EGG(n) = A(n) + C (3.4)

where n is the time index, A(n) is the lateral contact area, C is a

constant proportional to the shunt impedance specified for the case when

A(n) = 0, and k is a scaling constant. The glottal area is calculated

in the model using the projected triangular glottal area configuration,

not the projected area given by AGI and AG2 of the two-mass vocal fold

articulatory speech synthesis model.

Algorithm EGG Simulation

(1) Using the Flanagan-Ishizaka two-mass vocal fold

articulatory speech synthesis model, obtain AGI and AG2

(defined earlier).

(2) Specify opening and closing angles, e0 and aC respec-


(3) Specify phase shift between AG1 and AG2.

(4) Let N be the number of area samples computed in step 1.

(5) Do for n = 1, N.

(a) Compute d1(n) and d2(n) using equation 2 and 3.

(b) If AG1(n) < AG1(n+I) or AG2(n) < AG2(n+1) set

ANG = 60 ; folds are opening. Else set ANG = Oc

folds are closing.

(c) Recompute glottal area based on current folds


(d) Compute length of upper and lower contact margins.

(e) Compute the depth of contact area.

(f) Compute contact area.

(g) Compute EGG using equation 4. K and C are arbitrarly

set to 1 and 10, respectively.

(h) Compute the differential EGG(DEGG) using the filter

H(z) = 1 z-1. If done go to step 6 else go to step


(6) Plot EGG, DEGG, glottal area.

(7) End.

The EGG model described above is summarized in Figures 3.11 and


3.5 Simulation Results

To help orient the reader the first example is illustrated in

Figure 3.13 for the following conditions: opening angle, 00 = 1.00 ;

closing angle, 8C = 0.20 ; and a lag of 0.8 ms between the upper and

lower vocal fold margins. These values have been found to simulate

features of an actual EGG quite well. In subsequent subsections we




Figure 3.11 EGG waveform simulation using Flanagan-Ishizaka two mass
vocal fold articulatory speech synthesis model.





" AG1
" AG2


(D1 & D2 ) OF FOLDS



Flow chart for EGG model calculations.

Figure 3.12

0 I015mec2




0 5 10 15 msec 20

Figure 3.13 Simulated EGG, DEGG, and glottal area waveforms with
60=l.0, Gc=0.2, and a 0.7 msecs phase lag between upper and lower
vocal fold margins.

demonstrate the effects on the EGG waveform of varying one parameter at

a time. The lag can be specified in the model as a fraction of the

fundamental period of voicing. Here we represent this lag in units of

ms. The maximum and minimum displacements of both the upper and lower

margins are the same. The collision of the unitary masses is assumed to

be plastic, perhaps like the collision of two wedges of putty. There is

a strong resemblance between this simulated EGG waveform and the

measured EGG waveform in Figure 3.8b except the model waveform does not

have the rounded corners of the measured waveform.

We now use this model to show the effects of the following:

varying the opening angle, varying the closing angle, varying the phase

difference between the upper and lower margins of the folds, a mucus

strand bridging the vocal folds during the opening phase, and vocal fold

polyps or nodules on the EGG model waveform.

3.5.1 Varying the Opening Angle, 6o

For these calculations the closing angle was fixed at eC = 0.20 ,

while d1(n) and d2(n) were calculated using the two-mass articulatory

synthesis model. Figure 3.14a shows the effects of varying 00 from 0.20

to 2.70. As 00 increases, the rising slope of the EGG, which

corresponds to the opening phase of the glottal area waveform,

decreases. A bend in the rising portion of the EGG is visible.

For e0 > 3.50 the EGG rises gradually as the model vocal folds open

until the angle of opening reaches the specified 80 at which point the

EGG jumps suddenly, in step-like fashion, to the maximum EGG value.

This discontinuity phenomenon is due to the constant C being specified










0 5 10 15 msec 20

Figure 3.14 Simulated EGG, DEGG,
vocal fold angles (a) opening, 00

and glottal area waveforms for various
and (b) closing, Oc"

for small values of 0 and then left unchanged for all succeeding

calculations. This problem might be resolved by letting the constant C

in equation (3.4) be a function of the opening angle.

Note that the differentiated EGG (DEGG) waveform marks the instant

of glottal opening and closing with its positive and negative peaks,

respectively. However, as the angle of opening increases, the positive

DEGG peak becomes broader, making the decision for the instant of

glottal opening more difficult. For our data, a good EGG model waveform

is best simulated using 0.50 e0 < 20

3.5.2 Varying the Closing Angle, e.

For these calculations, the opening angle was fixed at e0 = 20

while d1(n) and d2(n) were calculated as before. Figure 3.14b shows the

effects of varying 9c from .00010 to 20. For ec > 50 a stair-step

discontinuity occurs in the falling slope of the EGG, which corresponds

to the closing phase of the glottal area. As 9C increases the falling

slope of the EGG decreases.

For small closing angles, the largest negative peak in the DEGG

waveform does not occur at the instant of zero glottal area, but several

instants later. However, when the angle of closure is quite large,

namely ec = 20 then the large negative peak in the DEGG corresponds

approximately to the instant of zero glottal area. This appears to

agree with actual measured data. The instant of zero glottal area

corresponds to the instant at which the EGG starts its downward

(negative) deflection. This instant generally occurs in the model

waveforms just slightly before the instant of greatest negative slope,

i.e., the instant at which the DEGG has its largest negative value. A

good EGG model waveform is best simulated using 0.00010 < eC < 0.50 .

3.5.3 Varying the Phase Difference Between Upper and Lower Vocal Fold

For these calculations eC = 0.20 and 80 = 20 and the minimum

and maximum upper and lower margin displacements are equal, but the

upper vocal fold margin lags the lower margin. The simulated EGG

waveforms in Figure 3.15a closely resemble the measured EGG for a normal

voice phonation at low frequency, except for the lack of rounding of the

modeled EGG waveform during the glottal closed phase. When the phase

difference or lag between the vocal fold margins is zero, the resulting

simulated EGG is stylized with a steep closing phase and a gradual

opening phase.

For a lag from 0.3 to 0.7 ms the falling slope of the EGG is

affected in a manner analogous to that caused by increasing 0C. For a

lag from 0.7 to 1.0 ms, discontinuities (steps) occur in both the rising

and falling slopes of the EGG, as discussed earlier. With a lag from

1.3 to 4.2 ms, the simulated EGG resembles that produced by an

individual phonating in vocal fry (see Figures 3.15b and c).

For large lags note that the DEGG waveform does not correctly flag

the instants of glottal opening and closing as has been commonly

assumed. This can also happen when vocal fold polyps and nodules are

present. Consequently, we need another criterion for marking the

instants of glottal opening and closing when "irregular" or "unusual"

vocal fold vibrations are taking place.

( LAG=.8 msec


: -4'i

0 5 10 15 msec

.2 msec e




0 10 15 msec

Figure 3.15 Simulated EGG, DEGG, and glottal area waveforms for
various lag (phase) differences between upper and lower vocal fold
margins (a). Upper margin lags lower margin. Simulated vocal fry
(b) and measured vocal fry (c). Compared measured EGG with
simulated 2.0 msecs lag EGG in part (b).





3.5.4 Effects of Mucus

Mucus strands that bridge the vocal folds during the opening phase

have subtle effects on the EGG waveform. During the opening phase, the

EGG rises relatively slowly as the mucus strand stretches but continues

to bridge the vocal folds. At some point in the opening phase the mucus

strand breaks. This event can be measured from our ultra-high-speed

laryngeal films. Just after this break, there is a rapid rise (almost

an upward step) in the EGG waveform. This phenomenon is seen even more

readily in graphs of glottal lengths.

An example of a measured EGG waveform with a large amount of mucus

present on the vocal folds and with a mucus strand bridging the vocal

folds during the opening phase is shown in Figure 3.16a. The mucus

phenomenon described above is more readily observable but still subtle

in the graphs of the glottal length and glottal area. These graphs have

a "bend" in the rising portion of the waveform. The folds are open at

the beginning of the bend, but a strand of mucus bridges the folds.

When this strand breaks, the length waveform rises sharply to its

characteristic flat top. While the EGG waveform appears normal for this

case, the differentiated EGG does not have its maximum at the instant of

glottal opening, rather, the maximum occurs just as the mucus strand

breaks. This point corresponds to the "knee" in the opening phase of

the glottal area waveform. The EGG waveform reaches its maximum just as

the mucus strand has broken. Thus an excessive amount of mucus,

providing a highly conductive current path, may distort the EGG

measurement. In such cases the EGG may not be an accurate

representation of the lateral area of contact of vocal fold tissue.

I: E l
~I i !

..... ..........

30 60 90 120 150 180 210 240 270 30
SUBJ JMHN 178 Hz, 72 dB







0 5 10 15 msec 20


Figure 3.16 (a) measured EGG, mucus strand is present on vocal folds.
Also shown are the DEGG, the glottal length, (L), and glottal area (A).
(b) simulated EGG for a simulated mucus strand. The various EGG curves
represent a fractional decrease in the vocal fold lateral area.

Figure 3.16b shows the effect of simulating various size mucus strands

bridging the vocal folds during the initial part of the vocal fold

opening phase and then breaking. This simulation mimics the measured

event quite well.

Another mucus-related phenomenon is reflected in the EGG waveform

as a difference in the impedance of the vocal folds during the opening

and closing phases of the folds. If only a small amount of mucus is

present on the lateral area of contact of the folds and no tissue

compression occurs when the vocal folds collide, then the impedance of

the vocal folds at the instants of initial glottal opening and initial

glottal closing should be the same. However, when a large amount of

mucus is present the impedance of the folds is decreased (since the

resistance of mucus is presumed to be less than that of the vocal fold

tissue). The EGG waveform will now be slightly lower at the instant of

initial glottal opening, because of the mucus, than at the instant of

initial glottal closing. This can be seen in Figure 3.17.

3.5.5 Effects of Nodules and Polyps

According to Moore [53] polyps and nodules on the vocal folds arise

as a result of trauma to the folds. A laryngeal polyp may occur as a

result of a single brief period of vocal strain, whereas a nodule often

develops over a longer period of time and progresses through several


The shape and consistency of the polyp or nodule affects the shape

of the EGG waveform. We believe that an edematous polyp or nodule will

have a more pronounced effect on the EGG than a fibrous polyp or

nodule. This remains, however, a conjecture to be verified.


.. .. .... ... ., ........... ..... ...... -.. ... ,..... .......... ,......
9.v .- *1!

.. .............

----- -- ---- -- -- ---- ----- - -- ------- i... ... - -

0 30 60 90 120 150 180 210 240 270
SUBJ : RKK 170 Hz. 72 dB

Figure 3.17 Measured EGG, glottal length (L), and glottal area (a)
for the case when a large amount of mucus is present in the vocal
folds. Vertical lines demarking the initiation of glottal opening
and closing identify different impedence levels on the EGG.

A large polyp will cause an earlier contact of the vocal folds

during the closing phase but not a greater area of contact unless it is

very soft. A large protrusion may actually reduce the amount of contact

since there will be no contact adjacent to the nodule. However, since

nodules are in the loose mucosa they slide onto and off the upper

surface and consequently may present a different rate and amount of

contact between opening and closing. Furthermore, the EGG for hard

protrusions should differ from that for soft protrusions, i.e., show

less contact area.

The effect of a large edematous polyp on the EGG is easily

implemented in our simulation model as a percentage increase in the

lateral area of contact as the folds close. We can vary the location,

size, and effective conductivity of the polyp or nodule.

Figure 3.18 shows an EGG measured from a patient with a nodule

located near the middle of the upper margin of the vocal folds. As the

folds initiate the opening phase, the EGG progresses in a normal

fashion. But as the folds continue to separate beyond the location of

the nodule, the EGG level remains constant. This is where the nodule on

one fold remains in contact with the other fold for a brief period. The

effective lateral area of contact remains constant. For a short

interval the contact area of the nodule compensates for the loss of

contact area due to the folds separating below and beyond the nodule

location. In the same figure we show the EGG waveform for a simulated


Figure 3.19 illustrates the effect of varying the lateral area of

contact of the nodule on the opening (rising) portion of the EGG

0 5 10 15 20

Figure 3.18 Simulated EGG (upper graph) for a simulated nodule present
on one vocal fold, 80=2.0, 0=0.5. Measured EGG (lower graph) for a
nodule present on one vocal fold.


Figure 3.19 Simulated EGG waveforms for various fractional increases
in lateral contact area, simulating various nodule sizes. The 0.05
curve represents a 5% increase in the lateral contact area.

waveform. The larger the contact area of the nodule the longer the flat

segment of the rising portion of the EGG waveform. The location

(anterior/posterior) of the nodule can be varied as well in the model,

as seen in Figure 3.20. These examples illustrate the potential of the

EGG to estimate the size and location (anterior/posterior) of a vocal

fold nodule or polyp.

3.6 Discussion

Our simple model of vocal fold vibrations and EGG waveform

generation does not satisfy all the properties of physics, e.g.,

conservation of momentum. The collision of actual folds results in

their deformation. This may lead to an increase in the lateral contact

area because the mucosal layer is pushed both superiorly and inferiorly

along the surface of contact. This deformation of mass is not accounted

for in our model. These imperfections can and should be overcome with

future improvements to the model. Despite this incompleteness in the

model, we found the model could replicate many aspects of the EGG

waveform via the concept of inverse lateral vocal fold contact area. We

believe the numerous examples illustrated in the previous sections

substantiate this claim. The concept of vocal fold mass compression

seems to be of little importance in explaining the major features of the

EGG waveform.

We stress that our model is presently best suited for a normal male

voice in modal register. This is because the vocal folds are apparently

thicker for males and our observations concerning opening and closing

angles and upper and lower margin lags have been made primarily from



0 5 10 15 msec 20

Figure 3.20 Simulated EGG when the simulated nodule location is varied
posterior-anterior, 0=2.0, Gc=O.2. The 0.7 curve denotes that the
nodule is located at 0.7L from the posterior boundary of the idealized
glottis. The nodule is simulated with a 5% increase in lateral contact

ultra-high-speed laryngeal films for this case. Higher pitched voices

(both males and females) tend to have thinner, longer vocal folds. The

initial point of vocal fold contact in this case may occur at the middle

of the vocal folds. Complete closure may not occur. The folds also

appear to be quite thin, with perhaps only one margin, i.e., no upper

and lower vocal fold margins. These concepts can be modeled, but we

have yet to do so.

Several interesting observations can be drawn from our modeled

data. The differentiated EGG (DEGG) has been considered by users of the

EGG to be a reliable indicator of the instants of glottal opening (large

positive peak) and closing (large negative peak). This concept appears

to be valid for a normal male voice in modal register with vocal fold

closure. However, the simulation of vocal fold vibratory motion that

either departs from modal register or is impaired by a vocal fold nodule

predicts that the DEGG is no longer a reliable indicator of the instants

of glottal opening and closing. The model predicts that closure occurs

later than the measured data and the peak used to predict opening is

very broad.

By using a set of model EGG templates for a series of fixed opening

and closing vocal fold angles, one may be able to estimate these angles

from a measured EGG waveform. Similar remarks apply to estimating the

location and size of a vocal fold nodule. But more work is needed

before we can make such predictions reliably and consistently.

A factor which plagues the users of the EGG is determing whether or

not the glottis is closed; i.e., can a closed glottis be determined from

the EGG waveform alone? Presently, the answer to this question is no!

The EGG waveform of a breathy voice with an open glottal chink may look

essentially the same as an EGG waveform where complete glottal closure

has occurred. If the determination of complete glottal closure is

important, then the investigator must use other means (such as listening

to the voice) to assess whether complete glottal closure has occurred.

The model is presently inadequate with respect to duplicating the

rounded segments of measured EGG waveforms. These rounded segments are

most prominent in the closed and open phases. During the closed phase,

we feel the vocal fold tissue has compressed and deformed to actually

increase the lateral area of vocal fold contact. Our model does not

duplicate this phenomenon but one can easily conceptualize this effect

and predict that such behavior in the model would lead to a more

realistic simulation of a measured EGG waveform. During the opening

phase, a different phenomenon is occuring, and we do not have as ready

an explanation. Apparently the tissue conductance path changes

gradually as the folds open. This path should not be modeled by a fixed

impedance or fixed area (the parameter c in our model). More work is

needed to understand this phenomenon better.

The absolute magnitude of the EGG waveform in our model has little

importance, since the waveform amplitude can be scaled using the

constant k in our model. This is in agreement with previous

observations Childers et al. [60] and Baer et al. [63]. But the

relative amplitude does give information regarding the relative

separation of the folds.

The model of the vocal vibratory motion is only a very rough

approximation to that observed for real vocal folds. Some investigators

would depict the vibratory motion of one fold as more analogous to a

vibrating string or large rubber band, anchored at the anterior and

posterior ends. This description seems more adequate for higher pitched

registers. The model we have implemented is more appropriate for modal

register. Despite its obvious faults, the model has replicated many

aspects of the EGG waveform.

The artistic, unitary, one mass vocal fold model depicted in Figure

3.10 can be improved by extending the model to include two masses as

shown in Figure 3.21. This improved model more accurately depicts the

vocal fold vibratory motion by allowing the glottis to close with only

partial (approximately one-half) vocal fold contact. A comparison of

Figures 3.10 and 3.17 allows one to conclude that the vocal fold

vibratory events are more precisely matched with EGG waveform segments

in Figure 3.21 than in Figure 3.10. This addition to the model,

including those described above, should markedly improve our ability to

relate vocal fold vibratory motion events to EGG waveform segments.

3.7 Conclusions

Our triangular unitary mass model of the vocal folds has

incorporated several concepts of vocal fold vibrations, three of which

are opening, angles of the vocal folds, closing angles of the vocal

folds, and the phase shift between the upper and lower margins of the

folds. We have found that EGG waveforms for a normal male voice in

modal register are simulated when

I _2 9

Upper Still Open
Lower '/ Closed

Upper Closed
Lower '/ Open

Upper Open
Lower Closed

Upper Closed
Lower Open

Upper '/? Closed
Lower Closed

Upper z Open
Lower Open

Figure 3.21 EGG waveform and the flexible, elastic two-mass vocal
fold model. The vocal fold model events are labeled on the EGG
waveform. The upper and lower vocal fold margins on each mass are
out of phase. An artistic license was taken to illustrate vocal
fold motion.

Both Open


Both Closed



0.50 < e0 < 20

0.00010 < C < 0.50

0.3 < lag < 0.7 (msec)

We have focused our work on the normal male voice in modal register

because we know more about the vibratory motion of the vocal folds for

this case. We have simulated a few examples of vocal fry as well as the

effects of mucus and nodules.

In the simulations of vocal nodules (and polyps) we can predict the

approximate size and location of the nodule (being anterior or posterior

on the vocal folds) by observing the EGG waveform segment on which a

departure from "normal" functioning occurs. This is a form of the

inverse problem, i.e., given an EGG waveform we can begin to predict

vocal fold configurations. More work on this "inverse" problem is


One observation about vocal fry is in order. Vocal fry is defined

perceptually to be a low pitched, rough-sounding phonation that

corresponds to aperiodic low frequency vocal fold vibration. The model

can simulate a fry-like EGG waveform at any pitch, however, because the

lag between the upper and lower margins may be specified independent of

the fundamental frequency. In the near future one could use the

articulatory speech model to synthesize aperiodic speech at various

pitch frequencies and record the listener's report of the auditory

perception of such simulations. This should allow us to compare and

contrast the vibratory basis of vocal fry and other high-pitched vocal

roughness. One also could simulate EGG waveforms for high pitched


voices, e.g., female voices, and for other voice registers and voice



The present study relies on obtaining minute details of the

mechanics of the vibrations of the vocal folds. The data collection

system used for this purpose is shown in Figure 4.1. The major building

blocks of this system are: an ultra-high-speed cinematography camera; a

high intensity lamp for illumination of the vocal folds; an

electroglotto-graph for obtaining the EGG; a directional hearing

microphone for tranducing the acoustic pressure wave into the electrical

speech signal; timing code generator for synchronizing the recorded and

filmed data; tape recorders for recording speech, EGG, and timing

signals; and a grid projector for absolute measurement of the filmed

vocal folds image.

The system is designed to photograph the vibration of the folds of

a subject in excess of 5000 frames/sec. The EGG and speech signals are

recorded for the phonated vowel, in this study, the vowel is /i/. The

speech signal and the EGG are recorded in synchrony with the film by a

timing signal and a timing code. This synchrony is obtained by

photographing the timing signal of 5 Khz, the timing code (to be

described), and the EGG signal on the edges of the laryngeal film while

simultaneously recording the speech, EGG, and another 10 Khz timing

signal on tape recorders. The 10 Khz timing signal is in phase with the

5 Khz timing signal photographed on the film.

Figure 4.1 Data collection system.

In the following sections we describe the data base collected from

normal and pathologic subjects. We also describe the collection system

functional blocks in conjunction with the overall system operation.

4.1 Data Base

Two classes of subjects were used in this study, subjects with a

normal larynx and subjects with a pathology of the vocal folds.

4.1.1 Normal Subject Data Base

The normal population in this study consisted of four adult males

(HMN, DMK, GPM, AKK). These subjects were checked by a speech

pathologist for evidence of vocal disorders or laryngeal pathology but

none were found. As mentioned earlier, the vowel /i/ was chosen, since

during phonation of this vowel the epiglottis is usually held back out

of the optical pathway of the vocal folds image during filming.

However, due to the presence of a laryngeal mirror, and holding down the

tongue to facilitate high speed photography, the sound produced was

closer to an /a/ than /i/ in most cases. The duration of filming and

recording was approximately three seconds.

The task for each subject consisted of phonating the vowel /i/ at

three different intensities at each of three different fundamental

frequencies. The target fundamental frequencies were 125 Hz, 170 Hz,

and 340 Hz. Each subject tried to produce these fundamental frequencies

by matching a pure tone of the corresponding frequency provided on a set

of headphones. The subjects produced intensities at each of the target

fundamental frequency at a "comfortable" level, an intensity

approximately 4dB above and another intensity about 4 dB below this

level. The output intensities were measured using a General Radio Co.

type 1551-C sound level meter. Thus, there were nine tasks for each

normal subject for a total of thirty-six tasks.

4.1.2 Patient Data Base

Photographing subjects, in general, is a difficult task. The

presence of a pathology introduces additional constraints on high speed

filming of these subjects. Many patients are unable to phonate at

different fundamental frequency ranges when a laryngeal mirror is placed

inside their mouth. In many cases, the pathology itself restricts the

frequency range of the patient's vocal folds.

In our study we were able to recruit four patients for high speed

photography. The patient population consisted of one adult female (DJB)

and three adult males (LMB, GTS, MXR). Using traditional methods, each

patient's condition was diagnosed by an otolaryngologist and a speech

pathologist at the ENT clinic at the University of Florida. Table 4.1

lists the tasks the patients performed, and their corresponding


The task performed by each patient was tailored to his or her

individual capabilities. Care was taken so as not to subject the

patients to undue pain or discomfort. Each patient was asked to phonate

the vowel /i/ as best as they could. Again, the produced phonation

sounded more like an /a/ in most cases. It is clear that we did not

have as much control of the tasks performed by patients compared to the

tasks performed by normal subjects.



Subject Sex Task 1 Task 2 Comments

DJB F Modal pitch High pitch Unilateral nodule
Ill Ii/

LMB M High pitch High pitch Unilateral polyp with companion
intermittent sequence /i/ bulge on the other fold re-
sequence /i/ sults in some loss of voice

GTS M Exhalation Exhalation Bilateral paralysis
/i/ /i/

MXR M Modal /i/ None Voice production problem,

loss of voice

4.2 High Speed Photography

The high speed camera used in this study is a Fastax Model WF14.

The maximum exposure rate of this camera is 8000 frames/sec. The rate

of exposure is primarily dependent on the voltage applied to the drive

motor of the camera. In this study, this voltage was adjusted such that

an exposure rate of 5000 frames/sec was achieved before the last 150-200

frames of the film.

The camera has two lens systems to facilitate the simultaneous

filming of two images on the film frames. This feature was fully

utilized in our study as will be discussed shortly.

The technique of ultra-high-speed photography of the vibrating

vocal folds is described in [72,73]. Photographing the vocal folds is

accomplished by inserting a laryngeal mirror inside the subject's mouth

and placing it at the back of the pharynx. Unfortunately, many subjects

cannot overcome the gag-reflex produced by this setup, and only a

selected group of subjects can be used for laryngeal photography. The

location of the folds inside the larynx allows a very small amount of

light to be present at any time. This problem is solved using a high

intensity incandescent lamp having a color temperature of 32000K to

illuminate the vocal folds during vibration. The light beam passes

through two condenser lenses as well as a water cell to remove the

infrared and ultraviolet portion of the light spectrum. This serves to

protect the subject's vocal folds and other tissues from excessive heat

or ultraviolet light.

The laryngeal mirror reflects the light at 90' downward onto the

vocal folds. The image of the folds is reflected back by the laryngeal

mirror through one of the two lens systems to the high speed camera (see

Figure 4.1). The folds' image is focused manually by adjusting the

camera lens system. As the subject phonates, the details of the folds'

vibration are captured on the film.

Using the same lens system, absolute measurements of the folds'

vibration (glottal area) is accomplished by positioning a 0.1 square

inch grid in the focal plane of the vocal folds' image, as shown in

Figure 4.1. The grid image is positioned to lie in the corner of the

film frames, and does not interfere with the folds' image.

The second lens system is specifically designed to photograph an

oscilloscope face. The EGG signal and two other timing signals (to be

discussed shortly) are photographed through this lens system. The

oscilloscope trace of the EGG signal is positioned to lie along one edge

of the film. The other traces of the timing signals lie along the other

edge of the film. Again, care is taken such that none of the signal

traces interfere with the vocal fold image. Since the two lens systems

of the camera are at a 900 angle with each other, the three oscilloscope

traces appear on a film frame that is displaced five frames behind the

film frame recording the corresponding vocal folds image.

Two different types of high speed films were used. These were

black and white Kodak 7277 4-X reversal film and the color Kodak

Ektachrome 7250 high speed video news film.

Accurate time synchronization between the speech, EGG, and the

laryngeal film frames is unique and essential to our study. In order to

achieve this synchronization, a special time code generator was designed

[36]. The time code generator provides three timing signals, a 10 Khz

square wave that is recorded on the second channel of both tape

recorders, a 5 Khz square wave derived from and in phase with the 10 Khz

signal, and an 8 bit counter signal. The latter two timing signals are

inscribed on the laryngeal film frames as described earlier.

The high speed camera is an electromechanical device. The camera

motor requires a short amount of time to achieve the desired speed.

Hence, the film exposure rate varies from zero frames/sec at the

beginning of filming to 5000 frames/sec at the latter part of the 100

feet high speed film. The 5 Khz square wave signal is photographed on

the edge of the film. This signal is used to monitor the film exposure

rate (speed), among other things. The constant 5000 frames/sec rate is

achieved when each cycle of this signal is aligned with one frame of the

film. This method was used to accurately mark the constant 5000

frames/sec exposure rate region. We usually obtain from 150-200 frames

of film in this region.

The 8 bit counter signal is used to locate a specific frame in the

100 feet film. The counter is incremented by one every 100 cycles of

the 5 Khz signal. The counter value is output once every 100 cycles of

the 5 Khz signal. The counter bits are shifted out serially at a clock

rate of 40 Khz, such that the 8 bits are time aligned with every 100th

cycle of the 5 Khz signal. The oscilloscope trace of the 8 bit counter

is located directly below the 5 Khz signal trace on the edge of the

film. The proximity of the two traces helps in resolving ambiguities

that frequently arise during measurements of the film frames. Except

during the 100th 5 Khz cycle, the counter output signal on the film is

set to zero.

Thus, to locate any frame on the film, say N, we can use the

following formula:

N = k + j


k = N mod 100; the number of times the counter bits are shifted out


j = remainder of N mod 100

So, one will locate the kth output of the 8 bit counter and count j

number of frames to locate the desired frame. Usually, the reverse

operation is carried out. The frames are counted backward from the

starting frame of the constant region of 5000 frames/sec to the last 8

bit counter code on the film. Let the number of frames counted be j

frames, and let the code value be k, then the frame is the [100(k+1) +

j]th frame. We use (k + 1) instead of k because the initial 8 bit code

shifted out after the 1st 100 5 Khz cycles is zero and not one.

The 10 Khz timing signal recorded on both tape recorders is used to

provide the external sampling clock for the A/D converter when

digitizing the speech and EGG signals. The 5 Khz signal photographed on

the film is derived from and in phase with this 10 Khz signal. So the

number of samples of the EGG signal corresponding to the [100(k+1) + j]

cycles of the 5 Khz signal is [2 {100(k+1) + j}]. However, the

corresponding number of samples for the speech signal is [2 {100(k+1) +

j} + d], where d is the number of samples accounting for the propagation

delay of the speech signal from the glottis to the microphone.

Recently, we modified the time code generator circuit to interface

with the newly acquired Vetter FM tape recorder. The modification

includes a highly stable, crystal-based oscillator to eliminate the

drift in frequency exhibited by the old RC oscillator. Also, the 10 Khz

timing signal was replaced by a 20 Khz square wave signal. Now the

speech and EGG channels can be sampled using the same sampling

frequency. The other timing signals, namely the 5 Khz and the 8 bit

counter signals, were not modified. The overall operation of the time

code generator is essentially the same as discussed earlier.

The film frames analyzed usually lie at the end of the spool of

film. Thus, the timing signals, used to align the recorded speech and

EGG signals with the traced EGG and area function measured off the film,

need not start at the beginning of the film. So a delay circuit was

added to the time code generator such that the timing signals start

shortly after the film has already started. This feature greatly

reduced the transient artifact that occurred in the timing signals when

the camera motor was switched on.

4.3 Speech and EGG Signals Recording

The speech signal was recorded using a directional hearing aid

microphone coupled directly to one channel of a stereo tape recorder or

an FM tape recorder. The microphone was attached to the laryngeal

mirror handle and positioned inside the mouth to eliminate the noise

pickup of the camera motor. The distance of the glottis from the

microphone varies from subject to subject, but was approximately 11 cm

in most cases. The audio bandwidth of the microphone has been measured

to be about 6 Khz with a slight peak at 4 Khz. The tape recorders used

were Revox A77, Teac A-2060 or a Vetter model B FM instrumentation


The EGG signal was obtained using an electroglottograph designed by

D. Teany and manufactured by Synchrovoice Associates. This device was

modified to give a more stable output. Figure 4.2 illustrates the

linear phase filter circuit used for this purpose. The EGG signal was


RI-100K R2Z-278K R3-270K R4-22K RS-SBK R6-27K RT-12K RO-270K R9-LIOK R1O-isOK RII-l.2K
RL2-1.K R13-.2K R14-8.2K R1G-I.2K R16-22K

Ct=C2-CS-.LUF C3-.BISUF C4=.833UF

Figure 4.2 Trend removal filter for the electroglottograph.

connected to one channel of a Sony model TC530 stereo tape recorder or a

Vetter model B FM instrumentation recorder. The rise and fall of the

electroglottograph was adjusted using a square wave calibration circuit


Synchronization of the high speed film, EGG, and the speech signal

requires the presence of a synchronizing timing signal. Originally, the

four channel Vetter FM recorder was not available. The other available

stereo recorders, the Revox A77, Teac A-2060, and the Sony TX530 were

used. Since each one of these recorders has only two input channels, a

10 Khz square wave signal was input simultaneously to one channel on

each recorder, while the other channel was used for recording the speech

or EGG signal. This undoubtedly increased the potential errors in the

recording system. The stereo tape recorders were run at 7.5 ips to

obtain a flat frequency response from 50 Hz to 5 Khz. This problem was

solved once we obtained the four channel Vetter FM recorder. Having all

signals including the timing signal on one tape recorder greatly reduced

many potential errors both at the recording time and processing time

i.e., it was easier to locate corresponding signals, variations between

instruments are eliminated, etc. The Vetter recorder was run at 15 ips

to obtain a flat frequency response from 0 Hz to 4500 Hz on the speech

and EGG channel, and a flat frequency response from 200 Hz to 40 Khz on

the timing signal channel. The effects of bandpass filtering on the

timing channel were corrected by the addition of a simple Schmitt

trigger circuit. This additional circuit reconstructed the temporal

square shape characteristics of the timing signal which is needed to

externally trigger the A/D converter.

4.4 Data Measurement and Preprocessing

4.4.1 Speech and EGG Signals

The 10 Khz signal recorded on the second channel of both tape

recorders was used as the external sampling clock for the analog to

digital (A/D) converter. The A/D converter board is plugged into the

backplane of a Data General Nova 4 minicomputer. The A/D is capable of

sampling two channels of data at a combined sampling rate of more than

20 Khz. Due to the limited bandwidth of the tape recorders, the 10 Khz

timing signal was passed through a waveshaping circuit to obtain a clean

square waveform. However, small variations in the tape speed and jitter

in the waveshaping circuit are sufficient to introduce errors in

synchronization between the various signals. Using the FM tape recorder

in the latter part of the study eliminated these problems to a great


Prior to digitization, the speech and EGG signals were passed

through an analog lowpass filter with a cutoff frequency of 5 Khz. Once

the EGG and speech signal are digitized, two other problems have to be

corrected. The two problems are: the tape recorder distortion, and

power line and high frequency noise component.

Tape recorder distortion results from the capacitor coupling

normally used in tape recorders, which introduce both phase and

magnitude distortion in the low frequency region below 200 Hz. This

distortion is not visible in the speech waveform but manifests itself as

a downward slope in the EGG signal during the glottal open phase.

Berouti [21] proposed a solution to this problem. Briefly, the

recording and playback system is considered to be a linear, time

invariant system. Let the Fourier transform of this system transfer

function be H(w). We can derive H(w) using a reference signal, usually

a square wave. Let D(w) be the transform of the recorded and played

back signal, and U(w) the Fourier transform of the original and

undistorted signal, then:

u(w) D(w)

so we can restore U(w) by multiplying D(w) by the inverse of H(w). In

the case of the EGG signal, the traced EGG off the film is used as the

reference signal. Figure 4.3 illustrates this method and shows examples

of the EGG before and after correction. The use of the FM recorder

eliminated this problem.

The 60 Hz power line noise and the high frequency noise were

removed from the speech and EGG signals using a bandpass filter. The

filter used is a 351 point, linear phase FIR filter described in [74].

The transfer function of the filter is shown in Figure 4.4. Again, most

of the FM tape recordings did not contain these noises and did not need


4.4.2 High Speed Film Data

In this section we describe the method used to measure different

parameters associated with vibration of the vocal folds. It is

important to note that these measurements are done in a region of high

exposure rate of the film frames. In our study the pitch frequency

which is directly related to the frequency of vocal fold vibration never

exceeds 350 Hz. We were able to attain film speeds of up to 5000


U(W)~ (W) DW



Figure 4.3 Correction system for tape recorder phase


-20 I0

-60 .. ............ ...... .. ..c...... .. ... . ..... .... .... .... .

-8 ................ .. .. ......... ..... .......... ....... ........*........ .....

1000 1500 2000 500 3000 3500 4000 4500 5000

Figure 4.4 Magnitude frequency reponse of FIR linear
phase filter.

frames/sec, so we obviously satisfy the sampling theorem and should have

little or no aliasing in the measured data.

The first step in the measurement is to locate the region of

constant 5000 frames/sec. Using an Athena 224-ES stop frame projector,

we marked the segment of the film where the number of 5 Khz square wave

cycles between successive 8 bit counter output was 100 cycles or close

to it. This represents a film speed of approximately 5000 frames/sec.

Starting with this region, 150 frames were chosen and marked for


Next, the glottal area between the vocal fold image on the film

frame was digitized. Over the years a number of semiautomated,

computerized systems for digitizing the glottal area have been developed

and used at the Mind-Machine Interaction Research Laboratory at the

University of Florida [72,73,75,76]. The system consists of a Vidicon

TV camera attached to a Spatial Data Systems EyeCom 108PT image

procesing terminal. This terminal is interfaced to a Data General Nova

4 minicomputer and has the capability of displaying video images along

with superimposed graphics. The display screen is divided into 640x480

coordinate locations. A cursor, controlled by the joystick, can be

moved to any desired location on the screen. The cursor coordinates are

then transferred to the computer. Also, the terminal is capable of

digitizing images with a 640x480 pixels spatial resolution and intensity

resolution of 256 gray levels.

Measurements of the laryngeal film are carried out using the

following procedure: using an Athena 224-ES stop frames projector, the

region of constant 5000 frames/sec is first located. Each frame in this

region is projected onto a 450 translucent screen. The glottal image

projected on this screen is scanned by the TV camera and displayed on

the EyeCom display terminal. Using a joystick cursor, the operator

measures the glottal width at five preselected locations and also the

length of the glottis. The 0.1 inch grid also is measured and stored

for scaling. The marked five points on the image are connected by a

straight line by the software program to approximate a glottal

boundary. The glottal area bounded by these lines is computed by the

program. This, no doubt, introduced noise in the measured area, length,

and glottal width.

The EGG trace on the film was also digitized using this system.

Two points are measured on the EGG trace for every frame. The EGG

signal obtained from this procedure is referred to as the traced EGG.

Here, also, the traced EGG signal is noisy due to the limited spatial

resolution of the EyeCom terminal.

The limited resolution of the image system and the procedure used

in digitizing the various signals introduce noise in these measured

signals, as mentioned earlier. Consequently, these measurements have to

be suitably smoothed. Linear smoothing techniques alone are not

suitable for preserving important abrupt changes in the signals. In our

case, many of these signals have abrupt or sharp transitions due to the

nature of the vibrations of the vocal folds. Hence, a combination of

nonlinear median smoothing and linear smoothing as described in [77] was


4.5 Synchronization of the Measured Data

Section 4.2.1 laid out the theoretical procedure in aligning the

different measured signals using the recorded and photographed timing

signals. However, snychronization errors persisted between the measured

signals. These are primarily due to the sampling errors during digiti-

zation. However, the traced EGG obtained from the films is in perfect

alignment with the film data. We can utilize this feature to fine tune

synchronization between the measured data using the following procedure:

(1) The digitized EGG from the tape recorder was shifted to

align with the traced EGG signal. This typically

involved shifts of less than ten samples.

(2) The recorded EGG and speech signals are assumed to be

perfectly aligned. Therefore, the speech signal is

shifted by the same number of samples as the recorded

EGG. Moreover, to account for the acoustic propagation

delay from the glottis to the microphone, the speech

signal is further shifted by four samples.

Perfect alignment between the speech and recorded

EGG is valid in the case of FM recording. However, this

assumption is probably violated when using stereo tape

recorders. Furthermore, the compensation for the

acoustic delay could be in error by as many as three

4.6 Error Sources

4.6.1 Film Data

There are three primary sources of errors associated with the data

measured from the high speed films:

(1) Incomplete or poor exposure of the vocal folds--One

always desires the entire folds and glottal area to be

visible on the film frames. However, in many instances

this is not the case. Usually, the anterior portion of

the folds are overshadowed by the epiglottis. In this

situation, the procedure was not to include the

invisible portion by extrapolating the glottal contour

over this portion. This procedure was carried

throughout the entire set of high speed films. This, of

course, introduces a systematic error throughout the

measured film data.

(2) Limited resolution of the measured system--The

relatively poor contrast of some of the films does not

allow taking full advantage of the display terminal

resolution, which is itself somewhat limited.

(3) Operator approximation errors--Digitizing high speed

films is a formidable task. It is estimated that only 5

to 6 minutes (25000-30000 frames) of high speed

laryngeal film have been processed in the entire

world. More than one operator, therefore, is usually

required to perform this task. In our system, the

operator subjectively locates the cursor at the end

points of the opening of the vocal folds. This

introduces variability between film frames measured by

different operators. Other errors are discussed more

fully in [73].

4.6.2 Digitized Tape Recordings

Initially, two stereo tape recorders were used when recording both

the EGG and speech signals. This introduced the following errors:

(1) Synchronization errors due to the sampling process.

(2) Errors due to the tape recorder distortion.

(3) Tape recorder speed variation.

The traced EGG signal was used to correct the recorded EGG signal

and significantly reduced the errors in the EGG signal. On the other

hand, the speech signal cannot be as adequately corrected.

Using the Vetter FM tape recorder eliminated the first two problems

but not the third one. However, the FM recorder has fairly stable

recording speeds.

4.7 Conclusions

This chapter explained our data collection and measurement

system. The collected data base includes data from normal and abnormal

subjects. The normal data base consisted of thirty-six tasks performed

by four subjects; nine tasks per subject. The data base includes

digitized recorded speech and EGG signal and traced EGG and area

function measured off the film. The patient data base consists of eight

tasks performed by four patients (refer to Table 4.1).

As mentioned earlier, the data collection and high speed

photography of patient subjects proved to be difficult. Although we

filmed more patients than indicated by Table 4.1, we had to discard

their data because their films were not measurable. The discarded films

were the ones with poor film exposure or obstructions to the folds'

image on the film frames. Also, sources of error in the digitized

recorded signals and signals measured off the film were discussed.