Citation
Towards automatic gender recognition from speech

Material Information

Title:
Towards automatic gender recognition from speech
Creator:
Wu, Ke ( Dissertant )
Childers, D. G. ( Thesis advisor )
Smith, J. R. ( Reviewer )
Arroyo, A. A. ( Reviewer )
Principe, J. C. ( Reviewer )
Rothman, H. B. ( Reviewer )
Place of Publication:
Gainesville, Fla.
Publisher:
University of Florida
Publication Date:
Copyright Date:
1990
Language:
English
Physical Description:
viii, 198 leaves : ill. ; 29 cm.

Subjects

Subjects / Keywords:
Bandwidth ( jstor )
Error rates ( jstor )
Fricative consonants ( jstor )
Gender identity ( jstor )
Glottal consonants ( jstor )
Phonemes ( jstor )
Reflectance ( jstor )
Signals ( jstor )
Spoken communication ( jstor )
Vowels ( jstor )
Automatic speech recognition ( lcsh )
Dissertations, Academic -- UF -- Electrical Engineering
Electrical Engineering thesis Ph. D.
Pattern recognition systems ( lcsh )
Sex differences ( lcsh )
Voiceprints ( lcsh )
City of Gainesville ( local )
Genre:
bibliography ( marcgt )
non-fiction ( marcgt )
theses ( marcgt )

Notes

Abstract:
The purpose of this research was to investigate the potential effectiveness of digital speech processing and pattern recognition techniques in the automatic recognition of gender from speech. Some hypotheses concerning acoustic parameters that may influence our ability to distinguish a speaker’s gender were researched. The study followed two directions. One direction, coarse analysis, used classical pattern recognition techniques and asynchronous linear prediction coding (LPC) analysis of speech. Acoustic parameters such as autocorrelation, LPC, cepstrum, and reflection coeffiecients were derived to form test and reference templates. The effects of different distance measures, filter orders, recognition schemes, and phonemes were comparatively assessed. Comparisons of acoustic parameters using the Fisher’s discriminate ration criterion were also conducted. The second direction, fine analysis, used pitch synchronous closed-phase analysis to obtain accurate vowel characteristics for each gender. Detailed formant features, including frequencies, bandwidths, and amplitudes, were extracted by a closed-phase Weighted Recursive Least Squares with Variable Forgetting Factor method. The electroglottograph signal was used to locate the closed-phase portion of the speech signal. A two-way Analysis of Variance statistical analysis was performed to test the difference between two gender features, and the relative importance of grouped vowel features was evaluated by a pattern recognition approach. The results showed that most of the LPC derived acoustic parameters worked very well for automatic gender recognition. A within-gender and within-subject averaging technique was important for generating appropriate test and reference templates. The Euclidean distance measure appeared to be the most robust as well as the simplest of the distance measures. A statistical test indicated steeper spectral slops for female vowels. Results suggested that redundant gender information was imbedded in the fundamental frequency and vocal tract resonance. Features of female voices were observed to have higher within-the group variations than those of male voices. In summary, this study demonstrated the feasibility of an efficient gender recognition system. The importance of this system is that it would reduce the search space of speech or speaker recognition in half. The knowledge gained from this research might benefit the generation of synthetic speech with a desired male or female voice quality.
General Note:
Typescript.
General Note:
Vita.
Thesis:
Thesis (Ph. D.)--University of Florida, 1990.
Bibliography:
Includes bibliographical references (leaves 189-197).

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright [name of dissertation author]. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Resource Identifier:
23011922 ( oclc )
001583934 ( alephbibnum )

Downloads

This item has the following downloads:


Full Text













TOWARDS AUTOMATIC GENDER RECOGNITION FROM SPEECH


By

KEWU












A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF DOCTOR OF PHILOSOPHY




UNIVERSITY OF FLORIDA


1990

























To my parents

and to my wife














ACKNOWLEDGMENTS


The invaluable guidance, encouragement, and support I have received from
my adviser and committee chairman, Dr. D. G. Childers, during the years of my

graduate education are most appreciated. I am sincerely grateful for his direction,

insight, and patience throughout this dissertation research.

I would especially like to thank Dr. J. R. Smith, Dr. A. A. Arroyo, Dr. J. C.
Principe, and Dr. H. B. Rothman for their interest and participation in serving on my
supervisory committee and their productive criticism of my research project.
The partial support by the National Institutes of Health, National Science

Foundation, and University of Florida Center of Excellence Program is gratefully

acknowledged.
Special thanks are also extended to my fellow graduate students and other
members of the Mind-Machine Interaction Research Center for their friendship,
encouragement, and skillful technical help.

Last but not the least, I am greatly indebted to my wife, Hong-gen, and my

parents for their love, support, understanding, and patience. My gratitude to them is
beyond description.














TABLE OF CONTENTS


Page

ACKNOW LEDGEM ENTS ................................................................ iii

A B STR A C T ................................................................................... vii

CHAPTER

1 INTRODUCTION ................................................................... 1

1.1 Automatic Gender Recognition ...................................... 1
1.2 Application Perspective ...................................................... 2
1.3 Literature Review ........................................................... 4
1.3.1 Basic Gender Features ....................... .... ..... ...... 4
1.3.2 Acoustic Cues Responsible for Gender Perception .... 13
1.3.3 Summary of Previous Research ............................ 17
1.4 Objectives of this Research ................................................ 20
1.5 Description of Chapters ........................................... ...... 21

2 APPROACHES TO GENDER RECOGNITION FROM SPEECH .... 23

2.1 Overview of Research Plan .................................... ......... 23
2.2 Coarse Analysis ............................................................. 23
2.3 Fine Analysis ................................................................. 26

3 DATA COLLECTION AND PROCESSING ................................. 29

3.1 Database Description ...................................... ........... 29
3.2 Speech and EGG Digitization .......................................... 30
3.3 Synchronization of Data ................................... ........... 32

4 EXPERIMENTAL DESIGN BASED ON COARSE ANALYSIS ...... 34

4.1 Asynchronous LPC Analysis ................................ ......... 34
4.1.1 Linear Prediction Concepts .................................... 34
4.1.2 Analysis Conditions ............................................ 39
4.2 Acoustic Parameters ........................................ ............ 40
4.2.1 Autocorrelation Coefficients ................................... 40








4.2.2 LPC coefficients ............................................ 41
4.2.3 Cepstrum Coefficients ........................................ 41
4.2.4 .Reflection Coefficients ........................................ 41
4.2.5 .Fundamental Frequency and Formant Information .... 42
4.3 Distance M measures ......................................... ........... .. 42
4.3.1 Euclidean Distance ............................................. 42
4.3.2 LPC log Likelihood Distance ................................... 43
4.3.3 Cepstral Distortion .................................... .......... 45
4.3.4 Weighted Euclidean Distance ................................ 47
4.3.5 Probability Density Function .................................. 47
4.4 Template Formation and Recognition Schemes .................. 48
4.4.1 Purpose of Design .............................................. 48
4.4.2 Test and Reference Template Formation .................. 49
4.4.3 Nearest Neighbor Decision Rule ........................... 55
4.4.4 Structure of Four Recognition Schemes ................... 56
4.5 Resubstitution and Leave-One-Out Procedures .................. 60
4.6 Separability of Acoustic Parameters
Using Fisher's Discriminant Ratio Criterion .................. 61
4.6.1 Fisher's Discriminant and F ratio ......................... 61
4.6.2 Divergence and Probability of Error .................... 64

5 RESULTS OF RECOGNITION BASED ON COARSE ANALYSIS .. 68

5.1 Coarse Analysis Conditions ............................................ 68
5.2 Performance Assessments ............................................. 70
5.2.1 Comparative Study of Recognition Schemes .......... 71
5.2.2 Comparative Study of Acoustic Features ............... 78
5.2.2.1 LPC Parameter Verses Cepstrum Parameter .. 78
5.2.2.2 Other Acoustic Parameters ....................... 79
5.2.3 Comparative Study Using Different Phonemes ......... 84
5.2.4 Comparative Study of Filter Order Variation ......... 85
5.2.4.1 LPC Log Likelihood and Cepstral
Distortion Measure Cases ......................... 85
5.2.4.2 Euclidean Distance Versus
Probability Density Function ..................... 87
5.2.5 Comparative Study of Distance Measures ................ 88
5.2.6 Comparative Study Using Different Procedures ........ 90
5.2.7 Variability of Female Voices ................................ 93
5.3 Comparative Study of Acoustic Parameters
Using Fisher's Discriminant Ratio Criterion .................. 93
5.4 Conclusions ...................................................................... 102

6 EXPERIMENTAL DESIGN BASED ON FINE ANALYSIS ........... 106

6.1 Introduction .................................................................... 106
6.2 Limitations of Conventional LPC ....................................... 107
6.2.1 Influence of Voice Periodicity .................................. 108
6.2.2 Source-Tract Interaction ....................................... 111
6.3 Closed Phase WRLS-VFF Analysis .................................... 113








6.3.1 Algorithm Description ............................................ 113
6.3.2 EGG Assisted Procedures ....................................... 120
6.4 Testing M ethods ............................................................... 122
6.4.1 Two-way ANOVA Statistical Testing ...................... 123
6.4.2 Automatic Recognition by Using Grouped Features ... 128

7 EVALUATION OF VOWEL CHARACTERISTICS ..................... 130

7.1 Vowel Characteristics of Gender ....................................... 130
7.1.1 Fundamental Frequency and Formant Features
for Each Gender ................................................. 130
7.1.2 Comparison with Peterson and Barney's Results ...... 142
7.1.3 Results of Two-way ANOVA Statistical Test ........... 145
7.1.4 Results of T Statistical Test .................................... 145
7.1.5 Discussion ............................................................. 145
7.2 Relative Importance of Grouped Vowel Features ................. 151
7.2.1 Recognition Results ................................................ 152
7.2.2 D discussion ..................................................... ...... 154
7.3 Conclusions ....................................................................... 158

8 CONCLUDING REMARKS ..................................................... 162

8.1 Sum m ary .......................................................................... 162
8.2 Future Research Extensions .............................................. 166
8.2.1 Short Term Extension ............................................ 166
8.2.2 Long Term Extension ............................................ 168

APPENDICES

A RECOGNITION RATES FOR LPC AND CEPSTRUM
PARAMETERS ..................................................... 169

B RECOGNITION RATES FOR VARIOUS ACOUSTIC
PARAMETERS AND DISTANCE MEASURES ......... 179

REFERENCES ................ ................................................................. 189

BIOGRAPHICAL SKETCH ............................................................... 198














Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy


TOWARDS AUTOMATIC GENDER RECOGNITION FROM SPEECH


By

Ke Wu

May, 1990

Chairman: D. G. Childers
Major Department: Electrical Engineering

The purpose of this research was to investigate the potential effectiveness of

digital speech processing and pattern recognition techniques in the automatic

recognition of gender from speech. Some hypotheses concerning acoustic

parameters that may influence our ability to distinguish a speaker's gender were

researched.

The study followed two directions. One direction, coarse analysis, used
classical pattern recognition techniques and asynchronous linear prediction coding

(LPC) analysis of speech. Acoustic parameters such as autocorrelation, LPC,

cepstrum, and reflection coefficients were derived to form test and reference

templates. The effects of different distance measures, filter orders, recognition

schemes, and phonemes were comparatively assessed. Comparisons of acoustic

parameters using the Fisher's discriminant ratio criterion were also conducted.

The second direction, fine analysis, used pitch synchronous closed-phase

analysis to obtain accurate vowel characteristics for each gender. Detailed formant








features, including frequencies, bandwidths, and amplitudes, were extracted by a
closed-phase Weighted Recursive Least Squares with Variable Forgetting Factor

method. The electroglottograph signal was used to locate the closed-phase portion
of the speech signal. A two-way Analysis of Variance statistical analysis was

performed to test the difference between two gender features, and the relative
importance of grouped vowel features was evaluated by a pattern recognition
approach.

The results showed that most of the LPC derived acoustic parameters worked

very well for automatic gender recognition. A within-gender and within-subject

averaging technique was important for generating appropriate test and reference
templates. The Euclidean distance measure appeared to be the most robust as well
as the simplest of the distance measures.

The statistical test indicated steeper spectral slopes for female vowels. Results

suggested that redundant gender information was imbedded in the fundamental
frequency and vocal tract resonance. Features of female voices were observed to

have higher within-group variations than those of male voices.

In summary, this study demonstrated the feasibility of an efficient gender
recognition system. The importance of this system is that it would reduce the search
space of speech or speaker recognition in half. The knowledge gained from this

research might benefit the generation of synthetic speech with a desired male or

female voice quality.














CHAPTER 1
INTRODUCTION



1.1 Automatic Gender Recognition


Human listeners are able to capture and categorize the information of acoustic
speech signals. Categories include those that contribute a linguistic message, those
that identify the speaker, and those that convey clues about the speaker's
personality, emotional state, gender, age, accent, and the status of his/her health.
Automatic speech and speaker recognition systems are far less capable than
human listeners. Computerized speaker recognition can be accomplished but only
under highly constrained conditions. The major difficulty is that the number of
significant parameters is unmanageably large and little is known about the acoustic

speech features, articulation differences, vocal tract differences, phonemic
substitutions or deletions, prosodic variations and other factors that influence our
recognition ability.
Therefore, more insight and systematic study of intrinsically effective speaker
discrimination features are needed. A series of smaller experiments should be done
so that the experimental results will be mutually supportive and will lead to overall
understanding of the combined effects of all the parameters that are likely to be
present in actual situations (Rosenberg, 1976; Committee on Evaluation of Sound
Spectrograms, 1979).
Unlike automatic speech and speaker recognition, automatic gender
recognition was never proposed as a stand-alone problem. Little attention was paid








to either the theoretical basis or the practical techniques for the realization of a
system for the automatic recognition of gender from speech. Although

contemporary research on speech included investigation of physiological and

acoustic gender features and their correlation with perceived gender differences, no

attempt was made to classify the speaker's gender objectively, using features

automatically extracted by a computer. Childers and Hicks (1984) first proposed

such a study as a separate recognition task and, thus, this research resulted from that

proposal. A possible realization of such a system is shown in Figure 1.1.



1.2 Application Perspective

The significance of the proposed research is as follows:
o Accomplishing this task could facilitate speech recognition and

speaker identification or verification by reducing the required search

space to half. Such a pre-process may occur in the listening process

of human being. One of the speech perception hypotheses proposed

by O'Kane (1987) stated that human listeners have to determine the
gender of the speaker first in order to determine the identity of the
sounds. Another perception hypothesis is that the identity of the

sounds. can be roughly determined without knowledge of the

speaker's gender but final recognition is possible only after the

speaker's gender is known. In both cases, identification of the

speaker's gender is a necessary step before recognition of sounds.

o Accomplishing this task could be useful for speech synthesis. It is
well known that in synthesized speech, the female voice has not been

reproduced with the same level of success as the male voice (Monsen

and Engebreston, 1977). Further study of gender cues would
























































Where
FO -- fundamental frequency
Fl -- first formant frequency
BW1 -- first formant bandwidth


Figure 1.1 A possible automatic gender recognition system.








contribute to the solution of this problem since acoustic features for

synthesizing speech for either gender would be provided. Hence, the

voice quality of voice response systems and text-to-speech

synthesizers would be improved.
o Accomplishing this task could provide new guidelines and suggest

methods to identify the acoustic features related to dialect, age,
health conditions, etc.

o Accomplishing this task could be a unique or an only approach for

some applications (e.g., law enforcement applications). In a

criminal investigation, an attempt is usually made to identify the

speaker on a recording as a specific person. If an individual is able
to deceive the investigator as to his gender, he may well prevent his

detection. It is well known that speakers can disguise their

speech/voice to confound or prevent detection (Hollien and

McGlone, 1976--cited by Carlson, 1981). The female impersonator
is an example of intentional deception of the listener. In such a

case, identification of the speaker's gender is critical.
o Finally, we presumed that the research results could benefit clinical

applications such as correction for a person with a voice disorder or

handicap. Other applications include transsexual changes etc.

(Bralley et al., 1978; Carlson, 1981).


1.3 Literature Review

1.3.1 Basic Gender Features

The differences between male and female voices depend upon many factors.
Generally, there exist three types of parameters--physiological and acoustical

which are objective, and perceptual which is subjective (Figure 1.2).












PHYSIOLOGICAL:
* VOCAL FOLD LENGTH
& THICKNESS
* VOCAL TRACT
LENGTH, AREA
& SHAPE


ACOUSTIC:
FUNDAMENTAL
FREQUENCY
VOCAL TRACT FEATURES:
FORMANT FREQUENCY
BANDWIDTH &
AMPLITUDE
GLOTTAL VOLUME
VELOCITY WAVESHAPE
ez, i


GENDER
DISCRIMINATION







* INTONATION
* STRESS
* SPEAKING RATE


Figure 1.2 Basic gender features.








Many physiological parameters of the male and female vocal apparatus have
been determined and compared. Fant (1976) showed that the ratio of the total

length of the female vocal tract to that of a male is about 0.87, and Hirano et al.

(quoted by Cheng and Guerin, 1987) showed that the ratio of the length of the
female vocal fold to that of the male is about 0.8. Titze (1987 and 1989) reported
that, anatomically, the female larynx also differs from the male larynx in thickness,

angle of the thyroid laminae, resting angle of the glottis, vertical convergence angle

in the glottis, and in other ways. The ratio of the length and the ratio of the area of

pharynx cavity of the female to that of the male are 0.8 and 0.82, respectively.

Similarly, we take respectively 0.95 as the ratio of the length and 1.0 as the ratio of

the area of oral cavity of the female to that of the male. The extra ratio for the area

of the oral cavity is due to the fact that the degree of openness of the oral cavity is

comparatively greater in the case of the female than in the case of the male (Ohman

quoted by Fant, 1966). Ohman also suggested that a proportionally larger female

mouth opening is a factor to consider. Figure 1.3 illustrates the human vocal

apparatus.

The differences in physiological parameters can lead to induced differences in
acoustical parameters. When comparing male and female formant patterns, the
average female formant frequencies are roughly related to those of the male by a

simple scaling factor that is inversely proportional to the overall vocal tract length.

On the average, the female formant pattern is said to be scaled upward in frequency

by about 20% compared to the average male formant pattern (Figure 1.4). It is also
well known that the individual size of the vocal cavities and thus of the formant

pattern scale factor may vary appreciably depending upon the age and gender of the

-speaker. Peterson and Barney (1952) measured the first three formant frequencies

present in ten vowels spoken by men, women, and children. They reported that male
formants were the lowest in frequency, women had a higher range, and children had

























VELUM


TONGUE BODY


LIPS
T-- ONGUE TIP


L JAW


Figure 1.3 A cross section of human vocal apparatus.

















AMPLITUDE (db)


MALE


FREQUENCY (kHz)


Figure 1.4 An example of male and female formant features.



PITCH PERIOD

(msec.)


12.0 -


10.0 MALE


8.0 -


6.0-


4.0 FEMALE ;


2.0


0.0 -


FRAME


Figure 1.5 Fundamental frequency changes for two speakers

for the utterance "We were away a year ago."


FEMALE








the highest. Carlson (1981) gave a survey of the literature on the vocal tract
resonance characteristics as a gender cue.
Fant (1966) has pointed out that the male and female vowels are typically

different in three groups:

1) rounded back vowels,

2) very open unrounded vowels, and
3) close front vowels.

The main physiological determinants of the specific deviations are that the ratio of

pharyngeal length to mouth cavity length is greater for males than for females and

the laryngeal cavities are more developed in males.

Schwartz and Rine (1968) also demonstrated that the gender of an individual
can be identified from voiceless fricative phonemes such as /S/, /F/ etc. This again is
induced by the vocal tract size differences between the genders.

The higher fundamental frequency (pitch) range of the female speaker is quite

well known. There is a general agreement that the fundamental frequency is an

important factor in the identification of gender from voice (Curry, 1940--cited by

Carlson 1981; Hollien and Malcik, 1967; Saxman and Burk, 1967; Hollien and Paul,
1969; Hollien and Jackson, 1973; Monsen and Engebretson, 1977; Stoicheff, 1981;

Horri and Ryan, 1981; Linville and Fisher; 1985; Henton, 1987). One often finds the

statement that the pitch level of the female speaking voice is approximately one

octave higher than that of the male speaking voice (Linke, 1973). However, there is

considerable discrepancy among values obtained by different investigators.

According to Hollien and Shipp (1972), the male subjects showed an intersubject
pitch range of 112 146 Hz. Stoicheff's (1981) data showed that the range for the
female subjects was 170-275 Hz. Titze-(1989) found that the fundamental

frequency was scaled primarily according to the membranous lengths of the vocal

folds (scale factor 1.6). Figure 1.5 shows fundamental frequency changes for two








speakers for the utterance "We were away a year ago." Figure 1.6 shows the
corresponding speech signals.

The female voice is slightly weaker than the male voice. On the average the
root mean square (rms) intensity of glottal periods produced by female subjects is -6
db relative to comparable samples produced by males. A study by Karlsson (1986)
indicated a strong correlation between weak voice effort and constant air leakage
during closed-phase.

During the last few years, measuring the area of the glottis as well as
estimating the glottal volume-velocity waveform have become research topics of

interest (Holmberg et al., 1987). It is well known that the shape of the glottal
excitation wave is an important factor which can greatly affect speech quality
(Rothenberg, 1971). The wave shape produced by male subjects is typically

asymmetrical and frequently shows a prominent hump in the opening phase of the

wave (due to source-tract interaction). The closing portion of the wave generally
occupies 20%-40% of the total period and there may or may not be an easily
identifiable closed period (Monsen and Engebretson, 1977). Notable differences
between male and female waveforms are that the female waveform tends to be
symmetric. There is seldom a hump during the opening-phase indicating less or no

source-tract interaction, and both the opening and closing parts of the wave occupy

more nearly equal proportions of the period. Holmberg et al. (1987) found
statistically significant differences in male-female glottal waveform parameters. In

normal and loud voices, female waveforms indicated lower vocal fold closing
velocity, lower ac flow, and a proportionally shorter closed-phase of the cycle,

suggesting a steeper spectral slope for females. For softly spoken voices, spectral

slopes are more similar to those of males.
These glottal-source differences between male and female subjects are
understandable in terms of the relative size of male and female vocal folds. It is

















3r_8 vl


i~ = .3jo iaLb


m I .


ljIu = .J/Z ML


..- (b)


Figure 1.6 Speech signals for

for the utterance


(a) male and (b) female speakers
"We were away a year ago."


INJ = 1. .' M-L;


z-C-L


-v .. .








possible that the asymmetrical, humped appearance of the male glottal wave may be

due to a slightly out-of-phase movement of the upper and lower parts of each vocal

fold. If this is so, then the generally symmetrical appearance of the female glottal

wave may be due to the fact that the shorter female vocal folds come into contact

with each other more nearly as a single mass (Ishizaka and Flanagan, 1972).

The perceptual parameters or strategies used to make decisions concerning

male/female voices are not delineated in the literature even though making this

decision is a discrimination task performed routinely by human listeners. However,

it is hypothesized that a limited number of perceptual cues for classifying voices do

exist in the repertoire of listeners, and these cues may include some sociological

factors such as cultural stereotyping.

Singh and Murry (1978) and Murry and Singh (1980) investigated the

perceptual parameters of normal male and female voices. They found that the

fundamental frequency and formant structure of the speaker appeared to carry

significant information for all judgments. The listeners' judgments that the voices

they heard were female were more dependent on judged qualities of voice and effort.

Effort, pitch, and nasality were the perceptual parameters used to characterize
female voices while male voices were judged on the basis of effort, pitch, and

hoarseness. Their results suggested that listeners may use different perceptual
strategies to classify male voices than they use to classify female ones. Coleman

(1976) also suggested that there was a possibility of a gender-specific listener bias

for one acoustic characteristic or for one gender over the other.

Many researchers also believe melodic (intonation, stress, and/or

coarticulation) cues are speech characteristics associated with female voices.

Furthermore, the female voice is typically mor-e-breathy than the male voice. This

can be modeled by a dc shift in the glottal wave or, as suggested by Singh and Murry

(1978), is a result of a large number of pitch shifts. As the subject shifts pitch








direction frequently, complete vocal fold approximation is less probable. A

research on acoustic correlates of breathiness was performed by Klatt (1987) in
which three breathiness parameters (i.e., first harmonic amplitude, turbulence noise

and tracheal coupling) were proposed. A detailed discussion of controlling these

parameters was presented in Klatts' paper (1987).

A new trend to find the features responsible for gender identification is to

apply the approach of synthesis. The work done by Yegnanarayana et al. (1984),
Wu (1985), Childers et al.(1985a, 1985b, 1987, 1989), and Pinto et al. (1989)

represented this aspect. In their experiments, the speech of a talker of one gender

was converted to sound like that of a talker of the other gender to exam factors

responsible for distinguishing gender features. They found that the fundamental
frequency, the glottal excitation waveshape and the spectrum, which included

formant locations and bandwidth, overall spectral shape and slope, and energy, are

crucial control parameters.


1.3.2 Acoustic Cues Responsible for Gender Perception

As part of current interest in speaker recognition, investigators have sought to
specify gender-bearing attributes of the human voice. Under normal speaking and

listening circumstances, listeners have little difficulty distinguishing the voices of

adult males and females, suggesting that the acoustic parameters which underlie

gender identity are perceptually prominent. The judgment of adult gender is
strongly influenced by acoustic variables reflecting gender differences in laryngeal

size and mass as well as vocal tract length. However, the issue of which specific
acoustic cues are mostly responsible for gender identification has not been

definitively resolved. Such a controversy-partially dominated the previous research.

A series of experiments run by Schwartz (1968) and Ingemann (1968)

employed voiceless fricatives spoken in isolation as auditory stimuli and it was found








that listeners could identify speaker gender accurately from these stimuli, especially
from /H/, IS/, and /SH/ (and could not from /F/ and /TH/). Ingemann reported that
the most identifiable fricative was /h/, with identification of others ranging down to
little better than chance. Since the laryngeal fundamental (FO) was not available to
the listeners, their findings suggest that accurate gender identification is possible
from vocal tract resonance (VTR) information alone and, therefore, that formants
are important cues for speaker gender identification.

Further support for this conclusion came from studies by Schwartz & Rine
(1968) and Coleman (1971). Schwartz and Rine's study revealed that the listeners

were able to identify the speaker's gender from two whispered vowels (/i/ and /a/).
They found 100% correct identification for /a/ and 95% correct identification for /i/,

despite the absence of the laryngeal fundamental. In Coleman's study on male and

female voice quality and its relationship to vowel formant frequencies, /i/, /u/, and a

prose passage were employed to explore listeners' gender identification abilities.
All stimuli were produced at the same FO (85 Hz) by means of an electrolarynx.
Coleman discovered that the judges correctly recognized the speaker gender 88% of

the time (with 98% correct for male voices and 79% for female voices), even when
the FO remained constant for all speakers. He also discovered that the vowel
formant frequency averages were closely associated with the degree of male or
female voice quality.

Coleman (1973a and 1973b) attempted to reduce the influence of possible

differences in rate, juncture, and inflection between male and female speakers by

presenting their voiced productions of prose passage backward to subjects. The

judgments should have, therefore, been based solely on VTR and FO information
which would be unaffected by the backward presentation. By correlation analysis-
between measures of VTR, FO, and judgments of degree of male and female voice
quality in the voices of the speakers (with degree of correlation indicative of the








contribution of each of the vocal characteristics to listener judgments), he found that
listeners were basing their judgments of the degree of male or female voice quality

on the frequency of the laryngeal fundamental.

However, in a later study by Coleman (1976), there were inconsistent findings

from a pair of experiments concerned with a comparison of the contribution of two
vocal characteristics to the perception of male and female voice quality. The first
experiment, which utilized natural speech, indicated that the FO was very highly

correlated with the degree of gender perception while the VTR was less highly

correlated. When VTRs that were more characteristic of the opposite gender were

included experimentally in these voices, they did not affect the judges' estimates of

the degree of male or female voice quality. But, in the second experiment, when a

tone produced by a laryngeal vibrator was substituted for the normal glottal tone at
simulated FO representing both male (120 Hz) and female (240 Hz), and male and

female characteristics (i.e. vocal tract formants and laryngeal fundamentals) were

combined in the same voice experimentally, he found that the female FO was a weak

indicator of the female voice quality when it was combined with the male VTR

features although the male FO retained the perceptual prominence seen in the first
experiment. Thus, there was a difference in the manner that FO and VTR interact

for male and female perception.

Lass et al. (1976) conducted a study comparing listeners' gender identification

accuracy from voiced, whispered, and 255 Hz low-pass filtered isolated vowels.
They found that listener accuracy was greatest for the voiced stimuli (96% correct
out of 1800 identifications--20 speakers x 6 vowels x 15 listeners), followed by the

filtered stimuli (91% correct), and least accurate (75% correct) for the voiceless

vowels. Since the low-pass filtered vowels apparently eliminated formant

information, they concluded that the FO was a more important acoustic cue in
speaker gender identification tasks than the VTR characteristics of the speaker.








Lass et al. (1976) also reported that there were large gender differences in

their results. In all experimental conditions females were recognized at a
significantly lower level, which was in agreement with the results of Coleman (1971)
mentioned above. In another study supportive of this point, Brown and Feinstein

(1977) also used electrolarynx (120 Hz) to control FO so that VTR was the variable.

Identification of male speakers was 84% correct and identification of female

speakers was 67% correct. Brown and Feinstein also found, as in the Coleman

(1971) study, that centralized spectra were more ambiguous to listeners. Again,

VTR appeared to play a determinant role in gender identification in the absence of
FO.

In a later experiment, the effect of temporal speech alterations on speaker
gender and race identification was investigated. Lass and Mertz (1978) found that

gender identification accuracy remained high and unaffected by temporal speech

alterations when the normal temporal features of speech were altered by means of

the backward playing and time compressing of speech samples. They concluded

that temporal cues appeared to play a role in speaker race, but not speaker gender

identification.

In another study concerned with the effect of phonetic complexity on speaker

gender identification, Lass et al. (1979) found that phonetic complexity did not

appear to play a major role for gender judgments. No regular trend was evident
from simple to complex auditory stimuli and listeners' accuracy was as great for

isolated vowels as it was for sentences.

In an attempt to investigate the relative importance of portions of the

broadband frequency speech spectrum in gender identification, Lass et al. (1980)
constructed three recordings representing-the three experimental-conditions in the

study: unfiltered, 255 Hz low pass filtered, and 255 Hz high pass filtered. The

recordings were played back to a group of 28 judges. The results of their judgments








indicated that gender identification was not significantly affected by such filtering;

listeners' accuracy in gender recognition remained high for all three experimental

conditions, showing that gender identification can be made accurately from acoustic

information available in different portions of the broadband speech spectrum.


1.3.3 Summary of Previous Research
By reviewing the literature it can be concluded that the revealed information of

gender identification from previous research was extensive. However, it is clear that

much work still remains to be done.
What has not been completed

The relative importance of the FO versus VTR characteristics for
perceptual male or female voice quality is still controversial. The belief

that the FO is the strongest cue to gender seems to be substantiated by the

evidence. There is a hypothesis that in situations in which the role of FO

is diminished by deviancy, the effect of VTR characteristics upon gender

judgments increases from a minimal level to take on a large role equal to

and even sometimes greater than that played by FO (Carlson, 1981). But
this hypothesis remains unproven.

It is well known now that not only the vibration frequency of the

glottis (FO) but also the shape of the glottal excitation wave as well are
important factors which greatly affect speech quality (Rothenberg, 1971;

Holmes, 1973). Differences of glottal excitation wave shapes for male
and female were observed and investigated (Monsen and Engebretson,

1977; Karlsson, 1986; Holmberg and Hillman, 1987). But perceptive

justification of these characteristics was still limited (Carrell, 1981) and

the inverse filtering techniques need to be improved and more data

should be analyzed.








What was neglected

First of all, research on automatically classifying male/female

voices by using objective feature measurements was entirely missing.

Almost all previous work was concentrated on subjective testing which is

expensive, time and labor consuming, and subject dependent. Objective

gender recognition which is reliable, inexpensive, and consistent has not

been developed in parallel to subjective testing but such work is

necessary as we stated earlier.

Second, the influences of formant bandwidth and amplitude and

overall spectral shape on gender cues were not considered and

investigated. Traditionally, experiments on contribution of vocal tract

characteristics to gender perception were only concerned with formant

frequencies (Coleman, 1976). The bandwidths of the lowest formant

depend upon vocal tract wall loss and source-tract interaction (Rabiner

and Schafer, 1976; Rothenberg, 1981) while bandwidths of the higher

formants depend primarily upon the viscous friction, thermal loss, and

radiation loss (Flanagan, 1972). These factors may be different for each

gender so that the bandwidths and overall spectral shape are different for

each gender. Bladon (1983) pointed out that male vowels appeared to

have narrower formant bandwidths and perhaps also a less steeply

sloping spectrum. All these areas require further investigation.

What was the weakness

The acoustic features were obtained by short-time spectral

analysis which usually used analog spectrographic techniques.

Estimated FO and formant frequencies may be inaccurate due to








1. errors in determining the positions of the harmonic peaks (in

practice, the peaks were "read" by-means of inspection by a

person and then the FO and formants were calculated).

2. errors in formant estimation due to the influences of the FO
and source-tract interaction.
3. large instrument errors (e.g., drift).

Lindblom (1962) estimated the accuracy of spectrographic

measurement to be approximately equal to the fundamental frequency

divided by 4. Flanagan (1955) and Nord and Sventelius (1979, quoted by

Monsen and Engebretson, 1983) suggested that a difference of about 50

Hz for the second formant and a difference of about 21 Hz for the first

formant was perceived. Therefore, formant frequency estimation should

be as accurate as possible in vowel analysis as well as synthesis.

However, the most frequently referenced paper on acoustic phonetics,

which contains the most comprehensive measurements of the vowel

formants of American English (Peterson and Barney, 1952), may involve

measurement errors as pointed out by Monsen and Engebretson (1983),
especially for female and child subjects since the data were obtained by

spectrographic measurement.

The technique frequently employed to examine the ability of VTR

to serve as gender cue was to standardize the FO (and therefore eliminate
it as a variable) by utilizing an artificial larynx (Coleman, 1971, 1976;

Brown and Feinstein, 1977). This allows evaluation of VTR in a sample

that contains an FO that is the same for both male and female subjects.

The electrolarynx itself has an unnatural sound to it that may confuse the

listener and depress the overall accuracy of perception.








The study populations were relatively small for most

investigations. Sometimes the database used consisted of less than 10

subjects for each gender (Ingemann, 1968; Schwartz and Rine, 1968;

Brown and Feinstein, 1977), making the interpretation of the results

unreliable.

The results of the listening tests may depend on the gender

distribution of the testing panel because males and females may use

different judging strategies. However, this point usually was not

emphasized so that the conclusions claimed from listening tests may be

biased (Coleman, 1976; Carlson, 1981).

In summary, previous research has measured and investigated the

physiological or anatomical parameters for each gender. Under certain

assumptions, the relationship between anatomical parameters and some of the

acoustic features was established. The major acoustic parameters responsible for

perceptually discriminating a speaker's gender from voice were investigated and

tested. However, no attempt was made to automatically classify male/female voices

by objective feature measurements. The vowel characteristics for each gender were

inaccurate because of the weakness of analog techniques. Various hypotheses and

preliminary results need to be verified on a more comprehensive database. All these

constituted the underlying problems and impetuses for this research.


1.4 Objectives of this Research

This research sought to address these problems through two specific

objectives.

One objective of this study was to explore the possible effectiveness of digital

speech processing and pattern recognition techniques for an automatic gender

recognition system. Emphasis was placed on the investigation of various objective








acoustic parameters and distance measures. The optimal combination of these

parameters and measures was searched. The extracted acoustic features that are
most effective to classify speaker's gender objectively were characterized. Efficient
recognition schemes and decision algorithms for such purpose were developed.

The other objective of this study was to validate and clarify hypotheses
concerning some acoustic parameters affecting the ability of algorithms to

distinguish a speaker's gender. Emphasis was placed on extraction of accurate

vowel characteristics including fundamental frequency and formant features such as

formant frequency, bandwidth and amplitude for each- gender. The relative

importance of these characteristics for gender identification was evaluated.



1.5 Description of Chapters

In Chapter 2, an overview of the research plan is given and a brief description

of the coarse and fine analysis is presented. The database and the techniques

associated with data collection and preprocessing are discussed in Chapter 3. The

details of the experimental design based on coarse analysis are described in Chapter

4. Asynchronous LPC analysis is reviewed. Different acoustic parameters, distance

measures, template formation, and recognition schemes are provided. The

recognition decision rule and resubstitution or exclusive procedure are proposed as
well. In addition, the concept of the Fisher's discriminant ratio criterion is reviewed.

The recognition performance based on coarse analysis is assessed in Chapter 5.

Results of comparative studies of various phonemes, acoustic features, distance

measures, recognition schemes, and filter orders are reported. The gender
separability of acoustic features is also analyzed by using the Fisher's discriminant

ratio criterion. Chapter 6 expounds on the detailed experimental design of fine

analysis. In particular, the advantages of pitch synchronous closed phase analysis is








demonstrated. A review of the closed phase WRLS-VFF (Weighted Recursive
Squares with Variable Forgetting Factor) analysis and the EGG (electroglottograph)

assisted approaches is also presented. Chapter 6 also introduces testing methods for

fine analysis, which include the two-way ANOVA (Analysis of Variance) statistical

test and the automatic recognition test using grouped features. Chapter 7 analyzes
the vowel characteristics such as fundamental frequencies and formant features for

each gender. Statistical tests and relative importance of grouped vowel features are

also discussed. Finally in Chapter 8, a summary of the results of this dissertation is

offered. Recommendations and suggestions for future research conclude this last

chapter.














CHAPTER 2
APPROACHES TO GENDER RECOGNITION FROM SPEECH



2.1 Overview of Research Plan


The goal of this study was to explore the possible effectiveness of digital

speech processing and pattern recognition techniques for an automatic gender

recognition system from speech. In order to do this, some hypotheses concerning

acoustic parameters that act to affect our ability to distinguish speaker's gender
needed to be validated and clarified.

Thus, this study was divided into two directions as illustrated in Figure 2.1.

One direction was called coarse analysis since it applied classical pattern recognition

techniques and asynchronous linear prediction coding (LPC) analysis of speech.

The specific goal of this direction was to develop and test candidate algorithms for
achieving the gender recognition rapidly using only a brief data speech record.

The second research direction covered fine analysis since pitch synchronous

closed-phase analysis was utilized to obtain accurate vowel characteristics for each

gender. The specific aim of this direction was to compare the relative significance of

vowel characteristics for gender discrimination.


2.2 Coarse Analysis

The tool we used in this direction was asynchronous LPC analysis. The

advantages of using this technique are





















































Figure 2.1 The overall research flow.








1. The well-known linear prediction coding (LPC) vocoder is an

efficient vocoder which, when used as a model, encompasses the

features of the vocal source (except of fundamental frequency) as

well as the vocal tract (Rabiner and Schafer, 1978). Since gender

features are believed to be included in both vocal source and tract,

satisfactory results would be expected using LPC derived

parameters.

2. The LPC all-pole model has a smoothed, accurate spectral envelope

matching characteristic, especially for vowels. Formant frequency

measurements obtained by LPC have also been found to compare

favorably to measures obtained by spectrographic analysis (Monsen

and Engebretson, 1983; Linville and Fisher, 1985). Thus it is

expected that features obtained by LPC would represent the spectral

characteristics of both genders more accurately.

3. The LPC model has been successfully applied in speech and speaker

recognition (Makhoul, 1975a; Atal, 1974b, 1976; Rosenberg, 1976:

Markel, 1977; Davis and Mermerlstein, 1980; Rabiner and Levinson,

1981). Moreover, many related distortion or distance measurements

have been developed (Gray and Markel, 1976; Gray et al., 1980;

Juang, 1984; Nocerino et al., 1985) which could be conveniently

adopted for the preliminary experiments of gender recognition.

4. Deriving acoustic parameters from the LPC model is

computationally fast and efficient, only short data records are

needed. This is a very important factor in designing an automatic

gender recognition system. -








In the coarse analysis, acoustic parameters such as autocorrelation, LPC,

cepstrum, and reflection coefficients were derived to form test and reference

templates. The effects of using different distance measures, filter orders,

recognition schemes, and phonemes were comparatively evaluated. Comparisons of

acoustic parameters using the Fisher's discriminant ratio criterion were also

conducted.

The linear prediction coding concepts and detailed experimental design based

on the coarse analysis will be given in Chapter 4.


2.3 Fine Analysis


The objective of the fine analysis was to study and compare the relative

significance of vowel characteristics responsible for gender discrimination.

As we know, male/female vowel characteristics are featured by formant

positions, bandwidths, and amplitudes so that accurate formant estimation is

necessary. It is important to pay particular attention to the measurement technique

and to the degree of accuracy which can be achieved through it. Although formant

features have been measured for a variety of different studies, the accuracy of these
measurements is still a matter of conjecture.

Formant estimation is influenced by (Atal, 1974a; Childers et al., 1985a;
Krishnamurthy and Childers, 1986):

o the effect of the periodic vocal fold excitation, especially when the

harmonic is near the formant.

o the effect of the excitation-spectrum envelope.

o the effect of time averaging over several excitation cycles in the

analysis when the vocal folds are repeatedly in open-phase (large








source-tract interaction) and closed-phase (little or no source-tract
interaction) conditions. 0
Frame based asynchronous LPC analysis cannot reduce the effect of

source-tract interaction because this technique uses windows that average the data

over several excitation epoches. The pitch synchronized closed phase covariance

(CPC) method can reduce the effect of source-tract interaction. However, in certain

situations, the vocal tract filter derived by this method may be unstable because of

the short closed glottal intervals, especially for females and children (Ting et al.,

1988).

Sequential adaptive analysis methods offer an attractive alternate processing
strategy since they overcome some of the drawbacks of frame-based analysis. The
closed-phase WRLS-VFF method that tracks the time-varying parameters of the

vocal tract and updates the parameters during the glottal closed phase interval can

reduce the formant estimation error. Experimental results (Ting et al., 1988; Ting,

1989) show that the formant tracking ability and formant estimation accuracy of the

WRLS-VFF algorithm is superior to the LPC based method. Detailed formant

features, including frequencies, bandwidths, and amplitudes in the fine analysis

stage were obtained by using this method. The EGG signals were used to assist in

locating the closed phase portion of the speech signal (Childers and Larar, 1984;
Krishnamurthy and Childers, 1986).

There were two approaches for testing the relative importance of various
vowel features for gender recognition:

Statistical tests. Since formant characteristics such as frequencies,

bandwidths, and amplitudes depend on or are influenced by two
factors (i.e., gender as well as vowels) and each experimental

subject produces more than one vowel, our experiments should be

referred to as two factor experiments having repeated measures on








the same subject (Winer, 1971). Therefore, two-way ANOVA were
used to perform the statistical test. The significance of the
difference between each individual feature in terms of male/female

groups was analyzed.
Automatic recognition. First, the individual or grouped featuress,

such as only the fundamental frequency or only the formant

frequencies or bandwidths (but from all formants), were used to

form the reference and test templates. Then automatic recognition

schemes were applied on these templates. Finally, the recognition

error rates for different features were compared.

In Chapter 6, the detailed background of the closed phase WRLS-VFF method

and the experimental design based on fine analysis will be presented.















CHAPTER 3
DATA COLLECTION AND PROCESSING



3.1 Database Description


The database consists of speech and EGG data collected from 52 normal

subjects (27 males and 25 females) with speaker's age varying from 20 to 80 years.

The synchronous speech and EGG signals were simultaneously directly digitized.

Each subject read, after some practice, the following SAMPLE PROTOCOL that

includes 27 tasks.


SAMPLE PROTOCOL


F
F
F

F
F
I:


F
F
F
F
F


- 10 with comfortable pitch & loudness.
- 5 with progressive increase in loudness.
donation of the vowel /IY/ in t
)honation of the vowel /I/ in t
,honation of the diphthong /AI/ in t
,honation of the vowel /E/ in t
donation of the vowel /AE/ in t
,honation of the vowel /OO/ in t
)honation of the vowel /U/ in t
)honation of the diphthong /OU/ in t
donation of the vowel /OW/ in tl
,honation of the vowel /A/ in t
)honation of the vowel /UH/ in t
honation of the vowel /ER/ in t
)honation of the whisper /H/ in t
honation of the fricative /F/ in t
honation of the fricative /TH/ in t
,honation of the fricative /S/ in t


he
he
he
he
he
he
he
he
he
he
he
he
he
he
he
he


word BEET.
word BIT.
word BAIT.
word BET.
word BAT.
word BOOT.
word BOOK.
word BOAT.
word BOUGHT.
word BACH.
word BUT.
word BURT.
word HAT.
word FIX.
word THICK.
word SAT.


Task 1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.


Count 1
Count 1
Sustain
Sustain
Sustain
Sustain
Sustain
Sustain
Sustain
Sustain
Sustain
Sustain
Sustain
Sustain
Sustain
Sustain
Sustain
Sustain








Sustain phonation of the fricative /SH/ in the word SHIP.
Sustain phonation of the fricative /V/ in the word VAN.
Sustain phonation of the fricative /TH/ in the word THIS.
Sustain phonation of the fricative /Z/ in the word ZOO.
Sustain phonation of the fricative /ZH/ in the word AZURE.
Produce chromatic scale on "la" (attempt to go up, then down
as one effort -- pause between top 2 notes)
Sentence "We were away a year ago."
Sentence "Early one morning a man and a woman ambled along a
one mile lane."
Sentence "Should we chase those cowboys?"


A subset of the above was used in this research. It consisted of

1. Ten sustained vowels: /IY/, //, /E/, /AE/, /00/, /U/, /OW/, /A/, /UH,

and /ER/. There were a total of 520 vowels for 52 subjects: 270
vowels from males and 250 vowels from females.

2. Five sustained unvoiced fricatives (including a whisper): /H/, /F/,

/TH/, /S/, and /SH/. There were a total of 260 unvoiced fricatives for
all subjects: 135 from males and 125 from females.
3. Four voiced fricatives: /V/, /TH/, /Z/, and /ZH/. There were a total

of 208 unvoiced fricatives for all subjects: 108 from males and 100
from females.


3.2 Speech and EGG Digitization

All of the experimental data were collected with the subjects situated inside an
Industrial Acoustics Company single wall sound booth. The speech was picked up

with an Electro Voice RE-10 dynamic cardioid microphone and the EGG signal was
monitored by a Synchrovoice device. Amplification of the speech and EGG signals
was accomplished with 'a Digital Sound Corporation DSC-240 Audio Control
Console. The two channels were alternately sampled at 20 KHz by a Digital Sound

Corporation DSC-200 Digital Audio Converter system with 16 bits of precision.








The low-pass, anti-aliasing and reconstruction filters of the DSC-200 were
connected to the analog side of the converter. Both signals were bandlimited to 5
KHz by these passive elliptic filters with the specification of minimum stopband

attenuation of -55 db and passband ripple of 0.2 db. The DSC-240 station

provides audio signal interfacing to the DSC-200, which includes input and output
buffering as well as level metering and signal path switching.

The utterances were directly digitized since this choice avoids any distortion

that may be introduced in the tape recording process (Berouti et al., 1977; Naik,

1984). An extender attached to the microphone kept the speaker's lips 6 inches

away. With the microphone and EGG electrodes in place, the researcher ran the

data collection program on a terminal inside the sound room. A two channel

Tektronix Type 564B Storage Oscilloscope was connected to DSC-240 so both

speech and EGG signals were monitored. The program prompted the researcher by

presenting a list of commands on the screen. The researcher initiated digitization by

depressing the "D" key on the keyboard. Immediately after digitization, another
prompt indicated termination of the sampling process. The digitized utterance could

be played back and an option existed to repeat the digitization process if it was
thought that part of the utterance might have been spoken abnormally or the

digitized speech and EGG signals were unsatisfactory. For example, the speakers

were instructed to repeat a utterance if the panel of experts who were sitting in the

sound room or speaker felt that it was rushed, mispronounced, too low, etc. The

entire protocol with utterances repeated as necessary took an average of 15-20

minutes to collect. About 150-200 seconds of speech and EGG were automatically

stored on disk. Thus, for each subject, about 12000-16000 blocks (512 bytes per

block) of data were collected.

Since the speech and EGG channels were alternately sampled, the resulting
file of digitized data had the two signals interleaved. The trivial task of








demultiplexing was performed off-line after data collection. Once the data were
demultiplexed, the speech and EGG were trimmed to discard the unnecessary data
before and after an utterance while keeping the onset and offset portions at each end
of the data. After.trimming, about 4500-6500 blocks data were stored on disk for
each subject.


3.3 Synchronization of Data


When the speech and EGG signals were used during the analysis stage, they
were time aligned to account for the acoustic propagation delay from the larynx to

the microphone. The microphone was kept a fixed 15.24 centimeters (6 inches)

away from the speakers' lips to reduce breath noises and to simplify the alignment
process. Synchronization of the waveforms had to account for the distance from the
vocal folds to the microphone. To do so, average vocal tract lengths of 17 cm for

males and 15 cm for females were assumed. The number of samples to discard

from the beginning of the speech record was then

# samples = Int[(32.24/34442)10000 + .5] (3.1)
for males and

# samples = Int[(30.24/34442)10000 + .5] (3.2)
for females.

Equations (3.1) and (3.2) show that a 10 sample correction is appropriate for
males and a 9 sample correction is appropriate for females. Examination of the data
also supported use of these figures for adult speakers. Examples of aligned speech

and EGG signals for a male and female speaker are shown in Figure 3.1.











17064 /M F1














-1-.176


(a)





0U


-10240L


24 (b)



(b)


Examples of aligned
for (a) male and (b)


speech and EGG signals
female speakers.


Figure 3.1


,I -


-y)f4


-vjv jHVop vjyp v IV v V VI














CHAPTER 4
EXPERIMENTAL DESIGN BASED ON COARSE ANALYSIS




As stated in the Chapter 2, coarse analysis applies classical pattern recognition
techniques and asynchronous linear prediction coding (LPC) analysis of speech.

The specific goal of this analysis was to develop and test the algorithms to achieve

rapid gender recognition from only a brief speech data record. Figure 4.1 shows the
canonic pattern recognition model used in the gender recognition system. There are

four basic steps in the model:

1) acoustic parameter extraction,

2) test and reference pattern or template formation,
3) pattern similarity determination, and
4) decision rule.

The input is the acoustic waveform of the spoken speech signal, the desired
output is a "best estimate of the speaker's gender in the input. Such a model can be
a part of a speech or speaker recognition system or a front end processor of the
system. The following discussion of the coarse analysis proceeds in the context of
Figure 4.1.


4.1 Asynchronous LPC Analysis

4.1.1 Linear Prediction Concepts

Linear prediction, also known as the autoregressive (AR), all-pole model, or
maximum entropy model, is widely used in speech processing. This method has





























































Figure 4.1 A pattern recognition model
for gender recognition from speech.


SPEECH
SIGNAL


RECOGNIZED
GENDER








become the predominant technique for estimating the basic speech parameters (e.

g., pitch, formants, spectra, vocal tract area functions) and for representing speech
for low bit-rate transmission or storage. The method was first applied to speech
processing by Atal and Schroeder (1970) and Atal and Hanauer (1971). For speech
processing, the term linear prediction refers to a variety of essentially equivalent
formulations of the problem of modeling the speech waveform (Markel and Gray,
1976; Makhoul, 1975b). These different models usually lead to similar results but
each formulation has provided an insight into the speech modeling problem and is
generally dictated by their computation demands.
The particular form of this model that is appropriate for this research is

depicted in Figure 4.2. In this case, the composite spectrum effects of radiation,

vocal tract, and glottal excitation are represented by a time-varying digital filter
whose steady-state system function is of the form


S(z) G
H(z) = = (4.1)
U(z)
1 2 Olk Z-k
k=l

This system is excited by an impulse train for voiced speech or a random noise
sequence for unvoiced speech.
The above system function can be written alternatively in the time domain as



s(n) = 2 OtkS(n-k) + G u(n) (4.2)
k=l

Let us assume we have available past data samples from (n-p) to (n-1) and we are

predicting a new nth sample of the data sample



















PITCH PERIOD

CDGTAL FiLTER CCEFFICIENTS
(VCCAL TRACT PARAMETERS)


IMPULSE i I -
TRAIN t
GENERATOR


-- TIME-VARYING



RANCCM
DIGITAL FILTER



NUMEER
GE'ERATCR
___AMPLITUCE

I SPEECH
SAMPLES




/ \ t


A digital model of speech production.


Figure 4.2








p
s (n) = aks(n-k) (4.3)
k=l

The error between the value of the actual nth sample and its estimate is



P
e(n) = s(n) I (n) = s(n) + 2 aks(n-k) (4.4)
k=l

and equivalently,



s(n) = aks(n-k) + e(n) (4.5)
k=l

If the linear prediction model of Equation (2.5) conforms to the basic speech
production model given by (2.2), then


e(n) = G u(n) (4.6)

ak = ak (4.7)


Thus the coefficients (ak) identify the system, whose output is s(n). The problem
then is to determine the value of the coefficients (ak) from the actual speech signal.

The criterion used to obtain the coefficients (ak) is the minimization of the
short-time average prediction error E with respect to each coefficient ai, over some
time interval, where


E= [e(n)]2 (4.8)
n


This leads to the following set of equations









ak s(n-k)s(n-i) = s(n)s(n-i) 1 < i & p (4.9)
k=l n n


For a short-time analysis, the limits of summation are finite. The particular
choice of these limits has led to two methods of analysis (i.e., the autocorrelation
method (Markel and Gray, 1976) and the covariance method (Atal and Hanauer,
1971)).

The autocorrelation method results in a filter structure that is guaranteed to be
stable. Meanwhile, it operates on a data segment that is windowed using a Hanning,

or Hamming, or another window, typically 10-20 msec long (two to three pitch

periods).

The covariance method, on the other hand, gives a filter with no guaranteed

stability, but requires no explicit windowing. Hence it is eminently suitable for pitch

synchronous analysis.

One of the important features of the linear prediction model is that the
combined contribution of the glottal flow, the vocal tract and the radiation effect at

the lips are represented by a single recursive filter. The difficult problem of

separating the contribution of the source function from that of the vocal tract system
is thus completely avoided.


4.1.2 Analysis Conditions

In order to extract acoustic parameters rapidly, a conventional pitch

asynchronous autocorrelation LPC method was used, which applied a fixed frame
size, frame rate and number of parameters per frames. These analysis conditions
were:
Order of the filter: 8, 12, 16, 20
Analysis frame size: 256 points/frame








Frame overlap: None
Preemphasis Factor: 0.95

Analysis Window: Hamming

Data set for coefficient calculations: six frames total. The first two of these

were picked up from near the voice onset of an utterance, and the second two from
the middle of the utterance, and the last two from near the voice offset of the

utterance. By averaging six sets of coefficients obtained from these six frames, a

template coefficient set was calculated for each sustained utterance such as a vowel,

an unvoiced fricative, or a voiced fricative.



4.2 Acoustic Parameters

One of the key issues in developing a recognition system is to identify

appropriate features and measures which will support good recognition
performance. Several acoustic parameters were considered as feature candidates in

this study.


4.2.1 Autocorrelation Coefficients

They are defined conventionally as (Atal, 1974b)


00
R(k) = Z h(n)h(n+|kl) (4.10)
n=0


where h(n) is the impulse response of the filter. The relationship between the P

autocorrelation function coefficients and P LPC coefficients is unique in that they

can be obtained from each other (Rabiner and Schafer, 1978).










4.2.2 LPC coefficients

LPC coefficients are defined conventionally as (Rabiner and Schafer, 1978)


s(n) = ak s(n-k) (4.11)
k=l

where s(n-k) is the (n-k)th speech sample, s(n) is the nth predicted output and ak is
the kth LPC coefficient. LPC coefficients are determined by minimizing the
short-time average prediction error.


4.2.3. Cepstrum Coefficients

Cepstral coefficients can be obtained by the following recursive formula
(Rabiner and Schafer, 1978)


Co = 0,

k-1 i
Ck = ak+ ( ) ak-iCi 1 < k p (4.12)
i=1 k


where ak is kth LPC coefficient.


4.2.4 Reflection Coefficients

If we consider a model for speech production that consists of a concatenation
of N lossless acoustic tubes, then the reflection coefficients are defined as (Rabiner
and Schafer, 1978)

A(k+l) A(k)
r(k) = (4.13)
A(k+l) + A(k)








where A(k) is the area of the kth lossless tube. The reflection coefficient determines
the fraction of energy in a traveling wave that is reflected at each section boundary.

Further, r(i) is related to the PARCOR coefficient k(i) by (Rabiner arid Schafer,
1978)


r(i) = k(i) (4.14)


where k(i) can be obtained from LPC coefficients by recursion.


4.2.5 Fundamental Frequency and Formant Information
This set of features consists of frequencies, bandwidths and amplitudes of the
first, second, third and fourth formants and the fundamental frequencies (not for
fricatives). Formant information was obtained by a peak-picking technique, using
an FFT on the LPC coefficients. Fundamental frequency was calculated based upon
a modified cepstral algorithm.



4.3 Distance Measures


Several distance measures were considered.


4.3.1 Euclidean Distance


D euc = [ (X-Y) (X-Y) 11/2 (4.15)


where X and Y are the test and reference vectors respectively and t denotes the
transpose of the vector.








4.3.2 LPC Log Likelihood Distance
It was proposed by Itakura (1975) and defined as


aRa'
Di ( a) = log [ ] (4.16)
d Ra '


where a and d are the LPC coefficientvectors of the reference and test speeches,
and R is the matrix of autocorrelation coefficients of the test speech. An
interpretation of this formula is given in Figure 4.3 below in which the subscript r
denotes reference, and the subscript t denotes test.
The denominator of the term can be obtained by passing the test speech signal
St(n) through the inverse LPC system of the test H,(z), giving the energy a of the
error signal. Similarly, the numerator term can be obtained by passing the same test
signal St(n) through the inverse LPC system of the reference H,(z) with the energy p
of the error signal.
Thus we obtain


D pc, (a a) = log (0/a) (4.17)


It can also be shown that this distance measure is related to the spectra
dissimilarity between the test and reference speech signals.
For computational efficiency, variables can be changed and Equation (4.16)
or (4.17) can be rewritten as

p r (k)
D Ipc (a a) = log [ ra(k) ] (4.18)
k=0 E






44


















et (n) a =
I [Ht(z)]-1 --- I-I | I I









er (n) p =























Figure 4.3 An interpretation of LPC log likelihood
distance measure.


TEST


SPEECH


TEST


SPEECH


aR a'


a R a'









where r(k) is the autocorrelation of the speech segment to be recognized, E is the
total squared LPC prediction error associated with the estimates a(k) from this
segment, and ra(k) is the autocorrelation of the true (reference) LPC coefficients.
The block diagram for this subroutine is shown in Figure 4.4.
The spectral domain interpretation of Equation (4.18) is (Rabiner and
Levinson, 1981)


IT Hr(eJ) 2 do
Dip (a,a) = log[[f lI ] (4.19)
-irT Hi(e0j) 2rr


(i.e., an integrated square of the ratio of LPC spectra between reference and test
speech).


4.3.3 Cepstral Distortion
It can be defined as (Nocerino et al., 1985)


D ep (C, c') = Z (Ck C'k)2 (4.20)
k=-n


It can be shown that the power spectrum which corresponds to cepstrum is a
smoothed version of the true log spectral density function. The fewer the cepstral
coefficients used, the smoother the resultant log spectral density. It can also be
shown that this truncated cepstral distortion measure is a good approximation to the
L2 norm of the log spectral distortion measure between two time series, x(n) and
x'(n),



















TEST SPEECH AUTOCORRELATION LPC
SEGMENT
CALCULATION ANALYSIS






PREDICTION E:

CALCULATED








REFERENCE TEMPLATE
LPC COEFFICIENTS MEASURE






OUTPUT










Figure 4.4 The block diagram of LPC log likelihood
distance computation.








7T 2 2 2 d(
D L2 =f I log IX(@)l -log IX'(o)I 1- (4.21)
-TT 2Tr


where X(w) is the Fourier transform of x(n). Gray and Markel (1976) showed that
for the LPC analysis with a filter order of 10 the correlation coefficient between D L2
and D cep is 0.98; while for a order of 20, the correlation coefficient is 0.997.


4.3.4 Weighted Euclidean Distance


D WEUC = [ (X-Y)' W-1 (X-Y) ]1/2 (4.22)


where X is the test vector, W is the symmetrical covariance matrix obtained using a
set of reference vectors (e.g., from a set of templates which represent subjects of the
same gender), and Y is the mean vector of this set of reference vectors. The
weighting compensates for correlation between features in the overall distance and
reducing the intragroup variations. Weighted Euclidean distance is the simplified
version of the likelihood distance measure using the probability density function
discussed below.


4.3.5 Probability Density Function

1 -n/2 1 -(X-Y)' W-'(X-Y) 1/2
D PDF = ( exp[ ] (4.23)
27r IW1/2 2


where X, Y, W are the same as in Equation (4.22) and tW| is-the determinant of W.
The decision principle of this distance measure minimizes the probability of the
error.








4.4 Template Formation and Recognition Schemes


4.4.1 Purpose of Design

Another important issue in developing a recognition system is the selection of

appropriate template formation and recognition schemes.
During initial exploratory studies of fixed-text recognition using spectral

pattern matching techniques in the Pruzansky study (1963), the use of the long-term

average technique to form a feature vector was discovered to have potential for

free-text speaker recognition. The speaker recognition error rate was found to

remain undegraded (at 11 percent) even after spectral amplitudes were averaged

over all frames of speech data into a single reference spectral amplitude vector for

each talker. Markel et al. (1977) demonstrated that the between-to-within speaker

variation ratio was significantly increased under long-term average of the parameter

sets (thus free-text).

Temporal cues also appeared not to play a role in speaker gender

identification (Lass and Mertz, 1978). They found that gender identification

accuracy remained high and unaffected by temporal speech alterations when the

normal temporal features of speech were altered by means of the backward playing

and time compressing of speech samples.

Therefore, we would reasonably believe that the gender information is

time-invariant. Thus, long-term averaging would also emphasize the speaker's

gender information and increase the between-to-within gender variation ratio. In

practice we would also achieve free-text gender recognition in which gender

identification would be determined before recognition of speech or speaker and

thus, reduce the speech or speaker recognition search space to half.

The purpose of using different test and reference template formation schemes

is to verify the hypothesis above and, if it is correct, to determine how much








averaging has to be performed to obtain the best gender recognition. In the

preliminary attempt, the averaging was first done within three classes of the sounds
(i.e., vowels, unvoiced fricatives, and voiced fricatives).


4.4.2 Test and Reference Template Formation

The averaging procedures used to create test and reference templates for the

present experiment employed a multi-level combination approach as illustrated in

Figure 4.5.

The lower layer templates were feature parameter vectors obtained from each

utterance by an LPC analysis as described in the last section. They can be

autocorrelation, LPC, or cepstrum coefficients, etc. A lower layer template

coefficient set was calculated by averaging six sets of coefficients obtained from six

frames for each sustained utterance such as a vowel, an unvoiced fricative, or a

voiced fricative for every subject.

The next level of combination averaged all templates in the lower layer for

each subject to form a single median layer template to represent this subject.

Templates of all utterances for the same phoneme groups (e.g., vowels or unvoiced

fricatives or voiced fricatives), were averaged.

In the last stage, the single remaining male and female templates were

combined in the same manner as above. Each gender was represented by a single

token centroidd) obtained by averaging all templates in the median layer.

It is evident that from the lower layer to the upper layer, a higher degree of
averaging is achieved.

Figure 4.6(a) shows two reflection coefficient templates of vowels for male

and female speakers in the upper layer. The filter order was 12 so that there were 12

elements in each template (vector). Each template can be considered a "universal

token" representing each gender. The data in the figure are shown as














SUBJECT 2


TEMPLATE FOR
EACH UTTERANCE
(LOWER LAYER)











COMPUTE
AVERAGE



TEMPLATE FOR
EACH SUBJECT
(MEDIAN LAYER)


TEMPLATE FOR
EACH GENDER
(UPPER LAYER)


Figure 4.5 Test and reference template formation.


SUBJECT 1


SUBJECT M





















0--0 MALE
*-- FEMALE


II I I I


1 2 3 4 5 6 7 8
ELEMENT OF THE VECTOR


9 10 11 12
(TEMPLATE)


- MALE

FEMALE


1000


2000 3000

FREQUENCY (Hz)


(b)



Figure 4.6 (a) Two reflection coefficient templates of vowels
for male and female speakers in the upper layer.
(b) The corresponding spectra.


0.6880


0.40-8





je.000-

-_
0-9.200


-0.408--


-0.600


-0.800


UJ
'4



z
(I
E:
Z
rS


I








means standard errors (SE), which were calculated from the median layer

templates. We will see in the next Chapter that by applying these two tokens as
reference templates with the recognition Scheme 3 and the Euclidean distance a
100% gender recognition rate can be achieved. The result is not surprising if we
notice that the within-gender variation for these reflection coefficients, as
represented by SE in the figure, was small, compared to the between-gender
variation. It is also easily noted that elements 1, 4, 5, 6, 7, 8, 9, and 10 of these
reference templates account for the most between-gender variation. On the other
hand, elements 2, 3, 11, and 12 of these reference templates account for little
between-gender variation and thus could be discarded to reduce the dimensionality
of the vector. Figure 4.6(b) shows the spectra corresponding to the two "universal"
reflection coefficient templates.
Similarly, Figure 4.7(a) and (b) show two reflection coefficient templates
(with the same filter order of 12) and the corresponding spectra of unvoiced

fricatives for male and female speakers in the upper layer. Interestingly, strong

peaks are present in the "universal" female spectrum for voiced fricatives but only
several ripples appear in the male spectrum. We will see later that by using these
two tokens as reference templates with the recognition Scheme 3 and the Euclidean
distance, an 80.8% gender recognition rate can be achieved. Finally, Figure 4.8(a)

and (b) are two cepstral coefficient templates (with the same filter order of 12) and
the corresponding spectra of voiced fricatives for male and female speakers in the
upper layer. It is shown later that by using these two tokens as reference templates
with the recognition Scheme 3 and the Euclidean distance, a 92.3% gender

recognition rate can be achieved. The "universal" spectra for the two genders in

Figure 4.6(b), 4.7(b), and 4.8(b) possess the basic properties for vowels, unvoiced
fricatives, and voiced fricatives. For example, while the energy of the vowels is
concentrated in the lower frequency portion of the spectrum, the energy of the



















0-0 MALE
O0 *-* FEMALE
0o o
*>


I I I I I I I I I i I I
1 2 3 4 5 6 7 8 9 10 11 12
ELEMENT OF THE VECTOR (TEMPLATE)

(a)


- MALE


FEMALE


1888 2800


3800 4888


FREQUENCY (Hz)

(b)


Figure 4.7 (a) Two reflection coefficient templates of unvoiced fricatives
for male and female speakers in the upper layer.
(b) The corresponding spectra.


0.600

8.488





2e.e-28


-8.400

-8.688-


-8.800


6888



















0--0 MALE
*-- FEMALE


1 E3 4 5 6 7 8
ELEMENT OF THE VECTOR


9 10 11 12
(TEMPLATE)


- MALE


FEMALE


1i88 2888


3888 4888


FREQUENCY (Hz)

(b)


Figure 4.8 (a) Two cepstral coefficient templates of voiced fricatives
for male and female speakers in the upper layer.
(b) The corresponding spectra.


0.30088


0.200--






8.i88


-8.188


-0.288


5088








unvoiced fricatives is concentrated in the higher frequency portion of the spectrum.

And the energy of voiced fricatives is more or less equally distributed.


4.4.3 Nearest Neighbor Decision Rule

The other major step in the pattern recognition model is the decision rule

which chooses which reference template most closely matches the unknown test

template. Although a variety of approaches are applicable, only two decision rules

have been used in most practical systems, namely, the K-nearest neighbor rule

(KNN rule) and the nearest neighbor rule (NN rule).

The KNN rule is applied when each reference class (e.g., gender) is

represented by two or more reference templates (e.g., as would be used to make the

reference templates independent of the speaker). The KNN rule operates as follows:

Assume we have M reference templates for each of two genders, and for each

template a distance score is obtained. If we denote the distance for the ith reference

template of the jth gender as Dij (1 < i < M and j = 1, 2), this set of distance scores,

Di,, can be ordered such that


Di,j < D2,j < ... < DM,j (4.24)


Then for the KNN rule we compute the average distance (radius) for the jth gender

as

1 K
j = Dij (4.25)
K i=l


and we choose the index j* with the smallest average distance as the "recognized"

gender








j* = argmin rj (4.26)
j=1,2


When K is equal to 1, the KNN rule becomes the NN rule (i.e., it chooses the

reference template with the smallest distance as the recognized template).

The importance of the KNN rule is seen for word recognition when P is from 6
to 12, in which case it has been shown that a real statistical advantage is obtained

using the KNN rule (with K = 2 or 3) over .the NN rule (Rabiner et al., 1979).

However, since there was no previous knowledge of the decision rule as applied to

gender recognition, the NN rule was first used in this preliminary experiment.


4.4.4 Structure of Four Recognition Schemes

To investigate how much averaging should be done for the test and reference

templates to gain the best performance for the gender recognizer, several

recognition schemes were designed. Table 4.1 presents a brief summary of these

schemes.



Table 4.1 Four recognition schemes


Test template from Reference template from


SCHEME 1 LOWER LAYER MEDIAN LAYER
SCHEME 2 LOWER LAYER UPPER LAYER
SCHEME 3 MEDIAN LAYER UPPER LAYER
SCHEME 4 MEDIAN LAYER MEDIAN LAYER




Scheme 1 is illustrated in Figure 4.9(a). In the training stage, one test

template for each test utterance (i.e., the lower layer), and one reference template











SCHEME 1


TEST SUBJECT


LOWER LAYER






MEDIAN LAYER


SCHEME 2


TEST SUBJECT


LOWER LAYER






UPPER LAYER


(b)



Figure 4.9 Structures of four recognition schemes.






58



SCHEME S


TEST SUBJECT


MEDIAN LAYER




UPPER LAYER
























MEDIAN LAYER




MEDIAN LAYER


SCHEME 4


TEST SUBJECT


(d)


Figure 4.9 (continued)








for each subject (i.e., the median layer), were formed. The set of the entire median
layer constituted the reference cluster that includes all median templates. In the
testing stage, the distance measure for each lower layer template of all test subjects
was calculated with respect to each of the median layer templates, and the minimum
distance was found. The speaker gender of the lower layer utterance was then

classified as male or female, according to the gender known for the median layer

reference template.

Scheme 2 is illustrated in Figure 4.9(b). In the training stage, one test
template for each test utterance (i.e., the lower layer), and one reference template

for each gender (i.e., the upper layer), were formed. The upper layer constituted the

reference cluster that includes only two gender templates. In the testing stage, the
distance measure for each lower layer template of all test subjects was calculated

with respect to each of those upper layer templates and the minimum distance was
found. The speaker gender of the lower layer utterance was then classified as male

or female, according to the gender known for the upper layer reference template.

Figure 4.9(c) shows Scheme 3. In the training stage, one test template for each
test subject (i.e., the median layer), and one reference template for each gender (
i.e., the upper layer), were formed. The set of the entire median layer constituted
the test pool that includes all median templates. In the testing stage, the distance

measure for each median layer template of all test subjects was calculated with

respect to each of those upper layer templates and the minimum distance was found.

The speaker gender of the median layer template was then classified as male or
female, according to the gender known for the upper layer reference template.

Figure 4.9(d) shows Scheme 4. In the training stage, only the median layer
were formed and each subject was represented by a single template. The median
layer constituted both test and reference pools. In the testing stage, the
Leave-One-Out ot exclusive procedure (which is discussed in detail in the next








section) was applied. The distance measure for each median layer template was

calculated with respect to each of the rest of the median layer templates, and the'

minimum distance was found. The speaker gender of the test template was then

classified as male or female, according to the gender known for the reference

template. The above steps were repeated until all subjects were tested.



4.5 Resubstitution and Leave-One-Out Procedures


After the classifier is designed, it is necessary to evaluate its performance

relative to competing approaches. The error rate was considered as the performance

measure.

Four popular empirical approaches that count the number of errors when

testing the classifier with a test data set are (Childers, 1989):

The Resubstitution Estimate (inclusive). In this procedure, the same data set

is used for both designing and testing the classifier. Experimentally and

theoretically this procedure gives a very optimistic estimate, especially when the

data set is small. Note, however, that when a large data set is available, this method

is probably as good as any procedure.

The Holdout Estimate. The data is partitioned into two mutually exclusive

subsets in this procedure. One set is used for designing the classifier and the other

for testing. This procedure makes poor use of the data since a classifier designed on

the entire data set will, on the average, perform better than a classifier designed on

only a portion of the data set. This procedure is known to give a very pessimistic

error estimate.

The Leave-One-Out Estimate (exclusive). This procedure assumes that there

are n data samples available. Remove one sample from the data set. Design the

classifier with the remaining (n-1) data samples and then test it with the removed








data sample. Return the sample removed earlier to the data set. Then repeat the

above steps, removing a different sample each time, for n times, until every sample

has been used for testing. The total number of errors is the leave-one-out error
estimate. Clearly this method uses the data very effectively. This method is also

referred to as the Jack Knife method.

The Rotation Estimate. In this procedure, the data set is partitioned into n/d

disjoint subsets, where d is a divisor of n. Then, remove one subset from the design
set, design the classifier with the remaining data and test it on the removed subset,

not used in the design. Repeat the operation for n/d times until every subset is used

for testing. The rotation estimate is the average frequency of misclassification over

the n/d test sessions. When d=1 the rotation method reduces to the leave-one-out

methods. When d=n/2 it reduces to the holdout method where the roles of the design

and test sets are interchanged. The interchanging of design and test sets is known in

statistics as cross-validation in both directions. As we may expect, the properties of

the rotation estimate will fall somewhere between the leave-one-out method and

holdout method. The rotation estimate will be less biased than in the holdout

method and the variance is less than in the leave-one-out method.

In order to use the database effectively, the leave-one-out procedure was
adopted for the experiments. For comparison, the resubstitution procedure was also

used in selected experiments.



4.6 Separability of Acoustic Parameters
Using Fisher's Discriminant Ratio Criterion



4.6.1 Fisher's Discriminant and F ratio

After we chose acoustic feature candidates, we can see how well they separate

different genders by analytically studying the database. There are many measures








of separability which are generalizations of one kind or another of the Fisher's
discriminant ratio (Childers et al., 1982},Parsons, 1986) concept. The ratio usually

serves as a criterion for selecting features for discrimination.

The ability of a feature to separate classes depends on the distance between
classes and the scatter within classes (generally there will be more than two classes).
This separation is estimated by representing each class by its mean and taking the
variance of the means. This variance is then compared to the average width of the

distribution for each class (i.e., the mean of the individual variances). This measure

is commonly called the F ratio:


Variance of the means(over all classes)
F = (4.27)
Mean of the variances (within classes)


The F ratio is reduced to Fisher's discriminant when it is used for evaluating a
single feature and there are only two classes. For this reason, the F ratio is also
referred to as the generalized Fisher's discriminant.

In the case of pattern recognition, there are vectors of features, f, and
observed values of f for all the classes we are interested in recognizing. Then two

covariance matrices can be calculated, depending on how the data are grouped.

First, the covariance for a single recognition class can be computed by
selecting only feature measurements for class i. Let any vector from this class be fi.
Then the i within-class covariance matrix for class i is


Wi = ( (fi i)(fi i)t1) (4.28)


where () represents the expectation or averaging operation and Ai represents the

mean vector for the ith class: gi = (fi). W stands for "within." Notice that each of








these covariance matrices describes the scatter within a class; hence it corresponds
to one term of the average in the denominator of (4.40). If we make the common
assumption that the vector fi is normally distributed, then W is the covariance matrix
of the corresponding probability density function:



1 -n/2 1 -(f-i)t W-1(f-i) 1/2
PDFi (f) = (--) exp
27T Wil1/2 2


Then the denominator of the F ratio can be associated with the average of Wi
over all i; this is called the pooled within-class covariance matrix:


W = (Wi) (4.29)


Second, the variation within-classes can be ignored and the covariance
between classes can be found, representing each class by its centroid. The feature
centroid for class i is g~; hence the between-class covariance matrix is


B = ( (gi l)(i i)t) (4.30)


where g is the mean of Ai over all classes. B stands for "between." Here we ignore
the detailed distribution within each class and represent all the data for that class by
its mean. Hence B describes the scatter from class to class regardless of the scatter
within a class and in that sense corresponds to the numerator of (4.40).
Then the generalization we seek should involve a ratio in which the numerator
is based on B and the denominator on W, since we are looking for features with








small covariances within classes and large covariances between classes. Fukunaga

(1972) lists four such measures, two of which are


J, = trace (W-'B) (4.31)
and

trace B
J4 (4.32)
trace W


The motivation for these measures is clearer for J4 since we know that the trace of a

covariance matrix provides a measure of the total variance of its associated

variables (Parsons, 1986). If the value of J4 for a feature is relatively greater than
that for the other feature, then there is apparently more scatter between classes than

within classes for this feature, and this feature set is a better one than the other for

discrimination. J4 tests this ratio directly. The motivation for J1 is less obvious and

will have to awaitthe presentation of the material below.


4.6.2 Divergence and Probability of Error

The distance between two classes in feature space may also be evaluated by
divergence that is defined as the difference in the expected values of their

log-likelihood ratios (Kullback, 1959; Tou and Gonzales, 1974). This measure has

its roots in information theory (Kullback, 1959) and is a measure of the average

amount of information available for discriminating between class i and class k. It is
shown that for features with multivariate normal densities, the divergence is given

by


Dik = 0.5 trace (Wi Wk)(Wi- Wk-1)

+ 0.5 trace [(Wi-1 + Wk-1)(I. k)(ki 0k)t] (4.33)







This can be related to more familiar material as follows. If the covariance
matrices of the two classes are equal, if Wi and Wk can be replaced by an average
covariance matrix W, then the first term vanishes and the divergence reduces to


Dik = trace [W'(I(i IAk)( i k)t]
= (-i- Ak)' W-'(i- Ak)


The term, (Ai k) (9i Ak)t, is the between-class covariance matrix B; hence in
this case Die is the separability measure J1 = trace (W-'B).
Notice that ik or J1 is the Mahalanobis distance. This distance is related to
the approximation of the expected probability of error (PE) by Lachenbruch (1968),
Achariyapaopan and Childers (1983), and Childers (1986). If p is the dimension of
the feature vector, ni and n2 are the sample sizes for classes 1 and 2, and i(z) is the
standard normal distribution function defined as


1 z
I (z) = f exp (- 0.5 u2) du (4.34)
2"rr -oo


PE can be written as



PE = 0.5 [- (a P)] + 0.5 D [- (a + P)] (4.35)


where

S= C + n)(nn) ]1(4.36)
[ J1+P(nl+ +n2)/(11nn12) ]1/2








P (nz ni)
P3 = C (4.37)
[ nin2 (J1inn2 + p (nl + n2))] 1/2



(ni + n2 p 2)(nl + n2 p 5) 1/2
C = 0.5 ( ) (4.38)
(n, + n2- 3)(n, + n2- p 3)


For the fixed training sample sizes ni and n2 and vector dimension p, PE decreases
as the Mabalanobis distance J1 increases.
In the coarse analysis stage, the estimated J1, J4, and expected probabilities of
errors of the acoustic features ARC, LPC, FFF, RC, and CC, which were derived
from male and female groups (i.e., classes) in three phoneme categories, were
studied. Training sample sizes were 27 (ni) for the male group and 25 (n2) for the
female group since median layer templates (one for each subject) were used to
constitute the training sample pools. The feature vector dimension p was equivalent
to the filter order selected. For each of the acoustic features ARC, LPC, FFF, RC,
and CC in each of the three phoneme categories, the estimated JI, J4, and expected
probability of error were computed as follows:
(1) Estimate the within-gender covariance matrix Wi for each gender
using Equation (4.28).
(2) Compute the pooled (averaged) within-gender covariance matrix
W using Equation (4.29).
(3) Estimate the between-gender covariance matrix B using Equation
(4.30).
(4) Obtain the values of J4 and the Mabalanobis distance J1 from
matrixes W and B using Equations (4.32) and (4.31).





67


(5) Finally, calculate the value of PE from J1 nl, n2, and p using
Equations (4.34) to (4.38).

The analytical results were also compared to the empirical ones obtained from
experiments using recognition schemes. Section 5.3 in the next Chapter presents a

detailed discussion.














CHAPTER 5
RESULTS OF RECOGNITION BASED ON COARSE ANALYSIS




Our results showed that most of the LPC-derived feature parameters

performed well for gender recognition. Among them, the reflection coefficient
combined with the Euclidean distance measure was the best choice for sustained

vowels (100%). While the cepstral distortion measure worked extremely well for
unvoiced fricatives, the LPC log likelihood distortion measure, the reflection

coefficient combined with the Euclidean distance, and the cepstral distortion

measure were the good alternatives for voiced fricatives. Using the Euclidean

distance measure achieved better results than using the Probability Density
Function. Furthermore, the averaging techniques were very important in designing

appropriate test and reference templates and a filter order of 12 to 16 was sufficient
for most designs.


5.1 Coarse Analysis Conditions

Before we discuss in detail the performance assessments based on coarse
analysis, we briefly summarize the experimental conditions as follows:

Database

52 normal subjects: 27 male and 25 female

Phoneme group used: ten sustained vowels

five unvoiced fricative

four voiced fricatives








Analysis Conditions

Method: asynchronous autocorrelation LPC

Filter order: 8, 12, 16, 20

Analysis frame size: 256 points/frame

Frame overlap: None

Preemphasis Factor: 0.95

Analysis Window: Hamming

Data set for coefficient calculations: six frames total. Then

by averaging six sets of coefficients obtained from these six

frames, a template coefficient set was calculated for each

sustained utterance.

Acoustic Parameters

LPC coefficients (LPC)

Cepstrum Coefficients (CC)

Autocorrelation Coefficients (ARC)

Reflection Coefficients (RC)

Fundamental Frequency and Formant Information (FFF)

FFF was obtained by 12 order Closed Phase WRLS-VFF method

discussed in the next Chapter.

Distance Measures

Euclidean Distance (EUC)

LPC Log Likelihood Distance (LLD)

Cepstral Distortion (same as EUC)

Weighted Euclidean Distance (WEUC)

Probability Density Function (PDF)

Decision Rule

Nearest Neighbor








Recognition Schemes

Scheme 1

Scheme 2

Scheme 3

Scheme 4

Counting Error Procedures

Inclusive (Resubstitution)

Exclusive (Leave-One-Out)

Parameters Based on Fisher's Discriminant Ratio Criterion

J4, J1, and the expected probability of error



5.2 Performance Assessments


Since the WEUC is the simplified case of the PDF and the results produced by

the WEUC were very similar to those produced by the PDF in our experiments, only

the results obtained by the PDF will be discussed.

The complete results of the experiments are tabulated in Appendix A and B.
Appendix A presents the recognition rates for LPC log likelihood and cepstral

distortion measures with various phoneme categories, recognition schemes, and

filter orders. Inclusive procedures were only performed for the acoustic parameters

with a filter order of 16.

Appendix B presents the recognition rates for various acoustic parameters

(ARC, LPC, RC, CC) combined with EUC or PDF distance measures with different
phoneme categories and filter orders. Notice that only recognition Scheme 3 was
used for these experiments. Again, inclusive procedures were only performed for

the acoustic parameters with a filter order of 16. Since calculation of the cepstral








distortion measure was done by using the EUC, the results of the CC combined with

the EUC were directly extracted from Appendix A.


5.2.1 Comparative Study of Recognition Schemes

Tables 5.1 and 5.2 show the condensed results selected from Appendix A,

using the LPC log likelihood and cepstral distortion measures respectively.

Recognition rates for the four exclusive recognition schemes with various filter

orders are included. Figures 5.1 and 5.2 are graphic illustrations of Tables 5.1 and

5.2.

By observing curves of Figures 5.1 and 5.2, it can be immediately seen that

higher recognition rates were achieved using recognition Schemes 3 and 4 for all the

cases, including various filter orders combined with different phoneme categories.

Among them, by applying Schemes 3 and 4 to voiced fricatives, over 90%

recognition rates were accomplished for all filter orders and both distortion

measures. The highest correct recognition rate, 98.1%, was obtained for Scheme 4

by using the LPC log likelihood measure with a filter order of 8. The same rates

were obtained for Scheme 3 by using the LPC log likelihood measure with a filter

order of 20 and using the cepstral distortion measure with filter orders of 12 and 16.

The results indicated the following:

1. Choosing .appropriate template forming and recognition schemes

was important in achieving high correct recognition rates.

Particularly, the use of averaging techniques was critical since the

highest recognition rates were obtained by using Schemes 3 and 4, in

both of which the test and reference template were formed by

averaging, all the utterances from the same subjects or even the same

gender (Scheme 3). In contrast, Schemes 1 and 2, in which the test
template was formed from a single utterance, performed worse.



















Table 5.1
Results from exclusive recognition schemes
with various filter orders and
the LPC log likelihood distortion measure


CORRECT RATE %

Ordere=8 Order=12 Order=16 Order=20

Scheme 1 63.1 69.6 74.2 74.2

Sustained Scheme 2 65.2 71.5 76.2 76.5

Vowels Scheme 3 75.0 86.5 86.5 84.6
Scheme 4 75.0 80.8 86.5 88.5

Scheme 1 59.2 64.2 67.7 65.0

Unvoiced Scheme 2 61.5 63.9 64.2 64.2

Fricatives Scheme 3 67.3 75.0 75.0 78.9

Scheme 4 76.9 75.0 73.1 69.3

Scheme 1 74.5 72.1 73.1 72.6

Voiced Scheme 2 77.4 80.3 81.7 80.3

Fricatives Scheme 3 90.4 94.2 96.2 98.1

Scheme 4 98.1 96.2 96.2 94.3



















Table 5.2
Results from exclusive recognition schemes
with various filter orders and
the cepstral distortion measure


CORRECT RATE %

Order=8 Order=12 Order=16 Order=20

Scheme 1 61.3 68.3 69.4 70.6
Sustained Scheme 2 69.4 67.3 70.0 72.1

Vowels Scheme 3 82.7 92.3 90.4 90.4

Scheme 4 90.4 92.3 92.3 88.5

Scheme 1 61.2 65.8 63.9 64.6

Unvoiced Scheme 2 58.8 61.5 62.7 64.2

Fricatives Scheme 3 71.2 75.0 84.6 82.7

Scheme 4 78.8 88.5 90.4 88.5

Scheme 1 79.3 82.7 81.3 80.8

Voiced Scheme 2 75.5 82.2 84.6 85.1

Fricatives Scheme 3 94.2 98.1 98.1 96.2

Scheme 4 92.3 92.3 92.3 90.4



































lee-
s88
956

98 -

85

88

75
79 --
78

66

68

556


18e

95

98

85

88

75

70

65

60


.A Scheme 3


A A -A


A A
.-0 --
0... ""- -- ....-.7..


8 12 16 20

Filter order


Figure 5.1 Results from exclusive recognition schemes with various
filter orders and the LPC log likelihood distortion measure.


Scheme
Scheme
Scheme


-A Scheme 4
A .-------... ....
..A Scheme 3


S------ Scheme 2

S O scheme


A ..---A Scheme 3






* --
0 .--. 0 ... 0 Schemea

0 Scheme 2


0 Scheme I


















A A A




S A------A
A








0


A A A



A '


.,0 ..-.... --
0.

0
"-0 "






A "
..A-------A- A...


O


188-

95

98-

85

88-




65



55,




70-
88 -

956

698

856





66 -



68 -r
65r

55 -


Figure 5.2 Results from exclusive recognition schemes with various
filter orders and the cepstral distortion measure.


1ea -

95 I

6 i
98s



88

75-

78

65

68

55-


Scheme 3
Scheme 4


Scheme 2
Scheme 1


8 12 16

Filter grder


Scheme 4

Scheme 3


Scheme 1
Scheme 2


Scheme 3

Scheme 4

Scheme 2

Scheme I


.o ..


I I I 1 1








2. Averaging techniques seemed more crucial than clustering

techniques. Recognition theory states that choosing several

clustering centers for the same reference group should increase the

correct classification rate, because intragroup variations are taken

into account. In Schemes 1 and 4, multi-clustering reference centers

were formed. However, the theory functioned well in Scheme 4 but

inadequately in Scheme 1, although in Scheme 1, a number of

clustering centers were selected for the reference of the same

gender. Furthermore, Scheme 3 was a further simplified version of

Scheme 2 and Scheme 4. Instead of using each test vowel of the

subject as in Scheme 2, only a single test template was employed for

each test subject and a single reference template for each reference

gender. But the results were almost as good as those achieved by

using Scheme 4. In Scheme 2, averaging was performed over the

reference template but not over the test template. The correct

recognition rates were low. Thus, the results suggested the

importance of averaging techniques. To a great extent, averaging on

both test and reference templates eliminated the intrasubject

variation or diversity within different vowels or fricatives of a given

speaker, but on the other hand, emphasized features representing

this speaker's gender.

3. Since the averaging was applied to the acoustic parameters extracted

from different phonemes uttered by a speaker or speakers of the

same gender, and the phonemes were produced at different times,

the averaging is essentially a time-averaging technique.
Therefore, we would reasonably deduce that gender information

is time-invariant, phoneme independent, and speaker independent.








Because of this, averaging emphasized the speaker's gender
information and increased the between-to-within gender variation

ratio. In practice we would achieve free-text gender recognition in

which gender identification would be determined before recognition

of speech or speaker and thus, reduce the speech or speaker

recognition search space to half.
The conclusion is consistent with the findings by Lass and Mertz

(1978) that temporal cues appeared not to play a role in speaker

gender identification. As we cited earlier, in their listening tests they

found that gender identification accuracy remained high and

unaffected by temporal speech alterations when the normal temporal

features of speech were altered by means of the backward playing

and time compressing of speech samples.

We have shown in the previous section that use of the long-term

average technique to form a feature vector was discovered to have

potential for free-text speaker recognition in the Pruzansky study

(1963). Speaker recognition error rates remained undegraded even

after averaging spectral amplitudes over all frames of speech data

into a single reference spectral amplitude vector for each talker.

Markel et al. (1977) also demonstrated that the between-to-within

speaker variation ratio was significantly increased by performing

long-term parameter sets (thus text-free). Here we found that this

rule also applied to the gender recognition.
4. In terms of Scheme 3 versus Scheme 4, neither was obviously

superior. However, from a practical point of view, Scheme 3 would

be easier to realize since only two reference templates are needed.








5. In further experiments, different weighting factors could be applied

to different phoneme feature vectors according to the probabilities of

their appearances in real situations. By this way, time-averaging

would be better approximated.


5.2.2 Comparative Study of Acoustic Features

5.2.2.1 LPC Parameter Verses Cepstrum Parameter

Although both LPC log likelihood and cepstral distortion measures were

effective tools in classifying male/female voices, the performance of the latter was

better than the former.

1. By comparing Figures 5.1 and 5.2, it is noted that except in the

category of voiced fricatives, in which the performances were

competitive (both measures were able to achieve recognition rates of

98.1%), cepstrum coefficient features proved to be more sensitive

than LPC coefficients for gender discrimination. By choosing

appropriate schemes and filter orders the recognition rates for the

cepstral distortion measure reached 92.3% for vowels (Scheme 3

with a filter order of 12 and Scheme 4 with filter orders of 12 and

16), 90.4% for unvoiced fricatives (Scheme 4 with a filter order of

16). By using the LPC log likelihood distortion measure, the

corresponding highest recognition rates were 88.5% for vowels

(Scheme 4 with a filter order of 20) and 78.9% for unvoiced

fricatives (Scheme 3 with a filter order of 20).

2. By comparing tables in Appendix A, it is noted that the cepstral

distortion measure operated more evenly between male and female

groups, showing this feature has some "normalization

characteristics". As seen in Table A.1, there existed large








differences between male and female recognition rates for the LPC

recognizer with a filter order of 16. The largest gaps came from

Scheme 1 of the LPC. The differences were about 19% for vowels,

24% for unvoiced fricatives, and 15% for voiced fricatives. On the
other hand, the cepstral distortion measure worked evenly with the
same filter order. Table A.7 shows that the largest gaps were 6.6%
for vowels, 1.8% for unvoiced fricatives, and 2.4% for voiced

fricatives. Similar situations held for the results shown in other
tables and with inclusive schemes.

5.2.2.2 Other Acoustic Parameters

Tables 5.3 and 5.4 demonstrate results from exclusive recognition Scheme 3
with various filter orders and other acoustic parameters, using EUC and PDF

distance measures respectively. Figures 5.3 and 5.4 are graphic illustrations of

Tables 5.3 and 5.4.

1. The overall performance using RC and cepstrum coefficients was

better than that achieved using ARC and LPC coefficients, when the
Euclidean distance measure was adopted. The following
observations were made:

o The RC functioned extremely well with sustained vowels.

The recognition rates remained 100% for filter orders of 12,

16, and 20, showing that RC features captured gender

information from vowels effectively. The results were also

stable and filter order independent, as long as the filter order

was above 8. Table 5.3 shows that a 98.1% recognition rate

was reached by using the FFF, which was obtained using the

closed-phase WRLS-VFF method with a filter order of 12.



















Table 5.3
Results from the exclusive recognition scheme 3
with various filter orders, acoustic parameters, and
the Euclidean distance measure


CORRECT RATE %

Order=8 Order=12 Order=16 Order=20

ARC 78.8 78.S 78.8 82.7

LPC 73.1 78.8 80.8 80.8
Sustained
FFF N/A 98.1 N/A N/A
Vowels
RC 88.5 100.0 100.0 100.0

CC 82,7 92,3 90.4 90.4

ARC 75.0 75.0 75.0 75.0
Unvoiced LPC 80.8 69.2 71.2 71.2

Fricatives RC 80.8 80.8 80.8 80.8

CC 71.2 75.0 84.6 82.7

ARC 86.5 88.5 86.5 88.5
Voiced LPC 92.3 92.3 92.3 90.4

Fricatives RC 94.2 96.2 96.2 96.2

CC 94.2 98.1 98.1 96.2




















Table 5.4
Results from the exclusive recognition scheme 3
with various filter orders, acoustic parameters, and
the PDF (Probability Density Function) distance measure


CORRECT RATE %

Order=8 Order=12 Order=16 Order=20

ARC 80.8 84.6 88.5 67.3

LPC 84.6 98.1 92.3 80.8
Sustained
FFF N/A 96.2 N/A N/A
Vowels
RC 88.5 98.1 92.3 67.3

CC 78.8 94.2 90.3 75.0

ARC 69.2 65.4 57.7 N/A

Unvoiced LPC 78.8 86.5 78.8 53.8

Fricatives RC 78.8 73.1 67.3 55.8

CC 80.8 73.1 69.2 57.7

ARC 88.5 86.5 82.7 59.6

Voiced LPC 92.3 94.2 94.2 71.2

Fricatives RC 92.3 90.4 90.4 75.0

CC 92.3 92.3 80.8 71.2
















A -------A-------A


S.-- A


S0--. ----
**"


I88T-

95--
so-


as--
so



75

78

65

6o

55

19e

95-

85

86
88--

75
98




768

65-

68-

55

198T

95



85-

88 1

75

78

65

68


I


A---

A -------- -

0 0


8 12 16 28
Filter order


Figure 5.3 Results from the exclusive recognition scheme 3 with various
filter orders, acoustic parameters, and the Euclidean distance measure.


A A

- A --,-------

A ........
0O,.- .-"


IJ







83







a.
A-



* 0
/ LPC
A
A CC

ARC
RC


o.






OAR 'PC



OARC "',
~.\A


cc
.RC


-- ..-----..--.',
* .....A

00



SA


lee -
100

95

99 .

85

88

75

7e

65

6ef
668


5B


95

859
86

i


796



65
so


8 12 16 28


Filtur order




Figure 5.4 Results from the exclusive recognition scheme 3 with various
filter orders, acoustic parameters, and the PDF distance measure.


0ee T



0 90
85t
90








65


5'15


0 ARC


I 1 1 I








o The CC worked very well with unvoiced fricatives. The
highest recognition rate was 84.6% with a filter order of 16.
o Both RC and CC operated extremely well with voiced

fricatives, and the results remained stable across all the filter

orders that were tested. The results generated by the CC were
slight better than the RC with filter orders of 12 and 16
(98.1% versus 96.2%).
2. When the PDF distance measure was adopted, using LPC

coefficients was a good choice. The highest recognition rates,

98.1%, 86.5% ,and 94.2, for three phoneme categories all came from

LPC coefficients with a filter order of 12 (also for voiced fricatives
with a filter order of 16). However, it is noted that the results from

the PDF distance measure were highly affected by filter orders.


5.2.3 Comparative Study Using Different Phonemes

The results of Figures 5.1, 5.2, 5.3, and 5.4 also indicated that either vowels or
unvoiced fricatives or voiced fricatives could be used to objectively classify
speaker's gender. As we have seen before for vowel category, reflection coefficients
worked extremely well. The recognition rates reached 100% with filter orders of 12,

16, and 20. A 98.1% recognition rate could also be accomplished by using LPC
coefficients and the FFF. Surprisingly, the cepstral distortion measure can be used

to discriminate speaker's gender from unvoiced fricatives and a 90.4% recognition
rate was obtained. For voiced fricative category, a 98.1% recognition rate was
achieved by using LPC log likelihood and cepstral distortion measures. Therefore,

in terms of the "most effective" phoneme category for gender recognition, the first

preference would be vowels. The second one would be voiced fricatives and the last
one unvoiced fricatives.








As discussed in Chapter 4, the acoustic parameters used in coarse analysis
were derived from the LPC all-pole model that has a spectral matching
characteristic. It is known that LPC log likelihood and cepstral distortion measures

are directly related to the power spectral differences of the test and reference

signals. Thus, the results indicated that the spectral characteristics were major
factors in distinguishing the speaker's gender. Also, our results suggested that there
did exist significant differences between spectral characteristics of unvoiced

fricatives for the two genders, indicating that the speaker's gender could be

distinguished based only on speaker's vocal tract characteristics since no vocal fold

information existed in unvoiced fricatives. Moreover, if some of the vocal fold

information was combined with vocal tract characteristics as in vowel and voiced

fricative cases, the gender distinguishing was improved, even though in both of the

above cases the fundamental frequency information was not included.


5.2.4 Comparative Study of Filter Order Variation

5.2.4.1 LPC Log Likelihood and Cepstral Distortion Measure Cases

1. By observing the resultant curves of Schemes 1 and 2 from Figures

5.1 and 5.2, a general trend is easily noted. The recognition rates
generally improved by using higher order filters.
2. However, this trend was not observed for Schemes 3 and 4.

Individual inspection had to be made for specific cases. The

recognition rates for Schemes 3 and 4 together with LPC log

likelihood and cepstral distortion measures were first considered.

Notice that there are a total 12 rates for each of the filter orders.
o Comparing the recognition rates between filter orders of 8

and 12, 9 out of 12 rates increased, 2 of them decreased

(Scheme 4 for voiced and unvoiced fricatives with the LPC








log likelihood distance), and 1 of them tied (Scheme 4 for
unvoiced fricatives with the cepstral distortion measure).

This demonstrated that performance improved from filter

orders of 8 to 12.

o Comparing the recognition rates between filter orders of 12

and 16, out of 12 rates, 4 of them dropped. Two of them
increased (unvoiced fricatives Scheme 4 with the LPC

distance measure and sustained vowels Scheme 3 with the

cepstral distortion measure). The remaining 6 were equal.

This indicated that by using Scheme 3 or 4, there was not a
distinct difference between filter orders of 12 and 16.
o Comparing the recognition rates between filter orders of 16

and 20, out of 12 rates, 8 of them dropped. Three of them

increased and one was tied. Performance degraded from

filter orders of 16 to 20.

o If only the cepstral distortion measure was applied (Figure

5.2), the highest recognition rates appeared at filter orders of
12 and 16 for all three phoneme categories. However, if only

the LPC log likelihood distortion was used (Figure 5.1), the

highest recognition rates were reached with filter orders of 8

and 20. Therefore, there was no manifest trend of
performance difference for the LPC log likelihood distortion
measure across filter orders. Since the cepstral distortion

measure showed better results than the LPC log likelihood,

the filter orders of 12 to 16 seemed -to be best options for the

overall design.








5.2.4.2 Euclidean Distance Versus Probability Density Function

1. By examining Figure 5.3, it is seen that using Euclidean distance

measure increased the recognition rates slightly from filter orders of
8 to 12 with exception of the LPC for unvoiced fricatives.

Recognition rates with a filter order of 12 were almost the same as

with 16 and 20. No specific trend was observed. Except for the ARC

applied to the vowel category, all other performances reached their

peaks with either filter order of 12 or 16. It can be concluded that by

using the EUC distance measure, the best choice of filter order

would be around the range from 12 to 16.

2. However, Figure 5.4 shows us a different case. By inspecting Figure

5.4, it is immediately concluded that by using the PDF, gender

recognition rates varied considerably with the filter order. The

overall trend for the vowel category is that recognition rates

increased from a filter order of 8, reached its peak with a filter order

of 12 and then decreased. One exception is that by using the RC, the

recognition rate reached its peak with a filter order of 16. All

acoustic parameters, except the LPC, for voiced and unvoiced

fricatives showed decreasing recognition rates from a filter order of

8 to an order of 20. By using LPC coefficients, performance showed

some improvement from a filter order of 8 to 12 and 16 and then

degraded. Finally, recognition accuracies severely declined from a

filter order of 16 to an order of 20 for all three phoneme categories

and all acoustic parameters. It can be concluded that using the PDF,

the best option for filter order would be 8 -or 12.








5.2.5 Comparative Study of Distance Measures

It is generally believed that the use of the EUC distance measure is not as

effective as the use of the PDF because there is no normalization of the dimensions

involved in the definition of the EUC. The largest value dimension becomes the

most significant. In contrast, the PDF approach has such normalization function
through the covariance matrix computation. The PDF approach gives unequal

weighting for each element of a vector. It may suppress the elements with large

values but emphasize the elements with small values according to their importance

in reducing the intragroup variation.

However, the PDF approach did not work well in our experiments. By

observing Tables 5.3 and 5.4 as well as Figures 5.3 and 5.4, it can be seen that the

EUC outperformed the PDF.

First, out of 48 corresponding pairs of EUC and PDF recognition rates from

Tables 5.3 and 5.4, 32 EUC recognition rates were higher than those of the PDF.

Three of them were tied. Only 13 PDF recognition rates were higher than EUC
rates.

Second, as we have demonstrated in the sections above, performance using
PDF varied considerably with the filter order and there were severe performance

declines from the order of 16 to order of 20 for all three phoneme categories and all

acoustic parameters. On the other hand, the results of the EUC were relatively
consistent across all the filter orders that were tested, especially with filter orders of

12, 16, and 20.

Third, the two highest rates from Tables 5.3 and 5.4 for three phoneme groups

achieved using the EUC distance measure. The RC with the EUC yielded a 100%

recognition rate for sustained vowels-with filter-orders of-12, 16, and 20. The CC

with the EUC yielded a 98.1% recognition rate for voiced fricatives with filter orders
of 12 and 16. Even for unvoiced fricatives, the highest recognition rates for








unvoiced fricatives came from the CC with the EUC using Scheme 4 (Table 5.2).
They were 88.5% with a filter ordergof 12 and 90.4% with a filter order of 16.
Fourth, the EUC distance measure functioned more evenly on male and
female groups than did the PDF. By examination of all tables of exclusive schemes
in Appendix B, 43 male recognition rates were higher than those of females for total
48 PDF pairs. Only 5 female recognition rates were higher than those of males. And
the largest difference between gender group was 68.6% for the PDF (from the ARC
for unvoiced fricatives with filter order of 20). On the other hand, in 29 out of 49
EUC pairs, male rates were higher than those of females. And the largest gap
between gender group was only 21% (from the LPC for vowels with a filter order of

8).

A possible reason for this inferior PDF performance is due to the small ratio of
the available number of subjects per gender to the number of elements
(measurements) per feature vector. The assumption when using the PDF distance

measure to design a classifier is that the data are normally (Gaussian) distributed.
In this case, many factors are considered (e.g., the size of the training or design set
and the number of measurements (observations, samples) in the data record (or
vector)). Foley (1972) and Childers (1986) pointed out that if the ratio of the
available number of samples per class (in this study, number of subjects per gender)
to the number of samples per data record (in this study, number of elements per
feature vector) is small, then data classification for both design and test sets may be
unreliable. This ratio should be on the order of three or larger (Foley, 1972). In our
study, the ratios were 3.25 (26/8), 2.17 (26/12), 1.63 (26/16), and 1.3 (26/20) for
filter orders of 8, 12, 16, and 20 respectively. The value of 3.25 satisfied the
requirement but the others were too small. Therefore, with the exception of the

results with a filter order of 8, where the performances of the PDF and EUC were








comparable, the PDF approach did not function well. The smaller the ratio, the

worse the PDF performed.


5.2.6 Comparative Study Using Different Procedures
Performance differences between resubstitution (inclusive) and

Leave-One-Out (exclusive) procedures were also tested with a filter order of 16.

Tables A.9 and A. 10 in Appendix A present the inclusive recognition results for LPC

log likelihood and cepstral distortion measures with various recognition schemes

respectively. Tables B.9 and B.10 in Appendix B show the recognition results from

inclusive recognition Scheme 3 with various acoustic parameters, using EUC and

PDF distance measures respectively.

The results presented in Appendix A indicate that the correct recognition rates

of exclusive recognition procedure (Tables A.3 and A.7) were not greatly degraded

compared to those obtained from the inclusive recognition procedure, especially for

the cepstral distortion measure. For the cepstral distortion measure with Scheme 3,
the rates decreased from 94.2% to 90.4% for vowels, from 86.5% to 84.6% for

unvoiced fricatives, and remained constant for voiced fricatives. The maximum

decrease of the rates was less than 4%. For the LPC log likelihood with Scheme 3,

the rates degraded from 92.3% to 86.5% for vowels, 82.7% to 75% for unvoiced

fricatives, and 100% to 96.2 for voiced fricatives. Here maximum rate decrease was
7.7% observed for unvoiced fricatives. In contrast, the results from the partial

database of 21 subjects, which we analyzed before we completed our data collection

of the entire database, showed a much large decrease from inclusive to exclusive

procedures. Recognition rate dropped more than 14% for unvoiced fricatives and

more than 9% for voiced fricatives. This convinced us that the larger the database,

the less the performance differences between inclusive and exclusive procedures.








One interesting observation from Tables B.9 and B.10 in Appendix B was that
when using the PDF distance measure with the inclusive procedure, the correct
recognition rates were extremely high for all types of phonemes and feature vectors,
except for the ARC with unvoiced fricatives (it was still 98.1%). In addition, the

LPC, RC and CC were all able to provide 100% correct gender recognition from
unvoiced fricatives! However, when using the PDF with exclusive procedure (Table
B.7), the correct recognition rate decreased significantly, with drops ranging from a

minimum of 5.8% to a maximum of 40.4% (for unvoiced fricatives, drops ranging

from a minimum of 21.2% to a maximum of 40.4%). On the other hand, the EUC

distance measure operated more evenly. From inclusive to exclusive procedures

(Table B.3), recognition rates dropped very little, ranging from a minimum of 0% to
a maximum of 3.9%. The rates for four feature parameters did not decrease at all.

Figures 5.5(a) and (b) are graphic illustrations of Tables B.3 and B.9. It can be seen
that there was only minor performance difference between inclusive and exclusive

procedures when the EUC distance measure was used. Our results also suggested

that the PDF excelled at capturing the information from an individual subject. As

long as the data of the subject itself was included in the reference data set, the PDF
was able to pick up such specific information easily and then identify the subject's
gender accurately. Therefore, the correct recognition rates of the inclusive
procedure for the PDF were extremely high. However, the PDF recognition rates of

the exclusive procedure were much lower, indicating that the PDF was clearly

inferior at capturing gender information from the other 'average' subject with the

same gender. On the other hand, the EUC distance measure was good at capturing
gender information from the other subjects without including the characteristic of
the test subject itself.















ARC (EUC)

l LPC (EUC)

SLPC (LLD)

SRC (EUC)

SCC (EUC)


Unvoiced fricatives


Voiced fricatives


55 60 65 70 75 80 85 90 95 100


Correct rate %

(a)


Vowels





Unvoiced fricatives





Voiced fricatives


7,7


SARC (EUC)

SLPC(EUC)

- LPC(LLD)

! RC (EUC)

CC (EUC)


55 60 65 70 75 80 85 90 95 100

Correct rate %

(b)


Figure 5.5 Results of recognition Scheme 3 with the EUC and
a filter order of 16 for (a) exclusive procedure
(b) inclusive procedure.


Vowels


I


i i : i : I




Full Text
6.3.1 Algorithm Description 113
6.3.2 EGG Assisted Procedures 120
6.4 Testing Methods 122
6.4.1 Two-way ANOVA Statistical Testing 123
6.4.2 Automatic Recognition by Using Grouped Features ... 128
7 EVALUATION OF VOWEL CHARACTERISTICS 130
7.1 Vowel Characteristics of Gender 130
7.1.1 Fundamental Frequency and Formant Features
for Each Gender 130
7.1.2 Comparison with Peterson and Barneys Results 142
7.1.3 Results of Two-way ANOVA Statistical Test 145
7.1.4 Results of T Statistical Test 145
7.1.5 Discussion 145
7.2 Relative Importance of Grouped Vowel Features 151
7.2.1 Recognition Results 152
7.2.2 Discussion 154
7.3 Conclusions 158
8 CONCLUDING REMARKS 162
8.1 Summary 162
8.2 Future Research Extensions 166
8.2.1 Short Term Extension 166
8.2.2 Long Term Extension 168
APPENDICES
A RECOGNITION RATES FOR LPC AND CEPSTRUM
PARAMETERS 169
B RECOGNITION RATES FOR VARIOUS ACOUSTIC
PARAMETERS AND DISTANCE MEASURES 179
REFERENCES 189
BIOGRAPHICAL SKETCH 198
vi


131
al! normalized to 1. Formant information is shown in Table 7.2. Data are also
represented as means SE.
Table 7.1
Fundamental frequencies
of ten sustained vowels (Hz)
IY
I
E
AE
A
OW
U
00
UH
EE
Total
M 131.8
4.6
130.2
4.2
124.3
3.7
122.7
3.9
120.4
3.8
119.6
3.5
125.5
3.9
129.7
4.3
120.4
3.8
121.5
3.6
124.6
3.95
F
233.1
227.5
219.1
215.5
213.8
216.3
220.2
222.3
215.3
217.3
220.0
5.3
5.9
5.7
6.3
5.1
5.4
5.6
5.2
5.1
5.1
5.48
where M Male
F Female


15
contribution of each of the vocal characteristics to listener judgments), he found that
listeners were basing their judgments of the degree of male or female voice quality
on the frequency of the laryngeal fundamental.
However, in a later study by Coleman (1976), there were inconsistent findings
from a pair of experiments concerned with a comparison of the contribution of two
vocal characteristics to the perception of male and female voice quality. The first
experiment, which utilized natural speech, indicated that the FO was very highly
correlated with the degree of gender perception while the VTR was less highly
correlated. When VTRs that were more characteristic of the opposite gender were
included experimentally in these voices, they did not affect the judges estimates of
o
the degree of male or female voice quality. But, in the second experiment, when a
tone produced by a laryngeal vibrator was substituted for the normal glottal tone at
simulated FO representing both male (120 Hz) and female (240 Hz), and male and
female characteristics (i.e. vocal tract formants and laryngeal fundamentals) were
combined in the same voice experimentally, he found that the female FO was a weak
indicator of the female voice quality when it was combined with the male VTR
features although the male FO retained the perceptual prominence seen in the first
experiment. Thus, there was a difference in the manner that FO and VTR interact
for male and female perception.
Lass et al. (1976) conducted a study comparing listeners gender identification
accuracy from voiced, whispered, and 255 Hz low-pass filtered isolated vowels.
They found that listener accuracy was greatest for the voiced stimuli (96% correct
out of 1800 identifications20 speakers x 6 vowels x 15 listeners), followed by the
filtered stimuli (91% correct), and least accurate (75% correct) for the voiceless
vowels. Since the low-pass filtered vowels apparently eliminated formant
information, they concluded that the F0 was a more important acoustic cue in
speaker gender identification tasks than the VTR characteristics of the speaker.


52
means standard errors (SE), which were calculated from the median layer
templates. We will see in the next Chapter that by applying these two tokens as
reference templates with the recognition Scheme 3 and the Euclidean distance, a
100% gender recognition rate can be achieved. The result is not surprising if we
notice that the within-gender variation for these reflection coefficients, as
represented by SE in the figure, was small, compared to the between-gender
variation. It is also easily noted that elements 1, 4, 5, 6, 7, 8, 9, and 10 of these
reference templates account for the most between-gender variation. On the other
hand, elements 2, 3, 11, and 12 of these reference templates account for little
between-gender variation and thus could be discarded to reduce the dimensionality
o
of the vector. Figure 4.6(b) shows the spectra corresponding to the two universal
reflection coefficient templates.
Similarly, Figure 4.7(a) and (b) show two reflection coefficient templates
(with the same filter order of 12) and the corresponding spectra of unvoiced
fricatives for male and female speakers in the upper layer. Interestingly, strong
peaks are present in the universal female spectrum for voiced fricatives but only
several ripples appear in the male spectrum. We will see later that by using these
two tokens as reference templates with the recognition Scheme 3 and the Euclidean
distance, an 80.8% gender recognition rate can be achieved. Finally, Figure 4.8(a)
and (b) are two cepstral coefficient templates (with the same filter order of 12) and
the corresponding spectra of voiced fricatives for male and female speakers in the
upper layer. It is shown later that by using these two tokens as reference templates
with the recognition Scheme 3 and the Euclidean distance, a 92.3% gender
recognition rate can be achieved. The universal" spectra for the two genders in
Figure 4.6(b), 4.7(b), and 4.8(b) possess the basic properties for vowels, unvoiced
fricatives, and voiced fricatives. For example, while the energy of the vowels is
concentrated in the lower frequency portion of the spectrum, the energy of the


148
*
Table 7.6 T-test result for each individual feature
of ten vowels for male and female speakers
IY I E AE A OW U OO UH ER
Fundamental
Frequency
Formant
FI
o B1
-
-




-



A1




+
+

-
++
+
F2
B2
-
A2
++
++
++
++
+

++
++
++

F3
B3
++






-


A3

++
++
++
++
++
++
++
++
++
F4
B4
A4 ++
where
++
+


++ ++ ++ + ++ ++
Difference is highly significant M>F
Difference is significant M>F
Difference is highly significant M Difference is significant M No significant difference-
++
++
++


98
Table 5.7
Estimated values of J4 and Ji, expected probability of errors, and
experimental error rates for various acoustic parameters
Filter order = 20
EXPECTED
EXPERIMENTAL
J4
Jl
PROBABILITY
ERROR
OF ERROR
RATE*
ARC
0.16
11.0
0.12
0.33
Sustained
LPC
0.14
14.7
0.08
0.19
Vowels
FFF
N/A
N/A
N/A
N/A
RC
0.60
15.0
0.08
0.33
CC
0.25
8.54
0.16
0.25
ARC
0.11
5.02
0.23
N/A
Unvoiced
LPC
0.28
7.33
0.18
0.46
Fricatives
RC
0.14
5.83
0.21
0.44
CC
0.17
5.24
0.22
0.42
ARC
0.34
10.0
0.13
0.40
Voiced
LPC
0.36
15.7
0.08
0.29
Fricatives
RC
0.46
13.9
0.09
0.25
CC
0.19
14.1
0.09
0.29
obtained from the exclusive recognition Scheme 3,
using the PDF distance measure.


121
Figure 6.5 Synchronized waveforms (from top down):
speech, EGG, DEGG, and glottal area.


39
2 ak 2 s(n-k)s(n-i) = 2 s(n)s(n-i) 1 < i < p (4.9)
k=l n n
For a short-time analysis, the limits of summation are finite. The particular
choice of these limits has led to two methods of analysis (i.e., the autocorrelation
method (Markel and Gray, 1976) and the covariance method (Atal and Hanauer,
1971)).
The autocorrelation method results in a filter structure that is guaranteed to be
stable. Meanwhile, it operates on a data segment that is windowed using a Hanning,
or Hamming, or another window, typically 10-20 msec long (two to three pitch
periods).
The covariance method, on the other hand, gives a filter with no guaranteed
stability, but requires no explicit windowing. Hence it is eminently suitable for pitch
synchronous analysis.
One of the important features of the linear prediction model is that the
combined contribution of the glottal flow, the vocal tract and the radiation effect at
the lips are represented by a single recursive filter. The difficult problem of
separating the contribution of the source function from that of the vocal tract system
is thus completely avoided.
4.1.2 Analysis Conditions
In order to extract acoustic parameters rapidly, a conventional pitch
asynchronous autocorrelation LPC method was used, which applied a fixed frame
size, frame rate and number of parameters per frames. These analysis conditions
were:
Order of the filter: 8, 12, 16, 20
Analysis frame size: 256 points/frame


175
Table A.7
Results from exclusive recognition schemes
Cepstrum distance measure
Filter order = 16
CORRECT RATE %
MALE
FEMALE
TOTAL
Scheme
1
72.6
66.0
69.4
Sustained
Scheme
2
71.1
68.8
70.0
Vowels
Scheme
3
88.9
92.0
90.4
Scheme
4
92.6
92.0
92.3
Scheme
1
63.0
64.8
63.9
Unvoiced
Scheme
2
59.3
66.4
62.7
Fricatives
Scheme
3
85.2
84.0
84.6
Scheme
4
88.9
92.0
90.4
Scheme
1
82.4
80.0
81.3
Voiced
Scheme
2
88.9
80.0
84.6
Fricatives
Scheme
3
100.0
96.0
98.1
Scheme
4
92.6
92.0
92.3


190
Carlson, T. E. (1981) Some acoustical and perceptual correlates of speaker gender
identification, An outline of Ph.D. dissertation, University of Florida, Gainesville.
Carrell, T. D. (1981). Effects of glottal waveform on the perception of talker sex,
J. Acoust. Soc. Am., Supl., Vol. 70, S97.
Cheng, Y. M., and Guerin, B. (1987). Control parameters in male and female
glottal sources, Chapter 17, in Laryngeal function in phonation and respiration, T.
Bear, C. Susaki, and K. Harris, Eds., College Hill Publ., San Diego, 219-238.
Childers, D. G. (1977). Laryngeal pathology detection, CRC Crit. Rev. Bioeng.,
Vol. 2(4), 375-426.
Childers, D. G. (1986). Single-trial event-related potentials: statistical
classification and topography, Chapter 14, in Topographic Mapping of Brain
Electrical Activity, F. H. Duffy, Ed., Butterworth Publ., Boston, 255-277.
Childers, D. G. (1989). Biomedical Signal Processing, Chapter 10, in Selected
Topics in Signal Processing, Simon Haykin, Ed., Prentice-Hall, Englewood Cliffs,
New Jersey, 194-250.
Childers, D. G., Bloom, P. A., Arroyo, A. A., Roucos, S. E., Fischler, I. S.,
Achariyapaopan, T., and Perry, N. W. Jr. (1983). Classification of cortical
responses using features from single EEG records, IEEE Trans, on Biomed. Eng.,
Vol. BME-29(6), 423-438.
Childers, D. G., and Hicks, D. M. (1984). Two channel (EGG and speech)
analysis-synthesis for voice recognition, NSF Proposal, University of Florida,
Gainesville.
Childers, D. G., and Krishnamurthy, A. K. (1985). A critical review of
electroglottography. CRC Crit. Rev. Bioeng., Vol. 12(2), 131-164.
Childers, D. G., Krishnamurthy, A. K., Bocchieri, B. L., and Naik, J. M. (1985a ).
Vocal source and tract models based on speech signal analysis, in Mathematics
and Computers in Biomedical Applications, J. Eisenfeld and C. Delisi, Eds.,
Elsevier Science Publ. B. V., Amsterdam, Netherland, 335-349.
Childers, D. G., and Larar, J. N. (1984). Electroglottography for laryngeal function
assessment and speech analysis, IEEE Trans, on Biomed. Eng., Vol. BME-31(12),
807-817.
Childers, D. G., Naik, J. M., Larar, J. N., Krishnamurthy, A. K., and Moore, G. P.
(1983). Electroglottography, speech, and ultra-high speed cinematography, in
Vocal Fold Physiology: Biomechanics, Acoustics, and Phonatory Control, I. Titze
and R. Scherer, Eds., The Denver Center for the Performing Arts, Denver, 202-220.


61
data sample. Return the sample removed earlier to the data set. Then repeat the
above steps, removing a different sample each time, for n times, until every sample
has been used for testing. The total number of errors is the leave-one-out error
estimate. Clearly this method uses the data very effectively. This method is also
referred to as the Jack Knife method.
The Rotation Estimate. In this procedure, the data set is partitioned into n/d
disjoint subsets, where d is a divisor of n. Then, remove one subset from the design
set, design the classifier with the remaining data and test it on the removed subset,
not used in the design. Repeat the operation for n/d times until every subset is used
for testing. The rotation estimate is the average frequency of misclassification over
the n/d test sessions. When d=l the rotation method reduces to the leave-one-out
methods. When d=n/2 it reduces to the holdout method where the roles of the design
and test sets are interchanged. The interchanging of design and test sets is known in
statistics as cross-validation in both directions. As we may expect, the properties of
the rotation estimate will, fall somewhere between the leave-one-out method and
holdout method. The rotation estimate will be less biased than in the holdout
method and the variance is less than in the leave-one-out method.
In order to use the database effectively, the leave-one-out procedure was
adopted for the experiments. For comparison, the resubstitution procedure was also
used in selected experiments.
4.6 Separability of Acoustic Parameters
Using Fishers Discriminant Ratio Criterion
4.6.1 Fishers Discriminant and F ratio
After we chose acoustic feature candidates, we can see how well they separate
different genders by analytically studying the database. There are many measures


3
Where
FO fundamental frequency
FI first formant frequency
BW1 first formant bandwidth
Figure 1.1 A possible automatic gender recognition system.


157
between these two results, instead of declaring that formant
information was more sensitive for recognizing the speakers gender
it is more reasonable to infer that they had almost the same
sensitivity for gender recognition. It was also found that adding
additional pitch information to formant features did not help to
increase the recognition rate since by using all available formant
information and fundamental frequency in (f), the recognition rate
was not increased, either with EUC or PDF distance measure. This
indicated that the formant characteristics contained sufficient
gender information.
5. Redundant gender information was discovered. The above
discussion demonstrates that considerable redundant information
concerning gender appeared to be imbedded in vowel characteristics
such as formant and pitch features. For the automatic gender
recognition task, individual vowel features could be used such as the
fundamental frequency (98.1%), the fourth formant frequency
(96.2%), and the second formant frequency (98.1%). Combined or
grouped features would also be used such as the third formant with
the PDF distance measure (100%), the second formant with the EUC
distance measure (98.1%), or only frequencies of all formants
(98.1%) with the EUC distance measure. In practice, it is necessary
to consider which vowel characteristics could be easily extracted
from speech signals and which algorithms could be used to quickly
and accurately pick up these vowel characteristics. In reality, there
still remain many problems concerning the accurate formant
estimation and the fundamental frequency tracking even for
sustained vowels. Therefore, rendering the parametric


172
Table A.4
Results from exclusive recognition schemes
LPC distance measure (Log likelihood)
Filter order = 20
CORRECT RATE %
6
MALE
FEMALE
TOTAL
Scheme
1
66.3
82.8
74.2
Sustained
Scheme
2
72.2
81.2
76.5
Vowels
Scheme
3
88.9
80.0
84.6
Scheme
4
88.9
88.0
88.5
Scheme
1
79.3
49.6
65.0
Unvoiced
Scheme
2
64.4
64.0
64.2
Fricatives
Scheme
3
81.0
76.0
78.9
Scheme
4
59.3
80.0
69.3
Scheme
1
64.8
81.0
72.6
Voiced
Scheme
2
76.9
84.0
80.3
Fricatives
Scheme
3
96.3
100.0
98.1
Scheme
4
92.6
96.0
94.3


112
REST FOLD
AREA TENSION
SUBGLOTTAL-
PRESSURE
VOCAL FOLD
MODEL
GLOTTAL
V-V Ug(0,t)
GLOTTAL
OPENING AREA
Ag(t)
O
VOCAL
TRACT
FILTER
Figure 6.2 A model to study source-tract interaction effect.


33
Figure 3.1 Examples of aligned speech and EGG signals
for (a) male and (b) female speakers.


104
an appropriate distance measure was important for gender
recognition for a given type of phoneme.
2. Spectral characteristics were vital factors in distinguishing the
speakers gender. Either vowels or unvoiced fricatives or voiced
fricatives can be used to classify the subjects gender objectively.
The speakers gender features could be captured only based on
speakers vocal tract characteristics since no vocal fold information
was contained in unvoiced fricatives. Moreover, when some of the
vocal fold information was combined with vocal tract characteristics
as in vowel and voiced fricative cases, the gender discrimination was
improved.
3. Choosing appropriate template forming and recognition schemes
was crucial in order to achieve high recognition rates. Recognition
Schemes 3 and 4 were more sensitive for gender discrimination than
Schemes 1 and 2 indicating the importance of averaging techniques.
In addition, averaging techniques seemed more critical than
clustering techniques. To a great extent, averaging on both test and
reference templates eliminated the intrasubject variation within
different vowels or fricatives of a given subject and emphasized
features representing this subjects gender. The above discussion
implies that the gender information is time-invariant, phoneme
independent, and speaker independent.
4. The performance of the cepstral distortion measure was better than
that of the LPC log likelihood distortion measure. The cepstral
distortion measure acted more evenly between male and female
groups, indicating that this feature has some normalization
characteristics.


69
Analysis Conditions
Method: asynchronous autocorrelation LPC
Filter order: 8, 12, 16, 20
Analysis frame size: 256 points/frame
Frame overlap: None
Preemphasis Factor: 0.95
Analysis Window: Hamming
Data set for coefficient calculations: six frames total. Then
by averaging six sets of coefficients obtained from these six
frames, a template coefficient set was calculated for each
sustained utterance.
Acoustic Parameters
LPC coefficients (LPC)
Cepstrum Coefficients (CC)
Autocorrelation Coefficients (ARC)
Reflection Coefficients (RC)
Fundamental Frequency and Formant Information (FFF)
* FFF was obtained by 12 order Closed Phase WRLS-VFF method
discussed in the next Chapter.
Distance Measures
Euclidean Distance (EUC)
LPC Log Likelihood Distance (LLD)
Cepstral Distortion (same as EUC)
Weighted Euclidean Distance (WEUC)
Probability Density Function (PDF)
Decision Rule
Nearest Neighbor


176
Table A.8
Results from exclusive recognition schemes
Cepstrum distance measure
Filter order = 20
CORRECT RATE %
C
MALE
FEMALE
TOTAL
Scheme 1
74.1
66.8
70.6
Sustained
Scheme 2
73.7
70.4
72.1
Vowels
Scheme 3
88.9
92.0
90.4
Scheme 4
88.9
88.0
88.5
Scheme 1
65.2
64.0
64.6
Unvoiced
Scheme 2
60.7
68.0
64.2
Fricatives
Scheme 3
85.2
80.0
82.7
Scheme 4
85.2
92.0
88.5
Scheme 1
79.6
82.0
80.8
Voiced
Scheme 2
87.0
83.0
85.1
Fricatives
Scheme 3
96.3
96.0
96.2
Scheme 4
88.9
92.0
90.4


160
In terms of relative importance of fundamental frequency versus
formant characteristics, although using formant information showed
a slightly higher recognition rate (98.1% versus 96.2%), it was still
reasonable to believe that they had almost the same sensitivity for
gender recognition. This result also indicated that the formant
characteristics contained sufficient gender information without the
presence of the fundamental frequency.
4. Recall from Chapter 5 that the reflection coefficients derived from
sustained vowels, when combined with the EUC distance measure,
also produced the highest recognition rates (100%) for filter orders
of 12, 16, and 20. Therefore, there were two different approaches
for vowels in this study that achieved 100% recognition rates for both
males and females, namely, the third formant information with the
PDF distance measure and the reflection coefficients with the EUC
distance measure. Both used the recognition Scheme 3.
5. By examining Table 7.3 and recognition results obtained using
individual formant features, it is noted that the features with higher F
values in the ANOVA analysis usually had higher recognition rates
later in the automatic recognition test (e.g., F0, FI to F4), though the
feature with the highest F value may not posses the highest
recognition rate (F value for F0 is the highest but, instead, using F2
achieved highest recognition rate). This indicated that the
conclusions from the statistical test and the recognition test were
quite consistent.
6. Higher variability of theTemale voices was again noted. In general,
most recognition rates for female voices were lower than for male
voices. Female feature data plots appeared to be more scattered.


181
Table B.3
Results from the exclusive recognition scheme 3
Euclidean distance measure
Filter order = 16
CORRECT RATE %
6
MALE
FEMALE
TOTAL
ARC
81.5
76.0
78.8
Sustained
LPC
74.1
88.0
80.8
Vowels
RC
100.0
100.0
100.0
CC
88.9
92.0
90.4
ARC
74.1
76.0
75.0
Unvoiced
LPC
77.8
64.0
71.2
Fricatives
RC
81.5
80.0
80.8
CC
85.2
84.0
84.6
ARC
88.9
84.0
86.5
LPC
96.3
88.0
92.3
RC
96.3
96.0
96.2
CC
100.0
96.0
98.1
Fricatives


173
Table A.5
Results from exclusive recognition schemes
Cepstrum distance measure
Filter order = 8
CORRECT RATE %
MALE
FEMALE
TOTAL
Scheme
1
59.6
63.2
61.3
Sustained
Scheme
2
64.4
74.8
69.4
Vowels
Scheme
3
74.1
92.0
82.7
Scheme
4
85.2
96.0
90.4
Scheme
1
63.0
59.2
61.2
Unvoiced
Scheme
2
57.0
60.8
58.8
Fricatives
Scheme
3
70.4
72.0
71.2
Scheme
4
81.5
76.0
78.8
Scheme
1
86.1
72.0
79.3
Voiced
Scheme
2
72.2
79.0
75.5
Fricatives
Scheme
3
92.6
96.0
94.2
Scheme
4
88.9
96.0
92.3


AMPLITUDE (db> FREQUENCY (Hz) FREQUENCY (Hz)
136
UOUEL
Figure 7.3 The second formant characteristics of ten vowels.


47
TT 2 2 2 d)
D L2 = / | log |x(w)| -log |x*(co)| I (4.21)
-TT 2u
where X(o>) is the Fourier transform of x(n). Gray and Markel (1976) showed that
for the LPC analysis with a filter order of 10 the correlation coefficient between D L2
and D cep is 0.98; while for a order of 20, the correlation coefficient is 0.997.
4.3.4 Weighted Euclidean Distance
D weuc = [ (X-Y)1 W1 (X-Y) ]1/2 (4.22)
where X is the test vector, W is the symmetrical covariance matrix obtained using a
set of reference vectors (e.g., from a set of templates which represent subjects of the
same gender), and Y is the mean vector of this set of reference vectors. The
weighting compensates for correlation between features in the overall distance and
reducing the intragroup variations. Weighted Euclidean distance is the simplified
version of the likelihood distance measure using the probability density function
discussed below.
4.3.5 Probability Density Function
1 -n/2 1 -(X-Y)1 W"1 (X-Y) 1/2
D pdf = ( ) exp [ ] (4.23)
2tt I W|1/2 2
where X, Y, W are the same as in Equation (4.22) and [W| is the determinant of W.
The decision principle of this distance measure minimizes the probability of the
error.


9
the highest. Carlson (1981) gave a survey of the literature on the vocal tract
resonance characteristics as a gender cue.
Fant (1966) has pointed out that the male and female vowels are typically
different in three groups:
1) rounded back vowels,
2) very open unrounded vowels, and
3) close front vowels.
The main physiological determinants of the specific deviations are that the ratio of
pharyngeal length to mouth cavity length is greater for males than for females and
the laryngeal cavities are more developed in males.
Schwartz and Rine (1968) also demonstrated that the gender of an individual
can be identified from voiceless fricative phonemes such as /S/, /F/ etc. This again is
induced by the vocal tract size differences between the genders.
The higher fundamental frequency (pitch) range of the female speaker is quite
well known. There is a general agreement that the fundamental frequency is an
important factor in the identification of gender from voice (Curry, 1940cited by
Carlson 1981; Hollien and Malcik, 1967; Saxman and Burk, 1967; Hollien and Paul,
1969; Hollien and Jackson, 1973; Monsen and Engebretson, 1977; Stoicheff, 1981;
Horri and Ryan, 1981; Linville and Fisher, 1985; Henton, 1987). One often finds the
statement that the pitch level of the female speaking voice is approximately one
octave higher than that of the male speaking voice (Linke, 1973). However, there is
considerable discrepancy among values obtained by different investigators.
According to Hollien and Shipp (1972), the male subjects showed an intersubject
pitch range of 112 146 Hz. Stoicheffs (1981) data showed that the range for the
female subjects was 170-275 Hz. Titze-(1989) found that the fundamental
frequency was scaled primarily according to the membranous lengths of the vocal
folds (scale factor 1.6). Figure 1.5 shows fundamental frequency changes for two


14
that listeners could identify speaker gender accurately from these stimuli, especially
from /H/, IS/, and /SH/ (and could not from /F/ and /TH/). Ingemann reported that c
the most identifiable fricative was /h/, with identification of others ranging down to
little better than chance. Since the laryngeal fundamental (FO) was not available to
the listeners, their findings suggest that accurate gender identification is possible
from vocal tract resonance (VTR) information alone and, therefore, that formants
are important cues for speaker gender identification.
Further support for this conclusion came from studies by Schwartz & Rine
(1968) and Coleman (1971). Schwartz and Rines study revealed that the listeners
were able to identify the speakers gender from two whispered vowels (/i/ and /a/).
They found 100% correct identification for /a/ and 95% correct identification for /i/,
despite the absence of the laryngeal fundamental. In Colemans study on male and
female voice quality and its relationship to vowel formant frequencies, /i/, /u/, and a
prose passage were employed to explore listeners gender identification abilities.
All stimuli were produced at the same FO (85 Hz) by means of an electrolarynx.
Coleman discovered that the judges correctly recognized the speaker gender 88% of
the time (with 98% correct for male voices and 79% for female voices), even when
the FO remained constant for all speakers. He also discovered that the vowel
formant frequency averages were closely associated with the degree of male or
female voice quality.
Coleman (1973a and 1973b) attempted to reduce the influence of possible
differences in rate, juncture, and inflection between male and female speakers by
presenting their voiced productions of prose passage backward to subjects. The
judgments should have, therefore, been based solely on VTR and FO information
which would be unaffected by the backward presentation. By correlation analysis
between measures of VTR, FO, and judgments of degree of male and female voice
quality in the voices of the speakers (with degree of correlation indicative of the


73
Table 5.2
Results from exclusive recognition schemes
with various filter orders and
the cepstral distortion measure
CORRECT RATE %
0rder=8
Order=12
0rder=16
0rder=20
Sustained
Vowels
Scheme 1
61.3
68.3
69.4
70.6
Scheme 2
69.4
67.3
70.0
72.1
Scheme 3
82.7
92.3
90.4
90.4
Scheme 4
90.4
92.3
92.3
88.5
Unvoiced
Fricatives
Scheme 1
61.2
65.8
63.9
64.6
Scheme 2
58.8
61.5
62.7
64.2
Scheme 3
71.2
75.0
84.6
82.7
Scheme 4
78.8
88.5
90.4
88.5
Voiced
Fricatives
Scheme 1
79.3
82.7
81.3
80.8
Scheme 2
75.5
82.2
84.6
85.1
Scheme 3
94.2
98.1
98.1
96.2
Scheme 4
92.3
92.3
92.3
90.4


(6.6)
P
sk = Z ai(k)sk-i + ek
i=l
where sk denotes the kth sample of speech signal, a¡(k) are time-varying coefficients
and ek represents the combination effect of model error and pulsed excitation.
Using vector notation, sk can be expressed as
sk = Hk' <1^ + ek
(6.7)
where
Hkl [ sk_i,..., sk_p]
$k = [ai(k),...,ap(k)]
The estimated speech signal sk at the instant k can be written as
s k = Hk k (6.8)
where
<>k is the vector of estimated filter coefficients.
WRLS algorithm. A least squared criterion for the estimation error can be
defined as follows
k
Vk() = Z \kl (s¡ s 02
i=l
(6.9)


50
SUBJECT 1
SUBJECT 2
SUBJECT M
TEMPLATE FOR
EACH UTTERANCE
(LOWER LAYER)
COMPUTE
AVERAGE
TEMPLATE FOR
EACH SUBJECT
(MEDIAN LAYER)
TEMPLATE FOR
EACH GENDER
(UPPER LAYER)
Figure 4.5 Test and reference template formation.


133
Table 7.2 (continued)
where
FI

first
formant
frequency
(Hz)
B1

first
formant
bandwidth
(Hz)
A1

first
formant
amplitude
(db)
F2

second
formant
frequency
(Hz)
B2

second
formant
bandwidth
(Hz)
A2

second
formant
amplitude
(db)
F3

third
formant
frequency
(Hz)
B2

third
formant
bandwidth
(Hz)
A3

third
formant
amplitude
(db)
F4

fourth
formant
frequency
(Hz)
B4

fourth
formant
bandwidth
(Hz)
A4
--
fourth
formant
amplitude
(db)
Plots of the data were generated to visualize various characteristics of
different genders as well as different vowels. Figure 7.1 shows averaged
fundamental frequencies of ten sustained vowels for male/female speakers. Figures
7.2 to 7.5 show averaged first, second, third, and fourth formant frequencies,
bandwidths, and amplitudes of ten sustained vowels for male/female speakers
respectively. All data are represented as means SE. Figures 7.6 and 7.7 show
scatter plots of the second formant frequency as a function of first formant
frequency for ten sustained vowels by male and female speakers respectively. The
polygons in Figures 7.6 and 7.7 show the approximate range of variation in formant
frequencies for each of these vowels. Figure 7.8 shows vowel triangles for both male
and female speakers.


166
vowels. The conclusions from the statistical test and the recognition test were quite
consistent.
Redundant or excessive gender information was imbedded in the fundamental
frequency and vocal tract resonance features. Moreover, the formant characteristics
seemed to contain sufficient gender information without the presence of the
fundamental frequency.
Finally, the higher variability of female voices was noted. Performances of
various feature vectors combined with distance measures for male subjects were
generally better than the performances for female subjects. The recognition rates
for males showed higher mean, much greater minimum, and smaller standard
deviation than those for females. Upon plotting the female features we noted a
greater scattering of the data when compared to similar plots for male features.
There existed larger formant frequency, bandwidth and amplitude changes across
different vowels for female speakers and the corresponding standard errors for each
vowel were greater than those of males. These above observations suggested that
female voices were more variable and this may contribute to the perceptual melodic
feature for females.
In summary, this study demonstrated that it is feasible to design an efficient
gender recognition system. Such a system would reduce the search space of speech
or speaker recognition in half. Furthermore, the knowledge gained from this
research might benefit the generation of synthetic speech with a desired male or
female voice quality.
8.2 Future Research Extensions
8.2.1 Short Term Extension
o There is still a possibility to reduce the length of the speech signal
used to extract the acoustic parameters in the coarse analysis. In


10
speakers for the utterance We were away a year ago. Figure 1.6 shows the
corresponding speech signals.
The female voice is slightly weaker than the male voice. On the average the
root mean square (rms) intensity of glottal periods produced by female subjects is -6
db relative to comparable samples produced by males. A study by Karlsson (1986)
indicated a strong correlation between weak voice effort and constant air leakage
during closed-phase.
During the last few years, measuring the area of the glottis as well as
estimating the glottal volume-velocity waveform have become research topics of
interest (Holmberg et al., 1987). It is well known that the shape of the glottal
*
excitation wave is an important factor which can greatly affect speech quality
(Rothenberg, 1971). The wave shape produced by male subjects is typically
asymmetrical and frequently shows a prominent hump in the opening phase of the
wave (due to source-tract interaction). The closing portion of the wave generally
occupies 20%-40% of the total period and there may or may not be an easily
identifiable closed period (Monsen and Engebretson, 1977). Notable differences
between male and female waveforms are that the female waveform tends to be
symmetric. There is seldom a hump during the opening-phase indicating less or no
source-tract interaction, and both the opening and closing parts of the wave occupy
more nearly equal proportions of the period. Holmberg et al. (1987) found
statistically significant differences in male-female glottal waveform parameters. In
normal and loud voices, female waveforms indicated lower vocal fold closing
velocity, lower ac flow, and a proportionally shorter closed-phase of the cycle,
suggesting a steeper spectral slope for females. For softly spoken voices, spectral
slopes are more similar to those of males.
These glottal-source differences between male and female subjects are
understandable in terms of the relative size of male and female vocal folds. It is


22
demonstrated. A review of the closed phase WRLS-VFF (Weighted Recursive
Squares with Variable Forgetting Factor) analysis and the EGG (electroglottograph)
assisted approaches is also presented. Chapter 6 also introduces testing methods for
fine analysis, which include the two-way ANOVA (Analysis of Variance) statistical
test and the automatic recognition test using grouped features. Chapter 7 analyzes
the vowel characteristics such as fundamental frequencies and formant features for
each gender. Statistical tests and relative importance of grouped vowel features are
also discussed. Finally in Chapter 8, a summary of the results of this dissertation is
offered. Recommendations and suggestions for future research conclude this last
chapter.


71
distortion measure was done by using the EUC, the results of the CC combined with
the EUC were directly extracted from Appendix A.
5.2.1 Comparative Study of Recognition Schemes
Tables 5.1 and 5.2 show the condensed results selected from Appendix A,
using the LPC log likelihood and cepstral distortion measures respectively.
Recognition rates for the four exclusive recognition schemes with various filter
orders are included. Figures 5.1 and 5.2 are graphic illustrations of Tables 5.1 and
5.2.
By observing curves of Figures 5.1 and 5.2, it can be immediately seen that
higher recognition rates were achieved using recognition Schemes 3 and 4 for all the
cases, including various filter orders combined with different phoneme categories.
Among them, by applying Schemes 3 and 4 to voiced fricatives, over 90%
recognition rates were accomplished for all filter orders and both distortion
measures. The highest correct recognition rate, 98.1%, was obtained for Scheme 4
by using the LPC log likelihood measure with a filter order of 8. The same rates
were obtained for Scheme 3 by using the LPC log likelihood measure with a filter
order of 20 and using the cepstral distortion measure with filter orders of 12 and 16.
The results indicated the following:
1. Choosing appropriate template forming and recognition schemes
was important in achieving high correct recognition rates.
Particularly, the use of averaging techniques was critical since the
highest recognition rates were obtained by using Schemes 3 and 4, in
both of which the test and reference template were formed by
averaging all the utterances from the same subjects or even the same
gender (Scheme 3). In contrast, Schemes 1 and 2, in which the test
template was formed from a single utterance, performed worse.


168
o The KNN decision mechanism could be included so that the
performance of the gender recognition could be further improved.
a
8.2.2 Long Term Extension
The LPC model has no direct relationship to the speech production
mechanism and the acoustic parameters derived from the LPC model are difficult to
interpret in terms of voice quality of the perceived speech signal. The inherent
weakness of the LPC model (all pole) prevents us from sufficiently modeling the
fricatives. The results from the cepstral distortion measure suggested that gender
information did exist in unvoiced fricatives. Thus, more detailed spectrum or
cepstrum analysis for unvoiced fricatives is needed. Moreover, more sophisticated
models such as pole-zero ARMA or other models, which are more closely related to
speech production may be required for further studies.
Eskenazi (1989) found in his research that Spectral Flatness of the residue
signal of the LPC analysis and Coefficient of Excess of the speech signals showed a
significant difference between males and females in several vowels. Additional
study is needed to assess the reliability of these acoustic measures in gender
recognition. Furthermore, other time domain characteristics of the speech signal
such as Pitch Amplitude, Harmonics-to-Noise Ratio, Perturbation Quotients, and
Jitter and Shimmer might present differences between two genders. A detailed
research on these time domain parameters could be considered in the future.
Finally, the shape information of glottal volume velocity waveform, which has
been shown to be an important factor for gender classification (Monsen and
Engebretson, 1977; Karlsson, 1986; Holmberg and Hillman, 1987), was not
investigated in this study. Systematic analysis of this aspect using our database
could be performed when the inverse filtering techniques are improved.


187
Table B.9
Results from the inclusive recognition scheme 3
Euclidean distance measure
Filter order =16
CORRECT RATE %
MALE
FEMALE
TOTAL
ARC
88.9
76.0
82.7
Sustained
LPC
77.8
88.0
82.7
Vowels
RC
100.0
100.0
100.0
CC
92.6
96.0
94.2
ARC
77.8
80.0
78.9
Unvoiced
LPC
77.8
64.0
71.2
Fricatives
RC
85.2
80.0
82.7
CC
88.9
84.0
86.5
Voiced
ARC
92.6
LPC
100.0
RC
96.3
CC
100.0
88.0
90.4
88.0
94.2
96.0
96.2
96.0
98.1
Fricatives


183
Table B.5
Results from the exclusive recognition scheme 3
PDF (Probability Density Function) distance measure
Filter order = 8
CORRECT RATE %
MALE
FEMALE
TOTAL
ARC
81.5
80.0
80.8
Sustained
LPC
85.2
84.0
84.6
Vowels
RC
81.5
96.0
88.5
CC
81.5
76.0
78.8
ARC
74.1
64.0
69.2
Unvoiced
LPC
81.5
76.0
78.8
Fricatives
RC
74.1
84.0
78.8
CC
77.8
84.0
80.8
ARC 88.9 88.0 88.5
Voiced LPC 88.9 96.0 92.3
RC 92.6 92.0 92.3
CC 92.6 92.0 92.3
Fricatives


24
Figure 2.1 The overall research flow.


107
concerned with formant frequencies (Coleman, 1976). Bladon (1983) pointed out
that male vowels appeared to have narrower formant bandwidths and perhaps also a
less steeply sloping spectrum. These need to be further investigated on a larger
database.
Third, the acoustic features obtained by short-time spectrum analysis, which
was usually done by analog spectrographic techniques, are far from being error-free.
The accuracy of estimated F0 and formant frequencies is subject to reading errors,
unavoidable formant estimation errors due to the influences of the F0 and
source-tract interaction, and large instrument errors (e.g., drift).
In digital speech processing, although the problem of automatic formant
c
analysis of speech has received considerable attention and a variety of approaches
have been taken, the calculation of accurate formant features from the speech signal
is still considered an unsolved problem. The accuracy of formant tracking and
estimation from speech signals using the frame-based conventional LPC analysis
method as we discussed in coarse analysis is affected by factors such as
(1) the position of the analysis frame,
(2) the length of the analysis window, and
(3) the time-varying characteristics of the speech signal.
Sequential adaptive analysis methods offer an attractive alternate processing
strategy since they overcome some of the drawbacks of frame-based analysis.
In this chapter, the detailed background of closed phase WRLS-VFF method
and the experimental design for testing the relative importance of the formant
characteristics for gender recognition are presented.
6.2 Limitations of Conventional LPC
Conventional linear predictive coding (LPC) techniques attempt to model the
vocal tract and can provide a good estimate of the envelope of the speech spectrum.


129
was then used to form the reference and test templates. Four recognition schemes
and various distance measures were involved. The decision rule was the Nearest
Neighbor rule and the exclusive procedure was applied on the database when the
performance assessment was made. In addition, the order of the conventional LPC
filter was varied.
The outline of the approach in this section followed the above procedure
except the individual or grouped vowel feature(s) such as only the fundamental
frequency, or only the formant frequencies or bandwidths (but across all formants)
were used to form the reference and test templates. In order to determine in detail
the relative importance of various features and their combinations, the following
vowel feature sets formed the reference and test templates:
(1) only fundamental frequency,
(2) individual feature of formants, (i.e., every individual frequency,
bandwidth, and amplitude for all formants),
(3) individual formant, which consists of frequency, bandwidth, and
amplitude for a given formant,
(4) grouped frequencies, or bandwidths, or amplitudes from four
formants,
(5) entire formant information, and
(6) all available formant information and pitch information together.
Instead of four recognition schemes, only Scheme 3 was applied on the
recognition test. EUC and PDF distance measures were utilized. The decision rule
was still the Nearest Neighbor rule and the exclusive procedure was still applied on
the database when performance assessment was made. However, variation of the
filter order for WRLS-VFF analysis was not performed. Formant information was
obtained from the filter coefficients with the filter order kept at 12.


70
Recognition Schemes
Scheme 1
Scheme 2
Scheme 3
Scheme 4
Counting Error Procedures
Inclusive (Resubstitution)
Exclusive (Leave-One-Out)
Parameters Based on Fishers Discriminant Ratio Criterion
J4, Ji, and the expected probability of error
5.2 Performance Assessments
Since the WEUC is the simplified case of the PDF and the results produced by
the WEUC were very similar to those produced by the PDF in our experiments, only
the results obtained by the PDF will be discussed.
The complete results of the experiments are tabulated in Appendix A and B.
Appendix A presents the recognition rates for LPC log likelihood and cepstral
distortion measures with various phoneme categories, recognition schemes, and
filter orders. Inclusive procedures were only performed for the acoustic parameters
with a filter order of 16.
Appendix B presents the recognition rates for various acoustic parameters
(ARC, LPC, RC, CC) combined with EUC or PDF distance measures with different
phoneme categories and filter orders. Notice that only recognition Scheme 3 was
used for these experiments. Again, inclusive procedures were only performed for
the acoustic parameters with a filter order of 16. Since calculation of the cepstral


57
SCHEME 1
TEST SUBJECT
(a)
SCHEME 2
TEST SUBJECT
LOWER LAYER
UPPER LAYER
(b)
Figure 4.9 Structures of four recognition schemes.


84
o The CC worked very well with unvoiced fricatives. The
highest recognition rate was 84.6% with a filter order of 16.
o Both RC and CC operated extremely well with voiced
fricatives, and the results remained stable across all the filter
orders that were tested. The results generated by the CC were
slight better than the RC with filter orders of 12 and 16
(98.1% versus 96.2%).
2. When the PDF distance measure was adopted, using LPC
coefficients was a good choice. The highest recognition rates,
98.1%, 86.5% ,and 94.2, for three phoneme categories all came from
LPC coefficients with a filter order of 12 (also for voiced fricatives
with a filter order of 16). However, it is noted that the results from
the PDF distance measure were highly affected by filter orders.
5.2.3 Comparative Study Using Different Phonemes
The results of Figures 5.1, 5.2, 5.3, and 5.4 also indicated that either vowels or
unvoiced fricatives or voiced fricatives could be used to objectively classify
speakers gender. As we have seen before for vowel category, reflection coefficients
worked extremely well. The recognition rates reached 100% with filter orders of 12,
16, and 20. A 98.1% recognition rate could also be accomplished by using LPC
coefficients and the FFF. Surprisingly, the cepstral distortion measure can be used
to discriminate speakers gender from unvoiced fricatives and a 90.4% recognition
rate was obtained. For voiced fricative category, a 98.1% recognition rate was
achieved by using LPC log likelihood and cepstral distortion measures. Therefore,
in terms of the most effective phoneme category for gender recognition, the first
preference would be vowels. The second one would be voiced fricatives and the last
one unvoiced fricatives.


42
where A(k) is the area of the kth lossless tube. The reflection coefficient determines
the fraction of energy in a traveling wave that is reflected at each section boundary.
Further, r(i) is related to the PARCOR coefficient k(i) by (Rabiner and Schafer,
1978)
r(i) = k(i)
(4.14)
where k(i) can be obtained from LPC coefficients by recursion.
4.2.5 Fundamental Frequency and Formant Information
This set of features consists of frequencies, bandwidths and amplitudes of the
first, second, third and fourth formants and the fundamental frequencies (not for
fricatives). Formant information was obtained by a peak-picking technique, using
an FFT on the LPC coefficients. Fundamental frequency was calculated based upon
a modified cepstral algorithm.
4.3 Distance Measures
Several distance measures were considered.
4.3.1 Euclidean Distance
D euc = [ (X-Y)'(X-Y) ]1/2 (4.15)
where X and Y are the test and reference vectors respectively and t denotes the
transpose of the vector.


BIOGRAPHICAL SKETCH
Ke Wu was born at Guangzhou (Canton), China, and he received the Diploma
(equivalent to the Bachelor of Science degree) in mathematics from Zhongshan
(Sun Yatsen) University, Gaungzhou, China, in December 1977. Upon his
graduation, he was employed by Guangzhou Institute of Electronic Technology,
* Academia Sinica, as a Research Assistant in the Computer Laboratory. He was
admitted to the Graduate School, University of Florida, to pursue graduate studies in
August 1984 and completed his masters degree in electrical engineering in
December 1985. Mr. Wu has been a Graduate Research Assistant since 1984 in Dr.
D. G. Childers Mind-Machine Interaction Research Center and worked for IFAS
(Institute of Food and Agricultural Sciences, University of Florida) Computer
Network from May 1988 to May 1989. His current area of interest is digital signal
processing with applications to speech analysis, synthesis, and recognition
techniques by computer. Mr. Wu is scheduled to complete his Ph.D. degree in May,
1990.
198


60
section) was applied. The distance measure for each median layer template was
calculated with respect to each of the rest of the median layer templates, and the
minimum distance was found. The speaker gender of the test template was then
classified as male or female, according to the gender known for the reference
template. The above steps were repeated until all subjects were tested.
4.5 Resubstitution and Leave-One-Out Procedures
After the classifier is designed, it is necessary to evaluate its performance
relative to competing approaches. The error rate was considered as the performance
measure.
Four popular empirical approaches that count the number of errors when
testing the classifier with a test data set are (Childers, 1989):
The Resubstitution Estimate (Inclusive). In this procedure, the same data set
is used for both designing and testing the classifier. Experimentally and
theoretically this procedure gives a very optimistic estimate, especially when the
data set is small. Note, however, that when a large data set is available, this method
is probably as good as any procedure.
The Holdout Estimate. The data is partitioned into two mutually exclusive
subsets in this procedure. One set is used for designing the classifier and the other
for testing. This procedure makes poor use of the data since a classifier designed on
the entire data set will, on the average, perform better than a classifier designed on
only a portion of the data set. This procedure is known to give a very pessimistic
error estimate.
The Leave-One-Out Estimate (exclusive). This procedure assumes that there
are n data samples available. Remove one sample from the data set. Design the
classifier with the remaining (n-1) data samples and then test it with the removed


7
Figure 1.3 A cross section of human vocal apparatus.


CHAPTER 5
RESULTS OF RECOGNITION BASED ON COARSE ANALYSIS
Our results showed that most of the LPC-derived feature parameters
performed well for gender recognition. Among them, the reflection coefficient
combined with the Euclidean distance measure was the best choice for sustained
vowels (100%). While the cepstral distortion measure worked extremely well for
unvoiced fricatives, the LPC log likelihood distortion measure, the reflection
coefficient combined with the Euclidean distance, and the cepstral distortion
measure were the good alternatives for voiced fricatives. Using the Euclidean
distance measure achieved better results than using the Probability Density
Function. Furthermore, the averaging techniques were very important in designing
appropriate test and reference templates and a filter order of 12 to 16 was sufficient
for most designs.
5.1 Coarse Analysis Conditions
Before we discuss in detail the performance assessments based on coarse
analysis, we briefly summarize the experimental conditions as follows:
Database
52 normal subjects: 27 male and 25 female
Phoneme group used: ten sustained vowels
five unvoiced fricative
four voiced fricatives
68


146
Table 7.4
Two-way ANOVA statistical results
Main Effect of A
(Gender)
F-value
P-value
Fundamental
Frequency
226.3
<0.01
Formant
FI
111.6
<0.01
B1
41.8
<0.01
A1
0.48
>0.05
F2
195.9
<0.01
B2
46.2
<0.01
A2
34.0
<0.01
F3
203.8
<0.01
B3
3.2
>0.05
A3
58.1
<0.01
F4
196.2
<0.01
B4
17.2
<0.01
A4
52.3
<0.01
F-value:
computed
F ratio
P-value:
level of
signif icanci
Main Effect of B Effect
(Vowel)
F-value P-value F-value
23.8 <0.01 1.43
459.
8
<0.
01
13.
4
17.
3
<0.
01
10.
.3
70.
4
<0.
01
8.
,01
561.
9
<0.
01
4,
,92
2.
,4
<0.
05
2.
,17
37.
,5
<0.
,01
2.
,57
158.
, 6
<0.
,01
2.
, 89
5.
,4
<0.
.01
2.
.05
82.
. 3
<0.
.01
2,
. 63
14.
.2
<0.
,01
1,
. 89
4.
.3
<0,
.01
2,
.33
30,
.9
<0.
.01
1
.68
of A X B
P-value
>0.05
<0.01
<0.01
<0.01
<0.01
<0.05
<0.01
<0.01
<0.05
<0.01
>0.05
<0.05
>0.05


124
A variance, in the terminology of analysis of variance, is more frequently
called a mean square (MS). By definition
variation (sum of squares) SS
MS = = (6.16)
degree of freedom df
In other words, a mean square is the average variation per degree of freedom; this is
also a basic definition for variance.
The term degree of freedom (df) originates from the geometric representation
of problems associated with the determination of sampling distribution for statistics.
o
In this context, the term refers to the dimension of the geometric space appropriate
in the solution of the problem. More accurately
df = # of independent observations # of linear restraints (6.17)
On the assumption that the groups or samples making up a total series of
measures are random samples from a common normal population, the two
estimates of the population variance may be expected to differ only within the limits
of random sampling. This null hypothesis may be tested by dividing the larger
variance by the smaller one to get the variance ratio F. If the value of F equals or
exceed a certain value (usually tabled), then the null hypothesis that the samples
have been drawn from the same common normal population is considered invalid.
Therefore, the populations from which the sample have been drawn may differ in
terms of either means or variances or both. If the variances are approximately the
same, it is the means that differ. This, basically, is the analysis of variance in its
simplest form (Edwards, 1964).


182
Table B.4
Results from the exclusive recognition scheme 3
Euclidean distance measure
Filter order = 20
CORRECT RATE %
MALE
FEMALE
TOTAL
ARC
85.2
80.0
82.7
Sustained
LPC
74.1
88.0
80.8
Vowels
RC
100.0
100.0
100.0
CC
88.9
92.0
90.4
ARC
74.1
76.0
75.0
Unvoiced
LPC
74.1
68.0
71.2
Fricatives
RC
81.5
80.0
80.8
CC
85.2
80.0
82.7
Voiced
ARC
92.6
LPC
92.6
RC
96.3
CC
96.3
84.0
88.5
88.0
90.4
96.0
96.2
96.0
96.2
Fricatives


CHAPTER 1
INTRODUCTION
1.1 Automatic Gender Recognition
Human listeners are able to capture and categorize the information of acoustic
speech signals. Categories include those that contribute a linguistic message, those
that identify the speaker, and those that convey clues about the speakers
personality, emotional state, gender, age, accent, and the status of his/her health.
Automatic speech and speaker recognition systems are far less capable than
human listeners. Computerized speaker recognition can be accomplished but only
under highly constrained conditions. The major difficulty is that the number of
significant parameters is unmanageably large and little is known about the acoustic
speech features, articulation differences, vocal tract differences, phonemic
substitutions or deletions, prosodic variations and other factors that influence our
recognition ability.
Therefore, more insight and systematic study of intrinsically effective speaker
discrimination features are needed. A series of smaller experiments should be done
so that the experimental results will be mutually supportive and will lead to overall
understanding of the combined effects of all the parameters that are likely to be
present in actual situations (Rosenberg, 1976; Committee on Evaluation of Sound
Spectrograms, 1979).
Unlike automatic speech and speaker recognition, automatic gender
recognition was never proposed as a stand-alone problem. Little attention was paid
1


132
Table 7.2 Formant characteristics
of ten vowels for male/female
IY
I
E
AE
A
OW
U
00
UH
ER
U
302.9
438.5
541.6
645.0
673.0
614.6
486.7
341.5
590.9
477.4
FI
8.2
7.8
8.0
11.9
11.2
9.1
9.7
6.3
10.5
9.4
F
378.5
512.6
661.0
841.7
837.8
745.4
522.0
409.5
723.6
558.1
7.1
11.3
13.3
18.7
19.8
19.5
11.3
7.9
12.1
9.7
U
133.8
135.8
132.6
144.7
154.2
147.1
141.2
134.3
138.4
133.1
B1
3.8
4.1
3.2
4.8
7.0
5.6
11.0
4.3
3.9
2.9
F
144.1
150.5
177.0
221.4
272.1
249.1
163.0
132.0
221.4
175.6
4.4
5.9
13.8
19.2
21.2
21.2
5.8
3.1
17.5
9.4
U
15.4
.15.7
16.5
16.0
21.6
22.4
22.1
26.1
21.1
24.6
A1
.76
.78
.53
.60
.61
.61
.79
1.03
.71
.85
F
18.7
17.5
16.1
14.8
19.5
20.3
21.2
28.5
18.1
22.5
.78
.95
.68
.86
.71
.70
.74
.81
.74
.73
U
2172.0
1837.0
1690.2
1621.9
1097.9
990.4
1168.1
1067.1
1194.2
1276.0
F2
22.3
21.3
23.1
21.2
12.0
18.6
24.6
35.9
20.2
18.6
F
2588.2
2196.8
2013.2
1932.7
1245.5
1190.2
1386.2
1361.2
1445.3
1503.7
48.7
42.6
29.3
25.7
24.6
33.9
24.6
49.1
18.4
16.1
U
156.5
142.6
144.9
155.8
154.0
152.8
138.1
143.5
145.1
147.6
B2
6.9
4.2
5.9
5.0
6.5
5.8
2.4
3.6
3.7
4.0
F
198.6
199.4
188.4
182.9
226.9
209.1
196.5
191.1
233.7
166.3
15.2
11.9
10.4
9.0
10.0
23.7
17.7
13.1
20.8
8.0
U
15.6
17.8
17.8
17.9
21.4
21.1
19.1
19.0
19.0
24.0
A2
.77
.73
. 64
.48
.78
57
.54
.52
.58
.83
F
10.9
12.3
13.8
15.3
18.7
20.6
16.5
15.4
15.5
23.2
1.24
.80
.88
.71
.70
1.06
.82
.84
.71
.72
U
2851.3
2482.4
2456.1
2357.3
2457.4
2465.4
2307.2
2219.0
2401.1
1707.4
F3
36.9
33.9
31.3
26.5
36.8
33.9
25.8
20.3
36.2
34.4
F
3286.1
2995.9
2955.7
2981.6
2945.1
2853.0
2791.5
2729.8
2862.7
2024.1
36.0
35.3
38.0
50.9
63.1
53.6
3.0
7.6
47.0
59.0
M
281.3
228.1
245.1
253.6
241.9
199.8
208.5
191.0
239.3
145.9
B3
16.4
14.3
21.2
20.6
26.0
14.1
15.1
19.4
24.3
3.4
F
218.8
242.3
274.5
281.5
237.3
226.7
240.5
273.9
280.7
181.2
8.6
17.4
29.3
21.1
8.1
19.2
20.5
26.2
25.2
9.9
U
12.8
11.9
10.1
10.7
7.2
8.0
9.0
9.2
7.7
21.5
A3
.68
.48
.66
.56
.86
.84
.73
.96
.82
.91
F
11.7
7.9
5.9
4.7
.1
4.8
4.2
2.5
3.5
17.4
54
.55
.58
.68
.77
.73
.70
.76
.64
1.08
U
3572.7
3533.8
3511.4
3463.8
3463.6
3408.2
3359.4
3342.2
3423.4
3201.3
F4
49.9
35.3
38.2
43.6
40.9
38.5
40.1
51.8
41.4
45.3
F
4127.1
4265.6
4219.7
4146.7
3957.0
3922.5
3976.9
3976.5
4052.9
3888.4
69,1
4.7
55.1
52.6
45.4
55.3
56.8
55.1
81.7
83.3
U
226.8
191.5
224.0
237.5
238.2
170.9
212.0
188.3
204.0
176.7
S4
35.9
14.2
17.8
16.3
21.3
7.8
15.5
14.5
12.0
9.5
F
282.2
287.5
347.3
343.3
210.8
237.2
253.4
232.2
284.9
253.0
24.0
25.6
40.0
24.6
12.3
26.7
23.7
15.7
25.9
28.2
U
15.0
9.9
7.8
4.9
6.7
9.7
6.8
9.0
8.8
5.8
A4
. 88
. 80
1.01
91
1.07
. 87
.79
1.06
.91
.78
F
9.2
2.7
0.4
-1.2
3.9
4.3
2.3
3.0
2.2
0.2
.92
.68
.75
. 66
.72
.91
.74
.81
. 96
. 83


72
Table 5.1
Results from exclusive recognition schemes
with various filter orders and
the LPC log likelihood distortion measure
CORRECT RATE %
Order-^8
0rder=12
0rder=16
0rder=20
Sustained
Vowels
Scheme 1
63.1
69.6
74.2
74.2
Scheme 2
65.2
71.5
76.2
76.5
Scheme 3
75.0
86.5
86.5
84.6
Scheme 4
75.0
80.8
86.5
88.5
Unvoiced
Fricatives
Scheme 1
59.2
64.2
67.7
65.0
Scheme 2
61.5
63.9
64.2
64.2
Scheme 3
67.3
75.0
75.0
78.9
Scheme 4
76.9
75.0
73.1
69.3
Voiced
Fricatives
Scheme 1
74.5
72.1
73.1
72.6
Scheme 2
77.4
80.3
81.7
80.3
Scheme 3
90.4
94.2
96.2
98.1
Scheme 4
98.1
96.2
96.2
94.3


88
5.2.5 Comparative Study of Distance Measures
It is generally believed that the use of the EUC distance measure is not as
effective as the use of the PDF because there is no normalization of the dimensions
involved in the definition of the EUC. The largest value dimension becomes the
most significant. In contrast, the PDF approach has such normalization function
through the covariance matrix computation. The PDF approach gives unequal
weighting for each element of a vector. It may suppress the elements with large
values but emphasize the elements with small values according to their importance
in reducing the intragroup variation.
However, the PDF approach did not work well in our experiments. By
c
observing Tables 5.3 and 5.4 as well as Figures 5.3 and 5.4, it can be seen that the
EUC outperformed the PDF.
First, out of 48 corresponding pairs of EUC and PDF recognition rates from
Tables 5.3 and 5.4, 32 EUC recognition rates were higher than those of the PDF.
Three of them were tied. Only 13 PDF recognition rates were higher than EUC
rates.
Second, as we have demonstrated in the sections above, performance using
PDF varied considerably with the filter order and there were severe performance
declines from the order of 16 to order of 20 for all three phoneme categories and all
acoustic parameters. On the other hand, the results of the EUC were relatively
consistent across all the filter orders that were tested, especially with filter orders of
12, 16, and 20.
Third, the two highest rates from Tables 5.3 and 5.4 for three phoneme groups
achieved using the EUC distance measure. The RC with the EUC yielded a 100%
recognition rate for sustained vowels with filter orders of 12, 16, and 20. The CC
with the EUC yielded a 98.1% recognition rate for voiced fricatives with filter orders
of 12 and 16. Even for unvoiced fricatives, the highest recognition rates for


2
to either the theoretical basis or the practical techniques for the realization of a
system for the automatic recognition of gender from speech. Although
contemporary research on speech included investigation of physiological and
acoustic gender features and their correlation with perceived gender differences, no
attempt was made to classify the speakers gender objectively, using features
automatically extracted by a computer. Childers and Hicks (1984) first proposed
such a study as a separate recognition task and, thus, this research resulted from that
proposal. A possible realization of such a system is shown in Figure 1.1.
1.2 Application Perspective
The significance of the proposed research is as follows:
o Accomplishing this task could facilitate speech recognition and
speaker identification or verification by reducing the required search
space to half. Such a pre-process may occur in the listening process
of human being. One of the speech perception hypotheses proposed
by OKane (1987) stated that human listeners have to determine the
gender of the speaker first in order to determine the identity of the
sounds. Another perception hypothesis is that the identity of the
sounds can be roughly determined without knowledge of the
speakers gender but final recognition is possible only after the
speakers gender is known. In both cases, identification of the
speakers gender is a necessary step before recognition of sounds,
o Accomplishing this task could be useful for speech synthesis. It is
well known that in synthesized speech, the female voice has not been
reproduced with the same level of success as the male voice (Monsen
and Engebreston, 1977). Further study of gender cues would


93
5.2.7 Variability of Female Voices
Our results also showed that the performance of various feature vectors
combined with various distance measures for male subjects were generally better
than for female subjects. It can be seen by investigating those recognition rate pairs
for male/female subjects with the recognition Scheme 3 and the exclusive procedure
from all tables in Appendix A and B. There are 109 such pairs in total.
Statistics were obtained based on this data. Figure 5.6 illustrates the results.
It can easily be seen that the recognition rates for male showed higher mean, much
greater minimum, and smaller standard deviation than those for female. This
suggested that female features appeared to have higher variations than male
o
features. It was also noticed that out of 109 male verses female pairs, only 25
recognition rates for female were higher than those for male. Three pairs were
equal (100% recognition rates were achieved by both male and female subjects). All
of them were obtained using reflection coefficients with sustained vowels.
To further confirm the above, the Wilcoxon signed-ranks test and the paired
samples t-test (Ott, 1984) were also performed and the results are presented in
Tables 5.5(a) and 5.5(b), respectively. Both tests indicated that there do exist
statistically significant differences between the male and female recognition rates.
5.3 Comparative Study of Acoustic Parameters
Using Fishers Discriminant Ratio Criterion
As reviewed in Section 4.6 of Chapter 4, J4 calculates the ratio of the variance
within classes and the variance between classes directly and Jlt the Mahalanobis
- - ~ V
distance, measures the divergence between two classes in feature space. In addition,
the expected probability of misclassification (PE) can be computed from Jj in a


126
Subjects
where
* indicates a crossing relationship,
o indicates a nesting relationship.
In this kind of experiment, comparisons between different levels of factor A
involve differences between groups as well as differences associated with factor A.
On the other hand, comparisons between different levels of factor B at the same
level of A do not involve differences between groups. Since measurements included
in the latter comparisons are based upon the same elements, main effects associated
with such elements tend to cancel. For the latter comparisons, each element serves
as its own control with respect to such main effects.
Table 6.1 shows the partition of the total variation for this type of experiments.
Appropriate denominators for F ratios to be used in making statistical tests are
indicated by the expected values of the mean squares. To test the hypothesis that
there is no significant variation between different levels of Factor A, the appropriate
F ratio is
MSa
F =
MSsubj w.groups
(6.18)
The mean square in the denominator of the above F ratio is sometimes designated
MSerror (between)- To test the hypothesis that there is no significant variation between
different levels of Factor B, the appropriate F ratio is


I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and
quality, as a dissertation for the degree of Doctor of Philosophy.
c/- ''O) fijd At A a)
Donald G. Childers, Chairman
Professor of Electrical Engineering
I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and
quality, as a dissertation for the degree of Doctor of Philosophy.
/Jcick R. Smith
//Professor of Electrical Engineering
I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and
quality, as a dissertation for the degree of Doctor of Philosophy.
i A nf am 1 A '*
A. Antonio Arroyo
Associate Professor of Electrical Engineering
I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and
quality, as a dissertation for the degree of Doctor of Philosophy.
Electrical Engineering
I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and
quality, as a dissertation for the degree of Doctor of Philosophy.
\
.
9>- lObd
Howard B. Rotfi
man
Professor of Speech


82
lee
* a
i- s-
0
u
95
9e
85
ae
75
7e
65
6e
55
+
a
a
X H
U
L i.
u
H
Q
5
c

L

%
iee r
95
90
85 --
86 -
75
76 --
65 -
66 *
55t-
CC
RC
ARO
LPC
106 t
a h
L L
V
+J
U T3
a a
i. u
L -H
O o
3
L
O
V
80 -
75-
70 -
65 --
60 -
a
a
95 -
A
7T-A
-A-
L-*
' " A
CC
RC
3
X -H
4-1
a a
A
96 -
o
! j
' /
: /
O
o

--- o
LPC
ARC
55
-t-
12 16
Filter order
26
Figure 5.3 Results from the exclusive recognition scheme 3 with various
filter orders, acoustic parameters, and the Euclidean distance measure.
T


CHAPTER 2
APPROACHES TO GENDER RECOGNITION FROM SPEECH
2.1 Overview of Research Plan
The goal of this study was to explore the possible effectiveness of digital
speech processing and pattern recognition techniques for an automatic gender
recognition system from speech. In order to do this, some hypotheses concerning
acoustic parameters that act to affect our ability to distinguish speakers gender
needed to be validated and clarified.
Thus, this study was divided into two directions as illustrated in Figure 2.1.
One direction was called coarse analysis since it applied classical pattern recognition
techniques and asynchronous linear prediction coding (LPC) analysis of speech.
The specific goal of this direction was to develop and test candidate algorithms for
achieving the gender recognition rapidly using only a brief data speech record.
The second research direction covered fine analysis since pitch synchronous
closed-phase analysis was utilized to obtain accurate vowel characteristics for each
gender. The specific aim of this direction was to compare the relative significance of
vowel characteristics for gender discrimination.
2.2 Coarse Analysis
The tool we used in this direction was asynchronous LPC analysis. The
advantages of using this technique are
23


196
Rosenberg, A. E. (1976). Automatic speaker verification: a review, Proc. IEEE,
Vol. 64, 475-487.
Rothenberg, M. (1981). Acoustic interaction between the glottal source and the
vocal tract, in Vocal Fold Physiology, K. N. Stevens and M. Hirano, Eds.,
University of Tokyo Press, Tokyo, 305-328.
Saxman, J., and Burk, K. (1967). Speaking fundamental frequency characteristics
of middle-aged females, Folia Phoniatrica, Vol. 19, 167-172.
Schwartz, M. F. (1968). Identification of speaker sex from isolated, voiceless
fricatives, J. Acoust. Soc. of Am., Vol. 43, 1178-1179.
Schwartz, M. F., and Rie, H. E. (1968). Identification of speaker sex from
isolated, whispered vowels, J. Acoust. Soc. Am., Vol. 44, 1736- 1737.
Shipp, F. T., and Hollien, H. (1969). Perception of aging male voice, Journal of
Speech and Hearing Research, Vol. 12, 703-710.
Singh, S., and Murry, T. (1978). multidimensional classification of normal voice
qualities, J. Acoust. Soc. Am., Vol 64(1), 81-87.
Stoicheff, M. (1981). Speaking fundamental frequency characteristics of
non-smoking female adults, Journal of Speech Hearing Research, Vol. 24,
437-441.
Ting, Y. T. (1989). Adaptive estimation of time-varying signal parameters with
applications to speech, Ph.D. dissertation, University of Florida, Gainesville.
Ting, Y. T., Childers, D. G., and Principe, J. C. (1988). Tracking spectral
resonances, Fourth Annual ASSP Workshop on Spectrum Estimation and
Modeling, Minneapolis, 49-54.
Titze, I. R. (1987). Physiology of the female larynx, 114th Meeting of Acoustical
Society of America, J. Acoust. Soc. Am. Sup. 1, Vol. 82, S90.
Titze, I. R. (1989). Physiologic and acoustic differences between male and female
voices, J. Acoust. Soc. Am., Vol 85(4), 1699-1707.
Tou, J. T., and Gonzales, R. C. (1974). Pattern Recognition Principles,
Addison-Wesley, Reading, MA.
Winer, B. J. (1971). Statistical Principles in Experimental Design, McGraw-Hill,
New York.
Wu, K. (1985). A flexible speech analysis-synthesis system for voice conversion,
Masters Thesis, University of Florida, Gainesville.


65
This can be related to more familiar material as follows. If the covariance
matrices of the two classes are equal, if W¡ and Wk can be replaced by an average
covariance matrix W, then the first term vanishes and the divergence reduces to
Dik = trace [W_1(w ^k)(w nk)]
= (w- M-k)' W-1(w- p.k)
The term, (p.¡ p.k) (jjl¡ p*)1, is the between-class covariance matrix B; hence in
this case Dik is the separability measure Ji = trace (W_1B).
Notice that Dik or Ji is the Mahalanobis distance. This distance is related to
the approximation of the expected probability of error (PE) by Lachenbruch (1968),
Achariyapaopan and Childers (1983), and Childers (1986). If p is the dimension of
the feature vector, ni and n2 are the sample sizes for classes 1 and 2, and <£(z) is the
standard normal distribution function defined as
1
z
(4.34)
2tt -oo
Pe can be written as
PE = 0.5 $>[-(<*- (3)] + 0.5 (4.35)
where
o = C
[ Ji + p (nj + n2)/(nin2) ]m
(4.36)


184
Table B.6
Results from the exclusive recognition scheme 3
PDF (Probability Density Function) distance measure
Filter order = 12
CORRECT RATE %
MALE
FEMALE
TOTAL
ARC
92.6
76.0
84.6
LPC
100.0
96.0
98.1
Sustained
FFF
96.3
96.0
96.2
Vowels
RC
100.0
96.0
98.1
CC
100.0
88.0
94.2
ARC
81.5
48.0
65.4
Unvoiced
LPC
96.3
76.0
86.5
Fricatives
RC
66.7
80.0
73.1
CC
77.8
68.0
73.1
ARC
92.6
80.0
86.5
LPC
100.0
88.0
94.2
RC
96.3
84.0
90.4
CC
92.6
92.0
92.3
Fricatives


REFERENCES
Achariyapaopan, T., and Childers, D. G. (1983). On the optimum number of
features in the classification of multivariate normal distribution data, Report,
University of Florida, Gainesville.
Ananthapadmanabha, T.V., and Fant, G. (1982). Calculation of true glottal flow
and its components, Speech Transmission Lab., Rep., STL/QPSR, 1/1982, Royal
Institute of Technology, Stockholm, Sweden, 1-30.
-a
Atal, B. S. (1974a). Linear prediction of speech recent advances with
applications to speech analysis, in Speech Recognition, Invited Papers Presented at
the 1974 IEEE Symposium, D. R. Reddy, Ed., Academic Press, New York, 221-227.
Atal, B. S. (1974b). Effectiveness of linear prediction characteristics of the speech
wave for automatic speaker identification and verification, J. Acoust. Soc. Am.,
Vol. 55(6), 1304-1312.
Atal, B. S. (1976). Automatic recognition of speakers from their voices, Proc.
IEEE, Vol. 64, 460-475.
Atal, B. S., and Hanauer S. L. (1971). Speech analysis and synthesis by linear
prediction of the speech wave, J. Acoust. Soc. Am., Vol. 50(2), 637-655.
Atal, B. S., and Schroeder, M. R. (1970). Adaptive predictive coding of speech
signals, Bell Syst. Tech. J., Vol. 49(6), 1973-1986.
Berouti, M. G., Childers, D. G., and Paige, A. (1977). A correction of tape recorder
distortion, Proc. IEEE International Conference on Acoustics, Speech, and Signal
Processing, 397-400.
Bladon, A. (1983). Acoustic phonetics, auditory phonetics, speaker sex and speech
recognition: a thread, Chapter 2, in Computer Speech Processing, F. Fallside and
A. Woods, Eds., Prentice-Hall, Englewood Cliffs, New Jersey, 29-38.
Bralley, R., Bull, G., Gore, C., and Edgerton, M. (1978). Evaluation of vocal pitch
in male transsexuals, Journal of Communication Disorders, Vol. 11, 443-449.
Brown, W. S., and Feinstein, S. H. (1977). Speaker sex identification utilizing a
constant source, Folia Phoniatrica, Vol. 29, 248-249.
189


To my parents
and to my wife


features, including frequencies, bandwidths, and amplitudes, were extracted by a
closed-phase Weighted Recursive Least Squares with Variable Forgetting Factor
method. The electroglottograph signal was used to locate the closed-phase portion
of the speech signal. A two-way Analysis of Variance statistical analysis was
performed to test the difference between two gender features, and the relative
importance of grouped vowel features was evaluated by a pattern recognition
approach.
The results showed that most of the LPC derived acoustic parameters worked
very well for automatic gender recognition. A within-gender and within-subject
averaging technique was important for generating appropriate test and reference
templates. The Euclidean distance measure appeared to be the most robust as well
as the simplest of the distance measures.
The statistical test indicated steeper spectral slopes for female vowels. Results
suggested that redundant gender information was imbedded in the fundamental
frequency and vocal tract resonance. Features of female voices were observed to
have higher within-group variations than those of male voices.
In summary, this study demonstrated the feasibility of an efficient gender
recognition system. The importance of this system is that it would reduce the search
space of speech or speaker recognition in half. The knowledge gained from this
research might benefit the generation of synthetic speech with a desired male or
female voice quality.
vm


170
Table A.2
Results from exclusive recognition schemes
LPC distance measure (Log likelihood)
Filter order = 12
CORRECT RATE %
MALE
FEMALE
TOTAL
Scheme
1
65.2
74.4
69.6
Sustained
Scheme
2
69.3
74.0
71.5
Vowels
Scheme
3
88.9
84.0
86.5
Scheme
4
77.8
84.0
80.8
Scheme
1
80.7
46.4
64.2
Unvoiced
Scheme
2
68.9
58.4
63.9
Fricatives
Scheme
3
77.8
72.0
75.0
Scheme
4
66.7
84.0
75.0
Scheme
1
64.8
80.0
72.1
Voiced
Scheme
2
75.9
85.0
80.3
Fricatives
Scheme
3
92.6
96.0
94.2
Scheme
4
92.6
100.0
96.2


83
i
60 *
56 * 1
100 -
a
a
x.3
a
L
a
i.
95 J-
90 -
o
a
a
3
N -H
U
a a
*j u
a h
L i.
U
a e
v. u
l -H
o
U 3
L

V
100 T
95 T
90 i
86 -
80 t
75 |
70 -
65 {
RC
LPC
CC
O ARC
55
8
I I 1-
12 16 20
Filter order
Figure 5.4 Results from the exclusive recognition scheme 3 with various
filter orders, acoustic parameters, and the PDF distance measure.


FREQUENCY OF SECOND FORMfiNT (Hz)
141
2800-,
2603-
2400-
2200-
2000-
1800-
1600-
1403-
1200 -
1000 -
800
203
VOWEL TRIANGLES OF MALE/FEMALE SPEAKERS
4 IV
\
' \
' V
I \
4 OW

-A
4 AE
- -4
A
T" 1 1 1 1 1 r~
300. 400 S00 600 700 800 900
FREQUENCY OF FIRST FORMANT (Hz)
MALE
FEMALE
1
1000
Figure 7.8 Vowel triangles for male and female speakers.


51
0.600
0.400
0.200
Ul 0*000
¡5
j
5-0.200
-0.400
-0.600
-0.80 0 -I 1 1 1 1 1 1 1 1 1 1 1 1-
1 23466789 10 11 12
ELEMENT OF THE VECTOR (TEMPLATE)
(a)
(b)
Figure 4.6 (a) Two reflection coefficient templates of vowels
for male and female speakers in the upper layer,
(b) The corresponding spectra.


163
cutting the search space in half. The synthesis of high quality speech would benefit
since acoustic features for synthesizing speech for either gender would be identified.
We presumed that the research results would provide new guidelines for future
research to develop qualitative measures of speech quality and to develop new
methods to identify acoustic features related to dialect and speaking style. Finally,
the research also has potential clinical and law enforcement applications.
The proposed study followed two directions. One direction was called coarse
analysis since it used classical pattern recognition techniques and asynchronous
linear prediction coding (LPC) analysis of speech. Acoustic parameters such as
autocorrelation, LPC, cepstrum, and reflection coefficients were derived to form test
and reference templates. The effects of using different distance measures, filter
orders, recognition schemes, and phonemes were comparatively assessed.
Comparisons of acoustic parameters using the Fishers discriminant ratio criterion
were also conducted.
The second research direction covered fine analysis since pitch synchronous
closed-phase analysis was utilized to obtain accurate vowel characteristics for each
gender. Detailed formant features, including frequencies, bandwidths and
amplitudes, were extracted by a closed-phase WRLS-VFF (Weighted Recursive
Least Squares with Variable Forgetting Factor) method. The electroglottograph
(EGG) signal was used to locate the closed-phase portion of the speech signal. A
two-way ANOVA (Analysis of Variance) statistical analysis was performed to test
the difference between two gender features, and the relative importance of grouped
vowel features was evaluated by means of a pattern recognition approach.
The database consisted of 52 normal subjects, 27 males and 25 females. For
each subject ten sustained vowels; five unvoiced, and four voiced fricatives were
processed during the experiments. From each utterance approximately 150ms
speech signal was used for the experiments.


FREQUENCY OF SECOND FORMRNT (Hz)
140
FORMfiNT POSITIONS OF FEMOLE SPEAKERS
Figure 7.7 The scatter plot of the first versus second formant
frequencies for ten vowels of female speakers.


44
TEST
SPEECH
TEST
SPEECH
Figure 4.3 An interpretation of LPC log likelihood
distance measure.
V-'-


191
Childers, D. G., Wu, K., and Hicks, D. M. (1987). Factors in voice quality: acoustic
features related to gender, Proc. IEEE International' Conference on Acoustics,
Speech, and Signal Processing, Vol. 1, 293-296.
Childers, D. G., Wu, K., Hicks, D. M., and Yegnanarayana, B. (1989). Voice
conversion, Speech Communication Vol. 8, 147-158.
Childers, D. G., Yegnanarayana, B., and Wu, K. (1985b). Voice conversion:
factors responsible for quality, Proc. IEEE International Conference on Acoustics,
Speech, and Signal Processing, Vol. 2, 748-751.
Coleman, R. 0. (1971). Male and female voice quality and its relationship to vowel
formant frequencies, Journal of Speech and Hearing Research, Vol. 14, 566-577.
Coleman, R. 0. (1973a). A comparison of the contributions of two vocal
characteristics to the perception of the maleness and femaleness in the voice, Paper
presented at the Annual Convention of the American Speech and Hearing
Association, Detroit.
Coleman, R. 0. (1973b). Speaker identification in the absence of inter-subject
differences in glottal source characteristics, J. Acoust. Soc. Am., Vol. 53,
1741-1743.
Coleman, R. 0. (1976). A comparison of the contributions of two voice quality
characteristics to the perception of maleness and femaleness in the voice, Journal
of Speech and Hearing Research, Vol. 19, 168-180.
Committee on Evaluation of Sound Spectrograms. (1979). On the theory and
practice of voice identification, National Academy of Sciences Report,
Washington, DC.
Cowan, C. F. N., and Grant, M. (1985). Adaptive Filters, Chapter 5, Prentice-Hall,
Englewood Cliffs, New Jersey.
Davis, S., and Mermerlstein, P. (1980). Comparison of parametric representation
for monosyllable word recognition in continuously spoken sentences, IEEE Trans.
Acoust., Speech, and Signal Processing, Vol. 28(4), 375-366.
Edwards, L. A. (1964). Statistical Methods for the Behavioral Sciences, Holt,
Rinehart and Winston, New York.
Eskenazi, L. (1989). Acoustic correlates of voice quality and distortion measures
for speech processing, Ph.D. dissertation, University of Florida, Gainesville.
Fant, G. (1966). A note on vocal tract size factors and non-uniform F-pattern
scaling, Speech Transmission Lab., Rep., STL/QPSR, 1/1966, Royal Institute of
Technology, Stockholm, Sweden, 22-30.


99
sustained vowels, the lowest values of J4 (0.14) and Ji (2.81) and the
highest expected probability of error (0.27) appeared in the category
of unvoiced fricatives. The values of J4 and Jj and expected
probabilities of error of the features in the category of voiced
fricatives were between those in the categories of vowels and voiced
fricatives. The experimental error rates also showed the same
phenomenon in three categories so that the expected performance of
gender features matched with the empirical performance of the
features, in terms of different phoneme categories. This conclusion
was also true for the results with a filter order of 20 (Table 5.7).
3. It was also observed from Table 5.6 that all acoustic features for
vowels demonstrated much lower experimental error rates than the
corresponding expected probabilities of error. All acoustic features
for voiced fricatives showed less lower experimental error rates than
the corresponding expected probabilities of error. On the other
hand, most acoustic features for unvoiced fricatives possessed
higher experimental error rates than the corresponding expected
probabilities of error. For example, the differences between the
experimental error rates and the corresponding expected
probabilities of error for the CC of three phoneme categories were
0.19 0.06 = 0.13, 0.16 0.08 = 0.08, and 0.24 0.27 = -0.03
respectively. Thus, the more noisy the speech signals, the smaller
the differences between the experimental error rates and the
corresponding expected probabilities of error.
One explanation for these differences is that we used different
(though similar) models for computing the expected probabilities of
error and the experimental error rates. To compute an expected


76
2. Averaging techniques seemed more crucial than clustering
techniques. Recognition theory states that choosing several
clustering centers for the same reference group should increase the
correct classification rate, because intragroup variations are taken
into account. In Schemes 1 and 4, multi-clustering reference centers
were formed. However, the theory functioned well in Scheme 4 but
inadequately in Scheme 1, although in Scheme 1, a number of
clustering centers were selected for the reference of the same
gender. Furthermore, Scheme 3 was a further simplified version of
Scheme 2 and Scheme 4. Instead of using each test vowel of the
c
subject as in Scheme 2, only a single test template was employed for
each test subject and a single reference template for each reference
gender. But the results were almost as good as those achieved by
using Scheme 4. In Scheme 2, averaging was performed over the
reference template but not over the test template. The correct
recognition rates were low. Thus, the results suggested the
importance of averaging techniques. To a great extent, averaging on
both test and reference templates eliminated the intrasubject
variation or diversity within different vowels or fricatives of a given
speaker, but on the other hand, emphasized features representing
this speakers gender.
3. Since the averaging was applied to the acoustic parameters extracted
from different phonemes uttered by a speaker or speakers of the
same gender, and the phonemes were produced at different times,
the averaging is essentially a tinted-averaging technique.
Therefore, we would reasonably deduce that gender information
is time-invariant, phoneme independent, and speaker independent.


96
closed form. J4t Ja, and PE all can be used to describe or indicate, using slightly
different criteria, the separability of a given feature.
Table 5.6 shows the estimated values of J4and Ji, the expected probabilities of
error, and the experimental error (misclassification) rates for various acoustic
features derived in the coarse analysis. The training pools consisted of median layer
templates of 27 subjects for the male group and 25 for the female group. Each of the
features ARC, LPC, FFF, RC, and CC in three phoneme categories was investigated.
The vector dimension p was equal to the filter order of 12. The experimental error
rates were obtained by simply subtracting correct recognition rates from unity,
where the correct recognition rates are listed in Table 5.4, in which the recognition
Scheme 3 and the PDF distance measure were employed. Table 5.7 presents the
results for the same parameters as in Table 5.6 but with a filter order of 20.
1. By observing Table 5.6, it was noted that while the LPC of vowels
produced the highest value of Ji (10.9), the FFF of vowels reached
the highest value of J4 (3.35) and the second highest value of Ji
(10.4). Besides, the RC of vowels gave the second highest value of J4
(0.63) and the third highest value of Ji (9.29). They had the lowest
expected probabilities of error, which were 0.09, 0.09, and 0.11
respectively. The experimental error rates using these features were
also the lowest (0.02, 0.04, and 0.02), indicating that J4, Ji, and PE,
generally speaking, provided appropriate measures to predict the
performance of a feature for gender recognition. Of course, one
exception is that the LPC of vowels only yielded a value of 0.16 for
J4, which was far smaller than those of the FFF and RC. Thus Ji of
the LPC did not manifest a good prediction.
2. While the highest values of J4 (3.35) and Ji (10.9) and the lowest
expected probability of error (0.09) appeared in the category of


66
P (n2 nQ
(3 = C
[ njn2 (Jin!n2 + p (nj + n2))] 1/2
(4.37)
C
(ni+ n2- p 2)(ni + n2 p 5)
0.5 (
(rti + n2- 3)(nj + n2 p 3)
(4.38)
For the fixed training sample sizes ni and n2 and vector dimension p, PE decreases
as the Mabalanobis distance J! increases.
In the coarse analysis stage, the estimated Jlt J4, and expected probabilities of
errors of the acoustic features ARC, LPC, FFF, RC, and CC, which were derived
from male and female groups (i.e., classes) in three phoneme categories, were
studied. Training sample sizes were 27 (n^ for the male group and 25 (n2) for the
female group since median layer templates (one for each subject) were used to
constitute the training sample pools. The feature vector dimension p was equivalent
to the filter order selected. For each of the acoustic features ARC, LPC, FFF, RC,
and CC in each of the three phoneme categories, the estimated J2, J4, and expected
probability of error were computed as follows:
(1) Estimate the within-gender covariance matrix W¡ for each gender
using Equation (4.28).
(2) Compute the pooled (averaged) within-gender covariance matrix
W using Equation (4.29).
(3) Estimate the between-gender covariance matrix B using Equation
(4.30).
(4) Obtain the values of J4 and the Mabalanobis distance Ji from
matrixes W and B using Equations (4.32) and (4.31).


158
representation of speech such as LPC, cepstrum and reflection
coefficients in coarse analysis might be a more appropriate approach
for automatic gender recognition.
6. The results also showed that most female recognition rates were
lower than those for males. The female recognition rates were
higher than males in only 9 sets out of a total of 31 sets for the above
experimental recognition rates. This observation suggested again
that the features of female voices appeared to have higher
within-group variations than those for males.
7. From our work we found that different strategies could be used to
accomplish the highest recognition rates for each gender. For
example, we found that a 100% recognition rate for males could be
achieved by selecting the fundamental frequency, or the second
formant bandwidth, or the third formant position with the EUC
distance measure. On the other hand, a 100% recognition rate for
females could be obtained by selecting the second formant position
with the EUC. Since each of the above cases uses only a single
feature and the EUC was the simplest distance measure, these
combinations were the best strategies for male and female
recognition. Other combinations can also be used to achieve 100%
recognition rates but they would require more than one features and
in some cases using the PDF distance measure.
7.3 Conclusion
Conclusions from the fine analysis can be summarized as follows:
1. Both fundamental frequency and formant characteristics were
reliable indicators in gender discrimination. The fundamental


41
4.2.2 LPC coefficients
LPC coefficients are defined conventionally as (Rabiner and Schafer, 1978)
(4.11)
k=l
where s(n-k) is the (n-k)th speech sample, s(n) is the nth predicted output and ak is
the kth LPC coefficient. LPC coefficients are determined by minimizing the
short-time average prediction error.
4.2.3. Cepsirum .Coefficients
Cepstral coefficients can be obtained by the following recursive formula
(Rabiner and Schafer, 1978)
c0 = 0,
k-1 i
(4.12)
where ak is kth LPC coefficient.
4.2.4 Reflection Coefficients
If we consider a model for speech production that consists of a concatenation
of N lossless acoustic tubes, then the reflection coefficients are defined as (Rabiner
and Schafer, 1978)
A(k+1) A(k)
r(k) =
A(k+1) + A(k)
(4.13)


115
where X. is a constant forgetting (weighting) factor (FF) which progressively
decreases the weight of the past estimation error.
The estimated coefficient vector k that minimizes the error criterion can be
obtained by the well-known WRLS equations (Morikawa and Fujisaki, 1982),
namely,
= k-i + Kkek
Kk = Pk-iHk(k + Hk Pk-iHk)_1
c Pk = X^Pk-i + k^KkHk1 Pk_!
ek = sk Hk WRLS-VFF algorithm. When dealing with time-varying signals (e.g.,
speech), using a variable forgetting factor enables the parameter estimates to follow
sudden changes in the signal. For a locally stationary speech production model, the
a posteriori error at each time k indicates the state of the estimator. If the error
signal is small, then the FF should be close to unity; thus, the algorithm uses most of
the previous information in the signal. If, on the other hand, the error is large then a
small FF will decrease the error. This decrease in weighting of the error signal
shortens the effective memory length of the estimation process until the parameters
are readjusted and the error becomes small. A procedure to achieve the proper error
weighting by choosing the optimal FF is discussed here. The error information of
the filter can be defined as the weighted sum of the squares of a posteriori errors;
this can be expressed recursively as (Fortescue et al., 1981)
2 V, = X 2 Vt., + e,2/(l+Hi' Pl.H,)
(6.11)


Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
TOWARDS AUTOMATIC GENDER RECOGNITION FROM SPEECH
By
Ke Wu
May, 1990
Chairman: D. G. Childers
Major Department: Electrical Engineering
The purpose of this research was to investigate the potential effectiveness of
digital speech processing and pattern recognition techniques in the automatic
recognition of gender from speech. Some hypotheses concerning acoustic
parameters that may influence our ability to distinguish a speakers gender were
researched.
The study followed two directions. One direction, coarse analysis, used
classical pattern recognition techniques and asynchronous linear prediction coding
(LPC) analysis of speech. Acoustic parameters such as autocorrelation, LPC,
cepstrum, and reflection coefficients were derived to form test and reference
templates. The effects of different distance measures, filter orders, recognition
schemes, and phonemes were comparatively assessed. Comparisons of acoustic
parameters using the Fishers discriminant ratio criterion were also conducted.
The second direction, fine analysis, used pitch synchronous closed-phase
analysis to obtain accurate vowel characteristics for each gender. Detailed formant
Vll


FREQUENCY OF SECOND FORMPNT (Hz)
139
2600-,
2402-
2203-
2000-
1800-
1600-
1400-
1203-
1000-
800 -
600
230
FORMPNT POSITIONS OF MPLS SPEAKERS
O
! , f
330 400 500 600 700
FREQUENCY OF FIRST FORMPNT (Hz)
( 1
000 900
Figure 7.6 The scatter plot of the first versus second formant
frequencies for ten vowels of male speakers.


49
averaging has to be performed to obtain the best gender recognition. In the
preliminary attempt, the averaging was first done within three classes of the sounds
(i.e., vowels, unvoiced fricatives, and voiced fricatives).
4.4.2 Test and Reference Template Formation
The averaging procedures used to create test and reference templates for the
present experiment employed a multi-level combination approach as illustrated in
Figure 4.5.
The lower layer templates were feature parameter vectors obtained from each
utterance by an LPC analysis as described in the last section. They can be
autocorrelation, LPC, or cepstrum coefficients, etc. A lower layer template
coefficient set was calculated by averaging six sets of coefficients obtained from six
frames for each sustained utterance such as a vowel, an unvoiced fricative, or a
voiced fricative for every subject.
The next level of combination averaged all templates in the lower layer for
each subject to form a single median layer template to represent this subject.
Templates of all utterances for the same phoneme groups (e.g., vowels or unvoiced
fricatives or voiced fricatives), were averaged.
In the last stage, the single remaining male and female templates were
combined in the same manner as above. Each gender was represented by a single
token (centroid) obtained by averaging all templates in the median layer.
It is evident that from the lower layer to the upper layer, a higher degree of
averaging is achieved.
Figure 4.6(a) shows two reflection coefficient templates of vowels for male
and female speakers in the upper layer. The filter order was 12 so that there were 12
elements in each template (vector). Each template can be considered a universal
token representing each gender. The data in the figure are shown as


74
i
-u

L
U

L
L


D
H
P
a
u
H
L
V-
V
a
u
H
0
5
c
3
L
0
V-
a

3
v **
N *J
l
n
L

V
100
95
90
86
86
76
70
65
60 -
A'
o-
A Schema 3
'A
-A Schema 4
Scheme 2
O O Scheme 1
55 -
12
16
20
Filter order
Figure 5.1 Results from exclusive recognition schemes with various
filter orders and the UPC log likelihood distortion measure.


105
5. Filter orders of 12 to 16 were the most appropriate for the majority of
design options.
6. Using the EUC distance measure was more effective than using the
PDF. Recognition rates of the EUC were higher than those of the
PDF and were quite constant across all the filter orders. The EUC
distance measure operated more uniformly on male and female
groups than did the PDF. A possible reason for this inferior PDF
performance is due to the small ratio of the available number of
subjects per gender to the number of elements per feature vector.
7. Recognition rates of the leave-one-out or exclusive procedure were
c
only slightly degraded compared to those produced from the
resubstitution or inclusive procedure. The larger the database, the
less the performance differences between these two procedures.
8. The greater variation of female features was noted. The
performance of various feature vectors combined with distance
measures for male subjects were generally better than for female
subjects. The recognition rates for males showed a higher mean, a
much greater minimum, and a smaller standard deviation than those
for females. This indicates that female features have higher
variability than male features.
9. For the filter order of 12, the analytical inferences from the values of
Ji and J4 and the expected probabilities of error using various
acoustic features proved comparable to the empirical results of the
experiments with the PDF distance measure for gender recognition.
Furthermore, U the Mahalanobis distance appeared to be more
reliable for predicting the performance of a gender classifier than J4,


APPENDIX A
RECOGNITION RATES FOR LPC AND CEPSTRUM PARAMETERS
Table A.l
Results from exclusive recognition schemes
LPC distance measure (Log likelihood)
Filter order = 8
CORRECT RATE %
MALE
FEMALE
TOTAL
Scheme
1
50.0
77.2
63.1
Sustained
Scheme
2
60.7
70.0
65.2
Vowels
Scheme
3
81.5
68.0
75.0
Scheme
4
74.1
76.0
75.0
Scheme
1
77.8
39.2
59.2
Unvoiced
Scheme
2
67.4
55.2
61.5
Fricatives
Scheme
3
70.4
64.0
67.3
Scheme
4
66.7
88.0
76.9
Scheme
1
71.3
78.0
74.5
Voiced
Scheme
2
73.1
82.0
77.4
Fricatives
Scheme
3
81.5
100.0
90.4
Scheme
4
96.3
100.0
98.1
169


85
As discussed in Chapter 4, the acoustic parameters used in coarse analysis
were derived from the LPC all-pole model that has a spectral matching
characteristic. It is known that LPC log likelihood and cepstral distortion measures
are directly related to the power spectral differences of the test and reference
signals. Thus, the results indicated that the spectral characteristics were major
factors in distinguishing the speakers gender. Also, our results suggested that there
did exist significant differences between spectral characteristics of unvoiced
fricatives for the two genders, indicating that the speakers gender could be
distinguished based only on speakers vocal tract characteristics since no vocal fold
information existed in unvoiced fricatives. Moreover, if some of the vocal fold
information was combined with vocal tract characteristics as in vowel and voiced
fricative cases, the gender distinguishing was improved, even though in both of the
above cases the fundamental frequency information was not included.
5.2.4 Comparative Study of Filter Order Variation
5.2.4.1 LPC Log Likelihood and Cepstral Distortion Measure Cases
1. By observing the resultant curves of Schemes 1 and 2 from Figures
5.1 and 5.2, a general trend is easily noted. The recognition rates
generally improved by using higher order filters.
2. However, this trend was not observed for Schemes 3 and 4.
Individual inspection had to be made for specific cases. The
recognition rates for Schemes 3 and 4 together with LPC log
likelihood and cepstral distortion measures were first considered.
Notice that there are a total 12 rates for each of the filter orders,
o Comparing the recognition rates between filter orders of 8
and 12, 9 out of 12 rates increased, 2 of them decreased
(Scheme 4 for voiced and unvoiced fricatives with the LPC


153
(c) Using individual formant (including frequency, bandwidth,
and amplitude for the formant concerned)
EUC
MALE
FEMALE
TOTAL
First (F1,B1,A1)
100.0%
92.0%
96.2%
Second (F2,B2,A2)
96.3%
100.0%
98.1%
Third (F3,B3,A3)
100.0%
88.0%
94.2%
Fourth (F4,B4,A4)
96.3%
96.0%
96.2%
PDF
MALE
FEMALE
TOTAL
First (F1,B1,A1)
92.6%
92.0%
92.3%
Second (F2,B2,A2)
96.3%
100.0%
98.1%
Third (F3,B3,A3)
100.0%
100.0%
100.0%
Fourth (F4,B4,A4)
96.3%
96.0%
96.2%
(d) Using grouped frequencies, bandwidths, and amplitudes of formants
(each feature set contains only frequencies, or bandwidths, or
amplitudes from all formants)
MALE
FEMALE
TOTAL
Positions (F1,F2,F3,F4)
96.3%
100.0%
98.1%
Bandwidths (B1,B2,B3,B4)
88.9%
74.1%
84.6%
Amplitudes (A1,A2,A3,A4)
96.3%
96.0%
96.2%


90
comparable, the PDF approach did not function well. The smaller the ratio, the
worse the PDF performed.
5.2.6 Comparative Study Using Different Procedures
Performance differences between resubstitution (inclusive) and
Leave-One-Out (exclusive) procedures were also tested with a filter order of 16.
Tables A.9 and A.10 in Appendix A present the inclusive recognition results forLPC
log likelihood and cepstral distortion measures with various recognition schemes
respectively. Tables B.9 and B.10 in Appendix B show the recognition results from
inclusive recognition Scheme 3 with various acoustic parameters, using EUC and
PDF distance measures respectively.
The results presented in Appendix A indicate that the correct recognition rates
of exclusive recognition procedure (Tables A.3 and A.7) were not greatly degraded
compared to those obtained from the inclusive recognition procedure, especially for
the cepstral distortion measure. For the cepstral distortion measure with Scheme 3,
the rates decreased from 94.2% to 90.4% for vowels, from 86.5% to 84.6% for
unvoiced fricatives, and remained constant for voiced fricatives. The maximum
decrease of the rates was less than 4%. For the LPC log likelihood with Scheme 3,
the rates degraded from 92.3% to 86.5% for vowels, 82.7% to 75% for unvoiced
fricatives, and 100% to 96.2 for voiced fricatives. Here maximum rate decrease was
7.7% observed for unvoiced fricatives. In contrast, the results from the partial
database of 21 subjects, which we analyzed before we completed our data collection
of the entire database, showed a much large decrease from inclusive to exclusive
procedures. Recognition rate dropped more than 14% for unvoiced fricatives and
more than 9% for voiced fricatives. This convinced us that the larger the database,
the less the performance differences between inclusive and exclusive procedures.


151
vowels than for females and there is less standard error associated
with each vowel for males than that for females. This is more
manifest for bandwidths. By comparing Figures 7.6 and 7.7 of
scatter plotting of first and second formant frequencies for both
genders, it is seen that the female data plots appear to be more
spread out. The above observations indicate that there was greater
variability for female voices than for male voices. Many researchers
believe melodic (intonation, stress and/or coarticulation) cues are
speech characteristics associated with female voices. Larger
formant frequency, bandwidth and amplitude variation for female
speakers may also contribute to these perceptual cues.
7.2 Relative Importance of Grouped Vowel Features
In this section, the means to determine the relative importance of individual or
grouped vowel features such as fundamental frequency, formant frequency or
bandwidth for gender discrimination was to use these individual or grouped features
to form the test and reference templates and then to process the data for automatic
gender recognition. The results below were obtained by using the exclusive
recognition Scheme 3, which is the same as we described in Chapter 4. The
database used consisted of all 27 male and 25 female subjects with all 10 sustained
vowels for each subject. The recognition rates are presented as correct recognition
percentage.


26
In the coarse analysis, acoustic parameters such as autocorrelation, LPC,
cepstrum, and reflection coefficients were derived to form test and reference
templates. The effects of using different distance measures, filter orders,
recognition schemes, and phonemes were comparatively evaluated. Comparisons of
acoustic parameters using the Fishers discriminant ratio criterion were also
conducted.
The linear prediction coding concepts and detailed experimental design based
on the coarse analysis will be given in Chapter 4.
2.3 Fine Analysis
The objective of the fine analysis was to study and compare the relative
significance of vowel characteristics responsible for gender discrimination.
As we know, male/female vowel characteristics are featured by formant
positions, bandwidths, and amplitudes so that accurate formant estimation is
necessary. It is important to pay particular attention to the measurement technique
and to the degree of accuracy which can be achieved through it. Although formant
features have been measured for a variety of different studies, the accuracy of these
measurements is still a matter of conjecture.
Formant estimation is influenced by (Atal, 1974a; Childers et al., 1985a;
Krishnamurthy and Childers, 1986):
o the effect of the periodic vocal fold excitation, especially when the
harmonic is near the formant,
o the effect of the excitation-spectrum envelope,
o the effect of time averaging over several excitation cycles in the
analysis when the vocal folds are repeatedly in open-phase (large


118
Figure 6.3 Speech signal (upper), the corresponding DEGG
signal (middle), and the variable fogetting factor ^(lower).
-V--


152
7.2.1 Recognition Results
(a) Using only fundamental frequency (EUC)
MALE FEMALE TOTAL
100.0% 92.0% 96.2%
0(b) Using individual frequency, bandwidth, or amplitude
of four formants (EUC)
MALE
FEMALE
TOTAL
FI
96.3%
84.0%
90.4%
B1
92.6%
68.0%
80.8%
A1
51.9%
40.0%
46.2%
F2
96.3%
100.0%
98.1%
B2
100.0%
84.0%
92.3%
A2
81.5%
72.0%
76.9%
F3
100.0%
88.0%
94.2%
B3
66.7%
56.0%
61.5%
A3
81.5%
80.0%
80.8%
F4
96.3%
96.0%
96.2%
B4
77,8%
56.0%
67.3%
A4
81.5%
84.0%
82.7%


4
contribute to the solution of this problem since acoustic features for
synthesizing speech for either gender would be provided. Hence, the
voice quality of voice response systems and text-to-speech
synthesizers would be improved.
o Accomplishing this task could provide new guidelines and suggest
methods to identify the acoustic features related to dialect, age,
health conditions, etc.
o Accomplishing this task could be a unique or an only approach for
some applications (e.g., law enforcement applications). In a
criminal investigation, an attempt is usually made to identify the
speaker on a recording as a specific person. If an individual is able
to deceive the investigator as to his gender, he may well prevent his
detection. It is well known that speakers can disguise their
speech/voice to confound or prevent detection (Hollien and
McGlone, 1976cited by Carlson, 1981). The female impersonator
is an example of intentional deception of the listener. In such a
case, identification of the speakers gender is critical.
o Finally, we presumed that the research results could benefit clinical
applications such as correction for a person with a voice disorder or
handicap. Other applications include transsexual changes etc.
(Bralley et al., 1978; Carlson, 1981).
1.3 Literature Review
1.3.1 Basic Gender Features
The differences between male and female voices depend upon many factors.
Generally, there exist three types of parametersphysiological and acoustical
which are objective, and perceptual which is subjective (Figure 1.2).


59
for each subject (i.e., the median layer), were formed. The set of the entire median
layer constituted the reference cluster that includes all median templates. In the
testing stage, the distance measure for each lower layer template of all test subjects
was calculated with respect to each of the median layer templates, and the minimum
distance was found. The speaker gender of the lower layer utterance was then
classified as male or female, according to the gender known for the median layer
reference template.
Scheme 2 is illustrated in Figure 4.9(b). In the training stage, one test
template for each test utterance (i.e., the lower layer), and one reference template
for each gender (i.e., the upper layer), were formed. The upper layer constituted the
reference cluster that includes only two gender templates. In the testing stage, the
distance measure for each lower layer template of all test subjects was calculated
with respect to each of those upper layer templates and the minimum distance was
found. The speaker gender of the lower layer utterance was then classified as male
or female, according to the gender known for the upper layer reference template.
Figure 4.9(c) shows Scheme 3. In the training stage, one test template for each
test subject (i.e., the median layer), and one reference template for each gender (
i.e., the upper layer), were formed. The set of the entire median layer constituted
the test pool that includes all median templates. In the testing stage, the distance
measure for each median layer template of all test subjects was calculated with
respect to each of those upper layer templates and the minimum distance was found.
The speaker gender of the median layer template was then classified as male or
female, according to the gender known for the upper layer reference template.
Figure 4.9(d) shows Scheme 4. In the training stage, only the median layer
were formed and each subject was represented by a single template. The median
layer constituted both test and reference pools. In the testing stage, the
Leave-One-Out oi exclusive procedure (which is discussed in detail in the next


194
Lass, N. J., Almerino, C. A., Jordan, L. F., and Walsh, J. M. (1980). The effect of
filtered speech on speaker race and sex identifications, Journal of Phonetics, Vol.
8, 101-112.
Lass, N. J., Hughes, K. R., Bowyer, M. D., Waters, L. T., and Bourne, V. T. (1976).
Speaker sex identification from voiced, whispered, and filtered isolated vowels, J.
Acoust. Soc. of Am., Vol. 59, 675-678.
Lass, N. J., and Mertz, P. J. (1978). The effect of temporal speech alterations on
speaker race and sex identifications, Language and Speech Vol. 21, 279-290.
Lass, N. J., Tecca, J. E., Mancuso, R. A., and Black, W. I. (1979). The effect of
phonetic complexity on speaker race and sex identifications, Journal of Phonetics
Vol. 7, 105-118.
Lindblom, B. (1962). Accuracy and limitation of sona-graph measurements,
Proceedings of the 4th International Congress of Phonetic Sciences, Helsinki, 1961.
The Hague, 1962.
Linke, C. E. (1973). A study of pitch characteristics of female voices and their
relationship to vocal effectiveness, Folia Phoniatrica, Vol. 25, 173-185.
Linville, S. E., and Fisher, H. B. (1985). Acoustic characteristics of perceived
versus actual vocal age in controlled phonation by adult females, J. Acoust. Soc.
Am., Vol. 78(1), 40-48.
Markel, J. D., and Gray, A. H. (1976). Linear Prediction of Speech,
Springer-Verlag, Berlin.
Markel, J. D., Oshika, B., and Gray, A. H. Jr. (1977). Long-term feature averaging
for speaker recognition, EEEE Trans, on Acoust., Speech, and Signal Processing,
Vol. ASSP-25(4), 330-337.
Makhoul, J. (1975a). Linear prediction in automatic speech recognition, in
Speech Recognition, Invited Papers Presented at the 1974 IEEE Symposium, D. R.
Reddy, Ed., Academic Press, New York, 183-220.
Makhoul, J. (1975b). Linear prediction: a tutorial review, Proc. IEEE, Vol. 63(4),
561-580.
Monsen, R. B., and Engebretson, A. M. (1977). Study of variations in the male and
female glottal wave, J. Acoust. Soc. Am., Vol 62(4), 981-993.
Monsen, R. B., and Engebretson, A. M. (198,3). The accuracy of formant frequency
measurements: a comparison of spectrographic analysis and linear prediction,
Journal of Speech Hearing Research, Vol. 26, 89-97.


92
ARC (EUC)
LPC (EUC)
P/^/j LPC(LLD)
\\\j RC (EUC)
X )> cc
Figure 5.5 Results of recognition Scheme 3 with the EUC and
a filter order of 16 for (a) exclusive procedure
(b) inclusive procedure.


38
P
i (n) = 2 aks(n-k) (4.3)
k=l
The error between the value of the actual nth sample and its estimate is
P
e(n) = s(n) s (n) = s(n) + 2 aks(n-k) (4.4)
k=l
and equivalently,
s(n) = j aks(n-k) + &(n) (4.5)
k=l
If the linear prediction model of Equation (2.5) conforms to the basic speech
production model given by (2.2), then
e(n) = G u(n)
ak = Qik
Thus the coefficients (ak) identify the system, whose output is s(n). The problem
then is to determine the value of the coefficients (ak) from the actual speech signal.
The criterion used to obtain the coefficients (ak) is the minimization of the
short-time average prediction error E with respect to each coefficient a¡, over some
time interval, where
E = S [e(n)]2 (4.8)
n
(4.6)
(4.7)
This leads to the following set of equations


31
The low-pass, anti-aliasing and reconstruction filters of the DSC-200 were
connected to the analog side of the converter. Both signals were bandlimited to 5
KHz by these passive elliptic filters with the specification of minimum stopband
attenuation of -55 db and passband ripple of 0.2 db. The DSC-240 station
provides audio signal interfacing to the DSC-200, which includes input and output
buffering as well as level metering and signal path switching.
The utterances were directly digitized since this choice avoids any distortion
that may be introduced in the tape recording process (Berouti et al., 1977; Naik,
1984). An extender attached to the microphone kept the speakers lips 6 inches
away. With the microphone and EGG electrodes in place, the researcher ran the
£
data collection program on a terminal inside the sound room. A two channel
Tektronix Type 564B Storage Oscilloscope was connected to DSC-240 so both
speech and EGG signals were monitored. The program prompted the researcher by
presenting a list of commands on the screen. The researcher initiated digitization by
depressing the D key on the keyboard. Immediately after digitization, another
prompt indicated termination of the sampling process. The digitized utterance could
be played back and an option existed to repeat the digitization process if it was
thought that part of the utterance might have been spoken abnormally or the
digitized speech and EGG signals were unsatisfactory. For example, the speakers
were instructed to repeat a utterance if the panel of experts who were sitting in the
sound room or speaker felt that it was rushed, mispronounced, too low, etc. The
entire protocol with utterances repeated as necessary took an average of 15-20
minutes to collect. About 150-200 seconds of speech and EGG were automatically
stored on disk. Thus, for each subject, about 12000-16000 blocks (512 bytes per
block) of data were collected.
Since the speech and EGG channels were alternately sampled, the resulting
file of digitized data had the two signals interleaved. The trivial task of


103
Table 5.8
Most effective feature and distance measure combinations
with the exclusive recognition Scheme 3 (except noted).
CORRECT RATE % WITH DIFFERENT FILTER ORDER
C
8
12
16
20
Sustained
Vowels
LPC(PDF)
98.1
RC (PDF)
98.1
FFF(EUC)
98.1
RC (EUC)
100.0
100.0
100.0
Unvoiced
Fricatives
LPC(PDF)
86.5
CC (EUC)*
88.5
90.4
88.5
Voiced
Fricatives
LPC(LLD)
96.2
98.1
LPC(LLD)*
98.1
96.2
96.2
CC (EUC)
98.1
98.1
96.2
RC (EUC)
96.2
96.2
96.2
* By using recognition scheme 4.


117
where Na is the number of the filter coefficients in Equation (6.6). On the other
hand, since 2 V0 is related to the sum of the squares of the error, it can be
calculated from the prediction error over one or two analysis frames using LPC
analysis before executing the WRLS-VFF algorithm.
The estimation error is usually large when there is a large glottal open phase in
the speech signal. In such cases, a small \k is used to decrease the contributions to
errors in the estimation process. Therefore, small values of \k are related to the
glottal closing point where the prediction error is maximum. Figure 6.3 shows that
the regions of small kk correspond to the negative peaks of the differentiated
electroglottograph (DEGG) which are reliable estimates of the closing instant of the
glottis. The closed phase WRLS-VFF algorithm for speech analysis extracts the
vocal tract parameters (formants) only from the glottal closed interval. This
algorithm can be implemented as follows (Figure 6.4):
(1) Initialize the values of P0, 00, Xmin and 2 V0, and give the filter
order. Experience shows that the values of P0 and 2 V0 are
insensitive to the algorithm as long as they are large enough (e.g.,
Po=100, 2 Vo=1000000).
(2) Compute the filter gain Kk, error covariance Pk and prediction error
Vk using Equation (6.9).
(3) Compute Xk using Equation (6.13), if kk (4) Predict the new filter coefficient vector k.
(5) Check for the glottal closed phase using \k or the EGG signal, if
there is a closed glottal interval, then extract the formants and
bandwidths of the vocal tract by applying a peak-picking technique
on the spectrum, which is obtained by using an FFT on the
polynomial coefficients from k.


193
voice, 114th Meeting of Acoustical Society of America, J. Acoust. Soc. Am. Sup. 1,
Vol. 82, S90.
Holmes, J. N. (1973). The influence of glottal waveform on the naturalness of
speech from a parallel formant synthesizer, IEEE Trans. Audio and Electroacoust.,
Vol. 21, 298-305.
Horri, Y., and Ryan, W. J. (1981). Fundamental frequency characteristics and
perceived age of adult male speakers, Folia Phoniantrica, Vol. 33, 227-233.
Hughes, G. F. (1968). On the mean accuracy of statistical pattern recognizers,
IEEE Trans. Inform. Theor., Vol. IT-14: 55-63.
Ingemann, F. (1968). Identification of the speakers sex from voiceless fricatives,
J. Acoust. Soc. Am., Vol. 44, 1142-1144.
Ishizaka, D., and Flanagan, J.L. (1972). Synthesis of voiced sounds from a two
mass model of the vocal cords, Bell Syst. Tech. J., Vol. 50, 1233-1268.
Itakura, F. (1975). Minimum prediction residual principle applied to speech
recognition, IEEE Trans. Acoust., Speech, and Signal Processing, Vol. ASSP-23,
67-72.
Juang, B. H. (1984). On using the Itakura-Saito measures for speech coder
performance evaluation, AT&T Bell Lab. Tech. J., Vol. 63(8), 1477-1498.
Karlsson, I. (1986). Glottal wave forms for normal female speakers, Journal of
Phonetics, Vol. 14, 415-419.
Klatt, D. H. (1987). Acoustic correlates of breathiness: First harmonic amplitude,
turbulence noise, and tracheal coupling, 114th Meeting of Acoustical Society of
America, J. Acoust. Soc. Am. Sup. 1, Vol. 82, S91.
Klatt, D. H., and Klatt, L. C. (1987). Voice quality variations within and across
female and male talkers: implications for speech analysis, synthesis and
perception, submitted to J. Acoust. Soc. Am.
Krishnamurthy, A. K. (1983). Study of vocal fold vibration and the glottal sound
source using synchronized speech electroglottography and ultra-high speed
laryngeal films, Ph.D. dissertation, University of Florida, Gainesville.
Krishnamurthy, A.K., and Childers, D.G. (1986). Two-channel speech analysis,
IEEE Trans, on acoust., speech, and signal processing, Vol. ASSP-34(4), 730-743.
Kullback, S. (1959). Information Theory and Statistics, Wiley, New York.
Lachenbruch, P. A. (1968). On expected probabilities of misclassification in
discriminant analysis, necessary sample size, and a relation with the multiple
correlation coefficient, Biometric, Vol. 24, 823-834.


TOWARDS AUTOMATIC GENDER RECOGNITION FROM SPEECH
By
KE WU
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
1990


Ill
Problems such as these are significantly reduced when the pitch synchronous
analysis method is applied, in which the analysis window is restricted to a one pitch
period in length. Problems are further reduced w'hen sequential adaptive analysis
methods are used.
6.2.2 Source-Tract Interaction
The linear source-filter model of conventional LPC assumes that the source
and the vocal tract (VT) filter are separable and do not interact. However, this is not
strictly correct in that it is valid only if the source is properly defined. The
aerodynamic forces that are responsible for the self-oscillation of the vocal folds are
*
in turn affected by the supraglottal vocal tract. Thus, in theory, the VT shape can
affect vocal fold vibration. Extensive simulations by Krishnamurthy and Childers
(1986) using the Ishizaka-Flanagan (1972) two-mass model for the vocal folds
shows, however, that the vocal fold vibration (and in turn the glottal area) is not
significantly affected by the VT. On the other hand, the glottal volume velocity, is
affected. The interaction causes the volume-velocity signal to be skewed to the right
with respect to the glottal area. A ripple in the volume-velocity waveform may
appear, and is thought to be caused primarily by the VT first-formant frequency.
The simplest way to understand the ripple phenomenon is to consider the
single-formant model shown in Figure 6.2. This model has been used extensively to
study source-tract interaction effects (Ananthapadmanabha and Fant, 1982;
Rothenberg, 1981). The VT is represented to a first approximation by the RLC
resonant circuit for the first formant. Rg(t) and Lg(t) are the nonlinear and
time-varying glottal resistance and inductance, respectively, and are controlled by
the glottal area function Ag(t) as well as the current flowing through them, Ug(t). If
the impedance due to Rg(t) and Lg(t) is much larger than the VT input, impedance
Zin for all t, then the glottal volume velocity Ug(t) will be essentially independent of


32
demultiplexing was performed off-line after data collection. Once the data were
demultiplexed, the speech and EGG were trimmed to discard the unnecessary data
before and after an utterance while keeping the onset and offset portions at each end
of the data. After trimming, about 4500-6500 blocks data were stored on disk for
each subject.
3.3 Synchronization of Data
When the speech and EGG signals were used during the analysis stage, they
were time aligned to account for the acoustic propagation delay from the larynx to
c,
the microphone. The microphone was kept a fixed 15.24 centimeters (6 inches)
away from the speakers lips to reduce breath noises and to simplify the alignment
process. Synchronization of the waveforms had to account for the distance from the
vocal folds to the microphone. To do so, average vocal tract lengths of 17 cm for
males and 15 cm for females were assumed. The number of samples to discard
from the beginning of the speech record was then
ft samples = Int[(32.24/34442) 10000 + .5] (3.1)
for males and
ft samples = Int[(30.24/34442)10000 + .5] (3.2)
for females.
Equations (3.1) and (3.2) show that a 10 sample correction is appropriate for
males and a 9 sample correction is appropriate for females. Examination of the data
also supported use of these figures for adult speakers. Examples of aligned speech
and EGG signals for a male and female speaker are shown in Figure 3.1.


120
(6) Go to (2) until end of data.
In summary, this algorithm uses a variable forgetting factor, which can be
obtained recursively during the adaptation process, to control the adaptation gain
and to determine the effective memory length. Experimental results show that the
formant tracking ability and formant estimation accuracy of the WRLS-VFF
algorithm is superior to the other adaptive algorithms that were considered and to
the LPC based algorithm.
6.3.2 EGG Assisted Procedures
Electroglottography is essentially an electrical impedance measurement
technique used to monitor vocal fold activity (Childers, 1977; Childers and
Krishnamurthy, 1985). The EGG instrument measures the electrical impedance
variations of the larynx using a pair of plate electrodes held in contact with the skin on
both sides of the thyroid cartilage.
The percentage of amplitude in an EGG signal reflects the percentage change
in the tissue impedance. The EGG waveform represents the opening of the glottis as
an upward deflection and the closing of the glottis as a downward deflection. The
hypothesis is that the EGG signal is a function of the lateral area of contact between
the vocal folds.
An example of the synchronized acoustic speech, the EGG, the DEGG, and
glottal area (the opening between the vocal folds) measured from an ultra-high
speed film is given in Figure 6.5. Note that the instant of glottal closure is signaled
by a rapid decrease in the EGG and coincides with the minimum in the DEGG for
that period; also, the maximum in the DEGG in each period occurs very close to the
-> . --
instant of glottal opening. The difference between the instant of glottal closing (as
determined from the glottal area curve) and the minimum in the DEGG for that
period is about two data sample points (at a 10 KHz sampling rate), plus or minus


CHAPTER 4
EXPERIMENTAL DESIGN BASED ON COARSE ANALYSIS
As stated in the Chapter 2, coarse analysis applies classical pattern recognition
techniques and asynchronous linear prediction coding (LPC) analysis of speech.
The specific goal of this analysis was to develop and test the algorithms to achieve
rapid gender recognition from only a brief speech data record. Figure 4.1 shows the
canonic pattern recognition model used in the gender recognition system. There are
four basic steps in the model:
1) acoustic parameter extraction,
2) test and reference pattern or template formation,
3) pattern similarity determination, and
4) decision rule.
The input is the acoustic waveform of the spoken speech signal, the desired
output is a best estimate of the speakers gender in the input. Such a model can be
a part of a speech or speaker recognition system or a front end processor of the
system. The following discussion of the coarse analysis proceeds in the context of
Figure 4.1.
4.1 Asynchronous LPC Analysis
4.1.1 Linear Prediction Concepts
Linear prediction, also known as the autoregressive (AR), all-pole model, or
maximum entropy model, is widely used in speech processing. This method has
34


APPENDIX B
RECOGNITION RATES FOR VARIOUS ACOUSTIC PARAMETERS
AND DISTANCE MEASURES
Table B.l
Results from the exclusive recognition scheme 3
Euclidean distance measure
Filter order = 8
CORRECT RATE %
MALE
FEMALE
TOTAL
ARC
85.2
72.0
78.8
Sustained
LPC
63.0
84.0
73.1
Vowels
RC
81.5
96.0
88.5
CC
74.1
92.0
82.7
ARC
70.4
80.0
75.0
Unvoiced
LPC
81.5
80.0
80.8
Fricatives
RC
81.5
80.0
80.8
CC
70.4
72.0
71.2
ARC
88.9
84.0
86.5
Voiced
LPC
92.6
92.0
92.3
Fricatives
RC
96.3
92.0
94.2
CC
92.6
96.0
94.2
179


75

u

i.
+J
u

L
L



H
W
u
H
L
It
'D

U
H

5
c
3
L
Q
\-
i
*
3
L
0
V-
100 -
..A-
A-
96
A'
- '
-A
Scheme
3
A

A
90
~ A
Scheme
4
85


Scheme
2
80 +
o
o
0
Scheme
1
76-

70
66 -
60
55
1
1
(
!
8 12 16 20
Filter order
Figure 5.2 Results from exclusive recognition schemes with various
filter orders and the cepstral distortion measure.


125
Factorial experiments in which the same experimental unit (generally a
subject) is observed under more than one condition require special attention.
Experiments of this kind are referred to as those in which there are repeated
measures. A two-factor experiment in which there are repeated measures on factor
B (i.e., each experimental unit is observed under all levels of factor B) may be
represented schematically as follows:
bi
b2
b3
bm
ai
G!
Gi
Gj
.. Gj
a2
g2
g2
g2
.. g2
The symbol Gi represents a group of n subjects. The symbol G2 represents a second
group of n subjects. The subjects in G] are observed under treatment combinations
aibi, aib2) aib3, ..., and aibm. Thus the subjects in Gi are observed under all levels
of factor B in the experiment, but only under one level of factor A. The subjects in
G2 are observed under treatment combinations a2bi, a2b2, a2b3, ..., and a2bm. Thus
each subject in G2 is observed under all levels of factor B in the experiment, but only
under one level of factor A, namely, a2.
In this experiment, the subjects may be considered to define a third factor
having n levels. As such, the subject factor is said to be crossed with factor B but
nested under factor A (Winer, 1971). Schematically,


177
Table A.9
Results from inclusive recognition schemes
LPC distance measure (Log likelihood)
Filter order =16
CORRECT RATE %
MALE FEMALE TOTAL
Sustained
Vowels
Scheme
Scheme
Scheme
1
2
3
75.6
75.2
96.3
89.6
82.0
88.0
82.3
78.5
92.3
Scheme
1
85.2
59.2
72.6
Unvoiced
Scheme
2
73.3
68.0
70.8
Fricatives
Scheme
3
85.2
80.0
82.7
Scheme
1
75.0
87.0
80.8
Scheme
2
80.6
88.0
84.1
Scheme
3
100.0
100.0
100.0
Fricatives


AMPLITUDE (rib) FREQUENCY (Hz) FREQUENCY (Hz)
137
3600
3400
3200
3000
2800
2600
2400
2200
2000
1800
1600
AUERAQED BANDWIDTH
320 t
120 "T
25
AUERAQED AMPLITUDE
UOWEL
Figure 7.4 The third formant characteristics of ten vowels.


TOWARDS AUTOMATIC GENDER RECOGNITION FROM SPEECH
By
KE WU
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
1990

To my parents
and to my wife

ACKNOWLEDGMENTS
The invaluable guidance, encouragement, and support I have received from
my adviser and committee chairman, Dr. D. G. Childers, during the years of my
graduate education are most appreciated. I am sincerely grateful for his direction,
insight, and patience throughout this dissertation research.
I would especially like to thank Dr. J. R. Smith, Dr. A. A. Arroyo, Dr. J. C.
Principe, and Dr. H. B. Rothman for their interest and participation in serving on my
supervisory committee and their productive criticism of my research project.
The partial support by the National Institutes of Health, National Science
Foundation, and University of Florida Center of Excellence Program is gratefully
acknowledged.
Special thanks are also extended to my fellow graduate students and other
members of the Mind-Machine Interaction Research Center for their friendship,
encouragement, and skillful technical help.
Last but not the least, I am greatly indebted to my wife, Hong-gen, and my
parents for their love, support, understanding, and patience. My gratitude to them is
beyond description.
in

TABLE OF CONTENTS
Page
ACKNOWLEDGEMENTS iii
ABSTRACT vii
CHAPTER
/>
1 INTRODUCTION 1
1.1 Automatic Gender Recognition 1
1.2 Application Perspective 2
1.3 Literature Review 4
1.3.1 Basic Gender Features 4
1.3.2 Acoustic Cues Responsible for Gender Perception .... 13
1.3.3 Summary of Previous Research 17
1.4 Objectives of this Research 20
1.5 Description of Chapters 21
2 APPROACHES TO GENDER RECOGNITION FROM SPEECH .... 23
2.1 Overview of Research Plan 23
2.2 Coarse Analysis 23
2.3 Fine Analysis 26
3 DATA COLLECTION AND PROCESSING 29
3.1 Database Description 29
3.2 Speech and EGG Digitization 30
3.3 Synchronization of Data 32
4 EXPERIMENTAL DESIGN BASED ON COARSE ANALYSIS 34
4.1 Asynchronous LPC Analysis 34
4.1.1 Linear Prediction Concepts 34
4.1.2 Analysis Conditions 39
4.2 Acoustic Parameters 40
4.2.1Autocorrelation Coefficients 40
IV

4.2.2 LPC coefficients 41
4.2.3 Cepstrum Coefficients 41
4.2.4 Reflection Coefficients 41
4.2.5 Fundamental Frequency and Formant Information .... 42
4.3 Distance Measures 42
4.3.1 Euclidean Distance 42
4.3.2 LPC log Likelihood Distance 43
4.3.3 Cepstral Distortion 45
4.3.4 Weighted Euclidean Distance 47
4.3.5 Probability Density Function 47
4.4 Template Formation and Recognition Schemes 48
4.4.1 Purpose of Design 48
4.4.2 Test and Reference Template Formation 49
4.4.3 Nearest Neighbor Decision Rule 55
4.4.4 Structure of Four Recognition Schemes 56
4.5 Resubstitution and Leave-One-Out Procedures 60
4.6 Separability of Acoustic Parameters
Using Fishers Discriminant Ratio Criterion 61
4.6.1 Fishers Discriminant and F ratio 61
4.6.2 Divergence and Probability of Error 64
5 RESULTS OF RECOGNITION BASED ON COARSE ANALYSIS .. 68
5.1 Coarse Analysis Conditions 68
5.2 Performance Assessments 70
5.2.1 Comparative Study of Recognition Schemes 71
5.2.2 Comparative Study of Acoustic Features 78
5.2.2.1 LPC Parameter Verses Cepstrum Parameter .. 78
5.2.2.2 Other Acoustic Parameters 79
5.2.3 Comparative Study Using Different Phonemes 84
5.2.4 Comparative Study of Filter Order Variation 85
5.2.4.1 LPC Log Likelihood and Cepstral
Distortion Measure Cases 85
5.2.4.2 Euclidean Distance Versus
Probability Density Function 87
5.2.5 Comparative Study of Distance Measures 88
5.2.6 Comparative Study Using Different Procedures 90
5.2.7 Variability of Female Voices 93
5.3 Comparative Study of Acoustic Parameters
Using Fishers Discriminant Ratio Criterion 93
5.4 Conclusions 102
6 EXPERIMENTAL DESIGN BASED ON FINE ANALYSIS 106
6.1 Introduction 106
6.2 Limitations of Conventional LPC 107
6.2.1 Influence of Voice Periodicity 108
6.2.2 Source-Tract Interaction Ill
6.3 Closed Phase WRLS-VFF Analysis 113
v

6.3.1 Algorithm Description 113
6.3.2 EGG Assisted Procedures 120
6.4 Testing Methods 122
6.4.1 Two-way ANOVA Statistical Testing 123
6.4.2 Automatic Recognition by Using Grouped Features ... 128
7 EVALUATION OF VOWEL CHARACTERISTICS 130
7.1 Vowel Characteristics of Gender 130
7.1.1 Fundamental Frequency and Formant Features
for Each Gender 130
7.1.2 Comparison with Peterson and Barneys Results 142
7.1.3 Results of Two-way ANOVA Statistical Test 145
7.1.4 Results of T Statistical Test 145
7.1.5 Discussion 145
7.2 Relative Importance of Grouped Vowel Features 151
7.2.1 Recognition Results 152
7.2.2 Discussion 154
7.3 Conclusions 158
8 CONCLUDING REMARKS 162
8.1 Summary 162
8.2 Future Research Extensions 166
8.2.1 Short Term Extension 166
8.2.2 Long Term Extension 168
APPENDICES
A RECOGNITION RATES FOR LPC AND CEPSTRUM
PARAMETERS 169
B RECOGNITION RATES FOR VARIOUS ACOUSTIC
PARAMETERS AND DISTANCE MEASURES 179
REFERENCES 189
BIOGRAPHICAL SKETCH 198
vi

Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
TOWARDS AUTOMATIC GENDER RECOGNITION FROM SPEECH
By
Ke Wu
May, 1990
Chairman: D. G. Childers
Major Department: Electrical Engineering
The purpose of this research was to investigate the potential effectiveness of
digital speech processing and pattern recognition techniques in the automatic
recognition of gender from speech. Some hypotheses concerning acoustic
parameters that may influence our ability to distinguish a speakers gender were
researched.
The study followed two directions. One direction, coarse analysis, used
classical pattern recognition techniques and asynchronous linear prediction coding
(LPC) analysis of speech. Acoustic parameters such as autocorrelation, LPC,
cepstrum, and reflection coefficients were derived to form test and reference
templates. The effects of different distance measures, filter orders, recognition
schemes, and phonemes were comparatively assessed. Comparisons of acoustic
parameters using the Fishers discriminant ratio criterion were also conducted.
The second direction, fine analysis, used pitch synchronous closed-phase
analysis to obtain accurate vowel characteristics for each gender. Detailed formant
Vll

features, including frequencies, bandwidths, and amplitudes, were extracted by a
closed-phase Weighted Recursive Least Squares with Variable Forgetting Factor
method. The electroglottograph signal was used to locate the closed-phase portion
of the speech signal. A two-way Analysis of Variance statistical analysis was
performed to test the difference between two gender features, and the relative
importance of grouped vowel features was evaluated by a pattern recognition
approach.
The results showed that most of the LPC derived acoustic parameters worked
very well for automatic gender recognition. A within-gender and within-subject
averaging technique was important for generating appropriate test and reference
templates. The Euclidean distance measure appeared to be the most robust as well
as the simplest of the distance measures.
The statistical test indicated steeper spectral slopes for female vowels. Results
suggested that redundant gender information was imbedded in the fundamental
frequency and vocal tract resonance. Features of female voices were observed to
have higher within-group variations than those of male voices.
In summary, this study demonstrated the feasibility of an efficient gender
recognition system. The importance of this system is that it would reduce the search
space of speech or speaker recognition in half. The knowledge gained from this
research might benefit the generation of synthetic speech with a desired male or
female voice quality.
vm

CHAPTER 1
INTRODUCTION
1.1 Automatic Gender Recognition
Human listeners are able to capture and categorize the information of acoustic
speech signals. Categories include those that contribute a linguistic message, those
that identify the speaker, and those that convey clues about the speakers
personality, emotional state, gender, age, accent, and the status of his/her health.
Automatic speech and speaker recognition systems are far less capable than
human listeners. Computerized speaker recognition can be accomplished but only
under highly constrained conditions. The major difficulty is that the number of
significant parameters is unmanageably large and little is known about the acoustic
speech features, articulation differences, vocal tract differences, phonemic
substitutions or deletions, prosodic variations and other factors that influence our
recognition ability.
Therefore, more insight and systematic study of intrinsically effective speaker
discrimination features are needed. A series of smaller experiments should be done
so that the experimental results will be mutually supportive and will lead to overall
understanding of the combined effects of all the parameters that are likely to be
present in actual situations (Rosenberg, 1976; Committee on Evaluation of Sound
Spectrograms, 1979).
Unlike automatic speech and speaker recognition, automatic gender
recognition was never proposed as a stand-alone problem. Little attention was paid
1

2
to either the theoretical basis or the practical techniques for the realization of a
system for the automatic recognition of gender from speech. Although
contemporary research on speech included investigation of physiological and
acoustic gender features and their correlation with perceived gender differences, no
attempt was made to classify the speakers gender objectively, using features
automatically extracted by a computer. Childers and Hicks (1984) first proposed
such a study as a separate recognition task and, thus, this research resulted from that
proposal. A possible realization of such a system is shown in Figure 1.1.
1.2 Application Perspective
The significance of the proposed research is as follows:
o Accomplishing this task could facilitate speech recognition and
speaker identification or verification by reducing the required search
space to half. Such a pre-process may occur in the listening process
of human being. One of the speech perception hypotheses proposed
by OKane (1987) stated that human listeners have to determine the
gender of the speaker first in order to determine the identity of the
sounds. Another perception hypothesis is that the identity of the
sounds can be roughly determined without knowledge of the
speakers gender but final recognition is possible only after the
speakers gender is known. In both cases, identification of the
speakers gender is a necessary step before recognition of sounds,
o Accomplishing this task could be useful for speech synthesis. It is
well known that in synthesized speech, the female voice has not been
reproduced with the same level of success as the male voice (Monsen
and Engebreston, 1977). Further study of gender cues would

3
Where
FO fundamental frequency
FI first formant frequency
BW1 first formant bandwidth
Figure 1.1 A possible automatic gender recognition system.

4
contribute to the solution of this problem since acoustic features for
synthesizing speech for either gender would be provided. Hence, the
voice quality of voice response systems and text-to-speech
synthesizers would be improved.
o Accomplishing this task could provide new guidelines and suggest
methods to identify the acoustic features related to dialect, age,
health conditions, etc.
o Accomplishing this task could be a unique or an only approach for
some applications (e.g., law enforcement applications). In a
criminal investigation, an attempt is usually made to identify the
speaker on a recording as a specific person. If an individual is able
to deceive the investigator as to his gender, he may well prevent his
detection. It is well known that speakers can disguise their
speech/voice to confound or prevent detection (Hollien and
McGlone, 1976cited by Carlson, 1981). The female impersonator
is an example of intentional deception of the listener. In such a
case, identification of the speakers gender is critical.
o Finally, we presumed that the research results could benefit clinical
applications such as correction for a person with a voice disorder or
handicap. Other applications include transsexual changes etc.
(Bralley et al., 1978; Carlson, 1981).
1.3 Literature Review
1.3.1 Basic Gender Features
The differences between male and female voices depend upon many factors.
Generally, there exist three types of parametersphysiological and acoustical
which are objective, and perceptual which is subjective (Figure 1.2).

5
PHYSIOLOGICAL:
VOCAL FOLD LENGTH
& THICKNESS
VOCAL TRACT
LENGTH, AREA
& SHAPE
ACOUSTIC:
FUNDAMENTAL
FREQUENCY
VOCAL TRACT FEATURES:
FORMANT FREQUENCY
BANDWIDTH &
AMPLITUDE
GLOTTAL VOLUME
.VELOCITY WAVESHAPE,
Figure 1.2 Basic gender features.

6
Many physiological parameters of the male and female vocal apparatus have
been determined and compared. Fant (1976) showed that the ratio of the total
length of the female vocal tract to that of a male is about 0.87, and Hirano et al.
(quoted by Cheng and Guerin, 1987) showed that the ratio of the length of the
female vocal fold to that of the male is about 0.8. Titze (1987 and 1989) reported
that, anatomically, the female larynx also differs from the male larynx in thickness,
angle of the thyroid laminae, resting angle of the glottis, vertical convergence angle
in the glottis, and in other ways. The ratio of the length and the ratio of the area of
pharynx cavity of the female to that of the male are 0.8 and 0.82, respectively.
Similarly, we take respectively 0.95 as the ratio of the length and 1.0 as the ratio of

the area of oral cavity of the female to that of the male. The extra ratio for the area
of the oral cavity is due to the fact that the degree of openness of the oral cavity is
comparatively greater in the case of the female than in the case of the male (Ohman
quoted by Fant, 1966). Ohman also suggested that a proportionally larger female
mouth opening is a factor to consider. Figure 1.3 illustrates the human vocal
apparatus.
The differences in physiological parameters can lead to induced differences in
acoustical parameters. When comparing male and female formant patterns, the
average female formant frequencies are roughly related to those of the male by a
simple scaling factor that is inversely proportional to the overall vocal tract length.
On the average, the female formant pattern is said to be scaled upward in frequency
by about 20% compared to the average male formant pattern (Figure 1.4). It is also
well known that the individual size of the vocal cavities and thus of the formant
pattern scale factor may vary appreciably depending upon the age and gender of the
speaker. Peterson and Barney (1952) measured the first three formant frequencies
present in ten vowels spoken by men, women, and children. They reported that male
formants were the lowest in frequency, women had a higher range, and children had

7
Figure 1.3 A cross section of human vocal apparatus.

8
AMPLITUDE (db)
Figure 1.4 An example of male and female formant features.
PITCH PERIOD
(msec. )
Figure 1.5 Fundamental frequency changes for two speakers
for the utterance We were away a year ago.

9
the highest. Carlson (1981) gave a survey of the literature on the vocal tract
resonance characteristics as a gender cue.
Fant (1966) has pointed out that the male and female vowels are typically
different in three groups:
1) rounded back vowels,
2) very open unrounded vowels, and
3) close front vowels.
The main physiological determinants of the specific deviations are that the ratio of
pharyngeal length to mouth cavity length is greater for males than for females and
the laryngeal cavities are more developed in males.
Schwartz and Rine (1968) also demonstrated that the gender of an individual
can be identified from voiceless fricative phonemes such as /S/, /F/ etc. This again is
induced by the vocal tract size differences between the genders.
The higher fundamental frequency (pitch) range of the female speaker is quite
well known. There is a general agreement that the fundamental frequency is an
important factor in the identification of gender from voice (Curry, 1940cited by
Carlson 1981; Hollien and Malcik, 1967; Saxman and Burk, 1967; Hollien and Paul,
1969; Hollien and Jackson, 1973; Monsen and Engebretson, 1977; Stoicheff, 1981;
Horri and Ryan, 1981; Linville and Fisher, 1985; Henton, 1987). One often finds the
statement that the pitch level of the female speaking voice is approximately one
octave higher than that of the male speaking voice (Linke, 1973). However, there is
considerable discrepancy among values obtained by different investigators.
According to Hollien and Shipp (1972), the male subjects showed an intersubject
pitch range of 112 146 Hz. Stoicheffs (1981) data showed that the range for the
female subjects was 170-275 Hz. Titze-(1989) found that the fundamental
frequency was scaled primarily according to the membranous lengths of the vocal
folds (scale factor 1.6). Figure 1.5 shows fundamental frequency changes for two

10
speakers for the utterance We were away a year ago. Figure 1.6 shows the
corresponding speech signals.
The female voice is slightly weaker than the male voice. On the average the
root mean square (rms) intensity of glottal periods produced by female subjects is -6
db relative to comparable samples produced by males. A study by Karlsson (1986)
indicated a strong correlation between weak voice effort and constant air leakage
during closed-phase.
During the last few years, measuring the area of the glottis as well as
estimating the glottal volume-velocity waveform have become research topics of
interest (Holmberg et al., 1987). It is well known that the shape of the glottal
*
excitation wave is an important factor which can greatly affect speech quality
(Rothenberg, 1971). The wave shape produced by male subjects is typically
asymmetrical and frequently shows a prominent hump in the opening phase of the
wave (due to source-tract interaction). The closing portion of the wave generally
occupies 20%-40% of the total period and there may or may not be an easily
identifiable closed period (Monsen and Engebretson, 1977). Notable differences
between male and female waveforms are that the female waveform tends to be
symmetric. There is seldom a hump during the opening-phase indicating less or no
source-tract interaction, and both the opening and closing parts of the wave occupy
more nearly equal proportions of the period. Holmberg et al. (1987) found
statistically significant differences in male-female glottal waveform parameters. In
normal and loud voices, female waveforms indicated lower vocal fold closing
velocity, lower ac flow, and a proportionally shorter closed-phase of the cycle,
suggesting a steeper spectral slope for females. For softly spoken voices, spectral
slopes are more similar to those of males.
These glottal-source differences between male and female subjects are
understandable in terms of the relative size of male and female vocal folds. It is

11
Z>48
-2M8
at- 't'r:
5 = 1.732 re
. (b)
Figure 1.6 Speech signals for (a) male and (b) female speakers
for the utterance We were away a year ago.

12
possible that the asymmetrical, humped appearance of the male glottal wave may be
due to a slightly out-of-phase movement of the upper and lower parts of each vocal
fold. If this is so, then the generally symmetrical appearance of the female glottal
wave may be due to the fact that the shorter female vocal folds come into contact
with each other more nearly as a single mass (Ishizaka and Flanagan, 1972).
The perceptual parameters or strategies used to make decisions concerning
male/female voices are not delineated in the literature even though making this
decision is a discrimination task performed routinely by human listeners. However,
it is hypothesized that a limited number of perceptual cues for classifying voices do
exist in the repertoire of listeners, and these cues may include some sociological
c,
factors such as cultural stereotyping.
Singh and Murry (1978) and Murry and Singh (1980) investigated the
perceptual parameters of normal male and female voices. They found that the
fundamental frequency and formant structure of the speaker appeared to carry
significant information for all judgments. The listeners judgments that the voices
they heard were female were more dependent on judged qualities of voice and effort.
Effort, pitch, and nasality were the perceptual parameters used to characterize
female voices while male voices were judged on the basis of effort, pitch, and
hoarseness. Their results suggested that listeners may use different perceptual
strategies to classify male voices than they use to classify female ones. Coleman
(1976) also suggested that there was a possibility of a gender-specific listener bias
for one acoustic characteristic or for one gender over the other.
Many researchers also believe melodic (intonation, stress, and/or
coarticulation) cues are speech characteristics associated with female voices.
Furthermore, the female voice is typically more breathy than the male voice. This
can be modeled by a dc shift in the glottal wave or, as suggested by Singh and Murry
(1978), is a result of a large number of pitch shifts. As the subject shifts pitch

13
direction frequently, complete vocal fold approximation is less probable. A
research on acoustic correlates of breathiness was performed by Klatt (1987) in
which three breathiness parameters (i.e., first harmonic amplitude, turbulence noise
and tracheal coupling) were proposed. A detailed discussion of controlling these
parameters was presented in Klatts paper (1987).
A new trend to find the features responsible for gender identification is to
apply the approach of synthesis. The work done by Yegnanarayana et al. (1984),
Wu (1985), Childers et al.(1985a, 1985b, 1987, 1989), and Pinto et al. (1989)
represented this aspect. In their experiments, the speech of a talker of one gender
was converted to sound like that of a talker of the other gender to exam factors
c
responsible for distinguishing gender features. They found that the fundamental
frequency, the glottal excitation waveshape and the spectrum, which included
formant locations and bandwidth, overall spectral shape and slope, and energy, are
crucial control parameters.
1.3.2 Acoustic Cues Responsible for Gender Perception
As part of current interest in speaker recognition, investigators have sought to
specify gender-bearing attributes of the human voice. Under normal speaking and
listening circumstances, listeners have little difficulty distinguishing the voices of
adult males and females, suggesting that the acoustic parameters which underlie
gender identity are perceptually prominent. The judgment of adult gender is
strongly influenced by acoustic variables reflecting gender differences in laryngeal
size and mass as well as vocal tract length. However, the issue of which specific
acoustic cues are mostly responsible for gender identification has not been
definitively resolved. Such a controversy partially dominated the previous research.
A series of experiments run by Schwartz (1968) and Ingemann (1968)
employed voiceless fricatives spoken in isolation as auditory stimuli and it was found

14
that listeners could identify speaker gender accurately from these stimuli, especially
from /H/, IS/, and /SH/ (and could not from /F/ and /TH/). Ingemann reported that c
the most identifiable fricative was /h/, with identification of others ranging down to
little better than chance. Since the laryngeal fundamental (FO) was not available to
the listeners, their findings suggest that accurate gender identification is possible
from vocal tract resonance (VTR) information alone and, therefore, that formants
are important cues for speaker gender identification.
Further support for this conclusion came from studies by Schwartz & Rine
(1968) and Coleman (1971). Schwartz and Rines study revealed that the listeners
were able to identify the speakers gender from two whispered vowels (/i/ and /a/).
They found 100% correct identification for /a/ and 95% correct identification for /i/,
despite the absence of the laryngeal fundamental. In Colemans study on male and
female voice quality and its relationship to vowel formant frequencies, /i/, /u/, and a
prose passage were employed to explore listeners gender identification abilities.
All stimuli were produced at the same FO (85 Hz) by means of an electrolarynx.
Coleman discovered that the judges correctly recognized the speaker gender 88% of
the time (with 98% correct for male voices and 79% for female voices), even when
the FO remained constant for all speakers. He also discovered that the vowel
formant frequency averages were closely associated with the degree of male or
female voice quality.
Coleman (1973a and 1973b) attempted to reduce the influence of possible
differences in rate, juncture, and inflection between male and female speakers by
presenting their voiced productions of prose passage backward to subjects. The
judgments should have, therefore, been based solely on VTR and FO information
which would be unaffected by the backward presentation. By correlation analysis
between measures of VTR, FO, and judgments of degree of male and female voice
quality in the voices of the speakers (with degree of correlation indicative of the

15
contribution of each of the vocal characteristics to listener judgments), he found that
listeners were basing their judgments of the degree of male or female voice quality
on the frequency of the laryngeal fundamental.
However, in a later study by Coleman (1976), there were inconsistent findings
from a pair of experiments concerned with a comparison of the contribution of two
vocal characteristics to the perception of male and female voice quality. The first
experiment, which utilized natural speech, indicated that the FO was very highly
correlated with the degree of gender perception while the VTR was less highly
correlated. When VTRs that were more characteristic of the opposite gender were
included experimentally in these voices, they did not affect the judges estimates of
o
the degree of male or female voice quality. But, in the second experiment, when a
tone produced by a laryngeal vibrator was substituted for the normal glottal tone at
simulated FO representing both male (120 Hz) and female (240 Hz), and male and
female characteristics (i.e. vocal tract formants and laryngeal fundamentals) were
combined in the same voice experimentally, he found that the female FO was a weak
indicator of the female voice quality when it was combined with the male VTR
features although the male FO retained the perceptual prominence seen in the first
experiment. Thus, there was a difference in the manner that FO and VTR interact
for male and female perception.
Lass et al. (1976) conducted a study comparing listeners gender identification
accuracy from voiced, whispered, and 255 Hz low-pass filtered isolated vowels.
They found that listener accuracy was greatest for the voiced stimuli (96% correct
out of 1800 identifications20 speakers x 6 vowels x 15 listeners), followed by the
filtered stimuli (91% correct), and least accurate (75% correct) for the voiceless
vowels. Since the low-pass filtered vowels apparently eliminated formant
information, they concluded that the F0 was a more important acoustic cue in
speaker gender identification tasks than the VTR characteristics of the speaker.

16
Lass et al. (1976) also reported that there were large gender differences in
their results. In all experimental conditions females were recognized at a
significantly lower level, which was in agreement with the results of Coleman (1971)
mentioned above. In another study supportive of this point, Brown and Feinstein
(1977) also used electrolarynx (120 Hz) to control F0 so that VTR was the variable.
Identification of male speakers was 84% correct and identification of female
speakers was 67% correct. Brown and Feinstein also found, as in the Coleman
(1971) study, that centralized spectra were more ambiguous to listeners. Again,
VTR appeared to play a determinant role in gender identification in the absence of
F0.
In a later experiment, the effect of temporal speech alterations on speaker
gender and race identification was investigated. Lass and Mertz (1978) found that
gender identification accuracy remained high and unaffected by temporal speech
alterations when the normal temporal features of speech were altered by means of
the backward playing and time compressing of speech samples. They concluded
that temporal cues appeared to play a role in speaker race, but not speaker gender
identification.
In another study concerned with the effect of phonetic complexity on speaker
gender identification, Lass et al. (1979) found that phonetic complexity did not
appear to play a major role for gender judgments. No regular trend was evident
from simple to complex auditory stimuli and listeners accuracy was as great for
isolated vowels as it was for sentences.
In an attempt to investigate the relative importance of portions of the
broadband frequency speech spectrum in gender identification, Lass et al. (1980)
constructed three recordings representing'the three experimental conditions in the
study: unfUtered, 255 Hz low pass filtered, and 255 Hz high pass filtered. The
recordings were played back to a group of 28 judges. The results of their judgments

17
indicated that gender identification was not significantly affected by such filtering;
listeners accuracy in gender recognition remained high for all three experimental
conditions, showing that gender identification can be made accurately from acoustic
information available in different portions of the broadband speech spectrum.
1.3.3 Summary of Previous Research
By reviewing the literature it can be concluded that the revealed information of
gender identification from previous research was extensive. However, it is clear that
much work still remains to be done.
What has not been completed
The relative importance of the FO versus VTR characteristics for
perceptual male or female voice quality is still controversial. The belief
that the FO is the strongest cue to gender seems to be substantiated by the
evidence. There is a hypothesis that in situations in which the role of FO
is diminished by deviancy, the effect of VTR characteristics upon gender
judgments increases from a minimal level to take on a large role equal to
and even sometimes greater than that played by FO (Carlson, 1981). But
this hypothesis remains unproven.
It is well known now that not only the vibration frequency of the
glottis (FO) but also the shape of the glottal excitation wave as well are
important factors which greatly affect speech quality (Rothenberg, 1971;
Holmes, 1973). Differences of glottal excitation wave shapes for male
and female were observed and investigated (Monsen and Engebretson,
1977; Karlsson, 1986; Holmberg and Hillman, 1987). But perceptive
justification of these characteristics was still limited (Carrell, 1981) and
the inverse filtering techniques need to be improved and more data
should be analyzed.

18
What was neglected
First of all, research on automatically classifying male/female
voices by using objective feature measurements was entirely missing.
Almost all previous work was concentrated on subjective testing which is
expensive, time and labor consuming, and subject dependent. Objective
gender recognition which is reliable, inexpensive, and consistent has not
been deveioped in parallel to subjective testing but such work is
necessary as we stated earlier.
Second, the influences of formant bandwidth and amplitude and
overall spectral shape on gender cues were not considered and
investigated. Traditionally, experiments on contribution of vocal tract
characteristics to gender perception were only concerned with formant
frequencies (Coleman, 1976). The bandwidths of the lowest formant
depend upon vocal tract wall loss and source-tract interaction (Rabiner
and Schafer, 1976; Rothenberg, 1981) while bandwidths of the higher
formants depend primarily upon the viscous friction, thermal loss, and
radiation loss (Flanagan, 1972). These factors may be different for each
gender so that the bandwidths and overall spectral shape are different for
each gender. Bladon (1983) pointed out that male vowels appeared to
have narrower formant bandwidths and perhaps also a less steeply
sloping spectrum. All these areas require further investigation.
What was the weakness
The acoustic features were obtained by short-time spectral
analysis which usually used analog spectrographic techniques.
Estimated FO and formant frequencies may be inaccurate due to

19
1. errors in determining the positions of the harmonic peaks (in
practice, the peaks were read by means of inspection by a
person and then the FO and formants were calculated).
2. errors in formant estimation due to the influences of the FO
and source-tract interaction.
3. large instrument errors (e.g., drift).
Lindblom (1962) estimated the accuracy of spectrographic
measurement to be approximately equal to the fundamental frequency
divided by 4. Flanagan (1955) and Nord and Sventelius (1979, quoted by
Monsen and Engebretson, 1983) suggested that a difference of about 50
o
Hz for the second formant and a difference of about 21 Hz for the first
formant was perceived. Therefore, formant frequency estimation should
be as accurate as possible in vowel analysis as well as synthesis.
However, the most frequently referenced paper on acoustic phonetics,
which contains the most comprehensive measurements of the vowel
formants of American English (Peterson and Barney, 1952), may involve
measurement errors as pointed out by Monsen and Engebretson (1983),
especially for female and child subjects since the data were obtained by
spectrographic measurement.
The technique frequently employed to examine the ability of VTR
to serve as gender cue was to standardize the FO (and therefore eliminate
it as a variable) by utilizing an artificial larynx (Coleman, 1971, 1976;
Brown and Feinstein, 1977). This allows evaluation of VTR in a sample
that contains an FO that is the same for both male and female subjects.
The electrolarynx itself has an unnatural sound to it that may confuse the
listener and depress the overall accuracy of perception.

20
The study populations were relatively small for most
investigations. Sometimes the database used consisted of less than 10
subjects for each gender (Ingemann, 1968; Schwartz and Rine, 1968;
Brown and Feinstein, 1977), making the interpretation of the results
unreliable.
The results of the listening tests may depend on the gender
distribution of the testing panel because males and females may use
different judging strategies. However, this point usually was not
emphasized so that the conclusions claimed from listening tests may be
biased (Coleman, 1976; Carlson, 1981).
In summary, previous research has measured and investigated the
physiological or anatomical parameters for each gender. Under certain
assumptions, the relationship between anatomical parameters and some of the
acoustic features was established. The major acoustic parameters responsible for
perceptually discriminating a speakers gender from voice were investigated and
tested. However, no attempt was made to automatically classify male/female voices
by objective feature measurements. The vowel characteristics for each gender were
inaccurate because of the weakness of analog techniques. Various hypotheses and
preliminary results need to be verified on a more comprehensive database. All these
constituted the underlying problems and impetuses for this research.
1.4 Objectives of this Research
This research sought to address these problems through two specific
objectives.
One objective of this study was to explore the possible effectiveness of digital
speech processing and pattern recognition techniques for an automatic gender
recognition system. Emphasis was placed on the investigation of various objective

21
acoustic parameters and distance measures. The optimal combination of these
parameters and measures was searched. The extracted acoustic features that are
most effective to classify speakers gender objectively were characterized. Efficient
recognition schemes and decision algorithms for such purpose were developed.
The other objective of this study was to validate and clarify hypotheses
concerning some acoustic parameters affecting the ability of algorithms to
distinguish a speakers gender. Emphasis was placed on extraction of accurate
vowel characteristics including fundamental frequency and formant features such as
formant frequency, bandwidth and amplitude for each gender. The relative
importance of these characteristics for gender identification was evaluated.
1.5 Description of Chapters
In Chapter 2, an overview of the research plan is given and a brief description
of the coarse and fine analysis is presented. The database and the techniques
associated with data collection and preprocessing are discussed in Chapter 3. The
details of the experimental design based on coarse analysis are described in Chapter
4. Asynchronous LPC analysis is reviewed. Different acoustic parameters, distance
measures, template formation, and recognition schemes are provided. The
recognition decision rule and resubstitution or exclusive procedure are proposed as
well. In addition, the concept of the Fishers discriminant ratio criterion is reviewed.
The recognition performance based on coarse analysis is assessed in Chapter 5.
Results of comparative studies of various phonemes, acoustic features, distance
measures, recognition schemes, and filter orders are reported. The gender
separability of acoustic features is also analyzed by using the Fishers discriminant
ratio criterion. Chapter 6 expounds on the detailed experimental design of fine
analysis. In particular, the advantages of pitch synchronous closed phase analysis is

22
demonstrated. A review of the closed phase WRLS-VFF (Weighted Recursive
Squares with Variable Forgetting Factor) analysis and the EGG (electroglottograph)
assisted approaches is also presented. Chapter 6 also introduces testing methods for
fine analysis, which include the two-way ANOVA (Analysis of Variance) statistical
test and the automatic recognition test using grouped features. Chapter 7 analyzes
the vowel characteristics such as fundamental frequencies and formant features for
each gender. Statistical tests and relative importance of grouped vowel features are
also discussed. Finally in Chapter 8, a summary of the results of this dissertation is
offered. Recommendations and suggestions for future research conclude this last
chapter.

CHAPTER 2
APPROACHES TO GENDER RECOGNITION FROM SPEECH
2.1 Overview of Research Plan
The goal of this study was to explore the possible effectiveness of digital
speech processing and pattern recognition techniques for an automatic gender
recognition system from speech. In order to do this, some hypotheses concerning
acoustic parameters that act to affect our ability to distinguish speakers gender
needed to be validated and clarified.
Thus, this study was divided into two directions as illustrated in Figure 2.1.
One direction was called coarse analysis since it applied classical pattern recognition
techniques and asynchronous linear prediction coding (LPC) analysis of speech.
The specific goal of this direction was to develop and test candidate algorithms for
achieving the gender recognition rapidly using only a brief data speech record.
The second research direction covered fine analysis since pitch synchronous
closed-phase analysis was utilized to obtain accurate vowel characteristics for each
gender. The specific aim of this direction was to compare the relative significance of
vowel characteristics for gender discrimination.
2.2 Coarse Analysis
The tool we used in this direction was asynchronous LPC analysis. The
advantages of using this technique are
23

24
Figure 2.1 The overall research flow.

25
1. The well-known linear prediction coding (LPC) vocoder is an
efficient vocoder which, when used as a model, encompasses the
features of the vocal source (except of fundamental frequency) as
well as the vocal tract (Rabiner and Schafer, 1978). Since gender
features are believed to be included in both vocal source and tract,
satisfactory results would be expected using LPC derived
parameters.
2. The LPC all-pole model has a smoothed, accurate spectral envelope
matching characteristic, especially for vowels. Formant frequency
measurements obtained by LPC have also been found to compare
favorably to measures obtained by spectrographic analysis (Monsen
and Engebretson, 1983; Linville and Fisher, 1985). Thus it is
expected that features obtained by LPC would represent the spectral
characteristics of both genders more accurately.
3. The LPC model has been successfully applied in speech and speaker
recognition (Makhoul, 1975a; Atal, 1974b, 1976; Rosenberg, 1976:
Markel, 1977; Davis and Mermerlstein, 1980; Rabiner and Levinson,
1981). Moreover, many related distortion or distance measurements
have been developed (Gray and Markel, 1976; Gray et al., 1980;
Juang, 1984; Nocerino et al., 1985) which could be conveniently
adopted for the preliminary experiments of gender recognition.
4. Deriving acoustic parameters from the LPC model is
computationally fast and efficient, only short data records are
needed. This is a very important factor in designing an automatic
gender recognition system.

26
In the coarse analysis, acoustic parameters such as autocorrelation, LPC,
cepstrum, and reflection coefficients were derived to form test and reference
templates. The effects of using different distance measures, filter orders,
recognition schemes, and phonemes were comparatively evaluated. Comparisons of
acoustic parameters using the Fishers discriminant ratio criterion were also
conducted.
The linear prediction coding concepts and detailed experimental design based
on the coarse analysis will be given in Chapter 4.
2.3 Fine Analysis
The objective of the fine analysis was to study and compare the relative
significance of vowel characteristics responsible for gender discrimination.
As we know, male/female vowel characteristics are featured by formant
positions, bandwidths, and amplitudes so that accurate formant estimation is
necessary. It is important to pay particular attention to the measurement technique
and to the degree of accuracy which can be achieved through it. Although formant
features have been measured for a variety of different studies, the accuracy of these
measurements is still a matter of conjecture.
Formant estimation is influenced by (Atal, 1974a; Childers et al., 1985a;
Krishnamurthy and Childers, 1986):
o the effect of the periodic vocal fold excitation, especially when the
harmonic is near the formant,
o the effect of the excitation-spectrum envelope,
o the effect of time averaging over several excitation cycles in the
analysis when the vocal folds are repeatedly in open-phase (large

27
source-tract interaction) and closed-phase (little or no source-tract
interaction) conditions.
Frame based asynchronous LPC analysis cannot reduce the effect of
source-tract interaction because this technique uses windows that average the data
over several excitation epoches. The pitch synchronized closed phase covariance
(CPC) method can reduce the effect of source-tract interaction. However, in certain
situations, the vocal tract filter derived by this method may be unstable because of
the short closed glottal intervals, especially for females and children (Ting et al.,
1988).
Sequential adaptive analysis methods offer an attractive alternate processing
<3
strategy since they overcome some of the drawbacks of frame-based analysis. The
closed-phase WRLS-VFF method that tracks the time-varying parameters of the
vocal tract and updates the parameters during the glottal closed phase interval can
reduce the formant estimation error. Experimental results (Ting et al., 1988; Ting,
1989) show that the formant tracking ability and formant estimation accuracy of the
WRLS-VFF algorithm is superior to the LPC based method. Detailed formant
features, including frequencies, bandwidths, and amplitudes in the fine analysis
stage were obtained by using this method. The EGG signals were used to assist in
locating the closed phase portion of the speech signal (Childers and Larar, 1984;
Krishnamurthy and Childers, 1986).
There were two approaches for testing the relative importance of various
vowel features for gender recognition:
Statistical tests. Since formant characteristics such as frequencies,
bandwidths, and amplitudes depend on or are influenced by two
factors (i.e., gender as well as vowels) and each experimental
subject produces more than one vowel, our experiments should be
referred to as two factor experiments having repeated measures on

28
the same subject (Winer, 1971). Therefore, two-way ANOVA were
used to perform the statistical test. The significance of the
difference between each individual feature in terms of male/female
groups was analyzed.
Automatic recognition. First, the individual or grouped feature(s),
such as only the fundamental frequency or only the formant
frequencies or bandwidths (but from all formants), were used to
form the reference and test templates. Then automatic recognition
schemes were applied on these templates. Finally, the recognition
error rates for different features were compared.
In Chapter 6, the detailed background of the closed phase WRLS-VFF method
and the experimental design based on fine analysis will be presented.

CHAPTER 3
DATA COLLECTION AND PROCESSING
3.1 Database Description
The database consists of speech and EGG data collected from 52 normal
subjects (27 males and 25 females) with speakers age varying from 20 to 80 years.
The synchronous speech and EGG signals were simultaneously directly digitized.
Each subject read, after some practice, the following SAMPLE PROTOCOL that
includes 27 tasks.
SAMPLE PROTOCOL
Task 1. Count 1-10 with comfortable pitch & loudness.
2. Count 1-5 with progressive increase in loudness.
3.
Sustain phonation of the vowel
/IY/
4.
Sustain phonation of the vowel
/I/
5.
Sustain phonation of the diphthong
IAU
6.
Sustain phonation of the vowel
IE/
7.
Sustain phonation of the vowel
/AE/
8.
Sustain phonation of the vowel
100/
9.
Sustain phonation of the vowel
/u/
10.
Sustain phonation of the diphthong
/ou/
11.
Sustain phonation of the vowel
/ow/
12.
Sustain phonation of the vowel
/A/
13.
Sustain phonation of the vowel
UHJ
14.
Sustain phonation of the vowel
/ER/
15.
Sustain phonation of the whisper
H/
16.
Sustain phonation of the fricative
/F/
17.
Sustain phonation of the fricative
TW
18.
Sustain phonation of the fricative
/S/
29
the word BEET,
the word BIT.
the word BAIT,
the word BET.
the word BAT.
the word BOOT,
the word BOOK,
the word BOAT,
the word BOUGHT,
the word BACH,
the word BUT.
the word BURT,
the word HAT.
the word FIX.
the word THICK,
the word SAT.

30
19.
20.
21.
22.
,23.
24.
25.
26.
27.
Sustain phonation of the fricative /SH/ in the word SHIP.
Sustain phonation of the fricative /V/ in the word VAN.
Sustain phonation of the fricative /TH/ in the word THIS.
Sustain phonation of the fricative /Z/ in the word ZOO.
Sustain phonation of the fricative /ZH/ in the word AZURE.
Produce chromatic scale on la (attempt to go up, then down
as one effort pause between top 2 notes)
Sentence We were away a year ago.
Sentence Early one morning a man and a woman ambled along a
one mile lane.
Sentence Should we chase those cowboys?
A subset of the above was used in this research. It consisted of
1. Ten sustained vowels: /IY/, /I/, /E/, /AE/, /OO/, /U/, /OW/, /A/, /UH/,
and /ER/. There were a total of 520 vowels for 52 subjects: 270
vowels from males and 250 vowels from females.
2. Five sustained unvoiced fricatives (including a whisper): /H/, /F/,
/TH/, /S/, and /SH/. There were a total of 260 unvoiced fricatives for
all subjects: 135 from males and 125 from females.
3. Four voiced fricatives: /V/, /TH/, /Z/, and /ZH/. There were a total
of 208 unvoiced fricatives for all subjects: 108 from males and 100
from females.
3.2 Speech and EGG Digitization
All of the experimental data were collected with the subjects situated inside an
Industrial Acoustics Company single wall sound booth. The speech was picked up
with an Electro Voice RE-10 dynamic cardioid microphone and the EGG signal was
monitored by a Synchrovoice device. Amplification of the speech and EGG signals
was accomplished with a Digital Sound Corporation DSC-240 Audio Control
Console. The two channels were alternately sampled at 20 KHz by a Digital Sound
Corporation DSC-200 Digital Audio Converter system with 16 bits of precision.

31
The low-pass, anti-aliasing and reconstruction filters of the DSC-200 were
connected to the analog side of the converter. Both signals were bandlimited to 5
KHz by these passive elliptic filters with the specification of minimum stopband
attenuation of -55 db and passband ripple of 0.2 db. The DSC-240 station
provides audio signal interfacing to the DSC-200, which includes input and output
buffering as well as level metering and signal path switching.
The utterances were directly digitized since this choice avoids any distortion
that may be introduced in the tape recording process (Berouti et al., 1977; Naik,
1984). An extender attached to the microphone kept the speakers lips 6 inches
away. With the microphone and EGG electrodes in place, the researcher ran the
£
data collection program on a terminal inside the sound room. A two channel
Tektronix Type 564B Storage Oscilloscope was connected to DSC-240 so both
speech and EGG signals were monitored. The program prompted the researcher by
presenting a list of commands on the screen. The researcher initiated digitization by
depressing the D key on the keyboard. Immediately after digitization, another
prompt indicated termination of the sampling process. The digitized utterance could
be played back and an option existed to repeat the digitization process if it was
thought that part of the utterance might have been spoken abnormally or the
digitized speech and EGG signals were unsatisfactory. For example, the speakers
were instructed to repeat a utterance if the panel of experts who were sitting in the
sound room or speaker felt that it was rushed, mispronounced, too low, etc. The
entire protocol with utterances repeated as necessary took an average of 15-20
minutes to collect. About 150-200 seconds of speech and EGG were automatically
stored on disk. Thus, for each subject, about 12000-16000 blocks (512 bytes per
block) of data were collected.
Since the speech and EGG channels were alternately sampled, the resulting
file of digitized data had the two signals interleaved. The trivial task of

32
demultiplexing was performed off-line after data collection. Once the data were
demultiplexed, the speech and EGG were trimmed to discard the unnecessary data
before and after an utterance while keeping the onset and offset portions at each end
of the data. After trimming, about 4500-6500 blocks data were stored on disk for
each subject.
3.3 Synchronization of Data
When the speech and EGG signals were used during the analysis stage, they
were time aligned to account for the acoustic propagation delay from the larynx to
c,
the microphone. The microphone was kept a fixed 15.24 centimeters (6 inches)
away from the speakers lips to reduce breath noises and to simplify the alignment
process. Synchronization of the waveforms had to account for the distance from the
vocal folds to the microphone. To do so, average vocal tract lengths of 17 cm for
males and 15 cm for females were assumed. The number of samples to discard
from the beginning of the speech record was then
ft samples = Int[(32.24/34442) 10000 + .5] (3.1)
for males and
ft samples = Int[(30.24/34442)10000 + .5] (3.2)
for females.
Equations (3.1) and (3.2) show that a 10 sample correction is appropriate for
males and a 9 sample correction is appropriate for females. Examination of the data
also supported use of these figures for adult speakers. Examples of aligned speech
and EGG signals for a male and female speaker are shown in Figure 3.1.

33
Figure 3.1 Examples of aligned speech and EGG signals
for (a) male and (b) female speakers.

CHAPTER 4
EXPERIMENTAL DESIGN BASED ON COARSE ANALYSIS
As stated in the Chapter 2, coarse analysis applies classical pattern recognition
techniques and asynchronous linear prediction coding (LPC) analysis of speech.
The specific goal of this analysis was to develop and test the algorithms to achieve
rapid gender recognition from only a brief speech data record. Figure 4.1 shows the
canonic pattern recognition model used in the gender recognition system. There are
four basic steps in the model:
1) acoustic parameter extraction,
2) test and reference pattern or template formation,
3) pattern similarity determination, and
4) decision rule.
The input is the acoustic waveform of the spoken speech signal, the desired
output is a best estimate of the speakers gender in the input. Such a model can be
a part of a speech or speaker recognition system or a front end processor of the
system. The following discussion of the coarse analysis proceeds in the context of
Figure 4.1.
4.1 Asynchronous LPC Analysis
4.1.1 Linear Prediction Concepts
Linear prediction, also known as the autoregressive (AR), all-pole model, or
maximum entropy model, is widely used in speech processing. This method has
34

35
SPEECH
SIGNAL
Figure 4.1 A pattern recognition model
for gender recognition from speech.

36
become the predominant technique for estimating the basic speech parameters (e.
g., pitch, formants, spectra, vocal tract area functions) and for representing speech
for low bit-rate transmission or storage. The method was first applied to speech
processing by Atal and Schroeder (1970) and Atal and Hanauer (1971). For speech
processing, the term linear prediction refers to a variety of essentially equivalent
formulations of the problem of modeling the speech waveform (Markel and Gray,
1976; Makhoul, 1975b). These different models usually lead to similar results but
each formulation has provided an insight into the speech modeling problem and is
generally dictated by their computation demands.
The particular form of this model that is appropriate for this research is
<3
depicted in Figure 4.2. In this case, the composite spectrum effects of radiation,
vocal tract, and glottal excitation are represented by a time-varying digital filter
whose steady-state system function is of the form
S(z) G
H(z) =
U(z)
k=l
(4.1)
This system is excited by an impulse train for voiced speech or a random noise
sequence for unvoiced speech.
The above system function can be written alternatively in the time domain as
s(n) =
(^(n-k) + G u(n)
(4.2)
Let us assume we have available past data samples from (n-p) to (n-1) and we are
predicting a new nth sample of the data sample

37
PITCH PERIOD
Figure 4.2 A digital model of speech production.

38
P
i (n) = 2 aks(n-k) (4.3)
k=l
The error between the value of the actual nth sample and its estimate is
P
e(n) = s(n) s (n) = s(n) + 2 aks(n-k) (4.4)
k=l
and equivalently,
s(n) = j aks(n-k) + &(n) (4.5)
k=l
If the linear prediction model of Equation (2.5) conforms to the basic speech
production model given by (2.2), then
e(n) = G u(n)
ak = Qik
Thus the coefficients (ak) identify the system, whose output is s(n). The problem
then is to determine the value of the coefficients (ak) from the actual speech signal.
The criterion used to obtain the coefficients (ak) is the minimization of the
short-time average prediction error E with respect to each coefficient a¡, over some
time interval, where
E = S [e(n)]2 (4.8)
n
(4.6)
(4.7)
This leads to the following set of equations

39
2 ak 2 s(n-k)s(n-i) = 2 s(n)s(n-i) 1 < i < p (4.9)
k=l n n
For a short-time analysis, the limits of summation are finite. The particular
choice of these limits has led to two methods of analysis (i.e., the autocorrelation
method (Markel and Gray, 1976) and the covariance method (Atal and Hanauer,
1971)).
The autocorrelation method results in a filter structure that is guaranteed to be
stable. Meanwhile, it operates on a data segment that is windowed using a Hanning,
or Hamming, or another window, typically 10-20 msec long (two to three pitch
periods).
The covariance method, on the other hand, gives a filter with no guaranteed
stability, but requires no explicit windowing. Hence it is eminently suitable for pitch
synchronous analysis.
One of the important features of the linear prediction model is that the
combined contribution of the glottal flow, the vocal tract and the radiation effect at
the lips are represented by a single recursive filter. The difficult problem of
separating the contribution of the source function from that of the vocal tract system
is thus completely avoided.
4.1.2 Analysis Conditions
In order to extract acoustic parameters rapidly, a conventional pitch
asynchronous autocorrelation LPC method was used, which applied a fixed frame
size, frame rate and number of parameters per frames. These analysis conditions
were:
Order of the filter: 8, 12, 16, 20
Analysis frame size: 256 points/frame

40
Frame overlap: None
Preemphasis Factor: 0.95
Analysis Window: Hamming
Data set for coefficient calculations: six frames total. The first two of these
were picked up from near the voice onset of an utterance, and the second two from
the middle of the utterance, and the last two from near the voice offset of the
utterance. By averaging six sets of coefficients obtained from these six frames, a
template coefficient set was calculated for each sustained utterance such as a vowel,
an unvoiced fricative, or a voiced fricative.
4.2 Acoustic Parameters
One of the key issues in developing a recognition system is to identify
appropriate features and measures which will support good recognition
performance. Several acoustic parameters were considered as feature candidates in
this study.
4.2.1 Autocorrelation Coefficients
They are defined conventionally as (Atal, 1974b)
R(k) = 2 h(n)h(n+|k|)
n=0
(4.10)
where h(n) is the impulse response of the filter. The relationship between the P
autocorrelation function coefficients and P LPC coefficients is unique in that they
can be obtained from each other (Rabiner and Schafer, 1978).

41
4.2.2 LPC coefficients
LPC coefficients are defined conventionally as (Rabiner and Schafer, 1978)
(4.11)
k=l
where s(n-k) is the (n-k)th speech sample, s(n) is the nth predicted output and ak is
the kth LPC coefficient. LPC coefficients are determined by minimizing the
short-time average prediction error.
4.2.3. Cepsirum .Coefficients
Cepstral coefficients can be obtained by the following recursive formula
(Rabiner and Schafer, 1978)
c0 = 0,
k-1 i
(4.12)
where ak is kth LPC coefficient.
4.2.4 Reflection Coefficients
If we consider a model for speech production that consists of a concatenation
of N lossless acoustic tubes, then the reflection coefficients are defined as (Rabiner
and Schafer, 1978)
A(k+1) A(k)
r(k) =
A(k+1) + A(k)
(4.13)

42
where A(k) is the area of the kth lossless tube. The reflection coefficient determines
the fraction of energy in a traveling wave that is reflected at each section boundary.
Further, r(i) is related to the PARCOR coefficient k(i) by (Rabiner and Schafer,
1978)
r(i) = k(i)
(4.14)
where k(i) can be obtained from LPC coefficients by recursion.
4.2.5 Fundamental Frequency and Formant Information
This set of features consists of frequencies, bandwidths and amplitudes of the
first, second, third and fourth formants and the fundamental frequencies (not for
fricatives). Formant information was obtained by a peak-picking technique, using
an FFT on the LPC coefficients. Fundamental frequency was calculated based upon
a modified cepstral algorithm.
4.3 Distance Measures
Several distance measures were considered.
4.3.1 Euclidean Distance
D euc = [ (X-Y)'(X-Y) ]1/2 (4.15)
where X and Y are the test and reference vectors respectively and t denotes the
transpose of the vector.

43
4.3.2 LPC Log Likelihood Distance
It was proposed by Itakura (1975) and defined as
a R a
D ipc ( a) = log [ ] (4.16)
R
where a and are the LPC coefficient vectors of the reference and test speeches,
and R is the matrix of autocorrelation coefficients of the test speech. An
interpretation of this formula is given in Figure 4.3 below in which the subscript r
denotes reference, and the subscript t denotes test.
The denominator of the term can be obtained by passing the test speech signal
S,(n) through the inverse LPC system of the test Ht(z), giving the energy a of the
error signal. Similarly, the numerator term can be obtained by passing the same test
signal St(n) through the inverse LPC system of the reference Hr(z) with the energy (3
of the error signal.
Thus we obtain
D ipC ( a) = log ((3/a) (4.17)
It can also be shown that this distance measure is related to the spectra
dissimilarity between the test and reference speech signals.
For computational efficiency, variables can be changed and Equation (4.16)
or (4.17) can be rewritten as
P r(k)
D ipc ( a) = log [ Z ra(k) ]
k=0 E
(4.18)

44
TEST
SPEECH
TEST
SPEECH
Figure 4.3 An interpretation of LPC log likelihood
distance measure.
V-'-

45
where r(k) is the autocorrelation of the speech segment to be recognized, E is the
total squared LPC prediction error associated with the estimates a(k) from this
segment, and ra(k) is the autocorrelation of the true (reference) LPC coefficients.
The block diagram for this subroutine is shown in Figure 4.4.
The spectral domain interpretation of Equation (4.18) is (Rabiner and
Levinson, 1981)
it Hr(ej) 2 dco
D lpc ( a) log [/| 1 ] (4.19)
-77 H,(eJ) 2 77
(i.e., an integrated square of the ratio of LPC spectra between reference and test
speech).
4.3.3 Cepstral Distortion
It can be defined as (Nocerino et al., 1985)
D Cep (c, c) = E (ck ck)2 (4.20)
k=-n
It can be shown that the power spectrum which corresponds to cepstrum is a
smoothed version of the true log spectral density function. The fewer the cepstral
coefficients used, the smoother the resultant log spectral density. It can also be
shown that this truncated cepstral distortion measure is a good approximation to the
L2 norm of the log spectral distortion measure between two time series, x(n) and
x(n),

46
OUTPUT
Figure 4.4 The block diagram of LPC log likelihood
distance computation.

47
TT 2 2 2 d)
D L2 = / | log |x(w)| -log |x*(co)| I (4.21)
-TT 2u
where X(o>) is the Fourier transform of x(n). Gray and Markel (1976) showed that
for the LPC analysis with a filter order of 10 the correlation coefficient between D L2
and D cep is 0.98; while for a order of 20, the correlation coefficient is 0.997.
4.3.4 Weighted Euclidean Distance
D weuc = [ (X-Y)1 W1 (X-Y) ]1/2 (4.22)
where X is the test vector, W is the symmetrical covariance matrix obtained using a
set of reference vectors (e.g., from a set of templates which represent subjects of the
same gender), and Y is the mean vector of this set of reference vectors. The
weighting compensates for correlation between features in the overall distance and
reducing the intragroup variations. Weighted Euclidean distance is the simplified
version of the likelihood distance measure using the probability density function
discussed below.
4.3.5 Probability Density Function
1 -n/2 1 -(X-Y)1 W"1 (X-Y) 1/2
D pdf = ( ) exp [ ] (4.23)
2tt I W|1/2 2
where X, Y, W are the same as in Equation (4.22) and [W| is the determinant of W.
The decision principle of this distance measure minimizes the probability of the
error.

48
4.4 Template Formation and Recognition Schemes
4.4.1 Purpose of Design
Another important issue in developing a recognition system is the selection of
appropriate template formation and recognition schemes.
During initial exploratory studies of fixed-text recognition using spectral
pattern matching techniques in the Pruzansky study (1963), the use of the long-term
average technique to form a feature vector was discovered to have potential for
free-text speaker recognition. The speaker recognition error rate was found to
remain undegraded (at 11 percent) even after spectral amplitudes were averaged
over all frames of speech data into a single reference spectral amplitude vector for
each talker. Markel et al. (1977) demonstrated that the between-to-within speaker
variation ratio was significantly increased under long-term average of the parameter
sets (thus free-text).
Temporal cues also appeared not to play a role in speaker gender
identification (Lass and Mertz, 1978). They found that gender identification
accuracy remained high and unaffected by temporal speech alterations when the
normal temporal features of speech were altered by means of the backward playing
and time compressing of speech samples.
Therefore, we would reasonably believe that the gender information is
time-invariant. Thus, long-term averaging would also emphasize the speakers
gender information and increase the between-to-within gender variation ratio. In
practice we would also achieve free-text gender recognition in which gender
identification would be determined before recognition of speech or speaker and
thus, reduce the speech or speaker recognition search space to half.
The purpose of using different test and reference template formation schemes
is to verify the hypothesis above and, if it is correct, to determine how much

49
averaging has to be performed to obtain the best gender recognition. In the
preliminary attempt, the averaging was first done within three classes of the sounds
(i.e., vowels, unvoiced fricatives, and voiced fricatives).
4.4.2 Test and Reference Template Formation
The averaging procedures used to create test and reference templates for the
present experiment employed a multi-level combination approach as illustrated in
Figure 4.5.
The lower layer templates were feature parameter vectors obtained from each
utterance by an LPC analysis as described in the last section. They can be
autocorrelation, LPC, or cepstrum coefficients, etc. A lower layer template
coefficient set was calculated by averaging six sets of coefficients obtained from six
frames for each sustained utterance such as a vowel, an unvoiced fricative, or a
voiced fricative for every subject.
The next level of combination averaged all templates in the lower layer for
each subject to form a single median layer template to represent this subject.
Templates of all utterances for the same phoneme groups (e.g., vowels or unvoiced
fricatives or voiced fricatives), were averaged.
In the last stage, the single remaining male and female templates were
combined in the same manner as above. Each gender was represented by a single
token (centroid) obtained by averaging all templates in the median layer.
It is evident that from the lower layer to the upper layer, a higher degree of
averaging is achieved.
Figure 4.6(a) shows two reflection coefficient templates of vowels for male
and female speakers in the upper layer. The filter order was 12 so that there were 12
elements in each template (vector). Each template can be considered a universal
token representing each gender. The data in the figure are shown as

50
SUBJECT 1
SUBJECT 2
SUBJECT M
TEMPLATE FOR
EACH UTTERANCE
(LOWER LAYER)
COMPUTE
AVERAGE
TEMPLATE FOR
EACH SUBJECT
(MEDIAN LAYER)
TEMPLATE FOR
EACH GENDER
(UPPER LAYER)
Figure 4.5 Test and reference template formation.

51
0.600
0.400
0.200
Ul 0*000
¡5
j
5-0.200
-0.400
-0.600
-0.80 0 -I 1 1 1 1 1 1 1 1 1 1 1 1-
1 23466789 10 11 12
ELEMENT OF THE VECTOR (TEMPLATE)
(a)
(b)
Figure 4.6 (a) Two reflection coefficient templates of vowels
for male and female speakers in the upper layer,
(b) The corresponding spectra.

52
means standard errors (SE), which were calculated from the median layer
templates. We will see in the next Chapter that by applying these two tokens as
reference templates with the recognition Scheme 3 and the Euclidean distance, a
100% gender recognition rate can be achieved. The result is not surprising if we
notice that the within-gender variation for these reflection coefficients, as
represented by SE in the figure, was small, compared to the between-gender
variation. It is also easily noted that elements 1, 4, 5, 6, 7, 8, 9, and 10 of these
reference templates account for the most between-gender variation. On the other
hand, elements 2, 3, 11, and 12 of these reference templates account for little
between-gender variation and thus could be discarded to reduce the dimensionality
o
of the vector. Figure 4.6(b) shows the spectra corresponding to the two universal
reflection coefficient templates.
Similarly, Figure 4.7(a) and (b) show two reflection coefficient templates
(with the same filter order of 12) and the corresponding spectra of unvoiced
fricatives for male and female speakers in the upper layer. Interestingly, strong
peaks are present in the universal female spectrum for voiced fricatives but only
several ripples appear in the male spectrum. We will see later that by using these
two tokens as reference templates with the recognition Scheme 3 and the Euclidean
distance, an 80.8% gender recognition rate can be achieved. Finally, Figure 4.8(a)
and (b) are two cepstral coefficient templates (with the same filter order of 12) and
the corresponding spectra of voiced fricatives for male and female speakers in the
upper layer. It is shown later that by using these two tokens as reference templates
with the recognition Scheme 3 and the Euclidean distance, a 92.3% gender
recognition rate can be achieved. The universal" spectra for the two genders in
Figure 4.6(b), 4.7(b), and 4.8(b) possess the basic properties for vowels, unvoiced
fricatives, and voiced fricatives. For example, while the energy of the vowels is
concentrated in the lower frequency portion of the spectrum, the energy of the

53
0.600 -
0.400 -
0.200 -
^ 0.000 ^
D
-J
% 0.200 -
-0.400 -
-0.600 -
-0.seo -
o
o-
-
-o MALE
- FEMALE
S
O-
>8,
'O-
-4-
-4-
-4-
-4-
-4-
-4-
3 4 5 6 7 8 9 10 11 12
ELEMENT OF THE UECTOR (TEMPLATE)
(a)
(b)
Figure 4.7 (a) Two reflection coefficient templates of unvoiced fricatives
for male and female speakers in the upper layer.
(b) The corresponding spectra.

54
(a)
FREQUENCY (Hi)
(b)
Figure 4.8 (a) Two cepstral coefficient templates of voiced fricatives
for male and female speakers in the upper layer.
(b) The corresponding spectra.

55
unvoiced fricatives is concentrated in the higher frequency portion of the spectrum.
And the energy of voiced fricatives is more or less equally distributed.
4.4.3 Nearest Neighbor Decision Rule
The other major step in the pattern recognition model is the decision rule
which chooses which reference template most closely matches the unknown test
template. Although a variety of approaches are applicable, only two decision rules
have been used in most practical systems, namely, the K-nearest neighbor rule
(KNN rule) and the nearest neighbor rule (NN rule).
The KNN rule is applied when each reference class (e.g., gender) is
represented by two or more reference templates (e.g., as would be used to make the
reference templates independent of the speaker). The KNN rule operates as follows:
Assume we have M reference templates for each of two genders, and for each
template a distance score is obtained. If we denote the distance for the ith reference
template of the jth gender as D¡j (1 < i < M and j = 1, 2), this set of distance scores,
Ditj, can be ordered such that
(4.24)
Then for the KNN rule we compute the average distance (radius) for the jth gender
as
1 K
2 Ditj
K i=l
(4.25)
and we choose the index j* with the smallest average distance as the recognized
gender

56
i* = argmin n
j-1,2
(4.26)
When K is equal to 1, the KNN rule becomes the NN rule (i.e., it chooses the
reference template with the smallest distance as the recognized template).
The importance of the KNN rule is seen for word recognition when P is from 6
to 12, in which case it has been shown that a real statistical advantage is obtained
using the KNN rule (with K = 2 or 3) over the NN rule (Rabiner et al., 1979).
However, since there was no previous knowledge of the decision rule as applied to
gender recognition, the NN rule was first used in this preliminary experiment.
4.4.4 Structure of Four Recognition Schemes
To investigate how much averaging should be done for the test and reference
templates to gain the best performance for the gender recognizer, several
recognition schemes were designed. Table 4.1 presents a brief summary of these
schemes.
Table 4.1 Four recognition schemes
Test template from Reference template from
SCHEME 1
SCHEME 2
SCHEME 3
SCHEME 4
LOWER LAYER
LOWER LAYER
MEDIAN LAYER
MEDIAN LAYER
MEDIAN LAYER
UPPER LAYER
UPPER LAYER
MEDIAN LAYER
Scheme 1 is illustrated in Figure 4.9(a). In the training stage, one test
template for each test utterance (i.e., the lower layer), and one reference template

57
SCHEME 1
TEST SUBJECT
(a)
SCHEME 2
TEST SUBJECT
LOWER LAYER
UPPER LAYER
(b)
Figure 4.9 Structures of four recognition schemes.

58
SCHEME 3
TEST SUBJECT
(C)
SCHEME 4
TEST SUBJECT
MEDIAN LAYER
MEDIAN LAYER
(d)
Figure 4.9 (continued)

59
for each subject (i.e., the median layer), were formed. The set of the entire median
layer constituted the reference cluster that includes all median templates. In the
testing stage, the distance measure for each lower layer template of all test subjects
was calculated with respect to each of the median layer templates, and the minimum
distance was found. The speaker gender of the lower layer utterance was then
classified as male or female, according to the gender known for the median layer
reference template.
Scheme 2 is illustrated in Figure 4.9(b). In the training stage, one test
template for each test utterance (i.e., the lower layer), and one reference template
for each gender (i.e., the upper layer), were formed. The upper layer constituted the
reference cluster that includes only two gender templates. In the testing stage, the
distance measure for each lower layer template of all test subjects was calculated
with respect to each of those upper layer templates and the minimum distance was
found. The speaker gender of the lower layer utterance was then classified as male
or female, according to the gender known for the upper layer reference template.
Figure 4.9(c) shows Scheme 3. In the training stage, one test template for each
test subject (i.e., the median layer), and one reference template for each gender (
i.e., the upper layer), were formed. The set of the entire median layer constituted
the test pool that includes all median templates. In the testing stage, the distance
measure for each median layer template of all test subjects was calculated with
respect to each of those upper layer templates and the minimum distance was found.
The speaker gender of the median layer template was then classified as male or
female, according to the gender known for the upper layer reference template.
Figure 4.9(d) shows Scheme 4. In the training stage, only the median layer
were formed and each subject was represented by a single template. The median
layer constituted both test and reference pools. In the testing stage, the
Leave-One-Out oi exclusive procedure (which is discussed in detail in the next

60
section) was applied. The distance measure for each median layer template was
calculated with respect to each of the rest of the median layer templates, and the
minimum distance was found. The speaker gender of the test template was then
classified as male or female, according to the gender known for the reference
template. The above steps were repeated until all subjects were tested.
4.5 Resubstitution and Leave-One-Out Procedures
After the classifier is designed, it is necessary to evaluate its performance
relative to competing approaches. The error rate was considered as the performance
measure.
Four popular empirical approaches that count the number of errors when
testing the classifier with a test data set are (Childers, 1989):
The Resubstitution Estimate (Inclusive). In this procedure, the same data set
is used for both designing and testing the classifier. Experimentally and
theoretically this procedure gives a very optimistic estimate, especially when the
data set is small. Note, however, that when a large data set is available, this method
is probably as good as any procedure.
The Holdout Estimate. The data is partitioned into two mutually exclusive
subsets in this procedure. One set is used for designing the classifier and the other
for testing. This procedure makes poor use of the data since a classifier designed on
the entire data set will, on the average, perform better than a classifier designed on
only a portion of the data set. This procedure is known to give a very pessimistic
error estimate.
The Leave-One-Out Estimate (exclusive). This procedure assumes that there
are n data samples available. Remove one sample from the data set. Design the
classifier with the remaining (n-1) data samples and then test it with the removed

61
data sample. Return the sample removed earlier to the data set. Then repeat the
above steps, removing a different sample each time, for n times, until every sample
has been used for testing. The total number of errors is the leave-one-out error
estimate. Clearly this method uses the data very effectively. This method is also
referred to as the Jack Knife method.
The Rotation Estimate. In this procedure, the data set is partitioned into n/d
disjoint subsets, where d is a divisor of n. Then, remove one subset from the design
set, design the classifier with the remaining data and test it on the removed subset,
not used in the design. Repeat the operation for n/d times until every subset is used
for testing. The rotation estimate is the average frequency of misclassification over
the n/d test sessions. When d=l the rotation method reduces to the leave-one-out
methods. When d=n/2 it reduces to the holdout method where the roles of the design
and test sets are interchanged. The interchanging of design and test sets is known in
statistics as cross-validation in both directions. As we may expect, the properties of
the rotation estimate will, fall somewhere between the leave-one-out method and
holdout method. The rotation estimate will be less biased than in the holdout
method and the variance is less than in the leave-one-out method.
In order to use the database effectively, the leave-one-out procedure was
adopted for the experiments. For comparison, the resubstitution procedure was also
used in selected experiments.
4.6 Separability of Acoustic Parameters
Using Fishers Discriminant Ratio Criterion
4.6.1 Fishers Discriminant and F ratio
After we chose acoustic feature candidates, we can see how well they separate
different genders by analytically studying the database. There are many measures

62
of separability which are generalizations of one kind or another of the Fishers
discriminant ratio (Childers et al., 1982; Parsons, 1986) concept. The ratio usually
serves as a criterion for selecting features for discrimination.
The ability of a feature to separate classes depends on the distance between
classes and the scatter within classes (generally there will be more than two classes).
This separation is estimated by representing each class by its mean and taking the
variance of the means. This variance is then compared to the average width of the
distribution for each class (i.e., the mean of the individual variances). This measure
is commonly called the F ratio:
G
Variance of the means(over all classes)
F = (4.27)
Mean of the variances (within classes)
The F ratio is reduced to Fishers discriminant when it is used for evaluating a
single feature and there are only two classes. For this reason, the F ratio is also
referred to as the generalized Fishers discriminant.
In the case of pattern recognition, there are vectors of features, f, and
observed values of f for all the classes we are interested in recognizing. Then two
covariance matrices can be calculated, depending on how the data are grouped.
First, the covariance for a single recognition class can be computed by
selecting only feature measurements for class i. Let any vector from this class be f¡.
Then the i within-class covariance matrix for class i is
Wi-<(fj- w)(f¡- Hi)') (4.28)
where <) represents the expectation or averaging operation and ^ represents the
mean vector for the ith class: p.¡ = (f¡). W stands for within. Notice that each of

63
these covariance matrices describes the scatter within a class; hence it corresponds
to one term of the average in the denominator of (4.40). If we make the common
assumption that the vector f¡ is normally distributed, then W is the covariance matrix
of the corresponding probability density function:
1 -n/2 1
PDF, (f) = ( )
2 77 |Wi|1/2
-(f-w)* W'Hf-w) 1 m
exp [ ]
2
Then the denominator of the F ratio can be associated with the average of W¡
over all i; this is called the pooled within-class covariance matrix:
W = < Wj )
(4.29)
Second, the variation within-classes can be ignored and the covariance
between classes can be found, representing each class by its centroid. The feature
centroid for class i is w; hence the between-class covariance matrix is
B = < (p-i jjl)0 n) >
(4.30)
where p. is the mean of ^ over all classes. B stands for between. Here we ignore
the detailed distribution within each class and represent all the data for that class by
its mean. Hence B describes the scatter from class to class regardless of the scatter
within a class and in that sense corresponds to the numerator of (4.40).
Then the generalization we seek should involve a ratio in which the numerator
is based on B and the denominator on W, since we are looking for features with

64
small covariances within classes and large covariances between classes. Fukunaga
(1972) lists four such measures, two of which are
Ji = trace (W^B) (4.31)
and
trace B
J4 = (4.32)
trace W
The motivation for these measures is clearer for J4 since we know that the trace of a
covariance matrix provides a measure of the total variance of its associated
o
variables (Parsons, 1986). If the value of J4 for a feature is relatively greater than
that for the other feature, then there is apparently more scatter between classes than
within classes for this feature, and this feature set is a better one than the other for
discrimination. J4 tests this ratio directly. The motivation for Jj is less obvious and
will have to await the presentation of the material below.
4.6.2 Divergence and Probability of Error
The distance between two classes in feature space may also be evaluated by
divergence that is defined as the difference in the expected values of their
log-likelihood ratios (Kullback, 1959; Tou and Gonzales, 1974). This measure has
its roots in information theory (Kullback, 1959) and is a measure of the average
amount of information available for discriminating between class i and class k. It is
shown that for features with multivariate normal densities, the divergence is given
by
Dik = 0.5 trace (W, Wk)(Wf1 WkM)
+ 0.5 trace [(Wf1 + Wk")(|i| m)(w m)'1
(4.33)

65
This can be related to more familiar material as follows. If the covariance
matrices of the two classes are equal, if W¡ and Wk can be replaced by an average
covariance matrix W, then the first term vanishes and the divergence reduces to
Dik = trace [W_1(w ^k)(w nk)]
= (w- M-k)' W-1(w- p.k)
The term, (p.¡ p.k) (jjl¡ p*)1, is the between-class covariance matrix B; hence in
this case Dik is the separability measure Ji = trace (W_1B).
Notice that Dik or Ji is the Mahalanobis distance. This distance is related to
the approximation of the expected probability of error (PE) by Lachenbruch (1968),
Achariyapaopan and Childers (1983), and Childers (1986). If p is the dimension of
the feature vector, ni and n2 are the sample sizes for classes 1 and 2, and <£(z) is the
standard normal distribution function defined as
1
z
(4.34)
2tt -oo
Pe can be written as
PE = 0.5 $>[-(<*- (3)] + 0.5 (4.35)
where
o = C
[ Ji + p (nj + n2)/(nin2) ]m
(4.36)

66
P (n2 nQ
(3 = C
[ njn2 (Jin!n2 + p (nj + n2))] 1/2
(4.37)
C
(ni+ n2- p 2)(ni + n2 p 5)
0.5 (
(rti + n2- 3)(nj + n2 p 3)
(4.38)
For the fixed training sample sizes ni and n2 and vector dimension p, PE decreases
as the Mabalanobis distance J! increases.
In the coarse analysis stage, the estimated Jlt J4, and expected probabilities of
errors of the acoustic features ARC, LPC, FFF, RC, and CC, which were derived
from male and female groups (i.e., classes) in three phoneme categories, were
studied. Training sample sizes were 27 (n^ for the male group and 25 (n2) for the
female group since median layer templates (one for each subject) were used to
constitute the training sample pools. The feature vector dimension p was equivalent
to the filter order selected. For each of the acoustic features ARC, LPC, FFF, RC,
and CC in each of the three phoneme categories, the estimated J2, J4, and expected
probability of error were computed as follows:
(1) Estimate the within-gender covariance matrix W¡ for each gender
using Equation (4.28).
(2) Compute the pooled (averaged) within-gender covariance matrix
W using Equation (4.29).
(3) Estimate the between-gender covariance matrix B using Equation
(4.30).
(4) Obtain the values of J4 and the Mabalanobis distance Ji from
matrixes W and B using Equations (4.32) and (4.31).

67
(5) Finally, calculate the value of PE from ni, n2, and p using
Equations (4.34) to (4.38).
The analytical results were also compared to the empirical ones obtained from
experiments using recognition schemes. Section 5.3 in the next Chapter presents a
detailed discussion.

CHAPTER 5
RESULTS OF RECOGNITION BASED ON COARSE ANALYSIS
Our results showed that most of the LPC-derived feature parameters
performed well for gender recognition. Among them, the reflection coefficient
combined with the Euclidean distance measure was the best choice for sustained
vowels (100%). While the cepstral distortion measure worked extremely well for
unvoiced fricatives, the LPC log likelihood distortion measure, the reflection
coefficient combined with the Euclidean distance, and the cepstral distortion
measure were the good alternatives for voiced fricatives. Using the Euclidean
distance measure achieved better results than using the Probability Density
Function. Furthermore, the averaging techniques were very important in designing
appropriate test and reference templates and a filter order of 12 to 16 was sufficient
for most designs.
5.1 Coarse Analysis Conditions
Before we discuss in detail the performance assessments based on coarse
analysis, we briefly summarize the experimental conditions as follows:
Database
52 normal subjects: 27 male and 25 female
Phoneme group used: ten sustained vowels
five unvoiced fricative
four voiced fricatives
68

69
Analysis Conditions
Method: asynchronous autocorrelation LPC
Filter order: 8, 12, 16, 20
Analysis frame size: 256 points/frame
Frame overlap: None
Preemphasis Factor: 0.95
Analysis Window: Hamming
Data set for coefficient calculations: six frames total. Then
by averaging six sets of coefficients obtained from these six
frames, a template coefficient set was calculated for each
sustained utterance.
Acoustic Parameters
LPC coefficients (LPC)
Cepstrum Coefficients (CC)
Autocorrelation Coefficients (ARC)
Reflection Coefficients (RC)
Fundamental Frequency and Formant Information (FFF)
* FFF was obtained by 12 order Closed Phase WRLS-VFF method
discussed in the next Chapter.
Distance Measures
Euclidean Distance (EUC)
LPC Log Likelihood Distance (LLD)
Cepstral Distortion (same as EUC)
Weighted Euclidean Distance (WEUC)
Probability Density Function (PDF)
Decision Rule
Nearest Neighbor

70
Recognition Schemes
Scheme 1
Scheme 2
Scheme 3
Scheme 4
Counting Error Procedures
Inclusive (Resubstitution)
Exclusive (Leave-One-Out)
Parameters Based on Fishers Discriminant Ratio Criterion
J4, Ji, and the expected probability of error
5.2 Performance Assessments
Since the WEUC is the simplified case of the PDF and the results produced by
the WEUC were very similar to those produced by the PDF in our experiments, only
the results obtained by the PDF will be discussed.
The complete results of the experiments are tabulated in Appendix A and B.
Appendix A presents the recognition rates for LPC log likelihood and cepstral
distortion measures with various phoneme categories, recognition schemes, and
filter orders. Inclusive procedures were only performed for the acoustic parameters
with a filter order of 16.
Appendix B presents the recognition rates for various acoustic parameters
(ARC, LPC, RC, CC) combined with EUC or PDF distance measures with different
phoneme categories and filter orders. Notice that only recognition Scheme 3 was
used for these experiments. Again, inclusive procedures were only performed for
the acoustic parameters with a filter order of 16. Since calculation of the cepstral

71
distortion measure was done by using the EUC, the results of the CC combined with
the EUC were directly extracted from Appendix A.
5.2.1 Comparative Study of Recognition Schemes
Tables 5.1 and 5.2 show the condensed results selected from Appendix A,
using the LPC log likelihood and cepstral distortion measures respectively.
Recognition rates for the four exclusive recognition schemes with various filter
orders are included. Figures 5.1 and 5.2 are graphic illustrations of Tables 5.1 and
5.2.
By observing curves of Figures 5.1 and 5.2, it can be immediately seen that
higher recognition rates were achieved using recognition Schemes 3 and 4 for all the
cases, including various filter orders combined with different phoneme categories.
Among them, by applying Schemes 3 and 4 to voiced fricatives, over 90%
recognition rates were accomplished for all filter orders and both distortion
measures. The highest correct recognition rate, 98.1%, was obtained for Scheme 4
by using the LPC log likelihood measure with a filter order of 8. The same rates
were obtained for Scheme 3 by using the LPC log likelihood measure with a filter
order of 20 and using the cepstral distortion measure with filter orders of 12 and 16.
The results indicated the following:
1. Choosing appropriate template forming and recognition schemes
was important in achieving high correct recognition rates.
Particularly, the use of averaging techniques was critical since the
highest recognition rates were obtained by using Schemes 3 and 4, in
both of which the test and reference template were formed by
averaging all the utterances from the same subjects or even the same
gender (Scheme 3). In contrast, Schemes 1 and 2, in which the test
template was formed from a single utterance, performed worse.

72
Table 5.1
Results from exclusive recognition schemes
with various filter orders and
the LPC log likelihood distortion measure
CORRECT RATE %
Order-^8
0rder=12
0rder=16
0rder=20
Sustained
Vowels
Scheme 1
63.1
69.6
74.2
74.2
Scheme 2
65.2
71.5
76.2
76.5
Scheme 3
75.0
86.5
86.5
84.6
Scheme 4
75.0
80.8
86.5
88.5
Unvoiced
Fricatives
Scheme 1
59.2
64.2
67.7
65.0
Scheme 2
61.5
63.9
64.2
64.2
Scheme 3
67.3
75.0
75.0
78.9
Scheme 4
76.9
75.0
73.1
69.3
Voiced
Fricatives
Scheme 1
74.5
72.1
73.1
72.6
Scheme 2
77.4
80.3
81.7
80.3
Scheme 3
90.4
94.2
96.2
98.1
Scheme 4
98.1
96.2
96.2
94.3

73
Table 5.2
Results from exclusive recognition schemes
with various filter orders and
the cepstral distortion measure
CORRECT RATE %
0rder=8
Order=12
0rder=16
0rder=20
Sustained
Vowels
Scheme 1
61.3
68.3
69.4
70.6
Scheme 2
69.4
67.3
70.0
72.1
Scheme 3
82.7
92.3
90.4
90.4
Scheme 4
90.4
92.3
92.3
88.5
Unvoiced
Fricatives
Scheme 1
61.2
65.8
63.9
64.6
Scheme 2
58.8
61.5
62.7
64.2
Scheme 3
71.2
75.0
84.6
82.7
Scheme 4
78.8
88.5
90.4
88.5
Voiced
Fricatives
Scheme 1
79.3
82.7
81.3
80.8
Scheme 2
75.5
82.2
84.6
85.1
Scheme 3
94.2
98.1
98.1
96.2
Scheme 4
92.3
92.3
92.3
90.4

74
i
-u

L
U

L
L


D
H
P
a
u
H
L
V-
V
a
u
H
0
5
c
3
L
0
V-
a

3
v **
N *J
l
n
L

V
100
95
90
86
86
76
70
65
60 -
A'
o-
A Schema 3
'A
-A Schema 4
Scheme 2
O O Scheme 1
55 -
12
16
20
Filter order
Figure 5.1 Results from exclusive recognition schemes with various
filter orders and the UPC log likelihood distortion measure.

75

u

i.
+J
u

L
L



H
W
u
H
L
It
'D

U
H

5
c
3
L
Q
\-
i
*
3
L
0
V-
100 -
..A-
A-
96
A'
- '
-A
Scheme
3
A

A
90
~ A
Scheme
4
85


Scheme
2
80 +
o
o
0
Scheme
1
76-

70
66 -
60
55
1
1
(
!
8 12 16 20
Filter order
Figure 5.2 Results from exclusive recognition schemes with various
filter orders and the cepstral distortion measure.

76
2. Averaging techniques seemed more crucial than clustering
techniques. Recognition theory states that choosing several
clustering centers for the same reference group should increase the
correct classification rate, because intragroup variations are taken
into account. In Schemes 1 and 4, multi-clustering reference centers
were formed. However, the theory functioned well in Scheme 4 but
inadequately in Scheme 1, although in Scheme 1, a number of
clustering centers were selected for the reference of the same
gender. Furthermore, Scheme 3 was a further simplified version of
Scheme 2 and Scheme 4. Instead of using each test vowel of the
c
subject as in Scheme 2, only a single test template was employed for
each test subject and a single reference template for each reference
gender. But the results were almost as good as those achieved by
using Scheme 4. In Scheme 2, averaging was performed over the
reference template but not over the test template. The correct
recognition rates were low. Thus, the results suggested the
importance of averaging techniques. To a great extent, averaging on
both test and reference templates eliminated the intrasubject
variation or diversity within different vowels or fricatives of a given
speaker, but on the other hand, emphasized features representing
this speakers gender.
3. Since the averaging was applied to the acoustic parameters extracted
from different phonemes uttered by a speaker or speakers of the
same gender, and the phonemes were produced at different times,
the averaging is essentially a tinted-averaging technique.
Therefore, we would reasonably deduce that gender information
is time-invariant, phoneme independent, and speaker independent.

77
Because of this, averaging emphasized the speakers gender
information and increased the between-to-within gender variation
ratio. In practice we would achieve free-text gender recognition in
which gender identification would be determined before recognition
of speech or speaker and thus, reduce the speech or speaker
recognition search space to half.
The conclusion is consistent with the findings by Lass and Mertz
(1978) that temporal cues appeared not to play a role in speaker
gender identification. As we cited earlier, in their listening tests they
found that gender identification accuracy remained high and
unaffected by temporal speech alterations when the normal temporal
features of speech were altered by means of the backward playing
and time compressing of speech samples.
We have shown in the previous section that use of the long-term
average technique to form a feature vector was discovered to have
potential for free-text speaker recognition in the Pruzansky study
(1963). Speaker recognition error rates remained undegraded even
after averaging spectral amplitudes over all frames of speech data
into a single reference spectral amplitude vector for each talker.
Markei et al. (1977) also demonstrated that the between-to-within
speaker variation ratio was significantly increased by performing
long-term parameter sets (thus text-free). Here we found that this
rule also applied to the gender recognition.
4. In terms of Scheme 3 versus Scheme 4, neither was obviously
superior. However, from a practical point of view, Scheme 3 would
be easier to realize since only two reference templates are needed.

78
5. In further experiments, different weighting factors could be applied
to different phoneme feature vectors according to the probabilities of
their appearances in real situations. By this way, time-averaging
would be better approximated.
5.2.2 Comparative Stuby of Acoustic.features
5.2.2.1 LPC Parameter Verses Cepstrum Parameter
Although both LPC log likelihood and cepstral distortion measures were
effective tools in classifying male/female voices, the performance of the latter was
better than the former.
c
1. By comparing Figures 5.1 and 5.2, it is noted that except in the
category of voiced fricatives, in which the performances were
competitive (both measures were able to achieve recognition rates of
98.1%), cepstrum coefficient features proved to be more sensitive
than LPC coefficients for gender discrimination. By choosing
appropriate schemes and filter orders the recognition rates for the
cepstral distortion measure reached 92.3% for vowels (Scheme 3
with a filter order of 12 and Scheme 4 with filter orders of 12 and
16), 90.4% for unvoiced fricatives (Scheme 4 with a filter order of
16). By using the LPC log likelihood distortion measure, the
corresponding highest recognition rates were 88.5% for vowels
(Scheme 4 with a filter order of 20) and 78.9% for unvoiced
fricatives (Scheme 3 with a filter order of 20).
2. By comparing tables in Appendix A, it is noted that the cepstral
distortion measure operated more evenly between male and female
groups, showing this feature has some normalization
characteristics. As seen in Table A.l, there existed large

79
differences between male and female recognition rates for the LPC
recognizer with a filter order of 16. The largest gaps came from
Scheme 1 of the LPC. The differences were about 19% for vowels,
24% for unvoiced fricatives, and 15% for voiced fricatives. On the
other hand, the cepstral distortion measure worked evenly with the
same filter order. Table A.7 shows that the largest gaps were 6.6%
for vowels, 1.8% for unvoiced fricatives, and 2.4% for voiced
fricatives. Similar situations held for the results shown in other
tables and with inclusive schemes.
5.2.2.2 Qrher-AcQustic.Pararnefers
Tables 5.3 and 5.4 demonstrate results from exclusive recognition Scheme 3
with various filter orders and other acoustic parameters, using EUC and PDF
distance measures respectively. Figures 5.3 and 5.4 are graphic illustrations of
Tables 5.3 and 5.4.
1. The overall performance using RC and cepstrum coefficients was
better than that achieved using ARC and LPC coefficients, when the
Euclidean distance measure was adopted. The following
observations were made:
o The RC functioned extremely well with sustained vowels.
The recognition rates remained 100% for filter orders of 12,
16, and 20, showing that RC features captured gender
information from vowels effectively. The results were also
stable and filter order independent, as long as the filter order
was above 8. Table 5.3 shows that a 98.1% recognition rate
was reached by using the FFF, which was obtained using the
closed-phase WRLS-VFF method with a filter order of 12.

80
Table 5.3
Results from the exclusive recognition scheme 3
with various filter orders, acoustic parameters, and
the Euclidean distance measure
o
CORRECT RATE %
Order=8
Order=12
Order=16
0rder=20
Sustained
Vowels
ARC
78. R
78.8
78.8
8?.. 7
LPC
73.1
78.8
80.8
80.8
FFF
N/A
98.1
N/A
N/A
RC
88.5
100.0
100.0
100.0
CC
82,7
92,3
90,4
90.4
Unvoiced
Fricatives
ARC
75.0
75.0
75.0
75.0
LPC
80.8
69.2
71.2
71.2
RC
80.8
80.8
80.8
80.8
CC
71.2
75.0
84.6
82.7
Voiced
Fricatives
ARC
86,5
88.5
86.5
88.5
LPC
92.3
92.3
92.3
90.4
RC
94.2
96.2
96.2
96.2
CC
94.2
98.1
98.1
96.2

81
Table 5.4
Results from the exclusive recognition scheme 3
with various filter orders, acoustic parameters, and
the PDF (Probability Density Function) distance measure
CORRECT RATE %
Order=8
0rder=12
0rder=16
0rder=20
Sustained
Vowels
ARC
80.8
84.6
88.5
67.3
LPC
84.6
98.1
92.3
80.8
FFF
N/A
96.2
N/A
N/A
RC
88.5
98.1
92.3
67.3
CC
78.8
94.2
90.3
75.0
Unvoiced
Fricatives
ARC
69.2
65.4
57.7
N/A
LPC
78.8
86.5
78.8
53.8
RC
78.8
73.1
67.3
55.8
CC
80.8
73.1
69.2
57.7
Voiced
Fricatives
ARC
88.5
86.5
82.7
59.6
LPC
92.3
94.2
94.2
71.2
RC
92.3
90.4
90.4
75.0
CC
92.3
92.3
80.8
71.2

82
lee
* a
i- s-
0
u
95
9e
85
ae
75
7e
65
6e
55
+
a
a
X H
U
L i.
u
H
Q
5
c

L

%
iee r
95
90
85 --
86 -
75
76 --
65 -
66 *
55t-
CC
RC
ARO
LPC
106 t
a h
L L
V
+J
U T3
a a
i. u
L -H
O o
3
L
O
V
80 -
75-
70 -
65 --
60 -
a
a
95 -
A
7T-A
-A-
L-*
' " A
CC
RC
3
X -H
4-1
a a
A
96 -
o
! j
' /
: /
O
o

--- o
LPC
ARC
55
-t-
12 16
Filter order
26
Figure 5.3 Results from the exclusive recognition scheme 3 with various
filter orders, acoustic parameters, and the Euclidean distance measure.
T

83
i
60 *
56 * 1
100 -
a
a
x.3
a
L
a
i.
95 J-
90 -
o
a
a
3
N -H
U
a a
*j u
a h
L i.
U
a e
v. u
l -H
o
U 3
L

V
100 T
95 T
90 i
86 -
80 t
75 |
70 -
65 {
RC
LPC
CC
O ARC
55
8
I I 1-
12 16 20
Filter order
Figure 5.4 Results from the exclusive recognition scheme 3 with various
filter orders, acoustic parameters, and the PDF distance measure.

84
o The CC worked very well with unvoiced fricatives. The
highest recognition rate was 84.6% with a filter order of 16.
o Both RC and CC operated extremely well with voiced
fricatives, and the results remained stable across all the filter
orders that were tested. The results generated by the CC were
slight better than the RC with filter orders of 12 and 16
(98.1% versus 96.2%).
2. When the PDF distance measure was adopted, using LPC
coefficients was a good choice. The highest recognition rates,
98.1%, 86.5% ,and 94.2, for three phoneme categories all came from
LPC coefficients with a filter order of 12 (also for voiced fricatives
with a filter order of 16). However, it is noted that the results from
the PDF distance measure were highly affected by filter orders.
5.2.3 Comparative Study Using Different Phonemes
The results of Figures 5.1, 5.2, 5.3, and 5.4 also indicated that either vowels or
unvoiced fricatives or voiced fricatives could be used to objectively classify
speakers gender. As we have seen before for vowel category, reflection coefficients
worked extremely well. The recognition rates reached 100% with filter orders of 12,
16, and 20. A 98.1% recognition rate could also be accomplished by using LPC
coefficients and the FFF. Surprisingly, the cepstral distortion measure can be used
to discriminate speakers gender from unvoiced fricatives and a 90.4% recognition
rate was obtained. For voiced fricative category, a 98.1% recognition rate was
achieved by using LPC log likelihood and cepstral distortion measures. Therefore,
in terms of the most effective phoneme category for gender recognition, the first
preference would be vowels. The second one would be voiced fricatives and the last
one unvoiced fricatives.

85
As discussed in Chapter 4, the acoustic parameters used in coarse analysis
were derived from the LPC all-pole model that has a spectral matching
characteristic. It is known that LPC log likelihood and cepstral distortion measures
are directly related to the power spectral differences of the test and reference
signals. Thus, the results indicated that the spectral characteristics were major
factors in distinguishing the speakers gender. Also, our results suggested that there
did exist significant differences between spectral characteristics of unvoiced
fricatives for the two genders, indicating that the speakers gender could be
distinguished based only on speakers vocal tract characteristics since no vocal fold
information existed in unvoiced fricatives. Moreover, if some of the vocal fold
information was combined with vocal tract characteristics as in vowel and voiced
fricative cases, the gender distinguishing was improved, even though in both of the
above cases the fundamental frequency information was not included.
5.2.4 Comparative Study of Filter Order Variation
5.2.4.1 LPC Log Likelihood and Cepstral Distortion Measure Cases
1. By observing the resultant curves of Schemes 1 and 2 from Figures
5.1 and 5.2, a general trend is easily noted. The recognition rates
generally improved by using higher order filters.
2. However, this trend was not observed for Schemes 3 and 4.
Individual inspection had to be made for specific cases. The
recognition rates for Schemes 3 and 4 together with LPC log
likelihood and cepstral distortion measures were first considered.
Notice that there are a total 12 rates for each of the filter orders,
o Comparing the recognition rates between filter orders of 8
and 12, 9 out of 12 rates increased, 2 of them decreased
(Scheme 4 for voiced and unvoiced fricatives with the LPC

86
log likelihood distance), and 1 of them tied (Scheme 4 for
unvoiced fricatives with the cepstral distortion measure).
This demonstrated that performance improved from filter
orders of 8 to 12.
o Comparing the recognition rates between filter orders of 12
and 16, out of 12 rates, 4 of them dropped. Two of them
increased (unvoiced fricatives Scheme 4 with the LPC
distance measure and sustained vowels Scheme 3 with the
cepstral distortion measure). The remaining 6 were equal.
This indicated that by using Scheme 3 or 4, there was not a
a
distinct difference between filter orders of 12 and 16.
o Comparing the recognition rates between filter orders of 16
and 20, out of 12 rates, 8 of them dropped. Three of them
increased and one was tied. Performance degraded from
filter orders of 16 to 20.
o If only the cepstral distortion measure was applied (Figure
5.2), the highest recognition rates appeared at filter orders of
12 and 16 for all three phoneme categories. However, if only
the LPC log likelihood distortion was used (Figure 5.1), the
highest recognition rates were reached with filter orders of 8
and 20. Therefore, there was no manifest trend of
performance difference for the LPC log likelihood distortion
measure across filter orders. Since the cepstral distortion
measure showed better results than the LPC log likelihood,
the filter orders of 12 to 16 seemed to be best options for the
overall design.

87
5.2.4.2 Euclidean Distance Versus Probability Density Function
1. By examining Figure 5.3, it is seen that using Euclidean distance
measure increased the recognition rates slightly from filter orders of
8 to 12 with exception of the LPC for unvoiced fricatives.
Recognition rates with a filter order of 12 were almost the same as
with 16 and 20. No specific trend was observed. Except for the ARC
applied to the vowel category, all other performances reached their
peaks with either filter order of 12 or 16. It can be concluded that by
using the EUC distance measure, the best choice of filter order
would be around the range from 12 to 16.
2. However, Figure 5.4 shows us a different case. By inspecting Figure
5.4, it is immediately concluded that by using the PDF, gender
recognition rates varied considerably with the filter order. The
overall trend for the vowel category is that recognition rates
increased from a filter order of 8, reached its peak with a filter order
of 12 and then decreased. One exception is that by using the RC, the
recognition rate reached its peak with a filter order of 16. All
acoustic parameters, except the LPC, for voiced and unvoiced
fricatives showed decreasing recognition rates from a filter order of
8 to an order of 20. By using LPC coefficients, performance showed
some improvement from a filter order of 8 to 12 and 16 and then
degraded. Finally, recognition accuracies severely declined from a
filter order of 16 to an order of 20 for all three phoneme categories
and all acoustic parameters. It can be concluded that using the PDF,
the best option for filter order would be 8 or 12.

88
5.2.5 Comparative Study of Distance Measures
It is generally believed that the use of the EUC distance measure is not as
effective as the use of the PDF because there is no normalization of the dimensions
involved in the definition of the EUC. The largest value dimension becomes the
most significant. In contrast, the PDF approach has such normalization function
through the covariance matrix computation. The PDF approach gives unequal
weighting for each element of a vector. It may suppress the elements with large
values but emphasize the elements with small values according to their importance
in reducing the intragroup variation.
However, the PDF approach did not work well in our experiments. By
c
observing Tables 5.3 and 5.4 as well as Figures 5.3 and 5.4, it can be seen that the
EUC outperformed the PDF.
First, out of 48 corresponding pairs of EUC and PDF recognition rates from
Tables 5.3 and 5.4, 32 EUC recognition rates were higher than those of the PDF.
Three of them were tied. Only 13 PDF recognition rates were higher than EUC
rates.
Second, as we have demonstrated in the sections above, performance using
PDF varied considerably with the filter order and there were severe performance
declines from the order of 16 to order of 20 for all three phoneme categories and all
acoustic parameters. On the other hand, the results of the EUC were relatively
consistent across all the filter orders that were tested, especially with filter orders of
12, 16, and 20.
Third, the two highest rates from Tables 5.3 and 5.4 for three phoneme groups
achieved using the EUC distance measure. The RC with the EUC yielded a 100%
recognition rate for sustained vowels with filter orders of 12, 16, and 20. The CC
with the EUC yielded a 98.1% recognition rate for voiced fricatives with filter orders
of 12 and 16. Even for unvoiced fricatives, the highest recognition rates for

89
unvoiced fricatives came from the CC with the EUC using Scheme 4 (Table 5.2).
They were 88.5% with a filter order-of 12 and 90.4% with a filter order of 16.
Fourth, the EUC distance measure functioned more evenly on male and
female groups than did the PDF. By examination of all tables of exclusive schemes
in Appendix B, 43 male recognition rates were higher than those of females for total
48 PDF pairs. Only 5 female recognition rates were higher than those of males. And
the largest difference between gender group was 68.6% for the PDF (from the ARC
for unvoiced fricatives with filter order of 20). On the other hand, in 29 out of 49
EUC pairs, male rates were higher than those of females. And the largest gap
between gender group was only 21% (from the LPC for vowels with a filter order of
8).
A possible reason for this inferior PDF performance is due to the small ratio of
the available number of subjects per gender to the number of elements
(measurements) per feature vector. The assumption when using the PDF distance
measure to design a classifier is that the data are normally (Gaussian) distributed.
In this case, many factors are considered (e.g., the size of the training or design set
and the number of measurements (observations, samples) in the data record (or
vector)). Foley (1972) and Childers (1986) pointed out that if the ratio of the
available number of samples per class (in this study, number of subjects per gender)
to the number of samples per data record (in this study, number of elements per
feature vector) is small, then data classification for both design and test sets may be
unreliable. This ratio should be on the order of three or larger (Foley, 1972). In our
study, the ratios were 3.25 (26/8), 2.17 (26/12), 1.63 (26/16), and 1.3 (26/20) for
filter orders of 8, 12, 16, and 20 respectively. The value of 3.25 satisfied the
requirement but the others were too small. Therefore, with the exception of the
results with a filter order of 8, where the performances of the PDF and EUC were

90
comparable, the PDF approach did not function well. The smaller the ratio, the
worse the PDF performed.
5.2.6 Comparative Study Using Different Procedures
Performance differences between resubstitution (inclusive) and
Leave-One-Out (exclusive) procedures were also tested with a filter order of 16.
Tables A.9 and A.10 in Appendix A present the inclusive recognition results forLPC
log likelihood and cepstral distortion measures with various recognition schemes
respectively. Tables B.9 and B.10 in Appendix B show the recognition results from
inclusive recognition Scheme 3 with various acoustic parameters, using EUC and
PDF distance measures respectively.
The results presented in Appendix A indicate that the correct recognition rates
of exclusive recognition procedure (Tables A.3 and A.7) were not greatly degraded
compared to those obtained from the inclusive recognition procedure, especially for
the cepstral distortion measure. For the cepstral distortion measure with Scheme 3,
the rates decreased from 94.2% to 90.4% for vowels, from 86.5% to 84.6% for
unvoiced fricatives, and remained constant for voiced fricatives. The maximum
decrease of the rates was less than 4%. For the LPC log likelihood with Scheme 3,
the rates degraded from 92.3% to 86.5% for vowels, 82.7% to 75% for unvoiced
fricatives, and 100% to 96.2 for voiced fricatives. Here maximum rate decrease was
7.7% observed for unvoiced fricatives. In contrast, the results from the partial
database of 21 subjects, which we analyzed before we completed our data collection
of the entire database, showed a much large decrease from inclusive to exclusive
procedures. Recognition rate dropped more than 14% for unvoiced fricatives and
more than 9% for voiced fricatives. This convinced us that the larger the database,
the less the performance differences between inclusive and exclusive procedures.

91
One interesting observation from Tables B.9 and B.10 in Appendix B was that
when using the PDF distance measure with the inclusive procedure, the correct
recognition rates were extremely high for all types of phonemes and feature vectors,
except for the ARC with unvoiced fricatives (it was still 98.1%). In addition, the
LPC, RC and CC were all able to provide 100% correct gender recognition from
unvoiced fricatives! However, when using the PDF with exclusive procedure (Table
B.7), the correct recognition rate decreased significantly, with drops ranging from a
minimum of 5.8% to a maximum of 40.4% (for unvoiced fricatives, drops ranging
from a minimum of 21.2% to a maximum of 40.4%). On the other hand, the EUC
distance measure operated more evenly. From inclusive to exclusive procedures
(Table B.3), recognition rates dropped very little, ranging from a minimum of 0%to
a maximum of 3.9%. The rates for four feature parameters did not decrease at all.
Figures 5.5(a) and (b) are graphic illustrations of Tables B.3 and B.9. It can be seen
that there was only minor performance difference between inclusive and exclusive
procedures when the EUC distance measure was used. Our results also suggested
that the PDF excelled at capturing the information from an individual subject. As
long as the data of the subject itself was included in the reference data set, the PDF
was able to pick up such specific information easily and then identify the subjects
gender accurately. Therefore, the correct recognition rates of the inclusive
procedure for the PDF were extremely high. However, the PDF recognition rates of
the exclusive procedure were much lower, indicating that the PDF was clearly
inferior at capturing gender information from the other average subject with the
same gender. On the other hand, the EUC distance measure was good at capturing
gender information from the other subjects without including the characteristic of
the test subject itself.

92
ARC (EUC)
LPC (EUC)
P/^/j LPC(LLD)
\\\j RC (EUC)
X )> cc
Figure 5.5 Results of recognition Scheme 3 with the EUC and
a filter order of 16 for (a) exclusive procedure
(b) inclusive procedure.

93
5.2.7 Variability of Female Voices
Our results also showed that the performance of various feature vectors
combined with various distance measures for male subjects were generally better
than for female subjects. It can be seen by investigating those recognition rate pairs
for male/female subjects with the recognition Scheme 3 and the exclusive procedure
from all tables in Appendix A and B. There are 109 such pairs in total.
Statistics were obtained based on this data. Figure 5.6 illustrates the results.
It can easily be seen that the recognition rates for male showed higher mean, much
greater minimum, and smaller standard deviation than those for female. This
suggested that female features appeared to have higher variations than male
o
features. It was also noticed that out of 109 male verses female pairs, only 25
recognition rates for female were higher than those for male. Three pairs were
equal (100% recognition rates were achieved by both male and female subjects). All
of them were obtained using reflection coefficients with sustained vowels.
To further confirm the above, the Wilcoxon signed-ranks test and the paired
samples t-test (Ott, 1984) were also performed and the results are presented in
Tables 5.5(a) and 5.5(b), respectively. Both tests indicated that there do exist
statistically significant differences between the male and female recognition rates.
5.3 Comparative Study of Acoustic Parameters
Using Fishers Discriminant Ratio Criterion
As reviewed in Section 4.6 of Chapter 4, J4 calculates the ratio of the variance
within classes and the variance between classes directly and Jlt the Mahalanobis
- - ~ V
distance, measures the divergence between two classes in feature space. In addition,
the expected probability of misclassification (PE) can be computed from Jj in a

Figure 5.6 Statistical results of 109 male and female
recognition rate pairs.

95
Table 5.5 Results of Wilcoxon signed-ranks test
and paired sample t-test
Result of Wilcoxon Signed-Ranks Test
Total Data Pairs (N): 109
Zero-differences: 3
Data Pairs used: 106
+ Ranks: Sum = 4313
- Ranks: Sum = 1357
Count = 81
Count = 25
Wilcoxon T = 1357
z = 4.658493
One-sided Significant Level < .01
Two-sided Significant Level < .01
Result of Paired Samples T-test
Mean Difference
Standard Deviation
Standard Error
Degrees of Freedom
t-statistic
= 5.968807
= 13.42507
= 1.285889
= 108
= 4.641776
One-sided Significant Level < .01
Two-sided Significant Level < .01

96
closed form. J4t Ja, and PE all can be used to describe or indicate, using slightly
different criteria, the separability of a given feature.
Table 5.6 shows the estimated values of J4and Ji, the expected probabilities of
error, and the experimental error (misclassification) rates for various acoustic
features derived in the coarse analysis. The training pools consisted of median layer
templates of 27 subjects for the male group and 25 for the female group. Each of the
features ARC, LPC, FFF, RC, and CC in three phoneme categories was investigated.
The vector dimension p was equal to the filter order of 12. The experimental error
rates were obtained by simply subtracting correct recognition rates from unity,
where the correct recognition rates are listed in Table 5.4, in which the recognition
Scheme 3 and the PDF distance measure were employed. Table 5.7 presents the
results for the same parameters as in Table 5.6 but with a filter order of 20.
1. By observing Table 5.6, it was noted that while the LPC of vowels
produced the highest value of Ji (10.9), the FFF of vowels reached
the highest value of J4 (3.35) and the second highest value of Ji
(10.4). Besides, the RC of vowels gave the second highest value of J4
(0.63) and the third highest value of Ji (9.29). They had the lowest
expected probabilities of error, which were 0.09, 0.09, and 0.11
respectively. The experimental error rates using these features were
also the lowest (0.02, 0.04, and 0.02), indicating that J4, Ji, and PE,
generally speaking, provided appropriate measures to predict the
performance of a feature for gender recognition. Of course, one
exception is that the LPC of vowels only yielded a value of 0.16 for
J4, which was far smaller than those of the FFF and RC. Thus Ji of
the LPC did not manifest a good prediction.
2. While the highest values of J4 (3.35) and Ji (10.9) and the lowest
expected probability of error (0.09) appeared in the category of

97
Table 5.6
Estimated values of J4 and Jlf expected probability of errors, and
experimental error rates for various acoustic parameters
Filter order = 12
J4
Ji
EXPECTED
PROBABILITY
OF ERROR
EXPERIMENTAL
ERROR
RATE*
ARC
0.10
4.34
0.21
0.15
LPC
0.16
10.9
0.09
0.02
Sustained
FFF
3.35
10.4
0.09
0.04
Vowels
RC
0.63
9.29
0.11
0.02
CC
0.27
5.11
0.19
0.06
ARC
0.14
2.81
0.27
0.35
Unvoiced
LPC
0.23
4.46
0.20
0.13
Fricatives
RC
0.14
3.38
0.24
0.27
CC
0.16
3.46
0.24
0.27
ARC
0.37
5.92
0.17
0.13
Voiced
LPC
0.36
8.17
0.12
0.06
Fricatives
RC
0.47
5.84
0.17
0.10
CC
0.20
6.04
0.16
0.08
*obtained from the exclusive recognition Scheme 3
using the PUF distance measure.

98
Table 5.7
Estimated values of J4 and Ji, expected probability of errors, and
experimental error rates for various acoustic parameters
Filter order = 20
EXPECTED
EXPERIMENTAL
J4
Jl
PROBABILITY
ERROR
OF ERROR
RATE*
ARC
0.16
11.0
0.12
0.33
Sustained
LPC
0.14
14.7
0.08
0.19
Vowels
FFF
N/A
N/A
N/A
N/A
RC
0.60
15.0
0.08
0.33
CC
0.25
8.54
0.16
0.25
ARC
0.11
5.02
0.23
N/A
Unvoiced
LPC
0.28
7.33
0.18
0.46
Fricatives
RC
0.14
5.83
0.21
0.44
CC
0.17
5.24
0.22
0.42
ARC
0.34
10.0
0.13
0.40
Voiced
LPC
0.36
15.7
0.08
0.29
Fricatives
RC
0.46
13.9
0.09
0.25
CC
0.19
14.1
0.09
0.29
obtained from the exclusive recognition Scheme 3,
using the PDF distance measure.

99
sustained vowels, the lowest values of J4 (0.14) and Ji (2.81) and the
highest expected probability of error (0.27) appeared in the category
of unvoiced fricatives. The values of J4 and Jj and expected
probabilities of error of the features in the category of voiced
fricatives were between those in the categories of vowels and voiced
fricatives. The experimental error rates also showed the same
phenomenon in three categories so that the expected performance of
gender features matched with the empirical performance of the
features, in terms of different phoneme categories. This conclusion
was also true for the results with a filter order of 20 (Table 5.7).
3. It was also observed from Table 5.6 that all acoustic features for
vowels demonstrated much lower experimental error rates than the
corresponding expected probabilities of error. All acoustic features
for voiced fricatives showed less lower experimental error rates than
the corresponding expected probabilities of error. On the other
hand, most acoustic features for unvoiced fricatives possessed
higher experimental error rates than the corresponding expected
probabilities of error. For example, the differences between the
experimental error rates and the corresponding expected
probabilities of error for the CC of three phoneme categories were
0.19 0.06 = 0.13, 0.16 0.08 = 0.08, and 0.24 0.27 = -0.03
respectively. Thus, the more noisy the speech signals, the smaller
the differences between the experimental error rates and the
corresponding expected probabilities of error.
One explanation for these differences is that we used different
(though similar) models for computing the expected probabilities of
error and the experimental error rates. To compute an expected

100
probability of error, templates of all subjects for each gender were
used and a pooled or averaged within-gender covariance matrix was
formed to calculate Ja. On the other hand, to compute an
experimental error rate the exclusive (leave-one-out) recognition
scheme was used and separate within-gender covariance matrixes
were formed for the male and female classes to calculate the PDF
distances. The model differences probably caused the differences
between the experimental error rates and the corresponding
expected probabilities of error, yet why the more noisy the speech
signals, the smaller the differences, is still a remaining question.
4. It appeared more reliable to use Jj to predict the performance of a
gender classifier than J4. It can be noted from Table 5.6 that the FFF
had the extremely large value of J4 (3.35) for vowels, compared to
other features (the RC had o.63 and the LPC had 0.16). However,
the FFF generated an experimental error rate of 0.04, which was
even not smaller than the LPC or RC experimental error rates (0.02).
The FFF did not achieve extremely high performance in practice.
On the other hand, the LPC, FFF, and RC yielded the values of Ji of
10.9, 10.4, 9.29 respectively. There were little differences between
them. The predicted performance using Jj and the real performance
of of the classifiers were essentially consistent. The other example
was the LPC feature. The LPC of vowels and voiced fricatives
produced the values of J4 of 0.16 and 0.36 respectively, indicating
that the LPC classifier using voiced fricatives should have had higher
separability for gender recognition. However, real performance of
the LPC classifier acted the other way around. The experimental
error rates were 0.06 for voiced fricatives and 0.02 for vowels. On

101
the other hand, The LPC of vowels and voiced fricatives generated
the values of Ji of 10.9 and 8.17 respectively, predicting that the
LPC classifier using voiced fricatives had lower separability for
gender recognition, which was true in the experiments.
5. By comparison of Tables 5.6 and 5.7, we see that by using a filter
order of 20, values of Ji greatly increased from those using a filter
order of 12, but values of J4 only slightly changed (most of them
slightly decreased). As the values of Ji increased, the expected
probabilities of error calculated from J] decreased. However, the
most contradicting phenomenon is that the experimental error rates
showed a considerable decrease from those using a filter order of 12.
This does not agree with that predicted by J4 or Jj, or the expected
probabilities of error. Therefore, Ji and J4 were unreliable in
predicting the performance of a gender classifier in this case.
One cause of this problem is, again, probably due to the small
ratio of the available number of subjects per gender to the number of
elements per feature vector. As we discussed in the previous
section, if this ratio is small, data classification for both design and
test sets may be unreliable (Foley, 1972; Childers, 1986). This ratio
should be on the order of three or larger (Foley, 1972). In this study,
the ratios were 2.17 and 1.3 for filter orders of 12 and 20
respectively. While the value of 2.17 might be still considered
marginal for designing a classifier, the value of 1.3 was too small.
This may explain why Jj and J4 failed to predict the performance of a
gender classifier using a filter order of 20. Another cause for this
failure might be the peaking phenomenon (Hughes, 1968). This can
occur if we use too few features, then the classifier performance

102
suffers because there is not sufficient data to properly classify the
test samples. However, if we use too many features the added
features may be unreliable or noisy and thus decrease the
performance of the classifier.
6. In summary, for the filter order of 12, the analytical inferences from
the values of Ji and J4 and the expected probabilities of error using
various acoustic features proved comparable to the empirical results
of the experiments with the PDF distance measure for gender
recognition. Furthermore, Ji the Mahalanobis distance appeared to
be more reliable for predicting the performance of a gender
classifier than J4.
5.4 Conclusions
Considering that only approximately 150ms of the speech signal were used for
the experiments, computer automatic gender recognition is very encouraging. The
conclusions of the above discussion can be summarized as follows:
1. Most of the LPC-derived feature parameters functioned well for
gender recognition. Among them, the reflection coefficients
combined with the EUC distance measure was most robust for
sustained vowels (100%) and the results were quite consistent and
filter order independent. While the cepstral distortion measure
worked extremely well for unvoiced fricatives (90.4%), the LPC log
likelihood distortion measure, the reflection coefficients combined
with the EUC distance measure, and the cepstral distortion measure
were the best options for voiced fricatives (98.1%) (Table 5.8).
Hence carefully selecting the acoustic feature vector combined with

103
Table 5.8
Most effective feature and distance measure combinations
with the exclusive recognition Scheme 3 (except noted).
CORRECT RATE % WITH DIFFERENT FILTER ORDER
C
8
12
16
20
Sustained
Vowels
LPC(PDF)
98.1
RC (PDF)
98.1
FFF(EUC)
98.1
RC (EUC)
100.0
100.0
100.0
Unvoiced
Fricatives
LPC(PDF)
86.5
CC (EUC)*
88.5
90.4
88.5
Voiced
Fricatives
LPC(LLD)
96.2
98.1
LPC(LLD)*
98.1
96.2
96.2
CC (EUC)
98.1
98.1
96.2
RC (EUC)
96.2
96.2
96.2
* By using recognition scheme 4.

104
an appropriate distance measure was important for gender
recognition for a given type of phoneme.
2. Spectral characteristics were vital factors in distinguishing the
speakers gender. Either vowels or unvoiced fricatives or voiced
fricatives can be used to classify the subjects gender objectively.
The speakers gender features could be captured only based on
speakers vocal tract characteristics since no vocal fold information
was contained in unvoiced fricatives. Moreover, when some of the
vocal fold information was combined with vocal tract characteristics
as in vowel and voiced fricative cases, the gender discrimination was
improved.
3. Choosing appropriate template forming and recognition schemes
was crucial in order to achieve high recognition rates. Recognition
Schemes 3 and 4 were more sensitive for gender discrimination than
Schemes 1 and 2 indicating the importance of averaging techniques.
In addition, averaging techniques seemed more critical than
clustering techniques. To a great extent, averaging on both test and
reference templates eliminated the intrasubject variation within
different vowels or fricatives of a given subject and emphasized
features representing this subjects gender. The above discussion
implies that the gender information is time-invariant, phoneme
independent, and speaker independent.
4. The performance of the cepstral distortion measure was better than
that of the LPC log likelihood distortion measure. The cepstral
distortion measure acted more evenly between male and female
groups, indicating that this feature has some normalization
characteristics.

105
5. Filter orders of 12 to 16 were the most appropriate for the majority of
design options.
6. Using the EUC distance measure was more effective than using the
PDF. Recognition rates of the EUC were higher than those of the
PDF and were quite constant across all the filter orders. The EUC
distance measure operated more uniformly on male and female
groups than did the PDF. A possible reason for this inferior PDF
performance is due to the small ratio of the available number of
subjects per gender to the number of elements per feature vector.
7. Recognition rates of the leave-one-out or exclusive procedure were
c
only slightly degraded compared to those produced from the
resubstitution or inclusive procedure. The larger the database, the
less the performance differences between these two procedures.
8. The greater variation of female features was noted. The
performance of various feature vectors combined with distance
measures for male subjects were generally better than for female
subjects. The recognition rates for males showed a higher mean, a
much greater minimum, and a smaller standard deviation than those
for females. This indicates that female features have higher
variability than male features.
9. For the filter order of 12, the analytical inferences from the values of
Ji and J4 and the expected probabilities of error using various
acoustic features proved comparable to the empirical results of the
experiments with the PDF distance measure for gender recognition.
Furthermore, U the Mahalanobis distance appeared to be more
reliable for predicting the performance of a gender classifier than J4,

CHAPTER 6
EXPERIMENTAL DESIGN BASED ON FINE ANALYSIS
6.1 Introduction
The purpose of fine analysis is to perform a detailed study of the vowel
characteristics responsible for distinguishing a speakers gender. Perceptually,
vowel characteristics are considered to convey the most important gender features
and provide most of the vocal fold and tract properties (Carlson, 1981).
Male/female vowel characteristics are mainly distinguished by the fundamental
frequencies of the vowels and resonance (formant) information of the vowels.
Therefore, thorough evaluation of these features for gender recognition is necessary,
and previous research on vowel characteristics is far from complete.
First, the relative importance of the fundamental frequency (FO) versus vocal
tract resonance (VTR) characteristics for perceptual male or female voice quality is
still controversial. The belief that the FO is the strongest cue to gender is
substantiated by the evidence of the previous research. There is a hypothesis that in
situations in which the FO is masked due to the deviation, the effect of VTR
characteristics upon gender judgements increases from a minimal level to a larger
role which is equal to and sometimes greater than that played by FO (Carlson, 1981).
This hypothesis needs further verifying.
, ---v- -
Second, the influence of bandwidth and amplitude of formants and overall
spectral shape on gender cues was not thoroughly considered and investigated.
Mostly, contribution of vocal tract characteristics to gender perception was only
106

107
concerned with formant frequencies (Coleman, 1976). Bladon (1983) pointed out
that male vowels appeared to have narrower formant bandwidths and perhaps also a
less steeply sloping spectrum. These need to be further investigated on a larger
database.
Third, the acoustic features obtained by short-time spectrum analysis, which
was usually done by analog spectrographic techniques, are far from being error-free.
The accuracy of estimated F0 and formant frequencies is subject to reading errors,
unavoidable formant estimation errors due to the influences of the F0 and
source-tract interaction, and large instrument errors (e.g., drift).
In digital speech processing, although the problem of automatic formant
c
analysis of speech has received considerable attention and a variety of approaches
have been taken, the calculation of accurate formant features from the speech signal
is still considered an unsolved problem. The accuracy of formant tracking and
estimation from speech signals using the frame-based conventional LPC analysis
method as we discussed in coarse analysis is affected by factors such as
(1) the position of the analysis frame,
(2) the length of the analysis window, and
(3) the time-varying characteristics of the speech signal.
Sequential adaptive analysis methods offer an attractive alternate processing
strategy since they overcome some of the drawbacks of frame-based analysis.
In this chapter, the detailed background of closed phase WRLS-VFF method
and the experimental design for testing the relative importance of the formant
characteristics for gender recognition are presented.
6.2 Limitations of Conventional LPC
Conventional linear predictive coding (LPC) techniques attempt to model the
vocal tract and can provide a good estimate of the envelope of the speech spectrum.

108
However, the assumptions of the linear source-filter model, the single impulse or
white excitation as the input to the model, and the all-pole non-zero for the spectrum
(sufficient only for vowels) remain only assumptions. Several factors to consider when
estimating vocal tract resonance (formant) frequencies, bandwidths, and amplitudes
from the spectrum of the speech signal are:
(1) the periodicity of the vocal fold excitation,
(2) the analysis frame during the glottal open phase interval (the effect
of source-tract interaction),
(3) the influence of the the fundamental frequency of phonation when it
is near the first formant, and
(4) rapid formant variations that may occur in consonant-vowel
transitions or diphthongs.
6.2.1 Influence of Voice Periodicity
Figure 6.1 shows the simplified speech production model represented by a
time varying linear system with filter parameters slowly changing (lip radiation has
been already included in vocal tract filter). The impulse train generator produces a
sequence of unit impulses which are spaced by the fundamental period of N
samples. If we denote p(n) as the input impulse train, s(n) as the output speech
signal, and g(n), v(n), and r(n) as impulse responses of glottal, vocal tract, and lip
radiation filters respectively, then
s(n) = p(n)*g(n)*v(n)*r(n)
Since p(n) = 2 8(n-mN), we have
m
(6.1)

109
g(n)
v(n)
r(n)
Figure 6.1 A simplified speech production model.

110
s(n) =2 h(n-mN)
(6.2)
m
where
h(n) = g(n)*v(n)*r(n)
(6.3)
Thus in frequency domain,
H(jw) = G(jw) V(jw) R(jw)
(6.4)
and
S(jto) = Z H(jw) e"jmN
(6.5)
m
As we see from Equations (6.4) and (6.5), formant estimation from speech spectrum is
influenced by the effects of
o the periodic vocal fold excitation,
o the glottal filter spectrum envelope,
o the vocal tract filter spectrum, and
c the lip radiation filter spectrum.
There two conditions under which the frame-based conventional LPC provides
a correct solution for periodic excitation:
1. the prediction error is minimized over an interval in which either
the excitation is absent or the signal is exactly predictable by linear
prediction and
2. the impulse response of the all pole filter dies to zero at least p (the
order of the LPC filter) samples before the start of next period.
Usually, it is difficult to meet these conditions. For instance, a periodic
excitation of the all-poll filter introduces errors in the conventional linear prediction
analysis (Atal, 1974).

Ill
Problems such as these are significantly reduced when the pitch synchronous
analysis method is applied, in which the analysis window is restricted to a one pitch
period in length. Problems are further reduced w'hen sequential adaptive analysis
methods are used.
6.2.2 Source-Tract Interaction
The linear source-filter model of conventional LPC assumes that the source
and the vocal tract (VT) filter are separable and do not interact. However, this is not
strictly correct in that it is valid only if the source is properly defined. The
aerodynamic forces that are responsible for the self-oscillation of the vocal folds are
*
in turn affected by the supraglottal vocal tract. Thus, in theory, the VT shape can
affect vocal fold vibration. Extensive simulations by Krishnamurthy and Childers
(1986) using the Ishizaka-Flanagan (1972) two-mass model for the vocal folds
shows, however, that the vocal fold vibration (and in turn the glottal area) is not
significantly affected by the VT. On the other hand, the glottal volume velocity, is
affected. The interaction causes the volume-velocity signal to be skewed to the right
with respect to the glottal area. A ripple in the volume-velocity waveform may
appear, and is thought to be caused primarily by the VT first-formant frequency.
The simplest way to understand the ripple phenomenon is to consider the
single-formant model shown in Figure 6.2. This model has been used extensively to
study source-tract interaction effects (Ananthapadmanabha and Fant, 1982;
Rothenberg, 1981). The VT is represented to a first approximation by the RLC
resonant circuit for the first formant. Rg(t) and Lg(t) are the nonlinear and
time-varying glottal resistance and inductance, respectively, and are controlled by
the glottal area function Ag(t) as well as the current flowing through them, Ug(t). If
the impedance due to Rg(t) and Lg(t) is much larger than the VT input, impedance
Zin for all t, then the glottal volume velocity Ug(t) will be essentially independent of

112
REST FOLD
AREA TENSION
SUBGLOTTAL-
PRESSURE
VOCAL FOLD
MODEL
GLOTTAL
V-V Ug(0,t)
GLOTTAL
OPENING AREA
Ag(t)
O
VOCAL
TRACT
FILTER
Figure 6.2 A model to study source-tract interaction effect.

113
Zin and there will be no source-tract interaction. This is true, however, only when
Ag(t) is very small or zero. During the open phase, the glottal impedance is
comparable to Zin, and effectively increases the damping of the first formant.
Pitch synchronized closed phase covariance (CPC) method can reduce the
effect of source-tract interaction and thus the spectral estimation error due to the
interaction since the analysis window is confined to the closed glottal region for each
pitch period. However, in certain situations, the VT filter derived by the CPC
method is not guaranteed to be stable and the formant variations cannot be tracked
accurately (e.g., the speech of women and children, the fast transitions between
certain vowels and consonants, and short closed glottal intervals).
6
6.3 Closed Phase wrls-vff Analysis
Since accurate estimation of vocal tract resonance frequencies (formants) and
their bandwidths and amplitudes is essential before their significance for gender
recognition can be assessed, analysis methods with better performance must be
considered.
Sequential adaptive approaches that track the time-varying parameters of the
vocal tract and update the parameters during the glottal closed phase interval can
reduce the formant estimation error because they reduce the influences of pitch
periodicity and source-tract interaction and adjust rapidly to fast changes in the
speech signal. Results show (Ting, 1989) that the WRLS-VFF algorithm offers a
more accurate formant estimation than frame-based LPC analysis.
6.3.1 Algorithm Description
It is generally assumed that the speech signal is generated by an all-pole
model (sufficient for vowels) of order p represented by the following equation

(6.6)
P
sk = Z ai(k)sk-i + ek
i=l
where sk denotes the kth sample of speech signal, a¡(k) are time-varying coefficients
and ek represents the combination effect of model error and pulsed excitation.
Using vector notation, sk can be expressed as
sk = Hk' <1^ + ek
(6.7)
where
Hkl [ sk_i,..., sk_p]
$k = [ai(k),...,ap(k)]
The estimated speech signal sk at the instant k can be written as
s k = Hk k (6.8)
where
<>k is the vector of estimated filter coefficients.
WRLS algorithm. A least squared criterion for the estimation error can be
defined as follows
k
Vk() = Z \kl (s¡ s 02
i=l
(6.9)

115
where X. is a constant forgetting (weighting) factor (FF) which progressively
decreases the weight of the past estimation error.
The estimated coefficient vector k that minimizes the error criterion can be
obtained by the well-known WRLS equations (Morikawa and Fujisaki, 1982),
namely,
= k-i + Kkek
Kk = Pk-iHk(k + Hk Pk-iHk)_1
c Pk = X^Pk-i + k^KkHk1 Pk_!
ek = sk Hk WRLS-VFF algorithm. When dealing with time-varying signals (e.g.,
speech), using a variable forgetting factor enables the parameter estimates to follow
sudden changes in the signal. For a locally stationary speech production model, the
a posteriori error at each time k indicates the state of the estimator. If the error
signal is small, then the FF should be close to unity; thus, the algorithm uses most of
the previous information in the signal. If, on the other hand, the error is large then a
small FF will decrease the error. This decrease in weighting of the error signal
shortens the effective memory length of the estimation process until the parameters
are readjusted and the error becomes small. A procedure to achieve the proper error
weighting by choosing the optimal FF is discussed here. The error information of
the filter can be defined as the weighted sum of the squares of a posteriori errors;
this can be expressed recursively as (Fortescue et al., 1981)
2 V, = X 2 Vt., + e,2/(l+Hi' Pl.H,)
(6.11)

116
A strategy for choosing the forgetting factor X.k may now be defined by
requiring 2 \k to be such that
2 Vi = 2 vM .... 2 v0 (6.12)
In other words, the forgetting factor will compensate at each step for the new
error information in the latest measurement, thereby insuring that the estimation is
always based on the same error information. Thus from (6.11)
= 1 ek2(l+Hkl Pk-jHk)-1/^ V0 (6.13)
Therefore the WRLS-VFF algorithm can be expressed with the same
equations used for the WRLS algorithm while the constant weighting factor X can be
replaced by X.k as shown in Equation (6.13). The effective memory of the algorithm
can be defined as (Cowan and Grant, 1985)
M = l/(l-\k) (6.14)
Furthermore, if Xk becomes extremely small, then the memory also becomes
small. In practice, some applications require a certain memory size. In these cases,
we recommend that a minimal \k be defined as
^min 1 I/Na, if X.k < then X.k X.m¡n
(6.15)

117
where Na is the number of the filter coefficients in Equation (6.6). On the other
hand, since 2 V0 is related to the sum of the squares of the error, it can be
calculated from the prediction error over one or two analysis frames using LPC
analysis before executing the WRLS-VFF algorithm.
The estimation error is usually large when there is a large glottal open phase in
the speech signal. In such cases, a small \k is used to decrease the contributions to
errors in the estimation process. Therefore, small values of \k are related to the
glottal closing point where the prediction error is maximum. Figure 6.3 shows that
the regions of small kk correspond to the negative peaks of the differentiated
electroglottograph (DEGG) which are reliable estimates of the closing instant of the
glottis. The closed phase WRLS-VFF algorithm for speech analysis extracts the
vocal tract parameters (formants) only from the glottal closed interval. This
algorithm can be implemented as follows (Figure 6.4):
(1) Initialize the values of P0, 00, Xmin and 2 V0, and give the filter
order. Experience shows that the values of P0 and 2 V0 are
insensitive to the algorithm as long as they are large enough (e.g.,
Po=100, 2 Vo=1000000).
(2) Compute the filter gain Kk, error covariance Pk and prediction error
Vk using Equation (6.9).
(3) Compute Xk using Equation (6.13), if kk (4) Predict the new filter coefficient vector k.
(5) Check for the glottal closed phase using \k or the EGG signal, if
there is a closed glottal interval, then extract the formants and
bandwidths of the vocal tract by applying a peak-picking technique
on the spectrum, which is obtained by using an FFT on the
polynomial coefficients from k.

118
Figure 6.3 Speech signal (upper), the corresponding DEGG
signal (middle), and the variable fogetting factor ^(lower).
-V--

119
Figure 6.4 The algorithm flow of the WRLS-VFF.

120
(6) Go to (2) until end of data.
In summary, this algorithm uses a variable forgetting factor, which can be
obtained recursively during the adaptation process, to control the adaptation gain
and to determine the effective memory length. Experimental results show that the
formant tracking ability and formant estimation accuracy of the WRLS-VFF
algorithm is superior to the other adaptive algorithms that were considered and to
the LPC based algorithm.
6.3.2 EGG Assisted Procedures
Electroglottography is essentially an electrical impedance measurement
technique used to monitor vocal fold activity (Childers, 1977; Childers and
Krishnamurthy, 1985). The EGG instrument measures the electrical impedance
variations of the larynx using a pair of plate electrodes held in contact with the skin on
both sides of the thyroid cartilage.
The percentage of amplitude in an EGG signal reflects the percentage change
in the tissue impedance. The EGG waveform represents the opening of the glottis as
an upward deflection and the closing of the glottis as a downward deflection. The
hypothesis is that the EGG signal is a function of the lateral area of contact between
the vocal folds.
An example of the synchronized acoustic speech, the EGG, the DEGG, and
glottal area (the opening between the vocal folds) measured from an ultra-high
speed film is given in Figure 6.5. Note that the instant of glottal closure is signaled
by a rapid decrease in the EGG and coincides with the minimum in the DEGG for
that period; also, the maximum in the DEGG in each period occurs very close to the
-> . --
instant of glottal opening. The difference between the instant of glottal closing (as
determined from the glottal area curve) and the minimum in the DEGG for that
period is about two data sample points (at a 10 KHz sampling rate), plus or minus

121
Figure 6.5 Synchronized waveforms (from top down):
speech, EGG, DEGG, and glottal area.

122
one sample point (Krishnamurthy 1983). In general, the instant of glottal opening as
measured by the EGG is more variable than the measurement of the instant of
glottal closure (Childers et al., 1983; Krishnamurthy, 1983). This variability is
certainly tolerable for isolating closed (or open) phase glottal segments since the
speech data selected for analysis can be shortened at both ends by two or three
samples to assure the segment is truly representative of a closed (or open) glottal
interval.
The fact that the EGG is a reliable source of glottal vibratory information
makes it quite useful for the present study. The EGG signal was used to check for
the glottal closed phase and when it was found, the WRLS-VFF algorithm updated
c
the filter coefficients at this optimal position.
6.4 Testing Methods
In order to select the best vowel features most responsible for distinguishing
gender, each individual or group of features was tested for their relative importance
after the features (i.e., fundamental frequencies and formant features including
frequencies, bandwidths, and amplitudes) were available for both genders. There
are two possible approaches:
o Statistical tests. Since we are interested in making the descriptions
of gender features more precise and in determining whether there
are any significant deferences between two gender group features
and how significant these differences are, statistical methods can
be applied to assist in evaluating and validating the reliability of
observed differences, and in determining the degree of confidence
we may place in certain generalizations about the observations.
The experiments in this study are referred to as two factor
experiments having repeated measures on the same subject

123
(Winer, 1971). Therefore, a two-way ANOVA was used to
perform statistical tests. The difference between each individual
feature or grouped features in terms of male/female groups was
then compared.
o Automatic pattern recognition. Basically, this approach is similar to
the pattern recognition approach of the coarse analysis except that
the individual or grouped vowel feature(s) were used to form the
reference and test templates. Examples are using only the
fundamental frequency, or only the formant frequencies or
bandwidths (but across all formants) to form the templates. The
automatic recognition scheme 3 was then applied on these
templates. The EUC or PDF distance measures were utilized.
Finally, the recognition error rates for different features were then
compared.
6.4.1 Two-wav ANOVA Statistical Testing
The analysis of variance, as the name indicates, deals with variances rather
than with standard deviations and standard errors. The rationale of the analysis of
variance is that the total sum of squares of a set of measurements composed of
several groups can be analyzed or broken down into specific parts, each part
identifiable with a given source of variation. This is called the partition of the total
variation. In the simplest case, the total sum of squares is analyzed in two parts: a
sum of squares based upon variation within the several groups, and a sum of squares
based upon the variation between the group means. Then, from these two sums of
squares, independent estimates of the population variance are computed.

124
A variance, in the terminology of analysis of variance, is more frequently
called a mean square (MS). By definition
variation (sum of squares) SS
MS = = (6.16)
degree of freedom df
In other words, a mean square is the average variation per degree of freedom; this is
also a basic definition for variance.
The term degree of freedom (df) originates from the geometric representation
of problems associated with the determination of sampling distribution for statistics.
o
In this context, the term refers to the dimension of the geometric space appropriate
in the solution of the problem. More accurately
df = # of independent observations # of linear restraints (6.17)
On the assumption that the groups or samples making up a total series of
measures are random samples from a common normal population, the two
estimates of the population variance may be expected to differ only within the limits
of random sampling. This null hypothesis may be tested by dividing the larger
variance by the smaller one to get the variance ratio F. If the value of F equals or
exceed a certain value (usually tabled), then the null hypothesis that the samples
have been drawn from the same common normal population is considered invalid.
Therefore, the populations from which the sample have been drawn may differ in
terms of either means or variances or both. If the variances are approximately the
same, it is the means that differ. This, basically, is the analysis of variance in its
simplest form (Edwards, 1964).

125
Factorial experiments in which the same experimental unit (generally a
subject) is observed under more than one condition require special attention.
Experiments of this kind are referred to as those in which there are repeated
measures. A two-factor experiment in which there are repeated measures on factor
B (i.e., each experimental unit is observed under all levels of factor B) may be
represented schematically as follows:
bi
b2
b3
bm
ai
G!
Gi
Gj
.. Gj
a2
g2
g2
g2
.. g2
The symbol Gi represents a group of n subjects. The symbol G2 represents a second
group of n subjects. The subjects in G] are observed under treatment combinations
aibi, aib2) aib3, ..., and aibm. Thus the subjects in Gi are observed under all levels
of factor B in the experiment, but only under one level of factor A. The subjects in
G2 are observed under treatment combinations a2bi, a2b2, a2b3, ..., and a2bm. Thus
each subject in G2 is observed under all levels of factor B in the experiment, but only
under one level of factor A, namely, a2.
In this experiment, the subjects may be considered to define a third factor
having n levels. As such, the subject factor is said to be crossed with factor B but
nested under factor A (Winer, 1971). Schematically,

126
Subjects
where
* indicates a crossing relationship,
o indicates a nesting relationship.
In this kind of experiment, comparisons between different levels of factor A
involve differences between groups as well as differences associated with factor A.
On the other hand, comparisons between different levels of factor B at the same
level of A do not involve differences between groups. Since measurements included
in the latter comparisons are based upon the same elements, main effects associated
with such elements tend to cancel. For the latter comparisons, each element serves
as its own control with respect to such main effects.
Table 6.1 shows the partition of the total variation for this type of experiments.
Appropriate denominators for F ratios to be used in making statistical tests are
indicated by the expected values of the mean squares. To test the hypothesis that
there is no significant variation between different levels of Factor A, the appropriate
F ratio is
MSa
F =
MSsubj w.groups
(6.18)
The mean square in the denominator of the above F ratio is sometimes designated
MSerror (between)- To test the hypothesis that there is no significant variation between
different levels of Factor B, the appropriate F ratio is

127
Table 6.1 Summary of Analysis of Variance
Source of Variation
Corresponding MS
Between subjects
* Factor A
* Subjects within groups
MS
MS
a
subj w.groups
Within subjects
* Factor B
* A x B effect
* B x subjects within groups
MSb
MSab
MSb x subj w.groups

128
MSb
F =
MSb x subj w.groups
(6.19)
To test the hypothesis that that there is no significant interaction effect of Factor A
and B, the appropriate F ratio is
MSab
F =
MSb x subj w.groups
(6.20)
The mean square in the denominator of the last two F ratios is sometimes called
MSerror (within) since it forms the denominator of F ratios used in testing effects which
can be classified as part of the within-subject variation. Detailed computational
procedures for partitioning the relevant sums of squares with equal or unequal group
size are given in Winers book (1971).
Formant characteristics such as frequencies, bandwidths, and amplitudes
depend on (or are influenced by) two factors. These are gender, which may be
denoted as Factor A, and vowels, which may be denoted as Factor B. Each
experimental subject is observed under more than one vowel, and thus our
experiments should be referred to as two factor experiments having repeated
measures on the same subjects. Hence, the type of two-way ANOVA discussed above
with unequal group size of 27 versus 25, two levels (two gender) in Factor A, and ten
levels (ten vowels) in Factor B was used to perform the statistical test.
6.4.2 Automatic Recognition bv Using Grouped Features
As we mentioned above, this approach is similar to the pattern recognition
model of the coarse analysis. In the coarse analysis, LPC, autocorrelation,
cepstrum, and reflection coefficients were extracted from speech signal. Averaging

129
was then used to form the reference and test templates. Four recognition schemes
and various distance measures were involved. The decision rule was the Nearest
Neighbor rule and the exclusive procedure was applied on the database when the
performance assessment was made. In addition, the order of the conventional LPC
filter was varied.
The outline of the approach in this section followed the above procedure
except the individual or grouped vowel feature(s) such as only the fundamental
frequency, or only the formant frequencies or bandwidths (but across all formants)
were used to form the reference and test templates. In order to determine in detail
the relative importance of various features and their combinations, the following
vowel feature sets formed the reference and test templates:
(1) only fundamental frequency,
(2) individual feature of formants, (i.e., every individual frequency,
bandwidth, and amplitude for all formants),
(3) individual formant, which consists of frequency, bandwidth, and
amplitude for a given formant,
(4) grouped frequencies, or bandwidths, or amplitudes from four
formants,
(5) entire formant information, and
(6) all available formant information and pitch information together.
Instead of four recognition schemes, only Scheme 3 was applied on the
recognition test. EUC and PDF distance measures were utilized. The decision rule
was still the Nearest Neighbor rule and the exclusive procedure was still applied on
the database when performance assessment was made. However, variation of the
filter order for WRLS-VFF analysis was not performed. Formant information was
obtained from the filter coefficients with the filter order kept at 12.

CHAPTER 7
EVALUATION OF VOWEL CHARACTERISTICS
As discussed in Chapter 6, sequential adaptive approaches that track the
time-varying parameters of the vocal tract and update the parameters during the
glottal closed phase interval can reduce the formant estimation error. Formant
information in the fine analysis was obtained by a closed-phase WRLS-VFF
method, which is one of these approaches. The database consisted of speech and
EGG data collected from 52 normal subjects (27 males and 25 females), for each of
w'hom ten sustained vowels were included.
7.1 Vowel Characteristics of Gender
7.1.1 Fundamental Frequency and Formant Features for Each Gender
The estimated average fundamental frequency for ten sustained vowels of all
subjects were calculated based upon a modified cepstral algorithm. Results are
shown in Table 7.1. Data are represented as means standard error (SE). Formant
information on the same database was obtained by first using the closed-phase
WRLS-VFF method to track the parameters of a time-varying all-pole model of the
vocal tract. Then, a smooth spectrum was acquired by using a FFT on 12
coefficients computed by WRLS-VFF. A peak-picking technique was finally
applied to obtain the formant positions, bandwidths, and amplitudes. The formant
amplitudes are all referred to the zero db line of the spectrum with the gain factors
130

131
al! normalized to 1. Formant information is shown in Table 7.2. Data are also
represented as means SE.
Table 7.1
Fundamental frequencies
of ten sustained vowels (Hz)
IY
I
E
AE
A
OW
U
00
UH
EE
Total
M 131.8
4.6
130.2
4.2
124.3
3.7
122.7
3.9
120.4
3.8
119.6
3.5
125.5
3.9
129.7
4.3
120.4
3.8
121.5
3.6
124.6
3.95
F
233.1
227.5
219.1
215.5
213.8
216.3
220.2
222.3
215.3
217.3
220.0
5.3
5.9
5.7
6.3
5.1
5.4
5.6
5.2
5.1
5.1
5.48
where M Male
F Female

132
Table 7.2 Formant characteristics
of ten vowels for male/female
IY
I
E
AE
A
OW
U
00
UH
ER
U
302.9
438.5
541.6
645.0
673.0
614.6
486.7
341.5
590.9
477.4
FI
8.2
7.8
8.0
11.9
11.2
9.1
9.7
6.3
10.5
9.4
F
378.5
512.6
661.0
841.7
837.8
745.4
522.0
409.5
723.6
558.1
7.1
11.3
13.3
18.7
19.8
19.5
11.3
7.9
12.1
9.7
U
133.8
135.8
132.6
144.7
154.2
147.1
141.2
134.3
138.4
133.1
B1
3.8
4.1
3.2
4.8
7.0
5.6
11.0
4.3
3.9
2.9
F
144.1
150.5
177.0
221.4
272.1
249.1
163.0
132.0
221.4
175.6
4.4
5.9
13.8
19.2
21.2
21.2
5.8
3.1
17.5
9.4
U
15.4
.15.7
16.5
16.0
21.6
22.4
22.1
26.1
21.1
24.6
A1
.76
.78
.53
.60
.61
.61
.79
1.03
.71
.85
F
18.7
17.5
16.1
14.8
19.5
20.3
21.2
28.5
18.1
22.5
.78
.95
.68
.86
.71
.70
.74
.81
.74
.73
U
2172.0
1837.0
1690.2
1621.9
1097.9
990.4
1168.1
1067.1
1194.2
1276.0
F2
22.3
21.3
23.1
21.2
12.0
18.6
24.6
35.9
20.2
18.6
F
2588.2
2196.8
2013.2
1932.7
1245.5
1190.2
1386.2
1361.2
1445.3
1503.7
48.7
42.6
29.3
25.7
24.6
33.9
24.6
49.1
18.4
16.1
U
156.5
142.6
144.9
155.8
154.0
152.8
138.1
143.5
145.1
147.6
B2
6.9
4.2
5.9
5.0
6.5
5.8
2.4
3.6
3.7
4.0
F
198.6
199.4
188.4
182.9
226.9
209.1
196.5
191.1
233.7
166.3
15.2
11.9
10.4
9.0
10.0
23.7
17.7
13.1
20.8
8.0
U
15.6
17.8
17.8
17.9
21.4
21.1
19.1
19.0
19.0
24.0
A2
.77
.73
. 64
.48
.78
57
.54
.52
.58
.83
F
10.9
12.3
13.8
15.3
18.7
20.6
16.5
15.4
15.5
23.2
1.24
.80
.88
.71
.70
1.06
.82
.84
.71
.72
U
2851.3
2482.4
2456.1
2357.3
2457.4
2465.4
2307.2
2219.0
2401.1
1707.4
F3
36.9
33.9
31.3
26.5
36.8
33.9
25.8
20.3
36.2
34.4
F
3286.1
2995.9
2955.7
2981.6
2945.1
2853.0
2791.5
2729.8
2862.7
2024.1
36.0
35.3
38.0
50.9
63.1
53.6
3.0
7.6
47.0
59.0
M
281.3
228.1
245.1
253.6
241.9
199.8
208.5
191.0
239.3
145.9
B3
16.4
14.3
21.2
20.6
26.0
14.1
15.1
19.4
24.3
3.4
F
218.8
242.3
274.5
281.5
237.3
226.7
240.5
273.9
280.7
181.2
8.6
17.4
29.3
21.1
8.1
19.2
20.5
26.2
25.2
9.9
U
12.8
11.9
10.1
10.7
7.2
8.0
9.0
9.2
7.7
21.5
A3
.68
.48
.66
.56
.86
.84
.73
.96
.82
.91
F
11.7
7.9
5.9
4.7
.1
4.8
4.2
2.5
3.5
17.4
54
.55
.58
.68
.77
.73
.70
.76
.64
1.08
U
3572.7
3533.8
3511.4
3463.8
3463.6
3408.2
3359.4
3342.2
3423.4
3201.3
F4
49.9
35.3
38.2
43.6
40.9
38.5
40.1
51.8
41.4
45.3
F
4127.1
4265.6
4219.7
4146.7
3957.0
3922.5
3976.9
3976.5
4052.9
3888.4
69,1
4.7
55.1
52.6
45.4
55.3
56.8
55.1
81.7
83.3
U
226.8
191.5
224.0
237.5
238.2
170.9
212.0
188.3
204.0
176.7
S4
35.9
14.2
17.8
16.3
21.3
7.8
15.5
14.5
12.0
9.5
F
282.2
287.5
347.3
343.3
210.8
237.2
253.4
232.2
284.9
253.0
24.0
25.6
40.0
24.6
12.3
26.7
23.7
15.7
25.9
28.2
U
15.0
9.9
7.8
4.9
6.7
9.7
6.8
9.0
8.8
5.8
A4
. 88
. 80
1.01
91
1.07
. 87
.79
1.06
.91
.78
F
9.2
2.7
0.4
-1.2
3.9
4.3
2.3
3.0
2.2
0.2
.92
.68
.75
. 66
.72
.91
.74
.81
. 96
. 83

133
Table 7.2 (continued)
where
FI

first
formant
frequency
(Hz)
B1

first
formant
bandwidth
(Hz)
A1

first
formant
amplitude
(db)
F2

second
formant
frequency
(Hz)
B2

second
formant
bandwidth
(Hz)
A2

second
formant
amplitude
(db)
F3

third
formant
frequency
(Hz)
B2

third
formant
bandwidth
(Hz)
A3

third
formant
amplitude
(db)
F4

fourth
formant
frequency
(Hz)
B4

fourth
formant
bandwidth
(Hz)
A4
--
fourth
formant
amplitude
(db)
Plots of the data were generated to visualize various characteristics of
different genders as well as different vowels. Figure 7.1 shows averaged
fundamental frequencies of ten sustained vowels for male/female speakers. Figures
7.2 to 7.5 show averaged first, second, third, and fourth formant frequencies,
bandwidths, and amplitudes of ten sustained vowels for male/female speakers
respectively. All data are represented as means SE. Figures 7.6 and 7.7 show
scatter plots of the second formant frequency as a function of first formant
frequency for ten sustained vowels by male and female speakers respectively. The
polygons in Figures 7.6 and 7.7 show the approximate range of variation in formant
frequencies for each of these vowels. Figure 7.8 shows vowel triangles for both male
and female speakers.

FREQUENCY (Hz)
134
AUERAGED UOUEL FUNDAMENTAL FREQUENCY
260
240
220
200
180
160
140
120
100
O O MALE
FEMALE
T
UOLIEL
Figure 7.1 Fundamental frequencies of ten vowels.

AMPLITUDE (db) FREQUENCY (Hz) FREQUENCY (Hz)
135
320
300
280
260
240
220
200
180
160
140
120
AUERAGED BANDWIDTH
AUERAGED AMPLITUDE
32 t
30
12 1 1 1 1 1 1 1 1 1 1 H
IY I E AE A OW U 00 UH ER
UOUEL
Figure 7.2 The first formant characteristics of ten vowels.

AMPLITUDE (db> FREQUENCY (Hz) FREQUENCY (Hz)
136
UOUEL
Figure 7.3 The second formant characteristics of ten vowels.

AMPLITUDE (rib) FREQUENCY (Hz) FREQUENCY (Hz)
137
3600
3400
3200
3000
2800
2600
2400
2200
2000
1800
1600
AUERAQED BANDWIDTH
320 t
120 "T
25
AUERAQED AMPLITUDE
UOWEL
Figure 7.4 The third formant characteristics of ten vowels.

AMPLITUDE 138
UOUIEL
Figure 7.5 The fourth formant characteristics of ten vowels.

FREQUENCY OF SECOND FORMPNT (Hz)
139
2600-,
2402-
2203-
2000-
1800-
1600-
1400-
1203-
1000-
800 -
600
230
FORMPNT POSITIONS OF MPLS SPEAKERS
O
! , f
330 400 500 600 700
FREQUENCY OF FIRST FORMPNT (Hz)
( 1
000 900
Figure 7.6 The scatter plot of the first versus second formant
frequencies for ten vowels of male speakers.

FREQUENCY OF SECOND FORMRNT (Hz)
140
FORMfiNT POSITIONS OF FEMOLE SPEAKERS
Figure 7.7 The scatter plot of the first versus second formant
frequencies for ten vowels of female speakers.

FREQUENCY OF SECOND FORMfiNT (Hz)
141
2800-,
2603-
2400-
2200-
2000-
1800-
1600-
1403-
1200 -
1000 -
800
203
VOWEL TRIANGLES OF MALE/FEMALE SPEAKERS
4 IV
\
' \
' V
I \
4 OW

-A
4 AE
- -4
A
T" 1 1 1 1 1 r~
300. 400 S00 600 700 800 900
FREQUENCY OF FIRST FORMANT (Hz)
MALE
FEMALE
1
1000
Figure 7.8 Vowel triangles for male and female speakers.

142
7.1.2 Comparison with Peterson and Barneys Results
Averages of fundamental and formant frequencies and amplitudes of ten
vowels by 76 speakers were obtained by Peterson and Barney (1952), using
recorders and sound spectrograph. Their database consisted of 33 males, 28
females, and 15 children. The results are listed in Table 7.3. Instead of sustained
vowels, the vowels they used were picked from CVC (consonant vowel consonant)
words. The formant amplitudes were all referred to the amplitude of the first
formant in /OW/. Bandwidth information and fourth formant information (F4, B4,
A4) were not provided in their paper. The standard error for each measurement of
each vowel was also not provided. For comparison, the male/female vowel triangles
using their data and the data in this experiment are overlapped in Figure 7.9.
There are some differences between our results and their results. For
example, the vowel triangles for both males and females from our experiment
showed less scattering than those in their experiments. The front-high vowels (e.g.,
/IY/) of our experiment had higher first formants and lower second formants than
theirs. While the middle-low vowels (e.g., /AJ) of our experiment demonstrated
lower first formants and higher second formants than theirs, our back-high vowels
(e.g., 1001) showed higher both first and second formants than theirs. These
differences may come from many factors that were involved. The databases were
different (27 male and 25 female for ours and 33 male and 28 female for theirs).
The means of producing vowel utterances were different (sustained for ours and
CVC for theirs). The data collection methods were different (direct digitization for
ours and tape recording for theirs). The measurement techniques were different
(WRLS-VFF for ours and sound spectrography for theirs). Although variations
existed between estimated vowel formants, the data in this study provide more

143
Table 7.3
Vowel characteristics obtained by Peterson and Barney
(after Peterson and Barney (1952))
IY
E AE A OW U
00 UH ER
Fundamental
U
138
135
130
127
124
129
137
141
130
133
Frequency
F
235
232
223
210
212
216
232
231
221
218
(Hz)
ch
272
269
260
251
256
263
276
274
261
261
Formant
Frequency
(Hz)

270
390
530
660
730
570
440
300
640
490
FI
F
310
430
610
860
850
590
470
370
760
500
Ch
370
530
690
1010
1030
680
560
430
850
560
U
2290
1990
1840
1720
1090
840
1020
870
1190
1350
F2
F
2790
2480
2330
2050
1220
920
1160
950
1400
1640
Ch
3200
2730
2610
2320
1370
1060
1410
1170
1590
1820
M
3010
2550
2480
2410
2440
2410
2240
2240
2390
1690
F3
F
3310
3070
2990
2850
2810
2710
2680
2670
2780
1960
Ch
3730
3600
3570
3320
3170
3180
3310
3260
3360
2160
Formant
A1
-4
-3
-2
-1
-1
0
-1
-3
-1
-5
Amplitude
A2
-24
-23
-17
-12
-5
-7
-12
-19
-10
-15
(db)
A3
-28
-27
-24
-22
-28
-34
-34
-43
-27
-20

Frequency of second formant (Hz)
144
Figure 7.9 Vowel triangles using our data and
Peterson and Barneys data
y.. .

145
comprehensive information on the vowels. The data in Tables 5.1 and 5.2 serve as a
useful reference characterization of the vowels.
7.1.3 Results of Two-wav ANOVA Statistical Test
Since formant characteristics such as FI, F2, Bl, B2, etc. were influenced
by gender as well as vowels and each subject was observed under more than one
vowel, our experiments were referred to as two factor experiments having repeated
measures on the same subject (Winer, 1971). For the gender factor, which was
denoted as A, there were two different levels (i.e., male and female). For the vowel
factor, which was denoted as B, there were 10 different levels (i.e., ten different
vowels). Each subject was observed under all levels of B. Therefore, the two-way
ANOVA with repeated measure was used to perform the statistical test. The
algorithm of Winer (1971) was adopted. Table 7.4 shows the results.
Based on these analysis data, the general test results on factor A (gender) can
be summarized in Table 7.5.
7.1.4 Results of T Statistical Test
Based on the data in Tables 7.1 and 7.2, a t-test (Ott, 1984) for each individual
feature of ten vowels for male and female speakers was also performed. The results
are summarized in Table 7.6. Tables 7.1, 7.2, and 7.6 can serve as control strategies
for synthesizing vowels with desired male or female voice quality.
7.1.5 Discussion
1. Fundamental frequencies (FO) of all vowels for male subjects were
lower than those of female subjects. From Table 7.1, an averaged FO
for all male subjects and all vowels (total 270 vowels) was 124.6 Hz
and 224.9 Hz (total 250 vowels) for females. Table 7.5 shows that

146
Table 7.4
Two-way ANOVA statistical results
Main Effect of A
(Gender)
F-value
P-value
Fundamental
Frequency
226.3
<0.01
Formant
FI
111.6
<0.01
B1
41.8
<0.01
A1
0.48
>0.05
F2
195.9
<0.01
B2
46.2
<0.01
A2
34.0
<0.01
F3
203.8
<0.01
B3
3.2
>0.05
A3
58.1
<0.01
F4
196.2
<0.01
B4
17.2
<0.01
A4
52.3
<0.01
F-value:
computed
F ratio
P-value:
level of
signif icanci
Main Effect of B Effect
(Vowel)
F-value P-value F-value
23.8 <0.01 1.43
459.
8
<0.
01
13.
4
17.
3
<0.
01
10.
.3
70.
4
<0.
01
8.
,01
561.
9
<0.
01
4,
,92
2.
,4
<0.
05
2.
,17
37.
,5
<0.
,01
2.
,57
158.
, 6
<0.
,01
2.
, 89
5.
,4
<0.
.01
2.
.05
82.
. 3
<0.
.01
2,
. 63
14.
.2
<0.
,01
1,
. 89
4.
.3
<0,
.01
2,
.33
30,
.9
<0.
.01
1
.68
of A X B
P-value
>0.05
<0.01
<0.01
<0.01
<0.01
<0.05
<0.01
<0.01
<0.05
<0.01
>0.05
<0.05
>0.05

147
Table 7.5
Significance of differences between male and female
sustained vowels
DIFFERENCES BETWEEN CONCLUSION
MALE/FEMALE SUSTAINED VOWELS
Fundamental frequencies FO highly significant M Formant
FI
highly significant M B1
highly significant M A1
not significant
F2
highly significant M B2
highly significant M A2
highly significant M>F
F3
highly significant M B3
not significant
A3
highly significant M>F
F4
highly significant M B4
highly significant M A4
highly significant M>F

148
*
Table 7.6 T-test result for each individual feature
of ten vowels for male and female speakers
IY I E AE A OW U OO UH ER
Fundamental
Frequency
Formant
FI
o B1
-
-




-



A1




+
+

-
++
+
F2
B2
-
A2
++
++
++
++
+

++
++
++

F3
B3
++






-


A3

++
++
++
++
++
++
++
++
++
F4
B4
A4 ++
where
++
+


++ ++ ++ + ++ ++
Difference is highly significant M>F
Difference is significant M>F
Difference is highly significant M Difference is significant M No significant difference-
++
++
++

149
the statistical differences between vowel FO of male and female
speakers were highly significant. Comparing the curves in Figures
7.1 to those in Figure 7.2, 7.3, 7.4 and 7.5, which show curves of
formant frequencies, bandwidths, and amplitudes, the curves of
fundamental frequencies across different vowels for both male and
female speakers were relatively flatter than those of formants,
indicating there were more formant than fundamental frequency
variations across different vowels. This suggests that for both male
and female speakers, glottal vibration patterns were relatively less
variable than vocal tract shapes when different vowels were
o
pronounced.
2. Formant frequencies (FI, F2, F3, and F4) from male vowels were
lower than those from female vowels. The statistical differences of
FI, F2, F3 and F4 between male and female speakers were highly
significant. As previous research has shown, formant frequencies
are inversely proportional to vocal tract lengths (Rabiner and
Schafer, 1978) and the vocal tract lengths for males are longer than
females (Fant, 1976). Thus the results support the previous
conclusions.
3. With regard to the formant bandwidths of male and female vowels,
the bandwidths of the formants for male vowels were narrower than
those for female vowels with the exception of B3. The statistical
differences of Bl, B2 and B4 between male and female speakers
were highly significant. Even though there was no statistically
significant difference between the male and female B3, B3 of most
vowels (except /IY/ and /A/) of males still were narrower than those
of females.

150
4. On the other hand, the amplitudes of the formants from male vowels
were higher than those from female vowels with exception of Al.
The statistical differences of A2, A3, and A4 between male and
female speakers were also highly significant. Even though there was
no overall statistically significant difference for Al, Al of most
vowels (except /IY/, U and /OO) of males still were higher than
those of females.
5. The combination of analyses 3 and 4 implies a steeper spectral slope
for females. Two explanations for this exist. First, in normal and
loud voices, female glottal waveforms indicated lower vocal fold
6
closing velocity, lower ac flow, and a proportionally shorter
closed-phase of the cycle (Holmberg and Hillman, 1987). As a
result, the speech spectrum for the female voice appeared with a
steeper slope, and thus the amplitude of formants were lower and the
bandwidths of the formants broader for the female voice. However,
considering that in the analysis stage the algorithm of closed-phase
WRLS-VFF was supposed to eliminate the influence of the vocal
fold vibration factor, another explanation for the steeper slope for
female voice exists. It was indicated first by Rabiner and Schafer
(1978) that while the first formant bandwidths are primarily
determined by wall loss, the higher formant bandwidths are
primarily determined by radiation loss. In between, the second and
third formant bandwidths are to be determined by a combination of
these two loss mechanisms. The results of our experiments implies
that these two losses were larger for female subjects.
6. In Figures 7.2 to 7.5, it is noted that the curves of frequency,
bandwidth and amplitude for males are flatter across different

151
vowels than for females and there is less standard error associated
with each vowel for males than that for females. This is more
manifest for bandwidths. By comparing Figures 7.6 and 7.7 of
scatter plotting of first and second formant frequencies for both
genders, it is seen that the female data plots appear to be more
spread out. The above observations indicate that there was greater
variability for female voices than for male voices. Many researchers
believe melodic (intonation, stress and/or coarticulation) cues are
speech characteristics associated with female voices. Larger
formant frequency, bandwidth and amplitude variation for female
speakers may also contribute to these perceptual cues.
7.2 Relative Importance of Grouped Vowel Features
In this section, the means to determine the relative importance of individual or
grouped vowel features such as fundamental frequency, formant frequency or
bandwidth for gender discrimination was to use these individual or grouped features
to form the test and reference templates and then to process the data for automatic
gender recognition. The results below were obtained by using the exclusive
recognition Scheme 3, which is the same as we described in Chapter 4. The
database used consisted of all 27 male and 25 female subjects with all 10 sustained
vowels for each subject. The recognition rates are presented as correct recognition
percentage.

152
7.2.1 Recognition Results
(a) Using only fundamental frequency (EUC)
MALE FEMALE TOTAL
100.0% 92.0% 96.2%
0(b) Using individual frequency, bandwidth, or amplitude
of four formants (EUC)
MALE
FEMALE
TOTAL
FI
96.3%
84.0%
90.4%
B1
92.6%
68.0%
80.8%
A1
51.9%
40.0%
46.2%
F2
96.3%
100.0%
98.1%
B2
100.0%
84.0%
92.3%
A2
81.5%
72.0%
76.9%
F3
100.0%
88.0%
94.2%
B3
66.7%
56.0%
61.5%
A3
81.5%
80.0%
80.8%
F4
96.3%
96.0%
96.2%
B4
77,8%
56.0%
67.3%
A4
81.5%
84.0%
82.7%

153
(c) Using individual formant (including frequency, bandwidth,
and amplitude for the formant concerned)
EUC
MALE
FEMALE
TOTAL
First (F1,B1,A1)
100.0%
92.0%
96.2%
Second (F2,B2,A2)
96.3%
100.0%
98.1%
Third (F3,B3,A3)
100.0%
88.0%
94.2%
Fourth (F4,B4,A4)
96.3%
96.0%
96.2%
PDF
MALE
FEMALE
TOTAL
First (F1,B1,A1)
92.6%
92.0%
92.3%
Second (F2,B2,A2)
96.3%
100.0%
98.1%
Third (F3,B3,A3)
100.0%
100.0%
100.0%
Fourth (F4,B4,A4)
96.3%
96.0%
96.2%
(d) Using grouped frequencies, bandwidths, and amplitudes of formants
(each feature set contains only frequencies, or bandwidths, or
amplitudes from all formants)
MALE
FEMALE
TOTAL
Positions (F1,F2,F3,F4)
96.3%
100.0%
98.1%
Bandwidths (B1,B2,B3,B4)
88.9%
74.1%
84.6%
Amplitudes (A1,A2,A3,A4)
96.3%
96.0%
96.2%

154
PDF
MALE
FEMALE
TOTAL
Positions (F1,F2,F3,F4)
96.3%
100.0%
98.1%
Bandwidths (B1,B2,B3,B4)
85.2%
80.0%
82.7%
Amplitudes (A1,A2,A3,A4)
92.6%
96.0%
94.0%
(e) Using entire formant information (F1,B1,..., through A4)
o
MALE
FEMALE
TOTAL
EUC
96.3%
100.0%
98.1%
PDF
96.3%
96.0%
96.2%
(f) Using entire formant information and pitch information
MALE FEMALE TOTAL
EUC
PDF
96.3% 100.0% 98.1%
96.3% 96.0% 96.2%
7.2.2 Discussion
1. With regard to the relative importance of each individual formant
feature for objectively distinguishing the speakers gender, it was

155
noted from (b) that by using the second formant frequency, the
highest correct recognition rate (98.1%) was produced. The
recognition rate of the fourth formant frequency was 96.2%, with
94.2% for the third formant frequency and 90.4% for the first
formant frequency. According to the theory of the ideal lossless tube
(Rabiner and Schafer, 1978) that formant frequencies are inversely
proportional to the length of the tube, all formants of a vowel should
be equally shifted. If the implication of equal shifting of formants
holds, the recognition rate should be the same for any individual
formant frequency. However, the results demonstrated that
different formants had different recognition rates so that the equal
shifting of formants only holds under ideal conditions. Importantly,
it was noted that by using the second formant bandwidth, a
recognition rate of 92.3% was achieved, which was higher than
achieved using the first formant frequency. It can be also observed
from Figure 7.3(b) that it is easy to draw a straight line (boundary)
between male/female curves in the second formant bandwidths. In
conclusion, the second formant bandwidth might also be a sensible
gender discrimination indicator.
2. In terms of the relative importance of individual formant
characteristics for objectively distinguishing the speakers gender, it
was found from (c) that by using the second formant information
associated with EUC distance measure, the highest recognition rate
(98.1%) was reached. Remembering the discussion above that the
second formant frequency had the most distinct gender information
and its bandwidth was also a good gender indicator, the results in (c)
were reasonable. However, it is surprising that all individual

156
formants showed high recognition rates (over 90%), yet the
corresponding individual bandwidth and amplitude did not achieve
high recognition rates in (b). This suggests that a combination of
formant features improved the performance. It was also noted that
when using EUC, the third formant had the lowest recognition rate
but when using PDF distance measure, the third formant had the
highest recognition rate (100%).
3. With regard to the relative importance of frequency, bandwidth, or
amplitude features of formants for objectively distinguishing the
speakers gender, results in (d) disclosed that by using formant
frequencies, the highest recognition rate was obtained (98.1%). The
second highest was obtained using amplitudes (96.2%) and the
lowest was obtained using bandwidths (84.6%). Notice from the
results in .(b), where by using individuals Al, A2, A3, and A4, only
46.2%, 76.9%, 80.8%, and 82.7% recognition rates were obtained
respectively. It was realized that the combined effect of amplitudes
was more sensitive than the individual effect of each amplitude for
gender recognition. However, the same did not hold for bandwidths.
For bandwidths, the combined effect (84.6%) was not more sensitive
than the individual effect of each bandwidth (80.8%, 92.3%,
61.54%, 67.3% for Bl, B2, B3, and B4 respectively).
4. In terms of the relative importance of fundamental frequency versus
formant characteristics for objectively distinguishing the speakers
gender, results in (a) indicated that when using only fundamental
frequency, a 96.2% recognition rate was achieved. When using only
formant information in (e), a slight improvement (98.1%) was
obtained. Considering that there was only one subject difference

157
between these two results, instead of declaring that formant
information was more sensitive for recognizing the speakers gender
it is more reasonable to infer that they had almost the same
sensitivity for gender recognition. It was also found that adding
additional pitch information to formant features did not help to
increase the recognition rate since by using all available formant
information and fundamental frequency in (f), the recognition rate
was not increased, either with EUC or PDF distance measure. This
indicated that the formant characteristics contained sufficient
gender information.
5. Redundant gender information was discovered. The above
discussion demonstrates that considerable redundant information
concerning gender appeared to be imbedded in vowel characteristics
such as formant and pitch features. For the automatic gender
recognition task, individual vowel features could be used such as the
fundamental frequency (98.1%), the fourth formant frequency
(96.2%), and the second formant frequency (98.1%). Combined or
grouped features would also be used such as the third formant with
the PDF distance measure (100%), the second formant with the EUC
distance measure (98.1%), or only frequencies of all formants
(98.1%) with the EUC distance measure. In practice, it is necessary
to consider which vowel characteristics could be easily extracted
from speech signals and which algorithms could be used to quickly
and accurately pick up these vowel characteristics. In reality, there
still remain many problems concerning the accurate formant
estimation and the fundamental frequency tracking even for
sustained vowels. Therefore, rendering the parametric

158
representation of speech such as LPC, cepstrum and reflection
coefficients in coarse analysis might be a more appropriate approach
for automatic gender recognition.
6. The results also showed that most female recognition rates were
lower than those for males. The female recognition rates were
higher than males in only 9 sets out of a total of 31 sets for the above
experimental recognition rates. This observation suggested again
that the features of female voices appeared to have higher
within-group variations than those for males.
7. From our work we found that different strategies could be used to
accomplish the highest recognition rates for each gender. For
example, we found that a 100% recognition rate for males could be
achieved by selecting the fundamental frequency, or the second
formant bandwidth, or the third formant position with the EUC
distance measure. On the other hand, a 100% recognition rate for
females could be obtained by selecting the second formant position
with the EUC. Since each of the above cases uses only a single
feature and the EUC was the simplest distance measure, these
combinations were the best strategies for male and female
recognition. Other combinations can also be used to achieve 100%
recognition rates but they would require more than one features and
in some cases using the PDF distance measure.
7.3 Conclusion
Conclusions from the fine analysis can be summarized as follows:
1. Both fundamental frequency and formant characteristics were
reliable indicators in gender discrimination. The fundamental

159
frequencies of male subjects were lower than those of female
subjects for all vowels and statistical differences of fundamental
frequencies between male and female speakers were highly
significant. For both male and female speakers, glottal vibration
patterns were relatively less variable than vocal tract shapes when
different vowels were pronounced.
2. Formant frequencies from male vowels were lower than those from
female vowels (FI, F2, F3, F4). Generally speaking, bandwidths of
the formants from male vowels were narrower than those from
female vowels and the amplitudes of the formants from male vowels
Ci
were higher than those from female vowels. Statistical differences
of FI, F2, F3 and F4, Bl, B2 and B4, and A2, A3 and A4 between
male and female speakers were highly significant. This suggested a
steeper spectral slope for females.
3. Considerable redundant information concerning gender appeared to
be imbedded in the formant and pitch features. However, in terms
of the relative importance of formant frequency, bandwidth, or
amplitude for objectively distinguishing male/female voices, the
highest recognition rate (98.1%) was achieved when using the
formant frequency. And in terms of the first, second, third, or forth
formant characteristics, using the second formant information
showed highest recognition rate (98.1%) with the EUC distance
measure. The second formant frequency and bandwidth also
individually showed the highest recognition rate (98.1% and 92.3%)
in the 4-frequency or 4-bandwidth groups respectively. With the
PDF distance measure, the third formant including frequency,
bandwidth, and amplitude appeared to be the best feature (100%).

160
In terms of relative importance of fundamental frequency versus
formant characteristics, although using formant information showed
a slightly higher recognition rate (98.1% versus 96.2%), it was still
reasonable to believe that they had almost the same sensitivity for
gender recognition. This result also indicated that the formant
characteristics contained sufficient gender information without the
presence of the fundamental frequency.
4. Recall from Chapter 5 that the reflection coefficients derived from
sustained vowels, when combined with the EUC distance measure,
also produced the highest recognition rates (100%) for filter orders
of 12, 16, and 20. Therefore, there were two different approaches
for vowels in this study that achieved 100% recognition rates for both
males and females, namely, the third formant information with the
PDF distance measure and the reflection coefficients with the EUC
distance measure. Both used the recognition Scheme 3.
5. By examining Table 7.3 and recognition results obtained using
individual formant features, it is noted that the features with higher F
values in the ANOVA analysis usually had higher recognition rates
later in the automatic recognition test (e.g., F0, FI to F4), though the
feature with the highest F value may not posses the highest
recognition rate (F value for F0 is the highest but, instead, using F2
achieved highest recognition rate). This indicated that the
conclusions from the statistical test and the recognition test were
quite consistent.
6. Higher variability of theTemale voices was again noted. In general,
most recognition rates for female voices were lower than for male
voices. Female feature data plots appeared to be more scattered.

161
There existed larger formant frequency, bandwidth, and amplitude
changes across different vowels for female speakers and the
corresponding standard errors for each vowel were greater than
those for male speakers. The above observations suggested that
female voices were more variable and this may contribute to the
perceptual melodic feature of female voices.

CHAPTER 8
CONCLUDING REMARKS
8.1 Summary
Unlike automatic speech and speaker recognition, automatic gender
recognition was never proposed as a stand alone problem. Although contemporary
research on speech has included the investigation of physiological and acoustic
gender features and their correlations with perceived gender differences, no attempt
was made before this study to classify the speakers gender objectively by using
features automatically extracted by a computer.
The main purpose of this research was to investigate the possible effectiveness
of digital speech processing and pattern recognition techniques for an automatic
gender recognition system. Emphasis was placed on the analysis of various
objective acoustic parameters, distance measures, optimal combinations of the
parameters and measures, and most efficient recognition schemes. In addition,
some hypotheses concerning acoustic parameters that influence our ability to
distinguish a speakers gender were clarified. Emphasis was placed on extraction of
accurate vowel characteristics including fundamental frequency and formant
features such as frequencies, bandwidths and amplitudes for each gender.
The significance of the proposed research could include the following. First,
speech recognition and speaker identification or verification would be assisted if we
could automatically recognize a speakers gender. This would allow different
speech analysis algorithms for each gender, facilitating speech recognition by
162

163
cutting the search space in half. The synthesis of high quality speech would benefit
since acoustic features for synthesizing speech for either gender would be identified.
We presumed that the research results would provide new guidelines for future
research to develop qualitative measures of speech quality and to develop new
methods to identify acoustic features related to dialect and speaking style. Finally,
the research also has potential clinical and law enforcement applications.
The proposed study followed two directions. One direction was called coarse
analysis since it used classical pattern recognition techniques and asynchronous
linear prediction coding (LPC) analysis of speech. Acoustic parameters such as
autocorrelation, LPC, cepstrum, and reflection coefficients were derived to form test
and reference templates. The effects of using different distance measures, filter
orders, recognition schemes, and phonemes were comparatively assessed.
Comparisons of acoustic parameters using the Fishers discriminant ratio criterion
were also conducted.
The second research direction covered fine analysis since pitch synchronous
closed-phase analysis was utilized to obtain accurate vowel characteristics for each
gender. Detailed formant features, including frequencies, bandwidths and
amplitudes, were extracted by a closed-phase WRLS-VFF (Weighted Recursive
Least Squares with Variable Forgetting Factor) method. The electroglottograph
(EGG) signal was used to locate the closed-phase portion of the speech signal. A
two-way ANOVA (Analysis of Variance) statistical analysis was performed to test
the difference between two gender features, and the relative importance of grouped
vowel features was evaluated by means of a pattern recognition approach.
The database consisted of 52 normal subjects, 27 males and 25 females. For
each subject ten sustained vowels; five unvoiced, and four voiced fricatives were
processed during the experiments. From each utterance approximately 150ms
speech signal was used for the experiments.

164
Results showed that most of the LPC derived acoustic parameters worked very
well for automatic gender recognition. Among them, the reflection coefficients
combined with the Euclidean measure was most robust for sustained vowels (100%)
and the results were quite consistent and filter order independent. While the
cepstral distortion measure worked extremely well for unvoiced fricatives (90.4%),
the LPC Log likelihood distortion measure, reflection coefficient combined with
Euclidean distance, and cepstral distortion measure were the best options for voiced
fricatives (98.1% for all three). Hence carefully selecting the acoustic feature vector
and combining it with an appropriate distance measure was important for gender
recognition for a given type of phoneme. The use of unvoiced fricatives achieved a
high recognition rate indicating that the speakers gender could also be captured
only from the speakers vocal tract characteristics.
A within-gender and within-subject averaging technique was critical for
generating appropriate test and reference templates. To a great extent, averaging on
both test and reference templates eliminated the intra-subject variation within
different vowels or fricatives of a given subject and emphasized features
representing this subjects gender. This might imply that the gender information is
time-invariant, phoneme independent, and speaker independent.
The Euclidean distance (EUC) measure appeared to be the most robust as well
as the simplest of the distance measures. Using the Euclidean distance measure was
more effective than using the Probability Density Function (PDF). Most recognition
rates of using the EUC were higher than those using the PDF and were notably
constant across all filter orders that were tested. The EUC distance measure
operated more uniformly on male and female groups than did the PDF. A possible
reason for this inferior PDF performance is due to the small ratio of the available
number of subjects per gender to the number of elements per feature vector.

165
Meanwhile, a filter order of 12 to 16 seemed to be the appropriate choice for most
designs.
There were two different approaches for vowels in this study that achieved
100% recognition rates for both males and females, namely, the third formant
information (a filter order of 12) with the PDF distance measure and the reflection
coefficients (filter orders of 12, 16, and 20) with the EUC distance measure. Both
used the recognition Scheme 3.
A comparative study of acoustic parameters using Fishers discriminant ratio
criterion indicated that for the filter order of 12, the analytical inferences from the
values of J] and J4 and the expected probabilities of error using various acoustic
features proved comparable to the empirical results of the experiments with the PDF
distance measure for gender recognition. Furthermore, Jj the Mahalanobis distance
appeared to be more reliable for predicting the performance of a gender classifier
than J4.
The statistical tests and recognition results in fine analysis suggested that both
fundamental frequency and formant characteristics were reliable indicators for
gender discrimination. For both male and female speakers, glottal vibration
patterns were relatively less variable than the vocal tract shapes when different
vowels were pronounced.
Formant frequencies from male vowels were lower than those from female
vowels. In general, bandwidths of the formants from male vowels were narrower
than those from female vowels and the amplitudes of the formants from male vowels
were higher than those from female vowels. Statistical differences of the first,
second, third, and fourth formant frequencies; first, second, and fourth formant
bandwidths; and second, third, fourth formant amplitudes between male and female
speakers were highly significant. This suggested steeper spectral slopes for female

166
vowels. The conclusions from the statistical test and the recognition test were quite
consistent.
Redundant or excessive gender information was imbedded in the fundamental
frequency and vocal tract resonance features. Moreover, the formant characteristics
seemed to contain sufficient gender information without the presence of the
fundamental frequency.
Finally, the higher variability of female voices was noted. Performances of
various feature vectors combined with distance measures for male subjects were
generally better than the performances for female subjects. The recognition rates
for males showed higher mean, much greater minimum, and smaller standard
deviation than those for females. Upon plotting the female features we noted a
greater scattering of the data when compared to similar plots for male features.
There existed larger formant frequency, bandwidth and amplitude changes across
different vowels for female speakers and the corresponding standard errors for each
vowel were greater than those of males. These above observations suggested that
female voices were more variable and this may contribute to the perceptual melodic
feature for females.
In summary, this study demonstrated that it is feasible to design an efficient
gender recognition system. Such a system would reduce the search space of speech
or speaker recognition in half. Furthermore, the knowledge gained from this
research might benefit the generation of synthetic speech with a desired male or
female voice quality.
8.2 Future Research Extensions
8.2.1 Short Term Extension
o There is still a possibility to reduce the length of the speech signal
used to extract the acoustic parameters in the coarse analysis. In

167
our experiments, six frames were selected from the same sustained
utterance to form (by averaging) a template for this utterance.
Considerable redundant information might exist in these six
frames since all the sustained speech signals were quite stable.
Thus fewer frames (e.g., three frames) could be used and
redundant information in the speech signals would be reduced.
The same level of recognition rates would still be reached.
o Reduction of the feature vector dimensionality could be
considered. This can be done by various methods of feature
covariance matrix manipulations (e.g., principal component
analysis etc.) (Childers et al., 1982). However, by simply
observing the feature data plots, eliminating those elements that
account for little between-class variation, preliminary reduction
could be made. For example, Figure 4.6(a) shows two universal"
reflection coefficient templates (upper layer) for male and female
speakers. It is easily noted that elements 2, 3, 11, and 12 of these
reference templates accounted for little between-gender variation.
Consequently, these elements could be discarded to reduce the
dimensionality of the vector.
o Instead of using equal weighting factors in the averaging operation,
different weighing factors could be applied to different phoneme
feature vectors according to the probability of the phoneme
appearance in a real speech situation. Thus, time-averaging would
be better approximated than what was done in this study.
Eventually, an integrated automatic gender recognition system
using words and sentences could be built.

168
o The KNN decision mechanism could be included so that the
performance of the gender recognition could be further improved.
a
8.2.2 Long Term Extension
The LPC model has no direct relationship to the speech production
mechanism and the acoustic parameters derived from the LPC model are difficult to
interpret in terms of voice quality of the perceived speech signal. The inherent
weakness of the LPC model (all pole) prevents us from sufficiently modeling the
fricatives. The results from the cepstral distortion measure suggested that gender
information did exist in unvoiced fricatives. Thus, more detailed spectrum or
cepstrum analysis for unvoiced fricatives is needed. Moreover, more sophisticated
models such as pole-zero ARMA or other models, which are more closely related to
speech production may be required for further studies.
Eskenazi (1989) found in his research that Spectral Flatness of the residue
signal of the LPC analysis and Coefficient of Excess of the speech signals showed a
significant difference between males and females in several vowels. Additional
study is needed to assess the reliability of these acoustic measures in gender
recognition. Furthermore, other time domain characteristics of the speech signal
such as Pitch Amplitude, Harmonics-to-Noise Ratio, Perturbation Quotients, and
Jitter and Shimmer might present differences between two genders. A detailed
research on these time domain parameters could be considered in the future.
Finally, the shape information of glottal volume velocity waveform, which has
been shown to be an important factor for gender classification (Monsen and
Engebretson, 1977; Karlsson, 1986; Holmberg and Hillman, 1987), was not
investigated in this study. Systematic analysis of this aspect using our database
could be performed when the inverse filtering techniques are improved.

APPENDIX A
RECOGNITION RATES FOR LPC AND CEPSTRUM PARAMETERS
Table A.l
Results from exclusive recognition schemes
LPC distance measure (Log likelihood)
Filter order = 8
CORRECT RATE %
MALE
FEMALE
TOTAL
Scheme
1
50.0
77.2
63.1
Sustained
Scheme
2
60.7
70.0
65.2
Vowels
Scheme
3
81.5
68.0
75.0
Scheme
4
74.1
76.0
75.0
Scheme
1
77.8
39.2
59.2
Unvoiced
Scheme
2
67.4
55.2
61.5
Fricatives
Scheme
3
70.4
64.0
67.3
Scheme
4
66.7
88.0
76.9
Scheme
1
71.3
78.0
74.5
Voiced
Scheme
2
73.1
82.0
77.4
Fricatives
Scheme
3
81.5
100.0
90.4
Scheme
4
96.3
100.0
98.1
169

170
Table A.2
Results from exclusive recognition schemes
LPC distance measure (Log likelihood)
Filter order = 12
CORRECT RATE %
MALE
FEMALE
TOTAL
Scheme
1
65.2
74.4
69.6
Sustained
Scheme
2
69.3
74.0
71.5
Vowels
Scheme
3
88.9
84.0
86.5
Scheme
4
77.8
84.0
80.8
Scheme
1
80.7
46.4
64.2
Unvoiced
Scheme
2
68.9
58.4
63.9
Fricatives
Scheme
3
77.8
72.0
75.0
Scheme
4
66.7
84.0
75.0
Scheme
1
64.8
80.0
72.1
Voiced
Scheme
2
75.9
85.0
80.3
Fricatives
Scheme
3
92.6
96.0
94.2
Scheme
4
92.6
100.0
96.2

171
Table A.3
Results from exclusive recognition schemes
LPC distance measure (Log likelihood)
Filter order = 16
CORRECT RATE %
MALE
FEMALE
TOTAL
Scheme
1
65.2
84.0
74.2
Sustained
Scheme
2
72.2
80.4
76.2
Vowels
Scheme
3
92.6
80.0
86.5
Scheme
4
85.2
88.0
86.5
Scheme
1
79.3
55.2
67.7
Unvoiced
Scheme
2
65.2
63.2
64.2
Fricatives
Scheme
3
77.8
72.0
75.0
Scheme
4
66.7
80.0
73.1
Scheme
1
65.7
81.0
73.1
Voiced
Scheme
2
77.8
86.0
81.7
Fricatives
Scheme
3
96.3
96.0
96.2
Scheme
4
92.6
100.0
96.2

172
Table A.4
Results from exclusive recognition schemes
LPC distance measure (Log likelihood)
Filter order = 20
CORRECT RATE %
6
MALE
FEMALE
TOTAL
Scheme
1
66.3
82.8
74.2
Sustained
Scheme
2
72.2
81.2
76.5
Vowels
Scheme
3
88.9
80.0
84.6
Scheme
4
88.9
88.0
88.5
Scheme
1
79.3
49.6
65.0
Unvoiced
Scheme
2
64.4
64.0
64.2
Fricatives
Scheme
3
81.0
76.0
78.9
Scheme
4
59.3
80.0
69.3
Scheme
1
64.8
81.0
72.6
Voiced
Scheme
2
76.9
84.0
80.3
Fricatives
Scheme
3
96.3
100.0
98.1
Scheme
4
92.6
96.0
94.3

173
Table A.5
Results from exclusive recognition schemes
Cepstrum distance measure
Filter order = 8
CORRECT RATE %
MALE
FEMALE
TOTAL
Scheme
1
59.6
63.2
61.3
Sustained
Scheme
2
64.4
74.8
69.4
Vowels
Scheme
3
74.1
92.0
82.7
Scheme
4
85.2
96.0
90.4
Scheme
1
63.0
59.2
61.2
Unvoiced
Scheme
2
57.0
60.8
58.8
Fricatives
Scheme
3
70.4
72.0
71.2
Scheme
4
81.5
76.0
78.8
Scheme
1
86.1
72.0
79.3
Voiced
Scheme
2
72.2
79.0
75.5
Fricatives
Scheme
3
92.6
96.0
94.2
Scheme
4
88.9
96.0
92.3

174
Table A.6
Results from exclusive recognition schemes
Cepstrum distance measure
Filter order = 12
CORRECT RATE %
MALE
FEMALE
TOTAL
Scheme 1
68.9
67.6
68.3
Sustained
Scheme 2
68.2
66.4
67.3
Vowels
Scheme 3
88.9
96.0
92.3
Scheme 4
88.9
96.0
92.3
Scheme 1
65.2
66.4
65.8
Unvoiced
Scheme 2
57.8
65.6
61.5
Fricatives
Scheme 3
74.1
76.0
75.0
Scheme 4
85.2
92.0
88.5
Scheme 1
83.3
82.0
82.7
Voiced
Scheme 2
85.2
79.0
82.2
Fricatives
Scheme 3
100.0
96.0
98.1
Scheme 4
92.6
92.0
92.3

175
Table A.7
Results from exclusive recognition schemes
Cepstrum distance measure
Filter order = 16
CORRECT RATE %
MALE
FEMALE
TOTAL
Scheme
1
72.6
66.0
69.4
Sustained
Scheme
2
71.1
68.8
70.0
Vowels
Scheme
3
88.9
92.0
90.4
Scheme
4
92.6
92.0
92.3
Scheme
1
63.0
64.8
63.9
Unvoiced
Scheme
2
59.3
66.4
62.7
Fricatives
Scheme
3
85.2
84.0
84.6
Scheme
4
88.9
92.0
90.4
Scheme
1
82.4
80.0
81.3
Voiced
Scheme
2
88.9
80.0
84.6
Fricatives
Scheme
3
100.0
96.0
98.1
Scheme
4
92.6
92.0
92.3

176
Table A.8
Results from exclusive recognition schemes
Cepstrum distance measure
Filter order = 20
CORRECT RATE %
C
MALE
FEMALE
TOTAL
Scheme 1
74.1
66.8
70.6
Sustained
Scheme 2
73.7
70.4
72.1
Vowels
Scheme 3
88.9
92.0
90.4
Scheme 4
88.9
88.0
88.5
Scheme 1
65.2
64.0
64.6
Unvoiced
Scheme 2
60.7
68.0
64.2
Fricatives
Scheme 3
85.2
80.0
82.7
Scheme 4
85.2
92.0
88.5
Scheme 1
79.6
82.0
80.8
Voiced
Scheme 2
87.0
83.0
85.1
Fricatives
Scheme 3
96.3
96.0
96.2
Scheme 4
88.9
92.0
90.4

177
Table A.9
Results from inclusive recognition schemes
LPC distance measure (Log likelihood)
Filter order =16
CORRECT RATE %
MALE FEMALE TOTAL
Sustained
Vowels
Scheme
Scheme
Scheme
1
2
3
75.6
75.2
96.3
89.6
82.0
88.0
82.3
78.5
92.3
Scheme
1
85.2
59.2
72.6
Unvoiced
Scheme
2
73.3
68.0
70.8
Fricatives
Scheme
3
85.2
80.0
82.7
Scheme
1
75.0
87.0
80.8
Scheme
2
80.6
88.0
84.1
Scheme
3
100.0
100.0
100.0
Fricatives

178
Table A. 10
Results from inclusive recognition schemes
Cepstrum distance measure
Filter order =16
CORRECT RATE %
MALE FEMALE TOTAL
Sustained
Vowels
Scheme
Scheme
Scheme
1
2
3
78.2
74.8
92.6
72.0
70.0
96.0
75.2
72.5
94.2
Scheme
1
68.9
72.0
70.4
Unvoiced
Scheme
2
60.0
66.4
63.1
Fricatives
Scheme
3
88.9
84.0
86.5
Scheme
1
84.3
84.0
84.1
Voiced
Scheme
2
90.7
83.0
87.0
Fricatives
Scheme
3
100.0
96.0
98.1

APPENDIX B
RECOGNITION RATES FOR VARIOUS ACOUSTIC PARAMETERS
AND DISTANCE MEASURES
Table B.l
Results from the exclusive recognition scheme 3
Euclidean distance measure
Filter order = 8
CORRECT RATE %
MALE
FEMALE
TOTAL
ARC
85.2
72.0
78.8
Sustained
LPC
63.0
84.0
73.1
Vowels
RC
81.5
96.0
88.5
CC
74.1
92.0
82.7
ARC
70.4
80.0
75.0
Unvoiced
LPC
81.5
80.0
80.8
Fricatives
RC
81.5
80.0
80.8
CC
70.4
72.0
71.2
ARC
88.9
84.0
86.5
Voiced
LPC
92.6
92.0
92.3
Fricatives
RC
96.3
92.0
94.2
CC
92.6
96.0
94.2
179

180
Table B.2
Results from the exclusive recognition scheme 3
Euclidean distance measure
Filter order = 12
CORRECT RATE %
MALE
FEMALE
TOTAL
ARC
85.2
72.0
78.8
LPC
74.1
84.0
78.8
Sustained
FFF
96.3
100.0
98.1
Vowels
RC
100.0
100.0
100.0
CC
88.9
96.0
92.3
ARC
74.1
76.0
75.0
Unvoiced
LPC
70.4
68.0
69.2
Fricatives
RC
81.5
80.0
80.8
CC
74.1
76.0
75.0
ARC
88.9
88.0
88.5
LPC
96.3
88.0
92.3
RC
96.3
96.0
96.2
CC
100.0
96.0
98.1
Fricatives

181
Table B.3
Results from the exclusive recognition scheme 3
Euclidean distance measure
Filter order = 16
CORRECT RATE %
6
MALE
FEMALE
TOTAL
ARC
81.5
76.0
78.8
Sustained
LPC
74.1
88.0
80.8
Vowels
RC
100.0
100.0
100.0
CC
88.9
92.0
90.4
ARC
74.1
76.0
75.0
Unvoiced
LPC
77.8
64.0
71.2
Fricatives
RC
81.5
80.0
80.8
CC
85.2
84.0
84.6
ARC
88.9
84.0
86.5
LPC
96.3
88.0
92.3
RC
96.3
96.0
96.2
CC
100.0
96.0
98.1
Fricatives

182
Table B.4
Results from the exclusive recognition scheme 3
Euclidean distance measure
Filter order = 20
CORRECT RATE %
MALE
FEMALE
TOTAL
ARC
85.2
80.0
82.7
Sustained
LPC
74.1
88.0
80.8
Vowels
RC
100.0
100.0
100.0
CC
88.9
92.0
90.4
ARC
74.1
76.0
75.0
Unvoiced
LPC
74.1
68.0
71.2
Fricatives
RC
81.5
80.0
80.8
CC
85.2
80.0
82.7
Voiced
ARC
92.6
LPC
92.6
RC
96.3
CC
96.3
84.0
88.5
88.0
90.4
96.0
96.2
96.0
96.2
Fricatives

183
Table B.5
Results from the exclusive recognition scheme 3
PDF (Probability Density Function) distance measure
Filter order = 8
CORRECT RATE %
MALE
FEMALE
TOTAL
ARC
81.5
80.0
80.8
Sustained
LPC
85.2
84.0
84.6
Vowels
RC
81.5
96.0
88.5
CC
81.5
76.0
78.8
ARC
74.1
64.0
69.2
Unvoiced
LPC
81.5
76.0
78.8
Fricatives
RC
74.1
84.0
78.8
CC
77.8
84.0
80.8
ARC 88.9 88.0 88.5
Voiced LPC 88.9 96.0 92.3
RC 92.6 92.0 92.3
CC 92.6 92.0 92.3
Fricatives

184
Table B.6
Results from the exclusive recognition scheme 3
PDF (Probability Density Function) distance measure
Filter order = 12
CORRECT RATE %
MALE
FEMALE
TOTAL
ARC
92.6
76.0
84.6
LPC
100.0
96.0
98.1
Sustained
FFF
96.3
96.0
96.2
Vowels
RC
100.0
96.0
98.1
CC
100.0
88.0
94.2
ARC
81.5
48.0
65.4
Unvoiced
LPC
96.3
76.0
86.5
Fricatives
RC
66.7
80.0
73.1
CC
77.8
68.0
73.1
ARC
92.6
80.0
86.5
LPC
100.0
88.0
94.2
RC
96.3
84.0
90.4
CC
92.6
92.0
92.3
Fricatives

185
Table B.7
Results from the exclusive recognition scheme 3
PDF (Probability Density Function) distance measure
Filter order =16
CORRECT RATE %
MALE
FEMALE
TOTAL
ARC
92.6
84.0
88.5
Sustained
LPC
92.6
92.0
92.3
Vowels
RC
96.3
88.0
92.3
CC
96.3
84.0
90.3
ARC
66.0
48.0
57.7
Unvoiced
LPC
92.6
64.0
78.8
Fricatives
RC
70.4
64.0
67.3
CC
74.1
64.0
69.2
Voiced
ARC
92.6
72.0
82.7
LPC
100.0
88.0
94.2
RC
100.0
80.0
90.4
CC
88.9
72.0
80.8
Fricatives

186
Table B.8
Results from the exclusive recognition scheme 3
PDF (Probability Density Function) distance measure
Filter order = 20
CORRECT RATE %
MALE
FEMALE
TOTAL
ARC
78.0
56.0
67.3
Sustained
LPC
88.9
72.0
80.8
Vowels
RC
85.2
60.0
67.3
CC
88.9
60.0
75.0
ARC
N/A
N/A
N/A
Unvoiced
LPC
66.7
40.0
53.8
Fricatives
RC
63.0
48.0
55.8
CC
77.8
36.0
57.7
ARC
92.6
24.0
59.6
Voiced
LPC
92.6
48.0
71.2
Fricatives
RC
96.3
52.0
75.0
CC
74.1
68.0
71.2

187
Table B.9
Results from the inclusive recognition scheme 3
Euclidean distance measure
Filter order =16
CORRECT RATE %
MALE
FEMALE
TOTAL
ARC
88.9
76.0
82.7
Sustained
LPC
77.8
88.0
82.7
Vowels
RC
100.0
100.0
100.0
CC
92.6
96.0
94.2
ARC
77.8
80.0
78.9
Unvoiced
LPC
77.8
64.0
71.2
Fricatives
RC
85.2
80.0
82.7
CC
88.9
84.0
86.5
Voiced
ARC
92.6
LPC
100.0
RC
96.3
CC
100.0
88.0
90.4
88.0
94.2
96.0
96.2
96.0
98.1
Fricatives

188
Table B.10
Results from the inclusive recognition scheme 3
PDF (Probability Density Function) distance measure
Filter order =16
CORRECT RATE %
MALE
FEMALE
TOTAL
ARC
100.0
100.0
100.0
Sustained
LPC
100.0
100.0
100.0
Vowels
RC
100.0
100.0
100.0
CC
100.0
100.0
100.0
ARC
96.3
100.0
98.1
Unvoiced
LPC
100.0
100.0
100.0
Fricatives
RC
100.0
100.0
100.0
CC
100.0
100.0
100.0
ARC
100.0
100.0
100.0
LPC
100.0
100.0
100.0
RC
100.0
100.0
100.0
CC
100.0
100.0
100.0
Fricatives

REFERENCES
Achariyapaopan, T., and Childers, D. G. (1983). On the optimum number of
features in the classification of multivariate normal distribution data, Report,
University of Florida, Gainesville.
Ananthapadmanabha, T.V., and Fant, G. (1982). Calculation of true glottal flow
and its components, Speech Transmission Lab., Rep., STL/QPSR, 1/1982, Royal
Institute of Technology, Stockholm, Sweden, 1-30.
-a
Atal, B. S. (1974a). Linear prediction of speech recent advances with
applications to speech analysis, in Speech Recognition, Invited Papers Presented at
the 1974 IEEE Symposium, D. R. Reddy, Ed., Academic Press, New York, 221-227.
Atal, B. S. (1974b). Effectiveness of linear prediction characteristics of the speech
wave for automatic speaker identification and verification, J. Acoust. Soc. Am.,
Vol. 55(6), 1304-1312.
Atal, B. S. (1976). Automatic recognition of speakers from their voices, Proc.
IEEE, Vol. 64, 460-475.
Atal, B. S., and Hanauer S. L. (1971). Speech analysis and synthesis by linear
prediction of the speech wave, J. Acoust. Soc. Am., Vol. 50(2), 637-655.
Atal, B. S., and Schroeder, M. R. (1970). Adaptive predictive coding of speech
signals, Bell Syst. Tech. J., Vol. 49(6), 1973-1986.
Berouti, M. G., Childers, D. G., and Paige, A. (1977). A correction of tape recorder
distortion, Proc. IEEE International Conference on Acoustics, Speech, and Signal
Processing, 397-400.
Bladon, A. (1983). Acoustic phonetics, auditory phonetics, speaker sex and speech
recognition: a thread, Chapter 2, in Computer Speech Processing, F. Fallside and
A. Woods, Eds., Prentice-Hall, Englewood Cliffs, New Jersey, 29-38.
Bralley, R., Bull, G., Gore, C., and Edgerton, M. (1978). Evaluation of vocal pitch
in male transsexuals, Journal of Communication Disorders, Vol. 11, 443-449.
Brown, W. S., and Feinstein, S. H. (1977). Speaker sex identification utilizing a
constant source, Folia Phoniatrica, Vol. 29, 248-249.
189

190
Carlson, T. E. (1981) Some acoustical and perceptual correlates of speaker gender
identification, An outline of Ph.D. dissertation, University of Florida, Gainesville.
Carrell, T. D. (1981). Effects of glottal waveform on the perception of talker sex,
J. Acoust. Soc. Am., Supl., Vol. 70, S97.
Cheng, Y. M., and Guerin, B. (1987). Control parameters in male and female
glottal sources, Chapter 17, in Laryngeal function in phonation and respiration, T.
Bear, C. Susaki, and K. Harris, Eds., College Hill Publ., San Diego, 219-238.
Childers, D. G. (1977). Laryngeal pathology detection, CRC Crit. Rev. Bioeng.,
Vol. 2(4), 375-426.
Childers, D. G. (1986). Single-trial event-related potentials: statistical
classification and topography, Chapter 14, in Topographic Mapping of Brain
Electrical Activity, F. H. Duffy, Ed., Butterworth Publ., Boston, 255-277.
Childers, D. G. (1989). Biomedical Signal Processing, Chapter 10, in Selected
Topics in Signal Processing, Simon Haykin, Ed., Prentice-Hall, Englewood Cliffs,
New Jersey, 194-250.
Childers, D. G., Bloom, P. A., Arroyo, A. A., Roucos, S. E., Fischler, I. S.,
Achariyapaopan, T., and Perry, N. W. Jr. (1983). Classification of cortical
responses using features from single EEG records, IEEE Trans, on Biomed. Eng.,
Vol. BME-29(6), 423-438.
Childers, D. G., and Hicks, D. M. (1984). Two channel (EGG and speech)
analysis-synthesis for voice recognition, NSF Proposal, University of Florida,
Gainesville.
Childers, D. G., and Krishnamurthy, A. K. (1985). A critical review of
electroglottography. CRC Crit. Rev. Bioeng., Vol. 12(2), 131-164.
Childers, D. G., Krishnamurthy, A. K., Bocchieri, B. L., and Naik, J. M. (1985a ).
Vocal source and tract models based on speech signal analysis, in Mathematics
and Computers in Biomedical Applications, J. Eisenfeld and C. Delisi, Eds.,
Elsevier Science Publ. B. V., Amsterdam, Netherland, 335-349.
Childers, D. G., and Larar, J. N. (1984). Electroglottography for laryngeal function
assessment and speech analysis, IEEE Trans, on Biomed. Eng., Vol. BME-31(12),
807-817.
Childers, D. G., Naik, J. M., Larar, J. N., Krishnamurthy, A. K., and Moore, G. P.
(1983). Electroglottography, speech, and ultra-high speed cinematography, in
Vocal Fold Physiology: Biomechanics, Acoustics, and Phonatory Control, I. Titze
and R. Scherer, Eds., The Denver Center for the Performing Arts, Denver, 202-220.

191
Childers, D. G., Wu, K., and Hicks, D. M. (1987). Factors in voice quality: acoustic
features related to gender, Proc. IEEE International' Conference on Acoustics,
Speech, and Signal Processing, Vol. 1, 293-296.
Childers, D. G., Wu, K., Hicks, D. M., and Yegnanarayana, B. (1989). Voice
conversion, Speech Communication Vol. 8, 147-158.
Childers, D. G., Yegnanarayana, B., and Wu, K. (1985b). Voice conversion:
factors responsible for quality, Proc. IEEE International Conference on Acoustics,
Speech, and Signal Processing, Vol. 2, 748-751.
Coleman, R. 0. (1971). Male and female voice quality and its relationship to vowel
formant frequencies, Journal of Speech and Hearing Research, Vol. 14, 566-577.
Coleman, R. 0. (1973a). A comparison of the contributions of two vocal
characteristics to the perception of the maleness and femaleness in the voice, Paper
presented at the Annual Convention of the American Speech and Hearing
Association, Detroit.
Coleman, R. 0. (1973b). Speaker identification in the absence of inter-subject
differences in glottal source characteristics, J. Acoust. Soc. Am., Vol. 53,
1741-1743.
Coleman, R. 0. (1976). A comparison of the contributions of two voice quality
characteristics to the perception of maleness and femaleness in the voice, Journal
of Speech and Hearing Research, Vol. 19, 168-180.
Committee on Evaluation of Sound Spectrograms. (1979). On the theory and
practice of voice identification, National Academy of Sciences Report,
Washington, DC.
Cowan, C. F. N., and Grant, M. (1985). Adaptive Filters, Chapter 5, Prentice-Hall,
Englewood Cliffs, New Jersey.
Davis, S., and Mermerlstein, P. (1980). Comparison of parametric representation
for monosyllable word recognition in continuously spoken sentences, IEEE Trans.
Acoust., Speech, and Signal Processing, Vol. 28(4), 375-366.
Edwards, L. A. (1964). Statistical Methods for the Behavioral Sciences, Holt,
Rinehart and Winston, New York.
Eskenazi, L. (1989). Acoustic correlates of voice quality and distortion measures
for speech processing, Ph.D. dissertation, University of Florida, Gainesville.
Fant, G. (1966). A note on vocal tract size factors and non-uniform F-pattern
scaling, Speech Transmission Lab., Rep., STL/QPSR, 1/1966, Royal Institute of
Technology, Stockholm, Sweden, 22-30.

192
Fant, G. (1976). Vocal tract energy functions and non-uniform scaling, J. Acoust.
Soc. Japan, Vol. 11, 1-18.
Fant, G., Gobi, C., Karlsson, I., and Lin, Q. (1987). The female voice: Experiments
and overview. 114th Meeting of Acoustical Society of America, J. Acoust. Soc.
Am. Sup. 1, Vol. 82, S90.
Flanagan, J. L. (1955). A difference limen for vowel formant frequency. J. Acoust.
Soc. Am., Vol. 27, 613-617.
Flanagan, J. L. (1972). Speech Analysis, Synthesis and Perception, 2nd Ed.,
Springer-Verlag, New York.
Foley. D. H. (1972). Considerations of sample and feature size, IEEE Trans.
Inform. Theor., Vol. IT-18: 618-626.
Fortescue, T. R., Kershenbau, L. S., and Ydstie, B. E. (1981). Implementation of
self-regulator with variable forgetting factors, Automtica, Vol. 17, 831-835.
Fukunaga, K. (1972). Introduction to Statistical Pattern Recognition, Academic
Press, New York.
Gray, A. H., and Markel, J. D. (1976). Distance measures for speech processing,
IEEE Trans. Acoust., Speech, and Signal Processing, Vol. 24, 380-391.
Gray, R., Buzo. A. H., and Matusyama, Y. (1980). Distortion measures for speech
processing, IEEE Trans. Acoust., Speech, and Signal Processing, Vol. 24, 380-391.
Henton, C. G. (1987). Fact and fiction in the description of female and male
pitch. 114th Meeting of Acoustical Society of America, J. Acoust. Soc. Am. Sup. 1,
Vol. 82, S91.
Hollien, H., and Jackson, B. (1973). Normative data on the speaking fundamental
frequency characteristics of young adult males, Journal of Phonetics, Vol. 1,
117-120.
Hollien, H., and Malcik, E. (1967). Evaluation of cross-sectional studies of
adolescent voice changes in males, Speech Monographs, Vol. 34, 80-84.
Hollien, H., and Paul, P. (1969). A second evaluation of the speaking fundamental
frequency characteristics of post-adolescent girls, Language and Speech, Vol. 12,
119-124.
-
Hollien, H., and Shipp, T. (1972). Speaking fundamental frequency and
chronologic age in males, Journal of Speech Hearing Research Vol. 15, 155-159.
Holmberg, E. B., Hillman, R. E. and Perkell, J. S. (1987). Glottal airflow and
pressure measurements for female and male speakers in soft, normal, and loud

193
voice, 114th Meeting of Acoustical Society of America, J. Acoust. Soc. Am. Sup. 1,
Vol. 82, S90.
Holmes, J. N. (1973). The influence of glottal waveform on the naturalness of
speech from a parallel formant synthesizer, IEEE Trans. Audio and Electroacoust.,
Vol. 21, 298-305.
Horri, Y., and Ryan, W. J. (1981). Fundamental frequency characteristics and
perceived age of adult male speakers, Folia Phoniantrica, Vol. 33, 227-233.
Hughes, G. F. (1968). On the mean accuracy of statistical pattern recognizers,
IEEE Trans. Inform. Theor., Vol. IT-14: 55-63.
Ingemann, F. (1968). Identification of the speakers sex from voiceless fricatives,
J. Acoust. Soc. Am., Vol. 44, 1142-1144.
Ishizaka, D., and Flanagan, J.L. (1972). Synthesis of voiced sounds from a two
mass model of the vocal cords, Bell Syst. Tech. J., Vol. 50, 1233-1268.
Itakura, F. (1975). Minimum prediction residual principle applied to speech
recognition, IEEE Trans. Acoust., Speech, and Signal Processing, Vol. ASSP-23,
67-72.
Juang, B. H. (1984). On using the Itakura-Saito measures for speech coder
performance evaluation, AT&T Bell Lab. Tech. J., Vol. 63(8), 1477-1498.
Karlsson, I. (1986). Glottal wave forms for normal female speakers, Journal of
Phonetics, Vol. 14, 415-419.
Klatt, D. H. (1987). Acoustic correlates of breathiness: First harmonic amplitude,
turbulence noise, and tracheal coupling, 114th Meeting of Acoustical Society of
America, J. Acoust. Soc. Am. Sup. 1, Vol. 82, S91.
Klatt, D. H., and Klatt, L. C. (1987). Voice quality variations within and across
female and male talkers: implications for speech analysis, synthesis and
perception, submitted to J. Acoust. Soc. Am.
Krishnamurthy, A. K. (1983). Study of vocal fold vibration and the glottal sound
source using synchronized speech electroglottography and ultra-high speed
laryngeal films, Ph.D. dissertation, University of Florida, Gainesville.
Krishnamurthy, A.K., and Childers, D.G. (1986). Two-channel speech analysis,
IEEE Trans, on acoust., speech, and signal processing, Vol. ASSP-34(4), 730-743.
Kullback, S. (1959). Information Theory and Statistics, Wiley, New York.
Lachenbruch, P. A. (1968). On expected probabilities of misclassification in
discriminant analysis, necessary sample size, and a relation with the multiple
correlation coefficient, Biometric, Vol. 24, 823-834.

194
Lass, N. J., Almerino, C. A., Jordan, L. F., and Walsh, J. M. (1980). The effect of
filtered speech on speaker race and sex identifications, Journal of Phonetics, Vol.
8, 101-112.
Lass, N. J., Hughes, K. R., Bowyer, M. D., Waters, L. T., and Bourne, V. T. (1976).
Speaker sex identification from voiced, whispered, and filtered isolated vowels, J.
Acoust. Soc. of Am., Vol. 59, 675-678.
Lass, N. J., and Mertz, P. J. (1978). The effect of temporal speech alterations on
speaker race and sex identifications, Language and Speech Vol. 21, 279-290.
Lass, N. J., Tecca, J. E., Mancuso, R. A., and Black, W. I. (1979). The effect of
phonetic complexity on speaker race and sex identifications, Journal of Phonetics
Vol. 7, 105-118.
Lindblom, B. (1962). Accuracy and limitation of sona-graph measurements,
Proceedings of the 4th International Congress of Phonetic Sciences, Helsinki, 1961.
The Hague, 1962.
Linke, C. E. (1973). A study of pitch characteristics of female voices and their
relationship to vocal effectiveness, Folia Phoniatrica, Vol. 25, 173-185.
Linville, S. E., and Fisher, H. B. (1985). Acoustic characteristics of perceived
versus actual vocal age in controlled phonation by adult females, J. Acoust. Soc.
Am., Vol. 78(1), 40-48.
Markel, J. D., and Gray, A. H. (1976). Linear Prediction of Speech,
Springer-Verlag, Berlin.
Markel, J. D., Oshika, B., and Gray, A. H. Jr. (1977). Long-term feature averaging
for speaker recognition, EEEE Trans, on Acoust., Speech, and Signal Processing,
Vol. ASSP-25(4), 330-337.
Makhoul, J. (1975a). Linear prediction in automatic speech recognition, in
Speech Recognition, Invited Papers Presented at the 1974 IEEE Symposium, D. R.
Reddy, Ed., Academic Press, New York, 183-220.
Makhoul, J. (1975b). Linear prediction: a tutorial review, Proc. IEEE, Vol. 63(4),
561-580.
Monsen, R. B., and Engebretson, A. M. (1977). Study of variations in the male and
female glottal wave, J. Acoust. Soc. Am., Vol 62(4), 981-993.
Monsen, R. B., and Engebretson, A. M. (198,3). The accuracy of formant frequency
measurements: a comparison of spectrographic analysis and linear prediction,
Journal of Speech Hearing Research, Vol. 26, 89-97.

195
Morikawa, H., and Fujisaki, H. (1982). Adaptive analysis of speech based on a
pole-zero representation, IEEE Trans, on Acoust., Speech, and Signal Processing,
Vol. ASSP-30(1), 77-88.
Murry, T., and Singh, S. (1980). multidimensional analysis of male and female
voices, J. Acoust. Soc. Am., Vol 68(5), 1294-1300.
Naik, J. M. (1983). Synthesis and evaluation of natural sounding speech using the
linear predictive analysis-synthesis scheme, Ph.D. Dissertation, University of
Florida, Gainesville.
Nocerino, N., Soong, F. K., Rabiner, L. R. and Klatt, D. H. (1985). Comparative
study of several distortion measures for speech recognition, Speech
Communication, Vol. 4, 317-331
Nord, L., and Sventelius, E. (1979). Analysis and prediction of difference limen
data for formant frequencies, Speech Transmission Lab., Rep., STL/QPSR,
3-4/1979, Royal Institute of Technology, Stockholm, Sweden,,60-72.
OKane, M. (1987). Recognition of speech and recognition of speaker sex: parallel
or concurrent processes? 114th Meeting of Acoustical Society of America, J.
Acoust. Soc. Am. Sup. 1, Vol. 82, S84.
Ott, L. (1984). An Introduction to Statistical Methods and Data Analysis, Duxbury
Press, Boston.
Parsons, T. W. (1986). Voice and Speech Processing, McGraw-Hill, New York.
Peterson, G. E., and Barney, H. L. (1952). Control methods used in a study of the
vowels, J. Acoust. Soc. Am., Vol. 24, 175-184.
Pinto, N. B., Childers, D. G., and Lalwani, A. (1989) "Formant speech synthesis:
Improving production quality, accepted by IEEE Trans, on Acoust., Speech, and
Signal Processing.
Pruzansky, S. (1963). Pattern matching procedure for automatic talker
recognition, J. Acoust. Soc. Am., Vol. 35, 354-358.
Rabiner, L. R., and Levinson, S. E. (1981). Isolated and connected word
recognition theory and selected applications, IEEE Trans. Communications,
Vol. Com-29(5), 621-659
Rabiner, L. R., Levinson, S. E., Rosenberg, A. E., and Wilpon, J. G. (1979).
Speaker-independent recognition of isolated words using clustering techniques,
IEEE Trans, on acoust., speech, and signal processing, Vol. ASSP-27(1), 336-349.
Rabiner, L. R., and Schafer, R. W. (1978). Digital Processing of Speech Signal,
Prentice-Hall, Inc., Englewood Cliffs, New Jersey.

196
Rosenberg, A. E. (1976). Automatic speaker verification: a review, Proc. IEEE,
Vol. 64, 475-487.
Rothenberg, M. (1981). Acoustic interaction between the glottal source and the
vocal tract, in Vocal Fold Physiology, K. N. Stevens and M. Hirano, Eds.,
University of Tokyo Press, Tokyo, 305-328.
Saxman, J., and Burk, K. (1967). Speaking fundamental frequency characteristics
of middle-aged females, Folia Phoniatrica, Vol. 19, 167-172.
Schwartz, M. F. (1968). Identification of speaker sex from isolated, voiceless
fricatives, J. Acoust. Soc. of Am., Vol. 43, 1178-1179.
Schwartz, M. F., and Rie, H. E. (1968). Identification of speaker sex from
isolated, whispered vowels, J. Acoust. Soc. Am., Vol. 44, 1736- 1737.
Shipp, F. T., and Hollien, H. (1969). Perception of aging male voice, Journal of
Speech and Hearing Research, Vol. 12, 703-710.
Singh, S., and Murry, T. (1978). multidimensional classification of normal voice
qualities, J. Acoust. Soc. Am., Vol 64(1), 81-87.
Stoicheff, M. (1981). Speaking fundamental frequency characteristics of
non-smoking female adults, Journal of Speech Hearing Research, Vol. 24,
437-441.
Ting, Y. T. (1989). Adaptive estimation of time-varying signal parameters with
applications to speech, Ph.D. dissertation, University of Florida, Gainesville.
Ting, Y. T., Childers, D. G., and Principe, J. C. (1988). Tracking spectral
resonances, Fourth Annual ASSP Workshop on Spectrum Estimation and
Modeling, Minneapolis, 49-54.
Titze, I. R. (1987). Physiology of the female larynx, 114th Meeting of Acoustical
Society of America, J. Acoust. Soc. Am. Sup. 1, Vol. 82, S90.
Titze, I. R. (1989). Physiologic and acoustic differences between male and female
voices, J. Acoust. Soc. Am., Vol 85(4), 1699-1707.
Tou, J. T., and Gonzales, R. C. (1974). Pattern Recognition Principles,
Addison-Wesley, Reading, MA.
Winer, B. J. (1971). Statistical Principles in Experimental Design, McGraw-Hill,
New York.
Wu, K. (1985). A flexible speech analysis-synthesis system for voice conversion,
Masters Thesis, University of Florida, Gainesville.

197
Yegnanarayana, B., Naik, J. M. and Childers, D. G. (1984). Voice simulation:
factors affecting the quality and naturalness, 10th International Conference on
Computational Linguistics, 22nd Annual Meeting of the Association for
Computational Linguistics, Proceedings of Coling84, Stanford University, Stanford,
530-533.
o

BIOGRAPHICAL SKETCH
Ke Wu was born at Guangzhou (Canton), China, and he received the Diploma
(equivalent to the Bachelor of Science degree) in mathematics from Zhongshan
(Sun Yatsen) University, Gaungzhou, China, in December 1977. Upon his
graduation, he was employed by Guangzhou Institute of Electronic Technology,
* Academia Sinica, as a Research Assistant in the Computer Laboratory. He was
admitted to the Graduate School, University of Florida, to pursue graduate studies in
August 1984 and completed his masters degree in electrical engineering in
December 1985. Mr. Wu has been a Graduate Research Assistant since 1984 in Dr.
D. G. Childers Mind-Machine Interaction Research Center and worked for IFAS
(Institute of Food and Agricultural Sciences, University of Florida) Computer
Network from May 1988 to May 1989. His current area of interest is digital signal
processing with applications to speech analysis, synthesis, and recognition
techniques by computer. Mr. Wu is scheduled to complete his Ph.D. degree in May,
1990.
198

I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and
quality, as a dissertation for the degree of Doctor of Philosophy.
c/- ''O) fijd At A a)
Donald G. Childers, Chairman
Professor of Electrical Engineering
I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and
quality, as a dissertation for the degree of Doctor of Philosophy.
/Jcick R. Smith
//Professor of Electrical Engineering
I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and
quality, as a dissertation for the degree of Doctor of Philosophy.
i A nf am 1 A '*
A. Antonio Arroyo
Associate Professor of Electrical Engineering
I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and
quality, as a dissertation for the degree of Doctor of Philosophy.
Electrical Engineering
I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and
quality, as a dissertation for the degree of Doctor of Philosophy.
\
.
9>- lObd
Howard B. Rotfi
man
Professor of Speech

Internet Distribution Consent Agreement
In reference to the following dissertation:
AUTHOR:
TITLE:
Wu, Ke
Towards Automatic Gender Recognition fromSpeech (record number:
1583934)
PUBLICATION DATE: 1990
i, UF IMU
as copyright holder for the
aforementioned dissertation, hereby grant specific and limited archive and distribution rights to
the Board of Trustees of the University of Florida and its agents. I authorize the University of
Florida to digitize and distribute the dissertation described above for nonprofit, educational
purposes via the Internet or successive technologies.
This is a non-exclusive grant of permissions for specific off-line and on-line uses for an
indefinite term. Off-line uses shall be limited to those specifically allowed by "Fair Use" as
prescribed by the terms of United States copyright legislation (cf, Title 17, U.S. Code) as well as
to the maintenance and preservation of a digital archive copy. Digitization allows the University
of Florida to generate image- and text-based versions as appropriate and to provide and enhance
access using search software.
This grant of permissions prohibits use of the digitized versions for commercial use or profit.
Signature of Copyright Holder
NU
Printed or Typed Name of Copyright Holder/Licensee
Personal information blurred
Date of Signature
Please print, sign and return to:
Cathleen Martyniak
UF Dissertation Project
Preservation Department
University of Florida Libraries
P.O. Box 117007
Gainesville, FL 32611-7007
5/28/2008



165
Meanwhile, a filter order of 12 to 16 seemed to be the appropriate choice for most
designs.
There were two different approaches for vowels in this study that achieved
100% recognition rates for both males and females, namely, the third formant
information (a filter order of 12) with the PDF distance measure and the reflection
coefficients (filter orders of 12, 16, and 20) with the EUC distance measure. Both
used the recognition Scheme 3.
A comparative study of acoustic parameters using Fishers discriminant ratio
criterion indicated that for the filter order of 12, the analytical inferences from the
values of J] and J4 and the expected probabilities of error using various acoustic
features proved comparable to the empirical results of the experiments with the PDF
distance measure for gender recognition. Furthermore, Jj the Mahalanobis distance
appeared to be more reliable for predicting the performance of a gender classifier
than J4.
The statistical tests and recognition results in fine analysis suggested that both
fundamental frequency and formant characteristics were reliable indicators for
gender discrimination. For both male and female speakers, glottal vibration
patterns were relatively less variable than the vocal tract shapes when different
vowels were pronounced.
Formant frequencies from male vowels were lower than those from female
vowels. In general, bandwidths of the formants from male vowels were narrower
than those from female vowels and the amplitudes of the formants from male vowels
were higher than those from female vowels. Statistical differences of the first,
second, third, and fourth formant frequencies; first, second, and fourth formant
bandwidths; and second, third, fourth formant amplitudes between male and female
speakers were highly significant. This suggested steeper spectral slopes for female


54
(a)
FREQUENCY (Hi)
(b)
Figure 4.8 (a) Two cepstral coefficient templates of voiced fricatives
for male and female speakers in the upper layer.
(b) The corresponding spectra.


127
Table 6.1 Summary of Analysis of Variance
Source of Variation
Corresponding MS
Between subjects
* Factor A
* Subjects within groups
MS
MS
a
subj w.groups
Within subjects
* Factor B
* A x B effect
* B x subjects within groups
MSb
MSab
MSb x subj w.groups


xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID EJQB0CGLR_BY8BJT INGEST_TIME 2012-02-22T19:33:05Z PACKAGE UF00082197_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES


Internet Distribution Consent Agreement
In reference to the following dissertation:
AUTHOR:
TITLE:
Wu, Ke
Towards Automatic Gender Recognition fromSpeech (record number:
1583934)
PUBLICATION DATE: 1990
i, UF IMU
as copyright holder for the
aforementioned dissertation, hereby grant specific and limited archive and distribution rights to
the Board of Trustees of the University of Florida and its agents. I authorize the University of
Florida to digitize and distribute the dissertation described above for nonprofit, educational
purposes via the Internet or successive technologies.
This is a non-exclusive grant of permissions for specific off-line and on-line uses for an
indefinite term. Off-line uses shall be limited to those specifically allowed by "Fair Use" as
prescribed by the terms of United States copyright legislation (cf, Title 17, U.S. Code) as well as
to the maintenance and preservation of a digital archive copy. Digitization allows the University
of Florida to generate image- and text-based versions as appropriate and to provide and enhance
access using search software.
This grant of permissions prohibits use of the digitized versions for commercial use or profit.
Signature of Copyright Holder
NU
Printed or Typed Name of Copyright Holder/Licensee
Personal information blurred
Date of Signature
Please print, sign and return to:
Cathleen Martyniak
UF Dissertation Project
Preservation Department
University of Florida Libraries
P.O. Box 117007
Gainesville, FL 32611-7007
5/28/2008


67
(5) Finally, calculate the value of PE from ni, n2, and p using
Equations (4.34) to (4.38).
The analytical results were also compared to the empirical ones obtained from
experiments using recognition schemes. Section 5.3 in the next Chapter presents a
detailed discussion.


147
Table 7.5
Significance of differences between male and female
sustained vowels
DIFFERENCES BETWEEN CONCLUSION
MALE/FEMALE SUSTAINED VOWELS
Fundamental frequencies FO highly significant M Formant
FI
highly significant M B1
highly significant M A1
not significant
F2
highly significant M B2
highly significant M A2
highly significant M>F
F3
highly significant M B3
not significant
A3
highly significant M>F
F4
highly significant M B4
highly significant M A4
highly significant M>F


109
g(n)
v(n)
r(n)
Figure 6.1 A simplified speech production model.


20
The study populations were relatively small for most
investigations. Sometimes the database used consisted of less than 10
subjects for each gender (Ingemann, 1968; Schwartz and Rine, 1968;
Brown and Feinstein, 1977), making the interpretation of the results
unreliable.
The results of the listening tests may depend on the gender
distribution of the testing panel because males and females may use
different judging strategies. However, this point usually was not
emphasized so that the conclusions claimed from listening tests may be
biased (Coleman, 1976; Carlson, 1981).
In summary, previous research has measured and investigated the
physiological or anatomical parameters for each gender. Under certain
assumptions, the relationship between anatomical parameters and some of the
acoustic features was established. The major acoustic parameters responsible for
perceptually discriminating a speakers gender from voice were investigated and
tested. However, no attempt was made to automatically classify male/female voices
by objective feature measurements. The vowel characteristics for each gender were
inaccurate because of the weakness of analog techniques. Various hypotheses and
preliminary results need to be verified on a more comprehensive database. All these
constituted the underlying problems and impetuses for this research.
1.4 Objectives of this Research
This research sought to address these problems through two specific
objectives.
One objective of this study was to explore the possible effectiveness of digital
speech processing and pattern recognition techniques for an automatic gender
recognition system. Emphasis was placed on the investigation of various objective


186
Table B.8
Results from the exclusive recognition scheme 3
PDF (Probability Density Function) distance measure
Filter order = 20
CORRECT RATE %
MALE
FEMALE
TOTAL
ARC
78.0
56.0
67.3
Sustained
LPC
88.9
72.0
80.8
Vowels
RC
85.2
60.0
67.3
CC
88.9
60.0
75.0
ARC
N/A
N/A
N/A
Unvoiced
LPC
66.7
40.0
53.8
Fricatives
RC
63.0
48.0
55.8
CC
77.8
36.0
57.7
ARC
92.6
24.0
59.6
Voiced
LPC
92.6
48.0
71.2
Fricatives
RC
96.3
52.0
75.0
CC
74.1
68.0
71.2


17
indicated that gender identification was not significantly affected by such filtering;
listeners accuracy in gender recognition remained high for all three experimental
conditions, showing that gender identification can be made accurately from acoustic
information available in different portions of the broadband speech spectrum.
1.3.3 Summary of Previous Research
By reviewing the literature it can be concluded that the revealed information of
gender identification from previous research was extensive. However, it is clear that
much work still remains to be done.
What has not been completed
The relative importance of the FO versus VTR characteristics for
perceptual male or female voice quality is still controversial. The belief
that the FO is the strongest cue to gender seems to be substantiated by the
evidence. There is a hypothesis that in situations in which the role of FO
is diminished by deviancy, the effect of VTR characteristics upon gender
judgments increases from a minimal level to take on a large role equal to
and even sometimes greater than that played by FO (Carlson, 1981). But
this hypothesis remains unproven.
It is well known now that not only the vibration frequency of the
glottis (FO) but also the shape of the glottal excitation wave as well are
important factors which greatly affect speech quality (Rothenberg, 1971;
Holmes, 1973). Differences of glottal excitation wave shapes for male
and female were observed and investigated (Monsen and Engebretson,
1977; Karlsson, 1986; Holmberg and Hillman, 1987). But perceptive
justification of these characteristics was still limited (Carrell, 1981) and
the inverse filtering techniques need to be improved and more data
should be analyzed.


185
Table B.7
Results from the exclusive recognition scheme 3
PDF (Probability Density Function) distance measure
Filter order =16
CORRECT RATE %
MALE
FEMALE
TOTAL
ARC
92.6
84.0
88.5
Sustained
LPC
92.6
92.0
92.3
Vowels
RC
96.3
88.0
92.3
CC
96.3
84.0
90.3
ARC
66.0
48.0
57.7
Unvoiced
LPC
92.6
64.0
78.8
Fricatives
RC
70.4
64.0
67.3
CC
74.1
64.0
69.2
Voiced
ARC
92.6
72.0
82.7
LPC
100.0
88.0
94.2
RC
100.0
80.0
90.4
CC
88.9
72.0
80.8
Fricatives


188
Table B.10
Results from the inclusive recognition scheme 3
PDF (Probability Density Function) distance measure
Filter order =16
CORRECT RATE %
MALE
FEMALE
TOTAL
ARC
100.0
100.0
100.0
Sustained
LPC
100.0
100.0
100.0
Vowels
RC
100.0
100.0
100.0
CC
100.0
100.0
100.0
ARC
96.3
100.0
98.1
Unvoiced
LPC
100.0
100.0
100.0
Fricatives
RC
100.0
100.0
100.0
CC
100.0
100.0
100.0
ARC
100.0
100.0
100.0
LPC
100.0
100.0
100.0
RC
100.0
100.0
100.0
CC
100.0
100.0
100.0
Fricatives


110
s(n) =2 h(n-mN)
(6.2)
m
where
h(n) = g(n)*v(n)*r(n)
(6.3)
Thus in frequency domain,
H(jw) = G(jw) V(jw) R(jw)
(6.4)
and
S(jto) = Z H(jw) e"jmN
(6.5)
m
As we see from Equations (6.4) and (6.5), formant estimation from speech spectrum is
influenced by the effects of
o the periodic vocal fold excitation,
o the glottal filter spectrum envelope,
o the vocal tract filter spectrum, and
c the lip radiation filter spectrum.
There two conditions under which the frame-based conventional LPC provides
a correct solution for periodic excitation:
1. the prediction error is minimized over an interval in which either
the excitation is absent or the signal is exactly predictable by linear
prediction and
2. the impulse response of the all pole filter dies to zero at least p (the
order of the LPC filter) samples before the start of next period.
Usually, it is difficult to meet these conditions. For instance, a periodic
excitation of the all-poll filter introduces errors in the conventional linear prediction
analysis (Atal, 1974).


46
OUTPUT
Figure 4.4 The block diagram of LPC log likelihood
distance computation.


164
Results showed that most of the LPC derived acoustic parameters worked very
well for automatic gender recognition. Among them, the reflection coefficients
combined with the Euclidean measure was most robust for sustained vowels (100%)
and the results were quite consistent and filter order independent. While the
cepstral distortion measure worked extremely well for unvoiced fricatives (90.4%),
the LPC Log likelihood distortion measure, reflection coefficient combined with
Euclidean distance, and cepstral distortion measure were the best options for voiced
fricatives (98.1% for all three). Hence carefully selecting the acoustic feature vector
and combining it with an appropriate distance measure was important for gender
recognition for a given type of phoneme. The use of unvoiced fricatives achieved a
high recognition rate indicating that the speakers gender could also be captured
only from the speakers vocal tract characteristics.
A within-gender and within-subject averaging technique was critical for
generating appropriate test and reference templates. To a great extent, averaging on
both test and reference templates eliminated the intra-subject variation within
different vowels or fricatives of a given subject and emphasized features
representing this subjects gender. This might imply that the gender information is
time-invariant, phoneme independent, and speaker independent.
The Euclidean distance (EUC) measure appeared to be the most robust as well
as the simplest of the distance measures. Using the Euclidean distance measure was
more effective than using the Probability Density Function (PDF). Most recognition
rates of using the EUC were higher than those using the PDF and were notably
constant across all filter orders that were tested. The EUC distance measure
operated more uniformly on male and female groups than did the PDF. A possible
reason for this inferior PDF performance is due to the small ratio of the available
number of subjects per gender to the number of elements per feature vector.


13
direction frequently, complete vocal fold approximation is less probable. A
research on acoustic correlates of breathiness was performed by Klatt (1987) in
which three breathiness parameters (i.e., first harmonic amplitude, turbulence noise
and tracheal coupling) were proposed. A detailed discussion of controlling these
parameters was presented in Klatts paper (1987).
A new trend to find the features responsible for gender identification is to
apply the approach of synthesis. The work done by Yegnanarayana et al. (1984),
Wu (1985), Childers et al.(1985a, 1985b, 1987, 1989), and Pinto et al. (1989)
represented this aspect. In their experiments, the speech of a talker of one gender
was converted to sound like that of a talker of the other gender to exam factors
c
responsible for distinguishing gender features. They found that the fundamental
frequency, the glottal excitation waveshape and the spectrum, which included
formant locations and bandwidth, overall spectral shape and slope, and energy, are
crucial control parameters.
1.3.2 Acoustic Cues Responsible for Gender Perception
As part of current interest in speaker recognition, investigators have sought to
specify gender-bearing attributes of the human voice. Under normal speaking and
listening circumstances, listeners have little difficulty distinguishing the voices of
adult males and females, suggesting that the acoustic parameters which underlie
gender identity are perceptually prominent. The judgment of adult gender is
strongly influenced by acoustic variables reflecting gender differences in laryngeal
size and mass as well as vocal tract length. However, the issue of which specific
acoustic cues are mostly responsible for gender identification has not been
definitively resolved. Such a controversy partially dominated the previous research.
A series of experiments run by Schwartz (1968) and Ingemann (1968)
employed voiceless fricatives spoken in isolation as auditory stimuli and it was found


11
Z>48
-2M8
at- 't'r:
5 = 1.732 re
. (b)
Figure 1.6 Speech signals for (a) male and (b) female speakers
for the utterance We were away a year ago.


8
AMPLITUDE (db)
Figure 1.4 An example of male and female formant features.
PITCH PERIOD
(msec. )
Figure 1.5 Fundamental frequency changes for two speakers
for the utterance We were away a year ago.


154
PDF
MALE
FEMALE
TOTAL
Positions (F1,F2,F3,F4)
96.3%
100.0%
98.1%
Bandwidths (B1,B2,B3,B4)
85.2%
80.0%
82.7%
Amplitudes (A1,A2,A3,A4)
92.6%
96.0%
94.0%
(e) Using entire formant information (F1,B1,..., through A4)
o
MALE
FEMALE
TOTAL
EUC
96.3%
100.0%
98.1%
PDF
96.3%
96.0%
96.2%
(f) Using entire formant information and pitch information
MALE FEMALE TOTAL
EUC
PDF
96.3% 100.0% 98.1%
96.3% 96.0% 96.2%
7.2.2 Discussion
1. With regard to the relative importance of each individual formant
feature for objectively distinguishing the speakers gender, it was


86
log likelihood distance), and 1 of them tied (Scheme 4 for
unvoiced fricatives with the cepstral distortion measure).
This demonstrated that performance improved from filter
orders of 8 to 12.
o Comparing the recognition rates between filter orders of 12
and 16, out of 12 rates, 4 of them dropped. Two of them
increased (unvoiced fricatives Scheme 4 with the LPC
distance measure and sustained vowels Scheme 3 with the
cepstral distortion measure). The remaining 6 were equal.
This indicated that by using Scheme 3 or 4, there was not a
a
distinct difference between filter orders of 12 and 16.
o Comparing the recognition rates between filter orders of 16
and 20, out of 12 rates, 8 of them dropped. Three of them
increased and one was tied. Performance degraded from
filter orders of 16 to 20.
o If only the cepstral distortion measure was applied (Figure
5.2), the highest recognition rates appeared at filter orders of
12 and 16 for all three phoneme categories. However, if only
the LPC log likelihood distortion was used (Figure 5.1), the
highest recognition rates were reached with filter orders of 8
and 20. Therefore, there was no manifest trend of
performance difference for the LPC log likelihood distortion
measure across filter orders. Since the cepstral distortion
measure showed better results than the LPC log likelihood,
the filter orders of 12 to 16 seemed to be best options for the
overall design.


Figure 5.6 Statistical results of 109 male and female
recognition rate pairs.


91
One interesting observation from Tables B.9 and B.10 in Appendix B was that
when using the PDF distance measure with the inclusive procedure, the correct
recognition rates were extremely high for all types of phonemes and feature vectors,
except for the ARC with unvoiced fricatives (it was still 98.1%). In addition, the
LPC, RC and CC were all able to provide 100% correct gender recognition from
unvoiced fricatives! However, when using the PDF with exclusive procedure (Table
B.7), the correct recognition rate decreased significantly, with drops ranging from a
minimum of 5.8% to a maximum of 40.4% (for unvoiced fricatives, drops ranging
from a minimum of 21.2% to a maximum of 40.4%). On the other hand, the EUC
distance measure operated more evenly. From inclusive to exclusive procedures
(Table B.3), recognition rates dropped very little, ranging from a minimum of 0%to
a maximum of 3.9%. The rates for four feature parameters did not decrease at all.
Figures 5.5(a) and (b) are graphic illustrations of Tables B.3 and B.9. It can be seen
that there was only minor performance difference between inclusive and exclusive
procedures when the EUC distance measure was used. Our results also suggested
that the PDF excelled at capturing the information from an individual subject. As
long as the data of the subject itself was included in the reference data set, the PDF
was able to pick up such specific information easily and then identify the subjects
gender accurately. Therefore, the correct recognition rates of the inclusive
procedure for the PDF were extremely high. However, the PDF recognition rates of
the exclusive procedure were much lower, indicating that the PDF was clearly
inferior at capturing gender information from the other average subject with the
same gender. On the other hand, the EUC distance measure was good at capturing
gender information from the other subjects without including the characteristic of
the test subject itself.


CHAPTER 7
EVALUATION OF VOWEL CHARACTERISTICS
As discussed in Chapter 6, sequential adaptive approaches that track the
time-varying parameters of the vocal tract and update the parameters during the
glottal closed phase interval can reduce the formant estimation error. Formant
information in the fine analysis was obtained by a closed-phase WRLS-VFF
method, which is one of these approaches. The database consisted of speech and
EGG data collected from 52 normal subjects (27 males and 25 females), for each of
w'hom ten sustained vowels were included.
7.1 Vowel Characteristics of Gender
7.1.1 Fundamental Frequency and Formant Features for Each Gender
The estimated average fundamental frequency for ten sustained vowels of all
subjects were calculated based upon a modified cepstral algorithm. Results are
shown in Table 7.1. Data are represented as means standard error (SE). Formant
information on the same database was obtained by first using the closed-phase
WRLS-VFF method to track the parameters of a time-varying all-pole model of the
vocal tract. Then, a smooth spectrum was acquired by using a FFT on 12
coefficients computed by WRLS-VFF. A peak-picking technique was finally
applied to obtain the formant positions, bandwidths, and amplitudes. The formant
amplitudes are all referred to the zero db line of the spectrum with the gain factors
130


123
(Winer, 1971). Therefore, a two-way ANOVA was used to
perform statistical tests. The difference between each individual
feature or grouped features in terms of male/female groups was
then compared.
o Automatic pattern recognition. Basically, this approach is similar to
the pattern recognition approach of the coarse analysis except that
the individual or grouped vowel feature(s) were used to form the
reference and test templates. Examples are using only the
fundamental frequency, or only the formant frequencies or
bandwidths (but across all formants) to form the templates. The
automatic recognition scheme 3 was then applied on these
templates. The EUC or PDF distance measures were utilized.
Finally, the recognition error rates for different features were then
compared.
6.4.1 Two-wav ANOVA Statistical Testing
The analysis of variance, as the name indicates, deals with variances rather
than with standard deviations and standard errors. The rationale of the analysis of
variance is that the total sum of squares of a set of measurements composed of
several groups can be analyzed or broken down into specific parts, each part
identifiable with a given source of variation. This is called the partition of the total
variation. In the simplest case, the total sum of squares is analyzed in two parts: a
sum of squares based upon variation within the several groups, and a sum of squares
based upon the variation between the group means. Then, from these two sums of
squares, independent estimates of the population variance are computed.


161
There existed larger formant frequency, bandwidth, and amplitude
changes across different vowels for female speakers and the
corresponding standard errors for each vowel were greater than
those for male speakers. The above observations suggested that
female voices were more variable and this may contribute to the
perceptual melodic feature of female voices.


78
5. In further experiments, different weighting factors could be applied
to different phoneme feature vectors according to the probabilities of
their appearances in real situations. By this way, time-averaging
would be better approximated.
5.2.2 Comparative Stuby of Acoustic.features
5.2.2.1 LPC Parameter Verses Cepstrum Parameter
Although both LPC log likelihood and cepstral distortion measures were
effective tools in classifying male/female voices, the performance of the latter was
better than the former.
c
1. By comparing Figures 5.1 and 5.2, it is noted that except in the
category of voiced fricatives, in which the performances were
competitive (both measures were able to achieve recognition rates of
98.1%), cepstrum coefficient features proved to be more sensitive
than LPC coefficients for gender discrimination. By choosing
appropriate schemes and filter orders the recognition rates for the
cepstral distortion measure reached 92.3% for vowels (Scheme 3
with a filter order of 12 and Scheme 4 with filter orders of 12 and
16), 90.4% for unvoiced fricatives (Scheme 4 with a filter order of
16). By using the LPC log likelihood distortion measure, the
corresponding highest recognition rates were 88.5% for vowels
(Scheme 4 with a filter order of 20) and 78.9% for unvoiced
fricatives (Scheme 3 with a filter order of 20).
2. By comparing tables in Appendix A, it is noted that the cepstral
distortion measure operated more evenly between male and female
groups, showing this feature has some normalization
characteristics. As seen in Table A.l, there existed large


55
unvoiced fricatives is concentrated in the higher frequency portion of the spectrum.
And the energy of voiced fricatives is more or less equally distributed.
4.4.3 Nearest Neighbor Decision Rule
The other major step in the pattern recognition model is the decision rule
which chooses which reference template most closely matches the unknown test
template. Although a variety of approaches are applicable, only two decision rules
have been used in most practical systems, namely, the K-nearest neighbor rule
(KNN rule) and the nearest neighbor rule (NN rule).
The KNN rule is applied when each reference class (e.g., gender) is
represented by two or more reference templates (e.g., as would be used to make the
reference templates independent of the speaker). The KNN rule operates as follows:
Assume we have M reference templates for each of two genders, and for each
template a distance score is obtained. If we denote the distance for the ith reference
template of the jth gender as D¡j (1 < i < M and j = 1, 2), this set of distance scores,
Ditj, can be ordered such that
(4.24)
Then for the KNN rule we compute the average distance (radius) for the jth gender
as
1 K
2 Ditj
K i=l
(4.25)
and we choose the index j* with the smallest average distance as the recognized
gender


150
4. On the other hand, the amplitudes of the formants from male vowels
were higher than those from female vowels with exception of Al.
The statistical differences of A2, A3, and A4 between male and
female speakers were also highly significant. Even though there was
no overall statistically significant difference for Al, Al of most
vowels (except /IY/, U and /OO) of males still were higher than
those of females.
5. The combination of analyses 3 and 4 implies a steeper spectral slope
for females. Two explanations for this exist. First, in normal and
loud voices, female glottal waveforms indicated lower vocal fold
6
closing velocity, lower ac flow, and a proportionally shorter
closed-phase of the cycle (Holmberg and Hillman, 1987). As a
result, the speech spectrum for the female voice appeared with a
steeper slope, and thus the amplitude of formants were lower and the
bandwidths of the formants broader for the female voice. However,
considering that in the analysis stage the algorithm of closed-phase
WRLS-VFF was supposed to eliminate the influence of the vocal
fold vibration factor, another explanation for the steeper slope for
female voice exists. It was indicated first by Rabiner and Schafer
(1978) that while the first formant bandwidths are primarily
determined by wall loss, the higher formant bandwidths are
primarily determined by radiation loss. In between, the second and
third formant bandwidths are to be determined by a combination of
these two loss mechanisms. The results of our experiments implies
that these two losses were larger for female subjects.
6. In Figures 7.2 to 7.5, it is noted that the curves of frequency,
bandwidth and amplitude for males are flatter across different


142
7.1.2 Comparison with Peterson and Barneys Results
Averages of fundamental and formant frequencies and amplitudes of ten
vowels by 76 speakers were obtained by Peterson and Barney (1952), using
recorders and sound spectrograph. Their database consisted of 33 males, 28
females, and 15 children. The results are listed in Table 7.3. Instead of sustained
vowels, the vowels they used were picked from CVC (consonant vowel consonant)
words. The formant amplitudes were all referred to the amplitude of the first
formant in /OW/. Bandwidth information and fourth formant information (F4, B4,
A4) were not provided in their paper. The standard error for each measurement of
each vowel was also not provided. For comparison, the male/female vowel triangles
using their data and the data in this experiment are overlapped in Figure 7.9.
There are some differences between our results and their results. For
example, the vowel triangles for both males and females from our experiment
showed less scattering than those in their experiments. The front-high vowels (e.g.,
/IY/) of our experiment had higher first formants and lower second formants than
theirs. While the middle-low vowels (e.g., /AJ) of our experiment demonstrated
lower first formants and higher second formants than theirs, our back-high vowels
(e.g., 1001) showed higher both first and second formants than theirs. These
differences may come from many factors that were involved. The databases were
different (27 male and 25 female for ours and 33 male and 28 female for theirs).
The means of producing vowel utterances were different (sustained for ours and
CVC for theirs). The data collection methods were different (direct digitization for
ours and tape recording for theirs). The measurement techniques were different
(WRLS-VFF for ours and sound spectrography for theirs). Although variations
existed between estimated vowel formants, the data in this study provide more


21
acoustic parameters and distance measures. The optimal combination of these
parameters and measures was searched. The extracted acoustic features that are
most effective to classify speakers gender objectively were characterized. Efficient
recognition schemes and decision algorithms for such purpose were developed.
The other objective of this study was to validate and clarify hypotheses
concerning some acoustic parameters affecting the ability of algorithms to
distinguish a speakers gender. Emphasis was placed on extraction of accurate
vowel characteristics including fundamental frequency and formant features such as
formant frequency, bandwidth and amplitude for each gender. The relative
importance of these characteristics for gender identification was evaluated.
1.5 Description of Chapters
In Chapter 2, an overview of the research plan is given and a brief description
of the coarse and fine analysis is presented. The database and the techniques
associated with data collection and preprocessing are discussed in Chapter 3. The
details of the experimental design based on coarse analysis are described in Chapter
4. Asynchronous LPC analysis is reviewed. Different acoustic parameters, distance
measures, template formation, and recognition schemes are provided. The
recognition decision rule and resubstitution or exclusive procedure are proposed as
well. In addition, the concept of the Fishers discriminant ratio criterion is reviewed.
The recognition performance based on coarse analysis is assessed in Chapter 5.
Results of comparative studies of various phonemes, acoustic features, distance
measures, recognition schemes, and filter orders are reported. The gender
separability of acoustic features is also analyzed by using the Fishers discriminant
ratio criterion. Chapter 6 expounds on the detailed experimental design of fine
analysis. In particular, the advantages of pitch synchronous closed phase analysis is


xml record header identifier oai:www.uflib.ufl.edu.ufdc:UF0008219700001datestamp 2009-01-26setSpec [UFDC_OAI_SET]metadata oai_dc:dc xmlns:oai_dc http:www.openarchives.orgOAI2.0oai_dc xmlns:dc http:purl.orgdcelements1.1 xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.openarchives.orgOAI2.0oai_dc.xsd dc:title Towards automatic gender recognition from speechdc:creator Wu, Kedc:publisher Ke Wudc:date 1990dc:type Bookdc:identifier http://www.uflib.ufl.edu/ufdc/?b=UF00082197&v=0000123011922 (oclc)001583934 (alephbibnum)dc:source University of Florida


128
MSb
F =
MSb x subj w.groups
(6.19)
To test the hypothesis that that there is no significant interaction effect of Factor A
and B, the appropriate F ratio is
MSab
F =
MSb x subj w.groups
(6.20)
The mean square in the denominator of the last two F ratios is sometimes called
MSerror (within) since it forms the denominator of F ratios used in testing effects which
can be classified as part of the within-subject variation. Detailed computational
procedures for partitioning the relevant sums of squares with equal or unequal group
size are given in Winers book (1971).
Formant characteristics such as frequencies, bandwidths, and amplitudes
depend on (or are influenced by) two factors. These are gender, which may be
denoted as Factor A, and vowels, which may be denoted as Factor B. Each
experimental subject is observed under more than one vowel, and thus our
experiments should be referred to as two factor experiments having repeated
measures on the same subjects. Hence, the type of two-way ANOVA discussed above
with unequal group size of 27 versus 25, two levels (two gender) in Factor A, and ten
levels (ten vowels) in Factor B was used to perform the statistical test.
6.4.2 Automatic Recognition bv Using Grouped Features
As we mentioned above, this approach is similar to the pattern recognition
model of the coarse analysis. In the coarse analysis, LPC, autocorrelation,
cepstrum, and reflection coefficients were extracted from speech signal. Averaging


16
Lass et al. (1976) also reported that there were large gender differences in
their results. In all experimental conditions females were recognized at a
significantly lower level, which was in agreement with the results of Coleman (1971)
mentioned above. In another study supportive of this point, Brown and Feinstein
(1977) also used electrolarynx (120 Hz) to control F0 so that VTR was the variable.
Identification of male speakers was 84% correct and identification of female
speakers was 67% correct. Brown and Feinstein also found, as in the Coleman
(1971) study, that centralized spectra were more ambiguous to listeners. Again,
VTR appeared to play a determinant role in gender identification in the absence of
F0.
In a later experiment, the effect of temporal speech alterations on speaker
gender and race identification was investigated. Lass and Mertz (1978) found that
gender identification accuracy remained high and unaffected by temporal speech
alterations when the normal temporal features of speech were altered by means of
the backward playing and time compressing of speech samples. They concluded
that temporal cues appeared to play a role in speaker race, but not speaker gender
identification.
In another study concerned with the effect of phonetic complexity on speaker
gender identification, Lass et al. (1979) found that phonetic complexity did not
appear to play a major role for gender judgments. No regular trend was evident
from simple to complex auditory stimuli and listeners accuracy was as great for
isolated vowels as it was for sentences.
In an attempt to investigate the relative importance of portions of the
broadband frequency speech spectrum in gender identification, Lass et al. (1980)
constructed three recordings representing'the three experimental conditions in the
study: unfUtered, 255 Hz low pass filtered, and 255 Hz high pass filtered. The
recordings were played back to a group of 28 judges. The results of their judgments


62
of separability which are generalizations of one kind or another of the Fishers
discriminant ratio (Childers et al., 1982; Parsons, 1986) concept. The ratio usually
serves as a criterion for selecting features for discrimination.
The ability of a feature to separate classes depends on the distance between
classes and the scatter within classes (generally there will be more than two classes).
This separation is estimated by representing each class by its mean and taking the
variance of the means. This variance is then compared to the average width of the
distribution for each class (i.e., the mean of the individual variances). This measure
is commonly called the F ratio:
G
Variance of the means(over all classes)
F = (4.27)
Mean of the variances (within classes)
The F ratio is reduced to Fishers discriminant when it is used for evaluating a
single feature and there are only two classes. For this reason, the F ratio is also
referred to as the generalized Fishers discriminant.
In the case of pattern recognition, there are vectors of features, f, and
observed values of f for all the classes we are interested in recognizing. Then two
covariance matrices can be calculated, depending on how the data are grouped.
First, the covariance for a single recognition class can be computed by
selecting only feature measurements for class i. Let any vector from this class be f¡.
Then the i within-class covariance matrix for class i is
Wi-<(fj- w)(f¡- Hi)') (4.28)
where <) represents the expectation or averaging operation and ^ represents the
mean vector for the ith class: p.¡ = (f¡). W stands for within. Notice that each of


CHAPTER 8
CONCLUDING REMARKS
8.1 Summary
Unlike automatic speech and speaker recognition, automatic gender
recognition was never proposed as a stand alone problem. Although contemporary
research on speech has included the investigation of physiological and acoustic
gender features and their correlations with perceived gender differences, no attempt
was made before this study to classify the speakers gender objectively by using
features automatically extracted by a computer.
The main purpose of this research was to investigate the possible effectiveness
of digital speech processing and pattern recognition techniques for an automatic
gender recognition system. Emphasis was placed on the analysis of various
objective acoustic parameters, distance measures, optimal combinations of the
parameters and measures, and most efficient recognition schemes. In addition,
some hypotheses concerning acoustic parameters that influence our ability to
distinguish a speakers gender were clarified. Emphasis was placed on extraction of
accurate vowel characteristics including fundamental frequency and formant
features such as frequencies, bandwidths and amplitudes for each gender.
The significance of the proposed research could include the following. First,
speech recognition and speaker identification or verification would be assisted if we
could automatically recognize a speakers gender. This would allow different
speech analysis algorithms for each gender, facilitating speech recognition by
162


37
PITCH PERIOD
Figure 4.2 A digital model of speech production.


119
Figure 6.4 The algorithm flow of the WRLS-VFF.


171
Table A.3
Results from exclusive recognition schemes
LPC distance measure (Log likelihood)
Filter order = 16
CORRECT RATE %
MALE
FEMALE
TOTAL
Scheme
1
65.2
84.0
74.2
Sustained
Scheme
2
72.2
80.4
76.2
Vowels
Scheme
3
92.6
80.0
86.5
Scheme
4
85.2
88.0
86.5
Scheme
1
79.3
55.2
67.7
Unvoiced
Scheme
2
65.2
63.2
64.2
Fricatives
Scheme
3
77.8
72.0
75.0
Scheme
4
66.7
80.0
73.1
Scheme
1
65.7
81.0
73.1
Voiced
Scheme
2
77.8
86.0
81.7
Fricatives
Scheme
3
96.3
96.0
96.2
Scheme
4
92.6
100.0
96.2


53
0.600 -
0.400 -
0.200 -
^ 0.000 ^
D
-J
% 0.200 -
-0.400 -
-0.600 -
-0.seo -
o
o-
-
-o MALE
- FEMALE
S
O-
>8,
'O-
-4-
-4-
-4-
-4-
-4-
-4-
3 4 5 6 7 8 9 10 11 12
ELEMENT OF THE UECTOR (TEMPLATE)
(a)
(b)
Figure 4.7 (a) Two reflection coefficient templates of unvoiced fricatives
for male and female speakers in the upper layer.
(b) The corresponding spectra.


19
1. errors in determining the positions of the harmonic peaks (in
practice, the peaks were read by means of inspection by a
person and then the FO and formants were calculated).
2. errors in formant estimation due to the influences of the FO
and source-tract interaction.
3. large instrument errors (e.g., drift).
Lindblom (1962) estimated the accuracy of spectrographic
measurement to be approximately equal to the fundamental frequency
divided by 4. Flanagan (1955) and Nord and Sventelius (1979, quoted by
Monsen and Engebretson, 1983) suggested that a difference of about 50
o
Hz for the second formant and a difference of about 21 Hz for the first
formant was perceived. Therefore, formant frequency estimation should
be as accurate as possible in vowel analysis as well as synthesis.
However, the most frequently referenced paper on acoustic phonetics,
which contains the most comprehensive measurements of the vowel
formants of American English (Peterson and Barney, 1952), may involve
measurement errors as pointed out by Monsen and Engebretson (1983),
especially for female and child subjects since the data were obtained by
spectrographic measurement.
The technique frequently employed to examine the ability of VTR
to serve as gender cue was to standardize the FO (and therefore eliminate
it as a variable) by utilizing an artificial larynx (Coleman, 1971, 1976;
Brown and Feinstein, 1977). This allows evaluation of VTR in a sample
that contains an FO that is the same for both male and female subjects.
The electrolarynx itself has an unnatural sound to it that may confuse the
listener and depress the overall accuracy of perception.


167
our experiments, six frames were selected from the same sustained
utterance to form (by averaging) a template for this utterance.
Considerable redundant information might exist in these six
frames since all the sustained speech signals were quite stable.
Thus fewer frames (e.g., three frames) could be used and
redundant information in the speech signals would be reduced.
The same level of recognition rates would still be reached.
o Reduction of the feature vector dimensionality could be
considered. This can be done by various methods of feature
covariance matrix manipulations (e.g., principal component
analysis etc.) (Childers et al., 1982). However, by simply
observing the feature data plots, eliminating those elements that
account for little between-class variation, preliminary reduction
could be made. For example, Figure 4.6(a) shows two universal"
reflection coefficient templates (upper layer) for male and female
speakers. It is easily noted that elements 2, 3, 11, and 12 of these
reference templates accounted for little between-gender variation.
Consequently, these elements could be discarded to reduce the
dimensionality of the vector.
o Instead of using equal weighting factors in the averaging operation,
different weighing factors could be applied to different phoneme
feature vectors according to the probability of the phoneme
appearance in a real speech situation. Thus, time-averaging would
be better approximated than what was done in this study.
Eventually, an integrated automatic gender recognition system
using words and sentences could be built.


18
What was neglected
First of all, research on automatically classifying male/female
voices by using objective feature measurements was entirely missing.
Almost all previous work was concentrated on subjective testing which is
expensive, time and labor consuming, and subject dependent. Objective
gender recognition which is reliable, inexpensive, and consistent has not
been deveioped in parallel to subjective testing but such work is
necessary as we stated earlier.
Second, the influences of formant bandwidth and amplitude and
overall spectral shape on gender cues were not considered and
investigated. Traditionally, experiments on contribution of vocal tract
characteristics to gender perception were only concerned with formant
frequencies (Coleman, 1976). The bandwidths of the lowest formant
depend upon vocal tract wall loss and source-tract interaction (Rabiner
and Schafer, 1976; Rothenberg, 1981) while bandwidths of the higher
formants depend primarily upon the viscous friction, thermal loss, and
radiation loss (Flanagan, 1972). These factors may be different for each
gender so that the bandwidths and overall spectral shape are different for
each gender. Bladon (1983) pointed out that male vowels appeared to
have narrower formant bandwidths and perhaps also a less steeply
sloping spectrum. All these areas require further investigation.
What was the weakness
The acoustic features were obtained by short-time spectral
analysis which usually used analog spectrographic techniques.
Estimated FO and formant frequencies may be inaccurate due to


6
Many physiological parameters of the male and female vocal apparatus have
been determined and compared. Fant (1976) showed that the ratio of the total
length of the female vocal tract to that of a male is about 0.87, and Hirano et al.
(quoted by Cheng and Guerin, 1987) showed that the ratio of the length of the
female vocal fold to that of the male is about 0.8. Titze (1987 and 1989) reported
that, anatomically, the female larynx also differs from the male larynx in thickness,
angle of the thyroid laminae, resting angle of the glottis, vertical convergence angle
in the glottis, and in other ways. The ratio of the length and the ratio of the area of
pharynx cavity of the female to that of the male are 0.8 and 0.82, respectively.
Similarly, we take respectively 0.95 as the ratio of the length and 1.0 as the ratio of

the area of oral cavity of the female to that of the male. The extra ratio for the area
of the oral cavity is due to the fact that the degree of openness of the oral cavity is
comparatively greater in the case of the female than in the case of the male (Ohman
quoted by Fant, 1966). Ohman also suggested that a proportionally larger female
mouth opening is a factor to consider. Figure 1.3 illustrates the human vocal
apparatus.
The differences in physiological parameters can lead to induced differences in
acoustical parameters. When comparing male and female formant patterns, the
average female formant frequencies are roughly related to those of the male by a
simple scaling factor that is inversely proportional to the overall vocal tract length.
On the average, the female formant pattern is said to be scaled upward in frequency
by about 20% compared to the average male formant pattern (Figure 1.4). It is also
well known that the individual size of the vocal cavities and thus of the formant
pattern scale factor may vary appreciably depending upon the age and gender of the
speaker. Peterson and Barney (1952) measured the first three formant frequencies
present in ten vowels spoken by men, women, and children. They reported that male
formants were the lowest in frequency, women had a higher range, and children had


35
SPEECH
SIGNAL
Figure 4.1 A pattern recognition model
for gender recognition from speech.


56
i* = argmin n
j-1,2
(4.26)
When K is equal to 1, the KNN rule becomes the NN rule (i.e., it chooses the
reference template with the smallest distance as the recognized template).
The importance of the KNN rule is seen for word recognition when P is from 6
to 12, in which case it has been shown that a real statistical advantage is obtained
using the KNN rule (with K = 2 or 3) over the NN rule (Rabiner et al., 1979).
However, since there was no previous knowledge of the decision rule as applied to
gender recognition, the NN rule was first used in this preliminary experiment.
4.4.4 Structure of Four Recognition Schemes
To investigate how much averaging should be done for the test and reference
templates to gain the best performance for the gender recognizer, several
recognition schemes were designed. Table 4.1 presents a brief summary of these
schemes.
Table 4.1 Four recognition schemes
Test template from Reference template from
SCHEME 1
SCHEME 2
SCHEME 3
SCHEME 4
LOWER LAYER
LOWER LAYER
MEDIAN LAYER
MEDIAN LAYER
MEDIAN LAYER
UPPER LAYER
UPPER LAYER
MEDIAN LAYER
Scheme 1 is illustrated in Figure 4.9(a). In the training stage, one test
template for each test utterance (i.e., the lower layer), and one reference template


79
differences between male and female recognition rates for the LPC
recognizer with a filter order of 16. The largest gaps came from
Scheme 1 of the LPC. The differences were about 19% for vowels,
24% for unvoiced fricatives, and 15% for voiced fricatives. On the
other hand, the cepstral distortion measure worked evenly with the
same filter order. Table A.7 shows that the largest gaps were 6.6%
for vowels, 1.8% for unvoiced fricatives, and 2.4% for voiced
fricatives. Similar situations held for the results shown in other
tables and with inclusive schemes.
5.2.2.2 Qrher-AcQustic.Pararnefers
Tables 5.3 and 5.4 demonstrate results from exclusive recognition Scheme 3
with various filter orders and other acoustic parameters, using EUC and PDF
distance measures respectively. Figures 5.3 and 5.4 are graphic illustrations of
Tables 5.3 and 5.4.
1. The overall performance using RC and cepstrum coefficients was
better than that achieved using ARC and LPC coefficients, when the
Euclidean distance measure was adopted. The following
observations were made:
o The RC functioned extremely well with sustained vowels.
The recognition rates remained 100% for filter orders of 12,
16, and 20, showing that RC features captured gender
information from vowels effectively. The results were also
stable and filter order independent, as long as the filter order
was above 8. Table 5.3 shows that a 98.1% recognition rate
was reached by using the FFF, which was obtained using the
closed-phase WRLS-VFF method with a filter order of 12.


116
A strategy for choosing the forgetting factor X.k may now be defined by
requiring 2 \k to be such that
2 Vi = 2 vM .... 2 v0 (6.12)
In other words, the forgetting factor will compensate at each step for the new
error information in the latest measurement, thereby insuring that the estimation is
always based on the same error information. Thus from (6.11)
= 1 ek2(l+Hkl Pk-jHk)-1/^ V0 (6.13)
Therefore the WRLS-VFF algorithm can be expressed with the same
equations used for the WRLS algorithm while the constant weighting factor X can be
replaced by X.k as shown in Equation (6.13). The effective memory of the algorithm
can be defined as (Cowan and Grant, 1985)
M = l/(l-\k) (6.14)
Furthermore, if Xk becomes extremely small, then the memory also becomes
small. In practice, some applications require a certain memory size. In these cases,
we recommend that a minimal \k be defined as
^min 1 I/Na, if X.k < then X.k X.m¡n
(6.15)


xml version 1.0 encoding UTF-8 standalone no
fcla fda yes
dl
METS:mets OBJID UF00082197_00001
xmlns:METS http:www.loc.govMETS
xmlns:mods http:www.loc.govmodsv3
xmlns:xlink http:www.w3.org1999xlink
xmlns:xsi http:www.w3.org2001XMLSchema-instance
xmlns:daitss http:www.fcla.edudlsmddaitss
xsi:schemaLocation
http:www.loc.govstandardsmetsmets.xsd
http:www.loc.govmodsv3mods-3-2.xsd
http:www.fcla.edudlsmddaitssdaitss.xsd
METS:dmdSec ID DMD1
METS:mdWrap MDTYPE MODS MIMETYPE textxml LABEL Metadata Object Description Schema
METS:xmlData
mods:mods
mods:identifier type oclc 23011922
alephbibnum 001583934
mods:location
mods:physicalLocation University of Florida
code UF
mods:name personal NAM1
mods:namePart Wu, Ke
mods:role
mods:roleTerm Main Entity
mods:originInfo
mods:publisher Ke Wu
mods:dateIssued 1990
mods:recordInfo
mods:recordIdentifier source ufdc UF00082197_00001
mods:recordContentSource University of Florida
mods:titleInfo
mods:title Towards automatic gender recognition from speech
mods:typeOfResource text
METS:amdSec
METS:digiprovMD AMD_DAITTS
OTHER OTHERMDTYPE DAITTS
daitss:daitss
daitss:AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
METS:fileSec
METS:fileGrp
METS:file GROUPID G1 J1 imagejpeg CHECKSUM 066f1428229b1025448d20aa04127170 CHECKSUMTYPE MD5 SIZE 72934
METS:FLocat LOCTYPE OTHERLOCTYPE SYSTEM xlink:href Copyright.jpg
G2 J2 3780f5c2c021878a64f4d6a8fe26090f 43023
00001.jpg
G3 J3 f6de410fc3796b3d4a536eb814a9398e 29977
00002.jpg
G4 J4 8ed1c2b001d6f2d10778491310cf866a 67132
00003.jpg
G5 J5 72ad5ee09c077f0267f2c70759d22a2d 74883
00004.jpg
G6 J6 bc89a33f988b2a3c68b33d383c5a508d 106383
00005.jpg
G7 J7 3671b5957d9491b1a0966628fb3927f7 79579
00006.jpg
G8 J8 1fbf0e8bba0865fc4114eb9d1343425f 74954
00007.jpg
G9 J9 e3f5fe441d9c01a09159655ead1e7dbb 80158
00008.jpg
G10 J10 f584025d994ce3347524c11ce09c4b72 74873
00009.jpg
G11 J11 531e5c2c30f8c8ee14959d3fb24a4f71 84044
00010.jpg
G12 J12 f3618e21e4a2be87b772c845ed210766 52198
00011.jpg
G13 J13 d5120da193a26a7b3fa01a0812297133 78542
00012.jpg
G14 J14 87494654b0f2c6e873b2c7606d752ca5 50738
00013.jpg
G15 J15 b23bb302015eca19602c7c59c3bbc20f 102809
00014.jpg
G16 J16 1a00d0e5eee4469635fac19537c2a453 44940
00015.jpg
G17 J17 450c7f8a5ad4f0904b6727ce8964c280 47592
00016.jpg
G18 J18 109aff1520c681a4f4d25500ff7af1eb 94139
00017.jpg
G19 J19 78f5e47c38519ec39d4f901f5b70fa45 98737
00018.jpg
G20 J20 26247ffdd5ae213848cc8c0da41d9f04 55668
00019.jpg
G21 J21 ceeffd353bc2fc5aa3d0e96025f8aacf 99048
00020.jpg
G22 J22 ab845e14835bf0ff958360b90ee958b1 94708
00021.jpg
G23 J23 06aa72e17fef5c696185d9cc8c8ea976 101112
00022.jpg
G24 J24 ca50d5ce4b06ca3dad95dedc6ecff084 99875
00023.jpg
G25 J25 bf1594c0eda4c4e58b7799d82f3030f8 97702
00024.jpg
G26 J26 08eaeb74cdfb7a4dc913f414c80b21e4 87393
00025.jpg
G27 J27 e2f1f160986e0867a497bfd372597221 83436
00026.jpg
G28 J28 f39e8f535b255307f51c3c3a8193540d 82644
00027.jpg
G29 J29 b5b79b9e5e799b74755f4a10713e7429 84577
00028.jpg
G30 J30 b164bea7f98414167b2f724f4eda5db3 94618
00029.jpg
G31 J31 f03560eecb20e1cdf69ac0e0bf0504db 56922
00030.jpg
G32 J32 4ce29fabc76263e58cad138b7aa74d64 72304
00031.jpg
G33 J33 a38084042127dea32bf011d0ea04f554 53624
00032.jpg
G34 J34 e28157076623df8b018c0c21d02a846e 80986
00033.jpg
G35 J35 5ef7f2b08e29108249aa116a82d324fa 82601
00034.jpg
G36 J36 f48af9e068d541005ef9a7d03bc05362 95535
00035.jpg
G37 J37 925deb401a48e25693caa5e006e9a823 54385
00036.jpg
G38 J38 5f3b9c19f4a0a334bfd6b36f86ac1eda 76675
00037.jpg
G39 J39 210986985e1cc4da9c291d714d904355 86242
00038.jpg
G40 J40 0e2c75acd48bdd802338f2dfc2c7b41e 98474
00039.jpg
G41 J41 08b45928f416662505f3c4d230d4e17f 77997
00040.jpg
G42 J42 6c908e5176b60117b2565ca5a4141d2c 60039
00041.jpg
G43 J43 7a3746625a7d99fcaa0c1d1e889edec5 71040
00042.jpg
G44 J44 9d5fa5a142fe3bcc73de6437d7b2c373 40835
00043.jpg
G45 J45 1a92ef80b0f6595f8c1f6139a0d21f64 75357
00044.jpg
G46 J46 953d587b1e0c4ca66df197e748aa3c7f 39948
00045.jpg
G47 J47 5c020d68c827000c06ac3f7b492f05b2 53740
00046.jpg
G48 J48 d416842989649502a669068b016a6cba 79052
00047.jpg
G49 J49 7bf93bd0df481eac527392df05b7e6cd 69995
00048.jpg
G50 J50 db6b185af219d5151a0cd9bcc135f066 59249
00049.jpg
G51 J51 db9ae7adae94eb3f754326d6a4fa7c5a 64339
00050.jpg
G52 J52 29fb7ccf956e56f4d27f2e0b76ac4663 67700
00051.jpg
G53 J53 84da51dc37a38795b04a9df449109f31 40096
00052.jpg
G54 J54 a9feee31b27f4a59dd9a7a62745a54b2 69420
00053.jpg
G55 J55 684de39e9ee55c01e15cbaae81a54c24 41396
00054.jpg
G56 J56 25db421496494a27dd1351adb9ab8612 71902
00055.jpg
G57 J57 1b9f9c1294f0ca7158d0f526ce560a11 92403
00056.jpg
G58 J58 fdbddb88fe42ef5afa356961b8fef909 87179
00057.jpg
G59 J59 f27ead86ddd54a7c795913ba55f1f6df 47189
00058.jpg
G60 J60 ee0a1fa151e1720e21e279a67464b615 48882
00059.jpg
G61 J61 7efa272443d04e75ebe126bd1bcda41e 99693
00060.jpg
G62 J62 fa3c057af4a75c8fffaf295b18181118 43814
00061.jpg
G63 J63 46fa0827999cab46f8661602185e391e 47945
00062.jpg
G64 J64 4b2826aabcab3254b89b62d566bbba70 72137
00063.jpg
G65 J65 5caaec774d2758baf79e78ad061f2237 72421
00064.jpg
G66 J66 0f2e715bc506c2bfa7288caac99da0ad 44430
00065.jpg
G67 J67 02d04e2c327b6e5e331e70c52588c848 46100
00066.jpg
G68 J68 3588b878daffcc045159345fd257a2d5 98566
00067.jpg
G69 J69 f2721cdbd7de6bfaa5a195b2cf67cd3b 87429
00068.jpg
G70 J70 47b648f763ff9994d604ba72593b2ee0 87409
00069.jpg
G71 J71 d540d47f9c42602bba83f249907743ce 82843
00070.jpg
G72 J72 e623e4bcae927bc14c42e44f78b0b8ce 70140
00071.jpg
G73 J73 f9fca4763dc893115be8fe032dea4d71 75099
00072.jpg
G74 J74 0ddf92988eb4d0e34f1103cc602c8347 61259
00073.jpg
G75 J75 251acdedff63366217b969de9fd841d1 75720
00074.jpg
G76 J76 c8d5bdcf80b67c81b2d75e814262fe72 37389
00075.jpg
G77 J77 27a9ddb1e2a79a83d6a2530c0d1cf055 68994
00076.jpg
G78 J78 88091799ba661f4fb0d8f0e24ec87d6d 59802
00077.jpg
G79 J79 b0ff516d73bd99f746bb0735e7806871 72282
00078.jpg
G80 J80 42d6b0ae24b567240a400ae74e7107ce 87343
00079.jpg
G81 J81 00fe7099aa4f890c927599f980b3c8c6 59803
00080.jpg
G82 J82 b6ac5e777fe821f6948745483ae0e4f8 59462
00081.jpg
G83 J83 ee2f386b47f0150a7725731e597b07c6 51259
00082.jpg
G84 J84 c6b1c8ab285873ea1f4a1bb5a060cc97 50489
00083.jpg
G85 J85 4afac054394142229b6892fa50885a54 87815
00084.jpg
G86 J86 945739aeb158f3ce4718b9fd4dd9cc04 83654
00085.jpg
G87 J87 6d303e75e36edb36e4786f3b066c6c41 83513
00086.jpg
G88 J88 0cf830a78276f851ddc8c43479183bf0 80261
00087.jpg
G89 J89 2f57effd86d11ea0061961a4646d322e 60155
00088.jpg
G90 J90 a8cfdf7de8592d2ea73ba6e736c87fd0 62516
00089.jpg
G91 J91 64aa7cbd9a01d906603413321bb3252b 49076
00090.jpg
G92 J92 1b4f1c3bdd401c7d27cbfc71a47ce8b1 47579
00091.jpg
G93 J93 b246c0fed8acfc5fc7de3084cf477767 88398
00092.jpg
G94 J94 55d7d86afaccf1ced2bff9b6f67191b9 89342
00093.jpg
G95 J95 c03766be5680c09239e31fa86ddfb5b3 75524
00094.jpg
G96 J96 eb1370f04ac19a6f864d0c629e394b46 82895
00095.jpg
G97 J97 d2a915a0423b25635e919cb055be8908 93699
00096.jpg
G98 J98 fadec51de5bab59ca0b466c794087905 95683
00097.jpg
G99 J99 e90168072937dbf03406692a0c7974a0 90829
00098.jpg
G100 J100 14ee3461c633fe8097563cb670068c87 93430
00099.jpg
G101 J101 c1876770df5f5ce2ffc915b281a2032d 62500
00100.jpg
G102 J102 4256af3dec378ffccd086d4afa8d6094 90523
00101.jpg
G103 J103 b316a92c82189a6a32a7fdbf3e5b7a57 46606
00102.jpg
G104 J104 e7d7db59213187fa5f284d15ff83bb40 50358
00103.jpg
G105 J105 145e0ac43557bc291ed66c308db4b7ca 91871
00104.jpg
G106 J106 c8282da7a5b2f09ed3c849778e13fcad 50963
00105.jpg
G107 J107 0291285251e8069f6d0ef5c047421abb 51250
00106.jpg
G108 J108 390c50370075ef3c88871893d1b132bc 87963
00107.jpg
G109 J109 dcd84d46889acb10c098015efeb4c075 88544
00108.jpg
G110 J110 e888af37be29402b5ed155c9edc255b9 88283
00109.jpg
G111 J111 96e99bb8b0bb12c1668d4ae226fbb37f 77802
00110.jpg
G112 J112 084d470ef4b9186201f5013e2ecf7cae 54284
00111.jpg
G113 J113 b2e7f76522b2ca3bcfb6c92c7d9a2963 80774
00112.jpg
G114 J114 c193ecdc5186f74cd602e30ac367693e 84328
00113.jpg
G115 J115 28265ff883c5ee1975f9b60e5e39cd84 81021
00114.jpg
G116 J116 0d2d8d1b0f4b0e89927caf568fcfce48 89973
00115.jpg
G117 J117 c6e5216f48760ba522b1d81ab2ce0010 73780
00116.jpg
G118 J118 cec9f4ac22086904515d70c089c68e61 36458
00117.jpg
G119 J119 93d28445bb1ce0c8feaf9aa4df5dbc89 62894
00118.jpg
G120 J120 fd795cb74f331d8f8cae28727c1fdffe 94478
00119.jpg
G121 J121 c6350933ada3519ee46ff454d5a8ad93 44585
00120.jpg
G122 J122 d6aaa5e5a840b8107e7b1ac3aecf665d 86189
00121.jpg
G123 J123 7ab646b7a98c714898045f841327d42e 49266
00122.jpg
G124 J124 d7db8c7a676b059b7a259d959bea72fc 73144
00123.jpg
G125 J125 e08b9a6c37dd5f58090fde917e030932 61167
00124.jpg
G126 J126 2518ad86ecc72568d1d8004146035c75 86780
00125.jpg
G127 J127 95ff46f249f1d9ba38d0c2b6b52ab955 46726
00126.jpg
G128 J128 7c2b4cd558e3099704c20e99ece8cdd6 47283
00127.jpg
G129 J129 cf259baeb1a263ce7a5c085d7031b41c 88751
00128.jpg
G130 J130 60069685a9311d0204539bfa39372858 48466
00129.jpg
G131 J131 25ea5aba1b6e62202f650784d970a111 87634
00130.jpg
G132 J132 179a53eeb861154f3aae4020c68b20d0 81387
00131.jpg
G133 J133 5f8edbe800c5f2f17c34ef8e1ec05a08 80267
00132.jpg
G134 J134 2ff68fcfbc02e494f6216880f825963b 70074
00133.jpg
G135 J135 acf7e1857bf71d92866ad6bfe6e2c718 71434
00134.jpg
G136 J136 31c408c0bc82752305b672a3f05769a4 37818
00135.jpg
G137 J137 921447c3fa864fd2c50bd0c6718eeb79 77783
00136.jpg
G138 J138 5fbc86504a85690f8980b288d8e9c314 86630
00137.jpg
G139 J139 e0e3337c0d10bf88593e712e528f8ef5 75856
00138.jpg
G140 J140 e5768c0ec5536dc277313d335eec5a03 44459
00139.jpg
G141 J141 2864cbc3a7cc5cfa3da2c3ea1ddd38ee 91394
00140.jpg
G142 J142 139cc8328414af139579355ccd956657 65645
00141.jpg
G143 J143 8c0f3abd9ae833031315387f315ba3e9 38689
00142.jpg
G144 J144 a77e0a1ce1e59b287adf538f358a537b 51571
00143.jpg
G145 J145 d3aff43fe44e6b96969739197adb6faf 52654
00144.jpg
G146 J146 6179e20cd19dd994f3fd40eeed8f69f5 50846
00145.jpg
G147 J147 a01b258a794d5864863a02a80ce34e19 49859
00146.jpg
G148 J148 80db6daad329841bc7b451a82429c566 52508
00147.jpg
G149 J149 5326519e8adf8466cd899740020d556a 56231
00148.jpg
G150 J150 cf5de1e153b5afd355a7521d36136441 43199
00149.jpg
G151 J151 2664ce1576e2f49297ebb283d79e3c2a 95931
00150.jpg
G152 J152 28720d16b61db787c78adf1b246e2878 54265
00151.jpg
G153 J153 4e497ab9538d49a5bbb88c6b8f9dc326 53375
00152.jpg
G154 J154 cf7219e2487501abb1c0fa80874bacc1 83596
00153.jpg
G155 J155 bd1f99109069d19f2d957c9a3fff3382 47559
00154.jpg
G156 J156 7504e18a185b63f318d2d56b4377fd93 48489
00155.jpg
G157 J157 39e9125646ebc7a24b714f8c765ebbf3 50130
00156.jpg
G158 J158 8b80c36b0b13b5aca3ba1094923171c2 82107
00157.jpg
G159 J159 1d2691ad481a5319fc9e82031eaae871 85590
00158.jpg
G160 J160 518857fea18b09c9b699c1135648bee6 75045
00159.jpg
G161 J161 b46785de04f8e9ab8d5b5adbd749f4fc 46921
00160.jpg
G162 J162 621a2262be0d8468b84a81b01530bfeb 60497
00161.jpg
G163 J163 aa4795df991c553d5bc8ae53f2983778 50190
00162.jpg
G164 J164 d24fe123e59207bb43a46c0480a67e67 89417
00163.jpg
G165 J165 fadb59e7e3191560ce82b7651089e378 86751
00164.jpg
G166 J166 30ca2f5c7ee0eee829403a59653d8a10 85844
00165.jpg
G167 J167 1b77dd4d37129bf6fc7fcae4e1b5571f 80271
00166.jpg
G168 J168 bd1eb6b5583cd3b35b0ed6a2b0f33e77 87180
00167.jpg
G169 J169 b3c1cf49d8f97939946dbecbedb4ca46 87444
00168.jpg
G170 J170 ef8a59aebd1bffe99fe0c1d7cea19ba4 42951
00169.jpg
G171 J171 fe0cdf38aa9c9750957469cc1a8cc2d8 77814
00170.jpg
G172 J172 699c87aae04c35a2045272671df493fb 97242
00171.jpg
G173 J173 7a2124aa494effe1e9fedbc07ad3e1a1 97657
00172.jpg
G174 J174 e088f02ff4abe0710f6cc22965456928 91157
00173.jpg
G175 J175 b9a8d7fd8191707f2b19c2e89b7bbbb9 86311
00174.jpg
G176 J176 9caa555ddb8f560c403d6c582773b8b2 81603
00175.jpg
G177 J177 ea223b67d0f63e056394eedb8f7a42d6 92720
00176.jpg
G178 J178 c9ead02f7c80657566ff2f88d6c1d83b 48350
00177.jpg
G179 J179 38c7b35f6bed28a4f6180ff7f90febb1 48287
00178.jpg
G180 J180 ca820941b42c793a56ad51efff1434e8 46898
00179.jpg
G181 J181 dc8012140157db0eb20e2f316106a626 48748
00180.jpg
G182 J182 df623844fef2a33baa251c441b9fa8b6 46321
00181.jpg
G183 J183 2b7da3164c2459ea4f8abdae1458200c 45851
00182.jpg
G184 J184 317744ccf393141e71fceb1794610d11 45695
00183.jpg
G185 J185 63716e2202a1daf43e80daf628f0a0a8 45634
00184.jpg
G186 J186 3caf87959e7e72e8907833724f6a3bca 42941
00185.jpg
G187 J187 0ba24b6de0aacb7d712fae2a8d2681d1 46096
00186.jpg
G188 J188 694d3305daee3861755e94c75a5fde16 49127
00187.jpg
G189 J189 a7040ccc7c1e858128e035523da56932 45076
00188.jpg
G190 J190 690afef7728282f4a22456ca1b9956f1 43117
00189.jpg
G191 J191 406bf4a41c711b2177b22eecdb3392cc 45031
00190.jpg
G192 J192 38784fb937a17071578672885cb26a11 44886
00191.jpg
G193 J193 7a424940a4bed5683c6de8b6af7a9926 46953
00192.jpg
G194 J194 0e7e37b51ad6de50e9a576ea457bad07 45720
00193.jpg
G195 J195 8c7ca73feca69a50c3d2ccc89d4ebbc9 44054
00194.jpg
G196 J196 5e930d57b248a50e22f3b6362353416d 43328
00195.jpg
G197 J197 3e57033348adae319461bad39903ddf3 47860
00196.jpg
G198 J198 8b67d59572fe70af47ef5eaaf779c2f9 89648
00197.jpg
G199 J199 631e0603c41765eb82dd2469c3dc4fff 109704
00198.jpg
G200 J200 dc3a77acae17719305e5bf4a5e9d9e8c 102508
00199.jpg
G201 J201 385dba6bc59b7559bdac3dcb9ccf6eda 102906
00200.jpg
G202 J202 2d61e008101897f6600df50b74ecc74e 103160
00201.jpg
G203 J203 eef210a4bee83ea948895425e09e01ad 97882
00202.jpg
G204 J204 8f927e4824d539e6733f206cf0422c58 101286
00203.jpg
G205 J205 4e49cc972c0de0b6bf5ac53e4cbcbf86 98715
00204.jpg
G206 J206 a0eccc8ca107a202a282f3ee0fdc3af1 42183
00205.jpg
G207 J207 5decedd15c58a61d23d12c5b25e54cd4 66449
00206.jpg
G208 J208 d60d5f010e155b4e1902c3bf45a5781b 80693
00207.jpg
E1 imagejp2 5b963802d7bd63a990338a2680c066f4 1052587
Copyright.jp2
E2 59c0b8c6bffc7b455e50e9f2d4e3cf4f 1051083
00001.jp2
E3 d41cf4bb91a8a46615d27470b19d4106 1053841
00002.jp2
E4 9db5449b0a95ac7b54f16b35f8392fb0 987971
00003.jp2
E5 47bf325dbb35a93795e21460b708bccc 1051291
00004.jp2
E6 e304ab9b8e46d188a26ad84fe1b7ab1c 1053834
00005.jp2
E7 04a7ad5d21f008cd092164f88c69e764 1052378
00006.jp2
E8 a37d461f6576da4c30520ce2dd291281 1052576
00007.jp2
E9 cf4fe1a2c0d2c58ab74ffa5b5052fd97 1053840
00008.jp2
E10 6c799c33c2484c6a61aa699ea3ced920 1053196
00009.jp2
E11 3e98e4d0551b52f5ea43bdd89023ee9e 1051949
00010.jp2
E12 9f982f35c0dd6ad20e78a3e2f9161876 1051893
00011.jp2
E13 2294f7fc2b635869a00ff23d4e286466 1053160
00012.jp2
E14 9865fb836930b5dbfbfe249fe2d9c5e4 1051924
00013.jp2
E15 b1934fa9051178c8079a781a05098931 1051309
00014.jp2
E16 3a1f060507682f3a647d73528b329940 1053127
00015.jp2
E17 cb865d2be41a89074c61f06926d7a798 1053141
00016.jp2
E18 239bdee41415453484837ecd7067efc6
00017.jp2
E19 8dca66424394b997677c9c5901aae290 1052485
00018.jp2
E20 3e9e403f0f3baa8cdd45c41899428de6 1052535
00019.jp2
E21 3d52b821731d4338e2cc31af35691d86 1051942
00020.jp2
E22 7d32dfd2e448ed8c7e9ffbf42455f975 1053712
00021.jp2
E23 e7dd821070d972a95ecdcdc3bb6ceb19 1054493
00022.jp2
E24 a763be639d41cce9d85535f855cecd2e 1053211
00023.jp2
E25 037474bc88e4ad8c5826ebc70d74e859 1053734
00024.jp2
E26 a1579b008f4caad548cd6f7478c9873f 1052592
00025.jp2
E27 34800ab88f37dbc07c3f808c40510c69 1053197
00026.jp2
E28 1b275d04a4b7785f4d95321988dacf08 1054965
00027.jp2
E29 f9f860ac2270454a57ef0a1669f6d4d9 1054419
00028.jp2
E30 5dbe8a47a07ea6e902c53b7f51a57f71 1053854
00029.jp2
E31 dc74f7892d2d41edddfce8312e6f3900 1053790
00030.jp2
E32 a2bf531ce4a048804749a2cbea7867af 1054428
00031.jp2
E33 fcc23e26b56ff081273eb4b857b90b9b 1053845
00032.jp2
E34 cb9f4d6af525cd607fb833f3c1fc8fdd 1053116
00033.jp2
E35 5d3671b6d9e6e881ca767b437620b0ed 1052571
00034.jp2
E36 25760db8d9b39d3034635c0ceec24009 1053859
00035.jp2
E37 4542e33318083465c1d6a4caf1da9add 1053188
00036.jp2
E38 e43b43be03018a1bfb556d6a51d5d03f 1052544
00037.jp2
E39 3775ef765f6e90ff4e0cac094fd330aa 1052563
00038.jp2
E40 5c20b7392cc249118eb29b625e58348b
00039.jp2
E41 eb1b154c354b700495ab90020ec4a71e 1055069
00040.jp2
E42 13fc67428ad895070a72e3bc61a5b89e 1053156
00041.jp2
E43 8f244163d427929e575ddfd1f95bffb1 1053839
00042.jp2
E44 2048ae45b7a78d5aae042ca2fdea2140 1053227
00043.jp2
E45 8140da8dd286e2e1a1d4500c453119d8 1053034
00044.jp2
E46 87b17a50e57fbb99145bdde1f28dabf1 1052117
00045.jp2
E47 c9362178eb401514afd3c2cca4a2485a 1052573
00046.jp2
E48 e5108983d78d81858221cc0c9df1fef3 1052493
00047.jp2
E49 f84e9bf3436d420120ab101da26a5872 1053703
00048.jp2
E50 3e3b8e27fe0f1bc654b7394f08120bf2 1052519
00049.jp2
E51 5f5198df188cfd2d0c181842d9b21b43 1051810
00050.jp2
E52 5286615029da12b59a1b2ebd5b33f03e 1051920
00051.jp2
E53 be1ec72d4a3a1af89cc9f96eeca05b99 1051864
00052.jp2
E54 20ad23786dc6bad1a084d7457e6e7c11 1051270
00053.jp2
E55 1d046c387ee999a6e5e59cb6653d757c 1053136
00054.jp2
E56 516be1433593ffa0b49642dcc606bcfe 1051908
00055.jp2
E57 1aae8ea5549176f28879bacc20e98fd0 1051299
00056.jp2
E58 3c46a0012734760453160579c93e4ad4 1053765
00057.jp2
E59 aa5c2e362cddf7ddf8f66bb27046e46c 1054474
00058.jp2
E60 fa2046dc3e6e0dbd11646fed05316fc2 1053217
00059.jp2
E61 c69f4e0bc6a6f1c1d293151756091387 1053853
00060.jp2
E62 aa8594511b5d90a6150633ea49c1746f 1052477
00061.jp2
E63 3b2465ad02ac5f8403659a8bd0f1fa39 1052365
00062.jp2
E64 c3b11c0dbbbebd27d184a8ab868aafe3 1054455
00063.jp2
E65 bcd5e78aeba819e1c746dc589cca37a1 1053798
00064.jp2
E66 9514ecc5e16eb0eda0d8433338a85d7c 1053086
00065.jp2
E67 8078dd955a7e1b068b9d67ba816b4354 1053787
00066.jp2
E68 c1f4d8de3bf475180d818222b3cba580 1053861
00067.jp2
E69 867529a2e4ab2e38920c6a872931bb5f 1053222
00068.jp2
E70 eefb722125638c6c73dc29346aae7ffd 1053012
00069.jp2
E71 77192c9e672fa2b01ecc01484682c85a 1054498
00070.jp2
E72 d93460f011599dd4e7045f19d8bb8e5f 1054401
00071.jp2
E73 f599b46175ecb833d4e473bb6a450c1b 1053763
00072.jp2
E74 b3d814da34acb36689dd0c465f7b6709 1051758
00073.jp2
E75 0a1f0ab9183b90d14afdd730131ded13 1053163
00074.jp2
E76 4315ed999545a2ac3460ca699c3e67de 1051535
00075.jp2
E77 b65e537e1046a6eb6569a119977c5021 1052903
00076.jp2
E78 d6ebf4f494e33683e266689db03239ef 1053213
00077.jp2
E79 351299ea9145a73ce96352559a7a4fe3 1053230
00078.jp2
E80 75dd3603c5bd3aa1ef0592d87d09c58e 1051123
00079.jp2
E81 b90e995c5a937ef2cc2a9ff8ba9b382e 1051752
00080.jp2
E82 6febc6b8f9c375ddccefc97f21f89c19 1052120
00081.jp2
E83 847d595e10322fbc80bf90b5fff56bde 1052727
00082.jp2
E84 2b820d07c90c3c1615e9ddeceb96a53b 1052588
00083.jp2
E85 affdad8485c817627438d7086a601cba 1053745
00084.jp2
E86 3e75b224c29a461c6db9ed00127b949f 1051520
00085.jp2
E87 093381a13c8d9a0ce26a7e6db5e93457
00086.jp2
E88 d49af46e7d8a5ffd31fc6373ffe1e617 1052261
00087.jp2
E89 82cfef617cff4ca9d0935d636ec04a90 1051980
00088.jp2
E90 b85e00a810e0a0f485e8837c5e0d8336 1052184
00089.jp2
E91 239f0bfc7ee6538adc351b540b4ff27c 1055550
00090.jp2
E92 69d7d40106b23007373dee62662a894b 1057038
00091.jp2
E93 c898425f64a4964fd20f02d950e8a9b7 1051572
00092.jp2
E94 8f7d0df4f5a1dd6c62d193c2ac6588be 1052894
00093.jp2
E95 0e2eab98e35cd025acca99a0c10afbca 1053858
00094.jp2
E96 896ba007eb35ee71a59671cb4c85f991 1056387
00095.jp2
E97 61f15340e5fe565f69eafe9314cd6212 1053865
00096.jp2
E98 2d684dfb271d017d6fc4a731c1371ba5 1051943
00097.jp2
E99 15cb807fc946b1ef0fede1012da35f64 1055128
00098.jp2
E100 92f5d1a9f68cb00bf5f92806a2d2850c 1053832
00099.jp2
E101 dbebebf6d5f7baa180f3e49eb30f24b1 1057649
00100.jp2
E102 6ac7b95e9087063d6e29a901cb02e995 1057668
00101.jp2
E103 58123a2f4a3282d2684f65352a85cd0a 1057039
00102.jp2
E104 86c8de6170ed927cb6d0597b7edc02df 1056401
00103.jp2
E105 8ec6e37bb901df1b4933bacf2c691b55 1056341
00104.jp2
E106 4d0790ebc83857fadb4c09f66f06ad02 1054446
00105.jp2
E107 39975a40a39f81928d5ffed83746de52 1053784
00106.jp2
E108 41ed259b063166b5c830cdb90a342b3a 1053820
00107.jp2
E109 1e5e633d93de39661fce3091ff3aa796 1056887
00108.jp2
E110 e8f66deedd2a630f662f81201bda6555 1056293
00109.jp2
E111 ff2ef6da12fe8af73c2386e1eb21fcaf 1052577
00110.jp2
E112 5103fe7b502eab4f8f86ed3c7743c344 1054453
00111.jp2
E113 8434d3f3329fa0adef49d9ef04b25e74 1054502
00112.jp2
E114 65abcc71399b77433e11f3876f2c05be 1056904
00113.jp2
E115 e75d5d681d1a58c9f4d8e6d0f1d263a9 1054422
00114.jp2
E116 5fa8bd99d60e7b6a93d379473524805b 1053218
00115.jp2
E117 3b143d7cefc8f525a88b60ccb6e47210 1053802
00116.jp2
E118 1212abe4313440276c705e23f2ff85fc 1053113
00117.jp2
E119 6b204c3e7da2ec4e6bbe21a255e42051 1058313
00118.jp2
E120 5c005aa919b921d05a679e21d549eb08 1057043
00119.jp2
E121 e35ce9e00f0f61edaabaa48f9744d3dd 1057022
00120.jp2
E122 5c95de252f513a733203ae0e6baa0e52 1056359
00121.jp2
E123 9c89cfe22f5410d09c5c80d93ae0812f 1056330
00122.jp2
E124 d93ebf494c7ff74637511dbb9ff1a6f2 1053863
00123.jp2
E125 b4d4a2170bd2051bdfb3c18d67a67c11 1057040
00124.jp2
E126 3c6ef3c47bf7a969e74dd0bfc731fda9 1056793
00125.jp2
E127 9825f4b4dbd3a703e322e253ae043b71
00126.jp2
E128 a3d392062235361f74619209fcbccbc1 1055625
00127.jp2
E129 0f109599435ca6affe880709a70364c0 1055680
00128.jp2
E130 e6626d1b822f773410a1bd302aa5c226 1054928
00129.jp2
E131 54e4526e45d43e56ad7e0ad338bf8eba 1057637
00130.jp2
E132 8b28d12f4b3f79a01af089536c2883ab 1057628
00131.jp2
E133 ddb6c1ec3f9cfb90e97b07dc9e7434d4 1058304
00132.jp2
E134 e57b75d979e48691618a1672d1b666e4 1057002
00133.jp2
E135 7b79b1f0e2df523192012a78634511e6 1057044
00134.jp2
E136 e1cdd586baaf8a210b4a3238af11d378 1053780
00135.jp2
E137 76158531b8cd8e4126b71a211b120b2b 1057623
00136.jp2
E138 2fbd12c9948a2b0755c8583beee7cf5d 1057665
00137.jp2
E139 fe8565b2412735f5f5204ec0b0f5bc96 1054364
00138.jp2
E140 dc64b3141fcf2e15248d27e387adb382 1057610
00139.jp2
E141 d10ba8afe6bedf38e2a13d67249a0bf4
00140.jp2
E142 db8469af03fec76139ea64a1e64bd992 1053864
00141.jp2
E143 3060bc9fa08211eb76717504a90d0826 1054269
00142.jp2
E144 089c6ab0d69554aa38f3b9edda896496 1058256
00143.jp2
E145 2e99b003f84d8ea90fc6c9aae5294a1d 1058188
00144.jp2
E146 7aad978c34ed455fb10e17e0727c85c1 1057659
00145.jp2
E147 4c7db6da550ccb8ef895f2a910222a53 1056998
00146.jp2
E148 a7c2d199eee8c57823935a26437dba6e 1057509
00147.jp2
E149 0cc3db2248125149d4dbf49391527805 1055117
00148.jp2
E150 436a13e7dc00bee7ab5d72d9d13b33da 1053724
00149.jp2
E151 054669c9a30dd9529611469c9ab10db4 1056983
00150.jp2
E152 aa03574fc2d6c0a30e3113a0b3f2a2e2 1052503
00151.jp2
E153 e7386cb641167eba5223730016612d62 1051832
00152.jp2
E154 2f9c148e69b61ca5200601133e9a554c 1052194
00153.jp2
E155 6648368ea9e7b507e836845330793d22 1053382
00154.jp2
E156 1856ae69d64d4ee7e90df36cf32db4e8 1052201
00155.jp2
E157 13175976419d094bbeab45bbf11f365d 1053143
00156.jp2
E158 0626c6e78e25b5ec8f8a3600cb8f4e82 1050702
00157.jp2
E159 9af48c79c3d31c0a4629a0db81daca37 1052586
00158.jp2
E160 76766dcc454828e05c4b95922502d021 1049901
00159.jp2
E161 2349b3b603757bcde7acec1b3ce720a5 1051153
00160.jp2
E162 45c2708d2a0df6a700bb45c8ee3748f0 1053178
00161.jp2
E163 424cd14bab9bc4c6a4b42b70b28842c8 1050024
00162.jp2
E164 a806bb1e691b0b414ad79f8971b2f395 1050911
00163.jp2
E165 ae447f3d197554c78efea600359d70b2 1051281
00164.jp2
E166 4108bd7ecf289f5e640c6455d8618989 1054487
00165.jp2
E167 7c5b31bfab0e7c6515437fc0ebbb0eff 1054308
00166.jp2
E168 58c872736bf78c529124aef107233925
00167.jp2
E169 8690f803ffe40d86d0eadeb8c26373f6 1051408
00168.jp2
E170 730ed151db22acb5d58b0d3bccf37eef 1053140
00169.jp2
E171 8fcbb17f3a1c725e00d68af0f32a5eb2 1052542
00170.jp2
E172 ddefc34bb3001321d0e1eb9b9fad4aba 1053825
00171.jp2
E173 521646a2ab37d53c92c617c72e0ff828
00172.jp2
E174 4efd602f6730b827671f4503cbd0bf6b
00173.jp2
E175 1c86df1bccf7749df3d608d6667fea73 1053652
00174.jp2
E176 7e2eb5ece8e2a1b9c4a4ea50f378afcf 1053849
00175.jp2
E177 9c14d15e857c9bf49310abd799214e6c 1053184
00176.jp2
E178 674a1d1445d8d5c192e63d1dec87f92e 1053221
00177.jp2
E179 c59a77d67d35857f5b17640e4da31871 1052999
00178.jp2
E180 e3d5c53989189efcb498028c4a90278e 1058949
00179.jp2
E181 ee12c2b59d778301b583454dba79edc2 1054476
00180.jp2
E182 676fc59d0d960f52c4f7ed1f33bb20b0 1053850
00181.jp2
E183 ca9cc76e4d6f1b9378bcba9aa0793933 1053176
00182.jp2
E184 cfd7eec661dafb5736bea9e3747f0a83 1057001
00183.jp2
E185 5b66ebab59775f7d6a4a14b4c5d7201e 1054414
00184.jp2
E186 f9672d542cb908d3a9316de196504a9a 1055079
00185.jp2
E187 bea1adad0b638704f67cef638bc04f8f
00186.jp2
E188 1a7e420228510f89f3f6a7b8deafadc7 1054463
00187.jp2
E189 4dcde81b31865c3989525c9ce5814a1c 1052585
00188.jp2
E190 36b16691fe1570d7e38d97743f24f70d 1054432
00189.jp2
E191 255507332b47d0451a69a5e92abfbd22 1053768
00190.jp2
E192 fe9bb940226ed66fdbc04951d225b84f 1052441
00191.jp2
E193 9ff924378355f44dc368e758a6968b68 1053212
00192.jp2
E194 575626cd76d3c4a17e9e5e024240f312 1052438
00193.jp2
E195 a9053f4d27ab413a0dad61b404926e20 1052954
00194.jp2
E196 a8dd22bc1f4007a13d00325f8437cb0f
00195.jp2
E197 86ed0db0c26bd769cf8688ff09d889a9 1054159
00196.jp2
E198 5d0aad5838ef20e037f7a93f24c43345 1053075
00197.jp2
E199 dd35379adf23d41b7f5d7e16d925fcdb 1052385
00198.jp2
E200 c2829891e96f915b1da4e1f42a38e877 1053219
00199.jp2
E201 3c475c17586c910ba3306248bcc7347d
00200.jp2
E202 81523b2181c2e43227816fedf9849327
00201.jp2
E203 24fdf696f62fc4701e05ede793fac8ea 1053208
00202.jp2
E204 2e0bea1bd8cc7cb6ce95d4059aa078d0 1053220
00203.jp2
E205 1f958934c4dfc93c9ff7146cdb0e957f 1053760
00204.jp2
E206 c13afe06ddd124af5a06747dc5a1dfe7
00205.jp2
E207 43d06feb94be5daf2e7113fd55086280 1051860
00206.jp2
E208 6655551f9fd59653e1d3c6dd6d57f844 1053194
00207.jp2
F1 imagetiff 6.0 5714ef2c035eb28c1c69dc7d72f62424 8435328
Copyright.tif
F2 659b2bfcd7f84f9f78326414202b69bc 8408655
00001.tif
F3 70c83262113a6ef050900344055c5791 8430531
00002.tif
F4 b284e151bdf70a90f81ed7677c2058c1 7918704
00003.tif
F5 e3f5945d403db02828d682089108e98b 8410187
00004.tif
F6 d73b937858860d72d89321713e7b71a0
00005.tif
F7 8e90d2b0577e14d35b5702d06575545d 8420359
00006.tif
F8 b1b7659bc54c42e8027ff98851923879
00007.tif
F9 d6bac9c97a0bb72825aa42b7544fe059
00008.tif
F10 1d1e3b2bc534a08a117a3c7bdc0b53b7 8425445
00009.tif
F11 e40134a9fdc07db8e5b147cdd1c77c6b 8415273
00010.tif
F12 69d2c6280e7dbc60b2caff4a26edb8b8
00011.tif
F13 7720d138b7696e9616773c09ab6bc848
00012.tif
F14 b06629e22e33843af89cc08a8c1b3331
00013.tif
F15 c5950ab21290bd720dbbfdad02640199
00014.tif
F16 7bacf7e382c094f70b6b68831414cf5f
00015.tif
F17 a3e20d9b90b7190393c71b5d63dc1f54
00016.tif
F18 475c8556a6b21554f09de3e8379ead26
00017.tif
F19 345e5d21d22818aacf421f82f7bbbb4f
00018.tif
F20 51f107c0623a1b77b644053898486824
00019.tif
F21 920cd5720802c9804c95657f0d362040
00020.tif
F22 43b1d6431c304b501dbbc1a3563c707b
00021.tif
F23 6084846264a01e0b3fabf9e64fc739ea 8435617
00022.tif
F24 bd3093abb6594e2a0530e2ac83518e08
00023.tif
F25 0a89b779b77f483f4d1388d7f81375e0
00024.tif
F26 56f1325dc7af6810aa6c1e23c333a9e9
00025.tif
F27 d915543ff1be095b77a1fcd01e4191f3
00026.tif
F28 fc23a3a5a00df7a750f0996538943b57 8440703
00027.tif
F29 32b84ebd37424982917f354a21d22fec
00028.tif
F30 9ac517809f5925659ccbed0cbff9b90d
00029.tif
F31 a57cb901fbe9c59f9a725c33264d2c81
00030.tif
F32 d99c4647ca260cda517028dbd123870e
00031.tif
F33 986e843cec188b5c00e9d98829684ba1
00032.tif
F34 a8e191b2b2d037a1ae0d48d8f1f04e7d
00033.tif
F35 8f33a1df9b2370ae99b16c16c270cf95
00034.tif
F36 ee8e58106f44f6574edd51c14c1d1e7a
00035.tif
F37 d347e6cbb41addd6e2747eaf4eef9fc0
00036.tif
F38 a1d4554d046add01c62692e418928252
00037.tif
F39 6f4566efbb6be7b4e46aa7ed44f203b7
00038.tif
F40 227472c601273b4ab2f4dc8048a8a168
00039.tif
F41 26e77395759c956dac36d139a8123483
00040.tif
F42 82c3bea9bca9f2fdb1a705b84178245d
00041.tif
F43 04ce90ca09cdee1c4feb98cf95f12348
00042.tif
F44 2180ca43abc3f916a9fd57006e8f0a4a
00043.tif
F45 8921d1962860c9347d6fbdf6684e721a
00044.tif
F46 de3b2a7af78d0c404ae43c98cdcb1b24 8418819
00045.tif
F47 fd8f80b1b8fe84d7a10c21b821ca7910
00046.tif
F48 5681733887457d44faf64920f2b704e3
00047.tif
F49 eef4535d6fe651dcc4e162d494c41815
00048.tif
F50 4c03dcb15f6d41f061b5053ca7984170
00049.tif
F51 a2cf621f74334fb5255fbd7b6508db80
00050.tif
F52 a3b70ff8848debba7ba74534f5967d72
00051.tif
F53 e71c77d5563809d86fd0e082c2dc655e
00052.tif
F54 f031a18fed44522f1f2409414ffd34d5
00053.tif
F55 8d1e318af1501cafc6ba115cafc0a056
00054.tif
F56 57339a59ae370ce5a7438b96650f6627
00055.tif
F57 fdac140bcbad8777d94dd7e3eca073a6
00056.tif
F58 282d0f4968343d88b06d0358874b25d4
00057.tif
F59 42b8a9b42b2cf9366de73e15421e3654
00058.tif
F60 5524e2d4e0ca95ee8e18ca180f8022a3
00059.tif
F61 21686a9a6ef8338fba8c0605e20b549d
00060.tif
F62 8ec137ecd54ec0628e6ba822b4b5a9c5
00061.tif
F63 10ffff9ec8ee220cfdfd584e5f1dc5a4
00062.tif
F64 f0eb2ce984799e8b7fb568e95faba108
00063.tif
F65 8684d4570ebfda726066bddc8dd7ce46
00064.tif
F66 62aedeed37d4f7c501c60675e140bb7a
00065.tif
F67 d4f1b80ac95382534cb7005963c10a9c
00066.tif
F68 428c583e2c1fef7ba7d4c0fd98a9eef1
00067.tif
F69 fd1949afa48d0124a2eee6681b5ac4cd
00068.tif
F70 7d12e693ef7b1f8377a3373f638cb569 8423901
00069.tif
F71 664afa3c95d05864f64b42f10b83531d
00070.tif
F72 6417ef88cdb61223cdd3a1a4a38626f0
00071.tif
F73 161e1d94fa82e502db6c200c8f833dc1
00072.tif
F74 a203c5bbe80dfc34c03cc1236df08455 8413737
00073.tif
F75 67b705a443f4ae999a55a0836cea8832
00074.tif
F76 e2bc7824ffab1d6a3c8549e705d38dc8 8412193
00075.tif
F77 3d8d9f3070b645642c7275b540975a36
00076.tif
F78 8242c99c2b5b1bd1eb4ee304fc9cfcaa
00077.tif
F79 a70a4e75c1987c8d79e2ae8b847534d7
00078.tif
F80 8d5190721cfd29693c06783a3d6fbc4c
00079.tif
F81 04955d367e6e8b8bd810a8e6172a4f09
00080.tif
F82 793c528dad0ee237f38fb688a791a0e6 8417271
00081.tif
F83 f4e7ab7dcef85671ef9bcdef358aa35d 8422349
00082.tif
F84 258e2aa7002f8242ff6adbf92ad6932e
00083.tif
F85 e01f5e108e796ba9e0763465d2506434
00084.tif
F86 88052e668e195c9e764d0e1a6da5ef41
00085.tif
F87 3a46d98dbcf92efcec745c11b3bc7a83
00086.tif
F88 16f9b367e17d2127f8c6fc4d4724a4bc
00087.tif
F89 8d1c747bd35c78bfc0c2b7ac356574be 8415715
00088.tif
F90 1083e4ed510db3172d709bf3e7009982
00089.tif
F91 edba2a5fc74716ac49d11b8f561b62a9 8444229
00090.tif
F92 53b281f15c48afdcee1f988b1dfd914e 8455961
00091.tif
F93 1ffc42840b696ad9bfab156d20b3bd6e
00092.tif
F94 b844a257846c5694fc0f9c09de994b8e
00093.tif
F95 fbe60f70d17fdd882af37e13e5a3fcac
00094.tif
F96 9e016457df67f220674362710b699cf4 8450875
00095.tif
F97 0e0903d8e0e91ff9fcfbd74876e37629
00096.tif
F98 a9b60c7691c0a9d738e954699ce92b7b
00097.tif
F99 d91bbad8e7a7edcef35eaa0691701ee3
00098.tif
F100 731c38bf2e89fe924997cf7dec331c0a
00099.tif
F101 a3581bb02627276988540e059d26b190 8461047
00100.tif
F102 3362ec4cf40b6410262095e1a9f70ce3
00101.tif
F103 f319b08c2ff9965f546db188d6076090
00102.tif
F104 b7903a0eca773c6019e2b239dcfed8a6
00103.tif
F105 79b47a1d141fea35bb5de9e2adc2143a
00104.tif
F106 25608691a1d5b69b3b9f3540d90d510e
00105.tif
F107 4e0d3188ad7add1028e49362088044f4
00106.tif
F108 b3d92bfe30062ccbed16ac4b826d8067
00107.tif
F109 f522eede18a5fcb3c108c5b152ccf8b7
00108.tif
F110 7c4b8acd1fdca0603f5ab7257b5a32e8
00109.tif
F111 f549947afb1c511d03c7e60e35e891d2
00110.tif
F112 1279615730345a7ca9dbd20e212046fd
00111.tif
F113 363a07ebfc0cb8687da54bdad0b64370
00112.tif
F114 0c5831543c0d5c8258795f275d87a030
00113.tif
F115 6fdf7ff891a90fdeb64ea38246ec263a
00114.tif
F116 156ff64eca285136b626b51d18f9a83a
00115.tif
F117 545a58557d7711c17d7c763594b9069a
00116.tif
F118 06064a453855e24c5fad0498c3c3b44d
00117.tif
F119 d5834543c81aa1c7eb620374b8690b2a 8466133
00118.tif
F120 0f5509e2f6a2c8c6caeafa3e0ceb40eb
00119.tif
F121 69ad863c55fea7ac990d75e45f4859cd
00120.tif
F122 8fc559a13ad6d71efa74e54defdd3789
00121.tif
F123 2fb1be8db828346f8ed4520e3ee7d027
00122.tif
F124 e5d2daf119a9b4f016f6034156f4bd11
00123.tif
F125 2cc3343b58f258438d0942e0794f9afb
00124.tif
F126 7fa3ccf82c3940df4ff0da6b730b823f
00125.tif
F127 e6ee43c48c4c7d396579d42cca6fac06
00126.tif
F128 50303d9113f461680ad1ee95da218072 8445789
00127.tif
F129 a215b85dc7bff9074bdbca59439fb6ce
00128.tif
F130 42dfafdc567d5b3a35cb1797a6195e2a
00129.tif
F131 b090ea321220c3765221de3637bbd6ee
00130.tif
F132 3e933841eaf7e8ff8749b136252a5a2d
00131.tif
F133 45d56103a660fecb4337a6f4b299535d
00132.tif
F134 602ddeaeab34eb5a73fa3fa40ae6a152
00133.tif
F135 2902346f910fd1115d5c8ffec5ddce45
00134.tif
F136 0de69bd7dad43ac44a143647e7f01c90
00135.tif
F137 ea8502d5044740609954bfa60be6bcff
00136.tif
F138 387827de9d2c54c83ac7880d54e8c3ce
00137.tif
F139 7a0e51d12bc1a098699a6877fba487b7
00138.tif
F140 1da6d34cb960838576c8e99d3e05a621
00139.tif
F141 98f4154886b90ef9a646b6356907a1f5
00140.tif
F142 1f4b086bae1f49d1c7fc3a38c2e325f6
00141.tif
F143 c978dcff36992451776c69a8295f93da
00142.tif
F144 ac6e8c19f04af93678f02d1db6407baf
00143.tif
F145 8f9fd7931998b95e459d429b34367783
00144.tif
F146 f51f05440747e45c851a8e6d6ba26adf
00145.tif
F147 88417be72973dd1c7e8b82ee4c1a9fb8
00146.tif
F148 a92c880815717a0522d37646a223403e
00147.tif
F149 c4b18b3e42fbe6893126c19f4607b44c
00148.tif
F150 da55ea5084d1f3c544fbb93b85f74ff0
00149.tif
F151 f62cf6c5d5e08ed0bd1b0554c3ee8834
00150.tif
F152 2110c1f95d9aa48cead411e750977c51
00151.tif
F153 587091b2cfea6da8c32e8ae0618514a4
00152.tif
F154 08c31e2610320e61f66a075d5e00bbd1
00153.tif
F155 5d81ddc3ad2c51f55ff45ff55b3b985a 8427427
00154.tif
F156 ea8c02758f70b72989b723625aca15e1
00155.tif
F157 ae7bc195c8aea8d1e156266fcd56f0e7
00156.tif
F158 d06b6e1426b961cbaf3fd2c479e52030 8405567
00157.tif
F159 0f13a31dc91a2ed4a206e82a47758fd1
00158.tif
F160 2be1a341aa085dc99b6290d0133135f2 8398941
00159.tif
F161 56bef933160550c9483fd79d2e919154 8410641
00160.tif
F162 81796d555e70e5db759d2414a8a247d3
00161.tif
F163 7c396d7e5facdf7a0f698af98fb7fc03 8400493
00162.tif
F164 15d907be5bf95088f38b6f98125a6684 8407115
00163.tif
F165 f2e4d7b1dd64f034866cb39fec11e533
00164.tif
F166 89ef5476424b8dc8c5058959fcdf72c5
00165.tif
F167 bb30e7f6f46fc1c877be4495525905a0 8434065
00166.tif
F168 0bdf813ec6a5cebc085df6604de74cd2
00167.tif
F169 94960036833b205f27cede0dfc30c77f
00168.tif
F170 c4f9ebf13cb1074f88e4571c6acf2eae
00169.tif
F171 9ac772c3dd4089ef7486da8c35064d88
00170.tif
F172 bd6904e6efa6a329520480fa206c0c43
00171.tif
F173 03b5c5b08929daf9d4fa2027c82da95c
00172.tif
F174 86261eac13cc91b91a440cb35f5fc594
00173.tif
F175 8429b1ce436393f338f02296d721320b
00174.tif
F176 02237f17035f315d24f4493de52635a9
00175.tif
F177 235ec2729e2af9ccdce7a04386e64727
00176.tif
F178 ce638703860ccbc3b0692ad188387e28
00177.tif
F179 8515e15f71088bfd970fb5e3ae288052
00178.tif
F180 307befd1e3b0371b563ad851500f9ced 8471219
00179.tif
F181 22f8ea58bfb401731eced86bbbdf192d
00180.tif
F182 95933cc9bba93584c095b31729ce0356
00181.tif
F183 e0cfb98026704cc3e56a971f7dda3910
00182.tif
F184 88cc0bf37b03eab2dd7bff975d51fd1a
00183.tif
F185 e9b0ea04a478cd5dbf15233f4fe97151
00184.tif
F186 8e0775690b666ba51ea3e5e2d278e4f1
00185.tif
F187 f139c69e1484dd15502a8be15ae4a011
00186.tif
F188 20ab9b46cee98dc96a3d2b0b533fa9c0
00187.tif
F189 4f848213252c004f314b592b959a0bd6
00188.tif
F190 f8e9768175f9386cbbcbec75f06580a2
00189.tif
F191 6da11ec1a1c1ac0ee3b709592a5ff0b5
00190.tif
F192 1727613de742221bd59c1f537b5cdd44
00191.tif
F193 67d3028840ccf2398565aa5ec358cbcc
00192.tif
F194 5bd0d9ccb77886f7b13b48f123295745
00193.tif
F195 0e30bec7f213ed14ea59035a7a57ab6d
00194.tif
F196 db40296bb8690c140b21aa3097127ac9
00195.tif
F197 70df8e3aa2957c62174862c3aa15628b
00196.tif
F198 7636ff62c265a4d9eb9a5b76bbd191b4
00197.tif
F199 2d2d2efdc0b735ff77c4a6cfbd93d159
00198.tif
F200 93978ea7a025cd5176d90a7d1238b8ef
00199.tif
F201 d2fa2f5b4b45b2235ac6c695e1c4d1ec
00200.tif
F202 ee7ed3b8f99139cb450f72697944a140
00201.tif
F203 c9268b574338669f99ff239b759ffb54
00202.tif
F204 cc1565a67eb0ad20d63578d7b814432a
00203.tif
F205 aa70007694149835aa20723bfe543496
00204.tif
F206 06e65b1a7cc6064de6a1564aa2021f76
00205.tif
F207 73070fcb861bd2b473354694079dfbe0
00206.tif
F208 7db39ac1b64bb9bac4c6069de9f6d90f
00207.tif
R1 textx-pro a2fce33b0caf0d288a9e6cf0ef1d58c2 41678
Copyright.pro
R2 70e7dfdec8ba2eeba40e2b57409978c3 6905
00001.pro
R3 2d861966c3ecd9074508f2065947bf43 1013
00002.pro
R4 065eacc6781388834a6df33e678ad429 29147
00003.pro
R5 3afdd89eb08441aac77b10edc78e7b98 54033
00004.pro
R6 353b2016d9abc47bfdf7556a90594fb8 80806
00005.pro
R7 c68a5c0e22788535446c9dd3c4600080 55303
00006.pro
R8 53c4d0f14d424b9980e22f307a927a06 35731
00007.pro
R9 7a7f3421ace6b308c57c67a90168123c 39129
00008.pro
R10 6a4d92ec45bda6124ad4c5e489e6739a 39078
00009.pro
R11 0679fe3dbe11885c74ff6ae22f06ea0f 45976
00010.pro
R12 48a39cfa6bda2f8e86d733afd08d39fc 4669
00011.pro
R13 a7b32fbbad5292f4b60650279af14b0d 41485
00012.pro
R14 9901bd2b0767f898e96ab9fad62e9561 9142
00013.pro
R15 dead0e664e966842db1a7fd8fecb7556 57540
00014.pro
R16 6070b649a3839e77917f06462c7769dd 3343
00015.pro
R17 8c52d64459b59e1a33008701e2fae63e 9315
00016.pro
R18 eb70be5692d7c2284d28ff00ea4b1418 50151
00017.pro
R19 260bd6b38a230d4fcf41e92b21b58192 52994
00018.pro
R20 34728a725a5417b5bfe82330afb86ac1 6461
00019.pro
R21 79488b8828f9453aac51a4ace2fd2735 56344
00020.pro
R22 0218485dd42d9fe850f2d1338ce4ce70 52869
00021.pro
R23 4f9fc649e47bc790e92f15763f36a162 56348
00022.pro
R24 d392027e47dadcb4ff67b2d399b4a1cb 56360
00023.pro
R25 c88c0c27d20eab73a6e84260151215d0 52748
00024.pro
R26 ed703d37de6780f7d1449a3641f5494e 45987
00025.pro
R27 3ae2b71bb290f425d9a203210a3d33df 41600
00026.pro
R28 6d5fe8794aa977315a4a017def193f3c 43991
00027.pro
R29 c03b5a9190b8a3066fbf8098775fbadd 45485
00028.pro
R30 f62cf0c8d26043d1e71b4d3646466010 51992
00029.pro
R31 7b8a6b71202e9705841c07a7233980f6 19854
00030.pro
R32 b35aee8d6f27d6f5675e1eeba2ae474b 33965
00031.pro
R33 d1e3ef5014e556b285248f7c537948b5 1310
00032.pro
R34 08e67d3d3e041c5c25cd0448faddeb5f 40192
00033.pro
R35 bec61a5d5ea798290843cd82c88679f8 40909
00034.pro
R36 8d0c817169e759a67431bd97d36c118f 49906
00035.pro
R37 9596d0076ae5a8ce03998859fd45fead 19352
00036.pro
R38 43e21fad1a7418161dec5b85cdec3e08 43194
00037.pro
R39 16acccb4560529d84fd522d92995bd32 46500
00038.pro
R40 e12a141be7851f66e8aac56606d24447 55294
00039.pro
R41 723f71d7ab063d15027bd8655e8f911b 37835
00040.pro
R42 e14077ba0d0ba085645a6d7c341edc59 6689
00041.pro
R43 6f0f23e26baaf62509c1765cda4b336d 32126
00042.pro
R44 973f32bba6394d36b17e7001bd8d9130 3199
00043.pro
R45 f53287269631d5c553c5890854334cad 38361
00044.pro
R46 487ae82b9ba423163fe8a234367ec2b0 8273
00045.pro
R47 542a8f37bc6a83144366610376b2768a 24034
00046.pro
R48 34e586a16af575336e7c6d7ebb2f7d9f 38654
00047.pro
R49 cd21db790f0b285fcef121ac751d29d0 30934
00048.pro
R50 7b5e5040c89438049939e9d0980ebc94 23357
00049.pro
R51 eaae5bb28b4b5852a3bcefb70c9fdefe 27222
00050.pro
R52 e7771226a585a9af8b4764dbe3c71f83 31080
00051.pro
R53 989f61091ad97d4f0197d0e3d7d2fefd 5634
00052.pro
R54 1ad3262ca64e6c718f4212e9a79d4898 30150
00053.pro
R55 69693d789bd3832c53e646db3a38e366 6726
00054.pro
R56 aad98e8de9720a9cb5a86c53592b98e1 33874
00055.pro
R57 9e61ca4713bd24e2b4fb30978048ad5f 48909
00056.pro
R58 f4170c855c8942d0d3d03ea6546d013e 47847
00057.pro
R59 892f77e0c2636537acc9306a4da75dd8 6385
00058.pro
R60 a44b10ab1e75bc8469c00530867b07e7 15561
00059.pro
R61 1e91366db96312a23b55302274a2153b 56868
00060.pro
R62 8d5f1895163bc4f23c2070ae2d176670 12839
00061.pro
R63 552c0e3c38ccb1121c90aaa66fc142f2 10360
00062.pro
R64 993fb7bc54cedfd8fff57e699578c499 35997
00063.pro
R65 13082dd083572806d9f7ad8c7680162c 35047
00064.pro
R66 cf5d186e396e6ee7b23c776e37201efe 4498
00065.pro
R67 fd16d45ca441c3851e8b0312e4f2cd6a 3787
00066.pro
R68 951a71b1e57713187bc2ab3b33b2ccbf 56904
00067.pro
R69 b848683248d756e941c954a5d5754522 47390
00068.pro
R70 3a9c9cd4d8303c7d3fc5c69cbb1a8782 48375
00069.pro
R71 8c93e3b23fed783d00eb956dba028725 44831
00070.pro
R72 2af230358fd01aa1eb65a92dbe077996 35979
00071.pro
R73 44c926dd4a6e8108f752ea2d154a1405 38264
00072.pro
R74 a76a532cf5e4bd0167f509d1098eac26 27649
00073.pro
R75 96fe945f25bd1557cc43a59b4eab8dbd 37186
00074.pro
R76 25a234d1340bf98acc238f00565096fa 7618
00075.pro
R77 3d0bb4b2adbc644f1a4963299665a3bb 31270
00076.pro
R78 8faa60c1c1e0f497866e64ea008ba121 24046
00077.pro
R79 d7bc7a9fa3fa59912141877068c4520b 32837
00078.pro
R80 317d6c9cc830f3af59af0ba9cf3f2180
00079.pro
R81 3f2bd66e6837428b2687169ce0054fc5 19437
00080.pro
R82 894169b79eff00d3b9e38255eb6e7291 19165
00081.pro
R83 927c496a6b6b698279f1855c514be4a3 15716
00082.pro
R84 6832ed8a1df025695e3c055447b8bd68 13650
00083.pro
R85 c718047115df1584d413c781420a366a 45661
00084.pro
R86 e061417b9f43e52d54f22fa4e741f3f3 43381
00085.pro
R87 76f379b413a2efeede8b7ecb4b67a305 42467
00086.pro
R88 92edf150763993916047647dd43a29b6 40871
00087.pro
R89 10ce8f3d2f8f32421540b7d55611aca2 19329
00088.pro
R90 0ee61b31b4883daf27c950456d34a461 19363
00089.pro
R91 b362b4ae18df1fa2a3cbb4e93bbb193d 12382
00090.pro
R92 91874d13cbdd00af9cb69ea062910ff8 12016
00091.pro
R93 31c0977663d6635559077f3a5dd90dbb 46805
00092.pro
R94 462fa7acb1920e526c3feb77dac5b3eb 48933
00093.pro
R95 ba764330663755199f34cb0e0765f730 38015
00094.pro
R96 b539691cd7108a9f367a1d95cf2f01bb 43394
00095.pro
R97 03716fed51aecb7c4fd0a54c641ca8fe 50346
00096.pro
R98 9b374b80ca3ffd8d1e62eb8322433581 54025
00097.pro
R99 cce39eecfc88c9900426e343a0be1c48 50142
00098.pro
R100 dd7b3ec9893efe9b3ac7f5054bfcdfe8 52148
00099.pro
R101 08faee5a57d9a680de7bd05190177d25 14189
00100.pro
R102 645bcf6a922e79dc50d182257f48fef7 48045
00101.pro
R103 7eac72f0c54f1c4237a13c57362beeb9 8672
00102.pro
R104 716ee36a808b12d8eca90a957fd75891 15877
00103.pro
R105 d927ece4ac17762b9a38351cc14bb182 51965
00104.pro
R106 c3e333705ccda679fdb88c6cb800252d 21071
00105.pro
R107 6cfec05afc77b4a061c7dc4b506e22ed 21013
00106.pro
R108 426033dcee6e4699fa61acf3eb3c3405 46838
00107.pro
R109 d34aef69bdaefa518438aa6014eb9dd7 47960
00108.pro
R110 4c16637c961731435304bea66766302c 48035
00109.pro
R111 c7a1f107f8a001311fef1d218122d2b8 41255
00110.pro
R112 677b6de2b818537e11c13263e6881ca2 15857
00111.pro
R113 17a11487104c7882ee9b00ef89056b5b 43320
00112.pro
R114 fab4650e683700dc860eff0e59de273a 43581
00113.pro
R115 e29ed313d0d3fcf289cf22fa540656a2 41171
00114.pro
R116 e424ab4e6f931ce41f3d778483143a6d 48077
00115.pro
R117 be7b53d2236eaf9480a7690c740c320f 36562
00116.pro
R118 3eaf0433da0e130a64ebcdf70739fc41 2718
00117.pro
R119 4837d47b3a6ed0dbd1232d921f0fca81 30072
00118.pro
R120 de8c84d802cffaf3bd271d906ca5bad3 53666
00119.pro
R121 9ce411e45b8d5248a6144adb2578a298 5453
00120.pro
R122 fdc09fc934494154248f1e624a6dd6d9 46158
00121.pro
R123 5c5e7f5976b564504c2d159e79a39b97 17170
00122.pro
R124 05e62657cce3af6841c9ca961cc080a0 38593
00123.pro
R125 86cf44c33452ddc9be8bbd31cb72437b 27124
00124.pro
R126 47b7d2cc3488b65b419028b9c4fa0598 46289
00125.pro
R127 3bac63a430f8eb367e2f0263de91535a 4631
00126.pro
R128 0c2ff89a8f0a007cc59e1d1bcf8a80a4 4184
00127.pro
R129 60a6c88695ea348a8a01c62a80e029d9 48982
00128.pro
R130 1a3e04f666d02d387531289dff6d92d0 3825
00129.pro
R131 a7bf44d6eb09409e3e28afe1705b9c30 47105
00130.pro
R132 890cb874ef9299e1cd08149bdd5826e1 41431
00131.pro
R133 2e56c39ec7da9c669c4d3abcf156a6e0 41137
00132.pro
R134 f7e93272d4eced040d60c3f4e905be42 34180
00133.pro
R135 9c45b0028a89054d9763b2b0429f3951 34499
00134.pro
R136 3885330459155f72d8efaacd9d4a494d 7545
00135.pro
R137 a1f33b9099b162e3a7a273678c7bc5c7 41003
00136.pro
R138 b6ed310700ecc28431a208ba1400c6e8 45107
00137.pro
R139 af7555fee407f665315786126c0b10a4 36626
00138.pro
R140 6a286c0d99bf193ba9b8fa129ebeccce 14486
00139.pro
R141 71ca02cf3b3a628f4b670dc1f8bd8c55 85417
00140.pro
R142 71d97ed28c1b5ce32c983f1276745047 33793
00141.pro
R143 75bf4165245cb3034feb62a312bcc123 7017
00142.pro
R144 fac6ad9158ecf3bb7f317a23b9514989 16836
00143.pro
R145 f2856738a8c90daaad926d2455f52232 22369
00144.pro
R146 8183a5c7959a2efb5ddbf085ebe39069 17059
00145.pro
R147 6ea3bd7173bd77a080bcb9b0ac638697 12408
00146.pro
R148 ec1eef283df5df3380e061f1edf1d2b9 10121
00147.pro
R149 b1dc991575b047fad0ecaf3d6b6e61f8 9982
00148.pro
R150 5b4e01253c5db02b52e84192bc0a355d 10391
00149.pro
R151 fa99a975e6c4d30837ea3fcd3b6cbb56 51576
00150.pro
R152 8d1b1cd53d3eea46057b102eb320c31e 26947
00151.pro
R153 1e5de7b3cb41f2621eee255b2688a917 12639
00152.pro
R154 4a9f6ec2859272e9fd766b32748c9c60 43931
00153.pro
R155 3d0e185c64fd653e8d198648affd702c 19898
00154.pro
R156 3cf535469bdfee3f82d59ec8b7a10b71 13444
00155.pro
R157 c8a0fb50f456eeb0e6d7175f82752056 11943
00156.pro
R158 fc73b2c81428ff75640d69c7dfa27b64 43020
00157.pro
R159 cf856699e79fcd683cf4c249f8257a8e 46060
00158.pro
R160 d5f1ed9f70ce249f209959b0c7b841f3 37386
00159.pro
R161 02170c61b7cb54a2cc4f2f9a7ae3e05f 12777
00160.pro
R162 5ea44b00a5a6b18f6001519e38128d58 22415
00161.pro
R163 9ef55b5f691f4c5d3806c1eb087ae074 17563
00162.pro
R164 3b5847b4c320e907333a033d1e44ceb0 46526
00163.pro
R165 c9360a60c57b1abff1f4eed8f00777a0 46286
00164.pro
R166 76f2ae1e65dbd99dc51aed77af9ac872 45178
00165.pro
R167 57e2692b64d7da5405627eb502888fb8 41687
00166.pro
R168 8f44a71a1cf88d46cec1870c63a20805 44814
00167.pro
R169 86249ff3c2886a2a18582242bf8cc766 46026
00168.pro
R170 f3fd21a595e21d43c5d631ee99d46a90 9789
00169.pro
R171 ab2aa40bf48edb6c57a95d1778b75e4d 40346
00170.pro
R172 792dcff1fb651410124e3c599527ccd9 54714
00171.pro
R173 2f6681997037b9742f3fbb8bf0a38e2b 54241
00172.pro
R174 d8e9c5d772937ac612bcd81ffb00eb07 49375
00173.pro
R175 0b949ff59c69b2b2652f2ff40cc14ce4 45465
00174.pro
R176 cdbbe7013f006be33ade83a779e692d0 42461
00175.pro
R177 35abc81028ad3ae6d07f01dbe686e18b 49228
00176.pro
R178 5544bfa25cf4de40fe23031df80f5aa0 17800
00177.pro
R179 900b1947dfec98f12b72f9629ef1b8e7 15931
00178.pro
R180 1a1c24b90438fdf7ce9e391fe0428dab 15913
00179.pro
R181 2142b60603032ba97105df938aaa707b 15788
00180.pro
R182 41544219098265039084ced4e01ae5e2 15210
00181.pro
R183 cae91847ac39f27cb935f48f60869428 15629
00182.pro
R184 9bed1799c42e4dee0ec91622657cf14b 15529
00183.pro
R185 35ec1b28038b298ff62f4ebeee776033 15681
00184.pro
R186 238dbe2e6a1bc5837a6b0d1bbdbd7ba1 13956
00185.pro
R187 e372d6ac6ad93c689e2d0c5a938a2c29 13532
00186.pro
R188 9f35dc7f5ca1e1bea1368dbd58cb31e4 17086
00187.pro
R189 4270ae75a4fb92418baf7bcf916349c4 15505
00188.pro
R190 ae0421f9c7ac97c4ffcbb992bd8f44c4 14672
00189.pro
R191 de93bc7ff3969da2cfbc0b91410d8ed5 14313
00190.pro
R192 d08e2ed1461e6eb0445edd44434a9e99 14877
00191.pro
R193 a28eddf9436270d5835a1711c1446d35 15689
00192.pro
R194 0f2537396987f8a2fe7d068821b85c58 15349
00193.pro
R195 c5d5f9cce3f04a994a61e57af7844016 15262
00194.pro
R196 76249930344d6104f085992984184811 14267
00195.pro
R197 b51169b814b0aaafdc88419e44031167 15549
00196.pro
R198 aa4fcd8b164c5c5f0875065faa6d9e52 52708
00197.pro
R199 25982bf859bafc8b23a808f7ecf7a871 64316
00198.pro
R200 1a95cd934c6b7a6238ccefd56c5f9893 61886
00199.pro
R201 77fcb6e5dc919e4a9fd322196d3cafe2 58611
00200.pro
R202 a4e8c94b4c28a24ed70a078c6c8a73d6 63156
00201.pro
R203 b31567eed8d0c6af49f210429cc26b6f 58725
00202.pro
R204 fe3c8b30c71dba264f30394e6e429d9b 60289
00203.pro
R205 f298ea9ed3f47c68028db70c0c8058f9 55898
00204.pro
R206 f0960075179ac3ac16d4648827fd57f1 8760
00205.pro
R207 a6120cc84f92d8594f6e472772784d32 27376
00206.pro
R208 0ca6753a5b7d2f797c1fc2901ca79b13 37238
00207.pro
T1 textplain b9864cb52689a6b3b0bb5008367af34c 1966
Copyright.txt
T2 3b1e31785713cbf395636929258d5dca 396
00001.txt
T3 cf4438a4b689953907585701ab727194 84
00002.txt
T4 447847474630a5372dc8714e29b93565 1188
00003.txt
T5 cb482dc56dbe481ca4cf4a629e161bf5 2329
00004.txt
T6 18c6814719aa4ef381d3db9b5fb58929 3450
00005.txt
T7 2db0455c1fe194f4eda8e865024c47b8 2402
00006.txt
T8 a9ae28f38dbc4e94d2803daa92c6a8f3 1581
00007.txt
T9 0ee93557124ea5072e4e5d130fa25ca4 1557
00008.txt
T10 6b33659218ba66e7e02655047178e6c7 1629
00009.txt
T11 49ae9c55c99918fad2f6dd571f3b38b2 1946
00010.txt
T12 7e8211deca75d86496cf64e729236333 287
00011.txt
T13 0306c0103c5a3874295d0eff989b8c0b 1824
00012.txt
T14 93c3c152caffba237182e48fe8834b83 438
00013.txt
T15 513efdaa8ef13799aad4a574ad5b2c4c 2247
00014.txt
T16 f2049e60864e438a82ccc25eca79c0c0 191
00015.txt
T17 dd9b5e2197c53c7a18eaa9c172e3e66f 487
00016.txt
T18 257756559b0254ed562dd36daeac0ba2 1999
00017.txt
T19 b361d00716ac738bce50887a5f335265 2064
00018.txt
T20 1b464c8f80b2bc9ee8070f00c4ac3d5d 300
00019.txt
T21 6ab8ad11464971773226f24d3d83c42d 2221
00020.txt
T22 37166b912573790c7e8a50e6f9f57cd0 2076
00021.txt
T23 43e642bfac31cf54d99999c3ffb27d6c 2184
00022.txt
T24 8228bebb8aaade2628c7771996282a94 2200
00023.txt
T25 2cdec87eb681484df987943c077d83e2 2077
00024.txt
T26 a766c9c896cbfa106409a423e73bd9b6 1909
00025.txt
T27 06b6131c53cea632ba4d59a576c22f2f 1670
00026.txt
T28 6656c5e211bc27ff9d171e252f013547 1759
00027.txt
T29 3ecd4d11df041e2cf1d7fc473f15e992 1898
00028.txt
T30 63a7115b8447a5b9691b27c5d3c72f87 2066
00029.txt
T31 3d11d46a8f16a65faa125ec559972da0 786
00030.txt
T32 ef94c489ae444663c5fad58235f57964 1477
00031.txt
T33 e82c893947ccc7202ef00966b0d0e285 142
00032.txt
T34 1c6ca21e0f868788966f3e3d4ef8499b 1688
00033.txt
T35 14e5e2283a28aa5fe3e78c8bf5e43385 1711
00034.txt
T36 f1c47c0d04b75a9b76bab826c22f0517 2028
00035.txt
T37 efb3d0f5bb0d0fcf829ef081cc4fb1ff 854
00036.txt
T38 40dbbf8d18fd04275732f5f0dee307d7 1933
00037.txt
T39 95b89375d11dc794cee195496ef05d65 1883
00038.txt
T40 aa8f7e35b686c768911821348938a5a2 2170
00039.txt
T41 5178f9b305c5e01c10d7a89d79fa8025 1542
00040.txt
T42 3110c2354010f0f32bd6b088c77f4a30 345
00041.txt
T43 e26da2fff9918e845bfbf0be99cf68c2 1377
00042.txt
T44 3a0eab4c5d303637bfde9ce475448cfb 244
00043.txt
T45 9e7ee85787287db52e29a83922f53629 1722
00044.txt
T46 393151064a3f22b1d8a1de532cf6e85e 777
00045.txt
T47 1ea2c2c3ee9d1c3d1a671d9137c07315 1343
00046.txt
T48 d4dd935fccc2ccc4743b4d4f7eeeddfe 1596
00047.txt
T49 a32739dac7fa1106804b11ea03f75d95 1360
00048.txt
T50 7adfe3e4e1c55990da264718d576d6ef 1168
00049.txt
T51 bea05363cd56abaa6fc48f100df7142e 1184
00050.txt
T52 7b0a11fbc4b97bd4abbc3189a017dbfa 1462
00051.txt
T53 f6ffad7207d5328464d58bb0f5816b0e 437
00052.txt
T54 b38f9c6a5d9709bdde2fc143744aed3b 1337
00053.txt
T55 08f42c83dde3b408e02f57d00d3ad08a 600
00054.txt
T56 aa60cf7e2a8fcaa851035df59d8e0678 1531
00055.txt
T57 30c1fecb00341cd1f86c578e86fde515 1957
00056.txt
T58 6dfbba5e702b7c5ea122bcf2b2d8d76a 1916
00057.txt
T59 597d79825fb2f36c6651b0bbd8f077d6 306
00058.txt
T60 9e3b58315b6bd1b921c3fd1f04e0f126 623
00059.txt
T61 61843e4c10fcd9397d0858cfbf0a7843 2192
00060.txt
T62 fe54aeb4cc2be8f6655bb6e2a16f1875 728
00061.txt
T63 d5a6d41569f985353aae83ff3f563194 509
00062.txt
T64 ba87c6a8ea3966cf0c0c3dcc57912621 1555
00063.txt
T65 78fbcdd6997ec0158a2058ab80fa3961 1569
00064.txt
T66 8202cd979e52354c56b20c9fa8eddd87 259
00065.txt
T67 d1dab6c372efd3ebe4e4425102ee0928
00066.txt
T68 1fa7726941057a35079fb476c2dc3e1f 2216
00067.txt
T69 2573c1d0aa5b34e87b1c85a811cdcae0 1905
00068.txt
T70 6f13b21532db801f7ceef2a30e414645 1959
00069.txt
T71 d6dea5ab6f47e0ad05aa55e093d1867e 1885
00070.txt
T72 760d33ba2e30881dbef90b46006644ea 1572
00071.txt
T73 5eab7b2560162b47b79bc77af8ff5584 1656
00072.txt
T74 91c3c2f47fa821986105de0a884a2d09 1269
00073.txt
T75 4b6259c1d58b24d0dde2982e601291d5 1668
00074.txt
T76 bba7621924dc49895612c8b25c712f55 360
00075.txt
T77 07499733c46398c5e3af183aba552f3d 1406
00076.txt
T78 11411dd92523c971936faadd61b247bf 1031
00077.txt
T79 4a3eb3973bdbedbbfde16bfdb12322cc 1410
00078.txt
T80 ff6ea1f781a5aa9ada08b1e9ca0e117b 1947
00079.txt
T81 560ee38bb164def283419a2ddc61db41 1130
00080.txt
T82 62e4d490afa59c8bc424f7b86833c706 1124
00081.txt
T83 327a185de94b6de150f9d8cb5fa498ab 898
00082.txt
T84 68567ebde3409675527e39773f409f03 787
00083.txt
T85 8a1688f147fe5ecf3dfda6e7f0ee47ba 1907
00084.txt
T86 07adac6c98a3dc66cd4c7116151e22f7 1809
00085.txt
T87 f448f2df66cb8cc99f4a80edfe617ee4 1847
00086.txt
T88 1c877c59af1c0da38ab97b4143ae3513 1797
00087.txt
T89 0b50940d0dfa77218e0631f69c7501de 1231
00088.txt
T90 893b12e9512a8be7aba44b2e87c5f393 1262
00089.txt
T91 9b68daf67c3217bf93ce9d62de70d29d 670
00090.txt
T92 137f130241c3e77f18b3ba6904df48c9 841
00091.txt
T93 bf3d2fcc6c5b97d9fec488d63e2942ae 1942
00092.txt
T94 4ffe126de68c37f2cd94ff571cff1ff8 2003
00093.txt
T95 2374b4fb941430ba581b30618247831d 1594
00094.txt
T96 cc9c41e8e33281af69a15a3262b242af 1899
00095.txt
T97 6ee054af93b86f89bb2845be677fc1ba 1992
00096.txt
T98 6d9fe97f8837bed1b75f11937c04dd94 2096
00097.txt
T99 fac343f5cfe680d442c5204dddf423d0 1979
00098.txt
T100 626b4bdfbf44fbfbf0ae14ed6177e46f 2036
00099.txt
T101 a064feb470f27cc72cb4ec4bcedaae85 776
00100.txt
T102 864042cd08b3a9413c5aca2bc23ea789
00101.txt
T103 fda32694b0cc3d375e664ef0717aa3a2 471
00102.txt
T104 78cb35d7fdcdc7ff475f56289bdaaf8f 702
00103.txt
T105 9f9e842fa37c0d20158a93ba0f2968f4 2135
00104.txt
T106 2fac9ea0ede4199baf3d7ab8c2080476 1209
00105.txt
T107 02629d0697d1cf14e3f57e6321e6122b
00106.txt
T108 e5e98efb097ac074fb8bc92dbe5beb4a 1941
00107.txt
T109 b274532628240782347e8394f9eacc50 1982
00108.txt
T110 58feb4ee0bf84a4211a5ce0537e9a80c
00109.txt
T111 787ac6b584a17c680a63568d10d4f260 1804
00110.txt
T112 193ce3e42f366c41ee3c4dd4a0d78960 931
00111.txt
T113 d1287feb0b6de8df91e3c4709952e05d 1810
00112.txt
T114 520c3135534b316657805e99f8eb5d56 1795
00113.txt
T115 434cbebd36c0ed3c6f502fb48b6fcbe2 1716
00114.txt
T116 0bf17b9e928d0633b1fca82b14720807 1915
00115.txt
T117 53fe30007aaeede4d4452658fbf1d02a 1558
00116.txt
T118 0bc0aab8ab6f6a7640006bafb2c45259 117
00117.txt
T119 c78e1e5f7b835e7c2e3875bd83e5fed4 1455
00118.txt
T120 ef22a7bb04fc7e3651dfdef4967613b6 2115
00119.txt
T121 5d3050cdd9ec1c0512214e8e9abe807d 233
00120.txt
T122 d6d28df70ebfce1f860179dc502095d6 1864
00121.txt
T123 92a7af2edf86fae9ea87d06de04af76a 793
00122.txt
T124 c1bfced426118c9789123c0d0b052d12 1586
00123.txt
T125 5e8f8c9b8415da5dbb556e488909d9af 1206
00124.txt
T126 813c5f78e4a5eba83f685d765ddc812a
00125.txt
T127 4e24dfaa3b5a03805823a6b8d54eacfc 183
00126.txt
T128 936399aa2a2796349feeb656523435c2 230
00127.txt
T129 f237b446834ea77a0c3d1f15249b418f 1940
00128.txt
T130 caece98611b13009a1524c5b6ebbf793 262
00129.txt
T131 1ee6cf948536d6f8dac17339ca80d37d 1952
00130.txt
T132 02932114c9d8f96ee4192b2b1ea9f6d3 1758
00131.txt
T133 403d650b061462797fe0a5a23df77dd1 1749
00132.txt
T134 3900115727b42026b6ef55d70263063c 1464
00133.txt
T135 a08cbfbe9c5841cc0a0004e60335eff2 1539
00134.txt
T136 b837df7518e697ed12be3b7cb64173df 335
00135.txt
T137 bcb52db914f7a914d5169a985762b573 1789
00136.txt
T138 2b42ba491d1f8bdd0833deb01ea2516c 1825
00137.txt
T139 1d4abf8069084e099aa549604f015d5e 1526
00138.txt
T140 92ea1dd5b874bc46584b1347bb7a0790 928
00139.txt
T141 bfb0aa374694d459e5021f9e76fe2b39 4873
00140.txt
T142 49a57e50245925f253a02e0a63ee4c97 1497
00141.txt
T143 1c2d26435e3c332884b08c938ae37d7b 512
00142.txt
T144 aaa5a5def0d37b7ec577066591018b10 709
00143.txt
T145 7cd2588aa9132a378ca49f665d1a5f00 831
00144.txt
T146 9efc4fa54ea6fc643da82203ed9994c3 677
00145.txt
T147 0b33ebac9ba297615f55ffb55315d170 1018
00146.txt
T148 04f7cd8f228c92d35b9c615a085f81dd 584
00147.txt
T149 3e6a74b10eb87947a5008cbfc0b31209 557
00148.txt
T150 a8eabd93083eafc95593e778de087d7a 613
00149.txt
T151 61633cb5a2d130dc68374653b483a814 2009
00150.txt
T152 c2d84eab044d42d407f733653b88208f 1359
00151.txt
T153 d239a5f5d6d73cb7513f97743fa8c5c7 823
00152.txt
T154 1c6cb70e1908b04f4b03806e03b81455 1764
00153.txt
T155 dfc3700c93e441fc02e6c8750d75c014 992
00154.txt
T156 d547ac8f7b71263640dee7d39a27959e
00155.txt
T157 7855ec3c16c589a470796d96c0960778 618
00156.txt
T158 388796fe4349f8a4692af97067d7afa2 1793
00157.txt
T159 615394908cb5f7f2730415e37f0ef3c3 1897
00158.txt
T160 7b92ffc4bcb8e83ffde538ed8b4ffb8e 1587
00159.txt
T161 de5980108c4156a58fbf4227fe987724 585
00160.txt
T162 6333c4c42a4162e507315c4c99e9306f 995
00161.txt
T163 7be5a2f7b03d85714c7c5806be6871d9 820
00162.txt
T164 1b0e0640e6f03f520ab4c325e887e6f0 1906
00163.txt
T165 0b209aa9dac95da1b158bb0b3faa9a6b
00164.txt
T166 0c9f894904697946f3baf85875d03bcc 1862
00165.txt
T167 49fa455d4e2c562fe0ee2087af4c3043 1753
00166.txt
T168 e88358b5f074bb7a895d17a196ce2dfb 1855
00167.txt
T169 3dd7dd8f8838ca3b14d2eb9d27f712c2 1900
00168.txt
T170 38f652974faa4d3e464e64c8d8ca5113 419
00169.txt
T171 ac62c2cf7e8e3bdea2b73d9a10001d36 1697
00170.txt
T172 5755c52b511301f1f45f0efc77ec1a39 2149
00171.txt
T173 efb9642456dc4ad0c68b519760906490 2143
00172.txt
T174 93e8170233edf9bb3ee5a5d3b83dbd15 1963
00173.txt
T175 a7c54688e63c6a2dfd7a05296a59a7ef
00174.txt
T176 36f21400f59b0362c74b4a27138443d4 1747
00175.txt
T177 a36e252ce02522b8f78151fff40067ee 1956
00176.txt
T178 f44d68e5dc86f6adaa2750b2fb86bd4a 1038
00177.txt
T179 f05d7dccba19534bd8c063e266853a46 895
00178.txt
T180 3629d89ebfb8acce753085bbd8f96b23 943
00179.txt
T181 87c884d38f197ed541c7f81502eb125c 908
00180.txt
T182 e593d4d58077a2bfdf9c2c16485d83d4 920
00181.txt
T183 fbf2d02e721ddaf672f31d50c2292cbb 921
00182.txt
T184 f68bb70d12911e486910241329f6a2b6 902
00183.txt
T185 39f5d1160dc3529782cc0d3d296d7215 886
00184.txt
T186 f2166a6c9e9fdd2612561fc7c5c594cd 832
00185.txt
T187 5e183c7a023de9224631d61349ecc985 818
00186.txt
T188 0ff777221b038be8ae5b34871fc1d6b1 1047
00187.txt
T189 05414d252ca69f735ac9907ecec53652 1002
00188.txt
T190 1634444b8ba9e5b393b385d4e3e9271b 904
00189.txt
T191 a956b2ec97a6e67a0a7b82b80be7a19f 862
00190.txt
T192 253e0f160dd9d4855a18f1faa616efe1 926
00191.txt
T193 540f8eaac68f0d1306a9e427b3772f26 1003
00192.txt
T194 47fac919c383ce75828e6f47ae73cd2f 914
00193.txt
T195 2ec3f847dc7875747d1dff23953ed25b 950
00194.txt
T196 2a3357d6e467a49c08d1f3368adf2308
00195.txt
T197 e583b2c8e5b4bc04adf734bd4e97d401 906
00196.txt
T198 a0db19375aa2bdd452bb12414da21444 2041
00197.txt
T199 81c190d0d710835ab6e82f7613e4018b 2481
00198.txt
T200 1225dab9f0427cccfa10ae9c2aa40e69 2386
00199.txt
T201 de1b68c1adcf18d77bb2b6e8d82d3b01 2260
00200.txt
T202 efeb46f1ebb2c5127355ed4d470a23eb 2435
00201.txt
T203 9e1ac18f6864af0eeb748b0185d800b4 2270
00202.txt
T204 df47ea5321d7cccccf0833d1ea1ed8d0 2334
00203.txt
T205 3812bea27635ef44c888880f1eb72cb1 2169
00204.txt
T206 b6738c00c1c53385520fa97e6b56e510 379
00205.txt
T207 36e11bacde7df4fb1e5326102dd4e966 1096
00206.txt
T208 ce57d78c255d71c422a0cb4743e44387 1840
00207.txt
UR1 8f6e4efc5081f2a0cfa73d31eb2cad84 3146
00001thm.jpg
AR1 33b57c00a6d4d616ea0de6181d226364 11634
00001.QC.jpg
AR2 0016616bd601798ffeaa8986aeb48c43 7747
00002.QC.jpg
AR3 280a735bea070f2a1ce1c4dc4a681545 2124
00002thm.jpg
AR4 aea1d40bf6608d6035c4c7c1c86a0a8a 19452
00003.QC.jpg
AR5 8a40a924daa26477489706296a09efef 4707
00003thm.jpg
AR6 34bc444221160c63db0b56f796b20eb9 19826
00004.QC.jpg
AR7 3912b6015c5ff184c1326f1be7e5f3a4 4930
00004thm.jpg
AR8 cfe43d897dfc52c39b9c9510e3f70e89 27494
00005.QC.jpg
AR9 19e393ac6b0922c5d1f883be893524e6 6308
00005thm.jpg
AR10 0f1a0664e9638efe1a933e6d9bd740cb 21466
00006.QC.jpg
AR11 9ad4f794b70a653ecb58110b065753ca 5289
00006thm.jpg
AR12 3d538478db7b415105b7f5dd3b6e19f1 21561
00007.QC.jpg
AR13 dcd355f4611b8403de8de9e5990b233f 5470
00007thm.jpg
AR14 86ee2327c0269bf15a68bda163ceacc7 23907
00008.QC.jpg
AR15 3ea6ff28a41b2b228f9e44d445513b4d 5842
00008thm.jpg
AR16 5736c4fb7bdb7017eb23523a0797226e 22256
00009.QC.jpg
AR17 5510d6b56286d769c09a2ef9664b1195 5715
00009thm.jpg
AR18 3a25cca25b48d47cd72fcabe77d8df6e 24530
00010.QC.jpg
AR19 636471d7cce2f8e721e324550612f1c2 5907
00010thm.jpg
AR20 ac107ef8ca3eccc63e6ffb3688480206 14621
00011.QC.jpg
AR21 897abe0fb70ec338be41392ca4eb3a03 4112
00011thm.jpg
AR22 69e953cc0a037df448ca76d1873c0bfe 23202
00012.QC.jpg
AR23 9c1a2fc0c27569451c0c6fe3f26d14db 5692
00012thm.jpg
AR24 918d36762d605c6b466a0d4a6eb969ab 13731
00013.QC.jpg
AR25 6bdb54b765521b8c5d1d924832c76194 3891
00013thm.jpg
AR26 cde7333903395663951f0760003d3a93 30470
00014.QC.jpg
AR27 f1f3ce3c93ba65652b108e3790468ce6 6911
00014thm.jpg
AR28 9ced1e23570787b43854059895c06caf 12234
00015.QC.jpg
AR29 77d3704bacb1bf937c19c0969f033393 3732
00015thm.jpg
AR30 eac7df42fec44284ff812f3103fca987 13051
00016.QC.jpg
AR31 9e0c579615b6bd003671975258eed2c2 3561
00016thm.jpg
AR32 560e6459a5ca712c61362afec5cc8f05 27604
00017.QC.jpg
AR33 ce7feae27d769e89bb3acdae47b5364e 6439
00017thm.jpg
AR34 e11313ea2c61a5d50ab19218f182a63a 29341
00018.QC.jpg
AR35 f2d560ee07a5d03fc0c498d88f061d9d 6861
00018thm.jpg
AR36 c902dbe6bf0138df24b3fc71552ecffe 15026
00019.QC.jpg
AR37 5095b60229f6c68bb332afe5a0dc6a7e 4191
00019thm.jpg
AR38 a7e5e8aba195461ce3b56e76968fda68 29278
00020.QC.jpg
AR39 c58197f0b709cce77850bb7844c0c5d0 6718
00020thm.jpg
AR40 43ad5bddec298f3bf6a4f81913e9ca05 27887
00021.QC.jpg
AR41 5bcd8d7f1ff5c68687f58ee5fd4e07a3 6544
00021thm.jpg
AR42 1c7943339febf4c6d26206c9250451f3 29730
00022.QC.jpg
AR43 cd726e6a5b0c5d2bdd4cf8742a9459f9 6765
00022thm.jpg
AR44 5b97e5eb4df360e5ac8db060cfdd45bb 29154
00023.QC.jpg
AR45 dbf69ee5150976d0a067547b3ce78626 6799
00023thm.jpg
AR46 481cf78a5a1e7c50ac64ac340768f72f 28091
00024.QC.jpg
AR47 849d67c886419b5ea7809ac03a84e20a 6688
00024thm.jpg
AR48 eeed5268b46dfb5a2e0b636952d04855 25543
00025.QC.jpg
AR49 0f8d8fca6dd0ca7466f16d9177339dc3 6242
00025thm.jpg
AR50 cb7412418cf0f29c6b0f043543fabdc4 23960
00026.QC.jpg
AR51 99590b2f48ae5abfe693b2c7bcc16515 5787
00026thm.jpg
AR52 bc65205db777dc24344f8bb9a52008cb 24220
00027.QC.jpg
AR53 c9215ec957e1f7bd8a8db2c055756e6b 5837
00027thm.jpg
AR54 24eefd94b35b11b1f4a8eb62ca1caeb3 25192
00028.QC.jpg
AR55 e4553631e6d4c1ebf22f41f54a9b0f84 6089
00028thm.jpg
AR56 dc25a7b91d89172b8f52ca025c67f68e 27958
00029.QC.jpg
AR57 dcc5ce2f661d9c742b5ee52dd43878dd 6533
00029thm.jpg
AR58 3cc9a259228b89c55f2650f89a867fd4 15748
00030.QC.jpg
AR59 88a5078dc93207820b6688797e4b160d 3752
00030thm.jpg
AR60 503d571d820931715dc568fd78e4b318 20736
00031.QC.jpg
AR61 ea47a0f244690470e02880f986942a91 5278
00031thm.jpg
AR62 9eee590e47cd0bfce51c24b5b18e4dcf 15498
00032.QC.jpg
AR63 827b418376fecd45e1d6e0886aa2fcd6 4481
00032thm.jpg
AR64 d8540ba62e45f391aa7fc27982950aae 23726
00033.QC.jpg
AR65 9ec4e46cad935579f8e2836624b392ca 5774
00033thm.jpg
AR66 bf3107f89d661f82632243d084cb66fe 24029
00034.QC.jpg
AR67 8ce0f09d7f7098c4799d548811abd721 5785
00034thm.jpg
AR68 68680524308d8b053e0bbdeb15bec274 28471
00035.QC.jpg
AR69 f6e6e1e453ddeb66123ab3adf1b6db3c 6613
00035thm.jpg
AR70 0582c58e6849b7a8e76636a5512118e2 15402
00036.QC.jpg
AR71 40f1f0278eaba4bacb64d7522dbe0cca 3695
00036thm.jpg
AR72 4f3e854c1211b0f0718c3a6ae7da14f3 20363
00037.QC.jpg
AR73 e75c8bb1fefd072f0e172bf31f83586a 4806
00037thm.jpg
AR74 60b355e6f816b0b20fd8e2f93c663d49 24755
00038.QC.jpg
AR75 ae7e58dbd685d27b40495f9291f1b34b 6053
00038thm.jpg
AR76 1e38074e646b89d25b3eac56f215a3fa 29632
00039.QC.jpg
AR77 758498c655f3bc37aba375e105119794 6856
00039thm.jpg
AR78 e5c7e9710476e49cb71e624ede164c56 22968
00040.QC.jpg
AR79 c05b77515c8b3673ed44eb6a1882d9e2
00040thm.jpg
AR80 07574b42c377bae0ab39270c816666e0 16840
00041.QC.jpg
AR81 ccae51c65fc5d16562621b0bedc11ad8 4540
00041thm.jpg
AR82 79ba38f138e2e01d83dfd3d94df0432b 20266
00042.QC.jpg
AR83 f8856c06e210b0a0affd776339d4b7dd 5003
00042thm.jpg
AR84 364bacebcf6e735d2d02d0ba3f8cf33e 11111
00043.QC.jpg
AR85 ddb52e84719203325fb172638d870b31 3179
00043thm.jpg
AR86 1adc3d8207a554f7f38f0ade59398ef9 21995
00044.QC.jpg
AR87 e2d5547daf943d2718bdf6f1b79e183e 5683
00044thm.jpg
AR88 b150c7a13d30e17de82ce442d3d83ed0 11067
00045.QC.jpg
AR89 0b0f70246011dfa572437646823adc55 3087
00045thm.jpg
AR90 8c530256379148c191861bb0119a48e4 15753
00046.QC.jpg
AR91 455c014dd023518b85ed26e34649da5c 4217
00046thm.jpg
AR92 555c299a4e38a94e3b8a636cba4a9ccb 22909
00047.QC.jpg
AR93 b2c1c8ccd96c8ae7de9e0790ab507411 6019
00047thm.jpg
AR94 3512637366742380dc4ca888d7862f0d 20508
00048.QC.jpg
AR95 b3b4c94318246e0e09f0a62f25616113 5018
00048thm.jpg
AR96 31bb47dbc9002639e44749c7cce72e3b 16882
00049.QC.jpg
AR97 f8ea95de6c23db51f9c01e32479b5aad 4571
00049thm.jpg
AR98 9c430324c82b191992af1205fe934766 17978
00050.QC.jpg
AR99 edbe4b554a9d91bd588415b1181e69d7 4653
00050thm.jpg
AR100 03ad0085925f6824885276671b919b1f 19541
00051.QC.jpg
AR101 968cf5344251fce1963c610ab7a338e0 5143
00051thm.jpg
AR102 42fe3dd2017af5b726263511b7898cbd 10489
00052.QC.jpg
AR103 2d2889b29088ab5e107beca1ba7c0713 3024
00052thm.jpg
AR104 372c9cffb2750d0f684bde6999bcf676 19755
00053.QC.jpg
AR105 cbb53e4a420af775032c3d2867cfab45 5435
00053thm.jpg
AR106 f8e60e6137f806387ac24a9b8dfc644b 11326
00054.QC.jpg
AR107 a9e1b2556f573f0b1428f7c99d9fc2a9 3035
00054thm.jpg
AR108 cfd794b14e42560b00d6803b2a63fcbd 20814
00055.QC.jpg
AR109 465df3ad0fae1636d2810bd90eb6bc9f 5322
00055thm.jpg
AR110 8cbbdb5c3f07812babd2accc0843dd40 26896
00056.QC.jpg
AR111 143b8f964d2beb4d42d9b5164c93f9a9 6719
00056thm.jpg
AR112 fc6b3511af61e56e38fae9aa64f7bc6d 25821
00057.QC.jpg
AR113 334548ff4381084bbcb7988bea2b3149 6206
00057thm.jpg
AR114 8d58ac7435b9c743ece670faab369469 13136
00058.QC.jpg
AR115 13c474618512bafb20d8ddb0fd904ad1 3870
00058thm.jpg
AR116 5f3e4a0dca972b26a69b117cc6a0cae4 13122
00059.QC.jpg
AR117 62733a7a33c6a932e5ef9c56469a7759 3867
00059thm.jpg
AR118 b74b76eb5c3a3b89ca494f51189a8be3 30474
00060.QC.jpg
AR119 85d7ad5d5b43723bd5ea06bc75b33911 6851
00060thm.jpg
AR120 d1bf4af4caac2951de86ab3a7a43f5d8 12071
00061.QC.jpg
AR121 88e3ef10a777db8431e2d0f0e349d8b4
00061thm.jpg
AR122 89bcef83d9230ebf49aa3da4401691cc 13156
00062.QC.jpg
AR123 4bdd8267ba7b13214a82d68b42310ee3 3793
00062thm.jpg
AR124 85528e2fa2efd18e3dae2b6ce17f2701 20835
00063.QC.jpg
AR125 f766c87f213f0d5f144cc9ac4977f3cb 5469
00063thm.jpg
AR126 45eb655f579f469bf06c7831c60c7040 21228
00064.QC.jpg
AR127 22882ce851bd2d21328231dd332ff1fe 5250
00064thm.jpg
AR128 dcb552076cb0b78843a7e9d5736268dd 12732
00065.QC.jpg
AR129 24977af535f283b21c78d21abccdb13c 3500
00065thm.jpg
AR130 2b82deb1fe33f1a4b01dd9911a99c46f 12658
00066.QC.jpg
AR131 1b6d9b203bf652bce32833a07100af9f 3583
00066thm.jpg
AR132 5ef23173fd03dea2e58de6987497bb71 29255
00067.QC.jpg
AR133 731b3a7e091c03d001905aee985fe542 6728
00067thm.jpg
AR134 50479cba4f33c0028ff0a1a22252e9da 26344
00068.QC.jpg
AR135 c05c0c382356c45952337687aba23d41
00068thm.jpg
AR136 99cd613a6bb6dd5e92f36fc41a294801 25629
00069.QC.jpg
AR137 a75b96b4e46f2bbb10f7142e78941e46 5958
00069thm.jpg
AR138 d060bf3747a5cea33c09a62593017fdd 24321
00070.QC.jpg
AR139 c372e96cc25d7efc507681c09b9f43e8 5876
00070thm.jpg
AR140 afef440453a405889f41586549d4fceb 19656
00071.QC.jpg
AR141 4c1ed50aae4c73063051a68cd9392b5a 5072
00071thm.jpg
AR142 966b06f25fb99c4d37162e572203b22b 22016
00072.QC.jpg
AR143 45c07bcdedbea49a20e8da7117398c7c 5319
00072thm.jpg
AR144 0f854c1b1ffebe7b4819c15d2f5cdb30 17759
00073.QC.jpg
AR145 b1f70b09885e276a9eac4838be788d71 4563
00073thm.jpg
AR146 94385622d0df7ce7fc5e43c8a4c95d53 21666
00074.QC.jpg
AR147 a863904181351d108bdccfba32897a90 5497
00074thm.jpg
AR148 a99377ed2f4ce8d4b356832759c84c48 10292
00075.QC.jpg
AR149 1f7d17a43ea7a431b01339093e379a94 2670
00075thm.jpg
AR150 722fd6ede02f6d4b567277b7140358a3 20587
00076.QC.jpg
AR151 443e9bcdd4dcb4108c77295c722993c5 4945
00076thm.jpg
AR152 6276de2e3a83854ffc28e272de5519f0 16889
00077.QC.jpg
AR153 f3624a4f1d11ba2d292afdb4494911ab 4241
00077thm.jpg
AR154 2cdc9eaba0bf1168e29eb18ceaf3c5fc 20901
00078.QC.jpg
AR155 6e2e7fdbdbeeee2eac48e03d6951de0f 5120
00078thm.jpg
AR156 a4e95ba112ba3052d50d360904a82465 25515
00079.QC.jpg
AR157 29285fa167777b1e7d5c422938ac7a81 6094
00079thm.jpg
AR158 8e9578c9441d02e1a6921cfb8c54dfa3 17202
00080.QC.jpg
AR159 c5f70d0afd57c54d600a8ef81de1a4c6 4477
00080thm.jpg
AR160 22db2f0ffa1f134e14c4525243c52f4a 16490
00081.QC.jpg
AR161 3173bc15d1cedf9282a15941e16efe15 4180
00081thm.jpg
AR162 34aa1bb15262c5c00004ea0527be8166 14852
00082.QC.jpg
AR163 00ccb71118d35f48909024d5ad4004a4 4102
00082thm.jpg
AR164 101f5b9650d3c10d213988c4f2d8af01 14940
00083.QC.jpg
AR165 b31f0131101faff81177006825bb98f1 4159
00083thm.jpg
AR166 f623ed8621d910ef999faf6971b4dfbf 25728
00084.QC.jpg
AR167 8db230a8ba2a1f6d0d724a0afcf712d0 6261
00084thm.jpg
AR168 a3464777ea39ffeff61c04dbf04936bb 24623
00085.QC.jpg
AR169 f30fa91294624ad613e2bede7e52e674 5744
00085thm.jpg
AR170 973934f39d0618b1ac8b1bc078fabc20 24440
00086.QC.jpg
AR171 13d37e80a34a527d6e2ca95161c7775f 5869
00086thm.jpg
AR172 7c19a4eaf1add371eaa2d10527372b26
00087.QC.jpg
AR173 2c704a11a50ce6710ab50f05a4fe222b 5871
00087thm.jpg
AR174 0002b524ead0be77d92172303810d3f7 17474
00088.QC.jpg
AR175 28f10663015aa505fcc4f8d398fa9509 4429
00088thm.jpg
AR176 d31542aa2aa394703dfc75b92e0b0439 17635
00089.QC.jpg
AR177 f3fae98e0d2b0db3e3b8466588cbe70f 4635
00089thm.jpg
AR178 fbfa18c5f2652825f4a0a7fd04bc2db4 13995
00090.QC.jpg
AR179 64c1e8fde71e8cf31bad9fe03636f640 3926
00090thm.jpg
AR180 28e161438a5e254f60a3fa082bb44da7 13416
00091.QC.jpg
AR181 9a4774b3383db7afc6d42b358a587be8 3768
00091thm.jpg
AR182 4c8f3578f67534d3e6458242c8365fbc 25799
00092.QC.jpg
AR183 6e1c69f1779b490842b198626efaf2fe 6175
00092thm.jpg
AR184 03b1b294f122a5651957393f762088f8 26131
00093.QC.jpg
AR185 fc25a24a25817ae545a52096b1645407 6205
00093thm.jpg
AR186 16c706c032e65dddaca3b7c76c90512a 22531
00094.QC.jpg
AR187 03f2fa9b32d6543dd483fd04af2b3b70 5316
00094thm.jpg
AR188 934f21ff9f19f15fe016ed87dcf39d0d 24279
00095.QC.jpg
AR189 27cacc3fb9809b5539881a924603c40a 5830
00095thm.jpg
AR190 344a911016fb04df40d6b5e308645ec1 27664
00096.QC.jpg
AR191 1287b3ca3095c13e9e5085172eb1a53d 6549
00096thm.jpg
AR192 b8dd10cdc01acf12a1c947327afe98be 28237
00097.QC.jpg
AR193 38240ff9b839bf40a80fa75cc12c1e3d 6568
00097thm.jpg
AR194 81fa3417c390d99debb06a65b31885a0 26812
00098.QC.jpg
AR195 e4472a5bf5d86185a86769d7e55f4faa 6266
00098thm.jpg
AR196 096d4844582c01efbc27c69c8b179e58 28375
00099.QC.jpg
AR197 2476da68bc75269801b212c202c0e659 6599
00099thm.jpg
AR198 460011044aebaf39f7e52319b8b1e875 16773
00100.QC.jpg
AR199 3248a992e884b7419f7683027e6db429 4318
00100thm.jpg
AR200 1508b9e424f3d3e9adb0f0381568fae0 26316
00101.QC.jpg
AR201 51ec384585828a8082e04e667b6a737b 6446
00101thm.jpg
AR202 09f68cb90d6e0462b1505cfb6b740745 12185
00102.QC.jpg
AR203 fdbfe9a8e3da57c7289c3737202beb66 3400
00102thm.jpg
AR204 673024f72eb6ac5b58f42199ed840d55 13228
00103.QC.jpg
AR205 129eda70320026f8a5eb750737481e1d 3704
00103thm.jpg
AR206 fdf303b88bc24cd88022a0ab71426ce5 27799
00104.QC.jpg
AR207 e23319b8c7b022d6aa4199216710bbf5 6506
00104thm.jpg
AR208 2a39c6569604c959e70314d619dcc58c 14967
00105.QC.jpg
AR209 da617fcce5ba0d5afaf44e28641ca083 4006
00105thm.jpg
AR210 25ea4edc0478f9b42afb03e05f26737f 14618
00106.QC.jpg
AR211 d370a45b0c1df61fec0881e6fbf20eac 3954
00106thm.jpg
AR212 544a8d0009bc984db134bdce9d396ad2 25838
00107.QC.jpg
AR213 dd90e5820cc6fc9c5dbe8822b25b22d9 6002
00107thm.jpg
AR214 fbf5583167dcbf47e03b1c636b3cf63e 26106
00108.QC.jpg
AR215 6707e796d9043c964ce60bda133f8af9 6077
00108thm.jpg
AR216 96d3064501f33e47654ba27603ad76a6 26213
00109.QC.jpg
AR217 0dc2bbb8f79de1a15fe0c89db2a9fa15 6104
00109thm.jpg
AR218 18cb45f5d7eed307098450853a302329 22841
00110.QC.jpg
AR219 78950b448b8a56c9cf54d5395d4dd074 5506
00110thm.jpg
AR220 bd3b024b46556ef2d6f9b922fc5c4cc5 15714
00111.QC.jpg
AR221 4587c293df8b3e8f5ef3590ffb9b5217 4066
00111thm.jpg
AR222 9ba3c37921f8e7ce2f2fd172ab988fd0 24012
00112.QC.jpg
AR223 355347a4b1a9898da6c0c869aae3adf1 5607
00112thm.jpg
AR224 8838a9528a5f660ccbace5129c9e506e 25328
00113.QC.jpg
AR225 fc001adf32523e5d4e581bfa5a6e6a84 5901
00113thm.jpg
AR226 7d0a95e7fb5afeae29801799d22bc3b9 24224
00114.QC.jpg
AR227 204be35f1245012a6ed9f981a4e16750 5852
00114thm.jpg
AR228 4cee90cd659ad731d64bd5965868e1f5 27010
00115.QC.jpg
AR229 a0f7ecadd883cc03f097d7b28a0a0991 6470
00115thm.jpg
AR230 7045e8c6a281efd016c39024e2c8b654 21497
00116.QC.jpg
AR231 21ba3b517c11b3dbe50779c0d4f4ac2d 5267
00116thm.jpg
AR232 d9bf54fd35529fffb9cae2fbed18d40c 9799
00117.QC.jpg
AR233 aa623232c4ba8500161b983e5ced5d77 2879
00117thm.jpg
AR234 5e376924ba5ae5f8e549441ef7b812ef 18348
00118.QC.jpg
AR235 18e79319b766e5dc4598127177f0010e 4767
00118thm.jpg
AR236 805b8a15c6c0c30ec7ac1782947866f2 28701
00119.QC.jpg
AR237 3b371f45286fcfb044ecdd57c716fdd9 6668
00119thm.jpg
AR238 90685930172a9b212c363696c1473a6a 12490
00120.QC.jpg
AR239 3cc348f126d9ade86b96f7b7d0a535b4 3535
00120thm.jpg
AR240 289c5bf9a1af0b252ee559a07924d1f4 25627
00121.QC.jpg
AR241 99e8a8cee2d693044aabdb8f4f05e720 6161
00121thm.jpg
AR242 72416974134edd3d8f8656351d7a81d6 13991
00122.QC.jpg
AR243 3186e82f2301a651bd2cb6dc21ad0c8c 3795
00122thm.jpg
AR244 b6f3f56bb3ecc03ae2fee602b28b5abe 22025
00123.QC.jpg
AR245 24aabbfcf7a26203dca570b5365264d9 5717
00123thm.jpg
AR246 2311b6c7440268a85c84384946714e20 17458
00124.QC.jpg
AR247 7759768dbbf197657b3b2f3a54e44a08 4666
00124thm.jpg
AR248 d1dbe99e14afdfde0ac54feef5d1efb2 25938
00125.QC.jpg
AR249 a8a14d1aeab765c31c0d72f6b6f7f089 6238
00125thm.jpg
AR250 fa896a259718683562296c94701a685a 12852
00126.QC.jpg
AR251 01ff39d26df2a8ab38372706916c54e8 3689
00126thm.jpg
AR252 02fd73fa9a5f72014a61a0827c65efb6 13750
00127.QC.jpg
AR253 8db9ff899bf19fd99712974db1634075 3774
00127thm.jpg
AR254 a971254a09581d70770d0629f532009a 26388
00128.QC.jpg
AR255 87e84dc64ed53a74c711ff111bd717c9 6351
00128thm.jpg
AR256 5edf940624cc014bf7d27702f2e468b8 13782
00129.QC.jpg
AR257 afde875c5d775184809709835a6525b2 3921
00129thm.jpg
AR258 7e60833df2fbc937559d903041209c6e 26359
00130.QC.jpg
AR259 d4133f3835f0d6dec6b52d4195f09eb1 6250
00130thm.jpg
AR260 541e76b5e9962b496709b744e5215777 24158
00131.QC.jpg
AR261 29146e7a7013132d247c2c3a3b567e59 5824
00131thm.jpg
AR262 33bf39da05df4b7e7a3554cb08d67794 23570
00132.QC.jpg
AR263 7878fd43f68dfb54b763a75ce6050874 5847
00132thm.jpg
AR264 b7379eb79ac050a522949e00c27a2ee0 20370
00133.QC.jpg
AR265 2115165b4eab9400303470a370fd17ac 5166
00133thm.jpg
AR266 7e1cbe73534d9ac1006b290e6e6f2384 21052
00134.QC.jpg
AR267 463ff0f667e16adda685e8b7497dbc25 5234
00134thm.jpg
AR268 aed38c3df34f006b691cc200f050b439 10258
00135.QC.jpg
AR269 6e639ff350a6d4f09086ec1f741c42f9 2927
00135thm.jpg
AR270 06b896509af14e797587576cf78cf139 22216
00136.QC.jpg
AR271 b7d110c3f6dcb763ca0ce8ad3b538dc9 5882
00136thm.jpg
AR272 75cda4db07f8989e6a19cf1a404f91bf 25634
00137.QC.jpg
AR273 e8dac94fc2ca9707178204cea9af54c9 6063
00137thm.jpg
AR274 4adf5aa7fb907813b0c147d2eec09df8 21955
00138.QC.jpg
AR275 7ed41087f7b445045d3b4233d5f3d0fe 5359
00138thm.jpg
AR276 6bba98999c78e9c6c5fc11cb8cedb30f 12409
00139.QC.jpg
AR277 144d453b0fa59ca958fe73b7d1d68edd 3266
00139thm.jpg
AR278 d4ad9e03f77b45565c4967a292be41ba 26104
00140.QC.jpg
AR279 7705480f12218406f493200ccd09f280 6304
00140thm.jpg
AR280 51c49af99a6b032de15d52610dffed39 18065
00141.QC.jpg
AR281 4cf6f69a1ed2b19c3a6233686f2077c3 4435
00141thm.jpg
AR282 9fc33ce83e4dd3c3f40f5c8820704ea4 10954
00142.QC.jpg
AR283 88495d6e72a701f362be7782665f990b 3100
00142thm.jpg
AR284 ef538e07a7752b2cef4c70cc6a7c167d 14919
00143.QC.jpg
AR285 af3feaf5ea5c74f4d6c2596acf6a0d35 4179
00143thm.jpg
AR286 b2cff40eaaf787ce3d1078d13cbf4adb 15493
00144.QC.jpg
AR287 3b3580bebccbf588ffa217f284a9de6b 4283
00144thm.jpg
AR288 af9c71a25aad9ff6727c03a749c24125
00145.QC.jpg
AR289 d994c081808e0005df90a1d78f3744e9 4400
00145thm.jpg
AR290 da95a3fd30c7478fd9645e5c75adfe0d 14932
00146.QC.jpg
AR291 44d837b4a784bfa735affeb7c9125be7 4413
00146thm.jpg
AR292 22c4207d422ff9cea5d0bcb1802bd06f 14374
00147.QC.jpg
AR293 b83b180fdd2a8d0c4bf997b96b09a185 4033
00147thm.jpg
AR294 cb52db95903210a3f54bbe5e825f0b40 15917
00148.QC.jpg
AR295 e066ef91ab1bd2449dc7f89bfb38619f 4334
00148thm.jpg
AR296 60527826106a1eae79937c9773b439a8 12237
00149.QC.jpg
AR297 b11ed3380ce5c8ec109bc8813453cb5d 3466
00149thm.jpg
AR298 bfa33d0bc4dc21366eefb2802852457b 28671
00150.QC.jpg
AR299 1cdf527f832e5cddc1e25647872540d8 6684
00150thm.jpg
AR300 d79f0543fe630e275f4e9f56f849a546 14467
00151.QC.jpg
AR301 edf758c9fbe12d467e3da6049e1caef0
00151thm.jpg
AR302 5ee55ecb7dc519a1fe3e3b9d4bd5f72d 14846
00152.QC.jpg
AR303 0784c04770d6b92657ef8ebc84b6a868 4132
00152thm.jpg
AR304 bf32a1cddb7d1937b5ab8eaaf3c41e45 24740
00153.QC.jpg
AR305 b48ec09e5082b6e84e3a804d147ce518 6189
00153thm.jpg
AR306 39855a930451333f5fea009b54cc1b1b 12775
00154.QC.jpg
AR307 1470e6669bedc71b7063d3f20d619a22 3559
00154thm.jpg
AR308 f64485379ecbbe13a3c33f7d5f4b7d7c 12642
00155.QC.jpg
AR309 d08ee1292303ae3756073d267c9ef809 3397
00155thm.jpg
AR310 ec24f34ad07f0f1aa795b56c80ff4f98 14491
00156.QC.jpg
AR311 b9b7d3c04d513447da0318477708e60b 4026
00156thm.jpg
AR312 98067652c9ce8779bc75b47b0edc887f 24513
00157.QC.jpg
AR313 6b4af4b956f95f44c77fecdd97c09a9c 5781
00157thm.jpg
AR314 4f65b95b2150fb56e25ae3afd2607d2f 25262
00158.QC.jpg
AR315 3e481791b29d48f77b07e2d6ef72a232 6000
00158thm.jpg
AR316 3c3a4673f8f86323110a1737b0451416 22031
00159.QC.jpg
AR317 df50b83a6d491f7ab81721a3d51b40dc 5351
00159thm.jpg
AR318 eec38e168f8d96d034a70a5359dab7a3 12657
00160.QC.jpg
AR319 cc990ec3b2870a8d713dfe78a01361e3
00160thm.jpg
AR320 fd3e8c07a558ed408af806ced43d23ed 15972
00161.QC.jpg
AR321 06ad3b60525603dddbaeb7b459a56c8b 4530
00161thm.jpg
AR322 ba73f8f358b5e1f46152a084216f3cbc 13843
00162.QC.jpg
AR323 ab9641a357c00ce5a41028bde21205d5 3723
00162thm.jpg
AR324 2fd643961ffb6caf77c53112cf33c2d2 25802
00163.QC.jpg
AR325 feb0a4930e1bb1e2ad8f93122bf329e6 6166
00163thm.jpg
AR326 748ee6a1d32be70f02a89dc9d6c89790 24879
00164.QC.jpg
AR327 15c80c320ae735b5336c0ac40d8bf699 5834
00164thm.jpg
AR328 3962b337b0290315c7f5fc260c642e0b 25294
00165.QC.jpg
AR329 fde9c7bf0d6d5d5516f334f18985b9ca
00165thm.jpg
AR330 d0a098f999a8a21b0f74891197d0e74e 23465
00166.QC.jpg
AR331 3545641eaf0e7676f18da34e2fa0ab91 5684
00166thm.jpg
AR332 4214c60efd98a13b06d58dbb970abcc0 25624
00167.QC.jpg
AR333 a221067c2cfcea93aab0890d384246fd
00167thm.jpg
AR334 8de26a6f2df2942d20c842fcd120adea 24967
00168.QC.jpg
AR335 05be06ad236ff6fd3cdebcf3230e494a 5799
00168thm.jpg
AR336 e5251905e3e90ada0a498ab2447a9505 11721
00169.QC.jpg
AR337 b9f077607191a34a25e36283a487aa73 3002
00169thm.jpg
AR338 1d3658119f516753a85ef57474e10c02 23106
00170.QC.jpg
AR339 d0e2f1093b7f248a3bf7753968c82dd0
00170thm.jpg
AR340 b0cabb228062a6d976e3e18dd436f760 29056
00171.QC.jpg
AR341 def082d1a2ab8c202ff0004f3c324d25 6829
00171thm.jpg
AR342 e4e2262f3b4c3d97f216f959ba32212d 28853
00172.QC.jpg
AR343 35fcdbd5f254bdcc7fc9aa4a35ade087 6400
00172thm.jpg
AR344 5a59f56613fe9e49c6cf25243989e096 27246
00173.QC.jpg
AR345 361f1e684cde656510983df33770381c 6651
00173thm.jpg
AR346 41f7d93e3df37d5ada885a6bc64ba511 25679
00174.QC.jpg
AR347 d536f148284d3fdb19e8153b473ea351 6490
00174thm.jpg
AR348 af6e47a99f364f473888f0247f5eae02 24186
00175.QC.jpg
AR349 26b98ba8c11cd97080ba029e85a87900 5720
00175thm.jpg
AR350 dec13b8b63565efee41e50a74f85edeb 27052
00176.QC.jpg
AR351 bf4efd9c93dc6f5e6222066e26e23048 6187
00176thm.jpg
AR352 7757bf83436d11ae17cb94bd15bfc6bb 13261
00177.QC.jpg
AR353 ab0c3b180cd09e888b8bd04038132f0e 3757
00177thm.jpg
AR354 5cba798985f0cdacd25c56e6f96fdd6f 13141
00178.QC.jpg
AR355 7b34a9f36550bc829980b2dab7633707 3503
00178thm.jpg
AR356 0f4d255e7e2bf1a82a8bc6d89b51a531 12970
00179.QC.jpg
AR357 be3e95b66d934b2ce3bebe28fa581b02 3660
00179thm.jpg
AR358 1efb1ad9ee3cded8f52779da8ae7d42a 13508
00180.QC.jpg
AR359 6de1bce1437ebd1c119f8aecd865972a 3934
00180thm.jpg
AR360 1a8b9025d78dafbd8da20fa53d9e384f 12792
00181.QC.jpg
AR361 36b7276d50b6b5ada382f3aa7250da05 3709
00181thm.jpg
AR362 ae48327ed03250b4cb171e19b9058818 12861
00182.QC.jpg
AR363 3e32d351725fb0d156fa2944d8c21ce1 3628
00182thm.jpg
AR364 cd1cb923a10fa590e30e8016de13da36 12764
00183.QC.jpg
AR365 207284605a8312cae7ab20be40baa8c4 3645
00183thm.jpg
AR366 a0e5d42b5f7338275f2a8b85fb40af88 12868
00184.QC.jpg
AR367 ae8dc3a37d56778a82d629619c2741e3 3668
00184thm.jpg
AR368 0d90637829d039b506595d55472f5fa1 12067
00185.QC.jpg
AR369 94186cd93c76e85f7ce50b37335fd510 3345
00185thm.jpg
AR370 8c8b0fc163e8e95077ab18d877524b73 12725
00186.QC.jpg
AR371 d2433266d0482ab6c07cb9a43732722e 3441
00186thm.jpg
AR372 ad374376e9e5736b8c81cbe837606a47 13722
00187.QC.jpg
AR373 d905f587e600423f2bae978a05292e14 3785
00187thm.jpg
AR374 d6d2e05160cc19794792208bcbc28c25 12849
00188.QC.jpg
AR375 2582fadb55ec9b57bf7c794b5c3dedb9
00188thm.jpg
AR376 9878dd3295ba4ba621c3b5a1dd1fe3cd 12558
00189.QC.jpg
AR377 54344e2edb745de980da91e57a134f2a 3667
00189thm.jpg
AR378 024e05ac1d054e56454b50dc1afcad2e 12411
00190.QC.jpg
AR379 0270a0f68fb24d7f9236bc1a224cbfbc 3619
00190thm.jpg
AR380 2c3d974ea67602288db5be9f7f78d30b 12666
00191.QC.jpg
AR381 e366c937d2d53699738fe5fb730236d2 3556
00191thm.jpg
AR382 1f8dcce33a6b6ff15e46def6a007aed7 13065
00192.QC.jpg
AR383 80d7cfc1159f5f613faa32b0cce50752 3721
00192thm.jpg
AR384 52a1f169e149f1ee66af26d550cc1e4a 13041
00193.QC.jpg
AR385 511bde6104a650e375ba8fe916bb083e 3747
00193thm.jpg
AR386 d7acfde4ca95785005448a3194b6d8bc 12512
00194.QC.jpg
AR387 a08403a796bd0c7b4e15fa702a9f2f27 3576
00194thm.jpg
AR388 4f9f145af5bdff3149649a3428f4c6a9 12370
00195.QC.jpg
AR389 8e0cc86e412e4c94fb77bf826809568e 3548
00195thm.jpg
AR390 2fa3d831b8193552907f0aa682bb8093 12955
00196.QC.jpg
AR391 9b788e06a5261d28b68e4f1b2aeaab8b 3634
00196thm.jpg
AR392 ba3d51ada881deac91d3bb38d4ba68f6 24308
00197.QC.jpg
AR393 b72ccc3c08ff09c48a8f0b2db1b83a58 6020
00197thm.jpg
AR394 2af4d6be4901c0bf372b8475c37444ad 30056
00198.QC.jpg
AR395 ab56346e693df6d92925394143778870 7355
00198thm.jpg
AR396 ddeab6f55b1ed580cb3d056da6d221de 27985
00199.QC.jpg
AR397 8b5e922be15d67649c81d76e08936fc0 7109
00199thm.jpg
AR398 355bff3a3364b912e7654e904a2ad7ab 28384
00200.QC.jpg
AR399 a0b7e539733319a9c9d103a2f9dc8371 7246
00200thm.jpg
AR400 25fd304f8d47c78f9fd88f87079c81d5 28849
00201.QC.jpg
AR401 251f144046a80ae13bfc1840236d9d81 7065
00201thm.jpg
AR402 8fb6dc0c24592d5431b84edf5cb67336 26965
00202.QC.jpg
AR403 d9f67553dc722e5b474896c23fc379d8 6687
00202thm.jpg
AR404 19b6df7c4efdd3cc0ab0e3b64d8d10ee 28052
00203.QC.jpg
AR405 533ce627c6d8cf0f0496c7bbffd5949d 7025
00203thm.jpg
AR406 94ac934a518dd6a8c822f46cde1a97c1 27364
00204.QC.jpg
AR407 b5ee7000d6a7d5363ae5e61d225a0da4 6857
00204thm.jpg
AR408 f5af931e15f3c6e1cb026d4917fefcd5 10613
00205.QC.jpg
AR409 5a78b3dd1ea2dbd85bdd1127eec6e26d 2582
00205thm.jpg
AR410 d91c881976cfa494e8c40b81100addf9 19224
00206.QC.jpg
AR411 a59a96dc0a524fe08267615c47acbfb5 4768
00206thm.jpg
AR412 fb22bc7bfe580cc4e6717e47c57e54c6 21744
00207.QC.jpg
AR413 73e0432a8c5aaa9c7bfbcddbaeabe50a 5712
00207thm.jpg
AR414 e0de2485610740e899aeb21ac1a513ee 20491
Copyright.QC.jpg
AR415 ef202288e4e70647f35433a61a517958 5236
Copyrightthm.jpg
AR416 589bc58067902a205de16f7ee520a8e1 47294
Copyright_Archive.pro
AR417 335a86b579d37113c91f93f9991ea185
Copyright_Archive.tif
AR418 ed02aaf50625d08253dd1412d38f6bc1 2183
Copyright_Archive.txt
AR419 85211a6ff265c12139e873cc60e20738 306212
UF00082197_00001.mets
METS:structMap STRUCT1 TYPE mixed
METS:div DMDID Towards automatic gender recognition from speech ORDER 0 main
D1 1 Title Page
P2 i
METS:fptr FILEID
D2 2 Dedication
P3 ii
D3 3 Acknowledgement
P4 iii
D4 4 Table of Contents
P5 iv
P6 v
P7 vi
D5 5 Abstract
P8 vii
P9 viii
D6 Introduction 6 Chapter
P10
P11
P12
P13
P14
P15
P16 7
P17 8
P18 9
P19 10
P20 11
P21 12
P22 13
P23 14
P24 15
P25 16
P26 17
P27 18
P28 19
P29 20
P30 21
P31 22
D7 Approaches to
P32 23
P33 24
P34 25
P35 26
P36 27
P37 28
D8 Data collection and processing
P38 29
P39 30
P40 31
P41 32
P42 33
D9 Experimental design based on coarse analysis
P43 34
P44 35
P45 36
P46 37
P47 38
P48 39
P49 40
P50 41
P51 42
P52 43
P53 44
P54 45
P55 46
P56 47
P57 48
P58 49
P59 50
P60 51
P61 52
P62 53
P63 54
P64 55
P65 56
P66 57
P67 58
P68 59
P69 60
P70 61
P71 62
P72 63
P73 64
P74 65
P75 66
P76 67
D10 Results
P77 68
P78 69
P79 70
P80 71
P81 72
P82 73
P83 74
P84 75
P85 76
P86 77
P87 78
P88 79
P89 80
P90 81
P91 82
P92 83
P93
P94 85
P95 86
P96 87
P97 88
P98 89
P99 90
P100 91
P101 92
P102 93
P103 94
P104 95
P105 96
P106 97
P107 98
P108 99
P109 100
P110 101
P111 102
P112 103
P113 104
P114 105
D11 fine
P115 106
P116 107
P117 108
P118 109
P119 110
P120 111
P121 112
P122 113
P123 114
P124 115
P125 116
P126
P127 118
P128 119
P129 120
P130 121
P131 122
P132 123
P133 124
P134 125
P135 126
P136 127
P137 128
P138 129
D12 Evaluation vowel characteristics
P139 130
P140 131
P141 132
P142 133
P143 134
P144 135
P145 136
P146 137
P147 138
P148 139
P149 140
P150 141
P151
P152 143
P153 144
P154 145
P155 146
P156 147
P157 148
P158 149
P159 150
P160 151
P161 152
P162 153
P163 154
P164 155
P165 156
P166 157
P167 158
P168 159
P169 160
P170 161
D13 Concluding remarks
P171 162
P172 163
P173 164
P174 165
P175 166
P176 167
P177 168
D14 Recognition rates for LPC cepstrum parameters Appendix
P178 169
P179 170
P180 171
P181 172
P182 173
P183 174
P184 175
P185 176
P186 177
P187 178
D15 various acoustic distance measures
P188 179
P189 180
P190 181
P191 182
P192
P193 184
P194 185
P195 186
P196 187
P197 188
D16 Reference
P198 189
P199 190
P200
P201 192
P202 193
P203 194
P204 195
P205 196
P206 197
D17 Biographical sketch
P207 198
P208 199
D18 Copyright
P1


192
Fant, G. (1976). Vocal tract energy functions and non-uniform scaling, J. Acoust.
Soc. Japan, Vol. 11, 1-18.
Fant, G., Gobi, C., Karlsson, I., and Lin, Q. (1987). The female voice: Experiments
and overview. 114th Meeting of Acoustical Society of America, J. Acoust. Soc.
Am. Sup. 1, Vol. 82, S90.
Flanagan, J. L. (1955). A difference limen for vowel formant frequency. J. Acoust.
Soc. Am., Vol. 27, 613-617.
Flanagan, J. L. (1972). Speech Analysis, Synthesis and Perception, 2nd Ed.,
Springer-Verlag, New York.
Foley. D. H. (1972). Considerations of sample and feature size, IEEE Trans.
Inform. Theor., Vol. IT-18: 618-626.
Fortescue, T. R., Kershenbau, L. S., and Ydstie, B. E. (1981). Implementation of
self-regulator with variable forgetting factors, Automtica, Vol. 17, 831-835.
Fukunaga, K. (1972). Introduction to Statistical Pattern Recognition, Academic
Press, New York.
Gray, A. H., and Markel, J. D. (1976). Distance measures for speech processing,
IEEE Trans. Acoust., Speech, and Signal Processing, Vol. 24, 380-391.
Gray, R., Buzo. A. H., and Matusyama, Y. (1980). Distortion measures for speech
processing, IEEE Trans. Acoust., Speech, and Signal Processing, Vol. 24, 380-391.
Henton, C. G. (1987). Fact and fiction in the description of female and male
pitch. 114th Meeting of Acoustical Society of America, J. Acoust. Soc. Am. Sup. 1,
Vol. 82, S91.
Hollien, H., and Jackson, B. (1973). Normative data on the speaking fundamental
frequency characteristics of young adult males, Journal of Phonetics, Vol. 1,
117-120.
Hollien, H., and Malcik, E. (1967). Evaluation of cross-sectional studies of
adolescent voice changes in males, Speech Monographs, Vol. 34, 80-84.
Hollien, H., and Paul, P. (1969). A second evaluation of the speaking fundamental
frequency characteristics of post-adolescent girls, Language and Speech, Vol. 12,
119-124.
-
Hollien, H., and Shipp, T. (1972). Speaking fundamental frequency and
chronologic age in males, Journal of Speech Hearing Research Vol. 15, 155-159.
Holmberg, E. B., Hillman, R. E. and Perkell, J. S. (1987). Glottal airflow and
pressure measurements for female and male speakers in soft, normal, and loud


81
Table 5.4
Results from the exclusive recognition scheme 3
with various filter orders, acoustic parameters, and
the PDF (Probability Density Function) distance measure
CORRECT RATE %
Order=8
0rder=12
0rder=16
0rder=20
Sustained
Vowels
ARC
80.8
84.6
88.5
67.3
LPC
84.6
98.1
92.3
80.8
FFF
N/A
96.2
N/A
N/A
RC
88.5
98.1
92.3
67.3
CC
78.8
94.2
90.3
75.0
Unvoiced
Fricatives
ARC
69.2
65.4
57.7
N/A
LPC
78.8
86.5
78.8
53.8
RC
78.8
73.1
67.3
55.8
CC
80.8
73.1
69.2
57.7
Voiced
Fricatives
ARC
88.5
86.5
82.7
59.6
LPC
92.3
94.2
94.2
71.2
RC
92.3
90.4
90.4
75.0
CC
92.3
92.3
80.8
71.2


108
However, the assumptions of the linear source-filter model, the single impulse or
white excitation as the input to the model, and the all-pole non-zero for the spectrum
(sufficient only for vowels) remain only assumptions. Several factors to consider when
estimating vocal tract resonance (formant) frequencies, bandwidths, and amplitudes
from the spectrum of the speech signal are:
(1) the periodicity of the vocal fold excitation,
(2) the analysis frame during the glottal open phase interval (the effect
of source-tract interaction),
(3) the influence of the the fundamental frequency of phonation when it
is near the first formant, and
(4) rapid formant variations that may occur in consonant-vowel
transitions or diphthongs.
6.2.1 Influence of Voice Periodicity
Figure 6.1 shows the simplified speech production model represented by a
time varying linear system with filter parameters slowly changing (lip radiation has
been already included in vocal tract filter). The impulse train generator produces a
sequence of unit impulses which are spaced by the fundamental period of N
samples. If we denote p(n) as the input impulse train, s(n) as the output speech
signal, and g(n), v(n), and r(n) as impulse responses of glottal, vocal tract, and lip
radiation filters respectively, then
s(n) = p(n)*g(n)*v(n)*r(n)
Since p(n) = 2 8(n-mN), we have
m
(6.1)


4.2.2 LPC coefficients 41
4.2.3 Cepstrum Coefficients 41
4.2.4 Reflection Coefficients 41
4.2.5 Fundamental Frequency and Formant Information .... 42
4.3 Distance Measures 42
4.3.1 Euclidean Distance 42
4.3.2 LPC log Likelihood Distance 43
4.3.3 Cepstral Distortion 45
4.3.4 Weighted Euclidean Distance 47
4.3.5 Probability Density Function 47
4.4 Template Formation and Recognition Schemes 48
4.4.1 Purpose of Design 48
4.4.2 Test and Reference Template Formation 49
4.4.3 Nearest Neighbor Decision Rule 55
4.4.4 Structure of Four Recognition Schemes 56
4.5 Resubstitution and Leave-One-Out Procedures 60
4.6 Separability of Acoustic Parameters
Using Fishers Discriminant Ratio Criterion 61
4.6.1 Fishers Discriminant and F ratio 61
4.6.2 Divergence and Probability of Error 64
5 RESULTS OF RECOGNITION BASED ON COARSE ANALYSIS .. 68
5.1 Coarse Analysis Conditions 68
5.2 Performance Assessments 70
5.2.1 Comparative Study of Recognition Schemes 71
5.2.2 Comparative Study of Acoustic Features 78
5.2.2.1 LPC Parameter Verses Cepstrum Parameter .. 78
5.2.2.2 Other Acoustic Parameters 79
5.2.3 Comparative Study Using Different Phonemes 84
5.2.4 Comparative Study of Filter Order Variation 85
5.2.4.1 LPC Log Likelihood and Cepstral
Distortion Measure Cases 85
5.2.4.2 Euclidean Distance Versus
Probability Density Function 87
5.2.5 Comparative Study of Distance Measures 88
5.2.6 Comparative Study Using Different Procedures 90
5.2.7 Variability of Female Voices 93
5.3 Comparative Study of Acoustic Parameters
Using Fishers Discriminant Ratio Criterion 93
5.4 Conclusions 102
6 EXPERIMENTAL DESIGN BASED ON FINE ANALYSIS 106
6.1 Introduction 106
6.2 Limitations of Conventional LPC 107
6.2.1 Influence of Voice Periodicity 108
6.2.2 Source-Tract Interaction Ill
6.3 Closed Phase WRLS-VFF Analysis 113
v


143
Table 7.3
Vowel characteristics obtained by Peterson and Barney
(after Peterson and Barney (1952))
IY
E AE A OW U
00 UH ER
Fundamental
U
138
135
130
127
124
129
137
141
130
133
Frequency
F
235
232
223
210
212
216
232
231
221
218
(Hz)
ch
272
269
260
251
256
263
276
274
261
261
Formant
Frequency
(Hz)

270
390
530
660
730
570
440
300
640
490
FI
F
310
430
610
860
850
590
470
370
760
500
Ch
370
530
690
1010
1030
680
560
430
850
560
U
2290
1990
1840
1720
1090
840
1020
870
1190
1350
F2
F
2790
2480
2330
2050
1220
920
1160
950
1400
1640
Ch
3200
2730
2610
2320
1370
1060
1410
1170
1590
1820
M
3010
2550
2480
2410
2440
2410
2240
2240
2390
1690
F3
F
3310
3070
2990
2850
2810
2710
2680
2670
2780
1960
Ch
3730
3600
3570
3320
3170
3180
3310
3260
3360
2160
Formant
A1
-4
-3
-2
-1
-1
0
-1
-3
-1
-5
Amplitude
A2
-24
-23
-17
-12
-5
-7
-12
-19
-10
-15
(db)
A3
-28
-27
-24
-22
-28
-34
-34
-43
-27
-20


113
Zin and there will be no source-tract interaction. This is true, however, only when
Ag(t) is very small or zero. During the open phase, the glottal impedance is
comparable to Zin, and effectively increases the damping of the first formant.
Pitch synchronized closed phase covariance (CPC) method can reduce the
effect of source-tract interaction and thus the spectral estimation error due to the
interaction since the analysis window is confined to the closed glottal region for each
pitch period. However, in certain situations, the VT filter derived by the CPC
method is not guaranteed to be stable and the formant variations cannot be tracked
accurately (e.g., the speech of women and children, the fast transitions between
certain vowels and consonants, and short closed glottal intervals).
6
6.3 Closed Phase wrls-vff Analysis
Since accurate estimation of vocal tract resonance frequencies (formants) and
their bandwidths and amplitudes is essential before their significance for gender
recognition can be assessed, analysis methods with better performance must be
considered.
Sequential adaptive approaches that track the time-varying parameters of the
vocal tract and update the parameters during the glottal closed phase interval can
reduce the formant estimation error because they reduce the influences of pitch
periodicity and source-tract interaction and adjust rapidly to fast changes in the
speech signal. Results show (Ting, 1989) that the WRLS-VFF algorithm offers a
more accurate formant estimation than frame-based LPC analysis.
6.3.1 Algorithm Description
It is generally assumed that the speech signal is generated by an all-pole
model (sufficient for vowels) of order p represented by the following equation


145
comprehensive information on the vowels. The data in Tables 5.1 and 5.2 serve as a
useful reference characterization of the vowels.
7.1.3 Results of Two-wav ANOVA Statistical Test
Since formant characteristics such as FI, F2, Bl, B2, etc. were influenced
by gender as well as vowels and each subject was observed under more than one
vowel, our experiments were referred to as two factor experiments having repeated
measures on the same subject (Winer, 1971). For the gender factor, which was
denoted as A, there were two different levels (i.e., male and female). For the vowel
factor, which was denoted as B, there were 10 different levels (i.e., ten different
vowels). Each subject was observed under all levels of B. Therefore, the two-way
ANOVA with repeated measure was used to perform the statistical test. The
algorithm of Winer (1971) was adopted. Table 7.4 shows the results.
Based on these analysis data, the general test results on factor A (gender) can
be summarized in Table 7.5.
7.1.4 Results of T Statistical Test
Based on the data in Tables 7.1 and 7.2, a t-test (Ott, 1984) for each individual
feature of ten vowels for male and female speakers was also performed. The results
are summarized in Table 7.6. Tables 7.1, 7.2, and 7.6 can serve as control strategies
for synthesizing vowels with desired male or female voice quality.
7.1.5 Discussion
1. Fundamental frequencies (FO) of all vowels for male subjects were
lower than those of female subjects. From Table 7.1, an averaged FO
for all male subjects and all vowels (total 270 vowels) was 124.6 Hz
and 224.9 Hz (total 250 vowels) for females. Table 7.5 shows that


AMPLITUDE (db) FREQUENCY (Hz) FREQUENCY (Hz)
135
320
300
280
260
240
220
200
180
160
140
120
AUERAGED BANDWIDTH
AUERAGED AMPLITUDE
32 t
30
12 1 1 1 1 1 1 1 1 1 1 H
IY I E AE A OW U 00 UH ER
UOUEL
Figure 7.2 The first formant characteristics of ten vowels.


43
4.3.2 LPC Log Likelihood Distance
It was proposed by Itakura (1975) and defined as
a R a
D ipc ( a) = log [ ] (4.16)
R
where a and are the LPC coefficient vectors of the reference and test speeches,
and R is the matrix of autocorrelation coefficients of the test speech. An
interpretation of this formula is given in Figure 4.3 below in which the subscript r
denotes reference, and the subscript t denotes test.
The denominator of the term can be obtained by passing the test speech signal
S,(n) through the inverse LPC system of the test Ht(z), giving the energy a of the
error signal. Similarly, the numerator term can be obtained by passing the same test
signal St(n) through the inverse LPC system of the reference Hr(z) with the energy (3
of the error signal.
Thus we obtain
D ipC ( a) = log ((3/a) (4.17)
It can also be shown that this distance measure is related to the spectra
dissimilarity between the test and reference speech signals.
For computational efficiency, variables can be changed and Equation (4.16)
or (4.17) can be rewritten as
P r(k)
D ipc ( a) = log [ Z ra(k) ]
k=0 E
(4.18)


25
1. The well-known linear prediction coding (LPC) vocoder is an
efficient vocoder which, when used as a model, encompasses the
features of the vocal source (except of fundamental frequency) as
well as the vocal tract (Rabiner and Schafer, 1978). Since gender
features are believed to be included in both vocal source and tract,
satisfactory results would be expected using LPC derived
parameters.
2. The LPC all-pole model has a smoothed, accurate spectral envelope
matching characteristic, especially for vowels. Formant frequency
measurements obtained by LPC have also been found to compare
favorably to measures obtained by spectrographic analysis (Monsen
and Engebretson, 1983; Linville and Fisher, 1985). Thus it is
expected that features obtained by LPC would represent the spectral
characteristics of both genders more accurately.
3. The LPC model has been successfully applied in speech and speaker
recognition (Makhoul, 1975a; Atal, 1974b, 1976; Rosenberg, 1976:
Markel, 1977; Davis and Mermerlstein, 1980; Rabiner and Levinson,
1981). Moreover, many related distortion or distance measurements
have been developed (Gray and Markel, 1976; Gray et al., 1980;
Juang, 1984; Nocerino et al., 1985) which could be conveniently
adopted for the preliminary experiments of gender recognition.
4. Deriving acoustic parameters from the LPC model is
computationally fast and efficient, only short data records are
needed. This is a very important factor in designing an automatic
gender recognition system.


5
PHYSIOLOGICAL:
VOCAL FOLD LENGTH
& THICKNESS
VOCAL TRACT
LENGTH, AREA
& SHAPE
ACOUSTIC:
FUNDAMENTAL
FREQUENCY
VOCAL TRACT FEATURES:
FORMANT FREQUENCY
BANDWIDTH &
AMPLITUDE
GLOTTAL VOLUME
.VELOCITY WAVESHAPE,
Figure 1.2 Basic gender features.


159
frequencies of male subjects were lower than those of female
subjects for all vowels and statistical differences of fundamental
frequencies between male and female speakers were highly
significant. For both male and female speakers, glottal vibration
patterns were relatively less variable than vocal tract shapes when
different vowels were pronounced.
2. Formant frequencies from male vowels were lower than those from
female vowels (FI, F2, F3, F4). Generally speaking, bandwidths of
the formants from male vowels were narrower than those from
female vowels and the amplitudes of the formants from male vowels
Ci
were higher than those from female vowels. Statistical differences
of FI, F2, F3 and F4, Bl, B2 and B4, and A2, A3 and A4 between
male and female speakers were highly significant. This suggested a
steeper spectral slope for females.
3. Considerable redundant information concerning gender appeared to
be imbedded in the formant and pitch features. However, in terms
of the relative importance of formant frequency, bandwidth, or
amplitude for objectively distinguishing male/female voices, the
highest recognition rate (98.1%) was achieved when using the
formant frequency. And in terms of the first, second, third, or forth
formant characteristics, using the second formant information
showed highest recognition rate (98.1%) with the EUC distance
measure. The second formant frequency and bandwidth also
individually showed the highest recognition rate (98.1% and 92.3%)
in the 4-frequency or 4-bandwidth groups respectively. With the
PDF distance measure, the third formant including frequency,
bandwidth, and amplitude appeared to be the best feature (100%).


36
become the predominant technique for estimating the basic speech parameters (e.
g., pitch, formants, spectra, vocal tract area functions) and for representing speech
for low bit-rate transmission or storage. The method was first applied to speech
processing by Atal and Schroeder (1970) and Atal and Hanauer (1971). For speech
processing, the term linear prediction refers to a variety of essentially equivalent
formulations of the problem of modeling the speech waveform (Markel and Gray,
1976; Makhoul, 1975b). These different models usually lead to similar results but
each formulation has provided an insight into the speech modeling problem and is
generally dictated by their computation demands.
The particular form of this model that is appropriate for this research is
<3
depicted in Figure 4.2. In this case, the composite spectrum effects of radiation,
vocal tract, and glottal excitation are represented by a time-varying digital filter
whose steady-state system function is of the form
S(z) G
H(z) =
U(z)
k=l
(4.1)
This system is excited by an impulse train for voiced speech or a random noise
sequence for unvoiced speech.
The above system function can be written alternatively in the time domain as
s(n) =
(^(n-k) + G u(n)
(4.2)
Let us assume we have available past data samples from (n-p) to (n-1) and we are
predicting a new nth sample of the data sample


AMPLITUDE 138
UOUIEL
Figure 7.5 The fourth formant characteristics of ten vowels.


63
these covariance matrices describes the scatter within a class; hence it corresponds
to one term of the average in the denominator of (4.40). If we make the common
assumption that the vector f¡ is normally distributed, then W is the covariance matrix
of the corresponding probability density function:
1 -n/2 1
PDF, (f) = ( )
2 77 |Wi|1/2
-(f-w)* W'Hf-w) 1 m
exp [ ]
2
Then the denominator of the F ratio can be associated with the average of W¡
over all i; this is called the pooled within-class covariance matrix:
W = < Wj )
(4.29)
Second, the variation within-classes can be ignored and the covariance
between classes can be found, representing each class by its centroid. The feature
centroid for class i is w; hence the between-class covariance matrix is
B = < (p-i jjl)0 n) >
(4.30)
where p. is the mean of ^ over all classes. B stands for between. Here we ignore
the detailed distribution within each class and represent all the data for that class by
its mean. Hence B describes the scatter from class to class regardless of the scatter
within a class and in that sense corresponds to the numerator of (4.40).
Then the generalization we seek should involve a ratio in which the numerator
is based on B and the denominator on W, since we are looking for features with


101
the other hand, The LPC of vowels and voiced fricatives generated
the values of Ji of 10.9 and 8.17 respectively, predicting that the
LPC classifier using voiced fricatives had lower separability for
gender recognition, which was true in the experiments.
5. By comparison of Tables 5.6 and 5.7, we see that by using a filter
order of 20, values of Ji greatly increased from those using a filter
order of 12, but values of J4 only slightly changed (most of them
slightly decreased). As the values of Ji increased, the expected
probabilities of error calculated from J] decreased. However, the
most contradicting phenomenon is that the experimental error rates
showed a considerable decrease from those using a filter order of 12.
This does not agree with that predicted by J4 or Jj, or the expected
probabilities of error. Therefore, Ji and J4 were unreliable in
predicting the performance of a gender classifier in this case.
One cause of this problem is, again, probably due to the small
ratio of the available number of subjects per gender to the number of
elements per feature vector. As we discussed in the previous
section, if this ratio is small, data classification for both design and
test sets may be unreliable (Foley, 1972; Childers, 1986). This ratio
should be on the order of three or larger (Foley, 1972). In this study,
the ratios were 2.17 and 1.3 for filter orders of 12 and 20
respectively. While the value of 2.17 might be still considered
marginal for designing a classifier, the value of 1.3 was too small.
This may explain why Jj and J4 failed to predict the performance of a
gender classifier using a filter order of 20. Another cause for this
failure might be the peaking phenomenon (Hughes, 1968). This can
occur if we use too few features, then the classifier performance


ACKNOWLEDGMENTS
The invaluable guidance, encouragement, and support I have received from
my adviser and committee chairman, Dr. D. G. Childers, during the years of my
graduate education are most appreciated. I am sincerely grateful for his direction,
insight, and patience throughout this dissertation research.
I would especially like to thank Dr. J. R. Smith, Dr. A. A. Arroyo, Dr. J. C.
Principe, and Dr. H. B. Rothman for their interest and participation in serving on my
supervisory committee and their productive criticism of my research project.
The partial support by the National Institutes of Health, National Science
Foundation, and University of Florida Center of Excellence Program is gratefully
acknowledged.
Special thanks are also extended to my fellow graduate students and other
members of the Mind-Machine Interaction Research Center for their friendship,
encouragement, and skillful technical help.
Last but not the least, I am greatly indebted to my wife, Hong-gen, and my
parents for their love, support, understanding, and patience. My gratitude to them is
beyond description.
in


64
small covariances within classes and large covariances between classes. Fukunaga
(1972) lists four such measures, two of which are
Ji = trace (W^B) (4.31)
and
trace B
J4 = (4.32)
trace W
The motivation for these measures is clearer for J4 since we know that the trace of a
covariance matrix provides a measure of the total variance of its associated
o
variables (Parsons, 1986). If the value of J4 for a feature is relatively greater than
that for the other feature, then there is apparently more scatter between classes than
within classes for this feature, and this feature set is a better one than the other for
discrimination. J4 tests this ratio directly. The motivation for Jj is less obvious and
will have to await the presentation of the material below.
4.6.2 Divergence and Probability of Error
The distance between two classes in feature space may also be evaluated by
divergence that is defined as the difference in the expected values of their
log-likelihood ratios (Kullback, 1959; Tou and Gonzales, 1974). This measure has
its roots in information theory (Kullback, 1959) and is a measure of the average
amount of information available for discriminating between class i and class k. It is
shown that for features with multivariate normal densities, the divergence is given
by
Dik = 0.5 trace (W, Wk)(Wf1 WkM)
+ 0.5 trace [(Wf1 + Wk")(|i| m)(w m)'1
(4.33)


97
Table 5.6
Estimated values of J4 and Jlf expected probability of errors, and
experimental error rates for various acoustic parameters
Filter order = 12
J4
Ji
EXPECTED
PROBABILITY
OF ERROR
EXPERIMENTAL
ERROR
RATE*
ARC
0.10
4.34
0.21
0.15
LPC
0.16
10.9
0.09
0.02
Sustained
FFF
3.35
10.4
0.09
0.04
Vowels
RC
0.63
9.29
0.11
0.02
CC
0.27
5.11
0.19
0.06
ARC
0.14
2.81
0.27
0.35
Unvoiced
LPC
0.23
4.46
0.20
0.13
Fricatives
RC
0.14
3.38
0.24
0.27
CC
0.16
3.46
0.24
0.27
ARC
0.37
5.92
0.17
0.13
Voiced
LPC
0.36
8.17
0.12
0.06
Fricatives
RC
0.47
5.84
0.17
0.10
CC
0.20
6.04
0.16
0.08
*obtained from the exclusive recognition Scheme 3
using the PUF distance measure.


CHAPTER 3
DATA COLLECTION AND PROCESSING
3.1 Database Description
The database consists of speech and EGG data collected from 52 normal
subjects (27 males and 25 females) with speakers age varying from 20 to 80 years.
The synchronous speech and EGG signals were simultaneously directly digitized.
Each subject read, after some practice, the following SAMPLE PROTOCOL that
includes 27 tasks.
SAMPLE PROTOCOL
Task 1. Count 1-10 with comfortable pitch & loudness.
2. Count 1-5 with progressive increase in loudness.
3.
Sustain phonation of the vowel
/IY/
4.
Sustain phonation of the vowel
/I/
5.
Sustain phonation of the diphthong
IAU
6.
Sustain phonation of the vowel
IE/
7.
Sustain phonation of the vowel
/AE/
8.
Sustain phonation of the vowel
100/
9.
Sustain phonation of the vowel
/u/
10.
Sustain phonation of the diphthong
/ou/
11.
Sustain phonation of the vowel
/ow/
12.
Sustain phonation of the vowel
/A/
13.
Sustain phonation of the vowel
UHJ
14.
Sustain phonation of the vowel
/ER/
15.
Sustain phonation of the whisper
H/
16.
Sustain phonation of the fricative
/F/
17.
Sustain phonation of the fricative
TW
18.
Sustain phonation of the fricative
/S/
29
the word BEET,
the word BIT.
the word BAIT,
the word BET.
the word BAT.
the word BOOT,
the word BOOK,
the word BOAT,
the word BOUGHT,
the word BACH,
the word BUT.
the word BURT,
the word HAT.
the word FIX.
the word THICK,
the word SAT.


87
5.2.4.2 Euclidean Distance Versus Probability Density Function
1. By examining Figure 5.3, it is seen that using Euclidean distance
measure increased the recognition rates slightly from filter orders of
8 to 12 with exception of the LPC for unvoiced fricatives.
Recognition rates with a filter order of 12 were almost the same as
with 16 and 20. No specific trend was observed. Except for the ARC
applied to the vowel category, all other performances reached their
peaks with either filter order of 12 or 16. It can be concluded that by
using the EUC distance measure, the best choice of filter order
would be around the range from 12 to 16.
2. However, Figure 5.4 shows us a different case. By inspecting Figure
5.4, it is immediately concluded that by using the PDF, gender
recognition rates varied considerably with the filter order. The
overall trend for the vowel category is that recognition rates
increased from a filter order of 8, reached its peak with a filter order
of 12 and then decreased. One exception is that by using the RC, the
recognition rate reached its peak with a filter order of 16. All
acoustic parameters, except the LPC, for voiced and unvoiced
fricatives showed decreasing recognition rates from a filter order of
8 to an order of 20. By using LPC coefficients, performance showed
some improvement from a filter order of 8 to 12 and 16 and then
degraded. Finally, recognition accuracies severely declined from a
filter order of 16 to an order of 20 for all three phoneme categories
and all acoustic parameters. It can be concluded that using the PDF,
the best option for filter order would be 8 or 12.


28
the same subject (Winer, 1971). Therefore, two-way ANOVA were
used to perform the statistical test. The significance of the
difference between each individual feature in terms of male/female
groups was analyzed.
Automatic recognition. First, the individual or grouped feature(s),
such as only the fundamental frequency or only the formant
frequencies or bandwidths (but from all formants), were used to
form the reference and test templates. Then automatic recognition
schemes were applied on these templates. Finally, the recognition
error rates for different features were compared.
In Chapter 6, the detailed background of the closed phase WRLS-VFF method
and the experimental design based on fine analysis will be presented.


FREQUENCY (Hz)
134
AUERAGED UOUEL FUNDAMENTAL FREQUENCY
260
240
220
200
180
160
140
120
100
O O MALE
FEMALE
T
UOLIEL
Figure 7.1 Fundamental frequencies of ten vowels.


27
source-tract interaction) and closed-phase (little or no source-tract
interaction) conditions.
Frame based asynchronous LPC analysis cannot reduce the effect of
source-tract interaction because this technique uses windows that average the data
over several excitation epoches. The pitch synchronized closed phase covariance
(CPC) method can reduce the effect of source-tract interaction. However, in certain
situations, the vocal tract filter derived by this method may be unstable because of
the short closed glottal intervals, especially for females and children (Ting et al.,
1988).
Sequential adaptive analysis methods offer an attractive alternate processing
<3
strategy since they overcome some of the drawbacks of frame-based analysis. The
closed-phase WRLS-VFF method that tracks the time-varying parameters of the
vocal tract and updates the parameters during the glottal closed phase interval can
reduce the formant estimation error. Experimental results (Ting et al., 1988; Ting,
1989) show that the formant tracking ability and formant estimation accuracy of the
WRLS-VFF algorithm is superior to the LPC based method. Detailed formant
features, including frequencies, bandwidths, and amplitudes in the fine analysis
stage were obtained by using this method. The EGG signals were used to assist in
locating the closed phase portion of the speech signal (Childers and Larar, 1984;
Krishnamurthy and Childers, 1986).
There were two approaches for testing the relative importance of various
vowel features for gender recognition:
Statistical tests. Since formant characteristics such as frequencies,
bandwidths, and amplitudes depend on or are influenced by two
factors (i.e., gender as well as vowels) and each experimental
subject produces more than one vowel, our experiments should be
referred to as two factor experiments having repeated measures on


48
4.4 Template Formation and Recognition Schemes
4.4.1 Purpose of Design
Another important issue in developing a recognition system is the selection of
appropriate template formation and recognition schemes.
During initial exploratory studies of fixed-text recognition using spectral
pattern matching techniques in the Pruzansky study (1963), the use of the long-term
average technique to form a feature vector was discovered to have potential for
free-text speaker recognition. The speaker recognition error rate was found to
remain undegraded (at 11 percent) even after spectral amplitudes were averaged
over all frames of speech data into a single reference spectral amplitude vector for
each talker. Markel et al. (1977) demonstrated that the between-to-within speaker
variation ratio was significantly increased under long-term average of the parameter
sets (thus free-text).
Temporal cues also appeared not to play a role in speaker gender
identification (Lass and Mertz, 1978). They found that gender identification
accuracy remained high and unaffected by temporal speech alterations when the
normal temporal features of speech were altered by means of the backward playing
and time compressing of speech samples.
Therefore, we would reasonably believe that the gender information is
time-invariant. Thus, long-term averaging would also emphasize the speakers
gender information and increase the between-to-within gender variation ratio. In
practice we would also achieve free-text gender recognition in which gender
identification would be determined before recognition of speech or speaker and
thus, reduce the speech or speaker recognition search space to half.
The purpose of using different test and reference template formation schemes
is to verify the hypothesis above and, if it is correct, to determine how much


174
Table A.6
Results from exclusive recognition schemes
Cepstrum distance measure
Filter order = 12
CORRECT RATE %
MALE
FEMALE
TOTAL
Scheme 1
68.9
67.6
68.3
Sustained
Scheme 2
68.2
66.4
67.3
Vowels
Scheme 3
88.9
96.0
92.3
Scheme 4
88.9
96.0
92.3
Scheme 1
65.2
66.4
65.8
Unvoiced
Scheme 2
57.8
65.6
61.5
Fricatives
Scheme 3
74.1
76.0
75.0
Scheme 4
85.2
92.0
88.5
Scheme 1
83.3
82.0
82.7
Voiced
Scheme 2
85.2
79.0
82.2
Fricatives
Scheme 3
100.0
96.0
98.1
Scheme 4
92.6
92.0
92.3


100
probability of error, templates of all subjects for each gender were
used and a pooled or averaged within-gender covariance matrix was
formed to calculate Ja. On the other hand, to compute an
experimental error rate the exclusive (leave-one-out) recognition
scheme was used and separate within-gender covariance matrixes
were formed for the male and female classes to calculate the PDF
distances. The model differences probably caused the differences
between the experimental error rates and the corresponding
expected probabilities of error, yet why the more noisy the speech
signals, the smaller the differences, is still a remaining question.
4. It appeared more reliable to use Jj to predict the performance of a
gender classifier than J4. It can be noted from Table 5.6 that the FFF
had the extremely large value of J4 (3.35) for vowels, compared to
other features (the RC had o.63 and the LPC had 0.16). However,
the FFF generated an experimental error rate of 0.04, which was
even not smaller than the LPC or RC experimental error rates (0.02).
The FFF did not achieve extremely high performance in practice.
On the other hand, the LPC, FFF, and RC yielded the values of Ji of
10.9, 10.4, 9.29 respectively. There were little differences between
them. The predicted performance using Jj and the real performance
of of the classifiers were essentially consistent. The other example
was the LPC feature. The LPC of vowels and voiced fricatives
produced the values of J4 of 0.16 and 0.36 respectively, indicating
that the LPC classifier using voiced fricatives should have had higher
separability for gender recognition. However, real performance of
the LPC classifier acted the other way around. The experimental
error rates were 0.06 for voiced fricatives and 0.02 for vowels. On


180
Table B.2
Results from the exclusive recognition scheme 3
Euclidean distance measure
Filter order = 12
CORRECT RATE %
MALE
FEMALE
TOTAL
ARC
85.2
72.0
78.8
LPC
74.1
84.0
78.8
Sustained
FFF
96.3
100.0
98.1
Vowels
RC
100.0
100.0
100.0
CC
88.9
96.0
92.3
ARC
74.1
76.0
75.0
Unvoiced
LPC
70.4
68.0
69.2
Fricatives
RC
81.5
80.0
80.8
CC
74.1
76.0
75.0
ARC
88.9
88.0
88.5
LPC
96.3
88.0
92.3
RC
96.3
96.0
96.2
CC
100.0
96.0
98.1
Fricatives


122
one sample point (Krishnamurthy 1983). In general, the instant of glottal opening as
measured by the EGG is more variable than the measurement of the instant of
glottal closure (Childers et al., 1983; Krishnamurthy, 1983). This variability is
certainly tolerable for isolating closed (or open) phase glottal segments since the
speech data selected for analysis can be shortened at both ends by two or three
samples to assure the segment is truly representative of a closed (or open) glottal
interval.
The fact that the EGG is a reliable source of glottal vibratory information
makes it quite useful for the present study. The EGG signal was used to check for
the glottal closed phase and when it was found, the WRLS-VFF algorithm updated
c
the filter coefficients at this optimal position.
6.4 Testing Methods
In order to select the best vowel features most responsible for distinguishing
gender, each individual or group of features was tested for their relative importance
after the features (i.e., fundamental frequencies and formant features including
frequencies, bandwidths, and amplitudes) were available for both genders. There
are two possible approaches:
o Statistical tests. Since we are interested in making the descriptions
of gender features more precise and in determining whether there
are any significant deferences between two gender group features
and how significant these differences are, statistical methods can
be applied to assist in evaluating and validating the reliability of
observed differences, and in determining the degree of confidence
we may place in certain generalizations about the observations.
The experiments in this study are referred to as two factor
experiments having repeated measures on the same subject


77
Because of this, averaging emphasized the speakers gender
information and increased the between-to-within gender variation
ratio. In practice we would achieve free-text gender recognition in
which gender identification would be determined before recognition
of speech or speaker and thus, reduce the speech or speaker
recognition search space to half.
The conclusion is consistent with the findings by Lass and Mertz
(1978) that temporal cues appeared not to play a role in speaker
gender identification. As we cited earlier, in their listening tests they
found that gender identification accuracy remained high and
unaffected by temporal speech alterations when the normal temporal
features of speech were altered by means of the backward playing
and time compressing of speech samples.
We have shown in the previous section that use of the long-term
average technique to form a feature vector was discovered to have
potential for free-text speaker recognition in the Pruzansky study
(1963). Speaker recognition error rates remained undegraded even
after averaging spectral amplitudes over all frames of speech data
into a single reference spectral amplitude vector for each talker.
Markei et al. (1977) also demonstrated that the between-to-within
speaker variation ratio was significantly increased by performing
long-term parameter sets (thus text-free). Here we found that this
rule also applied to the gender recognition.
4. In terms of Scheme 3 versus Scheme 4, neither was obviously
superior. However, from a practical point of view, Scheme 3 would
be easier to realize since only two reference templates are needed.


CHAPTER 6
EXPERIMENTAL DESIGN BASED ON FINE ANALYSIS
6.1 Introduction
The purpose of fine analysis is to perform a detailed study of the vowel
characteristics responsible for distinguishing a speakers gender. Perceptually,
vowel characteristics are considered to convey the most important gender features
and provide most of the vocal fold and tract properties (Carlson, 1981).
Male/female vowel characteristics are mainly distinguished by the fundamental
frequencies of the vowels and resonance (formant) information of the vowels.
Therefore, thorough evaluation of these features for gender recognition is necessary,
and previous research on vowel characteristics is far from complete.
First, the relative importance of the fundamental frequency (FO) versus vocal
tract resonance (VTR) characteristics for perceptual male or female voice quality is
still controversial. The belief that the FO is the strongest cue to gender is
substantiated by the evidence of the previous research. There is a hypothesis that in
situations in which the FO is masked due to the deviation, the effect of VTR
characteristics upon gender judgements increases from a minimal level to a larger
role which is equal to and sometimes greater than that played by FO (Carlson, 1981).
This hypothesis needs further verifying.
, ---v- -
Second, the influence of bandwidth and amplitude of formants and overall
spectral shape on gender cues was not thoroughly considered and investigated.
Mostly, contribution of vocal tract characteristics to gender perception was only
106


102
suffers because there is not sufficient data to properly classify the
test samples. However, if we use too many features the added
features may be unreliable or noisy and thus decrease the
performance of the classifier.
6. In summary, for the filter order of 12, the analytical inferences from
the values of Ji and J4 and the expected probabilities of error using
various acoustic features proved comparable to the empirical results
of the experiments with the PDF distance measure for gender
recognition. Furthermore, Ji the Mahalanobis distance appeared to
be more reliable for predicting the performance of a gender
classifier than J4.
5.4 Conclusions
Considering that only approximately 150ms of the speech signal were used for
the experiments, computer automatic gender recognition is very encouraging. The
conclusions of the above discussion can be summarized as follows:
1. Most of the LPC-derived feature parameters functioned well for
gender recognition. Among them, the reflection coefficients
combined with the EUC distance measure was most robust for
sustained vowels (100%) and the results were quite consistent and
filter order independent. While the cepstral distortion measure
worked extremely well for unvoiced fricatives (90.4%), the LPC log
likelihood distortion measure, the reflection coefficients combined
with the EUC distance measure, and the cepstral distortion measure
were the best options for voiced fricatives (98.1%) (Table 5.8).
Hence carefully selecting the acoustic feature vector combined with


197
Yegnanarayana, B., Naik, J. M. and Childers, D. G. (1984). Voice simulation:
factors affecting the quality and naturalness, 10th International Conference on
Computational Linguistics, 22nd Annual Meeting of the Association for
Computational Linguistics, Proceedings of Coling84, Stanford University, Stanford,
530-533.
o


89
unvoiced fricatives came from the CC with the EUC using Scheme 4 (Table 5.2).
They were 88.5% with a filter order-of 12 and 90.4% with a filter order of 16.
Fourth, the EUC distance measure functioned more evenly on male and
female groups than did the PDF. By examination of all tables of exclusive schemes
in Appendix B, 43 male recognition rates were higher than those of females for total
48 PDF pairs. Only 5 female recognition rates were higher than those of males. And
the largest difference between gender group was 68.6% for the PDF (from the ARC
for unvoiced fricatives with filter order of 20). On the other hand, in 29 out of 49
EUC pairs, male rates were higher than those of females. And the largest gap
between gender group was only 21% (from the LPC for vowels with a filter order of
8).
A possible reason for this inferior PDF performance is due to the small ratio of
the available number of subjects per gender to the number of elements
(measurements) per feature vector. The assumption when using the PDF distance
measure to design a classifier is that the data are normally (Gaussian) distributed.
In this case, many factors are considered (e.g., the size of the training or design set
and the number of measurements (observations, samples) in the data record (or
vector)). Foley (1972) and Childers (1986) pointed out that if the ratio of the
available number of samples per class (in this study, number of subjects per gender)
to the number of samples per data record (in this study, number of elements per
feature vector) is small, then data classification for both design and test sets may be
unreliable. This ratio should be on the order of three or larger (Foley, 1972). In our
study, the ratios were 3.25 (26/8), 2.17 (26/12), 1.63 (26/16), and 1.3 (26/20) for
filter orders of 8, 12, 16, and 20 respectively. The value of 3.25 satisfied the
requirement but the others were too small. Therefore, with the exception of the
results with a filter order of 8, where the performances of the PDF and EUC were


156
formants showed high recognition rates (over 90%), yet the
corresponding individual bandwidth and amplitude did not achieve
high recognition rates in (b). This suggests that a combination of
formant features improved the performance. It was also noted that
when using EUC, the third formant had the lowest recognition rate
but when using PDF distance measure, the third formant had the
highest recognition rate (100%).
3. With regard to the relative importance of frequency, bandwidth, or
amplitude features of formants for objectively distinguishing the
speakers gender, results in (d) disclosed that by using formant
frequencies, the highest recognition rate was obtained (98.1%). The
second highest was obtained using amplitudes (96.2%) and the
lowest was obtained using bandwidths (84.6%). Notice from the
results in .(b), where by using individuals Al, A2, A3, and A4, only
46.2%, 76.9%, 80.8%, and 82.7% recognition rates were obtained
respectively. It was realized that the combined effect of amplitudes
was more sensitive than the individual effect of each amplitude for
gender recognition. However, the same did not hold for bandwidths.
For bandwidths, the combined effect (84.6%) was not more sensitive
than the individual effect of each bandwidth (80.8%, 92.3%,
61.54%, 67.3% for Bl, B2, B3, and B4 respectively).
4. In terms of the relative importance of fundamental frequency versus
formant characteristics for objectively distinguishing the speakers
gender, results in (a) indicated that when using only fundamental
frequency, a 96.2% recognition rate was achieved. When using only
formant information in (e), a slight improvement (98.1%) was
obtained. Considering that there was only one subject difference


58
SCHEME 3
TEST SUBJECT
(C)
SCHEME 4
TEST SUBJECT
MEDIAN LAYER
MEDIAN LAYER
(d)
Figure 4.9 (continued)


TABLE OF CONTENTS
Page
ACKNOWLEDGEMENTS iii
ABSTRACT vii
CHAPTER
/>
1 INTRODUCTION 1
1.1 Automatic Gender Recognition 1
1.2 Application Perspective 2
1.3 Literature Review 4
1.3.1 Basic Gender Features 4
1.3.2 Acoustic Cues Responsible for Gender Perception .... 13
1.3.3 Summary of Previous Research 17
1.4 Objectives of this Research 20
1.5 Description of Chapters 21
2 APPROACHES TO GENDER RECOGNITION FROM SPEECH .... 23
2.1 Overview of Research Plan 23
2.2 Coarse Analysis 23
2.3 Fine Analysis 26
3 DATA COLLECTION AND PROCESSING 29
3.1 Database Description 29
3.2 Speech and EGG Digitization 30
3.3 Synchronization of Data 32
4 EXPERIMENTAL DESIGN BASED ON COARSE ANALYSIS 34
4.1 Asynchronous LPC Analysis 34
4.1.1 Linear Prediction Concepts 34
4.1.2 Analysis Conditions 39
4.2 Acoustic Parameters 40
4.2.1Autocorrelation Coefficients 40
IV


45
where r(k) is the autocorrelation of the speech segment to be recognized, E is the
total squared LPC prediction error associated with the estimates a(k) from this
segment, and ra(k) is the autocorrelation of the true (reference) LPC coefficients.
The block diagram for this subroutine is shown in Figure 4.4.
The spectral domain interpretation of Equation (4.18) is (Rabiner and
Levinson, 1981)
it Hr(ej) 2 dco
D lpc ( a) log [/| 1 ] (4.19)
-77 H,(eJ) 2 77
(i.e., an integrated square of the ratio of LPC spectra between reference and test
speech).
4.3.3 Cepstral Distortion
It can be defined as (Nocerino et al., 1985)
D Cep (c, c) = E (ck ck)2 (4.20)
k=-n
It can be shown that the power spectrum which corresponds to cepstrum is a
smoothed version of the true log spectral density function. The fewer the cepstral
coefficients used, the smoother the resultant log spectral density. It can also be
shown that this truncated cepstral distortion measure is a good approximation to the
L2 norm of the log spectral distortion measure between two time series, x(n) and
x(n),


80
Table 5.3
Results from the exclusive recognition scheme 3
with various filter orders, acoustic parameters, and
the Euclidean distance measure
o
CORRECT RATE %
Order=8
Order=12
Order=16
0rder=20
Sustained
Vowels
ARC
78. R
78.8
78.8
8?.. 7
LPC
73.1
78.8
80.8
80.8
FFF
N/A
98.1
N/A
N/A
RC
88.5
100.0
100.0
100.0
CC
82,7
92,3
90,4
90.4
Unvoiced
Fricatives
ARC
75.0
75.0
75.0
75.0
LPC
80.8
69.2
71.2
71.2
RC
80.8
80.8
80.8
80.8
CC
71.2
75.0
84.6
82.7
Voiced
Fricatives
ARC
86,5
88.5
86.5
88.5
LPC
92.3
92.3
92.3
90.4
RC
94.2
96.2
96.2
96.2
CC
94.2
98.1
98.1
96.2


155
noted from (b) that by using the second formant frequency, the
highest correct recognition rate (98.1%) was produced. The
recognition rate of the fourth formant frequency was 96.2%, with
94.2% for the third formant frequency and 90.4% for the first
formant frequency. According to the theory of the ideal lossless tube
(Rabiner and Schafer, 1978) that formant frequencies are inversely
proportional to the length of the tube, all formants of a vowel should
be equally shifted. If the implication of equal shifting of formants
holds, the recognition rate should be the same for any individual
formant frequency. However, the results demonstrated that
different formants had different recognition rates so that the equal
shifting of formants only holds under ideal conditions. Importantly,
it was noted that by using the second formant bandwidth, a
recognition rate of 92.3% was achieved, which was higher than
achieved using the first formant frequency. It can be also observed
from Figure 7.3(b) that it is easy to draw a straight line (boundary)
between male/female curves in the second formant bandwidths. In
conclusion, the second formant bandwidth might also be a sensible
gender discrimination indicator.
2. In terms of the relative importance of individual formant
characteristics for objectively distinguishing the speakers gender, it
was found from (c) that by using the second formant information
associated with EUC distance measure, the highest recognition rate
(98.1%) was reached. Remembering the discussion above that the
second formant frequency had the most distinct gender information
and its bandwidth was also a good gender indicator, the results in (c)
were reasonable. However, it is surprising that all individual


178
Table A. 10
Results from inclusive recognition schemes
Cepstrum distance measure
Filter order =16
CORRECT RATE %
MALE FEMALE TOTAL
Sustained
Vowels
Scheme
Scheme
Scheme
1
2
3
78.2
74.8
92.6
72.0
70.0
96.0
75.2
72.5
94.2
Scheme
1
68.9
72.0
70.4
Unvoiced
Scheme
2
60.0
66.4
63.1
Fricatives
Scheme
3
88.9
84.0
86.5
Scheme
1
84.3
84.0
84.1
Voiced
Scheme
2
90.7
83.0
87.0
Fricatives
Scheme
3
100.0
96.0
98.1


12
possible that the asymmetrical, humped appearance of the male glottal wave may be
due to a slightly out-of-phase movement of the upper and lower parts of each vocal
fold. If this is so, then the generally symmetrical appearance of the female glottal
wave may be due to the fact that the shorter female vocal folds come into contact
with each other more nearly as a single mass (Ishizaka and Flanagan, 1972).
The perceptual parameters or strategies used to make decisions concerning
male/female voices are not delineated in the literature even though making this
decision is a discrimination task performed routinely by human listeners. However,
it is hypothesized that a limited number of perceptual cues for classifying voices do
exist in the repertoire of listeners, and these cues may include some sociological
c,
factors such as cultural stereotyping.
Singh and Murry (1978) and Murry and Singh (1980) investigated the
perceptual parameters of normal male and female voices. They found that the
fundamental frequency and formant structure of the speaker appeared to carry
significant information for all judgments. The listeners judgments that the voices
they heard were female were more dependent on judged qualities of voice and effort.
Effort, pitch, and nasality were the perceptual parameters used to characterize
female voices while male voices were judged on the basis of effort, pitch, and
hoarseness. Their results suggested that listeners may use different perceptual
strategies to classify male voices than they use to classify female ones. Coleman
(1976) also suggested that there was a possibility of a gender-specific listener bias
for one acoustic characteristic or for one gender over the other.
Many researchers also believe melodic (intonation, stress, and/or
coarticulation) cues are speech characteristics associated with female voices.
Furthermore, the female voice is typically more breathy than the male voice. This
can be modeled by a dc shift in the glottal wave or, as suggested by Singh and Murry
(1978), is a result of a large number of pitch shifts. As the subject shifts pitch


Frequency of second formant (Hz)
144
Figure 7.9 Vowel triangles using our data and
Peterson and Barneys data
y.. .


195
Morikawa, H., and Fujisaki, H. (1982). Adaptive analysis of speech based on a
pole-zero representation, IEEE Trans, on Acoust., Speech, and Signal Processing,
Vol. ASSP-30(1), 77-88.
Murry, T., and Singh, S. (1980). multidimensional analysis of male and female
voices, J. Acoust. Soc. Am., Vol 68(5), 1294-1300.
Naik, J. M. (1983). Synthesis and evaluation of natural sounding speech using the
linear predictive analysis-synthesis scheme, Ph.D. Dissertation, University of
Florida, Gainesville.
Nocerino, N., Soong, F. K., Rabiner, L. R. and Klatt, D. H. (1985). Comparative
study of several distortion measures for speech recognition, Speech
Communication, Vol. 4, 317-331
Nord, L., and Sventelius, E. (1979). Analysis and prediction of difference limen
data for formant frequencies, Speech Transmission Lab., Rep., STL/QPSR,
3-4/1979, Royal Institute of Technology, Stockholm, Sweden,,60-72.
OKane, M. (1987). Recognition of speech and recognition of speaker sex: parallel
or concurrent processes? 114th Meeting of Acoustical Society of America, J.
Acoust. Soc. Am. Sup. 1, Vol. 82, S84.
Ott, L. (1984). An Introduction to Statistical Methods and Data Analysis, Duxbury
Press, Boston.
Parsons, T. W. (1986). Voice and Speech Processing, McGraw-Hill, New York.
Peterson, G. E., and Barney, H. L. (1952). Control methods used in a study of the
vowels, J. Acoust. Soc. Am., Vol. 24, 175-184.
Pinto, N. B., Childers, D. G., and Lalwani, A. (1989) "Formant speech synthesis:
Improving production quality, accepted by IEEE Trans, on Acoust., Speech, and
Signal Processing.
Pruzansky, S. (1963). Pattern matching procedure for automatic talker
recognition, J. Acoust. Soc. Am., Vol. 35, 354-358.
Rabiner, L. R., and Levinson, S. E. (1981). Isolated and connected word
recognition theory and selected applications, IEEE Trans. Communications,
Vol. Com-29(5), 621-659
Rabiner, L. R., Levinson, S. E., Rosenberg, A. E., and Wilpon, J. G. (1979).
Speaker-independent recognition of isolated words using clustering techniques,
IEEE Trans, on acoust., speech, and signal processing, Vol. ASSP-27(1), 336-349.
Rabiner, L. R., and Schafer, R. W. (1978). Digital Processing of Speech Signal,
Prentice-Hall, Inc., Englewood Cliffs, New Jersey.


95
Table 5.5 Results of Wilcoxon signed-ranks test
and paired sample t-test
Result of Wilcoxon Signed-Ranks Test
Total Data Pairs (N): 109
Zero-differences: 3
Data Pairs used: 106
+ Ranks: Sum = 4313
- Ranks: Sum = 1357
Count = 81
Count = 25
Wilcoxon T = 1357
z = 4.658493
One-sided Significant Level < .01
Two-sided Significant Level < .01
Result of Paired Samples T-test
Mean Difference
Standard Deviation
Standard Error
Degrees of Freedom
t-statistic
= 5.968807
= 13.42507
= 1.285889
= 108
= 4.641776
One-sided Significant Level < .01
Two-sided Significant Level < .01


30
19.
20.
21.
22.
,23.
24.
25.
26.
27.
Sustain phonation of the fricative /SH/ in the word SHIP.
Sustain phonation of the fricative /V/ in the word VAN.
Sustain phonation of the fricative /TH/ in the word THIS.
Sustain phonation of the fricative /Z/ in the word ZOO.
Sustain phonation of the fricative /ZH/ in the word AZURE.
Produce chromatic scale on la (attempt to go up, then down
as one effort pause between top 2 notes)
Sentence We were away a year ago.
Sentence Early one morning a man and a woman ambled along a
one mile lane.
Sentence Should we chase those cowboys?
A subset of the above was used in this research. It consisted of
1. Ten sustained vowels: /IY/, /I/, /E/, /AE/, /OO/, /U/, /OW/, /A/, /UH/,
and /ER/. There were a total of 520 vowels for 52 subjects: 270
vowels from males and 250 vowels from females.
2. Five sustained unvoiced fricatives (including a whisper): /H/, /F/,
/TH/, /S/, and /SH/. There were a total of 260 unvoiced fricatives for
all subjects: 135 from males and 125 from females.
3. Four voiced fricatives: /V/, /TH/, /Z/, and /ZH/. There were a total
of 208 unvoiced fricatives for all subjects: 108 from males and 100
from females.
3.2 Speech and EGG Digitization
All of the experimental data were collected with the subjects situated inside an
Industrial Acoustics Company single wall sound booth. The speech was picked up
with an Electro Voice RE-10 dynamic cardioid microphone and the EGG signal was
monitored by a Synchrovoice device. Amplification of the speech and EGG signals
was accomplished with a Digital Sound Corporation DSC-240 Audio Control
Console. The two channels were alternately sampled at 20 KHz by a Digital Sound
Corporation DSC-200 Digital Audio Converter system with 16 bits of precision.


149
the statistical differences between vowel FO of male and female
speakers were highly significant. Comparing the curves in Figures
7.1 to those in Figure 7.2, 7.3, 7.4 and 7.5, which show curves of
formant frequencies, bandwidths, and amplitudes, the curves of
fundamental frequencies across different vowels for both male and
female speakers were relatively flatter than those of formants,
indicating there were more formant than fundamental frequency
variations across different vowels. This suggests that for both male
and female speakers, glottal vibration patterns were relatively less
variable than vocal tract shapes when different vowels were
o
pronounced.
2. Formant frequencies (FI, F2, F3, and F4) from male vowels were
lower than those from female vowels. The statistical differences of
FI, F2, F3 and F4 between male and female speakers were highly
significant. As previous research has shown, formant frequencies
are inversely proportional to vocal tract lengths (Rabiner and
Schafer, 1978) and the vocal tract lengths for males are longer than
females (Fant, 1976). Thus the results support the previous
conclusions.
3. With regard to the formant bandwidths of male and female vowels,
the bandwidths of the formants for male vowels were narrower than
those for female vowels with the exception of B3. The statistical
differences of Bl, B2 and B4 between male and female speakers
were highly significant. Even though there was no statistically
significant difference between the male and female B3, B3 of most
vowels (except /IY/ and /A/) of males still were narrower than those
of females.


40
Frame overlap: None
Preemphasis Factor: 0.95
Analysis Window: Hamming
Data set for coefficient calculations: six frames total. The first two of these
were picked up from near the voice onset of an utterance, and the second two from
the middle of the utterance, and the last two from near the voice offset of the
utterance. By averaging six sets of coefficients obtained from these six frames, a
template coefficient set was calculated for each sustained utterance such as a vowel,
an unvoiced fricative, or a voiced fricative.
4.2 Acoustic Parameters
One of the key issues in developing a recognition system is to identify
appropriate features and measures which will support good recognition
performance. Several acoustic parameters were considered as feature candidates in
this study.
4.2.1 Autocorrelation Coefficients
They are defined conventionally as (Atal, 1974b)
R(k) = 2 h(n)h(n+|k|)
n=0
(4.10)
where h(n) is the impulse response of the filter. The relationship between the P
autocorrelation function coefficients and P LPC coefficients is unique in that they
can be obtained from each other (Rabiner and Schafer, 1978).