• TABLE OF CONTENTS
HIDE
 Title Page
 Dedication
 Acknowledgement
 Table of Contents
 Abstract
 Introduction
 Approaches to gender recognition...
 Data collection and processing
 Experimental design based on coarse...
 Results of recognition based on...
 Experimental design based on fine...
 Evaluation of vowel characteri...
 Concluding remarks
 Recognition rates for LPC and cepstrum...
 Recognition rates for various acoustic...
 Reference
 Biographical sketch
 Copyright






Towards automatic gender recognition from speech
CITATION SEARCH THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00082197/00001
 Material Information
Title: Towards automatic gender recognition from speech
Physical Description: viii, 198 leaves : ill. ; 29 cm.
Language: English
Creator: Wu, Ke ( Dissertant )
Childers, D. G. ( Thesis advisor )
Smith, J. R. ( Reviewer )
Arroyo, A. A. ( Reviewer )
Principe, J. C. ( Reviewer )
Rothman, H. B. ( Reviewer )
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 1990
Copyright Date: 1990
 Subjects
Subjects / Keywords: Automatic speech recognition   ( lcsh )
Electrical Engineering thesis Ph. D.
Pattern recognition systems   ( lcsh )
Sex differences   ( lcsh )
Voiceprints   ( lcsh )
Dissertations, Academic -- UF -- Electrical Engineering
Genre: bibliography   ( marcgt )
non-fiction   ( marcgt )
theses   ( marcgt )
 Notes
Abstract: The purpose of this research was to investigate the potential effectiveness of digital speech processing and pattern recognition techniques in the automatic recognition of gender from speech. Some hypotheses concerning acoustic parameters that may influence our ability to distinguish a speaker’s gender were researched. The study followed two directions. One direction, coarse analysis, used classical pattern recognition techniques and asynchronous linear prediction coding (LPC) analysis of speech. Acoustic parameters such as autocorrelation, LPC, cepstrum, and reflection coeffiecients were derived to form test and reference templates. The effects of different distance measures, filter orders, recognition schemes, and phonemes were comparatively assessed. Comparisons of acoustic parameters using the Fisher’s discriminate ration criterion were also conducted. The second direction, fine analysis, used pitch synchronous closed-phase analysis to obtain accurate vowel characteristics for each gender. Detailed formant features, including frequencies, bandwidths, and amplitudes, were extracted by a closed-phase Weighted Recursive Least Squares with Variable Forgetting Factor method. The electroglottograph signal was used to locate the closed-phase portion of the speech signal. A two-way Analysis of Variance statistical analysis was performed to test the difference between two gender features, and the relative importance of grouped vowel features was evaluated by a pattern recognition approach. The results showed that most of the LPC derived acoustic parameters worked very well for automatic gender recognition. A within-gender and within-subject averaging technique was important for generating appropriate test and reference templates. The Euclidean distance measure appeared to be the most robust as well as the simplest of the distance measures. A statistical test indicated steeper spectral slops for female vowels. Results suggested that redundant gender information was imbedded in the fundamental frequency and vocal tract resonance. Features of female voices were observed to have higher within-the group variations than those of male voices. In summary, this study demonstrated the feasibility of an efficient gender recognition system. The importance of this system is that it would reduce the search space of speech or speaker recognition in half. The knowledge gained from this research might benefit the generation of synthetic speech with a desired male or female voice quality.
General Note: Typescript.
General Note: Vita.
Thesis: Thesis (Ph. D.)--University of Florida, 1990.
Bibliography: Includes bibliographical references (leaves 189-197).
 Record Information
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
Resource Identifier: oclc - 23011922
alephbibnum - 001583934
System ID: UF00082197:00001

Table of Contents
    Title Page
        Page i
    Dedication
        Page ii
    Acknowledgement
        Page iii
    Table of Contents
        Page iv
        Page v
        Page vi
    Abstract
        Page vii
        Page viii
    Introduction
        Page 1
        Page 2
        Page 3
        Page 4
        Page 5
        Page 6
        Page 7
        Page 8
        Page 9
        Page 10
        Page 11
        Page 12
        Page 13
        Page 14
        Page 15
        Page 16
        Page 17
        Page 18
        Page 19
        Page 20
        Page 21
        Page 22
    Approaches to gender recognition from speech
        Page 23
        Page 24
        Page 25
        Page 26
        Page 27
        Page 28
    Data collection and processing
        Page 29
        Page 30
        Page 31
        Page 32
        Page 33
    Experimental design based on coarse analysis
        Page 34
        Page 35
        Page 36
        Page 37
        Page 38
        Page 39
        Page 40
        Page 41
        Page 42
        Page 43
        Page 44
        Page 45
        Page 46
        Page 47
        Page 48
        Page 49
        Page 50
        Page 51
        Page 52
        Page 53
        Page 54
        Page 55
        Page 56
        Page 57
        Page 58
        Page 59
        Page 60
        Page 61
        Page 62
        Page 63
        Page 64
        Page 65
        Page 66
        Page 67
    Results of recognition based on coarse analysis
        Page 68
        Page 69
        Page 70
        Page 71
        Page 72
        Page 73
        Page 74
        Page 75
        Page 76
        Page 77
        Page 78
        Page 79
        Page 80
        Page 81
        Page 82
        Page 83
        Page 84
        Page 85
        Page 86
        Page 87
        Page 88
        Page 89
        Page 90
        Page 91
        Page 92
        Page 93
        Page 94
        Page 95
        Page 96
        Page 97
        Page 98
        Page 99
        Page 100
        Page 101
        Page 102
        Page 103
        Page 104
        Page 105
    Experimental design based on fine analysis
        Page 106
        Page 107
        Page 108
        Page 109
        Page 110
        Page 111
        Page 112
        Page 113
        Page 114
        Page 115
        Page 116
        Page 117
        Page 118
        Page 119
        Page 120
        Page 121
        Page 122
        Page 123
        Page 124
        Page 125
        Page 126
        Page 127
        Page 128
        Page 129
    Evaluation of vowel characteristics
        Page 130
        Page 131
        Page 132
        Page 133
        Page 134
        Page 135
        Page 136
        Page 137
        Page 138
        Page 139
        Page 140
        Page 141
        Page 142
        Page 143
        Page 144
        Page 145
        Page 146
        Page 147
        Page 148
        Page 149
        Page 150
        Page 151
        Page 152
        Page 153
        Page 154
        Page 155
        Page 156
        Page 157
        Page 158
        Page 159
        Page 160
        Page 161
    Concluding remarks
        Page 162
        Page 163
        Page 164
        Page 165
        Page 166
        Page 167
        Page 168
    Recognition rates for LPC and cepstrum parameters
        Page 169
        Page 170
        Page 171
        Page 172
        Page 173
        Page 174
        Page 175
        Page 176
        Page 177
        Page 178
    Recognition rates for various acoustic parameters and distance measures
        Page 179
        Page 180
        Page 181
        Page 182
        Page 183
        Page 184
        Page 185
        Page 186
        Page 187
        Page 188
    Reference
        Page 189
        Page 190
        Page 191
        Page 192
        Page 193
        Page 194
        Page 195
        Page 196
        Page 197
    Biographical sketch
        Page 198
        Page 199
    Copyright
        Copyright
Full Text













TOWARDS AUTOMATIC GENDER RECOGNITION FROM SPEECH


By

KEWU












A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF DOCTOR OF PHILOSOPHY




UNIVERSITY OF FLORIDA


1990

























To my parents

and to my wife














ACKNOWLEDGMENTS


The invaluable guidance, encouragement, and support I have received from
my adviser and committee chairman, Dr. D. G. Childers, during the years of my

graduate education are most appreciated. I am sincerely grateful for his direction,

insight, and patience throughout this dissertation research.

I would especially like to thank Dr. J. R. Smith, Dr. A. A. Arroyo, Dr. J. C.
Principe, and Dr. H. B. Rothman for their interest and participation in serving on my
supervisory committee and their productive criticism of my research project.
The partial support by the National Institutes of Health, National Science

Foundation, and University of Florida Center of Excellence Program is gratefully

acknowledged.
Special thanks are also extended to my fellow graduate students and other
members of the Mind-Machine Interaction Research Center for their friendship,
encouragement, and skillful technical help.

Last but not the least, I am greatly indebted to my wife, Hong-gen, and my

parents for their love, support, understanding, and patience. My gratitude to them is
beyond description.














TABLE OF CONTENTS


Page

ACKNOW LEDGEM ENTS ................................................................ iii

A B STR A C T ................................................................................... vii

CHAPTER

1 INTRODUCTION ................................................................... 1

1.1 Automatic Gender Recognition ...................................... 1
1.2 Application Perspective ...................................................... 2
1.3 Literature Review ........................................................... 4
1.3.1 Basic Gender Features ....................... .... ..... ...... 4
1.3.2 Acoustic Cues Responsible for Gender Perception .... 13
1.3.3 Summary of Previous Research ............................ 17
1.4 Objectives of this Research ................................................ 20
1.5 Description of Chapters ........................................... ...... 21

2 APPROACHES TO GENDER RECOGNITION FROM SPEECH .... 23

2.1 Overview of Research Plan .................................... ......... 23
2.2 Coarse Analysis ............................................................. 23
2.3 Fine Analysis ................................................................. 26

3 DATA COLLECTION AND PROCESSING ................................. 29

3.1 Database Description ...................................... ........... 29
3.2 Speech and EGG Digitization .......................................... 30
3.3 Synchronization of Data ................................... ........... 32

4 EXPERIMENTAL DESIGN BASED ON COARSE ANALYSIS ...... 34

4.1 Asynchronous LPC Analysis ................................ ......... 34
4.1.1 Linear Prediction Concepts .................................... 34
4.1.2 Analysis Conditions ............................................ 39
4.2 Acoustic Parameters ........................................ ............ 40
4.2.1 Autocorrelation Coefficients ................................... 40








4.2.2 LPC coefficients ............................................ 41
4.2.3 Cepstrum Coefficients ........................................ 41
4.2.4 .Reflection Coefficients ........................................ 41
4.2.5 .Fundamental Frequency and Formant Information .... 42
4.3 Distance M measures ......................................... ........... .. 42
4.3.1 Euclidean Distance ............................................. 42
4.3.2 LPC log Likelihood Distance ................................... 43
4.3.3 Cepstral Distortion .................................... .......... 45
4.3.4 Weighted Euclidean Distance ................................ 47
4.3.5 Probability Density Function .................................. 47
4.4 Template Formation and Recognition Schemes .................. 48
4.4.1 Purpose of Design .............................................. 48
4.4.2 Test and Reference Template Formation .................. 49
4.4.3 Nearest Neighbor Decision Rule ........................... 55
4.4.4 Structure of Four Recognition Schemes ................... 56
4.5 Resubstitution and Leave-One-Out Procedures .................. 60
4.6 Separability of Acoustic Parameters
Using Fisher's Discriminant Ratio Criterion .................. 61
4.6.1 Fisher's Discriminant and F ratio ......................... 61
4.6.2 Divergence and Probability of Error .................... 64

5 RESULTS OF RECOGNITION BASED ON COARSE ANALYSIS .. 68

5.1 Coarse Analysis Conditions ............................................ 68
5.2 Performance Assessments ............................................. 70
5.2.1 Comparative Study of Recognition Schemes .......... 71
5.2.2 Comparative Study of Acoustic Features ............... 78
5.2.2.1 LPC Parameter Verses Cepstrum Parameter .. 78
5.2.2.2 Other Acoustic Parameters ....................... 79
5.2.3 Comparative Study Using Different Phonemes ......... 84
5.2.4 Comparative Study of Filter Order Variation ......... 85
5.2.4.1 LPC Log Likelihood and Cepstral
Distortion Measure Cases ......................... 85
5.2.4.2 Euclidean Distance Versus
Probability Density Function ..................... 87
5.2.5 Comparative Study of Distance Measures ................ 88
5.2.6 Comparative Study Using Different Procedures ........ 90
5.2.7 Variability of Female Voices ................................ 93
5.3 Comparative Study of Acoustic Parameters
Using Fisher's Discriminant Ratio Criterion .................. 93
5.4 Conclusions ...................................................................... 102

6 EXPERIMENTAL DESIGN BASED ON FINE ANALYSIS ........... 106

6.1 Introduction .................................................................... 106
6.2 Limitations of Conventional LPC ....................................... 107
6.2.1 Influence of Voice Periodicity .................................. 108
6.2.2 Source-Tract Interaction ....................................... 111
6.3 Closed Phase WRLS-VFF Analysis .................................... 113








6.3.1 Algorithm Description ............................................ 113
6.3.2 EGG Assisted Procedures ....................................... 120
6.4 Testing M ethods ............................................................... 122
6.4.1 Two-way ANOVA Statistical Testing ...................... 123
6.4.2 Automatic Recognition by Using Grouped Features ... 128

7 EVALUATION OF VOWEL CHARACTERISTICS ..................... 130

7.1 Vowel Characteristics of Gender ....................................... 130
7.1.1 Fundamental Frequency and Formant Features
for Each Gender ................................................. 130
7.1.2 Comparison with Peterson and Barney's Results ...... 142
7.1.3 Results of Two-way ANOVA Statistical Test ........... 145
7.1.4 Results of T Statistical Test .................................... 145
7.1.5 Discussion ............................................................. 145
7.2 Relative Importance of Grouped Vowel Features ................. 151
7.2.1 Recognition Results ................................................ 152
7.2.2 D discussion ..................................................... ...... 154
7.3 Conclusions ....................................................................... 158

8 CONCLUDING REMARKS ..................................................... 162

8.1 Sum m ary .......................................................................... 162
8.2 Future Research Extensions .............................................. 166
8.2.1 Short Term Extension ............................................ 166
8.2.2 Long Term Extension ............................................ 168

APPENDICES

A RECOGNITION RATES FOR LPC AND CEPSTRUM
PARAMETERS ..................................................... 169

B RECOGNITION RATES FOR VARIOUS ACOUSTIC
PARAMETERS AND DISTANCE MEASURES ......... 179

REFERENCES ................ ................................................................. 189

BIOGRAPHICAL SKETCH ............................................................... 198














Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy


TOWARDS AUTOMATIC GENDER RECOGNITION FROM SPEECH


By

Ke Wu

May, 1990

Chairman: D. G. Childers
Major Department: Electrical Engineering

The purpose of this research was to investigate the potential effectiveness of

digital speech processing and pattern recognition techniques in the automatic

recognition of gender from speech. Some hypotheses concerning acoustic

parameters that may influence our ability to distinguish a speaker's gender were

researched.

The study followed two directions. One direction, coarse analysis, used
classical pattern recognition techniques and asynchronous linear prediction coding

(LPC) analysis of speech. Acoustic parameters such as autocorrelation, LPC,

cepstrum, and reflection coefficients were derived to form test and reference

templates. The effects of different distance measures, filter orders, recognition

schemes, and phonemes were comparatively assessed. Comparisons of acoustic

parameters using the Fisher's discriminant ratio criterion were also conducted.

The second direction, fine analysis, used pitch synchronous closed-phase

analysis to obtain accurate vowel characteristics for each gender. Detailed formant








features, including frequencies, bandwidths, and amplitudes, were extracted by a
closed-phase Weighted Recursive Least Squares with Variable Forgetting Factor

method. The electroglottograph signal was used to locate the closed-phase portion
of the speech signal. A two-way Analysis of Variance statistical analysis was

performed to test the difference between two gender features, and the relative
importance of grouped vowel features was evaluated by a pattern recognition
approach.

The results showed that most of the LPC derived acoustic parameters worked

very well for automatic gender recognition. A within-gender and within-subject

averaging technique was important for generating appropriate test and reference
templates. The Euclidean distance measure appeared to be the most robust as well
as the simplest of the distance measures.

The statistical test indicated steeper spectral slopes for female vowels. Results

suggested that redundant gender information was imbedded in the fundamental
frequency and vocal tract resonance. Features of female voices were observed to

have higher within-group variations than those of male voices.

In summary, this study demonstrated the feasibility of an efficient gender
recognition system. The importance of this system is that it would reduce the search
space of speech or speaker recognition in half. The knowledge gained from this

research might benefit the generation of synthetic speech with a desired male or

female voice quality.














CHAPTER 1
INTRODUCTION



1.1 Automatic Gender Recognition


Human listeners are able to capture and categorize the information of acoustic
speech signals. Categories include those that contribute a linguistic message, those
that identify the speaker, and those that convey clues about the speaker's
personality, emotional state, gender, age, accent, and the status of his/her health.
Automatic speech and speaker recognition systems are far less capable than
human listeners. Computerized speaker recognition can be accomplished but only
under highly constrained conditions. The major difficulty is that the number of
significant parameters is unmanageably large and little is known about the acoustic

speech features, articulation differences, vocal tract differences, phonemic
substitutions or deletions, prosodic variations and other factors that influence our
recognition ability.
Therefore, more insight and systematic study of intrinsically effective speaker
discrimination features are needed. A series of smaller experiments should be done
so that the experimental results will be mutually supportive and will lead to overall
understanding of the combined effects of all the parameters that are likely to be
present in actual situations (Rosenberg, 1976; Committee on Evaluation of Sound
Spectrograms, 1979).
Unlike automatic speech and speaker recognition, automatic gender
recognition was never proposed as a stand-alone problem. Little attention was paid








to either the theoretical basis or the practical techniques for the realization of a
system for the automatic recognition of gender from speech. Although

contemporary research on speech included investigation of physiological and

acoustic gender features and their correlation with perceived gender differences, no

attempt was made to classify the speaker's gender objectively, using features

automatically extracted by a computer. Childers and Hicks (1984) first proposed

such a study as a separate recognition task and, thus, this research resulted from that

proposal. A possible realization of such a system is shown in Figure 1.1.



1.2 Application Perspective

The significance of the proposed research is as follows:
o Accomplishing this task could facilitate speech recognition and

speaker identification or verification by reducing the required search

space to half. Such a pre-process may occur in the listening process

of human being. One of the speech perception hypotheses proposed

by O'Kane (1987) stated that human listeners have to determine the
gender of the speaker first in order to determine the identity of the
sounds. Another perception hypothesis is that the identity of the

sounds. can be roughly determined without knowledge of the

speaker's gender but final recognition is possible only after the

speaker's gender is known. In both cases, identification of the

speaker's gender is a necessary step before recognition of sounds.

o Accomplishing this task could be useful for speech synthesis. It is
well known that in synthesized speech, the female voice has not been

reproduced with the same level of success as the male voice (Monsen

and Engebreston, 1977). Further study of gender cues would
























































Where
FO -- fundamental frequency
Fl -- first formant frequency
BW1 -- first formant bandwidth


Figure 1.1 A possible automatic gender recognition system.








contribute to the solution of this problem since acoustic features for

synthesizing speech for either gender would be provided. Hence, the

voice quality of voice response systems and text-to-speech

synthesizers would be improved.
o Accomplishing this task could provide new guidelines and suggest

methods to identify the acoustic features related to dialect, age,
health conditions, etc.

o Accomplishing this task could be a unique or an only approach for

some applications (e.g., law enforcement applications). In a

criminal investigation, an attempt is usually made to identify the

speaker on a recording as a specific person. If an individual is able
to deceive the investigator as to his gender, he may well prevent his

detection. It is well known that speakers can disguise their

speech/voice to confound or prevent detection (Hollien and

McGlone, 1976--cited by Carlson, 1981). The female impersonator
is an example of intentional deception of the listener. In such a

case, identification of the speaker's gender is critical.
o Finally, we presumed that the research results could benefit clinical

applications such as correction for a person with a voice disorder or

handicap. Other applications include transsexual changes etc.

(Bralley et al., 1978; Carlson, 1981).


1.3 Literature Review

1.3.1 Basic Gender Features

The differences between male and female voices depend upon many factors.
Generally, there exist three types of parameters--physiological and acoustical

which are objective, and perceptual which is subjective (Figure 1.2).












PHYSIOLOGICAL:
* VOCAL FOLD LENGTH
& THICKNESS
* VOCAL TRACT
LENGTH, AREA
& SHAPE


ACOUSTIC:
FUNDAMENTAL
FREQUENCY
VOCAL TRACT FEATURES:
FORMANT FREQUENCY
BANDWIDTH &
AMPLITUDE
GLOTTAL VOLUME
VELOCITY WAVESHAPE
ez, i


GENDER
DISCRIMINATION







* INTONATION
* STRESS
* SPEAKING RATE


Figure 1.2 Basic gender features.








Many physiological parameters of the male and female vocal apparatus have
been determined and compared. Fant (1976) showed that the ratio of the total

length of the female vocal tract to that of a male is about 0.87, and Hirano et al.

(quoted by Cheng and Guerin, 1987) showed that the ratio of the length of the
female vocal fold to that of the male is about 0.8. Titze (1987 and 1989) reported
that, anatomically, the female larynx also differs from the male larynx in thickness,

angle of the thyroid laminae, resting angle of the glottis, vertical convergence angle

in the glottis, and in other ways. The ratio of the length and the ratio of the area of

pharynx cavity of the female to that of the male are 0.8 and 0.82, respectively.

Similarly, we take respectively 0.95 as the ratio of the length and 1.0 as the ratio of

the area of oral cavity of the female to that of the male. The extra ratio for the area

of the oral cavity is due to the fact that the degree of openness of the oral cavity is

comparatively greater in the case of the female than in the case of the male (Ohman

quoted by Fant, 1966). Ohman also suggested that a proportionally larger female

mouth opening is a factor to consider. Figure 1.3 illustrates the human vocal

apparatus.

The differences in physiological parameters can lead to induced differences in
acoustical parameters. When comparing male and female formant patterns, the
average female formant frequencies are roughly related to those of the male by a

simple scaling factor that is inversely proportional to the overall vocal tract length.

On the average, the female formant pattern is said to be scaled upward in frequency

by about 20% compared to the average male formant pattern (Figure 1.4). It is also
well known that the individual size of the vocal cavities and thus of the formant

pattern scale factor may vary appreciably depending upon the age and gender of the

-speaker. Peterson and Barney (1952) measured the first three formant frequencies

present in ten vowels spoken by men, women, and children. They reported that male
formants were the lowest in frequency, women had a higher range, and children had

























VELUM


TONGUE BODY


LIPS
T-- ONGUE TIP


L JAW


Figure 1.3 A cross section of human vocal apparatus.

















AMPLITUDE (db)


MALE


FREQUENCY (kHz)


Figure 1.4 An example of male and female formant features.



PITCH PERIOD

(msec.)


12.0 -


10.0 MALE


8.0 -


6.0-


4.0 FEMALE ;


2.0


0.0 -


FRAME


Figure 1.5 Fundamental frequency changes for two speakers

for the utterance "We were away a year ago."


FEMALE








the highest. Carlson (1981) gave a survey of the literature on the vocal tract
resonance characteristics as a gender cue.
Fant (1966) has pointed out that the male and female vowels are typically

different in three groups:

1) rounded back vowels,

2) very open unrounded vowels, and
3) close front vowels.

The main physiological determinants of the specific deviations are that the ratio of

pharyngeal length to mouth cavity length is greater for males than for females and

the laryngeal cavities are more developed in males.

Schwartz and Rine (1968) also demonstrated that the gender of an individual
can be identified from voiceless fricative phonemes such as /S/, /F/ etc. This again is
induced by the vocal tract size differences between the genders.

The higher fundamental frequency (pitch) range of the female speaker is quite

well known. There is a general agreement that the fundamental frequency is an

important factor in the identification of gender from voice (Curry, 1940--cited by

Carlson 1981; Hollien and Malcik, 1967; Saxman and Burk, 1967; Hollien and Paul,
1969; Hollien and Jackson, 1973; Monsen and Engebretson, 1977; Stoicheff, 1981;

Horri and Ryan, 1981; Linville and Fisher; 1985; Henton, 1987). One often finds the

statement that the pitch level of the female speaking voice is approximately one

octave higher than that of the male speaking voice (Linke, 1973). However, there is

considerable discrepancy among values obtained by different investigators.

According to Hollien and Shipp (1972), the male subjects showed an intersubject
pitch range of 112 146 Hz. Stoicheff's (1981) data showed that the range for the
female subjects was 170-275 Hz. Titze-(1989) found that the fundamental

frequency was scaled primarily according to the membranous lengths of the vocal

folds (scale factor 1.6). Figure 1.5 shows fundamental frequency changes for two








speakers for the utterance "We were away a year ago." Figure 1.6 shows the
corresponding speech signals.

The female voice is slightly weaker than the male voice. On the average the
root mean square (rms) intensity of glottal periods produced by female subjects is -6
db relative to comparable samples produced by males. A study by Karlsson (1986)
indicated a strong correlation between weak voice effort and constant air leakage
during closed-phase.

During the last few years, measuring the area of the glottis as well as
estimating the glottal volume-velocity waveform have become research topics of

interest (Holmberg et al., 1987). It is well known that the shape of the glottal
excitation wave is an important factor which can greatly affect speech quality
(Rothenberg, 1971). The wave shape produced by male subjects is typically

asymmetrical and frequently shows a prominent hump in the opening phase of the

wave (due to source-tract interaction). The closing portion of the wave generally
occupies 20%-40% of the total period and there may or may not be an easily
identifiable closed period (Monsen and Engebretson, 1977). Notable differences
between male and female waveforms are that the female waveform tends to be
symmetric. There is seldom a hump during the opening-phase indicating less or no

source-tract interaction, and both the opening and closing parts of the wave occupy

more nearly equal proportions of the period. Holmberg et al. (1987) found
statistically significant differences in male-female glottal waveform parameters. In

normal and loud voices, female waveforms indicated lower vocal fold closing
velocity, lower ac flow, and a proportionally shorter closed-phase of the cycle,

suggesting a steeper spectral slope for females. For softly spoken voices, spectral

slopes are more similar to those of males.
These glottal-source differences between male and female subjects are
understandable in terms of the relative size of male and female vocal folds. It is

















3r_8 vl


i~ = .3jo iaLb


m I .


ljIu = .J/Z ML


..- (b)


Figure 1.6 Speech signals for

for the utterance


(a) male and (b) female speakers
"We were away a year ago."


INJ = 1. .' M-L;


z-C-L


-v .. .








possible that the asymmetrical, humped appearance of the male glottal wave may be

due to a slightly out-of-phase movement of the upper and lower parts of each vocal

fold. If this is so, then the generally symmetrical appearance of the female glottal

wave may be due to the fact that the shorter female vocal folds come into contact

with each other more nearly as a single mass (Ishizaka and Flanagan, 1972).

The perceptual parameters or strategies used to make decisions concerning

male/female voices are not delineated in the literature even though making this

decision is a discrimination task performed routinely by human listeners. However,

it is hypothesized that a limited number of perceptual cues for classifying voices do

exist in the repertoire of listeners, and these cues may include some sociological

factors such as cultural stereotyping.

Singh and Murry (1978) and Murry and Singh (1980) investigated the

perceptual parameters of normal male and female voices. They found that the

fundamental frequency and formant structure of the speaker appeared to carry

significant information for all judgments. The listeners' judgments that the voices

they heard were female were more dependent on judged qualities of voice and effort.

Effort, pitch, and nasality were the perceptual parameters used to characterize
female voices while male voices were judged on the basis of effort, pitch, and

hoarseness. Their results suggested that listeners may use different perceptual
strategies to classify male voices than they use to classify female ones. Coleman

(1976) also suggested that there was a possibility of a gender-specific listener bias

for one acoustic characteristic or for one gender over the other.

Many researchers also believe melodic (intonation, stress, and/or

coarticulation) cues are speech characteristics associated with female voices.

Furthermore, the female voice is typically mor-e-breathy than the male voice. This

can be modeled by a dc shift in the glottal wave or, as suggested by Singh and Murry

(1978), is a result of a large number of pitch shifts. As the subject shifts pitch








direction frequently, complete vocal fold approximation is less probable. A

research on acoustic correlates of breathiness was performed by Klatt (1987) in
which three breathiness parameters (i.e., first harmonic amplitude, turbulence noise

and tracheal coupling) were proposed. A detailed discussion of controlling these

parameters was presented in Klatts' paper (1987).

A new trend to find the features responsible for gender identification is to

apply the approach of synthesis. The work done by Yegnanarayana et al. (1984),
Wu (1985), Childers et al.(1985a, 1985b, 1987, 1989), and Pinto et al. (1989)

represented this aspect. In their experiments, the speech of a talker of one gender

was converted to sound like that of a talker of the other gender to exam factors

responsible for distinguishing gender features. They found that the fundamental
frequency, the glottal excitation waveshape and the spectrum, which included

formant locations and bandwidth, overall spectral shape and slope, and energy, are

crucial control parameters.


1.3.2 Acoustic Cues Responsible for Gender Perception

As part of current interest in speaker recognition, investigators have sought to
specify gender-bearing attributes of the human voice. Under normal speaking and

listening circumstances, listeners have little difficulty distinguishing the voices of

adult males and females, suggesting that the acoustic parameters which underlie

gender identity are perceptually prominent. The judgment of adult gender is
strongly influenced by acoustic variables reflecting gender differences in laryngeal

size and mass as well as vocal tract length. However, the issue of which specific
acoustic cues are mostly responsible for gender identification has not been

definitively resolved. Such a controversy-partially dominated the previous research.

A series of experiments run by Schwartz (1968) and Ingemann (1968)

employed voiceless fricatives spoken in isolation as auditory stimuli and it was found








that listeners could identify speaker gender accurately from these stimuli, especially
from /H/, IS/, and /SH/ (and could not from /F/ and /TH/). Ingemann reported that
the most identifiable fricative was /h/, with identification of others ranging down to
little better than chance. Since the laryngeal fundamental (FO) was not available to
the listeners, their findings suggest that accurate gender identification is possible
from vocal tract resonance (VTR) information alone and, therefore, that formants
are important cues for speaker gender identification.

Further support for this conclusion came from studies by Schwartz & Rine
(1968) and Coleman (1971). Schwartz and Rine's study revealed that the listeners

were able to identify the speaker's gender from two whispered vowels (/i/ and /a/).
They found 100% correct identification for /a/ and 95% correct identification for /i/,

despite the absence of the laryngeal fundamental. In Coleman's study on male and

female voice quality and its relationship to vowel formant frequencies, /i/, /u/, and a

prose passage were employed to explore listeners' gender identification abilities.
All stimuli were produced at the same FO (85 Hz) by means of an electrolarynx.
Coleman discovered that the judges correctly recognized the speaker gender 88% of

the time (with 98% correct for male voices and 79% for female voices), even when
the FO remained constant for all speakers. He also discovered that the vowel
formant frequency averages were closely associated with the degree of male or
female voice quality.

Coleman (1973a and 1973b) attempted to reduce the influence of possible

differences in rate, juncture, and inflection between male and female speakers by

presenting their voiced productions of prose passage backward to subjects. The

judgments should have, therefore, been based solely on VTR and FO information
which would be unaffected by the backward presentation. By correlation analysis-
between measures of VTR, FO, and judgments of degree of male and female voice
quality in the voices of the speakers (with degree of correlation indicative of the








contribution of each of the vocal characteristics to listener judgments), he found that
listeners were basing their judgments of the degree of male or female voice quality

on the frequency of the laryngeal fundamental.

However, in a later study by Coleman (1976), there were inconsistent findings

from a pair of experiments concerned with a comparison of the contribution of two
vocal characteristics to the perception of male and female voice quality. The first
experiment, which utilized natural speech, indicated that the FO was very highly

correlated with the degree of gender perception while the VTR was less highly

correlated. When VTRs that were more characteristic of the opposite gender were

included experimentally in these voices, they did not affect the judges' estimates of

the degree of male or female voice quality. But, in the second experiment, when a

tone produced by a laryngeal vibrator was substituted for the normal glottal tone at
simulated FO representing both male (120 Hz) and female (240 Hz), and male and

female characteristics (i.e. vocal tract formants and laryngeal fundamentals) were

combined in the same voice experimentally, he found that the female FO was a weak

indicator of the female voice quality when it was combined with the male VTR

features although the male FO retained the perceptual prominence seen in the first
experiment. Thus, there was a difference in the manner that FO and VTR interact

for male and female perception.

Lass et al. (1976) conducted a study comparing listeners' gender identification

accuracy from voiced, whispered, and 255 Hz low-pass filtered isolated vowels.
They found that listener accuracy was greatest for the voiced stimuli (96% correct
out of 1800 identifications--20 speakers x 6 vowels x 15 listeners), followed by the

filtered stimuli (91% correct), and least accurate (75% correct) for the voiceless

vowels. Since the low-pass filtered vowels apparently eliminated formant

information, they concluded that the FO was a more important acoustic cue in
speaker gender identification tasks than the VTR characteristics of the speaker.








Lass et al. (1976) also reported that there were large gender differences in

their results. In all experimental conditions females were recognized at a
significantly lower level, which was in agreement with the results of Coleman (1971)
mentioned above. In another study supportive of this point, Brown and Feinstein

(1977) also used electrolarynx (120 Hz) to control FO so that VTR was the variable.

Identification of male speakers was 84% correct and identification of female

speakers was 67% correct. Brown and Feinstein also found, as in the Coleman

(1971) study, that centralized spectra were more ambiguous to listeners. Again,

VTR appeared to play a determinant role in gender identification in the absence of
FO.

In a later experiment, the effect of temporal speech alterations on speaker
gender and race identification was investigated. Lass and Mertz (1978) found that

gender identification accuracy remained high and unaffected by temporal speech

alterations when the normal temporal features of speech were altered by means of

the backward playing and time compressing of speech samples. They concluded

that temporal cues appeared to play a role in speaker race, but not speaker gender

identification.

In another study concerned with the effect of phonetic complexity on speaker

gender identification, Lass et al. (1979) found that phonetic complexity did not

appear to play a major role for gender judgments. No regular trend was evident
from simple to complex auditory stimuli and listeners' accuracy was as great for

isolated vowels as it was for sentences.

In an attempt to investigate the relative importance of portions of the

broadband frequency speech spectrum in gender identification, Lass et al. (1980)
constructed three recordings representing-the three experimental-conditions in the

study: unfiltered, 255 Hz low pass filtered, and 255 Hz high pass filtered. The

recordings were played back to a group of 28 judges. The results of their judgments








indicated that gender identification was not significantly affected by such filtering;

listeners' accuracy in gender recognition remained high for all three experimental

conditions, showing that gender identification can be made accurately from acoustic

information available in different portions of the broadband speech spectrum.


1.3.3 Summary of Previous Research
By reviewing the literature it can be concluded that the revealed information of

gender identification from previous research was extensive. However, it is clear that

much work still remains to be done.
What has not been completed

The relative importance of the FO versus VTR characteristics for
perceptual male or female voice quality is still controversial. The belief

that the FO is the strongest cue to gender seems to be substantiated by the

evidence. There is a hypothesis that in situations in which the role of FO

is diminished by deviancy, the effect of VTR characteristics upon gender

judgments increases from a minimal level to take on a large role equal to

and even sometimes greater than that played by FO (Carlson, 1981). But
this hypothesis remains unproven.

It is well known now that not only the vibration frequency of the

glottis (FO) but also the shape of the glottal excitation wave as well are
important factors which greatly affect speech quality (Rothenberg, 1971;

Holmes, 1973). Differences of glottal excitation wave shapes for male
and female were observed and investigated (Monsen and Engebretson,

1977; Karlsson, 1986; Holmberg and Hillman, 1987). But perceptive

justification of these characteristics was still limited (Carrell, 1981) and

the inverse filtering techniques need to be improved and more data

should be analyzed.








What was neglected

First of all, research on automatically classifying male/female

voices by using objective feature measurements was entirely missing.

Almost all previous work was concentrated on subjective testing which is

expensive, time and labor consuming, and subject dependent. Objective

gender recognition which is reliable, inexpensive, and consistent has not

been developed in parallel to subjective testing but such work is

necessary as we stated earlier.

Second, the influences of formant bandwidth and amplitude and

overall spectral shape on gender cues were not considered and

investigated. Traditionally, experiments on contribution of vocal tract

characteristics to gender perception were only concerned with formant

frequencies (Coleman, 1976). The bandwidths of the lowest formant

depend upon vocal tract wall loss and source-tract interaction (Rabiner

and Schafer, 1976; Rothenberg, 1981) while bandwidths of the higher

formants depend primarily upon the viscous friction, thermal loss, and

radiation loss (Flanagan, 1972). These factors may be different for each

gender so that the bandwidths and overall spectral shape are different for

each gender. Bladon (1983) pointed out that male vowels appeared to

have narrower formant bandwidths and perhaps also a less steeply

sloping spectrum. All these areas require further investigation.

What was the weakness

The acoustic features were obtained by short-time spectral

analysis which usually used analog spectrographic techniques.

Estimated FO and formant frequencies may be inaccurate due to








1. errors in determining the positions of the harmonic peaks (in

practice, the peaks were "read" by-means of inspection by a

person and then the FO and formants were calculated).

2. errors in formant estimation due to the influences of the FO
and source-tract interaction.
3. large instrument errors (e.g., drift).

Lindblom (1962) estimated the accuracy of spectrographic

measurement to be approximately equal to the fundamental frequency

divided by 4. Flanagan (1955) and Nord and Sventelius (1979, quoted by

Monsen and Engebretson, 1983) suggested that a difference of about 50

Hz for the second formant and a difference of about 21 Hz for the first

formant was perceived. Therefore, formant frequency estimation should

be as accurate as possible in vowel analysis as well as synthesis.

However, the most frequently referenced paper on acoustic phonetics,

which contains the most comprehensive measurements of the vowel

formants of American English (Peterson and Barney, 1952), may involve

measurement errors as pointed out by Monsen and Engebretson (1983),
especially for female and child subjects since the data were obtained by

spectrographic measurement.

The technique frequently employed to examine the ability of VTR

to serve as gender cue was to standardize the FO (and therefore eliminate
it as a variable) by utilizing an artificial larynx (Coleman, 1971, 1976;

Brown and Feinstein, 1977). This allows evaluation of VTR in a sample

that contains an FO that is the same for both male and female subjects.

The electrolarynx itself has an unnatural sound to it that may confuse the

listener and depress the overall accuracy of perception.








The study populations were relatively small for most

investigations. Sometimes the database used consisted of less than 10

subjects for each gender (Ingemann, 1968; Schwartz and Rine, 1968;

Brown and Feinstein, 1977), making the interpretation of the results

unreliable.

The results of the listening tests may depend on the gender

distribution of the testing panel because males and females may use

different judging strategies. However, this point usually was not

emphasized so that the conclusions claimed from listening tests may be

biased (Coleman, 1976; Carlson, 1981).

In summary, previous research has measured and investigated the

physiological or anatomical parameters for each gender. Under certain

assumptions, the relationship between anatomical parameters and some of the

acoustic features was established. The major acoustic parameters responsible for

perceptually discriminating a speaker's gender from voice were investigated and

tested. However, no attempt was made to automatically classify male/female voices

by objective feature measurements. The vowel characteristics for each gender were

inaccurate because of the weakness of analog techniques. Various hypotheses and

preliminary results need to be verified on a more comprehensive database. All these

constituted the underlying problems and impetuses for this research.


1.4 Objectives of this Research

This research sought to address these problems through two specific

objectives.

One objective of this study was to explore the possible effectiveness of digital

speech processing and pattern recognition techniques for an automatic gender

recognition system. Emphasis was placed on the investigation of various objective








acoustic parameters and distance measures. The optimal combination of these

parameters and measures was searched. The extracted acoustic features that are
most effective to classify speaker's gender objectively were characterized. Efficient
recognition schemes and decision algorithms for such purpose were developed.

The other objective of this study was to validate and clarify hypotheses
concerning some acoustic parameters affecting the ability of algorithms to

distinguish a speaker's gender. Emphasis was placed on extraction of accurate

vowel characteristics including fundamental frequency and formant features such as

formant frequency, bandwidth and amplitude for each- gender. The relative

importance of these characteristics for gender identification was evaluated.



1.5 Description of Chapters

In Chapter 2, an overview of the research plan is given and a brief description

of the coarse and fine analysis is presented. The database and the techniques

associated with data collection and preprocessing are discussed in Chapter 3. The

details of the experimental design based on coarse analysis are described in Chapter

4. Asynchronous LPC analysis is reviewed. Different acoustic parameters, distance

measures, template formation, and recognition schemes are provided. The

recognition decision rule and resubstitution or exclusive procedure are proposed as
well. In addition, the concept of the Fisher's discriminant ratio criterion is reviewed.

The recognition performance based on coarse analysis is assessed in Chapter 5.

Results of comparative studies of various phonemes, acoustic features, distance

measures, recognition schemes, and filter orders are reported. The gender
separability of acoustic features is also analyzed by using the Fisher's discriminant

ratio criterion. Chapter 6 expounds on the detailed experimental design of fine

analysis. In particular, the advantages of pitch synchronous closed phase analysis is








demonstrated. A review of the closed phase WRLS-VFF (Weighted Recursive
Squares with Variable Forgetting Factor) analysis and the EGG (electroglottograph)

assisted approaches is also presented. Chapter 6 also introduces testing methods for

fine analysis, which include the two-way ANOVA (Analysis of Variance) statistical

test and the automatic recognition test using grouped features. Chapter 7 analyzes
the vowel characteristics such as fundamental frequencies and formant features for

each gender. Statistical tests and relative importance of grouped vowel features are

also discussed. Finally in Chapter 8, a summary of the results of this dissertation is

offered. Recommendations and suggestions for future research conclude this last

chapter.














CHAPTER 2
APPROACHES TO GENDER RECOGNITION FROM SPEECH



2.1 Overview of Research Plan


The goal of this study was to explore the possible effectiveness of digital

speech processing and pattern recognition techniques for an automatic gender

recognition system from speech. In order to do this, some hypotheses concerning

acoustic parameters that act to affect our ability to distinguish speaker's gender
needed to be validated and clarified.

Thus, this study was divided into two directions as illustrated in Figure 2.1.

One direction was called coarse analysis since it applied classical pattern recognition

techniques and asynchronous linear prediction coding (LPC) analysis of speech.

The specific goal of this direction was to develop and test candidate algorithms for
achieving the gender recognition rapidly using only a brief data speech record.

The second research direction covered fine analysis since pitch synchronous

closed-phase analysis was utilized to obtain accurate vowel characteristics for each

gender. The specific aim of this direction was to compare the relative significance of

vowel characteristics for gender discrimination.


2.2 Coarse Analysis

The tool we used in this direction was asynchronous LPC analysis. The

advantages of using this technique are





















































Figure 2.1 The overall research flow.








1. The well-known linear prediction coding (LPC) vocoder is an

efficient vocoder which, when used as a model, encompasses the

features of the vocal source (except of fundamental frequency) as

well as the vocal tract (Rabiner and Schafer, 1978). Since gender

features are believed to be included in both vocal source and tract,

satisfactory results would be expected using LPC derived

parameters.

2. The LPC all-pole model has a smoothed, accurate spectral envelope

matching characteristic, especially for vowels. Formant frequency

measurements obtained by LPC have also been found to compare

favorably to measures obtained by spectrographic analysis (Monsen

and Engebretson, 1983; Linville and Fisher, 1985). Thus it is

expected that features obtained by LPC would represent the spectral

characteristics of both genders more accurately.

3. The LPC model has been successfully applied in speech and speaker

recognition (Makhoul, 1975a; Atal, 1974b, 1976; Rosenberg, 1976:

Markel, 1977; Davis and Mermerlstein, 1980; Rabiner and Levinson,

1981). Moreover, many related distortion or distance measurements

have been developed (Gray and Markel, 1976; Gray et al., 1980;

Juang, 1984; Nocerino et al., 1985) which could be conveniently

adopted for the preliminary experiments of gender recognition.

4. Deriving acoustic parameters from the LPC model is

computationally fast and efficient, only short data records are

needed. This is a very important factor in designing an automatic

gender recognition system. -








In the coarse analysis, acoustic parameters such as autocorrelation, LPC,

cepstrum, and reflection coefficients were derived to form test and reference

templates. The effects of using different distance measures, filter orders,

recognition schemes, and phonemes were comparatively evaluated. Comparisons of

acoustic parameters using the Fisher's discriminant ratio criterion were also

conducted.

The linear prediction coding concepts and detailed experimental design based

on the coarse analysis will be given in Chapter 4.


2.3 Fine Analysis


The objective of the fine analysis was to study and compare the relative

significance of vowel characteristics responsible for gender discrimination.

As we know, male/female vowel characteristics are featured by formant

positions, bandwidths, and amplitudes so that accurate formant estimation is

necessary. It is important to pay particular attention to the measurement technique

and to the degree of accuracy which can be achieved through it. Although formant

features have been measured for a variety of different studies, the accuracy of these
measurements is still a matter of conjecture.

Formant estimation is influenced by (Atal, 1974a; Childers et al., 1985a;
Krishnamurthy and Childers, 1986):

o the effect of the periodic vocal fold excitation, especially when the

harmonic is near the formant.

o the effect of the excitation-spectrum envelope.

o the effect of time averaging over several excitation cycles in the

analysis when the vocal folds are repeatedly in open-phase (large








source-tract interaction) and closed-phase (little or no source-tract
interaction) conditions. 0
Frame based asynchronous LPC analysis cannot reduce the effect of

source-tract interaction because this technique uses windows that average the data

over several excitation epoches. The pitch synchronized closed phase covariance

(CPC) method can reduce the effect of source-tract interaction. However, in certain

situations, the vocal tract filter derived by this method may be unstable because of

the short closed glottal intervals, especially for females and children (Ting et al.,

1988).

Sequential adaptive analysis methods offer an attractive alternate processing
strategy since they overcome some of the drawbacks of frame-based analysis. The
closed-phase WRLS-VFF method that tracks the time-varying parameters of the

vocal tract and updates the parameters during the glottal closed phase interval can

reduce the formant estimation error. Experimental results (Ting et al., 1988; Ting,

1989) show that the formant tracking ability and formant estimation accuracy of the

WRLS-VFF algorithm is superior to the LPC based method. Detailed formant

features, including frequencies, bandwidths, and amplitudes in the fine analysis

stage were obtained by using this method. The EGG signals were used to assist in

locating the closed phase portion of the speech signal (Childers and Larar, 1984;
Krishnamurthy and Childers, 1986).

There were two approaches for testing the relative importance of various
vowel features for gender recognition:

Statistical tests. Since formant characteristics such as frequencies,

bandwidths, and amplitudes depend on or are influenced by two
factors (i.e., gender as well as vowels) and each experimental

subject produces more than one vowel, our experiments should be

referred to as two factor experiments having repeated measures on








the same subject (Winer, 1971). Therefore, two-way ANOVA were
used to perform the statistical test. The significance of the
difference between each individual feature in terms of male/female

groups was analyzed.
Automatic recognition. First, the individual or grouped featuress,

such as only the fundamental frequency or only the formant

frequencies or bandwidths (but from all formants), were used to

form the reference and test templates. Then automatic recognition

schemes were applied on these templates. Finally, the recognition

error rates for different features were compared.

In Chapter 6, the detailed background of the closed phase WRLS-VFF method

and the experimental design based on fine analysis will be presented.















CHAPTER 3
DATA COLLECTION AND PROCESSING



3.1 Database Description


The database consists of speech and EGG data collected from 52 normal

subjects (27 males and 25 females) with speaker's age varying from 20 to 80 years.

The synchronous speech and EGG signals were simultaneously directly digitized.

Each subject read, after some practice, the following SAMPLE PROTOCOL that

includes 27 tasks.


SAMPLE PROTOCOL


F
F
F

F
F
I:


F
F
F
F
F


- 10 with comfortable pitch & loudness.
- 5 with progressive increase in loudness.
donation of the vowel /IY/ in t
)honation of the vowel /I/ in t
,honation of the diphthong /AI/ in t
,honation of the vowel /E/ in t
donation of the vowel /AE/ in t
,honation of the vowel /OO/ in t
)honation of the vowel /U/ in t
)honation of the diphthong /OU/ in t
donation of the vowel /OW/ in tl
,honation of the vowel /A/ in t
)honation of the vowel /UH/ in t
honation of the vowel /ER/ in t
)honation of the whisper /H/ in t
honation of the fricative /F/ in t
honation of the fricative /TH/ in t
,honation of the fricative /S/ in t


he
he
he
he
he
he
he
he
he
he
he
he
he
he
he
he


word BEET.
word BIT.
word BAIT.
word BET.
word BAT.
word BOOT.
word BOOK.
word BOAT.
word BOUGHT.
word BACH.
word BUT.
word BURT.
word HAT.
word FIX.
word THICK.
word SAT.


Task 1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.


Count 1
Count 1
Sustain
Sustain
Sustain
Sustain
Sustain
Sustain
Sustain
Sustain
Sustain
Sustain
Sustain
Sustain
Sustain
Sustain
Sustain
Sustain








Sustain phonation of the fricative /SH/ in the word SHIP.
Sustain phonation of the fricative /V/ in the word VAN.
Sustain phonation of the fricative /TH/ in the word THIS.
Sustain phonation of the fricative /Z/ in the word ZOO.
Sustain phonation of the fricative /ZH/ in the word AZURE.
Produce chromatic scale on "la" (attempt to go up, then down
as one effort -- pause between top 2 notes)
Sentence "We were away a year ago."
Sentence "Early one morning a man and a woman ambled along a
one mile lane."
Sentence "Should we chase those cowboys?"


A subset of the above was used in this research. It consisted of

1. Ten sustained vowels: /IY/, //, /E/, /AE/, /00/, /U/, /OW/, /A/, /UH,

and /ER/. There were a total of 520 vowels for 52 subjects: 270
vowels from males and 250 vowels from females.

2. Five sustained unvoiced fricatives (including a whisper): /H/, /F/,

/TH/, /S/, and /SH/. There were a total of 260 unvoiced fricatives for
all subjects: 135 from males and 125 from females.
3. Four voiced fricatives: /V/, /TH/, /Z/, and /ZH/. There were a total

of 208 unvoiced fricatives for all subjects: 108 from males and 100
from females.


3.2 Speech and EGG Digitization

All of the experimental data were collected with the subjects situated inside an
Industrial Acoustics Company single wall sound booth. The speech was picked up

with an Electro Voice RE-10 dynamic cardioid microphone and the EGG signal was
monitored by a Synchrovoice device. Amplification of the speech and EGG signals
was accomplished with 'a Digital Sound Corporation DSC-240 Audio Control
Console. The two channels were alternately sampled at 20 KHz by a Digital Sound

Corporation DSC-200 Digital Audio Converter system with 16 bits of precision.








The low-pass, anti-aliasing and reconstruction filters of the DSC-200 were
connected to the analog side of the converter. Both signals were bandlimited to 5
KHz by these passive elliptic filters with the specification of minimum stopband

attenuation of -55 db and passband ripple of 0.2 db. The DSC-240 station

provides audio signal interfacing to the DSC-200, which includes input and output
buffering as well as level metering and signal path switching.

The utterances were directly digitized since this choice avoids any distortion

that may be introduced in the tape recording process (Berouti et al., 1977; Naik,

1984). An extender attached to the microphone kept the speaker's lips 6 inches

away. With the microphone and EGG electrodes in place, the researcher ran the

data collection program on a terminal inside the sound room. A two channel

Tektronix Type 564B Storage Oscilloscope was connected to DSC-240 so both

speech and EGG signals were monitored. The program prompted the researcher by

presenting a list of commands on the screen. The researcher initiated digitization by

depressing the "D" key on the keyboard. Immediately after digitization, another
prompt indicated termination of the sampling process. The digitized utterance could

be played back and an option existed to repeat the digitization process if it was
thought that part of the utterance might have been spoken abnormally or the

digitized speech and EGG signals were unsatisfactory. For example, the speakers

were instructed to repeat a utterance if the panel of experts who were sitting in the

sound room or speaker felt that it was rushed, mispronounced, too low, etc. The

entire protocol with utterances repeated as necessary took an average of 15-20

minutes to collect. About 150-200 seconds of speech and EGG were automatically

stored on disk. Thus, for each subject, about 12000-16000 blocks (512 bytes per

block) of data were collected.

Since the speech and EGG channels were alternately sampled, the resulting
file of digitized data had the two signals interleaved. The trivial task of








demultiplexing was performed off-line after data collection. Once the data were
demultiplexed, the speech and EGG were trimmed to discard the unnecessary data
before and after an utterance while keeping the onset and offset portions at each end
of the data. After.trimming, about 4500-6500 blocks data were stored on disk for
each subject.


3.3 Synchronization of Data


When the speech and EGG signals were used during the analysis stage, they
were time aligned to account for the acoustic propagation delay from the larynx to

the microphone. The microphone was kept a fixed 15.24 centimeters (6 inches)

away from the speakers' lips to reduce breath noises and to simplify the alignment
process. Synchronization of the waveforms had to account for the distance from the
vocal folds to the microphone. To do so, average vocal tract lengths of 17 cm for

males and 15 cm for females were assumed. The number of samples to discard

from the beginning of the speech record was then

# samples = Int[(32.24/34442)10000 + .5] (3.1)
for males and

# samples = Int[(30.24/34442)10000 + .5] (3.2)
for females.

Equations (3.1) and (3.2) show that a 10 sample correction is appropriate for
males and a 9 sample correction is appropriate for females. Examination of the data
also supported use of these figures for adult speakers. Examples of aligned speech

and EGG signals for a male and female speaker are shown in Figure 3.1.











17064 /M F1














-1-.176


(a)





0U


-10240L


24 (b)



(b)


Examples of aligned
for (a) male and (b)


speech and EGG signals
female speakers.


Figure 3.1


,I . -


-y)f4


-vjv jHVop vjyp v IV v V VI














CHAPTER 4
EXPERIMENTAL DESIGN BASED ON COARSE ANALYSIS




As stated in the Chapter 2, coarse analysis applies classical pattern recognition
techniques and asynchronous linear prediction coding (LPC) analysis of speech.

The specific goal of this analysis was to develop and test the algorithms to achieve

rapid gender recognition from only a brief speech data record. Figure 4.1 shows the
canonic pattern recognition model used in the gender recognition system. There are

four basic steps in the model:

1) acoustic parameter extraction,

2) test and reference pattern or template formation,
3) pattern similarity determination, and
4) decision rule.

The input is the acoustic waveform of the spoken speech signal, the desired
output is a "best estimate of the speaker's gender in the input. Such a model can be
a part of a speech or speaker recognition system or a front end processor of the
system. The following discussion of the coarse analysis proceeds in the context of
Figure 4.1.


4.1 Asynchronous LPC Analysis

4.1.1 Linear Prediction Concepts

Linear prediction, also known as the autoregressive (AR), all-pole model, or
maximum entropy model, is widely used in speech processing. This method has





























































Figure 4.1 A pattern recognition model
for gender recognition from speech.


SPEECH
SIGNAL


RECOGNIZED
GENDER








become the predominant technique for estimating the basic speech parameters (e.

g., pitch, formants, spectra, vocal tract area functions) and for representing speech
for low bit-rate transmission or storage. The method was first applied to speech
processing by Atal and Schroeder (1970) and Atal and Hanauer (1971). For speech
processing, the term linear prediction refers to a variety of essentially equivalent
formulations of the problem of modeling the speech waveform (Markel and Gray,
1976; Makhoul, 1975b). These different models usually lead to similar results but
each formulation has provided an insight into the speech modeling problem and is
generally dictated by their computation demands.
The particular form of this model that is appropriate for this research is

depicted in Figure 4.2. In this case, the composite spectrum effects of radiation,

vocal tract, and glottal excitation are represented by a time-varying digital filter
whose steady-state system function is of the form


S(z) G
H(z) = = (4.1)
U(z)
1 2 Olk Z-k
k=l

This system is excited by an impulse train for voiced speech or a random noise
sequence for unvoiced speech.
The above system function can be written alternatively in the time domain as



s(n) = 2 OtkS(n-k) + G u(n) (4.2)
k=l

Let us assume we have available past data samples from (n-p) to (n-1) and we are

predicting a new nth sample of the data sample



















PITCH PERIOD

CDGTAL FiLTER CCEFFICIENTS
(VCCAL TRACT PARAMETERS)


IMPULSE i I -
TRAIN t
GENERATOR


-- TIME-VARYING



RANCCM
DIGITAL FILTER



NUMEER
GE'ERATCR
___AMPLITUCE

I SPEECH
SAMPLES




/ \ t


A digital model of speech production.


Figure 4.2








p
s (n) = aks(n-k) (4.3)
k=l

The error between the value of the actual nth sample and its estimate is



P
e(n) = s(n) I (n) = s(n) + 2 aks(n-k) (4.4)
k=l

and equivalently,



s(n) = aks(n-k) + e(n) (4.5)
k=l

If the linear prediction model of Equation (2.5) conforms to the basic speech
production model given by (2.2), then


e(n) = G u(n) (4.6)

ak = ak (4.7)


Thus the coefficients (ak) identify the system, whose output is s(n). The problem
then is to determine the value of the coefficients (ak) from the actual speech signal.

The criterion used to obtain the coefficients (ak) is the minimization of the
short-time average prediction error E with respect to each coefficient ai, over some
time interval, where


E= [e(n)]2 (4.8)
n


This leads to the following set of equations









ak s(n-k)s(n-i) = s(n)s(n-i) 1 < i & p (4.9)
k=l n n


For a short-time analysis, the limits of summation are finite. The particular
choice of these limits has led to two methods of analysis (i.e., the autocorrelation
method (Markel and Gray, 1976) and the covariance method (Atal and Hanauer,
1971)).

The autocorrelation method results in a filter structure that is guaranteed to be
stable. Meanwhile, it operates on a data segment that is windowed using a Hanning,

or Hamming, or another window, typically 10-20 msec long (two to three pitch

periods).

The covariance method, on the other hand, gives a filter with no guaranteed

stability, but requires no explicit windowing. Hence it is eminently suitable for pitch

synchronous analysis.

One of the important features of the linear prediction model is that the
combined contribution of the glottal flow, the vocal tract and the radiation effect at

the lips are represented by a single recursive filter. The difficult problem of

separating the contribution of the source function from that of the vocal tract system
is thus completely avoided.


4.1.2 Analysis Conditions

In order to extract acoustic parameters rapidly, a conventional pitch

asynchronous autocorrelation LPC method was used, which applied a fixed frame
size, frame rate and number of parameters per frames. These analysis conditions
were:
Order of the filter: 8, 12, 16, 20
Analysis frame size: 256 points/frame








Frame overlap: None
Preemphasis Factor: 0.95

Analysis Window: Hamming

Data set for coefficient calculations: six frames total. The first two of these

were picked up from near the voice onset of an utterance, and the second two from
the middle of the utterance, and the last two from near the voice offset of the

utterance. By averaging six sets of coefficients obtained from these six frames, a

template coefficient set was calculated for each sustained utterance such as a vowel,

an unvoiced fricative, or a voiced fricative.



4.2 Acoustic Parameters

One of the key issues in developing a recognition system is to identify

appropriate features and measures which will support good recognition
performance. Several acoustic parameters were considered as feature candidates in

this study.


4.2.1 Autocorrelation Coefficients

They are defined conventionally as (Atal, 1974b)


00
R(k) = Z h(n)h(n+|kl) (4.10)
n=0


where h(n) is the impulse response of the filter. The relationship between the P

autocorrelation function coefficients and P LPC coefficients is unique in that they

can be obtained from each other (Rabiner and Schafer, 1978).










4.2.2 LPC coefficients

LPC coefficients are defined conventionally as (Rabiner and Schafer, 1978)


s(n) = ak s(n-k) (4.11)
k=l

where s(n-k) is the (n-k)th speech sample, s(n) is the nth predicted output and ak is
the kth LPC coefficient. LPC coefficients are determined by minimizing the
short-time average prediction error.


4.2.3. Cepstrum Coefficients

Cepstral coefficients can be obtained by the following recursive formula
(Rabiner and Schafer, 1978)


Co = 0,

k-1 i
Ck = ak+ ( ) ak-iCi 1 < k p (4.12)
i=1 k


where ak is kth LPC coefficient.


4.2.4 Reflection Coefficients

If we consider a model for speech production that consists of a concatenation
of N lossless acoustic tubes, then the reflection coefficients are defined as (Rabiner
and Schafer, 1978)

A(k+l) A(k)
r(k) = (4.13)
A(k+l) + A(k)








where A(k) is the area of the kth lossless tube. The reflection coefficient determines
the fraction of energy in a traveling wave that is reflected at each section boundary.

Further, r(i) is related to the PARCOR coefficient k(i) by (Rabiner arid Schafer,
1978)


r(i) = k(i) (4.14)


where k(i) can be obtained from LPC coefficients by recursion.


4.2.5 Fundamental Frequency and Formant Information
This set of features consists of frequencies, bandwidths and amplitudes of the
first, second, third and fourth formants and the fundamental frequencies (not for
fricatives). Formant information was obtained by a peak-picking technique, using
an FFT on the LPC coefficients. Fundamental frequency was calculated based upon
a modified cepstral algorithm.



4.3 Distance Measures


Several distance measures were considered.


4.3.1 Euclidean Distance


D euc = [ (X-Y) (X-Y) 11/2 (4.15)


where X and Y are the test and reference vectors respectively and t denotes the
transpose of the vector.








4.3.2 LPC Log Likelihood Distance
It was proposed by Itakura (1975) and defined as


aRa'
Di ( a) = log [ ] (4.16)
d Ra '


where a and d are the LPC coefficientvectors of the reference and test speeches,
and R is the matrix of autocorrelation coefficients of the test speech. An
interpretation of this formula is given in Figure 4.3 below in which the subscript r
denotes reference, and the subscript t denotes test.
The denominator of the term can be obtained by passing the test speech signal
St(n) through the inverse LPC system of the test H,(z), giving the energy a of the
error signal. Similarly, the numerator term can be obtained by passing the same test
signal St(n) through the inverse LPC system of the reference H,(z) with the energy p
of the error signal.
Thus we obtain


D pc, (a a) = log (0/a) (4.17)


It can also be shown that this distance measure is related to the spectra
dissimilarity between the test and reference speech signals.
For computational efficiency, variables can be changed and Equation (4.16)
or (4.17) can be rewritten as

p r (k)
D Ipc (a a) = log [ ra(k) ] (4.18)
k=0 E






44


















et (n) a =
I [Ht(z)]-1 --- I-I | I I









er (n) p =























Figure 4.3 An interpretation of LPC log likelihood
distance measure.


TEST


SPEECH


TEST


SPEECH


aR a'


a R a'









where r(k) is the autocorrelation of the speech segment to be recognized, E is the
total squared LPC prediction error associated with the estimates a(k) from this
segment, and ra(k) is the autocorrelation of the true (reference) LPC coefficients.
The block diagram for this subroutine is shown in Figure 4.4.
The spectral domain interpretation of Equation (4.18) is (Rabiner and
Levinson, 1981)


IT Hr(eJ) 2 do
Dip (a,a) = log[[f lI ] (4.19)
-irT Hi(e0j) 2rr


(i.e., an integrated square of the ratio of LPC spectra between reference and test
speech).


4.3.3 Cepstral Distortion
It can be defined as (Nocerino et al., 1985)


D ep (C, c') = Z (Ck C'k)2 (4.20)
k=-n


It can be shown that the power spectrum which corresponds to cepstrum is a
smoothed version of the true log spectral density function. The fewer the cepstral
coefficients used, the smoother the resultant log spectral density. It can also be
shown that this truncated cepstral distortion measure is a good approximation to the
L2 norm of the log spectral distortion measure between two time series, x(n) and
x'(n),



















TEST SPEECH AUTOCORRELATION LPC
SEGMENT
CALCULATION ANALYSIS






PREDICTION E:

CALCULATED








REFERENCE TEMPLATE
LPC COEFFICIENTS MEASURE






OUTPUT










Figure 4.4 The block diagram of LPC log likelihood
distance computation.








7T 2 2 2 d(
D L2 =f I log IX(@)l -log IX'(o)I 1- (4.21)
-TT 2Tr


where X(w) is the Fourier transform of x(n). Gray and Markel (1976) showed that
for the LPC analysis with a filter order of 10 the correlation coefficient between D L2
and D cep is 0.98; while for a order of 20, the correlation coefficient is 0.997.


4.3.4 Weighted Euclidean Distance


D WEUC = [ (X-Y)' W-1 (X-Y) ]1/2 (4.22)


where X is the test vector, W is the symmetrical covariance matrix obtained using a
set of reference vectors (e.g., from a set of templates which represent subjects of the
same gender), and Y is the mean vector of this set of reference vectors. The
weighting compensates for correlation between features in the overall distance and
reducing the intragroup variations. Weighted Euclidean distance is the simplified
version of the likelihood distance measure using the probability density function
discussed below.


4.3.5 Probability Density Function

1 -n/2 1 -(X-Y)' W-'(X-Y) 1/2
D PDF = ( exp[ ] (4.23)
27r IW1/2 2


where X, Y, W are the same as in Equation (4.22) and tW| is-the determinant of W.
The decision principle of this distance measure minimizes the probability of the
error.








4.4 Template Formation and Recognition Schemes


4.4.1 Purpose of Design

Another important issue in developing a recognition system is the selection of

appropriate template formation and recognition schemes.
During initial exploratory studies of fixed-text recognition using spectral

pattern matching techniques in the Pruzansky study (1963), the use of the long-term

average technique to form a feature vector was discovered to have potential for

free-text speaker recognition. The speaker recognition error rate was found to

remain undegraded (at 11 percent) even after spectral amplitudes were averaged

over all frames of speech data into a single reference spectral amplitude vector for

each talker. Markel et al. (1977) demonstrated that the between-to-within speaker

variation ratio was significantly increased under long-term average of the parameter

sets (thus free-text).

Temporal cues also appeared not to play a role in speaker gender

identification (Lass and Mertz, 1978). They found that gender identification

accuracy remained high and unaffected by temporal speech alterations when the

normal temporal features of speech were altered by means of the backward playing

and time compressing of speech samples.

Therefore, we would reasonably believe that the gender information is

time-invariant. Thus, long-term averaging would also emphasize the speaker's

gender information and increase the between-to-within gender variation ratio. In

practice we would also achieve free-text gender recognition in which gender

identification would be determined before recognition of speech or speaker and

thus, reduce the speech or speaker recognition search space to half.

The purpose of using different test and reference template formation schemes

is to verify the hypothesis above and, if it is correct, to determine how much








averaging has to be performed to obtain the best gender recognition. In the

preliminary attempt, the averaging was first done within three classes of the sounds
(i.e., vowels, unvoiced fricatives, and voiced fricatives).


4.4.2 Test and Reference Template Formation

The averaging procedures used to create test and reference templates for the

present experiment employed a multi-level combination approach as illustrated in

Figure 4.5.

The lower layer templates were feature parameter vectors obtained from each

utterance by an LPC analysis as described in the last section. They can be

autocorrelation, LPC, or cepstrum coefficients, etc. A lower layer template

coefficient set was calculated by averaging six sets of coefficients obtained from six

frames for each sustained utterance such as a vowel, an unvoiced fricative, or a

voiced fricative for every subject.

The next level of combination averaged all templates in the lower layer for

each subject to form a single median layer template to represent this subject.

Templates of all utterances for the same phoneme groups (e.g., vowels or unvoiced

fricatives or voiced fricatives), were averaged.

In the last stage, the single remaining male and female templates were

combined in the same manner as above. Each gender was represented by a single

token centroidd) obtained by averaging all templates in the median layer.

It is evident that from the lower layer to the upper layer, a higher degree of
averaging is achieved.

Figure 4.6(a) shows two reflection coefficient templates of vowels for male

and female speakers in the upper layer. The filter order was 12 so that there were 12

elements in each template (vector). Each template can be considered a "universal

token" representing each gender. The data in the figure are shown as














SUBJECT 2


TEMPLATE FOR
EACH UTTERANCE
(LOWER LAYER)











COMPUTE
AVERAGE



TEMPLATE FOR
EACH SUBJECT
(MEDIAN LAYER)


TEMPLATE FOR
EACH GENDER
(UPPER LAYER)


Figure 4.5 Test and reference template formation.


SUBJECT 1


SUBJECT M





















0--0 MALE
*-- FEMALE


II I I I


1 2 3 4 5 6 7 8
ELEMENT OF THE VECTOR


9 10 11 12
(TEMPLATE)


- MALE

FEMALE


1000


2000 3000

FREQUENCY (Hz)


(b)



Figure 4.6 (a) Two reflection coefficient templates of vowels
for male and female speakers in the upper layer.
(b) The corresponding spectra.


0.6880


0.40-8





je.000-

-_
0-9.200


-0.408--


-0.600


-0.800


UJ
'4



z
(I
E:
Z
rS


I








means standard errors (SE), which were calculated from the median layer

templates. We will see in the next Chapter that by applying these two tokens as
reference templates with the recognition Scheme 3 and the Euclidean distance a
100% gender recognition rate can be achieved. The result is not surprising if we
notice that the within-gender variation for these reflection coefficients, as
represented by SE in the figure, was small, compared to the between-gender
variation. It is also easily noted that elements 1, 4, 5, 6, 7, 8, 9, and 10 of these
reference templates account for the most between-gender variation. On the other
hand, elements 2, 3, 11, and 12 of these reference templates account for little
between-gender variation and thus could be discarded to reduce the dimensionality
of the vector. Figure 4.6(b) shows the spectra corresponding to the two "universal"
reflection coefficient templates.
Similarly, Figure 4.7(a) and (b) show two reflection coefficient templates
(with the same filter order of 12) and the corresponding spectra of unvoiced

fricatives for male and female speakers in the upper layer. Interestingly, strong

peaks are present in the "universal" female spectrum for voiced fricatives but only
several ripples appear in the male spectrum. We will see later that by using these
two tokens as reference templates with the recognition Scheme 3 and the Euclidean
distance, an 80.8% gender recognition rate can be achieved. Finally, Figure 4.8(a)

and (b) are two cepstral coefficient templates (with the same filter order of 12) and
the corresponding spectra of voiced fricatives for male and female speakers in the
upper layer. It is shown later that by using these two tokens as reference templates
with the recognition Scheme 3 and the Euclidean distance, a 92.3% gender

recognition rate can be achieved. The "universal" spectra for the two genders in

Figure 4.6(b), 4.7(b), and 4.8(b) possess the basic properties for vowels, unvoiced
fricatives, and voiced fricatives. For example, while the energy of the vowels is
concentrated in the lower frequency portion of the spectrum, the energy of the



















0-0 MALE
O0 *-* FEMALE
0o o
*>


I I I I I I I I I i I I
1 2 3 4 5 6 7 8 9 10 11 12
ELEMENT OF THE VECTOR (TEMPLATE)

(a)


- MALE


FEMALE


1888 2800


3800 4888


FREQUENCY (Hz)

(b)


Figure 4.7 (a) Two reflection coefficient templates of unvoiced fricatives
for male and female speakers in the upper layer.
(b) The corresponding spectra.


0.600

8.488





2e.e-28


-8.400

-8.688-


-8.800


6888



















0--0 MALE
*-- FEMALE


1 E3 4 5 6 7 8
ELEMENT OF THE VECTOR


9 10 11 12
(TEMPLATE)


- MALE


FEMALE


1i88 2888


3888 4888


FREQUENCY (Hz)

(b)


Figure 4.8 (a) Two cepstral coefficient templates of voiced fricatives
for male and female speakers in the upper layer.
(b) The corresponding spectra.


0.30088


0.200--






8.i88


-8.188


-0.288


5088








unvoiced fricatives is concentrated in the higher frequency portion of the spectrum.

And the energy of voiced fricatives is more or less equally distributed.


4.4.3 Nearest Neighbor Decision Rule

The other major step in the pattern recognition model is the decision rule

which chooses which reference template most closely matches the unknown test

template. Although a variety of approaches are applicable, only two decision rules

have been used in most practical systems, namely, the K-nearest neighbor rule

(KNN rule) and the nearest neighbor rule (NN rule).

The KNN rule is applied when each reference class (e.g., gender) is

represented by two or more reference templates (e.g., as would be used to make the

reference templates independent of the speaker). The KNN rule operates as follows:

Assume we have M reference templates for each of two genders, and for each

template a distance score is obtained. If we denote the distance for the ith reference

template of the jth gender as Dij (1 < i < M and j = 1, 2), this set of distance scores,

Di,, can be ordered such that


Di,j < D2,j < ... < DM,j (4.24)


Then for the KNN rule we compute the average distance (radius) for the jth gender

as

1 K
j = Dij (4.25)
K i=l


and we choose the index j* with the smallest average distance as the "recognized"

gender








j* = argmin rj (4.26)
j=1,2


When K is equal to 1, the KNN rule becomes the NN rule (i.e., it chooses the

reference template with the smallest distance as the recognized template).

The importance of the KNN rule is seen for word recognition when P is from 6
to 12, in which case it has been shown that a real statistical advantage is obtained

using the KNN rule (with K = 2 or 3) over .the NN rule (Rabiner et al., 1979).

However, since there was no previous knowledge of the decision rule as applied to

gender recognition, the NN rule was first used in this preliminary experiment.


4.4.4 Structure of Four Recognition Schemes

To investigate how much averaging should be done for the test and reference

templates to gain the best performance for the gender recognizer, several

recognition schemes were designed. Table 4.1 presents a brief summary of these

schemes.



Table 4.1 Four recognition schemes


Test template from Reference template from


SCHEME 1 LOWER LAYER MEDIAN LAYER
SCHEME 2 LOWER LAYER UPPER LAYER
SCHEME 3 MEDIAN LAYER UPPER LAYER
SCHEME 4 MEDIAN LAYER MEDIAN LAYER




Scheme 1 is illustrated in Figure 4.9(a). In the training stage, one test

template for each test utterance (i.e., the lower layer), and one reference template











SCHEME 1


TEST SUBJECT


LOWER LAYER






MEDIAN LAYER


SCHEME 2


TEST SUBJECT


LOWER LAYER






UPPER LAYER


(b)



Figure 4.9 Structures of four recognition schemes.






58



SCHEME S


TEST SUBJECT


MEDIAN LAYER




UPPER LAYER
























MEDIAN LAYER




MEDIAN LAYER


SCHEME 4


TEST SUBJECT


(d)


Figure 4.9 (continued)








for each subject (i.e., the median layer), were formed. The set of the entire median
layer constituted the reference cluster that includes all median templates. In the
testing stage, the distance measure for each lower layer template of all test subjects
was calculated with respect to each of the median layer templates, and the minimum
distance was found. The speaker gender of the lower layer utterance was then

classified as male or female, according to the gender known for the median layer

reference template.

Scheme 2 is illustrated in Figure 4.9(b). In the training stage, one test
template for each test utterance (i.e., the lower layer), and one reference template

for each gender (i.e., the upper layer), were formed. The upper layer constituted the

reference cluster that includes only two gender templates. In the testing stage, the
distance measure for each lower layer template of all test subjects was calculated

with respect to each of those upper layer templates and the minimum distance was
found. The speaker gender of the lower layer utterance was then classified as male

or female, according to the gender known for the upper layer reference template.

Figure 4.9(c) shows Scheme 3. In the training stage, one test template for each
test subject (i.e., the median layer), and one reference template for each gender (
i.e., the upper layer), were formed. The set of the entire median layer constituted
the test pool that includes all median templates. In the testing stage, the distance

measure for each median layer template of all test subjects was calculated with

respect to each of those upper layer templates and the minimum distance was found.

The speaker gender of the median layer template was then classified as male or
female, according to the gender known for the upper layer reference template.

Figure 4.9(d) shows Scheme 4. In the training stage, only the median layer
were formed and each subject was represented by a single template. The median
layer constituted both test and reference pools. In the testing stage, the
Leave-One-Out ot exclusive procedure (which is discussed in detail in the next








section) was applied. The distance measure for each median layer template was

calculated with respect to each of the rest of the median layer templates, and the'

minimum distance was found. The speaker gender of the test template was then

classified as male or female, according to the gender known for the reference

template. The above steps were repeated until all subjects were tested.



4.5 Resubstitution and Leave-One-Out Procedures


After the classifier is designed, it is necessary to evaluate its performance

relative to competing approaches. The error rate was considered as the performance

measure.

Four popular empirical approaches that count the number of errors when

testing the classifier with a test data set are (Childers, 1989):

The Resubstitution Estimate (inclusive). In this procedure, the same data set

is used for both designing and testing the classifier. Experimentally and

theoretically this procedure gives a very optimistic estimate, especially when the

data set is small. Note, however, that when a large data set is available, this method

is probably as good as any procedure.

The Holdout Estimate. The data is partitioned into two mutually exclusive

subsets in this procedure. One set is used for designing the classifier and the other

for testing. This procedure makes poor use of the data since a classifier designed on

the entire data set will, on the average, perform better than a classifier designed on

only a portion of the data set. This procedure is known to give a very pessimistic

error estimate.

The Leave-One-Out Estimate (exclusive). This procedure assumes that there

are n data samples available. Remove one sample from the data set. Design the

classifier with the remaining (n-1) data samples and then test it with the removed








data sample. Return the sample removed earlier to the data set. Then repeat the

above steps, removing a different sample each time, for n times, until every sample

has been used for testing. The total number of errors is the leave-one-out error
estimate. Clearly this method uses the data very effectively. This method is also

referred to as the Jack Knife method.

The Rotation Estimate. In this procedure, the data set is partitioned into n/d

disjoint subsets, where d is a divisor of n. Then, remove one subset from the design
set, design the classifier with the remaining data and test it on the removed subset,

not used in the design. Repeat the operation for n/d times until every subset is used

for testing. The rotation estimate is the average frequency of misclassification over

the n/d test sessions. When d=1 the rotation method reduces to the leave-one-out

methods. When d=n/2 it reduces to the holdout method where the roles of the design

and test sets are interchanged. The interchanging of design and test sets is known in

statistics as cross-validation in both directions. As we may expect, the properties of

the rotation estimate will fall somewhere between the leave-one-out method and

holdout method. The rotation estimate will be less biased than in the holdout

method and the variance is less than in the leave-one-out method.

In order to use the database effectively, the leave-one-out procedure was
adopted for the experiments. For comparison, the resubstitution procedure was also

used in selected experiments.



4.6 Separability of Acoustic Parameters
Using Fisher's Discriminant Ratio Criterion



4.6.1 Fisher's Discriminant and F ratio

After we chose acoustic feature candidates, we can see how well they separate

different genders by analytically studying the database. There are many measures








of separability which are generalizations of one kind or another of the Fisher's
discriminant ratio (Childers et al., 1982},Parsons, 1986) concept. The ratio usually

serves as a criterion for selecting features for discrimination.

The ability of a feature to separate classes depends on the distance between
classes and the scatter within classes (generally there will be more than two classes).
This separation is estimated by representing each class by its mean and taking the
variance of the means. This variance is then compared to the average width of the

distribution for each class (i.e., the mean of the individual variances). This measure

is commonly called the F ratio:


Variance of the means(over all classes)
F = (4.27)
Mean of the variances (within classes)


The F ratio is reduced to Fisher's discriminant when it is used for evaluating a
single feature and there are only two classes. For this reason, the F ratio is also
referred to as the generalized Fisher's discriminant.

In the case of pattern recognition, there are vectors of features, f, and
observed values of f for all the classes we are interested in recognizing. Then two

covariance matrices can be calculated, depending on how the data are grouped.

First, the covariance for a single recognition class can be computed by
selecting only feature measurements for class i. Let any vector from this class be fi.
Then the i within-class covariance matrix for class i is


Wi = ( (fi i)(fi i)t1) (4.28)


where () represents the expectation or averaging operation and Ai represents the

mean vector for the ith class: gi = (fi). W stands for "within." Notice that each of








these covariance matrices describes the scatter within a class; hence it corresponds
to one term of the average in the denominator of (4.40). If we make the common
assumption that the vector fi is normally distributed, then W is the covariance matrix
of the corresponding probability density function:



1 -n/2 1 -(f-i)t W-1(f-i) 1/2
PDFi (f) = (--) exp
27T Wil1/2 2


Then the denominator of the F ratio can be associated with the average of Wi
over all i; this is called the pooled within-class covariance matrix:


W = (Wi) (4.29)


Second, the variation within-classes can be ignored and the covariance
between classes can be found, representing each class by its centroid. The feature
centroid for class i is g~; hence the between-class covariance matrix is


B = ( (gi l)(i i)t) (4.30)


where g is the mean of Ai over all classes. B stands for "between." Here we ignore
the detailed distribution within each class and represent all the data for that class by
its mean. Hence B describes the scatter from class to class regardless of the scatter
within a class and in that sense corresponds to the numerator of (4.40).
Then the generalization we seek should involve a ratio in which the numerator
is based on B and the denominator on W, since we are looking for features with








small covariances within classes and large covariances between classes. Fukunaga

(1972) lists four such measures, two of which are


J, = trace (W-'B) (4.31)
and

trace B
J4 (4.32)
trace W


The motivation for these measures is clearer for J4 since we know that the trace of a

covariance matrix provides a measure of the total variance of its associated

variables (Parsons, 1986). If the value of J4 for a feature is relatively greater than
that for the other feature, then there is apparently more scatter between classes than

within classes for this feature, and this feature set is a better one than the other for

discrimination. J4 tests this ratio directly. The motivation for J1 is less obvious and

will have to awaitthe presentation of the material below.


4.6.2 Divergence and Probability of Error

The distance between two classes in feature space may also be evaluated by
divergence that is defined as the difference in the expected values of their

log-likelihood ratios (Kullback, 1959; Tou and Gonzales, 1974). This measure has

its roots in information theory (Kullback, 1959) and is a measure of the average

amount of information available for discriminating between class i and class k. It is
shown that for features with multivariate normal densities, the divergence is given

by


Dik = 0.5 trace (Wi Wk)(Wi- Wk-1)

+ 0.5 trace [(Wi-1 + Wk-1)(I. k)(ki 0k)t] (4.33)







This can be related to more familiar material as follows. If the covariance
matrices of the two classes are equal, if Wi and Wk can be replaced by an average
covariance matrix W, then the first term vanishes and the divergence reduces to


Dik = trace [W'(I(i IAk)( i k)t]
= (-i- Ak)' W-'(i- Ak)


The term, (Ai k) (9i Ak)t, is the between-class covariance matrix B; hence in
this case Die is the separability measure J1 = trace (W-'B).
Notice that ik or J1 is the Mahalanobis distance. This distance is related to
the approximation of the expected probability of error (PE) by Lachenbruch (1968),
Achariyapaopan and Childers (1983), and Childers (1986). If p is the dimension of
the feature vector, ni and n2 are the sample sizes for classes 1 and 2, and i(z) is the
standard normal distribution function defined as


1 z
I (z) = f exp (- 0.5 u2) du (4.34)
2"rr -oo


PE can be written as



PE = 0.5 [- (a P)] + 0.5 D [- (a + P)] (4.35)


where

S= C + n)(nn) ]1(4.36)
[ J1+P(nl+ +n2)/(11nn12) ]1/2








P (nz ni)
P3 = C (4.37)
[ nin2 (J1inn2 + p (nl + n2))] 1/2



(ni + n2 p 2)(nl + n2 p 5) 1/2
C = 0.5 ( ) (4.38)
(n, + n2- 3)(n, + n2- p 3)


For the fixed training sample sizes ni and n2 and vector dimension p, PE decreases
as the Mabalanobis distance J1 increases.
In the coarse analysis stage, the estimated J1, J4, and expected probabilities of
errors of the acoustic features ARC, LPC, FFF, RC, and CC, which were derived
from male and female groups (i.e., classes) in three phoneme categories, were
studied. Training sample sizes were 27 (ni) for the male group and 25 (n2) for the
female group since median layer templates (one for each subject) were used to
constitute the training sample pools. The feature vector dimension p was equivalent
to the filter order selected. For each of the acoustic features ARC, LPC, FFF, RC,
and CC in each of the three phoneme categories, the estimated JI, J4, and expected
probability of error were computed as follows:
(1) Estimate the within-gender covariance matrix Wi for each gender
using Equation (4.28).
(2) Compute the pooled (averaged) within-gender covariance matrix
W using Equation (4.29).
(3) Estimate the between-gender covariance matrix B using Equation
(4.30).
(4) Obtain the values of J4 and the Mabalanobis distance J1 from
matrixes W and B using Equations (4.32) and (4.31).





67


(5) Finally, calculate the value of PE from J1 nl, n2, and p using
Equations (4.34) to (4.38).

The analytical results were also compared to the empirical ones obtained from
experiments using recognition schemes. Section 5.3 in the next Chapter presents a

detailed discussion.














CHAPTER 5
RESULTS OF RECOGNITION BASED ON COARSE ANALYSIS




Our results showed that most of the LPC-derived feature parameters

performed well for gender recognition. Among them, the reflection coefficient
combined with the Euclidean distance measure was the best choice for sustained

vowels (100%). While the cepstral distortion measure worked extremely well for
unvoiced fricatives, the LPC log likelihood distortion measure, the reflection

coefficient combined with the Euclidean distance, and the cepstral distortion

measure were the good alternatives for voiced fricatives. Using the Euclidean

distance measure achieved better results than using the Probability Density
Function. Furthermore, the averaging techniques were very important in designing

appropriate test and reference templates and a filter order of 12 to 16 was sufficient
for most designs.


5.1 Coarse Analysis Conditions

Before we discuss in detail the performance assessments based on coarse
analysis, we briefly summarize the experimental conditions as follows:

Database

52 normal subjects: 27 male and 25 female

Phoneme group used: ten sustained vowels

five unvoiced fricative

four voiced fricatives








Analysis Conditions

Method: asynchronous autocorrelation LPC

Filter order: 8, 12, 16, 20

Analysis frame size: 256 points/frame

Frame overlap: None

Preemphasis Factor: 0.95

Analysis Window: Hamming

Data set for coefficient calculations: six frames total. Then

by averaging six sets of coefficients obtained from these six

frames, a template coefficient set was calculated for each

sustained utterance.

Acoustic Parameters

LPC coefficients (LPC)

Cepstrum Coefficients (CC)

Autocorrelation Coefficients (ARC)

Reflection Coefficients (RC)

Fundamental Frequency and Formant Information (FFF)

FFF was obtained by 12 order Closed Phase WRLS-VFF method

discussed in the next Chapter.

Distance Measures

Euclidean Distance (EUC)

LPC Log Likelihood Distance (LLD)

Cepstral Distortion (same as EUC)

Weighted Euclidean Distance (WEUC)

Probability Density Function (PDF)

Decision Rule

Nearest Neighbor








Recognition Schemes

Scheme 1

Scheme 2

Scheme 3

Scheme 4

Counting Error Procedures

Inclusive (Resubstitution)

Exclusive (Leave-One-Out)

Parameters Based on Fisher's Discriminant Ratio Criterion

J4, J1, and the expected probability of error



5.2 Performance Assessments


Since the WEUC is the simplified case of the PDF and the results produced by

the WEUC were very similar to those produced by the PDF in our experiments, only

the results obtained by the PDF will be discussed.

The complete results of the experiments are tabulated in Appendix A and B.
Appendix A presents the recognition rates for LPC log likelihood and cepstral

distortion measures with various phoneme categories, recognition schemes, and

filter orders. Inclusive procedures were only performed for the acoustic parameters

with a filter order of 16.

Appendix B presents the recognition rates for various acoustic parameters

(ARC, LPC, RC, CC) combined with EUC or PDF distance measures with different
phoneme categories and filter orders. Notice that only recognition Scheme 3 was
used for these experiments. Again, inclusive procedures were only performed for

the acoustic parameters with a filter order of 16. Since calculation of the cepstral








distortion measure was done by using the EUC, the results of the CC combined with

the EUC were directly extracted from Appendix A.


5.2.1 Comparative Study of Recognition Schemes

Tables 5.1 and 5.2 show the condensed results selected from Appendix A,

using the LPC log likelihood and cepstral distortion measures respectively.

Recognition rates for the four exclusive recognition schemes with various filter

orders are included. Figures 5.1 and 5.2 are graphic illustrations of Tables 5.1 and

5.2.

By observing curves of Figures 5.1 and 5.2, it can be immediately seen that

higher recognition rates were achieved using recognition Schemes 3 and 4 for all the

cases, including various filter orders combined with different phoneme categories.

Among them, by applying Schemes 3 and 4 to voiced fricatives, over 90%

recognition rates were accomplished for all filter orders and both distortion

measures. The highest correct recognition rate, 98.1%, was obtained for Scheme 4

by using the LPC log likelihood measure with a filter order of 8. The same rates

were obtained for Scheme 3 by using the LPC log likelihood measure with a filter

order of 20 and using the cepstral distortion measure with filter orders of 12 and 16.

The results indicated the following:

1. Choosing .appropriate template forming and recognition schemes

was important in achieving high correct recognition rates.

Particularly, the use of averaging techniques was critical since the

highest recognition rates were obtained by using Schemes 3 and 4, in

both of which the test and reference template were formed by

averaging, all the utterances from the same subjects or even the same

gender (Scheme 3). In contrast, Schemes 1 and 2, in which the test
template was formed from a single utterance, performed worse.



















Table 5.1
Results from exclusive recognition schemes
with various filter orders and
the LPC log likelihood distortion measure


CORRECT RATE %

Ordere=8 Order=12 Order=16 Order=20

Scheme 1 63.1 69.6 74.2 74.2

Sustained Scheme 2 65.2 71.5 76.2 76.5

Vowels Scheme 3 75.0 86.5 86.5 84.6
Scheme 4 75.0 80.8 86.5 88.5

Scheme 1 59.2 64.2 67.7 65.0

Unvoiced Scheme 2 61.5 63.9 64.2 64.2

Fricatives Scheme 3 67.3 75.0 75.0 78.9

Scheme 4 76.9 75.0 73.1 69.3

Scheme 1 74.5 72.1 73.1 72.6

Voiced Scheme 2 77.4 80.3 81.7 80.3

Fricatives Scheme 3 90.4 94.2 96.2 98.1

Scheme 4 98.1 96.2 96.2 94.3



















Table 5.2
Results from exclusive recognition schemes
with various filter orders and
the cepstral distortion measure


CORRECT RATE %

Order=8 Order=12 Order=16 Order=20

Scheme 1 61.3 68.3 69.4 70.6
Sustained Scheme 2 69.4 67.3 70.0 72.1

Vowels Scheme 3 82.7 92.3 90.4 90.4

Scheme 4 90.4 92.3 92.3 88.5

Scheme 1 61.2 65.8 63.9 64.6

Unvoiced Scheme 2 58.8 61.5 62.7 64.2

Fricatives Scheme 3 71.2 75.0 84.6 82.7

Scheme 4 78.8 88.5 90.4 88.5

Scheme 1 79.3 82.7 81.3 80.8

Voiced Scheme 2 75.5 82.2 84.6 85.1

Fricatives Scheme 3 94.2 98.1 98.1 96.2

Scheme 4 92.3 92.3 92.3 90.4



































lee-
s88
956

98 -

85

88

75
79 --
78

66

68

556


18e

95

98

85

88

75

70

65

60


.A Scheme 3


A A -A


A A
.-0 --
0... ""- -- ....-.7..


8 12 16 20

Filter order


Figure 5.1 Results from exclusive recognition schemes with various
filter orders and the LPC log likelihood distortion measure.


Scheme
Scheme
Scheme


-A Scheme 4
A .-------... ....
..A Scheme 3


S------ Scheme 2

S O scheme


A ..---A Scheme 3






* --
- 0 .--. 0 ... 0 Schemea

0 Scheme 2


0 Scheme I


















A A A




S A------A
A








0


A A A



A '


.,0 ..-.... --
0.

0
"-0 "






A "
..A-------A- A...


O


188-

95

98-

85

88-




65



55,




70-
88 -

956

698

856





66 -



68 -r
65r

55 -


Figure 5.2 Results from exclusive recognition schemes with various
filter orders and the cepstral distortion measure.


1ea -

95 I

6 i
98s



88

75-

78

65

68

55-


Scheme 3
Scheme 4


Scheme 2
Scheme 1


8 12 16

Filter grder


Scheme 4

Scheme 3


Scheme 1
Scheme 2


Scheme 3

Scheme 4

Scheme 2

Scheme I


.o - ..


I I I 1 1








2. Averaging techniques seemed more crucial than clustering

techniques. Recognition theory states that choosing several

clustering centers for the same reference group should increase the

correct classification rate, because intragroup variations are taken

into account. In Schemes 1 and 4, multi-clustering reference centers

were formed. However, the theory functioned well in Scheme 4 but

inadequately in Scheme 1, although in Scheme 1, a number of

clustering centers were selected for the reference of the same

gender. Furthermore, Scheme 3 was a further simplified version of

Scheme 2 and Scheme 4. Instead of using each test vowel of the

subject as in Scheme 2, only a single test template was employed for

each test subject and a single reference template for each reference

gender. But the results were almost as good as those achieved by

using Scheme 4. In Scheme 2, averaging was performed over the

reference template but not over the test template. The correct

recognition rates were low. Thus, the results suggested the

importance of averaging techniques. To a great extent, averaging on

both test and reference templates eliminated the intrasubject

variation or diversity within different vowels or fricatives of a given

speaker, but on the other hand, emphasized features representing

this speaker's gender.

3. Since the averaging was applied to the acoustic parameters extracted

from different phonemes uttered by a speaker or speakers of the

same gender, and the phonemes were produced at different times,

the averaging is essentially a time-averaging technique.
Therefore, we would reasonably deduce that gender information

is time-invariant, phoneme independent, and speaker independent.








Because of this, averaging emphasized the speaker's gender
information and increased the between-to-within gender variation

ratio. In practice we would achieve free-text gender recognition in

which gender identification would be determined before recognition

of speech or speaker and thus, reduce the speech or speaker

recognition search space to half.
The conclusion is consistent with the findings by Lass and Mertz

(1978) that temporal cues appeared not to play a role in speaker

gender identification. As we cited earlier, in their listening tests they

found that gender identification accuracy remained high and

unaffected by temporal speech alterations when the normal temporal

features of speech were altered by means of the backward playing

and time compressing of speech samples.

We have shown in the previous section that use of the long-term

average technique to form a feature vector was discovered to have

potential for free-text speaker recognition in the Pruzansky study

(1963). Speaker recognition error rates remained undegraded even

after averaging spectral amplitudes over all frames of speech data

into a single reference spectral amplitude vector for each talker.

Markel et al. (1977) also demonstrated that the between-to-within

speaker variation ratio was significantly increased by performing

long-term parameter sets (thus text-free). Here we found that this

rule also applied to the gender recognition.
4. In terms of Scheme 3 versus Scheme 4, neither was obviously

superior. However, from a practical point of view, Scheme 3 would

be easier to realize since only two reference templates are needed.








5. In further experiments, different weighting factors could be applied

to different phoneme feature vectors according to the probabilities of

their appearances in real situations. By this way, time-averaging

would be better approximated.


5.2.2 Comparative Study of Acoustic Features

5.2.2.1 LPC Parameter Verses Cepstrum Parameter

Although both LPC log likelihood and cepstral distortion measures were

effective tools in classifying male/female voices, the performance of the latter was

better than the former.

1. By comparing Figures 5.1 and 5.2, it is noted that except in the

category of voiced fricatives, in which the performances were

competitive (both measures were able to achieve recognition rates of

98.1%), cepstrum coefficient features proved to be more sensitive

than LPC coefficients for gender discrimination. By choosing

appropriate schemes and filter orders the recognition rates for the

cepstral distortion measure reached 92.3% for vowels (Scheme 3

with a filter order of 12 and Scheme 4 with filter orders of 12 and

16), 90.4% for unvoiced fricatives (Scheme 4 with a filter order of

16). By using the LPC log likelihood distortion measure, the

corresponding highest recognition rates were 88.5% for vowels

(Scheme 4 with a filter order of 20) and 78.9% for unvoiced

fricatives (Scheme 3 with a filter order of 20).

2. By comparing tables in Appendix A, it is noted that the cepstral

distortion measure operated more evenly between male and female

groups, showing this feature has some "normalization

characteristics". As seen in Table A.1, there existed large








differences between male and female recognition rates for the LPC

recognizer with a filter order of 16. The largest gaps came from

Scheme 1 of the LPC. The differences were about 19% for vowels,

24% for unvoiced fricatives, and 15% for voiced fricatives. On the
other hand, the cepstral distortion measure worked evenly with the
same filter order. Table A.7 shows that the largest gaps were 6.6%
for vowels, 1.8% for unvoiced fricatives, and 2.4% for voiced

fricatives. Similar situations held for the results shown in other
tables and with inclusive schemes.

5.2.2.2 Other Acoustic Parameters

Tables 5.3 and 5.4 demonstrate results from exclusive recognition Scheme 3
with various filter orders and other acoustic parameters, using EUC and PDF

distance measures respectively. Figures 5.3 and 5.4 are graphic illustrations of

Tables 5.3 and 5.4.

1. The overall performance using RC and cepstrum coefficients was

better than that achieved using ARC and LPC coefficients, when the
Euclidean distance measure was adopted. The following
observations were made:

o The RC functioned extremely well with sustained vowels.

The recognition rates remained 100% for filter orders of 12,

16, and 20, showing that RC features captured gender

information from vowels effectively. The results were also

stable and filter order independent, as long as the filter order

was above 8. Table 5.3 shows that a 98.1% recognition rate

was reached by using the FFF, which was obtained using the

closed-phase WRLS-VFF method with a filter order of 12.



















Table 5.3
Results from the exclusive recognition scheme 3
with various filter orders, acoustic parameters, and
the Euclidean distance measure


CORRECT RATE %

Order=8 Order=12 Order=16 Order=20

ARC 78.8 78.S 78.8 82.7

LPC 73.1 78.8 80.8 80.8
Sustained
FFF N/A 98.1 N/A N/A
Vowels
RC 88.5 100.0 100.0 100.0

CC 82,7 92,3 90.4 90.4

ARC 75.0 75.0 75.0 75.0
Unvoiced LPC 80.8 69.2 71.2 71.2

Fricatives RC 80.8 80.8 80.8 80.8

CC 71.2 75.0 84.6 82.7

ARC 86.5 88.5 86.5 88.5
Voiced LPC 92.3 92.3 92.3 90.4

Fricatives RC 94.2 96.2 96.2 96.2

CC 94.2 98.1 98.1 96.2




















Table 5.4
Results from the exclusive recognition scheme 3
with various filter orders, acoustic parameters, and
the PDF (Probability Density Function) distance measure


CORRECT RATE %

Order=8 Order=12 Order=16 Order=20

ARC 80.8 84.6 88.5 67.3

LPC 84.6 98.1 92.3 80.8
Sustained
FFF N/A 96.2 N/A N/A
Vowels
RC 88.5 98.1 92.3 67.3

CC 78.8 94.2 90.3 75.0

ARC 69.2 65.4 57.7 N/A

Unvoiced LPC 78.8 86.5 78.8 53.8

Fricatives RC 78.8 73.1 67.3 55.8

CC 80.8 73.1 69.2 57.7

ARC 88.5 86.5 82.7 59.6

Voiced LPC 92.3 94.2 94.2 71.2

Fricatives RC 92.3 90.4 90.4 75.0

CC 92.3 92.3 80.8 71.2
















A -------A-------A


S.-- A


S0--. ----
**"


I88T-

95--
so-


as--
so



75

78

65

6o

55

19e

95-

85

86
88--

75
98




768

65-

68-

55

198T

95



85-

88 1

75

78

65

68


I


A---

A -------- -

0 0


8 12 16 28
Filter order


Figure 5.3 Results from the exclusive recognition scheme 3 with various
filter orders, acoustic parameters, and the Euclidean distance measure.


A A

- A --,-------

A ........
0O,.- .-"


IJ







83







a.
A-



* 0
/ LPC
A
A CC

ARC
RC


o.






OAR 'PC



OARC "',
~.\A


cc
.RC


-- ..-----..--.',
* .....A

00



SA


lee -
100

95

99 .

85

88

75

7e

65

6ef
668


5B


95

859
86

i


796



65
so


8 12 16 28


Filtur order




Figure 5.4 Results from the exclusive recognition scheme 3 with various
filter orders, acoustic parameters, and the PDF distance measure.


0ee T



0 90
85t
90








65


5'15


0 ARC


I 1 1 I








o The CC worked very well with unvoiced fricatives. The
highest recognition rate was 84.6% with a filter order of 16.
o Both RC and CC operated extremely well with voiced

fricatives, and the results remained stable across all the filter

orders that were tested. The results generated by the CC were
slight better than the RC with filter orders of 12 and 16
(98.1% versus 96.2%).
2. When the PDF distance measure was adopted, using LPC

coefficients was a good choice. The highest recognition rates,

98.1%, 86.5% ,and 94.2, for three phoneme categories all came from

LPC coefficients with a filter order of 12 (also for voiced fricatives
with a filter order of 16). However, it is noted that the results from

the PDF distance measure were highly affected by filter orders.


5.2.3 Comparative Study Using Different Phonemes

The results of Figures 5.1, 5.2, 5.3, and 5.4 also indicated that either vowels or
unvoiced fricatives or voiced fricatives could be used to objectively classify
speaker's gender. As we have seen before for vowel category, reflection coefficients
worked extremely well. The recognition rates reached 100% with filter orders of 12,

16, and 20. A 98.1% recognition rate could also be accomplished by using LPC
coefficients and the FFF. Surprisingly, the cepstral distortion measure can be used

to discriminate speaker's gender from unvoiced fricatives and a 90.4% recognition
rate was obtained. For voiced fricative category, a 98.1% recognition rate was
achieved by using LPC log likelihood and cepstral distortion measures. Therefore,

in terms of the "most effective" phoneme category for gender recognition, the first

preference would be vowels. The second one would be voiced fricatives and the last
one unvoiced fricatives.








As discussed in Chapter 4, the acoustic parameters used in coarse analysis
were derived from the LPC all-pole model that has a spectral matching
characteristic. It is known that LPC log likelihood and cepstral distortion measures

are directly related to the power spectral differences of the test and reference

signals. Thus, the results indicated that the spectral characteristics were major
factors in distinguishing the speaker's gender. Also, our results suggested that there
did exist significant differences between spectral characteristics of unvoiced

fricatives for the two genders, indicating that the speaker's gender could be

distinguished based only on speaker's vocal tract characteristics since no vocal fold

information existed in unvoiced fricatives. Moreover, if some of the vocal fold

information was combined with vocal tract characteristics as in vowel and voiced

fricative cases, the gender distinguishing was improved, even though in both of the

above cases the fundamental frequency information was not included.


5.2.4 Comparative Study of Filter Order Variation

5.2.4.1 LPC Log Likelihood and Cepstral Distortion Measure Cases

1. By observing the resultant curves of Schemes 1 and 2 from Figures

5.1 and 5.2, a general trend is easily noted. The recognition rates
generally improved by using higher order filters.
2. However, this trend was not observed for Schemes 3 and 4.

Individual inspection had to be made for specific cases. The

recognition rates for Schemes 3 and 4 together with LPC log

likelihood and cepstral distortion measures were first considered.

Notice that there are a total 12 rates for each of the filter orders.
o Comparing the recognition rates between filter orders of 8

and 12, 9 out of 12 rates increased, 2 of them decreased

(Scheme 4 for voiced and unvoiced fricatives with the LPC








log likelihood distance), and 1 of them tied (Scheme 4 for
unvoiced fricatives with the cepstral distortion measure).

This demonstrated that performance improved from filter

orders of 8 to 12.

o Comparing the recognition rates between filter orders of 12

and 16, out of 12 rates, 4 of them dropped. Two of them
increased (unvoiced fricatives Scheme 4 with the LPC

distance measure and sustained vowels Scheme 3 with the

cepstral distortion measure). The remaining 6 were equal.

This indicated that by using Scheme 3 or 4, there was not a
distinct difference between filter orders of 12 and 16.
o Comparing the recognition rates between filter orders of 16

and 20, out of 12 rates, 8 of them dropped. Three of them

increased and one was tied. Performance degraded from

filter orders of 16 to 20.

o If only the cepstral distortion measure was applied (Figure

5.2), the highest recognition rates appeared at filter orders of
12 and 16 for all three phoneme categories. However, if only

the LPC log likelihood distortion was used (Figure 5.1), the

highest recognition rates were reached with filter orders of 8

and 20. Therefore, there was no manifest trend of
performance difference for the LPC log likelihood distortion
measure across filter orders. Since the cepstral distortion

measure showed better results than the LPC log likelihood,

the filter orders of 12 to 16 seemed -to be best options for the

overall design.








5.2.4.2 Euclidean Distance Versus Probability Density Function

1. By examining Figure 5.3, it is seen that using Euclidean distance

measure increased the recognition rates slightly from filter orders of
8 to 12 with exception of the LPC for unvoiced fricatives.

Recognition rates with a filter order of 12 were almost the same as

with 16 and 20. No specific trend was observed. Except for the ARC

applied to the vowel category, all other performances reached their

peaks with either filter order of 12 or 16. It can be concluded that by

using the EUC distance measure, the best choice of filter order

would be around the range from 12 to 16.

2. However, Figure 5.4 shows us a different case. By inspecting Figure

5.4, it is immediately concluded that by using the PDF, gender

recognition rates varied considerably with the filter order. The

overall trend for the vowel category is that recognition rates

increased from a filter order of 8, reached its peak with a filter order

of 12 and then decreased. One exception is that by using the RC, the

recognition rate reached its peak with a filter order of 16. All

acoustic parameters, except the LPC, for voiced and unvoiced

fricatives showed decreasing recognition rates from a filter order of

8 to an order of 20. By using LPC coefficients, performance showed

some improvement from a filter order of 8 to 12 and 16 and then

degraded. Finally, recognition accuracies severely declined from a

filter order of 16 to an order of 20 for all three phoneme categories

and all acoustic parameters. It can be concluded that using the PDF,

the best option for filter order would be 8 -or 12.








5.2.5 Comparative Study of Distance Measures

It is generally believed that the use of the EUC distance measure is not as

effective as the use of the PDF because there is no normalization of the dimensions

involved in the definition of the EUC. The largest value dimension becomes the

most significant. In contrast, the PDF approach has such normalization function
through the covariance matrix computation. The PDF approach gives unequal

weighting for each element of a vector. It may suppress the elements with large

values but emphasize the elements with small values according to their importance

in reducing the intragroup variation.

However, the PDF approach did not work well in our experiments. By

observing Tables 5.3 and 5.4 as well as Figures 5.3 and 5.4, it can be seen that the

EUC outperformed the PDF.

First, out of 48 corresponding pairs of EUC and PDF recognition rates from

Tables 5.3 and 5.4, 32 EUC recognition rates were higher than those of the PDF.

Three of them were tied. Only 13 PDF recognition rates were higher than EUC
rates.

Second, as we have demonstrated in the sections above, performance using
PDF varied considerably with the filter order and there were severe performance

declines from the order of 16 to order of 20 for all three phoneme categories and all

acoustic parameters. On the other hand, the results of the EUC were relatively
consistent across all the filter orders that were tested, especially with filter orders of

12, 16, and 20.

Third, the two highest rates from Tables 5.3 and 5.4 for three phoneme groups

achieved using the EUC distance measure. The RC with the EUC yielded a 100%

recognition rate for sustained vowels-with filter-orders of-12, 16, and 20. The CC

with the EUC yielded a 98.1% recognition rate for voiced fricatives with filter orders
of 12 and 16. Even for unvoiced fricatives, the highest recognition rates for








unvoiced fricatives came from the CC with the EUC using Scheme 4 (Table 5.2).
They were 88.5% with a filter ordergof 12 and 90.4% with a filter order of 16.
Fourth, the EUC distance measure functioned more evenly on male and
female groups than did the PDF. By examination of all tables of exclusive schemes
in Appendix B, 43 male recognition rates were higher than those of females for total
48 PDF pairs. Only 5 female recognition rates were higher than those of males. And
the largest difference between gender group was 68.6% for the PDF (from the ARC
for unvoiced fricatives with filter order of 20). On the other hand, in 29 out of 49
EUC pairs, male rates were higher than those of females. And the largest gap
between gender group was only 21% (from the LPC for vowels with a filter order of

8).

A possible reason for this inferior PDF performance is due to the small ratio of
the available number of subjects per gender to the number of elements
(measurements) per feature vector. The assumption when using the PDF distance

measure to design a classifier is that the data are normally (Gaussian) distributed.
In this case, many factors are considered (e.g., the size of the training or design set
and the number of measurements (observations, samples) in the data record (or
vector)). Foley (1972) and Childers (1986) pointed out that if the ratio of the
available number of samples per class (in this study, number of subjects per gender)
to the number of samples per data record (in this study, number of elements per
feature vector) is small, then data classification for both design and test sets may be
unreliable. This ratio should be on the order of three or larger (Foley, 1972). In our
study, the ratios were 3.25 (26/8), 2.17 (26/12), 1.63 (26/16), and 1.3 (26/20) for
filter orders of 8, 12, 16, and 20 respectively. The value of 3.25 satisfied the
requirement but the others were too small. Therefore, with the exception of the

results with a filter order of 8, where the performances of the PDF and EUC were








comparable, the PDF approach did not function well. The smaller the ratio, the

worse the PDF performed.


5.2.6 Comparative Study Using Different Procedures
Performance differences between resubstitution (inclusive) and

Leave-One-Out (exclusive) procedures were also tested with a filter order of 16.

Tables A.9 and A. 10 in Appendix A present the inclusive recognition results for LPC

log likelihood and cepstral distortion measures with various recognition schemes

respectively. Tables B.9 and B.10 in Appendix B show the recognition results from

inclusive recognition Scheme 3 with various acoustic parameters, using EUC and

PDF distance measures respectively.

The results presented in Appendix A indicate that the correct recognition rates

of exclusive recognition procedure (Tables A.3 and A.7) were not greatly degraded

compared to those obtained from the inclusive recognition procedure, especially for

the cepstral distortion measure. For the cepstral distortion measure with Scheme 3,
the rates decreased from 94.2% to 90.4% for vowels, from 86.5% to 84.6% for

unvoiced fricatives, and remained constant for voiced fricatives. The maximum

decrease of the rates was less than 4%. For the LPC log likelihood with Scheme 3,

the rates degraded from 92.3% to 86.5% for vowels, 82.7% to 75% for unvoiced

fricatives, and 100% to 96.2 for voiced fricatives. Here maximum rate decrease was
7.7% observed for unvoiced fricatives. In contrast, the results from the partial

database of 21 subjects, which we analyzed before we completed our data collection

of the entire database, showed a much large decrease from inclusive to exclusive

procedures. Recognition rate dropped more than 14% for unvoiced fricatives and

more than 9% for voiced fricatives. This convinced us that the larger the database,

the less the performance differences between inclusive and exclusive procedures.








One interesting observation from Tables B.9 and B.10 in Appendix B was that
when using the PDF distance measure with the inclusive procedure, the correct
recognition rates were extremely high for all types of phonemes and feature vectors,
except for the ARC with unvoiced fricatives (it was still 98.1%). In addition, the

LPC, RC and CC were all able to provide 100% correct gender recognition from
unvoiced fricatives! However, when using the PDF with exclusive procedure (Table
B.7), the correct recognition rate decreased significantly, with drops ranging from a

minimum of 5.8% to a maximum of 40.4% (for unvoiced fricatives, drops ranging

from a minimum of 21.2% to a maximum of 40.4%). On the other hand, the EUC

distance measure operated more evenly. From inclusive to exclusive procedures

(Table B.3), recognition rates dropped very little, ranging from a minimum of 0% to
a maximum of 3.9%. The rates for four feature parameters did not decrease at all.

Figures 5.5(a) and (b) are graphic illustrations of Tables B.3 and B.9. It can be seen
that there was only minor performance difference between inclusive and exclusive

procedures when the EUC distance measure was used. Our results also suggested

that the PDF excelled at capturing the information from an individual subject. As

long as the data of the subject itself was included in the reference data set, the PDF
was able to pick up such specific information easily and then identify the subject's
gender accurately. Therefore, the correct recognition rates of the inclusive
procedure for the PDF were extremely high. However, the PDF recognition rates of

the exclusive procedure were much lower, indicating that the PDF was clearly

inferior at capturing gender information from the other 'average' subject with the

same gender. On the other hand, the EUC distance measure was good at capturing
gender information from the other subjects without including the characteristic of
the test subject itself.















ARC (EUC)

l LPC (EUC)

SLPC (LLD)

SRC (EUC)

SCC (EUC)


Unvoiced fricatives


Voiced fricatives


55 60 65 70 75 80 85 90 95 100


Correct rate %

(a)


Vowels





Unvoiced fricatives





Voiced fricatives


7,7


SARC (EUC)

SLPC(EUC)

- LPC(LLD)

! RC (EUC)

CC (EUC)


55 60 65 70 75 80 85 90 95 100

Correct rate %

(b)


Figure 5.5 Results of recognition Scheme 3 with the EUC and
a filter order of 16 for (a) exclusive procedure
(b) inclusive procedure.


Vowels


I


i i : i : I




University of Florida Home Page
© 2004 - 2011 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated May 24, 2011 - Version 3.0.0 - mvs