TOWARDS AUTOMATIC GENDER RECOGNITION FROM SPEECH
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
To my parents
and to my wife
The invaluable guidance, encouragement, and support I have received from
my adviser and committee chairman, Dr. D. G. Childers, during the years of my
graduate education are most appreciated. I am sincerely grateful for his direction,
insight, and patience throughout this dissertation research.
I would especially like to thank Dr. J. R. Smith, Dr. A. A. Arroyo, Dr. J. C.
Principe, and Dr. H. B. Rothman for their interest and participation in serving on my
supervisory committee and their productive criticism of my research project.
The partial support by the National Institutes of Health, National Science
Foundation, and University of Florida Center of Excellence Program is gratefully
Special thanks are also extended to my fellow graduate students and other
members of the Mind-Machine Interaction Research Center for their friendship,
encouragement, and skillful technical help.
Last but not the least, I am greatly indebted to my wife, Hong-gen, and my
parents for their love, support, understanding, and patience. My gratitude to them is
TABLE OF CONTENTS
ACKNOW LEDGEM ENTS ................................................................ iii
A B STR A C T ................................................................................... vii
1 INTRODUCTION ................................................................... 1
1.1 Automatic Gender Recognition ...................................... 1
1.2 Application Perspective ...................................................... 2
1.3 Literature Review ........................................................... 4
1.3.1 Basic Gender Features ....................... .... ..... ...... 4
1.3.2 Acoustic Cues Responsible for Gender Perception .... 13
1.3.3 Summary of Previous Research ............................ 17
1.4 Objectives of this Research ................................................ 20
1.5 Description of Chapters ........................................... ...... 21
2 APPROACHES TO GENDER RECOGNITION FROM SPEECH .... 23
2.1 Overview of Research Plan .................................... ......... 23
2.2 Coarse Analysis ............................................................. 23
2.3 Fine Analysis ................................................................. 26
3 DATA COLLECTION AND PROCESSING ................................. 29
3.1 Database Description ...................................... ........... 29
3.2 Speech and EGG Digitization .......................................... 30
3.3 Synchronization of Data ................................... ........... 32
4 EXPERIMENTAL DESIGN BASED ON COARSE ANALYSIS ...... 34
4.1 Asynchronous LPC Analysis ................................ ......... 34
4.1.1 Linear Prediction Concepts .................................... 34
4.1.2 Analysis Conditions ............................................ 39
4.2 Acoustic Parameters ........................................ ............ 40
4.2.1 Autocorrelation Coefficients ................................... 40
4.2.2 LPC coefficients ............................................ 41
4.2.3 Cepstrum Coefficients ........................................ 41
4.2.4 .Reflection Coefficients ........................................ 41
4.2.5 .Fundamental Frequency and Formant Information .... 42
4.3 Distance M measures ......................................... ........... .. 42
4.3.1 Euclidean Distance ............................................. 42
4.3.2 LPC log Likelihood Distance ................................... 43
4.3.3 Cepstral Distortion .................................... .......... 45
4.3.4 Weighted Euclidean Distance ................................ 47
4.3.5 Probability Density Function .................................. 47
4.4 Template Formation and Recognition Schemes .................. 48
4.4.1 Purpose of Design .............................................. 48
4.4.2 Test and Reference Template Formation .................. 49
4.4.3 Nearest Neighbor Decision Rule ........................... 55
4.4.4 Structure of Four Recognition Schemes ................... 56
4.5 Resubstitution and Leave-One-Out Procedures .................. 60
4.6 Separability of Acoustic Parameters
Using Fisher's Discriminant Ratio Criterion .................. 61
4.6.1 Fisher's Discriminant and F ratio ......................... 61
4.6.2 Divergence and Probability of Error .................... 64
5 RESULTS OF RECOGNITION BASED ON COARSE ANALYSIS .. 68
5.1 Coarse Analysis Conditions ............................................ 68
5.2 Performance Assessments ............................................. 70
5.2.1 Comparative Study of Recognition Schemes .......... 71
5.2.2 Comparative Study of Acoustic Features ............... 78
22.214.171.124 LPC Parameter Verses Cepstrum Parameter .. 78
126.96.36.199 Other Acoustic Parameters ....................... 79
5.2.3 Comparative Study Using Different Phonemes ......... 84
5.2.4 Comparative Study of Filter Order Variation ......... 85
188.8.131.52 LPC Log Likelihood and Cepstral
Distortion Measure Cases ......................... 85
184.108.40.206 Euclidean Distance Versus
Probability Density Function ..................... 87
5.2.5 Comparative Study of Distance Measures ................ 88
5.2.6 Comparative Study Using Different Procedures ........ 90
5.2.7 Variability of Female Voices ................................ 93
5.3 Comparative Study of Acoustic Parameters
Using Fisher's Discriminant Ratio Criterion .................. 93
5.4 Conclusions ...................................................................... 102
6 EXPERIMENTAL DESIGN BASED ON FINE ANALYSIS ........... 106
6.1 Introduction .................................................................... 106
6.2 Limitations of Conventional LPC ....................................... 107
6.2.1 Influence of Voice Periodicity .................................. 108
6.2.2 Source-Tract Interaction ....................................... 111
6.3 Closed Phase WRLS-VFF Analysis .................................... 113
6.3.1 Algorithm Description ............................................ 113
6.3.2 EGG Assisted Procedures ....................................... 120
6.4 Testing M ethods ............................................................... 122
6.4.1 Two-way ANOVA Statistical Testing ...................... 123
6.4.2 Automatic Recognition by Using Grouped Features ... 128
7 EVALUATION OF VOWEL CHARACTERISTICS ..................... 130
7.1 Vowel Characteristics of Gender ....................................... 130
7.1.1 Fundamental Frequency and Formant Features
for Each Gender ................................................. 130
7.1.2 Comparison with Peterson and Barney's Results ...... 142
7.1.3 Results of Two-way ANOVA Statistical Test ........... 145
7.1.4 Results of T Statistical Test .................................... 145
7.1.5 Discussion ............................................................. 145
7.2 Relative Importance of Grouped Vowel Features ................. 151
7.2.1 Recognition Results ................................................ 152
7.2.2 D discussion ..................................................... ...... 154
7.3 Conclusions ....................................................................... 158
8 CONCLUDING REMARKS ..................................................... 162
8.1 Sum m ary .......................................................................... 162
8.2 Future Research Extensions .............................................. 166
8.2.1 Short Term Extension ............................................ 166
8.2.2 Long Term Extension ............................................ 168
A RECOGNITION RATES FOR LPC AND CEPSTRUM
PARAMETERS ..................................................... 169
B RECOGNITION RATES FOR VARIOUS ACOUSTIC
PARAMETERS AND DISTANCE MEASURES ......... 179
REFERENCES ................ ................................................................. 189
BIOGRAPHICAL SKETCH ............................................................... 198
Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
TOWARDS AUTOMATIC GENDER RECOGNITION FROM SPEECH
Chairman: D. G. Childers
Major Department: Electrical Engineering
The purpose of this research was to investigate the potential effectiveness of
digital speech processing and pattern recognition techniques in the automatic
recognition of gender from speech. Some hypotheses concerning acoustic
parameters that may influence our ability to distinguish a speaker's gender were
The study followed two directions. One direction, coarse analysis, used
classical pattern recognition techniques and asynchronous linear prediction coding
(LPC) analysis of speech. Acoustic parameters such as autocorrelation, LPC,
cepstrum, and reflection coefficients were derived to form test and reference
templates. The effects of different distance measures, filter orders, recognition
schemes, and phonemes were comparatively assessed. Comparisons of acoustic
parameters using the Fisher's discriminant ratio criterion were also conducted.
The second direction, fine analysis, used pitch synchronous closed-phase
analysis to obtain accurate vowel characteristics for each gender. Detailed formant
features, including frequencies, bandwidths, and amplitudes, were extracted by a
closed-phase Weighted Recursive Least Squares with Variable Forgetting Factor
method. The electroglottograph signal was used to locate the closed-phase portion
of the speech signal. A two-way Analysis of Variance statistical analysis was
performed to test the difference between two gender features, and the relative
importance of grouped vowel features was evaluated by a pattern recognition
The results showed that most of the LPC derived acoustic parameters worked
very well for automatic gender recognition. A within-gender and within-subject
averaging technique was important for generating appropriate test and reference
templates. The Euclidean distance measure appeared to be the most robust as well
as the simplest of the distance measures.
The statistical test indicated steeper spectral slopes for female vowels. Results
suggested that redundant gender information was imbedded in the fundamental
frequency and vocal tract resonance. Features of female voices were observed to
have higher within-group variations than those of male voices.
In summary, this study demonstrated the feasibility of an efficient gender
recognition system. The importance of this system is that it would reduce the search
space of speech or speaker recognition in half. The knowledge gained from this
research might benefit the generation of synthetic speech with a desired male or
female voice quality.
1.1 Automatic Gender Recognition
Human listeners are able to capture and categorize the information of acoustic
speech signals. Categories include those that contribute a linguistic message, those
that identify the speaker, and those that convey clues about the speaker's
personality, emotional state, gender, age, accent, and the status of his/her health.
Automatic speech and speaker recognition systems are far less capable than
human listeners. Computerized speaker recognition can be accomplished but only
under highly constrained conditions. The major difficulty is that the number of
significant parameters is unmanageably large and little is known about the acoustic
speech features, articulation differences, vocal tract differences, phonemic
substitutions or deletions, prosodic variations and other factors that influence our
Therefore, more insight and systematic study of intrinsically effective speaker
discrimination features are needed. A series of smaller experiments should be done
so that the experimental results will be mutually supportive and will lead to overall
understanding of the combined effects of all the parameters that are likely to be
present in actual situations (Rosenberg, 1976; Committee on Evaluation of Sound
Unlike automatic speech and speaker recognition, automatic gender
recognition was never proposed as a stand-alone problem. Little attention was paid
to either the theoretical basis or the practical techniques for the realization of a
system for the automatic recognition of gender from speech. Although
contemporary research on speech included investigation of physiological and
acoustic gender features and their correlation with perceived gender differences, no
attempt was made to classify the speaker's gender objectively, using features
automatically extracted by a computer. Childers and Hicks (1984) first proposed
such a study as a separate recognition task and, thus, this research resulted from that
proposal. A possible realization of such a system is shown in Figure 1.1.
1.2 Application Perspective
The significance of the proposed research is as follows:
o Accomplishing this task could facilitate speech recognition and
speaker identification or verification by reducing the required search
space to half. Such a pre-process may occur in the listening process
of human being. One of the speech perception hypotheses proposed
by O'Kane (1987) stated that human listeners have to determine the
gender of the speaker first in order to determine the identity of the
sounds. Another perception hypothesis is that the identity of the
sounds. can be roughly determined without knowledge of the
speaker's gender but final recognition is possible only after the
speaker's gender is known. In both cases, identification of the
speaker's gender is a necessary step before recognition of sounds.
o Accomplishing this task could be useful for speech synthesis. It is
well known that in synthesized speech, the female voice has not been
reproduced with the same level of success as the male voice (Monsen
and Engebreston, 1977). Further study of gender cues would
FO -- fundamental frequency
Fl -- first formant frequency
BW1 -- first formant bandwidth
Figure 1.1 A possible automatic gender recognition system.
contribute to the solution of this problem since acoustic features for
synthesizing speech for either gender would be provided. Hence, the
voice quality of voice response systems and text-to-speech
synthesizers would be improved.
o Accomplishing this task could provide new guidelines and suggest
methods to identify the acoustic features related to dialect, age,
health conditions, etc.
o Accomplishing this task could be a unique or an only approach for
some applications (e.g., law enforcement applications). In a
criminal investigation, an attempt is usually made to identify the
speaker on a recording as a specific person. If an individual is able
to deceive the investigator as to his gender, he may well prevent his
detection. It is well known that speakers can disguise their
speech/voice to confound or prevent detection (Hollien and
McGlone, 1976--cited by Carlson, 1981). The female impersonator
is an example of intentional deception of the listener. In such a
case, identification of the speaker's gender is critical.
o Finally, we presumed that the research results could benefit clinical
applications such as correction for a person with a voice disorder or
handicap. Other applications include transsexual changes etc.
(Bralley et al., 1978; Carlson, 1981).
1.3 Literature Review
1.3.1 Basic Gender Features
The differences between male and female voices depend upon many factors.
Generally, there exist three types of parameters--physiological and acoustical
which are objective, and perceptual which is subjective (Figure 1.2).
* VOCAL FOLD LENGTH
* VOCAL TRACT
VOCAL TRACT FEATURES:
* SPEAKING RATE
Figure 1.2 Basic gender features.
Many physiological parameters of the male and female vocal apparatus have
been determined and compared. Fant (1976) showed that the ratio of the total
length of the female vocal tract to that of a male is about 0.87, and Hirano et al.
(quoted by Cheng and Guerin, 1987) showed that the ratio of the length of the
female vocal fold to that of the male is about 0.8. Titze (1987 and 1989) reported
that, anatomically, the female larynx also differs from the male larynx in thickness,
angle of the thyroid laminae, resting angle of the glottis, vertical convergence angle
in the glottis, and in other ways. The ratio of the length and the ratio of the area of
pharynx cavity of the female to that of the male are 0.8 and 0.82, respectively.
Similarly, we take respectively 0.95 as the ratio of the length and 1.0 as the ratio of
the area of oral cavity of the female to that of the male. The extra ratio for the area
of the oral cavity is due to the fact that the degree of openness of the oral cavity is
comparatively greater in the case of the female than in the case of the male (Ohman
quoted by Fant, 1966). Ohman also suggested that a proportionally larger female
mouth opening is a factor to consider. Figure 1.3 illustrates the human vocal
The differences in physiological parameters can lead to induced differences in
acoustical parameters. When comparing male and female formant patterns, the
average female formant frequencies are roughly related to those of the male by a
simple scaling factor that is inversely proportional to the overall vocal tract length.
On the average, the female formant pattern is said to be scaled upward in frequency
by about 20% compared to the average male formant pattern (Figure 1.4). It is also
well known that the individual size of the vocal cavities and thus of the formant
pattern scale factor may vary appreciably depending upon the age and gender of the
-speaker. Peterson and Barney (1952) measured the first three formant frequencies
present in ten vowels spoken by men, women, and children. They reported that male
formants were the lowest in frequency, women had a higher range, and children had
T-- ONGUE TIP
Figure 1.3 A cross section of human vocal apparatus.
Figure 1.4 An example of male and female formant features.
4.0 FEMALE ;
Figure 1.5 Fundamental frequency changes for two speakers
for the utterance "We were away a year ago."
the highest. Carlson (1981) gave a survey of the literature on the vocal tract
resonance characteristics as a gender cue.
Fant (1966) has pointed out that the male and female vowels are typically
different in three groups:
1) rounded back vowels,
2) very open unrounded vowels, and
3) close front vowels.
The main physiological determinants of the specific deviations are that the ratio of
pharyngeal length to mouth cavity length is greater for males than for females and
the laryngeal cavities are more developed in males.
Schwartz and Rine (1968) also demonstrated that the gender of an individual
can be identified from voiceless fricative phonemes such as /S/, /F/ etc. This again is
induced by the vocal tract size differences between the genders.
The higher fundamental frequency (pitch) range of the female speaker is quite
well known. There is a general agreement that the fundamental frequency is an
important factor in the identification of gender from voice (Curry, 1940--cited by
Carlson 1981; Hollien and Malcik, 1967; Saxman and Burk, 1967; Hollien and Paul,
1969; Hollien and Jackson, 1973; Monsen and Engebretson, 1977; Stoicheff, 1981;
Horri and Ryan, 1981; Linville and Fisher; 1985; Henton, 1987). One often finds the
statement that the pitch level of the female speaking voice is approximately one
octave higher than that of the male speaking voice (Linke, 1973). However, there is
considerable discrepancy among values obtained by different investigators.
According to Hollien and Shipp (1972), the male subjects showed an intersubject
pitch range of 112 146 Hz. Stoicheff's (1981) data showed that the range for the
female subjects was 170-275 Hz. Titze-(1989) found that the fundamental
frequency was scaled primarily according to the membranous lengths of the vocal
folds (scale factor 1.6). Figure 1.5 shows fundamental frequency changes for two
speakers for the utterance "We were away a year ago." Figure 1.6 shows the
corresponding speech signals.
The female voice is slightly weaker than the male voice. On the average the
root mean square (rms) intensity of glottal periods produced by female subjects is -6
db relative to comparable samples produced by males. A study by Karlsson (1986)
indicated a strong correlation between weak voice effort and constant air leakage
During the last few years, measuring the area of the glottis as well as
estimating the glottal volume-velocity waveform have become research topics of
interest (Holmberg et al., 1987). It is well known that the shape of the glottal
excitation wave is an important factor which can greatly affect speech quality
(Rothenberg, 1971). The wave shape produced by male subjects is typically
asymmetrical and frequently shows a prominent hump in the opening phase of the
wave (due to source-tract interaction). The closing portion of the wave generally
occupies 20%-40% of the total period and there may or may not be an easily
identifiable closed period (Monsen and Engebretson, 1977). Notable differences
between male and female waveforms are that the female waveform tends to be
symmetric. There is seldom a hump during the opening-phase indicating less or no
source-tract interaction, and both the opening and closing parts of the wave occupy
more nearly equal proportions of the period. Holmberg et al. (1987) found
statistically significant differences in male-female glottal waveform parameters. In
normal and loud voices, female waveforms indicated lower vocal fold closing
velocity, lower ac flow, and a proportionally shorter closed-phase of the cycle,
suggesting a steeper spectral slope for females. For softly spoken voices, spectral
slopes are more similar to those of males.
These glottal-source differences between male and female subjects are
understandable in terms of the relative size of male and female vocal folds. It is
i~ = .3jo iaLb
m I .
ljIu = .J/Z ML
Figure 1.6 Speech signals for
for the utterance
(a) male and (b) female speakers
"We were away a year ago."
INJ = 1. .' M-L;
-v .. .
possible that the asymmetrical, humped appearance of the male glottal wave may be
due to a slightly out-of-phase movement of the upper and lower parts of each vocal
fold. If this is so, then the generally symmetrical appearance of the female glottal
wave may be due to the fact that the shorter female vocal folds come into contact
with each other more nearly as a single mass (Ishizaka and Flanagan, 1972).
The perceptual parameters or strategies used to make decisions concerning
male/female voices are not delineated in the literature even though making this
decision is a discrimination task performed routinely by human listeners. However,
it is hypothesized that a limited number of perceptual cues for classifying voices do
exist in the repertoire of listeners, and these cues may include some sociological
factors such as cultural stereotyping.
Singh and Murry (1978) and Murry and Singh (1980) investigated the
perceptual parameters of normal male and female voices. They found that the
fundamental frequency and formant structure of the speaker appeared to carry
significant information for all judgments. The listeners' judgments that the voices
they heard were female were more dependent on judged qualities of voice and effort.
Effort, pitch, and nasality were the perceptual parameters used to characterize
female voices while male voices were judged on the basis of effort, pitch, and
hoarseness. Their results suggested that listeners may use different perceptual
strategies to classify male voices than they use to classify female ones. Coleman
(1976) also suggested that there was a possibility of a gender-specific listener bias
for one acoustic characteristic or for one gender over the other.
Many researchers also believe melodic (intonation, stress, and/or
coarticulation) cues are speech characteristics associated with female voices.
Furthermore, the female voice is typically mor-e-breathy than the male voice. This
can be modeled by a dc shift in the glottal wave or, as suggested by Singh and Murry
(1978), is a result of a large number of pitch shifts. As the subject shifts pitch
direction frequently, complete vocal fold approximation is less probable. A
research on acoustic correlates of breathiness was performed by Klatt (1987) in
which three breathiness parameters (i.e., first harmonic amplitude, turbulence noise
and tracheal coupling) were proposed. A detailed discussion of controlling these
parameters was presented in Klatts' paper (1987).
A new trend to find the features responsible for gender identification is to
apply the approach of synthesis. The work done by Yegnanarayana et al. (1984),
Wu (1985), Childers et al.(1985a, 1985b, 1987, 1989), and Pinto et al. (1989)
represented this aspect. In their experiments, the speech of a talker of one gender
was converted to sound like that of a talker of the other gender to exam factors
responsible for distinguishing gender features. They found that the fundamental
frequency, the glottal excitation waveshape and the spectrum, which included
formant locations and bandwidth, overall spectral shape and slope, and energy, are
crucial control parameters.
1.3.2 Acoustic Cues Responsible for Gender Perception
As part of current interest in speaker recognition, investigators have sought to
specify gender-bearing attributes of the human voice. Under normal speaking and
listening circumstances, listeners have little difficulty distinguishing the voices of
adult males and females, suggesting that the acoustic parameters which underlie
gender identity are perceptually prominent. The judgment of adult gender is
strongly influenced by acoustic variables reflecting gender differences in laryngeal
size and mass as well as vocal tract length. However, the issue of which specific
acoustic cues are mostly responsible for gender identification has not been
definitively resolved. Such a controversy-partially dominated the previous research.
A series of experiments run by Schwartz (1968) and Ingemann (1968)
employed voiceless fricatives spoken in isolation as auditory stimuli and it was found
that listeners could identify speaker gender accurately from these stimuli, especially
from /H/, IS/, and /SH/ (and could not from /F/ and /TH/). Ingemann reported that
the most identifiable fricative was /h/, with identification of others ranging down to
little better than chance. Since the laryngeal fundamental (FO) was not available to
the listeners, their findings suggest that accurate gender identification is possible
from vocal tract resonance (VTR) information alone and, therefore, that formants
are important cues for speaker gender identification.
Further support for this conclusion came from studies by Schwartz & Rine
(1968) and Coleman (1971). Schwartz and Rine's study revealed that the listeners
were able to identify the speaker's gender from two whispered vowels (/i/ and /a/).
They found 100% correct identification for /a/ and 95% correct identification for /i/,
despite the absence of the laryngeal fundamental. In Coleman's study on male and
female voice quality and its relationship to vowel formant frequencies, /i/, /u/, and a
prose passage were employed to explore listeners' gender identification abilities.
All stimuli were produced at the same FO (85 Hz) by means of an electrolarynx.
Coleman discovered that the judges correctly recognized the speaker gender 88% of
the time (with 98% correct for male voices and 79% for female voices), even when
the FO remained constant for all speakers. He also discovered that the vowel
formant frequency averages were closely associated with the degree of male or
female voice quality.
Coleman (1973a and 1973b) attempted to reduce the influence of possible
differences in rate, juncture, and inflection between male and female speakers by
presenting their voiced productions of prose passage backward to subjects. The
judgments should have, therefore, been based solely on VTR and FO information
which would be unaffected by the backward presentation. By correlation analysis-
between measures of VTR, FO, and judgments of degree of male and female voice
quality in the voices of the speakers (with degree of correlation indicative of the
contribution of each of the vocal characteristics to listener judgments), he found that
listeners were basing their judgments of the degree of male or female voice quality
on the frequency of the laryngeal fundamental.
However, in a later study by Coleman (1976), there were inconsistent findings
from a pair of experiments concerned with a comparison of the contribution of two
vocal characteristics to the perception of male and female voice quality. The first
experiment, which utilized natural speech, indicated that the FO was very highly
correlated with the degree of gender perception while the VTR was less highly
correlated. When VTRs that were more characteristic of the opposite gender were
included experimentally in these voices, they did not affect the judges' estimates of
the degree of male or female voice quality. But, in the second experiment, when a
tone produced by a laryngeal vibrator was substituted for the normal glottal tone at
simulated FO representing both male (120 Hz) and female (240 Hz), and male and
female characteristics (i.e. vocal tract formants and laryngeal fundamentals) were
combined in the same voice experimentally, he found that the female FO was a weak
indicator of the female voice quality when it was combined with the male VTR
features although the male FO retained the perceptual prominence seen in the first
experiment. Thus, there was a difference in the manner that FO and VTR interact
for male and female perception.
Lass et al. (1976) conducted a study comparing listeners' gender identification
accuracy from voiced, whispered, and 255 Hz low-pass filtered isolated vowels.
They found that listener accuracy was greatest for the voiced stimuli (96% correct
out of 1800 identifications--20 speakers x 6 vowels x 15 listeners), followed by the
filtered stimuli (91% correct), and least accurate (75% correct) for the voiceless
vowels. Since the low-pass filtered vowels apparently eliminated formant
information, they concluded that the FO was a more important acoustic cue in
speaker gender identification tasks than the VTR characteristics of the speaker.
Lass et al. (1976) also reported that there were large gender differences in
their results. In all experimental conditions females were recognized at a
significantly lower level, which was in agreement with the results of Coleman (1971)
mentioned above. In another study supportive of this point, Brown and Feinstein
(1977) also used electrolarynx (120 Hz) to control FO so that VTR was the variable.
Identification of male speakers was 84% correct and identification of female
speakers was 67% correct. Brown and Feinstein also found, as in the Coleman
(1971) study, that centralized spectra were more ambiguous to listeners. Again,
VTR appeared to play a determinant role in gender identification in the absence of
In a later experiment, the effect of temporal speech alterations on speaker
gender and race identification was investigated. Lass and Mertz (1978) found that
gender identification accuracy remained high and unaffected by temporal speech
alterations when the normal temporal features of speech were altered by means of
the backward playing and time compressing of speech samples. They concluded
that temporal cues appeared to play a role in speaker race, but not speaker gender
In another study concerned with the effect of phonetic complexity on speaker
gender identification, Lass et al. (1979) found that phonetic complexity did not
appear to play a major role for gender judgments. No regular trend was evident
from simple to complex auditory stimuli and listeners' accuracy was as great for
isolated vowels as it was for sentences.
In an attempt to investigate the relative importance of portions of the
broadband frequency speech spectrum in gender identification, Lass et al. (1980)
constructed three recordings representing-the three experimental-conditions in the
study: unfiltered, 255 Hz low pass filtered, and 255 Hz high pass filtered. The
recordings were played back to a group of 28 judges. The results of their judgments
indicated that gender identification was not significantly affected by such filtering;
listeners' accuracy in gender recognition remained high for all three experimental
conditions, showing that gender identification can be made accurately from acoustic
information available in different portions of the broadband speech spectrum.
1.3.3 Summary of Previous Research
By reviewing the literature it can be concluded that the revealed information of
gender identification from previous research was extensive. However, it is clear that
much work still remains to be done.
What has not been completed
The relative importance of the FO versus VTR characteristics for
perceptual male or female voice quality is still controversial. The belief
that the FO is the strongest cue to gender seems to be substantiated by the
evidence. There is a hypothesis that in situations in which the role of FO
is diminished by deviancy, the effect of VTR characteristics upon gender
judgments increases from a minimal level to take on a large role equal to
and even sometimes greater than that played by FO (Carlson, 1981). But
this hypothesis remains unproven.
It is well known now that not only the vibration frequency of the
glottis (FO) but also the shape of the glottal excitation wave as well are
important factors which greatly affect speech quality (Rothenberg, 1971;
Holmes, 1973). Differences of glottal excitation wave shapes for male
and female were observed and investigated (Monsen and Engebretson,
1977; Karlsson, 1986; Holmberg and Hillman, 1987). But perceptive
justification of these characteristics was still limited (Carrell, 1981) and
the inverse filtering techniques need to be improved and more data
should be analyzed.
What was neglected
First of all, research on automatically classifying male/female
voices by using objective feature measurements was entirely missing.
Almost all previous work was concentrated on subjective testing which is
expensive, time and labor consuming, and subject dependent. Objective
gender recognition which is reliable, inexpensive, and consistent has not
been developed in parallel to subjective testing but such work is
necessary as we stated earlier.
Second, the influences of formant bandwidth and amplitude and
overall spectral shape on gender cues were not considered and
investigated. Traditionally, experiments on contribution of vocal tract
characteristics to gender perception were only concerned with formant
frequencies (Coleman, 1976). The bandwidths of the lowest formant
depend upon vocal tract wall loss and source-tract interaction (Rabiner
and Schafer, 1976; Rothenberg, 1981) while bandwidths of the higher
formants depend primarily upon the viscous friction, thermal loss, and
radiation loss (Flanagan, 1972). These factors may be different for each
gender so that the bandwidths and overall spectral shape are different for
each gender. Bladon (1983) pointed out that male vowels appeared to
have narrower formant bandwidths and perhaps also a less steeply
sloping spectrum. All these areas require further investigation.
What was the weakness
The acoustic features were obtained by short-time spectral
analysis which usually used analog spectrographic techniques.
Estimated FO and formant frequencies may be inaccurate due to
1. errors in determining the positions of the harmonic peaks (in
practice, the peaks were "read" by-means of inspection by a
person and then the FO and formants were calculated).
2. errors in formant estimation due to the influences of the FO
and source-tract interaction.
3. large instrument errors (e.g., drift).
Lindblom (1962) estimated the accuracy of spectrographic
measurement to be approximately equal to the fundamental frequency
divided by 4. Flanagan (1955) and Nord and Sventelius (1979, quoted by
Monsen and Engebretson, 1983) suggested that a difference of about 50
Hz for the second formant and a difference of about 21 Hz for the first
formant was perceived. Therefore, formant frequency estimation should
be as accurate as possible in vowel analysis as well as synthesis.
However, the most frequently referenced paper on acoustic phonetics,
which contains the most comprehensive measurements of the vowel
formants of American English (Peterson and Barney, 1952), may involve
measurement errors as pointed out by Monsen and Engebretson (1983),
especially for female and child subjects since the data were obtained by
The technique frequently employed to examine the ability of VTR
to serve as gender cue was to standardize the FO (and therefore eliminate
it as a variable) by utilizing an artificial larynx (Coleman, 1971, 1976;
Brown and Feinstein, 1977). This allows evaluation of VTR in a sample
that contains an FO that is the same for both male and female subjects.
The electrolarynx itself has an unnatural sound to it that may confuse the
listener and depress the overall accuracy of perception.
The study populations were relatively small for most
investigations. Sometimes the database used consisted of less than 10
subjects for each gender (Ingemann, 1968; Schwartz and Rine, 1968;
Brown and Feinstein, 1977), making the interpretation of the results
The results of the listening tests may depend on the gender
distribution of the testing panel because males and females may use
different judging strategies. However, this point usually was not
emphasized so that the conclusions claimed from listening tests may be
biased (Coleman, 1976; Carlson, 1981).
In summary, previous research has measured and investigated the
physiological or anatomical parameters for each gender. Under certain
assumptions, the relationship between anatomical parameters and some of the
acoustic features was established. The major acoustic parameters responsible for
perceptually discriminating a speaker's gender from voice were investigated and
tested. However, no attempt was made to automatically classify male/female voices
by objective feature measurements. The vowel characteristics for each gender were
inaccurate because of the weakness of analog techniques. Various hypotheses and
preliminary results need to be verified on a more comprehensive database. All these
constituted the underlying problems and impetuses for this research.
1.4 Objectives of this Research
This research sought to address these problems through two specific
One objective of this study was to explore the possible effectiveness of digital
speech processing and pattern recognition techniques for an automatic gender
recognition system. Emphasis was placed on the investigation of various objective
acoustic parameters and distance measures. The optimal combination of these
parameters and measures was searched. The extracted acoustic features that are
most effective to classify speaker's gender objectively were characterized. Efficient
recognition schemes and decision algorithms for such purpose were developed.
The other objective of this study was to validate and clarify hypotheses
concerning some acoustic parameters affecting the ability of algorithms to
distinguish a speaker's gender. Emphasis was placed on extraction of accurate
vowel characteristics including fundamental frequency and formant features such as
formant frequency, bandwidth and amplitude for each- gender. The relative
importance of these characteristics for gender identification was evaluated.
1.5 Description of Chapters
In Chapter 2, an overview of the research plan is given and a brief description
of the coarse and fine analysis is presented. The database and the techniques
associated with data collection and preprocessing are discussed in Chapter 3. The
details of the experimental design based on coarse analysis are described in Chapter
4. Asynchronous LPC analysis is reviewed. Different acoustic parameters, distance
measures, template formation, and recognition schemes are provided. The
recognition decision rule and resubstitution or exclusive procedure are proposed as
well. In addition, the concept of the Fisher's discriminant ratio criterion is reviewed.
The recognition performance based on coarse analysis is assessed in Chapter 5.
Results of comparative studies of various phonemes, acoustic features, distance
measures, recognition schemes, and filter orders are reported. The gender
separability of acoustic features is also analyzed by using the Fisher's discriminant
ratio criterion. Chapter 6 expounds on the detailed experimental design of fine
analysis. In particular, the advantages of pitch synchronous closed phase analysis is
demonstrated. A review of the closed phase WRLS-VFF (Weighted Recursive
Squares with Variable Forgetting Factor) analysis and the EGG (electroglottograph)
assisted approaches is also presented. Chapter 6 also introduces testing methods for
fine analysis, which include the two-way ANOVA (Analysis of Variance) statistical
test and the automatic recognition test using grouped features. Chapter 7 analyzes
the vowel characteristics such as fundamental frequencies and formant features for
each gender. Statistical tests and relative importance of grouped vowel features are
also discussed. Finally in Chapter 8, a summary of the results of this dissertation is
offered. Recommendations and suggestions for future research conclude this last
APPROACHES TO GENDER RECOGNITION FROM SPEECH
2.1 Overview of Research Plan
The goal of this study was to explore the possible effectiveness of digital
speech processing and pattern recognition techniques for an automatic gender
recognition system from speech. In order to do this, some hypotheses concerning
acoustic parameters that act to affect our ability to distinguish speaker's gender
needed to be validated and clarified.
Thus, this study was divided into two directions as illustrated in Figure 2.1.
One direction was called coarse analysis since it applied classical pattern recognition
techniques and asynchronous linear prediction coding (LPC) analysis of speech.
The specific goal of this direction was to develop and test candidate algorithms for
achieving the gender recognition rapidly using only a brief data speech record.
The second research direction covered fine analysis since pitch synchronous
closed-phase analysis was utilized to obtain accurate vowel characteristics for each
gender. The specific aim of this direction was to compare the relative significance of
vowel characteristics for gender discrimination.
2.2 Coarse Analysis
The tool we used in this direction was asynchronous LPC analysis. The
advantages of using this technique are
Figure 2.1 The overall research flow.
1. The well-known linear prediction coding (LPC) vocoder is an
efficient vocoder which, when used as a model, encompasses the
features of the vocal source (except of fundamental frequency) as
well as the vocal tract (Rabiner and Schafer, 1978). Since gender
features are believed to be included in both vocal source and tract,
satisfactory results would be expected using LPC derived
2. The LPC all-pole model has a smoothed, accurate spectral envelope
matching characteristic, especially for vowels. Formant frequency
measurements obtained by LPC have also been found to compare
favorably to measures obtained by spectrographic analysis (Monsen
and Engebretson, 1983; Linville and Fisher, 1985). Thus it is
expected that features obtained by LPC would represent the spectral
characteristics of both genders more accurately.
3. The LPC model has been successfully applied in speech and speaker
recognition (Makhoul, 1975a; Atal, 1974b, 1976; Rosenberg, 1976:
Markel, 1977; Davis and Mermerlstein, 1980; Rabiner and Levinson,
1981). Moreover, many related distortion or distance measurements
have been developed (Gray and Markel, 1976; Gray et al., 1980;
Juang, 1984; Nocerino et al., 1985) which could be conveniently
adopted for the preliminary experiments of gender recognition.
4. Deriving acoustic parameters from the LPC model is
computationally fast and efficient, only short data records are
needed. This is a very important factor in designing an automatic
gender recognition system. -
In the coarse analysis, acoustic parameters such as autocorrelation, LPC,
cepstrum, and reflection coefficients were derived to form test and reference
templates. The effects of using different distance measures, filter orders,
recognition schemes, and phonemes were comparatively evaluated. Comparisons of
acoustic parameters using the Fisher's discriminant ratio criterion were also
The linear prediction coding concepts and detailed experimental design based
on the coarse analysis will be given in Chapter 4.
2.3 Fine Analysis
The objective of the fine analysis was to study and compare the relative
significance of vowel characteristics responsible for gender discrimination.
As we know, male/female vowel characteristics are featured by formant
positions, bandwidths, and amplitudes so that accurate formant estimation is
necessary. It is important to pay particular attention to the measurement technique
and to the degree of accuracy which can be achieved through it. Although formant
features have been measured for a variety of different studies, the accuracy of these
measurements is still a matter of conjecture.
Formant estimation is influenced by (Atal, 1974a; Childers et al., 1985a;
Krishnamurthy and Childers, 1986):
o the effect of the periodic vocal fold excitation, especially when the
harmonic is near the formant.
o the effect of the excitation-spectrum envelope.
o the effect of time averaging over several excitation cycles in the
analysis when the vocal folds are repeatedly in open-phase (large
source-tract interaction) and closed-phase (little or no source-tract
interaction) conditions. 0
Frame based asynchronous LPC analysis cannot reduce the effect of
source-tract interaction because this technique uses windows that average the data
over several excitation epoches. The pitch synchronized closed phase covariance
(CPC) method can reduce the effect of source-tract interaction. However, in certain
situations, the vocal tract filter derived by this method may be unstable because of
the short closed glottal intervals, especially for females and children (Ting et al.,
Sequential adaptive analysis methods offer an attractive alternate processing
strategy since they overcome some of the drawbacks of frame-based analysis. The
closed-phase WRLS-VFF method that tracks the time-varying parameters of the
vocal tract and updates the parameters during the glottal closed phase interval can
reduce the formant estimation error. Experimental results (Ting et al., 1988; Ting,
1989) show that the formant tracking ability and formant estimation accuracy of the
WRLS-VFF algorithm is superior to the LPC based method. Detailed formant
features, including frequencies, bandwidths, and amplitudes in the fine analysis
stage were obtained by using this method. The EGG signals were used to assist in
locating the closed phase portion of the speech signal (Childers and Larar, 1984;
Krishnamurthy and Childers, 1986).
There were two approaches for testing the relative importance of various
vowel features for gender recognition:
Statistical tests. Since formant characteristics such as frequencies,
bandwidths, and amplitudes depend on or are influenced by two
factors (i.e., gender as well as vowels) and each experimental
subject produces more than one vowel, our experiments should be
referred to as two factor experiments having repeated measures on
the same subject (Winer, 1971). Therefore, two-way ANOVA were
used to perform the statistical test. The significance of the
difference between each individual feature in terms of male/female
groups was analyzed.
Automatic recognition. First, the individual or grouped featuress,
such as only the fundamental frequency or only the formant
frequencies or bandwidths (but from all formants), were used to
form the reference and test templates. Then automatic recognition
schemes were applied on these templates. Finally, the recognition
error rates for different features were compared.
In Chapter 6, the detailed background of the closed phase WRLS-VFF method
and the experimental design based on fine analysis will be presented.
DATA COLLECTION AND PROCESSING
3.1 Database Description
The database consists of speech and EGG data collected from 52 normal
subjects (27 males and 25 females) with speaker's age varying from 20 to 80 years.
The synchronous speech and EGG signals were simultaneously directly digitized.
Each subject read, after some practice, the following SAMPLE PROTOCOL that
includes 27 tasks.
- 10 with comfortable pitch & loudness.
- 5 with progressive increase in loudness.
donation of the vowel /IY/ in t
)honation of the vowel /I/ in t
,honation of the diphthong /AI/ in t
,honation of the vowel /E/ in t
donation of the vowel /AE/ in t
,honation of the vowel /OO/ in t
)honation of the vowel /U/ in t
)honation of the diphthong /OU/ in t
donation of the vowel /OW/ in tl
,honation of the vowel /A/ in t
)honation of the vowel /UH/ in t
honation of the vowel /ER/ in t
)honation of the whisper /H/ in t
honation of the fricative /F/ in t
honation of the fricative /TH/ in t
,honation of the fricative /S/ in t
Sustain phonation of the fricative /SH/ in the word SHIP.
Sustain phonation of the fricative /V/ in the word VAN.
Sustain phonation of the fricative /TH/ in the word THIS.
Sustain phonation of the fricative /Z/ in the word ZOO.
Sustain phonation of the fricative /ZH/ in the word AZURE.
Produce chromatic scale on "la" (attempt to go up, then down
as one effort -- pause between top 2 notes)
Sentence "We were away a year ago."
Sentence "Early one morning a man and a woman ambled along a
one mile lane."
Sentence "Should we chase those cowboys?"
A subset of the above was used in this research. It consisted of
1. Ten sustained vowels: /IY/, //, /E/, /AE/, /00/, /U/, /OW/, /A/, /UH,
and /ER/. There were a total of 520 vowels for 52 subjects: 270
vowels from males and 250 vowels from females.
2. Five sustained unvoiced fricatives (including a whisper): /H/, /F/,
/TH/, /S/, and /SH/. There were a total of 260 unvoiced fricatives for
all subjects: 135 from males and 125 from females.
3. Four voiced fricatives: /V/, /TH/, /Z/, and /ZH/. There were a total
of 208 unvoiced fricatives for all subjects: 108 from males and 100
3.2 Speech and EGG Digitization
All of the experimental data were collected with the subjects situated inside an
Industrial Acoustics Company single wall sound booth. The speech was picked up
with an Electro Voice RE-10 dynamic cardioid microphone and the EGG signal was
monitored by a Synchrovoice device. Amplification of the speech and EGG signals
was accomplished with 'a Digital Sound Corporation DSC-240 Audio Control
Console. The two channels were alternately sampled at 20 KHz by a Digital Sound
Corporation DSC-200 Digital Audio Converter system with 16 bits of precision.
The low-pass, anti-aliasing and reconstruction filters of the DSC-200 were
connected to the analog side of the converter. Both signals were bandlimited to 5
KHz by these passive elliptic filters with the specification of minimum stopband
attenuation of -55 db and passband ripple of 0.2 db. The DSC-240 station
provides audio signal interfacing to the DSC-200, which includes input and output
buffering as well as level metering and signal path switching.
The utterances were directly digitized since this choice avoids any distortion
that may be introduced in the tape recording process (Berouti et al., 1977; Naik,
1984). An extender attached to the microphone kept the speaker's lips 6 inches
away. With the microphone and EGG electrodes in place, the researcher ran the
data collection program on a terminal inside the sound room. A two channel
Tektronix Type 564B Storage Oscilloscope was connected to DSC-240 so both
speech and EGG signals were monitored. The program prompted the researcher by
presenting a list of commands on the screen. The researcher initiated digitization by
depressing the "D" key on the keyboard. Immediately after digitization, another
prompt indicated termination of the sampling process. The digitized utterance could
be played back and an option existed to repeat the digitization process if it was
thought that part of the utterance might have been spoken abnormally or the
digitized speech and EGG signals were unsatisfactory. For example, the speakers
were instructed to repeat a utterance if the panel of experts who were sitting in the
sound room or speaker felt that it was rushed, mispronounced, too low, etc. The
entire protocol with utterances repeated as necessary took an average of 15-20
minutes to collect. About 150-200 seconds of speech and EGG were automatically
stored on disk. Thus, for each subject, about 12000-16000 blocks (512 bytes per
block) of data were collected.
Since the speech and EGG channels were alternately sampled, the resulting
file of digitized data had the two signals interleaved. The trivial task of
demultiplexing was performed off-line after data collection. Once the data were
demultiplexed, the speech and EGG were trimmed to discard the unnecessary data
before and after an utterance while keeping the onset and offset portions at each end
of the data. After.trimming, about 4500-6500 blocks data were stored on disk for
3.3 Synchronization of Data
When the speech and EGG signals were used during the analysis stage, they
were time aligned to account for the acoustic propagation delay from the larynx to
the microphone. The microphone was kept a fixed 15.24 centimeters (6 inches)
away from the speakers' lips to reduce breath noises and to simplify the alignment
process. Synchronization of the waveforms had to account for the distance from the
vocal folds to the microphone. To do so, average vocal tract lengths of 17 cm for
males and 15 cm for females were assumed. The number of samples to discard
from the beginning of the speech record was then
# samples = Int[(32.24/34442)10000 + .5] (3.1)
for males and
# samples = Int[(30.24/34442)10000 + .5] (3.2)
Equations (3.1) and (3.2) show that a 10 sample correction is appropriate for
males and a 9 sample correction is appropriate for females. Examination of the data
also supported use of these figures for adult speakers. Examples of aligned speech
and EGG signals for a male and female speaker are shown in Figure 3.1.
17064 /M F1
Examples of aligned
for (a) male and (b)
speech and EGG signals
,I . -
-vjv jHVop vjyp v IV v V VI
EXPERIMENTAL DESIGN BASED ON COARSE ANALYSIS
As stated in the Chapter 2, coarse analysis applies classical pattern recognition
techniques and asynchronous linear prediction coding (LPC) analysis of speech.
The specific goal of this analysis was to develop and test the algorithms to achieve
rapid gender recognition from only a brief speech data record. Figure 4.1 shows the
canonic pattern recognition model used in the gender recognition system. There are
four basic steps in the model:
1) acoustic parameter extraction,
2) test and reference pattern or template formation,
3) pattern similarity determination, and
4) decision rule.
The input is the acoustic waveform of the spoken speech signal, the desired
output is a "best estimate of the speaker's gender in the input. Such a model can be
a part of a speech or speaker recognition system or a front end processor of the
system. The following discussion of the coarse analysis proceeds in the context of
4.1 Asynchronous LPC Analysis
4.1.1 Linear Prediction Concepts
Linear prediction, also known as the autoregressive (AR), all-pole model, or
maximum entropy model, is widely used in speech processing. This method has
Figure 4.1 A pattern recognition model
for gender recognition from speech.
become the predominant technique for estimating the basic speech parameters (e.
g., pitch, formants, spectra, vocal tract area functions) and for representing speech
for low bit-rate transmission or storage. The method was first applied to speech
processing by Atal and Schroeder (1970) and Atal and Hanauer (1971). For speech
processing, the term linear prediction refers to a variety of essentially equivalent
formulations of the problem of modeling the speech waveform (Markel and Gray,
1976; Makhoul, 1975b). These different models usually lead to similar results but
each formulation has provided an insight into the speech modeling problem and is
generally dictated by their computation demands.
The particular form of this model that is appropriate for this research is
depicted in Figure 4.2. In this case, the composite spectrum effects of radiation,
vocal tract, and glottal excitation are represented by a time-varying digital filter
whose steady-state system function is of the form
H(z) = = (4.1)
1 2 Olk Z-k
This system is excited by an impulse train for voiced speech or a random noise
sequence for unvoiced speech.
The above system function can be written alternatively in the time domain as
s(n) = 2 OtkS(n-k) + G u(n) (4.2)
Let us assume we have available past data samples from (n-p) to (n-1) and we are
predicting a new nth sample of the data sample
CDGTAL FiLTER CCEFFICIENTS
(VCCAL TRACT PARAMETERS)
IMPULSE i I -
/ \ t
A digital model of speech production.
s (n) = aks(n-k) (4.3)
The error between the value of the actual nth sample and its estimate is
e(n) = s(n) I (n) = s(n) + 2 aks(n-k) (4.4)
s(n) = aks(n-k) + e(n) (4.5)
If the linear prediction model of Equation (2.5) conforms to the basic speech
production model given by (2.2), then
e(n) = G u(n) (4.6)
ak = ak (4.7)
Thus the coefficients (ak) identify the system, whose output is s(n). The problem
then is to determine the value of the coefficients (ak) from the actual speech signal.
The criterion used to obtain the coefficients (ak) is the minimization of the
short-time average prediction error E with respect to each coefficient ai, over some
time interval, where
E= [e(n)]2 (4.8)
This leads to the following set of equations
ak s(n-k)s(n-i) = s(n)s(n-i) 1 < i & p (4.9)
k=l n n
For a short-time analysis, the limits of summation are finite. The particular
choice of these limits has led to two methods of analysis (i.e., the autocorrelation
method (Markel and Gray, 1976) and the covariance method (Atal and Hanauer,
The autocorrelation method results in a filter structure that is guaranteed to be
stable. Meanwhile, it operates on a data segment that is windowed using a Hanning,
or Hamming, or another window, typically 10-20 msec long (two to three pitch
The covariance method, on the other hand, gives a filter with no guaranteed
stability, but requires no explicit windowing. Hence it is eminently suitable for pitch
One of the important features of the linear prediction model is that the
combined contribution of the glottal flow, the vocal tract and the radiation effect at
the lips are represented by a single recursive filter. The difficult problem of
separating the contribution of the source function from that of the vocal tract system
is thus completely avoided.
4.1.2 Analysis Conditions
In order to extract acoustic parameters rapidly, a conventional pitch
asynchronous autocorrelation LPC method was used, which applied a fixed frame
size, frame rate and number of parameters per frames. These analysis conditions
Order of the filter: 8, 12, 16, 20
Analysis frame size: 256 points/frame
Frame overlap: None
Preemphasis Factor: 0.95
Analysis Window: Hamming
Data set for coefficient calculations: six frames total. The first two of these
were picked up from near the voice onset of an utterance, and the second two from
the middle of the utterance, and the last two from near the voice offset of the
utterance. By averaging six sets of coefficients obtained from these six frames, a
template coefficient set was calculated for each sustained utterance such as a vowel,
an unvoiced fricative, or a voiced fricative.
4.2 Acoustic Parameters
One of the key issues in developing a recognition system is to identify
appropriate features and measures which will support good recognition
performance. Several acoustic parameters were considered as feature candidates in
4.2.1 Autocorrelation Coefficients
They are defined conventionally as (Atal, 1974b)
R(k) = Z h(n)h(n+|kl) (4.10)
where h(n) is the impulse response of the filter. The relationship between the P
autocorrelation function coefficients and P LPC coefficients is unique in that they
can be obtained from each other (Rabiner and Schafer, 1978).
4.2.2 LPC coefficients
LPC coefficients are defined conventionally as (Rabiner and Schafer, 1978)
s(n) = ak s(n-k) (4.11)
where s(n-k) is the (n-k)th speech sample, s(n) is the nth predicted output and ak is
the kth LPC coefficient. LPC coefficients are determined by minimizing the
short-time average prediction error.
4.2.3. Cepstrum Coefficients
Cepstral coefficients can be obtained by the following recursive formula
(Rabiner and Schafer, 1978)
Co = 0,
Ck = ak+ ( ) ak-iCi 1 < k p (4.12)
where ak is kth LPC coefficient.
4.2.4 Reflection Coefficients
If we consider a model for speech production that consists of a concatenation
of N lossless acoustic tubes, then the reflection coefficients are defined as (Rabiner
and Schafer, 1978)
r(k) = (4.13)
A(k+l) + A(k)
where A(k) is the area of the kth lossless tube. The reflection coefficient determines
the fraction of energy in a traveling wave that is reflected at each section boundary.
Further, r(i) is related to the PARCOR coefficient k(i) by (Rabiner arid Schafer,
r(i) = k(i) (4.14)
where k(i) can be obtained from LPC coefficients by recursion.
4.2.5 Fundamental Frequency and Formant Information
This set of features consists of frequencies, bandwidths and amplitudes of the
first, second, third and fourth formants and the fundamental frequencies (not for
fricatives). Formant information was obtained by a peak-picking technique, using
an FFT on the LPC coefficients. Fundamental frequency was calculated based upon
a modified cepstral algorithm.
4.3 Distance Measures
Several distance measures were considered.
4.3.1 Euclidean Distance
D euc = [ (X-Y) (X-Y) 11/2 (4.15)
where X and Y are the test and reference vectors respectively and t denotes the
transpose of the vector.
4.3.2 LPC Log Likelihood Distance
It was proposed by Itakura (1975) and defined as
Di ( a) = log [ ] (4.16)
d Ra '
where a and d are the LPC coefficientvectors of the reference and test speeches,
and R is the matrix of autocorrelation coefficients of the test speech. An
interpretation of this formula is given in Figure 4.3 below in which the subscript r
denotes reference, and the subscript t denotes test.
The denominator of the term can be obtained by passing the test speech signal
St(n) through the inverse LPC system of the test H,(z), giving the energy a of the
error signal. Similarly, the numerator term can be obtained by passing the same test
signal St(n) through the inverse LPC system of the reference H,(z) with the energy p
of the error signal.
Thus we obtain
D pc, (a a) = log (0/a) (4.17)
It can also be shown that this distance measure is related to the spectra
dissimilarity between the test and reference speech signals.
For computational efficiency, variables can be changed and Equation (4.16)
or (4.17) can be rewritten as
p r (k)
D Ipc (a a) = log [ ra(k) ] (4.18)
et (n) a =
I [Ht(z)]-1 --- I-I | I I
er (n) p =
Figure 4.3 An interpretation of LPC log likelihood
a R a'
where r(k) is the autocorrelation of the speech segment to be recognized, E is the
total squared LPC prediction error associated with the estimates a(k) from this
segment, and ra(k) is the autocorrelation of the true (reference) LPC coefficients.
The block diagram for this subroutine is shown in Figure 4.4.
The spectral domain interpretation of Equation (4.18) is (Rabiner and
IT Hr(eJ) 2 do
Dip (a,a) = log[[f lI ] (4.19)
-irT Hi(e0j) 2rr
(i.e., an integrated square of the ratio of LPC spectra between reference and test
4.3.3 Cepstral Distortion
It can be defined as (Nocerino et al., 1985)
D ep (C, c') = Z (Ck C'k)2 (4.20)
It can be shown that the power spectrum which corresponds to cepstrum is a
smoothed version of the true log spectral density function. The fewer the cepstral
coefficients used, the smoother the resultant log spectral density. It can also be
shown that this truncated cepstral distortion measure is a good approximation to the
L2 norm of the log spectral distortion measure between two time series, x(n) and
TEST SPEECH AUTOCORRELATION LPC
LPC COEFFICIENTS MEASURE
Figure 4.4 The block diagram of LPC log likelihood
7T 2 2 2 d(
D L2 =f I log IX(@)l -log IX'(o)I 1- (4.21)
where X(w) is the Fourier transform of x(n). Gray and Markel (1976) showed that
for the LPC analysis with a filter order of 10 the correlation coefficient between D L2
and D cep is 0.98; while for a order of 20, the correlation coefficient is 0.997.
4.3.4 Weighted Euclidean Distance
D WEUC = [ (X-Y)' W-1 (X-Y) ]1/2 (4.22)
where X is the test vector, W is the symmetrical covariance matrix obtained using a
set of reference vectors (e.g., from a set of templates which represent subjects of the
same gender), and Y is the mean vector of this set of reference vectors. The
weighting compensates for correlation between features in the overall distance and
reducing the intragroup variations. Weighted Euclidean distance is the simplified
version of the likelihood distance measure using the probability density function
4.3.5 Probability Density Function
1 -n/2 1 -(X-Y)' W-'(X-Y) 1/2
D PDF = ( exp[ ] (4.23)
27r IW1/2 2
where X, Y, W are the same as in Equation (4.22) and tW| is-the determinant of W.
The decision principle of this distance measure minimizes the probability of the
4.4 Template Formation and Recognition Schemes
4.4.1 Purpose of Design
Another important issue in developing a recognition system is the selection of
appropriate template formation and recognition schemes.
During initial exploratory studies of fixed-text recognition using spectral
pattern matching techniques in the Pruzansky study (1963), the use of the long-term
average technique to form a feature vector was discovered to have potential for
free-text speaker recognition. The speaker recognition error rate was found to
remain undegraded (at 11 percent) even after spectral amplitudes were averaged
over all frames of speech data into a single reference spectral amplitude vector for
each talker. Markel et al. (1977) demonstrated that the between-to-within speaker
variation ratio was significantly increased under long-term average of the parameter
sets (thus free-text).
Temporal cues also appeared not to play a role in speaker gender
identification (Lass and Mertz, 1978). They found that gender identification
accuracy remained high and unaffected by temporal speech alterations when the
normal temporal features of speech were altered by means of the backward playing
and time compressing of speech samples.
Therefore, we would reasonably believe that the gender information is
time-invariant. Thus, long-term averaging would also emphasize the speaker's
gender information and increase the between-to-within gender variation ratio. In
practice we would also achieve free-text gender recognition in which gender
identification would be determined before recognition of speech or speaker and
thus, reduce the speech or speaker recognition search space to half.
The purpose of using different test and reference template formation schemes
is to verify the hypothesis above and, if it is correct, to determine how much
averaging has to be performed to obtain the best gender recognition. In the
preliminary attempt, the averaging was first done within three classes of the sounds
(i.e., vowels, unvoiced fricatives, and voiced fricatives).
4.4.2 Test and Reference Template Formation
The averaging procedures used to create test and reference templates for the
present experiment employed a multi-level combination approach as illustrated in
The lower layer templates were feature parameter vectors obtained from each
utterance by an LPC analysis as described in the last section. They can be
autocorrelation, LPC, or cepstrum coefficients, etc. A lower layer template
coefficient set was calculated by averaging six sets of coefficients obtained from six
frames for each sustained utterance such as a vowel, an unvoiced fricative, or a
voiced fricative for every subject.
The next level of combination averaged all templates in the lower layer for
each subject to form a single median layer template to represent this subject.
Templates of all utterances for the same phoneme groups (e.g., vowels or unvoiced
fricatives or voiced fricatives), were averaged.
In the last stage, the single remaining male and female templates were
combined in the same manner as above. Each gender was represented by a single
token centroidd) obtained by averaging all templates in the median layer.
It is evident that from the lower layer to the upper layer, a higher degree of
averaging is achieved.
Figure 4.6(a) shows two reflection coefficient templates of vowels for male
and female speakers in the upper layer. The filter order was 12 so that there were 12
elements in each template (vector). Each template can be considered a "universal
token" representing each gender. The data in the figure are shown as
Figure 4.5 Test and reference template formation.
II I I I
1 2 3 4 5 6 7 8
ELEMENT OF THE VECTOR
9 10 11 12
Figure 4.6 (a) Two reflection coefficient templates of vowels
for male and female speakers in the upper layer.
(b) The corresponding spectra.
means standard errors (SE), which were calculated from the median layer
templates. We will see in the next Chapter that by applying these two tokens as
reference templates with the recognition Scheme 3 and the Euclidean distance a
100% gender recognition rate can be achieved. The result is not surprising if we
notice that the within-gender variation for these reflection coefficients, as
represented by SE in the figure, was small, compared to the between-gender
variation. It is also easily noted that elements 1, 4, 5, 6, 7, 8, 9, and 10 of these
reference templates account for the most between-gender variation. On the other
hand, elements 2, 3, 11, and 12 of these reference templates account for little
between-gender variation and thus could be discarded to reduce the dimensionality
of the vector. Figure 4.6(b) shows the spectra corresponding to the two "universal"
reflection coefficient templates.
Similarly, Figure 4.7(a) and (b) show two reflection coefficient templates
(with the same filter order of 12) and the corresponding spectra of unvoiced
fricatives for male and female speakers in the upper layer. Interestingly, strong
peaks are present in the "universal" female spectrum for voiced fricatives but only
several ripples appear in the male spectrum. We will see later that by using these
two tokens as reference templates with the recognition Scheme 3 and the Euclidean
distance, an 80.8% gender recognition rate can be achieved. Finally, Figure 4.8(a)
and (b) are two cepstral coefficient templates (with the same filter order of 12) and
the corresponding spectra of voiced fricatives for male and female speakers in the
upper layer. It is shown later that by using these two tokens as reference templates
with the recognition Scheme 3 and the Euclidean distance, a 92.3% gender
recognition rate can be achieved. The "universal" spectra for the two genders in
Figure 4.6(b), 4.7(b), and 4.8(b) possess the basic properties for vowels, unvoiced
fricatives, and voiced fricatives. For example, while the energy of the vowels is
concentrated in the lower frequency portion of the spectrum, the energy of the
O0 *-* FEMALE
I I I I I I I I I i I I
1 2 3 4 5 6 7 8 9 10 11 12
ELEMENT OF THE VECTOR (TEMPLATE)
Figure 4.7 (a) Two reflection coefficient templates of unvoiced fricatives
for male and female speakers in the upper layer.
(b) The corresponding spectra.
1 E3 4 5 6 7 8
ELEMENT OF THE VECTOR
9 10 11 12
Figure 4.8 (a) Two cepstral coefficient templates of voiced fricatives
for male and female speakers in the upper layer.
(b) The corresponding spectra.
unvoiced fricatives is concentrated in the higher frequency portion of the spectrum.
And the energy of voiced fricatives is more or less equally distributed.
4.4.3 Nearest Neighbor Decision Rule
The other major step in the pattern recognition model is the decision rule
which chooses which reference template most closely matches the unknown test
template. Although a variety of approaches are applicable, only two decision rules
have been used in most practical systems, namely, the K-nearest neighbor rule
(KNN rule) and the nearest neighbor rule (NN rule).
The KNN rule is applied when each reference class (e.g., gender) is
represented by two or more reference templates (e.g., as would be used to make the
reference templates independent of the speaker). The KNN rule operates as follows:
Assume we have M reference templates for each of two genders, and for each
template a distance score is obtained. If we denote the distance for the ith reference
template of the jth gender as Dij (1 < i < M and j = 1, 2), this set of distance scores,
Di,, can be ordered such that
Di,j < D2,j < ... < DM,j (4.24)
Then for the KNN rule we compute the average distance (radius) for the jth gender
j = Dij (4.25)
and we choose the index j* with the smallest average distance as the "recognized"
j* = argmin rj (4.26)
When K is equal to 1, the KNN rule becomes the NN rule (i.e., it chooses the
reference template with the smallest distance as the recognized template).
The importance of the KNN rule is seen for word recognition when P is from 6
to 12, in which case it has been shown that a real statistical advantage is obtained
using the KNN rule (with K = 2 or 3) over .the NN rule (Rabiner et al., 1979).
However, since there was no previous knowledge of the decision rule as applied to
gender recognition, the NN rule was first used in this preliminary experiment.
4.4.4 Structure of Four Recognition Schemes
To investigate how much averaging should be done for the test and reference
templates to gain the best performance for the gender recognizer, several
recognition schemes were designed. Table 4.1 presents a brief summary of these
Table 4.1 Four recognition schemes
Test template from Reference template from
SCHEME 1 LOWER LAYER MEDIAN LAYER
SCHEME 2 LOWER LAYER UPPER LAYER
SCHEME 3 MEDIAN LAYER UPPER LAYER
SCHEME 4 MEDIAN LAYER MEDIAN LAYER
Scheme 1 is illustrated in Figure 4.9(a). In the training stage, one test
template for each test utterance (i.e., the lower layer), and one reference template
Figure 4.9 Structures of four recognition schemes.
Figure 4.9 (continued)
for each subject (i.e., the median layer), were formed. The set of the entire median
layer constituted the reference cluster that includes all median templates. In the
testing stage, the distance measure for each lower layer template of all test subjects
was calculated with respect to each of the median layer templates, and the minimum
distance was found. The speaker gender of the lower layer utterance was then
classified as male or female, according to the gender known for the median layer
Scheme 2 is illustrated in Figure 4.9(b). In the training stage, one test
template for each test utterance (i.e., the lower layer), and one reference template
for each gender (i.e., the upper layer), were formed. The upper layer constituted the
reference cluster that includes only two gender templates. In the testing stage, the
distance measure for each lower layer template of all test subjects was calculated
with respect to each of those upper layer templates and the minimum distance was
found. The speaker gender of the lower layer utterance was then classified as male
or female, according to the gender known for the upper layer reference template.
Figure 4.9(c) shows Scheme 3. In the training stage, one test template for each
test subject (i.e., the median layer), and one reference template for each gender (
i.e., the upper layer), were formed. The set of the entire median layer constituted
the test pool that includes all median templates. In the testing stage, the distance
measure for each median layer template of all test subjects was calculated with
respect to each of those upper layer templates and the minimum distance was found.
The speaker gender of the median layer template was then classified as male or
female, according to the gender known for the upper layer reference template.
Figure 4.9(d) shows Scheme 4. In the training stage, only the median layer
were formed and each subject was represented by a single template. The median
layer constituted both test and reference pools. In the testing stage, the
Leave-One-Out ot exclusive procedure (which is discussed in detail in the next
section) was applied. The distance measure for each median layer template was
calculated with respect to each of the rest of the median layer templates, and the'
minimum distance was found. The speaker gender of the test template was then
classified as male or female, according to the gender known for the reference
template. The above steps were repeated until all subjects were tested.
4.5 Resubstitution and Leave-One-Out Procedures
After the classifier is designed, it is necessary to evaluate its performance
relative to competing approaches. The error rate was considered as the performance
Four popular empirical approaches that count the number of errors when
testing the classifier with a test data set are (Childers, 1989):
The Resubstitution Estimate (inclusive). In this procedure, the same data set
is used for both designing and testing the classifier. Experimentally and
theoretically this procedure gives a very optimistic estimate, especially when the
data set is small. Note, however, that when a large data set is available, this method
is probably as good as any procedure.
The Holdout Estimate. The data is partitioned into two mutually exclusive
subsets in this procedure. One set is used for designing the classifier and the other
for testing. This procedure makes poor use of the data since a classifier designed on
the entire data set will, on the average, perform better than a classifier designed on
only a portion of the data set. This procedure is known to give a very pessimistic
The Leave-One-Out Estimate (exclusive). This procedure assumes that there
are n data samples available. Remove one sample from the data set. Design the
classifier with the remaining (n-1) data samples and then test it with the removed
data sample. Return the sample removed earlier to the data set. Then repeat the
above steps, removing a different sample each time, for n times, until every sample
has been used for testing. The total number of errors is the leave-one-out error
estimate. Clearly this method uses the data very effectively. This method is also
referred to as the Jack Knife method.
The Rotation Estimate. In this procedure, the data set is partitioned into n/d
disjoint subsets, where d is a divisor of n. Then, remove one subset from the design
set, design the classifier with the remaining data and test it on the removed subset,
not used in the design. Repeat the operation for n/d times until every subset is used
for testing. The rotation estimate is the average frequency of misclassification over
the n/d test sessions. When d=1 the rotation method reduces to the leave-one-out
methods. When d=n/2 it reduces to the holdout method where the roles of the design
and test sets are interchanged. The interchanging of design and test sets is known in
statistics as cross-validation in both directions. As we may expect, the properties of
the rotation estimate will fall somewhere between the leave-one-out method and
holdout method. The rotation estimate will be less biased than in the holdout
method and the variance is less than in the leave-one-out method.
In order to use the database effectively, the leave-one-out procedure was
adopted for the experiments. For comparison, the resubstitution procedure was also
used in selected experiments.
4.6 Separability of Acoustic Parameters
Using Fisher's Discriminant Ratio Criterion
4.6.1 Fisher's Discriminant and F ratio
After we chose acoustic feature candidates, we can see how well they separate
different genders by analytically studying the database. There are many measures
of separability which are generalizations of one kind or another of the Fisher's
discriminant ratio (Childers et al., 1982},Parsons, 1986) concept. The ratio usually
serves as a criterion for selecting features for discrimination.
The ability of a feature to separate classes depends on the distance between
classes and the scatter within classes (generally there will be more than two classes).
This separation is estimated by representing each class by its mean and taking the
variance of the means. This variance is then compared to the average width of the
distribution for each class (i.e., the mean of the individual variances). This measure
is commonly called the F ratio:
Variance of the means(over all classes)
F = (4.27)
Mean of the variances (within classes)
The F ratio is reduced to Fisher's discriminant when it is used for evaluating a
single feature and there are only two classes. For this reason, the F ratio is also
referred to as the generalized Fisher's discriminant.
In the case of pattern recognition, there are vectors of features, f, and
observed values of f for all the classes we are interested in recognizing. Then two
covariance matrices can be calculated, depending on how the data are grouped.
First, the covariance for a single recognition class can be computed by
selecting only feature measurements for class i. Let any vector from this class be fi.
Then the i within-class covariance matrix for class i is
Wi = ( (fi i)(fi i)t1) (4.28)
where () represents the expectation or averaging operation and Ai represents the
mean vector for the ith class: gi = (fi). W stands for "within." Notice that each of
these covariance matrices describes the scatter within a class; hence it corresponds
to one term of the average in the denominator of (4.40). If we make the common
assumption that the vector fi is normally distributed, then W is the covariance matrix
of the corresponding probability density function:
1 -n/2 1 -(f-i)t W-1(f-i) 1/2
PDFi (f) = (--) exp
27T Wil1/2 2
Then the denominator of the F ratio can be associated with the average of Wi
over all i; this is called the pooled within-class covariance matrix:
W = (Wi) (4.29)
Second, the variation within-classes can be ignored and the covariance
between classes can be found, representing each class by its centroid. The feature
centroid for class i is g~; hence the between-class covariance matrix is
B = ( (gi l)(i i)t) (4.30)
where g is the mean of Ai over all classes. B stands for "between." Here we ignore
the detailed distribution within each class and represent all the data for that class by
its mean. Hence B describes the scatter from class to class regardless of the scatter
within a class and in that sense corresponds to the numerator of (4.40).
Then the generalization we seek should involve a ratio in which the numerator
is based on B and the denominator on W, since we are looking for features with
small covariances within classes and large covariances between classes. Fukunaga
(1972) lists four such measures, two of which are
J, = trace (W-'B) (4.31)
The motivation for these measures is clearer for J4 since we know that the trace of a
covariance matrix provides a measure of the total variance of its associated
variables (Parsons, 1986). If the value of J4 for a feature is relatively greater than
that for the other feature, then there is apparently more scatter between classes than
within classes for this feature, and this feature set is a better one than the other for
discrimination. J4 tests this ratio directly. The motivation for J1 is less obvious and
will have to awaitthe presentation of the material below.
4.6.2 Divergence and Probability of Error
The distance between two classes in feature space may also be evaluated by
divergence that is defined as the difference in the expected values of their
log-likelihood ratios (Kullback, 1959; Tou and Gonzales, 1974). This measure has
its roots in information theory (Kullback, 1959) and is a measure of the average
amount of information available for discriminating between class i and class k. It is
shown that for features with multivariate normal densities, the divergence is given
Dik = 0.5 trace (Wi Wk)(Wi- Wk-1)
+ 0.5 trace [(Wi-1 + Wk-1)(I. k)(ki 0k)t] (4.33)
This can be related to more familiar material as follows. If the covariance
matrices of the two classes are equal, if Wi and Wk can be replaced by an average
covariance matrix W, then the first term vanishes and the divergence reduces to
Dik = trace [W'(I(i IAk)( i k)t]
= (-i- Ak)' W-'(i- Ak)
The term, (Ai k) (9i Ak)t, is the between-class covariance matrix B; hence in
this case Die is the separability measure J1 = trace (W-'B).
Notice that ik or J1 is the Mahalanobis distance. This distance is related to
the approximation of the expected probability of error (PE) by Lachenbruch (1968),
Achariyapaopan and Childers (1983), and Childers (1986). If p is the dimension of
the feature vector, ni and n2 are the sample sizes for classes 1 and 2, and i(z) is the
standard normal distribution function defined as
I (z) = f exp (- 0.5 u2) du (4.34)
PE can be written as
PE = 0.5 [- (a P)] + 0.5 D [- (a + P)] (4.35)
S= C + n)(nn) ]1(4.36)
[ J1+P(nl+ +n2)/(11nn12) ]1/2
P (nz ni)
P3 = C (4.37)
[ nin2 (J1inn2 + p (nl + n2))] 1/2
(ni + n2 p 2)(nl + n2 p 5) 1/2
C = 0.5 ( ) (4.38)
(n, + n2- 3)(n, + n2- p 3)
For the fixed training sample sizes ni and n2 and vector dimension p, PE decreases
as the Mabalanobis distance J1 increases.
In the coarse analysis stage, the estimated J1, J4, and expected probabilities of
errors of the acoustic features ARC, LPC, FFF, RC, and CC, which were derived
from male and female groups (i.e., classes) in three phoneme categories, were
studied. Training sample sizes were 27 (ni) for the male group and 25 (n2) for the
female group since median layer templates (one for each subject) were used to
constitute the training sample pools. The feature vector dimension p was equivalent
to the filter order selected. For each of the acoustic features ARC, LPC, FFF, RC,
and CC in each of the three phoneme categories, the estimated JI, J4, and expected
probability of error were computed as follows:
(1) Estimate the within-gender covariance matrix Wi for each gender
using Equation (4.28).
(2) Compute the pooled (averaged) within-gender covariance matrix
W using Equation (4.29).
(3) Estimate the between-gender covariance matrix B using Equation
(4) Obtain the values of J4 and the Mabalanobis distance J1 from
matrixes W and B using Equations (4.32) and (4.31).
(5) Finally, calculate the value of PE from J1 nl, n2, and p using
Equations (4.34) to (4.38).
The analytical results were also compared to the empirical ones obtained from
experiments using recognition schemes. Section 5.3 in the next Chapter presents a
RESULTS OF RECOGNITION BASED ON COARSE ANALYSIS
Our results showed that most of the LPC-derived feature parameters
performed well for gender recognition. Among them, the reflection coefficient
combined with the Euclidean distance measure was the best choice for sustained
vowels (100%). While the cepstral distortion measure worked extremely well for
unvoiced fricatives, the LPC log likelihood distortion measure, the reflection
coefficient combined with the Euclidean distance, and the cepstral distortion
measure were the good alternatives for voiced fricatives. Using the Euclidean
distance measure achieved better results than using the Probability Density
Function. Furthermore, the averaging techniques were very important in designing
appropriate test and reference templates and a filter order of 12 to 16 was sufficient
for most designs.
5.1 Coarse Analysis Conditions
Before we discuss in detail the performance assessments based on coarse
analysis, we briefly summarize the experimental conditions as follows:
52 normal subjects: 27 male and 25 female
Phoneme group used: ten sustained vowels
five unvoiced fricative
four voiced fricatives
Method: asynchronous autocorrelation LPC
Filter order: 8, 12, 16, 20
Analysis frame size: 256 points/frame
Frame overlap: None
Preemphasis Factor: 0.95
Analysis Window: Hamming
Data set for coefficient calculations: six frames total. Then
by averaging six sets of coefficients obtained from these six
frames, a template coefficient set was calculated for each
LPC coefficients (LPC)
Cepstrum Coefficients (CC)
Autocorrelation Coefficients (ARC)
Reflection Coefficients (RC)
Fundamental Frequency and Formant Information (FFF)
FFF was obtained by 12 order Closed Phase WRLS-VFF method
discussed in the next Chapter.
Euclidean Distance (EUC)
LPC Log Likelihood Distance (LLD)
Cepstral Distortion (same as EUC)
Weighted Euclidean Distance (WEUC)
Probability Density Function (PDF)
Counting Error Procedures
Parameters Based on Fisher's Discriminant Ratio Criterion
J4, J1, and the expected probability of error
5.2 Performance Assessments
Since the WEUC is the simplified case of the PDF and the results produced by
the WEUC were very similar to those produced by the PDF in our experiments, only
the results obtained by the PDF will be discussed.
The complete results of the experiments are tabulated in Appendix A and B.
Appendix A presents the recognition rates for LPC log likelihood and cepstral
distortion measures with various phoneme categories, recognition schemes, and
filter orders. Inclusive procedures were only performed for the acoustic parameters
with a filter order of 16.
Appendix B presents the recognition rates for various acoustic parameters
(ARC, LPC, RC, CC) combined with EUC or PDF distance measures with different
phoneme categories and filter orders. Notice that only recognition Scheme 3 was
used for these experiments. Again, inclusive procedures were only performed for
the acoustic parameters with a filter order of 16. Since calculation of the cepstral
distortion measure was done by using the EUC, the results of the CC combined with
the EUC were directly extracted from Appendix A.
5.2.1 Comparative Study of Recognition Schemes
Tables 5.1 and 5.2 show the condensed results selected from Appendix A,
using the LPC log likelihood and cepstral distortion measures respectively.
Recognition rates for the four exclusive recognition schemes with various filter
orders are included. Figures 5.1 and 5.2 are graphic illustrations of Tables 5.1 and
By observing curves of Figures 5.1 and 5.2, it can be immediately seen that
higher recognition rates were achieved using recognition Schemes 3 and 4 for all the
cases, including various filter orders combined with different phoneme categories.
Among them, by applying Schemes 3 and 4 to voiced fricatives, over 90%
recognition rates were accomplished for all filter orders and both distortion
measures. The highest correct recognition rate, 98.1%, was obtained for Scheme 4
by using the LPC log likelihood measure with a filter order of 8. The same rates
were obtained for Scheme 3 by using the LPC log likelihood measure with a filter
order of 20 and using the cepstral distortion measure with filter orders of 12 and 16.
The results indicated the following:
1. Choosing .appropriate template forming and recognition schemes
was important in achieving high correct recognition rates.
Particularly, the use of averaging techniques was critical since the
highest recognition rates were obtained by using Schemes 3 and 4, in
both of which the test and reference template were formed by
averaging, all the utterances from the same subjects or even the same
gender (Scheme 3). In contrast, Schemes 1 and 2, in which the test
template was formed from a single utterance, performed worse.
Results from exclusive recognition schemes
with various filter orders and
the LPC log likelihood distortion measure
CORRECT RATE %
Ordere=8 Order=12 Order=16 Order=20
Scheme 1 63.1 69.6 74.2 74.2
Sustained Scheme 2 65.2 71.5 76.2 76.5
Vowels Scheme 3 75.0 86.5 86.5 84.6
Scheme 4 75.0 80.8 86.5 88.5
Scheme 1 59.2 64.2 67.7 65.0
Unvoiced Scheme 2 61.5 63.9 64.2 64.2
Fricatives Scheme 3 67.3 75.0 75.0 78.9
Scheme 4 76.9 75.0 73.1 69.3
Scheme 1 74.5 72.1 73.1 72.6
Voiced Scheme 2 77.4 80.3 81.7 80.3
Fricatives Scheme 3 90.4 94.2 96.2 98.1
Scheme 4 98.1 96.2 96.2 94.3
Results from exclusive recognition schemes
with various filter orders and
the cepstral distortion measure
CORRECT RATE %
Order=8 Order=12 Order=16 Order=20
Scheme 1 61.3 68.3 69.4 70.6
Sustained Scheme 2 69.4 67.3 70.0 72.1
Vowels Scheme 3 82.7 92.3 90.4 90.4
Scheme 4 90.4 92.3 92.3 88.5
Scheme 1 61.2 65.8 63.9 64.6
Unvoiced Scheme 2 58.8 61.5 62.7 64.2
Fricatives Scheme 3 71.2 75.0 84.6 82.7
Scheme 4 78.8 88.5 90.4 88.5
Scheme 1 79.3 82.7 81.3 80.8
Voiced Scheme 2 75.5 82.2 84.6 85.1
Fricatives Scheme 3 94.2 98.1 98.1 96.2
Scheme 4 92.3 92.3 92.3 90.4
.A Scheme 3
A A -A
0... ""- -- ....-.7..
8 12 16 20
Figure 5.1 Results from exclusive recognition schemes with various
filter orders and the LPC log likelihood distortion measure.
-A Scheme 4
A .-------... ....
..A Scheme 3
S------ Scheme 2
S O scheme
A ..---A Scheme 3
- 0 .--. 0 ... 0 Schemea
0 Scheme 2
0 Scheme I
A A A
A A A
.,0 ..-.... --
Figure 5.2 Results from exclusive recognition schemes with various
filter orders and the cepstral distortion measure.
8 12 16
.o - ..
I I I 1 1
2. Averaging techniques seemed more crucial than clustering
techniques. Recognition theory states that choosing several
clustering centers for the same reference group should increase the
correct classification rate, because intragroup variations are taken
into account. In Schemes 1 and 4, multi-clustering reference centers
were formed. However, the theory functioned well in Scheme 4 but
inadequately in Scheme 1, although in Scheme 1, a number of
clustering centers were selected for the reference of the same
gender. Furthermore, Scheme 3 was a further simplified version of
Scheme 2 and Scheme 4. Instead of using each test vowel of the
subject as in Scheme 2, only a single test template was employed for
each test subject and a single reference template for each reference
gender. But the results were almost as good as those achieved by
using Scheme 4. In Scheme 2, averaging was performed over the
reference template but not over the test template. The correct
recognition rates were low. Thus, the results suggested the
importance of averaging techniques. To a great extent, averaging on
both test and reference templates eliminated the intrasubject
variation or diversity within different vowels or fricatives of a given
speaker, but on the other hand, emphasized features representing
this speaker's gender.
3. Since the averaging was applied to the acoustic parameters extracted
from different phonemes uttered by a speaker or speakers of the
same gender, and the phonemes were produced at different times,
the averaging is essentially a time-averaging technique.
Therefore, we would reasonably deduce that gender information
is time-invariant, phoneme independent, and speaker independent.
Because of this, averaging emphasized the speaker's gender
information and increased the between-to-within gender variation
ratio. In practice we would achieve free-text gender recognition in
which gender identification would be determined before recognition
of speech or speaker and thus, reduce the speech or speaker
recognition search space to half.
The conclusion is consistent with the findings by Lass and Mertz
(1978) that temporal cues appeared not to play a role in speaker
gender identification. As we cited earlier, in their listening tests they
found that gender identification accuracy remained high and
unaffected by temporal speech alterations when the normal temporal
features of speech were altered by means of the backward playing
and time compressing of speech samples.
We have shown in the previous section that use of the long-term
average technique to form a feature vector was discovered to have
potential for free-text speaker recognition in the Pruzansky study
(1963). Speaker recognition error rates remained undegraded even
after averaging spectral amplitudes over all frames of speech data
into a single reference spectral amplitude vector for each talker.
Markel et al. (1977) also demonstrated that the between-to-within
speaker variation ratio was significantly increased by performing
long-term parameter sets (thus text-free). Here we found that this
rule also applied to the gender recognition.
4. In terms of Scheme 3 versus Scheme 4, neither was obviously
superior. However, from a practical point of view, Scheme 3 would
be easier to realize since only two reference templates are needed.
5. In further experiments, different weighting factors could be applied
to different phoneme feature vectors according to the probabilities of
their appearances in real situations. By this way, time-averaging
would be better approximated.
5.2.2 Comparative Study of Acoustic Features
220.127.116.11 LPC Parameter Verses Cepstrum Parameter
Although both LPC log likelihood and cepstral distortion measures were
effective tools in classifying male/female voices, the performance of the latter was
better than the former.
1. By comparing Figures 5.1 and 5.2, it is noted that except in the
category of voiced fricatives, in which the performances were
competitive (both measures were able to achieve recognition rates of
98.1%), cepstrum coefficient features proved to be more sensitive
than LPC coefficients for gender discrimination. By choosing
appropriate schemes and filter orders the recognition rates for the
cepstral distortion measure reached 92.3% for vowels (Scheme 3
with a filter order of 12 and Scheme 4 with filter orders of 12 and
16), 90.4% for unvoiced fricatives (Scheme 4 with a filter order of
16). By using the LPC log likelihood distortion measure, the
corresponding highest recognition rates were 88.5% for vowels
(Scheme 4 with a filter order of 20) and 78.9% for unvoiced
fricatives (Scheme 3 with a filter order of 20).
2. By comparing tables in Appendix A, it is noted that the cepstral
distortion measure operated more evenly between male and female
groups, showing this feature has some "normalization
characteristics". As seen in Table A.1, there existed large
differences between male and female recognition rates for the LPC
recognizer with a filter order of 16. The largest gaps came from
Scheme 1 of the LPC. The differences were about 19% for vowels,
24% for unvoiced fricatives, and 15% for voiced fricatives. On the
other hand, the cepstral distortion measure worked evenly with the
same filter order. Table A.7 shows that the largest gaps were 6.6%
for vowels, 1.8% for unvoiced fricatives, and 2.4% for voiced
fricatives. Similar situations held for the results shown in other
tables and with inclusive schemes.
18.104.22.168 Other Acoustic Parameters
Tables 5.3 and 5.4 demonstrate results from exclusive recognition Scheme 3
with various filter orders and other acoustic parameters, using EUC and PDF
distance measures respectively. Figures 5.3 and 5.4 are graphic illustrations of
Tables 5.3 and 5.4.
1. The overall performance using RC and cepstrum coefficients was
better than that achieved using ARC and LPC coefficients, when the
Euclidean distance measure was adopted. The following
observations were made:
o The RC functioned extremely well with sustained vowels.
The recognition rates remained 100% for filter orders of 12,
16, and 20, showing that RC features captured gender
information from vowels effectively. The results were also
stable and filter order independent, as long as the filter order
was above 8. Table 5.3 shows that a 98.1% recognition rate
was reached by using the FFF, which was obtained using the
closed-phase WRLS-VFF method with a filter order of 12.
Results from the exclusive recognition scheme 3
with various filter orders, acoustic parameters, and
the Euclidean distance measure
CORRECT RATE %
Order=8 Order=12 Order=16 Order=20
ARC 78.8 78.S 78.8 82.7
LPC 73.1 78.8 80.8 80.8
FFF N/A 98.1 N/A N/A
RC 88.5 100.0 100.0 100.0
CC 82,7 92,3 90.4 90.4
ARC 75.0 75.0 75.0 75.0
Unvoiced LPC 80.8 69.2 71.2 71.2
Fricatives RC 80.8 80.8 80.8 80.8
CC 71.2 75.0 84.6 82.7
ARC 86.5 88.5 86.5 88.5
Voiced LPC 92.3 92.3 92.3 90.4
Fricatives RC 94.2 96.2 96.2 96.2
CC 94.2 98.1 98.1 96.2
Results from the exclusive recognition scheme 3
with various filter orders, acoustic parameters, and
the PDF (Probability Density Function) distance measure
CORRECT RATE %
Order=8 Order=12 Order=16 Order=20
ARC 80.8 84.6 88.5 67.3
LPC 84.6 98.1 92.3 80.8
FFF N/A 96.2 N/A N/A
RC 88.5 98.1 92.3 67.3
CC 78.8 94.2 90.3 75.0
ARC 69.2 65.4 57.7 N/A
Unvoiced LPC 78.8 86.5 78.8 53.8
Fricatives RC 78.8 73.1 67.3 55.8
CC 80.8 73.1 69.2 57.7
ARC 88.5 86.5 82.7 59.6
Voiced LPC 92.3 94.2 94.2 71.2
Fricatives RC 92.3 90.4 90.4 75.0
CC 92.3 92.3 80.8 71.2
A -------- -
8 12 16 28
Figure 5.3 Results from the exclusive recognition scheme 3 with various
filter orders, acoustic parameters, and the Euclidean distance measure.
- A --,-------
8 12 16 28
Figure 5.4 Results from the exclusive recognition scheme 3 with various
filter orders, acoustic parameters, and the PDF distance measure.
I 1 1 I
o The CC worked very well with unvoiced fricatives. The
highest recognition rate was 84.6% with a filter order of 16.
o Both RC and CC operated extremely well with voiced
fricatives, and the results remained stable across all the filter
orders that were tested. The results generated by the CC were
slight better than the RC with filter orders of 12 and 16
(98.1% versus 96.2%).
2. When the PDF distance measure was adopted, using LPC
coefficients was a good choice. The highest recognition rates,
98.1%, 86.5% ,and 94.2, for three phoneme categories all came from
LPC coefficients with a filter order of 12 (also for voiced fricatives
with a filter order of 16). However, it is noted that the results from
the PDF distance measure were highly affected by filter orders.
5.2.3 Comparative Study Using Different Phonemes
The results of Figures 5.1, 5.2, 5.3, and 5.4 also indicated that either vowels or
unvoiced fricatives or voiced fricatives could be used to objectively classify
speaker's gender. As we have seen before for vowel category, reflection coefficients
worked extremely well. The recognition rates reached 100% with filter orders of 12,
16, and 20. A 98.1% recognition rate could also be accomplished by using LPC
coefficients and the FFF. Surprisingly, the cepstral distortion measure can be used
to discriminate speaker's gender from unvoiced fricatives and a 90.4% recognition
rate was obtained. For voiced fricative category, a 98.1% recognition rate was
achieved by using LPC log likelihood and cepstral distortion measures. Therefore,
in terms of the "most effective" phoneme category for gender recognition, the first
preference would be vowels. The second one would be voiced fricatives and the last
one unvoiced fricatives.
As discussed in Chapter 4, the acoustic parameters used in coarse analysis
were derived from the LPC all-pole model that has a spectral matching
characteristic. It is known that LPC log likelihood and cepstral distortion measures
are directly related to the power spectral differences of the test and reference
signals. Thus, the results indicated that the spectral characteristics were major
factors in distinguishing the speaker's gender. Also, our results suggested that there
did exist significant differences between spectral characteristics of unvoiced
fricatives for the two genders, indicating that the speaker's gender could be
distinguished based only on speaker's vocal tract characteristics since no vocal fold
information existed in unvoiced fricatives. Moreover, if some of the vocal fold
information was combined with vocal tract characteristics as in vowel and voiced
fricative cases, the gender distinguishing was improved, even though in both of the
above cases the fundamental frequency information was not included.
5.2.4 Comparative Study of Filter Order Variation
22.214.171.124 LPC Log Likelihood and Cepstral Distortion Measure Cases
1. By observing the resultant curves of Schemes 1 and 2 from Figures
5.1 and 5.2, a general trend is easily noted. The recognition rates
generally improved by using higher order filters.
2. However, this trend was not observed for Schemes 3 and 4.
Individual inspection had to be made for specific cases. The
recognition rates for Schemes 3 and 4 together with LPC log
likelihood and cepstral distortion measures were first considered.
Notice that there are a total 12 rates for each of the filter orders.
o Comparing the recognition rates between filter orders of 8
and 12, 9 out of 12 rates increased, 2 of them decreased
(Scheme 4 for voiced and unvoiced fricatives with the LPC
log likelihood distance), and 1 of them tied (Scheme 4 for
unvoiced fricatives with the cepstral distortion measure).
This demonstrated that performance improved from filter
orders of 8 to 12.
o Comparing the recognition rates between filter orders of 12
and 16, out of 12 rates, 4 of them dropped. Two of them
increased (unvoiced fricatives Scheme 4 with the LPC
distance measure and sustained vowels Scheme 3 with the
cepstral distortion measure). The remaining 6 were equal.
This indicated that by using Scheme 3 or 4, there was not a
distinct difference between filter orders of 12 and 16.
o Comparing the recognition rates between filter orders of 16
and 20, out of 12 rates, 8 of them dropped. Three of them
increased and one was tied. Performance degraded from
filter orders of 16 to 20.
o If only the cepstral distortion measure was applied (Figure
5.2), the highest recognition rates appeared at filter orders of
12 and 16 for all three phoneme categories. However, if only
the LPC log likelihood distortion was used (Figure 5.1), the
highest recognition rates were reached with filter orders of 8
and 20. Therefore, there was no manifest trend of
performance difference for the LPC log likelihood distortion
measure across filter orders. Since the cepstral distortion
measure showed better results than the LPC log likelihood,
the filter orders of 12 to 16 seemed -to be best options for the
126.96.36.199 Euclidean Distance Versus Probability Density Function
1. By examining Figure 5.3, it is seen that using Euclidean distance
measure increased the recognition rates slightly from filter orders of
8 to 12 with exception of the LPC for unvoiced fricatives.
Recognition rates with a filter order of 12 were almost the same as
with 16 and 20. No specific trend was observed. Except for the ARC
applied to the vowel category, all other performances reached their
peaks with either filter order of 12 or 16. It can be concluded that by
using the EUC distance measure, the best choice of filter order
would be around the range from 12 to 16.
2. However, Figure 5.4 shows us a different case. By inspecting Figure
5.4, it is immediately concluded that by using the PDF, gender
recognition rates varied considerably with the filter order. The
overall trend for the vowel category is that recognition rates
increased from a filter order of 8, reached its peak with a filter order
of 12 and then decreased. One exception is that by using the RC, the
recognition rate reached its peak with a filter order of 16. All
acoustic parameters, except the LPC, for voiced and unvoiced
fricatives showed decreasing recognition rates from a filter order of
8 to an order of 20. By using LPC coefficients, performance showed
some improvement from a filter order of 8 to 12 and 16 and then
degraded. Finally, recognition accuracies severely declined from a
filter order of 16 to an order of 20 for all three phoneme categories
and all acoustic parameters. It can be concluded that using the PDF,
the best option for filter order would be 8 -or 12.
5.2.5 Comparative Study of Distance Measures
It is generally believed that the use of the EUC distance measure is not as
effective as the use of the PDF because there is no normalization of the dimensions
involved in the definition of the EUC. The largest value dimension becomes the
most significant. In contrast, the PDF approach has such normalization function
through the covariance matrix computation. The PDF approach gives unequal
weighting for each element of a vector. It may suppress the elements with large
values but emphasize the elements with small values according to their importance
in reducing the intragroup variation.
However, the PDF approach did not work well in our experiments. By
observing Tables 5.3 and 5.4 as well as Figures 5.3 and 5.4, it can be seen that the
EUC outperformed the PDF.
First, out of 48 corresponding pairs of EUC and PDF recognition rates from
Tables 5.3 and 5.4, 32 EUC recognition rates were higher than those of the PDF.
Three of them were tied. Only 13 PDF recognition rates were higher than EUC
Second, as we have demonstrated in the sections above, performance using
PDF varied considerably with the filter order and there were severe performance
declines from the order of 16 to order of 20 for all three phoneme categories and all
acoustic parameters. On the other hand, the results of the EUC were relatively
consistent across all the filter orders that were tested, especially with filter orders of
12, 16, and 20.
Third, the two highest rates from Tables 5.3 and 5.4 for three phoneme groups
achieved using the EUC distance measure. The RC with the EUC yielded a 100%
recognition rate for sustained vowels-with filter-orders of-12, 16, and 20. The CC
with the EUC yielded a 98.1% recognition rate for voiced fricatives with filter orders
of 12 and 16. Even for unvoiced fricatives, the highest recognition rates for
unvoiced fricatives came from the CC with the EUC using Scheme 4 (Table 5.2).
They were 88.5% with a filter ordergof 12 and 90.4% with a filter order of 16.
Fourth, the EUC distance measure functioned more evenly on male and
female groups than did the PDF. By examination of all tables of exclusive schemes
in Appendix B, 43 male recognition rates were higher than those of females for total
48 PDF pairs. Only 5 female recognition rates were higher than those of males. And
the largest difference between gender group was 68.6% for the PDF (from the ARC
for unvoiced fricatives with filter order of 20). On the other hand, in 29 out of 49
EUC pairs, male rates were higher than those of females. And the largest gap
between gender group was only 21% (from the LPC for vowels with a filter order of
A possible reason for this inferior PDF performance is due to the small ratio of
the available number of subjects per gender to the number of elements
(measurements) per feature vector. The assumption when using the PDF distance
measure to design a classifier is that the data are normally (Gaussian) distributed.
In this case, many factors are considered (e.g., the size of the training or design set
and the number of measurements (observations, samples) in the data record (or
vector)). Foley (1972) and Childers (1986) pointed out that if the ratio of the
available number of samples per class (in this study, number of subjects per gender)
to the number of samples per data record (in this study, number of elements per
feature vector) is small, then data classification for both design and test sets may be
unreliable. This ratio should be on the order of three or larger (Foley, 1972). In our
study, the ratios were 3.25 (26/8), 2.17 (26/12), 1.63 (26/16), and 1.3 (26/20) for
filter orders of 8, 12, 16, and 20 respectively. The value of 3.25 satisfied the
requirement but the others were too small. Therefore, with the exception of the
results with a filter order of 8, where the performances of the PDF and EUC were
comparable, the PDF approach did not function well. The smaller the ratio, the
worse the PDF performed.
5.2.6 Comparative Study Using Different Procedures
Performance differences between resubstitution (inclusive) and
Leave-One-Out (exclusive) procedures were also tested with a filter order of 16.
Tables A.9 and A. 10 in Appendix A present the inclusive recognition results for LPC
log likelihood and cepstral distortion measures with various recognition schemes
respectively. Tables B.9 and B.10 in Appendix B show the recognition results from
inclusive recognition Scheme 3 with various acoustic parameters, using EUC and
PDF distance measures respectively.
The results presented in Appendix A indicate that the correct recognition rates
of exclusive recognition procedure (Tables A.3 and A.7) were not greatly degraded
compared to those obtained from the inclusive recognition procedure, especially for
the cepstral distortion measure. For the cepstral distortion measure with Scheme 3,
the rates decreased from 94.2% to 90.4% for vowels, from 86.5% to 84.6% for
unvoiced fricatives, and remained constant for voiced fricatives. The maximum
decrease of the rates was less than 4%. For the LPC log likelihood with Scheme 3,
the rates degraded from 92.3% to 86.5% for vowels, 82.7% to 75% for unvoiced
fricatives, and 100% to 96.2 for voiced fricatives. Here maximum rate decrease was
7.7% observed for unvoiced fricatives. In contrast, the results from the partial
database of 21 subjects, which we analyzed before we completed our data collection
of the entire database, showed a much large decrease from inclusive to exclusive
procedures. Recognition rate dropped more than 14% for unvoiced fricatives and
more than 9% for voiced fricatives. This convinced us that the larger the database,
the less the performance differences between inclusive and exclusive procedures.
One interesting observation from Tables B.9 and B.10 in Appendix B was that
when using the PDF distance measure with the inclusive procedure, the correct
recognition rates were extremely high for all types of phonemes and feature vectors,
except for the ARC with unvoiced fricatives (it was still 98.1%). In addition, the
LPC, RC and CC were all able to provide 100% correct gender recognition from
unvoiced fricatives! However, when using the PDF with exclusive procedure (Table
B.7), the correct recognition rate decreased significantly, with drops ranging from a
minimum of 5.8% to a maximum of 40.4% (for unvoiced fricatives, drops ranging
from a minimum of 21.2% to a maximum of 40.4%). On the other hand, the EUC
distance measure operated more evenly. From inclusive to exclusive procedures
(Table B.3), recognition rates dropped very little, ranging from a minimum of 0% to
a maximum of 3.9%. The rates for four feature parameters did not decrease at all.
Figures 5.5(a) and (b) are graphic illustrations of Tables B.3 and B.9. It can be seen
that there was only minor performance difference between inclusive and exclusive
procedures when the EUC distance measure was used. Our results also suggested
that the PDF excelled at capturing the information from an individual subject. As
long as the data of the subject itself was included in the reference data set, the PDF
was able to pick up such specific information easily and then identify the subject's
gender accurately. Therefore, the correct recognition rates of the inclusive
procedure for the PDF were extremely high. However, the PDF recognition rates of
the exclusive procedure were much lower, indicating that the PDF was clearly
inferior at capturing gender information from the other 'average' subject with the
same gender. On the other hand, the EUC distance measure was good at capturing
gender information from the other subjects without including the characteristic of
the test subject itself.
l LPC (EUC)
55 60 65 70 75 80 85 90 95 100
Correct rate %
! RC (EUC)
55 60 65 70 75 80 85 90 95 100
Correct rate %
Figure 5.5 Results of recognition Scheme 3 with the EUC and
a filter order of 16 for (a) exclusive procedure
(b) inclusive procedure.
i i : i : I