Title Page
 Table of Contents
 List of Tables
 Results and discussion
 Summary and conclusions
 Biographical sketch

Group Title: evaluation of selected acoustic parameters for use in speaker identification /
Title: An Evaluation of selected acoustic parameters for use in speaker identification /
Full Citation
Permanent Link: http://ufdc.ufl.edu/UF00098140/00001
 Material Information
Title: An Evaluation of selected acoustic parameters for use in speaker identification /
Alternate Title: Speaker identification
Physical Description: vi, 48 leaves : ; 28cm.
Language: English
Creator: Doherty, Edward Thomas, 1936-
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 1975
Copyright Date: 1975
Subject: Auditory perception   ( lcsh )
Speech thesis Ph. D   ( lcsh )
Dissertations, Academic -- Speech -- UF   ( lcsh )
Genre: bibliography   ( marcgt )
non-fiction   ( marcgt )
Thesis: Thesis--University of Florida.
Bibliography: Bibliography: leaves 45-47.
General Note: Typescript.
General Note: Vita.
Statement of Responsibility: by Edward Thomas Doherty.
 Record Information
Bibliographic ID: UF00098140
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
Resource Identifier: alephbibnum - 000229638
oclc - 02272610
notis - AAZ6926


This item has the following downloads:

PDF ( 2 MBs ) ( PDF )

Table of Contents
    Title Page
        Page i
        Page ii
    Table of Contents
        Page iii
    List of Tables
        Page iv
        Page v
        Page vi
        Page 1
        Page 2
        Page 3
        Page 4
        Page 5
        Page 6
        Page 7
        Page 8
        Page 9
        Page 10
        Page 11
        Page 12
        Page 13
        Page 14
        Page 15
        Page 16
        Page 17
        Page 18
        Page 19
        Page 20
        Page 21
    Results and discussion
        Page 22
        Page 23
        Page 24
        Page 25
        Page 26
        Page 27
        Page 28
        Page 29
        Page 30
        Page 31
        Page 32
        Page 33
        Page 34
        Page 35
        Page 36
        Page 37
        Page 38
        Page 39
    Summary and conclusions
        Page 40
        Page 41
        Page 42
        Page 43
        Page 44
        Page 45
        Page 46
        Page 47
    Biographical sketch
        Page 48
        Page 49
        Page 50
        Page 51
Full Text







The author gratefully acknowledges counsel of Dr. Harry Hollien

who provided guidance and support throughout the writer's career at

the University of Florida and throughout the course of this study.

The author is also pleased to acknowledge the constructive comments

of his supervisory committee, composed of Drs. Teas, Paige and Rothman.



ACKNOWLEDGEMENTS .......................................... ii

LIST OF TABLES ............................................ iv

ABSTRACT .................................................. v


I INTRODUCT ION ................................ 1

II PROCEDURE ................................... 14

Ill RESULTS AND DISCUSSION ...................... 22

IV SUMMARY AND CONCLUSIONS ..................... 40

APPEND IX .................................................. 43

BIBLIOGRAPHY .............................................. 45

BIOGRAPHICAL SKETCH ....................................... 48



1 Classification of Observations Based on Long-
Term Power Spectra (LTS) ......................... 23

2 A Comparison of the Classification of Speakers
by Means of Long-Term Speech Spectra Using
Euclidean Distance, Cross-Correlation and Discrim-
inant Analysis Techniques (N=50) .................. 27

3 Classification of Observations Based on Speaking
Fundamental Frequency (SFF) ...................... 29

4 Classification of Speakers Based on Two Forms of
Speaking Time Vector (N=50) ...................... 32

5 Classification of Observations Based on Speaking
Time (ST) ........................................ 32

6 Classification of Observations Based on Various
Combinations of the Limited Passband Long-Term
Power Spectra (LTS), Speaking Fundamental Fre-
quency (SFF) and Speaking Time (ST) .............. 33

Abstract of Dissertation Presented to the Graduate Council of
the University of Florida in Partial Fulfillment of the Requirements
for the Degree of Doctor of Philosophy



Edward Thomas Doherty

Chairman: Harry Hollien
Major Department: Speech

An investigation was conducted to examine the effectiveness of

certain acoustic and temporal properties of the speech signal in the

determination of a speaker's identity from his voice alone. Specif-

ically, the purposes of this research were to: (I) examine whether

long-term power spectra (LTS), speaking fundamental frequency (SFF),

and speaking time (ST) extracted separately from a speaker's vocal

output provide sufficient bases to make judgements of a talker's

identity; (2) test whether speaker identification rates can be im-

proved by using the three vectors (LTS, SFF and ST) in various combi-

nations; (3) test whether the specified vectors are sufficiently ro-

bust to function in the presence of distortions -- limited passband,

stress or disguise; and (4) evaluate differences in correct identifi-

cation levels when various analytical procedures, i.e., Euclidean dis-

tance, cross-correlation or discriminant analysis, are used.

Readings of "An Apology for Idlers" were recorded from two groups

of speakers. The first group consisted of 50 college-age males who

performed the reading "as naturally as possible." The second group was

made up of 25 males, aged 25-45, who first read the passage in a nor-

mal fashion but who also were required to read it while subjected to

stress and while attempting voice disguise.

Acoustic/temporal analyses were performed on the speakers' utter-

ances to extract the LTS, SFF and ST vectors. To simulate a limited

passband for LTS, only 11 of the 23 parameters were employed. The re-

sults indicated that: (1) the LTS vector was extremely effective for

identifying speakers if the utterances were produced normally, (2) SFF

and ST were far less effective, (3) combining vectors usually improved

correct identification levels, (4) when the speech was recorded while

the talker was under stress or attempting a disguise, no single vector

or combination adequately differentiated talkers, and (5) under the

limited conditions of this study, a discriminant analysis is a more

efficient and practical method of determining a talker's identity from

these vectors than is cross-correlation or Euclidean distance.

The fact that no single vector or combination was capable of

maintaining the high identification rates found for normal productions

under conditions of stress or disguise indicates that none of the vec-

tors measure an invariant characteristic of an individual's speech.


The concept of determining the identity of a talker from acous-

tic information alone is intuitively acceptable. For example, any

individual can recognize a large number of known speakers simply upon

hearing their voices. The speech sample in question may be produced

by a relative, politician, movie or television personality, etc., i.e.,

a person who previously has supplied sufficient material to provide the

listener with a set of discrimination criteria. In fact, several in-

vestigators (Bricker and Pruzansky, 1966; LaRiviere, 1974; LaRiviere,

1975; Majewski, Hollien and Doherty, unpublished manuscript; Pollack,

Pickett and Sumby, 1954; Stevens, Williams, Carbonell and Woods, 1968)

have demonstrated that the aural mechanism can be used effectively to

identify talkers. In particular, Stevens et al. (1968) found that, for

single words, an aural technique was more effective than spectrography.

Bricker and Pruzansky (1966) reported an aural identification level of

98% when the listeners heard entire sentences from the speakers. In

another study in this area, Majewski et al. (unpublished manuscript)

demonstrated that listeners can reliably identify their co-workers under

normal and stress conditions and, apparently, their identification cap-

ability is well above chance even when the talkers attempt to disguise

their voices.

In addition to aural speaker identification, several techniques

based on machine processing have been utilized. For example, spectro-

graphy can be employed to produce an analog representation of speech

which, in turn, can be examined visually in order to permit the ident-

ification of talkers. In fact, Kersta's (1962) claim that speakers

can be identified from simple visual examinations of spectrograms has

stirred a great deal of controversy. Because this technique has been

colloquially labeled voiceprintss," naive individuals are led to be-

lieve it can approximate the precision of identification implied by

such a label (i.e., its implicit association with fingerprints). If

such were the case, all that would be necessary for the establishment

of an almost perfectly differentiating system -- based on a defined,

albeit subjective, set of parameters -- would be the refinement and

formulation of the procedure. However, as Hollien (1974) observed,

such efforts, even under controlled laboratory conditions, have not

reasonably shown that the voiceprintt" method is capable of producing

the order of identification accuracy claimed. Indeed, most other

scientists have repudiated the method; Vanderslice (1966) was among

the first to do so. While he was not able to test the specific met-

rics used by Kersta (since they were not disclosed), he did show

spectrograms of the same utterances which looked alike but were from

different talkers as well as some which looked different although pro-

duced by the same speaker. It would seem obvious from Vanderslice's

observations that voiceprintt" identification cannot be based on a

simple visual pattern matching scheme.

The voiceprint method has been attacked by others also; Bolt,

Cooper, David, Denes, Pickett and Stevens (1970)l stress that, since

Bolt and his associates were selected by the Technical Committee on

the identification task has not been defined by a set of objective mea-

sures, the validity and reliability of this method of speaker identifi-

cation currently is untestable. Moreover, controversy exists between

the antagonists concerning the applicability of this type of voice ident-

ification technique to "real-life cases." Thus, even if the problems of

interference with the acoustic signal are ignored, Bolt et al. (1970)

contend that false identification rates can reach levels as high as 63% --

at least, for certain tasks and observer training. Again, in 1973 these

same authors questioned the feasibility of using this visual technique

for speaker identification. The research conducted on voiceprintss" in

the interim had not resolved the problems. For example, they considered

the error rates found by Tosi, Oyer, Lashbrook, Pedrey, Nichol and Nash

(1972) to be too large for practical applications, i.e., the use of

voiceprintt"' identification in investigations and judicial or forensic

applications. In response, Black, Lashbrook, Nash, Oyer, Pedrey, Tosi

and Truby (1973) accused the Bolt group of judging the technique without

directly observing it and further charged that the critics of the sys-

tem disregarded factors important to its success, e.g., examiner train-

ing, examination time, sample size, etc. However, even after this dis-

cussion, Poza (1973) is still of the opinion that before the ''voice-

print" technique can be used, the reliability of the method must be de-

termined. Poza feels that speaker characteristics are probably repre-

sented in spectrograms and can be detected by examiners but, since

voiceprintss" are to be used in a forensic situation, those examiners

Speech Communication, the Acoustical Society of America, to form an
opinion concerning the reliability and validity of voiceprintss."

must be trained on materials that approximate practical cases. In any

event, the argument is unresolved and it should be noted that a sub-

stantial segment of the scientific community is currently unwilling to

accept voiceprintss" as a viable identification system and, moreover,

does not hold much hope for its being refined sufficiently to become a

reliable system. Even Tosi (1974), a long term proponent of "voice-

prints," has admitted that a single completely reliable speaker identi-

fication technique is not currently available and one may never be


The "machine'' approach to speaker identification is not confined

to the time/amplitude/frequency type of spectrography. Recently, other

researchers have studied this problem using other techniques of this

type, i.e., quantifiable acoustic parameters that are amenable to ma-

chine processing. The spectral composition of an utterance can be por-

trayed in forms other than spectrograms. For example, the output of a

spectral analysis device can be a set of relative numeric values which

reflect the acoustic energy as a function of frequency. Using such

spectral energy representations, Bricker, Gnanadesikan, Mathews,

Pruzansky, Tukey, Wachter and Warner (1971) reported achieving rather

high speaker recognition scores (90% or better). Gubrynowicz (1973)

and Kosiel (1973), using 15 and 10 talkers, respectively, found that

they could identify all of their speakers from a spectral analysis of

their vocal output. Majewski and Hollien (1974) also used a spectral

analysis technique (namely, long-term power spectra), where an average

of nearly 95% correct identification was achieved for two groups of 50

talkers -- rather large populations in speaker identification research.

Therefore, it would seem that it is likely that some portrayal of the

distribution of acoustic energy -- either instantaneous or long-term --

may be incorporated in a speaker identification system.

Research on the speaker identification problem has included an

examination of the fundamental frequency of phonation as a viable indi-

cator (Atal, 1972; Hollien, Hollien and Majewski, 1974; lies, 1972;

Sambur, 1973; Wolf, 1972). Thus far, it has not been shown to be as

effective as the measures of spectral composition. Sambur (1973) re-

ported that fo contours, which represent changes in speaking fundamen-

tal frequency, were useful but of somewhat less importance than other

measures, namely, formant location and bandwidth. However, Wolf (1972),

using an array of acoustic measures found that the glottal fundamental

frequency could be used with some success. Therefore, it appears that

some measures of the frequency of the source wave of the voice may prove

to be relatively functional in speaker identification.

Temporal factors may also prove to be useful in speaker identi-

fication. For example, speaking time (ST) may prove to have the capac-

ity to differentiate between talkers. In general, the term speaking

time is used to denote the portion of time during a complete utterance

that acoustic energy is present. While ST has not been tested as yet

as a discriminator of talkers, this characteristic of speech production

has been examined previously. For example, Holbrook (1973) reported

normative data on the total speaking time for four speech pathology

professors during an eight-hour day. He estimated that ST, the actual

time that a speaker is producing an acoustic signal, constituted 60%

of the total time to produce the utterance. The remaining 40% was de-

voted to voiceless speech-time and pauses. From data such as these,

one might hypothesize that talkers differ in terms of the relative dur-

action of segments containing acoustic energy. Moreover, if the times

are recorded when a speaker is and is not producing an acoustic signal

during the reading of a particular passage the intra-speaker differences

may be relatively small in comparison with inter-speaker differences and

thus provide an aid in identification.

There are many acoustical and temporal factors that might be used;

however, it would be beyond the scope of any single study to examine all

of the parameters or sets of parameters (vectors) which may be functional

in identifying individuals from their speech. Nevertheless, there are

several sets of objectively measured parameters that are of special in-

terest in that they hold the potential of eventually being incorporated

into some sort of speaker identification system. Briefly, several sets

of acoustic/temporal measures will be selected and their power in a

speaker identification system will be examined: (1) long-term speech

spectra (LTS), the distribution of acoustic energy as a function of

frequency, (2) speaking fundamental frequency (SFF), the characteris-

tics of a talker's glottal source wave and (3) speaking time (ST), the

amount of time phonation is present during the total speaking time.

Long-Term Power Spectra (LTS)

As stated, Bricker et al. (1971), Gubrynowicz (1973), Kosiel (1973),

Majewski and Hollien (1974) and Hollien, Majewski and Hollien (1974)

have used long-term speech spectra. In particular, Majewski and Hollien

(1974) obtained speech samples from two groups: 50 Americans and 50

Poles. Four sets of comparisons were made for each group resulting in

percent correct identification for the Poles of 96, 98, 98 and 98, and

for the Americans of 94, 96, 84 and 90. The overall average for both

was 94.2%, a rather impressive identification rate for populations as

large as those used in this research. In the same study, the authors

reported correct identification scores for a limited passband, 315-3150

Hz. In this case, the scores were 82% for the Poles and 70% for the

Americans. Considering the reduction in correct identifications, it is

apparent that some of the information eliminated was useful in

speaker identification. However, the overall performance of LTS for

these relatively large groups demonstrates that this measure has the

potential to be included in a speaker identification system.

Speaking Fundamental Frequency (SFF)

The impression of the pitch of a speaker's voice is an intuitively

appealing characteristic providing a clue to his identity. For example,

Atal (1972) examined the feasibility of speaker recognition based on

pitch contours -- actually changes in SFF. Six utterances from each of

ten male talkers were normalized, i.e., adjusted to have the same dura-

tion, and moment-to-moment changes in SFF were extracted. First, a

cross-correlation between pairs of contours was computed; the highest

mean correlation coefficient indicated that the test samples consti-

tuted the product of one speaker. This analysis produced a score of 70%.

Later a minimum-distance classification technique (similar to that of

Hollien, Hollien and Majewski, 1974) was utilized and a correct ident-

ification rate of 68% was obtained. This result was not significantly

different from those obtained from the correlation method. Since pitch

contours provide a measure of discriminability among talkers, it is

possible that other measures of fundamental frequency, mean SFF (fo)

and its variability (PS), may also be useful in differentiating talkers.

Indeed, Hollien, Hollien and Majewski developed a technique based upon

these measures which produced correct identification rates ranging from

80.0 to 100.0% for three groups. Even if, as in Atal's study, the

measures are not exceptionally functional as determiners of speaker

identity when used alone, they may enhance overall identification cap-

ability when combined with other parameters.

Speaking Time (ST)

While it has been noted that ST has not been examined previously as

a measure upon which to base speaker identification, it is possible that

the vector could be rather effective. Holbrook's (1973) report provides

only normative data on a group of professors. However, the vector could

be powerful if talkers are consistent, i.e., the proportion of time that

acoustic energy is being produced is stable from one speech sample to

another. Further, there must be variability among subjects.

Vector Combinations

In addition to using each of the three vectors separately, as dis-

cussed above, it is possible to develop a composite vector comprised of

all or part of the basic set, thereby selecting a parameter set that will

be most effective. For example, Sambur (1973) found that in order of im-

portance, formant location (F3 and F4 in vowels), bandwidth, and mean

speaking fundamental frequency are all effective features upon which to

make judgements of a speaker's identity. Further, he achieved extremely

low error rates -- as low as 0.003 -- by combining these elements. Thus,

while the speaker identification potential of each of the vectors (LTS,

SFF and ST applied separately) is of particular interest, certain com-

binations may prove to be more effective when they are combined into a

composite vector -- even if one of the members appeared to be a rather

poor predictor when used alone. For example, if one vector is indepen-

dent of the second (with which it is paired), then the identification

rates of the pair should show substantial improvement over their sep-

arate identification rates. For example, it is probable that SFF and

ST are independent. During speech, a talker can initiate and terminate

glottal activity at any time and anywhere in his phonational range.

Therefore, one vector may not affect the other and it is reasonable to

expect much better results when SFF and ST are used conjointly than

when they are used alone. Conversely, combining vectors may not im-

prove speaker identification levels if little information is present

in one vector that is not contained in the other. In any case, the

interrelationships of the three vectors currently is unknown and,

therefore, their discriminating power when used together cannot be pre-

dicted at this time.


To properly evaluate any identification system, some attention

must be paid to the ability of the system -- and its several compon-

ents -- to function properly in the presence of distortions of various

classes. Here, the term distortion is meant to refer to any factor

that may produce changes in the speech signal. Distortions may arise

either in the production or transmission of the acoustic signal.

During transmission, for example, the acoustic signal may be distorted

by effects of limited frequency response in the transmission line. The

most common occurrence of the filtering of speech is associated with

telephone conversations. While the filtering quality of telephone com-

munications is related to transmission distance, especially from the

subscriber to a control office, a nominal frequency response is from

300 to 3000 Hz. The obvious effect of such frequency limiting would be

to reduce the amount of information available to any vector, especially

one which is frequency dependent, thereby degrading the effectiveness

of that set of measures.

The distortions discussed thus far are related to one class of

transmission line interference. Additional distortions may arise in

the production of the speech signal. For example, Davitz and Davitz

(1959) demonstrated that emotions are portrayed in the acoustic sig-

nal. In this study, listeners were found to be able to reliably ident-

ify the intended emotion of male and female talkers reading the alpha-

bet. Thus, it is possible that such alterations by a speaker may inter-

fere with his correct identification, i.e., an individual may produce

sufficiently variant speech exemplars when emotionally aroused (as

opposed to when he is relaxed) that his utterances may be classified as

being from different talkers. Using both machine and perceptual tech-

niques, Majewski et al. (unpublished manuscript) and Hollien, Majewski

and Hollien (1974), respectively, demonstrated that samples drawn from

speakers while they were under stress were correctly identified less

frequently than samples produced normally.

Implicit in the discussion of the effects of psychological and

physiological factors upon the production of speech is the concept that

the talkers may not be consciously attempting to modify that signal.

The speaker is simply trying to say something and his vocal output may

reflect the pressure of an interfering stimulus. However, a speaker

also can exercise extensive, conscious control over his vocal output.

The extent to which a speaker can actively vary the acoustic signal

should be at least as great as those associated with emotionally based

distortions. Thus, a talker, by disguising the voice, should be able

to increase confusion about his identity. Again, Majewski et al.

(unpublished manuscript) and Hollien, Majewski and Hollien (1974), re-

ported a reduction in the number of correct identifications of speakers

attempting disguise than from those observed for normally produced sam-

ples. Further, these levels were lower than those reported for the

stress condition.

In any case, there are a variety of ways to distort a speech sig-

nal and it is of interest to determine whether the selected objective

measures are sensitive to distortions generated in transmission, i.e.,

limited passband, and in production, i.e., stress or disguise. In

other words, if the same acoustic parameters used to analyze normal

productions are extracted from these altered samples, how will their

accuracy in discriminating talkers be altered?

Statistical Analysis

The statistical method used in a particular piece of research may

have some bearing on the results. For example, Atal (1972) employed

both minimum-distance and cross-correlation techniques to analyze his

data. Both methods should have yielded identical results but, in fact,

they did produce slightly different scores (68% vs 70% correct identi-

fications, respectively). In addition, Atal computed a moments of

pitch-period distribution for the same data and obtained 78% correct

identifications. Although the differences for the three approaches

are not dramatic, it is possible that other methods could produce sub-

stantially different levels of identification.

In any case, a variety of statistical approaches are available

to assist in the analysis/interpretation of research of this type.

For example, Majewski and Hollien (1974) utilized a Euclidean distance

technique in which a distance is calculated from the location of an

observation to the location of each reference and the observation is

assigned to the closest reference. Zalewski, Majewski and Hollien (in

press) employed a cross-correlation statistical technique and showed a

slight improvement over the results of a Euclidean distance technique --

based on the same data -- used by Majewski and Hollien (1974). For

cross-correlations, observations are classified as coming from the same

talker whose reference utterance is most highly correlated. On the

other hand, Bricker et al. (1971) are of the opinion that a discrim-

inant analysis is an appropriate technique. In this case, a general-

ized square distance is computed based upon a covariance matrix -- a

slightly different approach from either the Euclidean distance or cross-

correlation approaches. Further, discriminant analysis does permit the

examination of the three vectors, LTS, ST and SFF, under the various


Statement of the Problem

There are many approaches to the speaker identification problem;

most have not been tested. Neither the characteristics of the voice

that make identification possible nor the techniques for evaluating

those qualities have been determined. It would be of practical benefit

to consider several sets of parameters which may be relatively powerful

in discriminating among talkers. Long-term speech spectra, speaking

time and speaking fundamental frequency will be examined. These vectors

will be studied both under idealized conditions and while being exposed

to possibly contaminating physical and psychological factors. LTS will

be subjected to filtering in order to determine the vector's effective-

ness under that type of distortion. All three basic vectors will be

examined when production distortions are present, i.e., stress and dis-

guise. The analysis system should not only make efficient use of the

data but also allow for the straightforward examination of any subset

of the parameters. Therefore, a discriminant analysis was used to

identify talkers from their speech alone and it permitted the exami-

nation of the effect of any portion of a vector or any combination of

the vectors.

Currently, an adequate technique of speaker identification does

not exist and the analytic method to select the correct talker has not

been determined. It would seem that research directed at examining

voice qualities and analysis procedures is warranted. Therefore, this

research directly addresses itself to four specific questions:

A. Are long-term power spectra, speaking fundamental frequency

and speaking time sufficiently powerful to be used singly to

identify a talker from his speech?

B. Are these parameters effective measures upon which to base

judgements of speaker identification when used in various


C. Are these specified parameters sufficiently robust to func-

tion in the presence of distortions -- limited passband,

stress or disguise?

D. Which statistical procedure, a Euclidean distance, cross-

correlation or discriminant analysis, provides the highest

correct speaker identification levels (as indicated by com-

parison based on LTS (fullband) vector for the group of 50




The present experimentation addresses two points; the assessment

of several acoustic parameter sets functioning in a speaker identifica-

tion task, and the most effective method to analyze the data.

The primary focus of this research is directed at the capability

of sets of selected acoustic measures -- used alone or in concert -- to

detect the identity of a talker from a number of speakers who may be

talking normally, under stress or while attempting a disguise.

Identification Parameters

Three sets of parameters are examined in terms of their contribu-

tion to speaker identification: long-term power spectra, speaking fun-

damental frequency and speaking time. The evaluations of the parameter

sets are conducted to determine the effectiveness of each one as a dis-

criminator a) if that set were the sole basis for identification and b)

their potential when used conjointly.

Long-Term Power Spectra.1 LTS is a measure of the distribution of

acoustic energy during a set time interval. The long-term spectral char-

acteristics for all speakers were extracted from the recorded speech

While the description of long-term power spectra specifies a vector
of 23 parameters, virtually all analyses are conducted with only 11
of these values. Therefore, unless specifically noted as otherwise,
all references to LTS apply to the subset of 11 measures.

samples by means of a General Radio 1921 Real-time Analyzer. These anal-

yses were performed in 23 one-third octave bands covering the frequency

range from 80 to 12,500 Hz. In order to minimize the influences of the

speech content upon the speech spectrum, the longest integration time

available (i.e., 32 sec.) was used. The sampling rate for this value of

integration time was 32 samples per second; four speech samples, each

of 32-sec. duration, were analyzed (serially) for each subject.

The values extracted during the spectral analyses, expressed in dB

levels for each of 23 frequency bands, were printed on paper tape by

means of an MDS 800 Printer. Since the printing time lasted only four

seconds, it was possible to analyze the samples without the necessity of

stopping the input tape recorder after each sample. For both groups, a

total of 300 speech spectra (each LTS vector expressed by 23 parameter

values) were obtained for further processing.

An intermediate data processing step was carried out on an IBM

370/165 computer. Specifically, the data for each speech sample were

normalized in order to equalize the overall levels of the samples. An

arbitrary total power level was used; i.e., 50 dB re: the full frequency

band under investigation. In other words, the energy measured in each

one-third octave band is recorded relative to 50 dB which represents

the energy present in the total signal.

Speaking Fundamental Frequency. The second set of parameters,

SFF, are measures of the rate of vibration of the vocal folds, i.e., the

lowest frequency in the glottal wave that contains energy. Two measures

were utilized -- mean speaking fundamental frequency (fo) and pitch

sigma (PS), i.e., standard deviation of the distribution. Mean funda-

mental frequency measures of each subject's reading were obtained using

the Fundamental Frequency Indicator (FFI-6), a digital readout fo

tracking device consisting of a group of successive low-pass filters

with cut-offs at half-octave intervals coupled with high-speed switch-

ing circuits which are controlled by a logic system. FFI produces a

string of pulses -- each pulse marking the boundary of a fundamental

period from complex speech waves -- which are delivered to a Digital

Electronics Corporation PDP-8 computer. An interval clock marks the

time from pulse to pulse and these values are processed digitally to

yield (among other data) the geometric mean frequency and standard

deviation of the frequency distribution. The importance of SFF is

tested using discriminant analysis. This vector will be used alone

and in combination with other vectors to test its value in identifying


Speaking Time. Finally, ST, which measures the amount of time

any acoustic energy is present during a total utterance, was the third

parameter studied. As used in this research, this vector also is com-

posed of two parameters. The first ST parameter measures the total

time that an acoustic signal is present during an individual's total

utterance. In this case, time measures are extracted for each of the

four 32-second segments of the reading passage. The second (parameter)

is a measure of the talker's rate, i.e., the amount of speech material

completed during a 32-sec. segment. Initially, it is unclear whether

the number of words or the number of phonemes produced would produce

the more stable measure of a talker's output. Therefore, both word

and phoneme counts per segment were recorded and tested to verify if

one was more appropriate to the speaker identification task, with the

better one being used in all subsequent analyses.

To obtain the first measure, a special device has been developed

which has two outputs: 1) a 1 kHz continuous signal that is produced

throughout the duration of the utterance and 2) a 1 kHz pulse train

which is present whenever acoustic energy exceeds a threshold. By

feeding these signals to a pair of electronic counters, the total time

of the utterance and the amount of time that phonation is produced can

be measured in milliseconds (ms). The second parameter was obtained by

recording the first and last utterance in a segment and then counting

the number of words or phonemes produced in that segment. As with the

previous set of parameters, similar procedures were employed to deter-

mine their value in differentiating talkers and were tested singly and

in combination with other parameter sets.

Speaker Populations

The speaker population utilized in this research was composed of

two groups. The first group contained 50 college-age male talkers from

the University of Florida. They were adult speakers of English, aged

18-25 years, and demonstrated no observable speech or voice problems.

Generally, such talkers might be expected to be quite homogeneous in

their speech patterns, e.g., rate, dialect, etc., and present a reason-

ably difficult population from which to correctly identify members.

Further, they produced only samples of their normal everyday speech.

The second group of subjects was selected from a population of

more mature males. In this case, 25 normal speakers of American En-

glish from approximately 25 to 45 years of age were included. Members

of this smaller, more variable group should be more readily identifi-

able from their speech -- a premise that was tested. In addition,

these subjects supplied -- along with their normal productions -- sam-

ples that were distorted in two ways, i.e., they were produced under

stress and as the subjects attempted a disguise; these speaking condi-

tions will be described below.

Speech Material

All subjects read a modernization of Robert Louis Stevenson's "An

Apology for Idlers" (see Appendix); this passage takes about 2.5 min-

utes to read. In order to gather samples of normal speech production,

talkers were instructed to read the passage "as naturally as possible."

Additional readings of the essay were requested from one of the two

groups of the subjects while under stress and while attempting a dis-

guise. The stress was induced by randomly applying a mild, varying

electric shock during the subject's reading. For the disguised voice,

talkers were allowed to modify their voices in any way except through

the use of a "foreign dialect" or by whispering.

Signal Distortions

The speech samples used in this study are "ideal," i.e., they were

recorded with extreme care under laboratory conditions without observ-

able distortions. Such controls are necessary if one is to test the

best possible outcome from the speaker identification system -- a mea-

sure of considerable interest -- since the primary focus is directed at

the maximum discriminant capability of the vectors. However, the sys-

tem's effectiveness in the presence of typical distortions is also a

critical measure. It is not within the scope of this project to exam-

ine the identification vectors under very many conditions of distortion

or field situations but, considering the criticism of Bolt et al. (1970)

of the voiceprintt" technique (i.e., that consideration had not been

accorded to practicalities), some attempt will be given to evaluating

the ability of the system to function in less than ideal conditions.

For this reason, LTS was subjected to both production and transmission

distortions in order to obtain an indication of the magnitude of change

in the system's correct identification score. Specifically, LTS was

subjected to a simulated system distortion, i.e., a limited passband

of 315-3150 Hz. The simulation is accomplished by removing the six

parameters which contain the information below 315 Hz and, similarly,

by removing the six parameters which contain the information above

3150 Hz. The result is a LTS vector of 11 rather than 23 parameters.

ST should be insensitive to the effects of filtering of this type --

as would the SFF data. Accordingly, LTS was the only parameter sub-

jected to filtering. However, all three parameters were exposed to

production distortions, i.e., stress and disguise. Finally, all vec-

tors were subjected singly and in combination to the respective dis-


Statistical Analysis

A discriminant analysis technique was used to identify speakers.

Fundamentally, discriminant analysis uses measures obtained from known

classes -- speakers in this case -- in order to determine a classifi-

cation criterion. Then, additional observations are compared to the

criterion for each class and classified in the set which it most

closely resembles. Basically, this technique is one of pattern match-

ing. Obviously, the classification criteria for the test sample must

be present -- i.e., the unknown speaker must be represented in the set

of known talkers -- because the discriminant analysis classifies every

test sample into some category, i.e., the best match.

Specifically, the Statistical Analysis System (SAS) discriminant

analysis package was employed. Since the computation of the discrim-

inant function requires two or more sets of measurements from each

group (speaker) to establish a criterion for those groups, the second,

third, and fourth 32-second segments of the normally rendered reading

were used to describe the characteristics of the talkers. For both

groups of talkers, the first segment of the normal reading was treated

as the "unknown." For the second group (of 25 talkers) in addition to

the normal sample, the first 32-second segment of the "stress" and "dis-

guise" readings were also treated as ''unknowns." In all cases, the

first segment constituted the test and the three remaining segments

were used to establish the reference (1 vs 2,3,4). All of the obser-

vations defined as "unknowns" were classified according to the dis-

criminant functions. A "correct identification'' was made when an ob-

servation was matched with the remaining samples from the same speaker

and a total number of correct identifications was obtained for each

condition. After converting the totals to percent correct identifi-

cation, the effectiveness of each vector was examined both as a single

vector and in combination and under the specific distortions applied

to it.

The discriminant analysis also performs a posterior classifica-

tion of the reference samples, i.e., the observations used to define

each talker are subsequently classified as though they were "unknowns."

The level of correct classification of the reference observations is

indicative of the expected level of performance of that vector in

properly identifying the test observations. In other words, if the

three segments used as references are rather similar, the "knowns"

should be properly classified. Further, when the test samples are


correctly classified, we can assume with some confidence that it is

not a chance occurrence.



Several approaches have been incorporated in this study, i.e.,

various acoustic measures and analysis techniques as well as distor-

tions of the speech signal. The data for each of the single param-

eters are presented in toto followed by the results for these vec-

tors used in combination. Each of these sections contains the re-

sults of the posterior classification of the reference (or "known")

samples which reflect the relative within and between talker vari-

ability. The posterior classifications of the "known" talkers are

listed because they provide some insight into the stability of the

groups defined by the vector. Following the presentation of that

data, the information concerning the classification of the test sam-

ples is presented. Finally, identification rates in the presence of

speech distortions are reported. In the case of the LTS vector, two

additional factors are considered, namely, the effect of filtering on

speaker identification and comparison of the statistical approach em-

ployed to classify the talkers.

Long-Term Power Spectra (LTS)

Table I lists the correct identification rates for the posterior

classifications performed on both groups. Both LTS vectors, fullband

and filtered, demonstrate rather high correct identification rates for

the posterior classifications. Therefore, the correct classification

of the test samples should be expected to be very high whenever either

Table 1. Classification of Observations
Spectra (LTS).

Based on Long-Term Power

Correct Identification
(in %)

Group A Group 8
Experimental N=50 N=25
Condition Normal Normal Stress Disguise

Posterior Classification
of "Knowns"

LTS (fullband) 100.0 100.0
LTS (passband) 98.0 100.0

Classification of "Unknowns'

LTS (fullband) 100.0 100.0 72.0 24.0
LTS (passband) 76.0 80.0 60.0 20.0

of those vectors is used.

Specifically, the fullband LTS vector achieved the 100.0% levels

of correct speaker identification for the normal productions of both

populations. The 23 parameters of this vector apparently describe

patterns for each talker that are unique -- at least for differenti-

ating numbers of talkers as large as 50 when no distortions are present.

Some concern exists that such a high identification level may have

resulted from a fortuitous selection of the segments to be used as

tests and references. Majewski and Hollien (1974), aware of this prob-

lem, examined speaker identification levels for four combinations of

test and reference sets, as follows: a) the fourth sample vs the mean

of the first three samples (4-1,2,3), b) the mean of the first three

samples vs the fourth sample (1,2,3-4), c) the mean of the last two

samples vs the mean of the first two samples (3,4-1,2) and d) the mean

of the first two samples vs the mean of the last two samples (1,2-3,4).

Their results -- using LTS with the group of 50 speakers -- showed

that, while the identification scores varied somewhat (from 84 to 96%),

all combinations produced rather high levels of speaker identification.

Therefore, some change in the scores may be expected if various test

and reference segments are used but the levels observed for any combi-

nation containing at least two segments used as references should pro-

duce comparable levels of identification.

LTS was also examined, for this investigation, to see if it would

continue to be a viable means of identifying talkers if it were sub-

jected to filtering of the type that may be encountered in telephone

communications. In other words, could correct identifications of

speakers be made if only the spectral information from 315-3150 Hz

were available rather than from 80-12,500 Hz? To simulate such a

condition, 12 of the parameters were removed from the LTS vector. As

can be seen in Table 1, the filtered condition (i.e., 11 parameters),

showed a reduction in the correct classification capability of 24.0%,

from 100.0% for the fullband to 76.0% for the limited one. The fact

that about three-fourths of the talkers were still properly identified

is encouraging but that level of accuracy is unacceptable for most

practical applications. The same comment is applicable to the smaller

group since its accuracy was reduced from 100.0% to 80.0%.

Production distortions, i.e., stress and disguise, have a marked

impact on speaker identification rates at least for this identifica-

tion parameter. Even the LTS (fullband) vector which functioned flaw-

lessly for normally produced speech could not adequately differentiate

among speakers. While a score of 72.0% is the highest achieved for

any vector or combination for the stress samples and 24.0% is among the

highest for disguised speech, both distortions have an obvious effect

on speaker identification. Under conditions of stress, LTS (fullband)

maintains a moderate level of performance. The remaining LTS vector,

already incorrectly classifying talkers who had produced normal utter-

ances, is even less effective with the stress and disguise conditions

providing correct identification rates of 60.0% and 20.0%, respectively.

Of particular interest in this study is a comparison of the three

data analysis techniques used in the evaluation of the effectiveness of

LTS in speaker identification: cross-correlation, Euclidean distance

and discriminant analysis. Previously, a special program had been

written to classify talkers based on the shortest distance from a test

sample to a reference sample in a polydimensional Euclidean space

(Majewski and Hollien, 1974). The discriminant analysis portion of the

SAS package was also used. The results of both statistical methods are

listed in Table 2, as are the scores for the cross-correlation technique

employed by Zalewski et al. (in press). For both conditions, i.e.,

using 11 or 23 of the parameters in the LTS vector, the SAS discriminant

analysis program correctly classified more talkers than the Euclidean

distance or cross-correlation techniques. Cross-correlations produced

the same results as were found using Euclidean distance. In this case,

discriminant analysis appears to be a more effective method of identi-

fying talkers from the spectral composition of their speech. Based on

the fact that every investigator cited (Bricker et al., 1971; Gubrynowicz,

1973; Kosiel, 1973; Majewski and Hollien, 1974 and Hollien, Majewski

and Hollien, 1974) observed levels of speaker identification in excess

of 90%, it would appear that the spectral composition of an individual's

voice contains substantial clues to his identity.

Again, it is possible that the high scores observed for LTS may

be related to the particular test/reference combinations employed.

In every other procedure in which discriminant analysis was used, the

first 32-sec. segment was the test and the remaining three segments

constituted the reference (1-2,3,4). However, in order to make the

results of the discriminant analysis technique directly comparable with

the others, the analysis was conducted again with the fourth segment

being treated as the test sample (4-1,2,3). The results are contained

in Tables I and 2; a 76.0% identification level was achieved when the

first segment in the test was the reference (1-2,3,4) and 84.0% in the

other case (4-1,2,3). In addition, a similar test was conducted with

a group of 25 talkers. The levels of correct identification were

Table 2. A Comparison of the Classification of Speakers by Means of
Long-Term Speech Spectra Using Euclidean Distance, Cross-
Correlation and Discriminant Analysis Techniques (N=50).

Correct Identification
(in %)

Classification Technique Fullband Filtered

Euclidean Distance 96.0 70.0

Cross-Correlation 96.0

Discriminant Analysis-* 100.0 84.0

In this case, only to be consistent with the approach used by the
other authors, the first, second and third 32-second samples were
used as the reference and the fourth was used as the test sample
(4-1 ,2,3).

80.0% (see Table 1) when the test sample was the first segment and

92.0% when the fourth segment was used. For both groups, there was

some improvement when the last segment was considered to be the test;

however, the levels of identification indicate that reasonably similar

results would be obtained for any single test sample vs the three re-

maining segments.

Speaking Fundamental Frequency (SFF)

In all cases where SFF is employed in the discriminant analysis,

the number of speakers is reduced from 50 to 43 for Group A and from

25 to 20 for Group B because the SFF vector could not be extracted for

all subjects. FFI was not able to extract reliable measure of funda-

mental frequency in several cases because there was excessive back-

ground noise or weak signals on the recordings.

Table 3 indicates that the posterior classification of the ref-

erence samples, 79.1% for the larger group and 68.3% for the smaller

one, is well below the level achieved by the LTS vector. On this

basis, SFF would not be expected to be as effective in discriminating

speakers as would LTS. Further, the posterior classification of the

large group is higher than for the small one indicating that for SFF,

the within talker variability was less and/or the between talker vari-

ability was greater for the large group. This finding was unexpected

since the talkers in the group of 50 were college students who were

18-25 years of age and the group of 25 ranged in age from 25 to 45.

In this regard, Hollien and Shipp (1972) have shown that the mean

fundamental frequency of males drops from 120 Hz to 107 Hz from the

ages of 25 to 45 thus indicating that there should be greater differ-

ences between members of Group B. Therefore, the within group

Table 3. Classification of Observations Based
Frequency (SFF).

on Speaking Fundamental

Correct Identification
(in %)

Group A Group B
N=50- N=25*
Normal Normal Stress Disguise

Posterior Classification
of "Knowns" 79.I 68.3

Identification of "Unknowns" 30.2 35.0 30.0 10.0

* SFF could not be extracted for all talkers; hence, the identification
scores are based on N=43 for Group A and N=20 for Group B.

variability of Group A must be less than that of Group B.

Table 3 also presents results of the classification of the normal

test samples from both groups. These data reveal much lower levels of

identification, about one-in-three correct for the normal productions.

Thus, it would appear that the SFF vector is unsatisfactory for differ-

entiating talkers when used alone. Further, these scores are slightly

(and artificially) inflated since the number of subjects is smaller for

both groups. Moreover, SFF demonstrates a diminished ability to identify

talkers in the presence of the production distortions, stress and dis-

guise. The reduction due to stress is minimal but is exceptionally

severe when the talker is attempting a disguise. Apparently, the ex-

posure to a stressful situation only mildly alters SFF. However,

speakers seem to be conscious of pitch as a vocal quality and frequently

change it to disguise their voices. Hence, SFF appears to be affected

in some degree by stress and is dramatically altered during disguise,

resulting in a decrease in this vector's effectiveness.

The measures of speaking fundamental frequency employed in this

study were only about one-half as effective as those chosen by Atal

(1972). Thus, it would appear that fo and its variability do not

adequately characterize individual speakers but the talker's pattern

of change in fundamental frequency throughout an utterance is somewhat

unique for an individual.

Speaking Time (ST)

The ST vector is composed of two measures: the amount of time

that the talker is producing acoustic energy above a threshold and the

amount of information he transmitted in each 32-second interval. The

second measure could be extracted in either of two ways: (1) the num-

ber of words and (2) the number of phonemes. These are related mea-

sures; only one need be included in the analysis. It is far easier

to determine word counts than phoneme counts; however, if the length

of words in a particular reading is different than the others, the

word count will vary independently of a speaker's speech rate. To

examine whether word or phoneme counts were more stable and thus pro-

vide more accurate identifications, discriminant analyses were conducted

with each parameter used with the other measure in the ST vector. Table

4 contains the results. From these limited number of data, it is diffi-

cult to make a definitive statement concerning the difference in the

sensitivity of the two parameters to speaker recognition. However,

there is an indication that the phoneme count is a somewhat more stable

measure. Therefore, all other analyses are performed using the number

of phonemes produced per interval -- as well as the measure of the

amount of time acoustic energy is generated during the utterance.

Table 5 shows the results of the discriminant analysis for both

groups of speakers based on ST. Again, the posterior classifications

of the reference samples indicate that this vector does not accurately

characterize each speaker. As would be expected with such poor group

definition, ST, with scores of 12.0 and 20.0% for Group A and Group B,

respectively, is an unsatisfactory vector upon which identifications

are to be made.

Combined Vectors

The posterior classification of the reference samples, as listed

in Table 6, suggests that combining the vectors (LTS (passband), SFF

and ST) improved the definition of each group of talkers. Since all

posterior classifications produce levels of greater than 90% correct

Table 4. Classification of Speakers Based on Two Forms of Speaking
Time Vector (N=50).

Correct Identification
(in %)

Reference Test

19.3 8.0

26.7 12.0

Table 5. Classification of Observations Based on Speaking Time (ST).

Correct Identification
(in %)

N=50 N=25
Normal Normal Stress Disguise

Posterior Classification
of "Knowns" 26.7 56.0

Identification of "Unknowns" 12.0 20.0 12.0 16.0




Table 6. Classification of Observations Based on Various Combinations
of the Limited Passband Long-Term Power Spectra (LTS), Speak-
ing Fundamental Frequency (SFF) and Speaking Time (ST).

Correct Identification
(in %)

Experimental N=50* N=25*
Condition Normal Normal Stress Disguise

Posterior Classification
of "Knowns"

LTS x SFF 100.0 100.0
LTS x ST 99.3 100.0
SFF x ST 91.5 98.3
LTS x SFF x ST 100.0 100.0

Classification of "Unknowns"

LTS x SFF 99.7 85.0 55.0 20.0
LTS x ST 84.0 88.0 64.0 28.0
SFF x ST 55.8 45.0 40.0 15.0
LTS x SFF x ST 100.0 100.0 60.0 15.0

SFF could not be extracted for all talkers; hence, wherever SFF is
used, identification scores are based on N=43 for Group A and N=20
for Group B.

identification, it would appear that the merging of two or more individ-

ual vectors into a simple, larger vector should provide more accurate

speaker identifications.

LTS and SFF. As noted, all analyses involving the SFF vector are

based on 43 rather than 50 speakers (Group A) and 20 rather than 25

speakers (Group B). Therefore, some of the increase in identification

levels shown in Table 6 (for the SFF factor used in combination with

others) may be related to this reduction in the number of speakers.

However, even when this relationship is tempered by such considerations,

it is apparent that LTS/SFF combination vector improves speaker identi-

fication for normal productions. A comparison of the data contained in

Tables I and 3 with those in Table 6 shows that the identification

levels for LTS x SFF increased 21.7% and 5.0% above LTS alone and 67.5%

and 50% above SFF alone for Groups A and B, respectively. On the other

hand, for the stress and disguise productions, the impact of this com-

bination vector is not clear. First, when LTS is combined with SFF,

it is notable that the level of identification is approximately doubled

for stress and disguise. Undoubtedly, the change is due to the addi-

tion of LTS. Conversely, however, speaker identification, under condi-

tions of stress or disguise, remains essentially unchanged when SFF is

added to LTS. The apparent depression of 5.0% for the stress condition

seems to be related to the reduced speaker population; i.e., since four

of the five subjects whose SFF could not be extracted had been correctly

identified using LTS alone, the correct identification level for the

LTS vector on the population of 20 could be less than was found for the

complete group of 25. Under conditions of production distortion, LTS

apparently is not aided by the addition of SFF.

LTS and ST. The results of discriminant analyses based on the

band limited LTS vector and ST also are listed in Table 6. In this

case, there is an increase in the correct identification rates for each

group for the combined vector relative to when either LTS or ST is used

alone. Again, when comparisons are made of the results shown in Table

6 to those in Tables I and 5, this paired vector is slightly better than

LTS alone. Moreover, the large improvement of LTS x ST over LTS alone

apparently is attributable to the initial level of LTS. Further, ST

seems to have some impact on the identification of speakers -- especially

when it is combined with LTS -- who are disguising their voices. Ad-

mittedly, the scores are low and may reflect little beyond chance cor-

rect classifications. Nevertheless, LTS x ST appear to be the "best"

vector combination when attempts are made to determine the identity of

a talker producing a disguised voice.

SFF and ST. Again, these analyses are conducted with the reduced

subject population. The results for this vector pair can be seen in

Table 6; the findings for the separate vectors are contained in Tables

3 and 5. Identification of the normal speech samples for both groups

is about 50.0%, a level substantially better than when either vector is

used alone. The improvement resulting from pairing implies that the

information contained in SFF is, in fact, different than the information

attributable to ST. As a result, confusions based on the processing of

one vector are sometimes resolved by the other vector. The identifi-

cation of speakers who are under stress or attempting to disguise their

voices remained at an unacceptably low level with this vector pair.

Due to the reduced number of speakers evaluated when SFF is used,

it is difficult to state the effect of combining these vectors when the

speech is distorted. The changes are slight and may reflect the re-

moval of a disproportionate number of correctly identified talkers.

LTS and SFF and ST. Finally, all three vectors were combined in

one analysis for each group of speakers. Using the reduced subject

populations, Table 6 shows that all normal test samples were cor-

rectly classified by use of this triple vector. However, LTS (full-

band) alone had achieved this same result with the larger (complete)

set of talkers. The exact number of parameters -- between 12 and 23 --

from the complete LTS vector necessary to reach all correct classifica-

tions was not tested. However, when the full vector (23 parameters) is

used, all normal samples are properly classified. When using the com-

bined LTS x SFF x ST only 15 parameters are used to reach that level

of classification.

Correct identification rates for stress and disguise are slightly

lower in some cases than those observed for previous analyses using

subvectors. However, as noted for the LTS x SFF paired vector, the

apparent reduction may result from having removed a number of correctly

identified talkers from the population when SFF could not be extracted.

In any case, it must be said that the LTS x SFF x ST triple vector is

not capable of identifying talkers when the voices are disguised.

Further, this triad of vectors is only as successful at differenti-

ating among talkers who are under stress as is the LTS (passband) alone.

Indeed, LTS (fullband) is the most accurate of all vectors or vector

combinations when used to identify stressed speakers.

Further Discussion

Generally, the results of this study indicate that LTS (fullband)

is the most effective acoustic measure upon which to base speaker

identification -- it properly classified all observations of normal

speech productions. Furthermore, LTS (fullband) produced the highest

score for the stress condition and one of the highest for the disguise


When LTS (fullband) was reduced from 23 to 11 parameters to simu-

late a limited passband, the identification level is markedly better

than either of the other two vectors, ST and SFF, when they were eval-

uated separately. LTS (passband) also performed at a higher level

(than the other single vectors) for the conditions of stress and dis-

guise. Moreover, ST appears to be the poorest vector for determining

a talker's identity with one exception, namely, when the speaker is

attempting to disguise his voice. However, the identification level

for ST under conditions of disguise may be little more than a chance

finding. At any rate, while the ST scores used alone are too low to

be of practical utility, the possibility remains that ST may ultimately

provide better discrimination of disguised voices when combined with

other parameters.

The SFF parameter, although better than ST, also proved to be

unsatisfactory when used alone. Indeed, correct classification rates

were low for both test and reference samples. SFF alone cannot differ-

entiate talkers but may be useful as an adjunct to another vector.

The relative levels of identification for the pairwise combina-

tions of the three vectors remain unchanged from their individual per-

formance. LTS (passband) is the most powerful -- of those fully

tested -- regardless of which other vector is associated with it. And,

in most cases, correct identification rates improve. Further, SFF gen-

erally produces better results than ST when each is paired with LTS

except that ST and LTS combine to produce the highest identification

rates for disguised voices. The combination of ST and SFF produce

substantially better results than would be predicted from their indi-

vidual scores. As previously noted, the information contained in each

set of measures seems to be independent and their pairing eliminates

some of the confusions present in the separate cases.

Another objective of this study was to determine whether various

statistical analyses of the data produced different results; it was

possible to do a partial evaluation since other analyses have been

carried out on the LTS (fullband). As noted previously, comparisons

were made between Euclidean distance (Majewski and Hollien, 1974),

cross-correlations (Zalewski et al., in press) and a discriminant

analysis. Comparisons could be made only for the LTS vector, and in

this case, there was a slight increase in correct identification be-

tween the lower rate achieved using the Euclidean distance and cross-

correlation techniques and the higher rate achieved with discriminant

analysis. While the differences between the three techniques are not

dramatic, the discriminant analysis did produce the most correct

identifications. Further, discriminant analysis is considered by

some researchers (e.g., Bricker et al., 1971) to be appropriate to the

speaker identification problem. Since the discriminant analysis can

be found in general purpose statistical packages, utilization of this

analysis procedure is both convenient and appropriate. Further, mod-

ification of the parameter set, i.e., the addition, deletion or altera-

tion of the vector, is a straightforward matter requiring only minor

changes in the program statements. To perform a similar change in a

user-generated program could require substantial changes in the estab-

lished routine.


Finally, it should be pointed out that one of the speech analysis

systems utilized in this research is linked directly to a computer.

This means that all of the values for each of the remaining parameters

must be recorded manually and handled separately for computer pro-

cessing. If the results of each type of analysis could be fed directly

to a computer, then all three vectors could be extracted simultaneously

and stored temporarily. Simple listings or additional processing then

could be conducted rather easily and without the susceptibility to

human errors that exists in the present system.



The primary concern of this research was to evaluate the effec-

tiveness of specified vectors -- either singly or in various combina-

tions -- to correctly identify talkers from their speech alone. The

effectiveness of the acoustic/temporal characteristics of the voice

(long-term speech spectra, speaking fundamental frequency and speaking

time) were evaluated with and without distortions. The distortions

included a limited frequency band for transmission, and stress and dis-

guise for production.

The results indicate that LTS is generally the most powerful vec-

tor upon which to base speaker identifications. In most cases, SFF,

although not satisfactory, seems to be a better discriminator than ST.

With few exceptions, speaker identification improved when vectors were

combined. Finally, although ST performed poorly, it may be a useful

adjunct to other parameters in identifying talkers who are trying to

disguise their voices.

Several conclusions may be drawn from this study. It must be con-

cluded that, since none of the vectors or combinations of vectors could

sustain high levels of identification for all conditions of speech pro-

duction, the vectors examined in this study are not measuring an in-

variant characteristic of a speaker's voice. Obviously, the production

of speech signals under stress or disguise results in significant changes

in spectral composition, fundamental frequency and speaking time. Con-

sidering the level of identification observed with LTS, speakers may

operate within a limited portion of their range during most of their

speech. However, noting the decrease in identifications when their pro-

ductions are distorted, it is obvious that they -- either actively or

passively -- modify the speech signal for each of the acoustic/temporal


Secondly, ST seems to have a greater capacity to determine the

identity of talkers who are attempting a disguise than would be ex-

pected from its performance with normals. If the results under dis-

guise are not a chance finding, it would appear, as hypothesized, that

talkers do not attend to that temporal factor while they are attempting

to disguise their voice. For this reason, additional research seems

warranted to improve the overall performance level of ST.

In view of the observed speaker identification levels, the full-

band LTS vector used alone seems to be the best procedure for identi-

fying talkers, i.e., it produces the highest correct identification

rates except when a disguise is attempted. Therefore, unless further

research can demonstrate that SFF and ST can produce acceptable ident-

ification levels -- even if only for specified conditions -- it would

appear sufficient to extract only the LTS vector from the speech sig-


Finally, LTS (fullband) was selected to test for possible differ-

ences in speaker identification resulting from the use of a specific

analysis technique. For reasons of appropriateness, accuracy and ease

of data handling, the packaged discriminant analysis seems more effec-

tive than the Euclidean distance method of Majewski and Hollien (1974) or

the cross-correlation technique of Zalewski et al. (in press).


Adapted from: An Apology For Idlers

By Robert Louis Stevenson

If you look back on your own education, I am sure it will not
be the full, vivid, hours of truancy that you regret. You would
rather cancel out some of the lack-luster periods between sleep and
waking that you experienced in school. For my own part, I have
attended a good many lectures in my time---I still remember that the
spinning of a top is a case of kinetic stability. But though I would
not willingly part with such scraps of science, I do not set the same
store in them as by certain other odds and ends that I came upon in
the open street while I was playing truant.

Extreme busyness, whether at school or college, church or market,
is a symptom of deficient vitality. A faculty for idleness implies
a catholic appetite and a strong sense of personal identity. There
are a sort of dead-alive, hackneyed people about, who are scarcely
conscious of living except in the exercise of some conventional occu-
pation. Bring these fellows into the country, or set them on board
ship, and you will see how they pine for their desk or their study.
They have no curiosity; they cannot give themselves over to random
provocations nor do they take pleasure in the exercise of their fac-
ulties for its own sake. Unless necessity lays about them with a
stick, they will even stand still. It is no good speaking to such
folk. They cannot be idle; their nature is not generous enough. They
pass those hours, which are not dedicated to furious toiling in the
gold-mill, in a sort of coma, When they do not require to go to the
office, when they are not hungry or have no mind to drink, the whole
breathing world is a blank to them. If they have to wait an hour or
so for a train, they fall into a stupid trance with their eyes open.
To see them you would suppose there was nothing to look at and no one
to speak with. You would imagine they were hypnotized or frozen. Yet,
very possibly they are hard workers in their own way, and have good
eyesight for a flaw in a deed or a turn of the market. They have been
to school and college, but during all that time they had their eye only
on their grades. They have gone about in the world and mixed with
clever people, but all the time they were thinking of their own af-
fairs. As if a man's soul were not too small to begin with, they have
dwarfed and narrowed theirs by a life of all work and no play. Here
they are forty, with a listless attention, a mind vacant of all mater-
ial of amusement, and not one thought to rub against another while they
wait for that train. Before he grew up he might have clambered on boxes.
When he was twenty he would have stared at the girls. But now the pipe
is smoked out, the snuff-box empty, and my gentleman sits bolt upright
upon a bench with vacant eyes. This does not appeal to me as being a
"Success in Life."

But it is not only the person himself who suffers from his busy
habits, but his wife and children, his friends and relations, and even
the very people he sits with in a railway carriage or a bus. Perpetual
devotion to what a man calls his "business" is only to be sustained by


perpetual neglect of many other things. In fact, it is not by any
means certain that a man's "business" is the most important thing he
has to do.


Atal, B.S. (1972). "Automatic Speaker Recognition Based on Pitch
Contours," J. Acoust. Soc. Amer., 52, 1687-1697.

Black, J.W., Lashbrook, W., Nash, E., Oyer, H.J., Pedrey, C., Tosi,
O.I. and Truby, H. (1973). "Reply to 'Speaker Identification
by Speech Spectrograms: Some Further Observations,'" J. Acoust.
Soc. Amer., 54, 535-537.

Bolt, R.H., Cooper, F.S., David, E.E., Denes, P.B., Pickett, J.M.
and Stevens, K.N. (1970). "Speaker Identification by Speech
Spectrograms: A Scientist's View of Its Reliability for Legal
Purposes," J. Acoust. Soc. Amer., 47, 597-612.

Bolt, R.H., Cooper, F.S., David, E.E., Denes, P.B., Pickett, J.M.
and Stevens, K.N. (1973). "Speaker Identification by Speech
Spectrograms: Some Further Observations," J. Acoust. Soc. Amer.,
54, 531-534.

Bricker, P.D., Gnanadesikan, R., Mathews, M.W., Pruzansky, S., Tukey,
P.A., Wachter, K.W. and Warner, J.L. (1971). "Statistical Tech-
niques for Talker Identification," The Bell System Technical
Journal, 50, 1427-1454.

Bricker, P.D. and Pruzansky, S. (1966). "Effects of Stimulus Con-
tent and Duration on Talker Identification," J. Acoust. Soc.
Amer., 40, 1441-1449.

Davitz, J.R. and Davitz, L.J. (1959). "The Communication of Feelings
by Content-free Speech," J. Communication, 9, 6-13.

Gubrynowicz, R. (1973). "Application of a Statistical Spectrum
Analysis to Automatic Voice Identification," Speech Analysis and
Synthesis, ed. W. Jassem, Vol. 3, Polish Academy of Sciences,
Institute of Fundamental Technology Research, Warsaw, Poland,

Holbrook, A. (1973). "Measurement of Daily Talking Time with VIC
(Voice Intensity Controller)," Amer. Speech and Hearing Associa-
tion Convention.

Ho lien, H. (1974). "The Peculiar Case of 'Voiceprints,"'J. Acoust.
Soc. Amer., 56, 210-213.

Hollien, H., Hollien, P. and Majewski, W. (1974). "Analysis of Funda-
mental Frequency as a Speaker Identification Technique," Con-

vention of the American Association of Phonetic Sciences, St.
Louis, Mo. (and in manuscript).

Hollien, H., Majewski, W. and Hollien, P. (1974). "Speaker Identifi-
cation by Long-Term Spectra under Normal, Stress and Disguise
Conditions," J. Acoust. Soc. Amer., 55, 5-20.

Hollien, H. and Shipp, T. (1972). "Speaking Fundamental Frequency
and Chronologic Age in Males," J. Speech and Hearing Research,
15, 155-159.

Miles, M. (1972). "Speaker Identification as a Function of Fundamental
Frequency and Resonant Frequencies" Ph.D. Dissertation, Univer-
sity of Florida.

Kersta, L.G. (1962). "Voiceprint Identification," Nature, 1253-1257.

Kosiel, U. (1973). "Statistical Analysis of Speaker-Dependent Differ-
ences in the Long-Term Average Spectrum of Polish Speech," Speech
Analysis and Synthesis, ed. W. Jassem, Vol. 3, Polish Academy of
Sciences, Institute of Fundamental Technology Research, Warsaw,
Poland, 181-208.

LaRiviere, C.L. (1974). "Speaker Identification from Turbulent Por-
tions of Fricatives," Phonetica, 29, 246-252.

LaRiviere, C.L. (1975). "Contribution of Fundamental Frequency and
Formant Frequencies to Speaker Identification," Phonetica, 31,

Majewski, W. and Hollien, H. (1974). "Euclidean Distance Between
Long-Term Speech Spectra as a Criterion for Speaker Identifica-
tion," Proceedings Speech Communication Seminar-74, Stockholm,
Sweden, 3, 303-310.

Majewski, W., Hollien, H. and Doherty, E.T. (Unpublished manuscript).
''Perceptual Identification of Voices Under Normal, Stress and
Disguised Speaking Conditions."

Pollack, I., Pickett, J.M. and Sumby, W.H. (1954). "On the Identi-
fication of Speakers by Voice," J. Acoust. Soc. Amer., 26, 403-

Poza, F. (1973). "Comments on Voiceprint Identification," California
Association of Criminalists.

Sambur, M.R. (1973). "Speaker Recognition and Verification Using
Linear Prediction Analysis," Speech Communication, XIX, 261-268.

Stevens, K.N., Williams, C.E., Carbonell, J.R. and Woods, B. (1968).
"Speaker Authentication and Identification: A Comparison of
Spectrographic and Auditory Presentations of Speech Material,"
J. Acoust. Soc. Amer., 44, 1596-1607.

Tosi, 0. (1974). "Voice Identification: An Overview,"''
and Urban Problem Course Handbook Series, No. 73,
Law Institute, New York City.

Tosi, 0., Oyer, H., Lashbrook, W., Pedrey, C., Nichol,
E. (1972). "Experiment on Voice Identification,"''
Soc. Amer., 51, 2030-2043.

Criminal Law

J. and Nash,
J. Acoust.

Vanderslice, R. (1966). "The 'Voice Print' Game," UCLA Working Papers
in Phonetics.

Wolf, J.J. (1972). "Efficient Acoustic Parameters for Speaker Recog-
nition," J. Acoust. Soc. Amer., 51, 2044-2056.

Zalewski, J., Majewski, W. and Hollien, H. (in press). "Cross-Corre-
lation of Long-Term Speech Spectra as a Speaker Identification


Edward Thomas Doherty was born December 18, 1936, at Boston,

Massachusetts. In June, 1954, he was graduated from Boston College

High School. From 1954 until 1958 Mr. Doherty was employed as a lab-

oratory technician at Cambridge, Massachusetts. From 1959 until 1961

he served in the United States Army and was stationed at Fort Sill,

Oklahoma. Following his discharge from the Army, he began his college

career and in June, 1965, he received the degree of Bachelor of Science

from Bridgewater State College. In 1965 he enrolled in the Graduate

School of the University of Florida. From September, 1965, until

March, 1968, he worked as a research trainee in the Department of

Speech and in August, 1967 received the degree of Master of Arts from

the University of Florida. From April, 1968 until June, 1970, he was

employed as a research associate at the Communication Sciences Labor-

atory. From July, 1970, until August, 1973, he was employed as a

research associate at the Speech Research Laboratory, San Francisco,

California. In September, 1973, he returned to the University of

Florida where, until the present time, he has worked as a research

trainee while he pursued his work toward the degree of Doctor of


I certify that I have read this study and that in my opinion it
conforms to acceptable standards of scholarly presentation and is fully
adequate, in scope and quality, as a dissertation for the degree of
Doctor of Philosophy.

Harry Holl en, Chairman
Professor of Speech

I certify that I have read this study and that in my opinion it
conforms to acceptable standards of scholarly presentation and is fully
adequate, in scope and quality, as a dissertation for the degree of
Doctor of Philosophy.

Donald C. Teas'
Professor of Psychology

I certify that I have read this study and that in my opinion it
conforms to acceptable standards of scholarly presentation and is fully
adequate, in scope and quality, as a dissertation for the degree of
Doctor of Philosophy.

Arnold Paige k
Associate Professor of Electrical

I certify that I have read this study and that in my opinion it
conforms to acceptable standards of scholarly presentation and is fully
adequate, in scope and quality, as a dissertation for the degree of
Doctor of Philosophy.

Howard B. Rothman
Assistant Professor of Speech

This dissertation was submitted to the Graduate Faculty of the Depart-
ment of Speech in the College of Arts and Sciences and to the Graduate
Council, and was accepted as partial fulfillment of the requirements for
the degree of Doctor of Philosophy.

August, 1975

Dean, Graduate School

University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs