Citation
Temporal parameters within the speech signal applied to speaker identification

Material Information

Title:
Temporal parameters within the speech signal applied to speaker identification
Creator:
Johnson, Charles Clifford, 1948-
Publication Date:
Language:
English
Physical Description:
xv, 115 leaves : ill. ; 28 cm.

Subjects

Subjects / Keywords:
Audio frequencies ( jstor )
Consonants ( jstor )
Discriminant analysis ( jstor )
Discriminative listening ( jstor )
Energy levels ( jstor )
Listening ( jstor )
Signals ( jstor )
Spoken communication ( jstor )
Telephones ( jstor )
Vowels ( jstor )
Automatic speech recognition ( lcsh )
Dissertations, Academic -- Speech -- UF ( lcsh )
Speech thesis Ph. D ( lcsh )
Genre:
bibliography ( marcgt )
non-fiction ( marcgt )

Notes

Thesis:
Thesis--University of Florida.
Bibliography:
Bibliography: leaves 111-114.
General Note:
Typescript.
General Note:
Vita.
Statement of Responsibility:
by Charles Clifford Johnson, Jr.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
The University of Florida George A. Smathers Libraries respect the intellectual property rights of others and do not claim any copyright interest in this item. This item may be protected by copyright but is made available here under a claim of fair use (17 U.S.C. §107) for non-profit research and educational purposes. Users of this work have responsibility for determining copyright status prior to reusing, publishing or reproducing this item for purposes other than what is allowed by fair use or other copyright exemptions. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder. The Smathers Libraries would like to learn more about this item and invite individuals or organizations to contact the RDS coordinator (ufdissertations@uflib.ufl.edu) with any additional information they can provide.
Resource Identifier:
022962102 ( ALEPH )
05360223 ( OCLC )

Downloads

This item has the following downloads:


Full Text












TEMPORAL PARAMETERS WITHIN THE SPEECH SIGNAL
APPLIED TO SPEAKER IDENTIFICATION









By

CHARLES CLIFFORD JOHNSON, JR.


A DISSERTATION PRESENTED TO THE GRADUATE COUNCIL OF THE
UNIVERSITY OF FLORIDA IN PARTIAL FULFILfLENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY









UNIVERSITY OF FLORIDA

1978
































Copyright

1978




























To my wife, Christine, and my children

Charles III, and Cristen.














ACKNOWLEDGEMENTS


I would like to express my sincerest thanks to Dr. Harry Hollien. He guided me through very trying times and through his experience with students and insight into research problems has, more than anyone else, made this dissertation possible. Drs. Howard Rothman and Sam Brown also deserve special thanks for being friends as well as members of my committee. I also acknowledge Drs. Alan Agresti, William Sanders, and Donald Dewsbury for taking the time to be members of my committee.

I would like to acknowledge Drs. E. Thomas Doherty and Gilbert Tolhurst for freely giving of their time and experience during this dissertation. I also express thanks to the entire staff of the Institute for Advanced Study of the Communication Processes, especially Amy Burnett, Dave Campbell, Kathy Farley, Carlos Febles, Jim Fitzgerald, Norman Green, Patti Hollien, Angela Hunt, Bob Idzikowski, Arlene Malick, Debbie Martin, Sally Potter, Russ Pierce, and Jamie Stone. The above mentioned people were listed in alphabetical order since there is no way in which each person's contribution to this dissertation could be measured.










Special thanks go to the Alachua County Sheriff's Department, especially Captain Wes Schellenger. Extra special thanks go to all my friends who have given me much needed encouragement, guidance, and some very helpful insight. Thank you Jim Hicks, Paul Graycar, Alan Smith, Rich Hill, and Brian Klepper.

Finally and certainly not least, I would like to

thank my wife, Christine. She has sacrificed greatly for me, encouraged me in the worst of times and never given up on me. To her and her alone do I owe the most, a debt I will never be able to repay.















TABLE OF CONTENTS


Page


ACKNOWLEDGEMENTS. . . . ..... LIST OF TABLES ................ ..

LIST OF FIGURES ..... ............


* . . . .


ABSTRACT. CHAPTER


I INTRODUCTION ..... .............

Review of the Literature ....

Aural/Perceptual Speaker
Identification ..........
Spectrographic Speaker
Identification ..........
Automatic or Semi-Automatic
Speaker Identification . .
Spectral Analysis (Long-term Speech Spectra) ..........
Fundamental Frequency . . .
Formant Frequency .........
Temporal Parameters .......

Objectives .... ............

:I METHODS ...... ..............

Temporal Parameters ..........


Time-Energy Distribution (TED). Voiced/Voiceless Speech Time
Vector (VVL) ... ..........


xii


5


5

14

23 24 27 30 32

35











TABLE OF CONTENTS (Continued)


Page


Vowel/Consonant Duration Ratio (V/C) .........
Word and Phrase Durations


. . . . . 45
(WPD) . . 46


Multiple Vectors ............

Experiment One .... ..........

Laboratory, Normal (LN) . . ..
Subjects ..... .............
Speech Material ... .........
Procedure .... ............

Experiment Two .... ...........

Laboratory Distorted Speech (LD) Subjects and Speech Material. .
Procedure .... ............

Experiment Three ... ..........

Semi-Field Conditions (SF). . .
Subjects and Speech Material. .
Procedure .... ............

Statistical Analysis .. ........

THE RESULTS AND DISCUSSION OF THE LABORATORY-NORMAL EXPERIMENT .......

Results ...... ...............
Discussion ..... ............

Single Vector Effectiveness Multiple Vector Effectiveness ..... ..............
Parameter Selection.......
Test Sample Selection ..... ..............


47

- . 48

* . 48
49
. . 49
* . 50

* . 50 . . 50 . . 51 . . 51

* . 52 . . 52 . . 52
* . 53

* . 54


. . 56 . . 57 � .63 � . 63

* . 66
* . 69

* . 70


vii


III










TABLE OF CONTENTS (Continued)


Page


IV THE RESULTS AND DISCUSSION OF THE
LABORATORY--DISTORTED SPEECH
EXPERIMENT ..... ..............


Results ..... ..............

Normal Speaking Condition . .
Stress Speaking Condition . .
Disguised Speaking Condition.


Discussion ..... ............

Normal Speaking Condition . . .
Stress Speaking Condition . . .
Disguised Speaking Condition..
Parameter Selection Techniques.
Test Sample Selection .......

V THE RESULTS AND DISCUSSION OF THE
SEMI-FIELD EXPERIMENT ... .........

Results ...... ............ . .
Discussion ..... ............

VI SUMMARY AND CONCLUSION ...........

REFERENCES ........ ....................

BIOGRAPHICAL SKETCH ..... ..............


* 96 . 102

* 105



* 115


viii














LIST OF TABLES


Table Page

Parameters Measured and Investigated
for Possible Utilization as the TimeEnergy Distribution (TED) Vector;
Speech Bursts Represent the Portion of
the Speech Energy Which Is Above a Given Relative Intensity Level and Pause Periods Are Defined as Those
Areas Between the Speech Bursts at a Given Energy Level Where the Energy
Falls Below the Given Level ...... .43

2 Pretest and Identification Scores for
the Laboratory-Normal Experiment
Obtained by Utilizing the Time-Energy
Distribution (TED), Voiced/Voiceless
Speech Time (VVL), Vowel/Consonant Duration Ratio (V/C), and Word and
Phrase Duration (WPD) Vectors on an
Individual Basis . ........ ......... 58

3 Pretest and Identification Scores for
the Laboratory-Normal Experiment Obtained
by Utilizing the TED, VVL, V/C, and WPD
Vectors in All Possible Pairs ...... 61

4 Pretest and Identification Scores for
the Laboratory-Normal Experiment
Obtained by Utilizing the TED, VVL,
V/C, and WPD Vectors in All Three- and
Four-Vector Combinations ....... ...62










LIST OF TABLES (Continued)


Table Page

5 Pretest and Identification Scores of the
Laboratory-Distorted Speech Experiment
Obtained Utilizing the Time-Energy Distribution (TED), Voiced/Voiceless Speech Time (VVL), vowel/Consonant Duration
Ratio (V/C) and Word and Phrase Duration
(WPD) Vectors in the Normal Speaking
Condition ........ ................. 77

6 Pretest and Identification Scores of the
Laboratory-Distorted Speech Experiment
Obtained Utilizing the TED, VVL, V/C,
and WPD Vectors in the Stress Speaking
Condition ..... ................. . 79

7 Pretest and Identification Scores of the
Laboratory-Distorted Speech Experiment
Obtained Utilizing the TED, VVL, V/C,
and WPD Vectors in the Disguised
Speaking Condition ... ............. ... 82

8 Identification Scores for the LaboratoryDistorted Speech Experiment Obtained
Utilizing the Fourth Speech Sample as the
Test; Only the TED, VVL, and TED X VVL
Vectors Were Examined .. ........... . 93

9 Identification Ranking for the Semi-Field
Experiment Obtained Utilizing Discriminant
Analysis on the Time-Energy Distribution
(TED) and the Voiced/Voiceless Speech
Time (VVL) Vectors ... ............. .. 99

10 Identification Rankings for the Semi-Field Experiment Obtained Utilizing CrossCorrelations on the TED and VVL Vectors, the Correlation Coefficients Between the
Known and Unknown Speakers Are Also Listed. 101














LIST OF FIGURES


Figure Page

1 Block diagram of the equipment utilized
to generate an energy envelope from the
analogy speech signal and extract the
time-energy distribution (TED) parameters ...... ................... ...40

2 Schematic representation of a typical
energy envelope as generated from the
TED equipment configuration ....... .42







Abstract of Dissertation Presented to the Graduate Council
of the University of Florida in Partial Fulfillment of
the Requirements for the Degree of Doctor of Philosophy

TEMPORAL PARAMETERS WITHIN THE SPEECH SIGNAL
APPLIED TO SPEAKER IDENTIFICATION By

Charles Clifford Johnson, Jr.

August, 1978

Chairman: Harry Hollien
Major Department: Speech

This research project was carried out with the purpose

of investigating some of the idiosyncratic speech characteristics which permit an individual to be identified from his voice alone. The specific objectives of this study were: (1) select and examine certain temporal speech parameters, with reference to their speaker identification capabilities, (2) test the speaker identification effectiveness of the selected parameters under stress and disguise conditions, and (3) examine the affects of simulated field conditions on the speaker identification capabilities of the selected temporal vectors.

From all the possible temporal characteristics which exist within the speech signal, four general sets of parameters were chosen. These temporal vectors included durational analysis of: (1) relative energy at several levels of intensity, (2) voiced and voiceless activity,


xii










(3) vowel/consonant ratios, and (4) specific words and phrases. Each of these vectors was composed of from

2 to 40 variables. These temporal vectors were extracted from speech samples generated from three experiments.

The initial experiment was a laboratory-based study. Forty adult males read a standard prose passage while being recorded in an "ideal" laboratory setting. The results of this, the first, experiment demonstrated the time-energy distribution (TED) vector as the most effective of the selected temporal parameters. The voiced/voiceless speech time (VVL), vowel/consonant duration ratio (V/C), and word and phrase duration (WPD) vectors followed in descending order of identification effectiveness.

The second experiment in this research also was

laboratory-based. In this case, the subjects (20 adult males) were recorded under similar conditions as those of the first experiment. However, these subjects read the passage in three different manners. These speaking conditions were: (1) normal, (2) stress (applied via electric shock), and (3) free disguise. This experiment resulted in the same vector effectiveness as the first experiment. That is, application of the TED vector yielded the highest levels of identification and the WL, V/C, and


xiii










WPD followed in effectiveness. In addition, it was found that stress and disguise speaking conditions do reduce the identification power of the selected temporal vectors. It should be noted that, while the disguise condition yielded much lower scores than the normal, this condition was higher than any other similar studies.

In the third study, the temporal parameters were investigated under conditions which would parallel those found in the forensic model. A speaker simulated a "crime" over the telephone and a "suspect pool" was created by recording subjects in a simulated interrogation procedure. The findings in this study demonstrated that the vectors were relatively ineffectual in this very restrictive situation. However, the TED and VVL vectors did show some limited potential; indicating that these vectors may, at some later date, be useful in a speaker identification system suitable for the forensic world.

In general, a few overall conclusions can be made based on the findings of the three completed studies.

1. Temporal characteristics found within the
speech signal are important in the speaker
identification process.

2. Certain temporal characteristics are idiosyncratic
of an individual's speech patterns.


xiv










3. Stressful and disguised speaking conditions
reduce the levels of identification exhibited
by these selected temporal vectors.

4. The temporal parameters examined in this
research program are less effected than
frequency parameters when a speaker
disguises his voice.

5. The restrictive condition of a simulated field
situation greatly interferes with the identification powers of these temporal vectors.

6. The temporal parameters may be a useful
addition to an established speaker identification system.














CHAPTER I


INTRODUCTION


In general, the most important information contained within the speech signal, produced by a given individual, is the linguistic message. However, this message is by no means the only information transmitted to a listener via the utterance. Data about the speaker's general emotional state, educational background, geographic origin, and/or specific identity also may be provided. All this information is important and warrants investigation. However, it was the focus of this research to examine only the speaker identifying information which is conveyed through the speech signal.

The identification of a speaker from his/her voice

alone routinely occurs under a variety of familiar circumstances; for example, in telephone conversations, at cocktail parties, from radio broadcasts, etc. In some situations, it is not just desirable but crucial to be able to extract the speaker's identity from his voice alone. For example, "infallible" speaker identification techniques










must be available to the military before voice activated weaponry can be developed and utilized with safety. In addition, businesses, banks, security companies and similar organizations have a need for voice activated computers, electronic devices, and machinery.

Speaker identification is of special interest to law enforcement agencies. Currently, the recording of conversations by criminals and/or suspects is common practice in criminal investigations. In addition, these recorded conversations often are admissible in courts of law. Therefore, a reliable and objective speaker identification system would be of inestimable value in the identification and conviction of criminals.

A substantial amount of research has been carried out in response to the need for speaker identification techniques that are reliable and objective. This research may be classified into three categories: (1) aural/perceptual speaker recognition, (2) visual recognition (spectrogram matching), and (3) machine recognition. The aural/perceptual "method" is simply speaker recognition by listening. That is, the technique utilizes the abilities of the human auditory system and











the cognitive powers of the human brain to determine the identity of a speaker. Research examining the aural/ perceptual speaker recognition has shown it to be quite valid under some circumstances (e. g., when the talker is well known to the listener), but severe limitations also have been demonstrated. A review and discussion of the literature appropriate to aural/perceptual speaker identification will be found below.

The so-called "spectrographic method" of speaker identification uses the aural/perceptual approach combined with a visual pattern matching technique based upon frequency-by-time-by-intensity sound spectrograms ("voice-prints"). These spectrograms are compared by means of a pattern matching procedure. A review of the relevant spectrographic speaker identification literature also will be presented below. However, controversy exists relevant to the predictive value of the "voiceprint" research; review of this conflict may be found in the following: Black et al., 1973; Bolt et al., 1970, 1973; Hollien, 1974, 1977; and Hollien and McGlone, 1976.










The third general approach currently being employed in speaker identification tasks utilizes sophisticated electronic devices (other than the sound spectrograph). In reality there are many "machine" approaches; however, all appear to exhibit four important advantages: (1) the specific acoustic and/or temporal parameters to be employed may be extracted from the speech signal serially and/or simultaneously, (2) the parameters (or group of parameters) utilized may be used in various combinations,

(3) the subjectivity of human analysis is eliminated to a great degree (Hollien and Majewski, 1977), and (4) the analysis can be done to any level of desired accuracy. In sum, the machine approach appears to show the greatest potential for ultimately producing a valid and objective speaker identification system.

It should be pointed out that most research using machine approach utilizes parameters based on acoustic analysis--particularly frequency spectrum and fundamental frequency--and the importance of these speaker characteristics to speaker identification has been demonstrated in the research literature. On the other hand, there are various temporal parameters that can be extracted











from the speech signal; they generally have not been studied for potential as speaker identification cues. However, temporal speech parameters, such as vowel and consonant duration, have been extensively investigated relative to the area of speech perception. A review of some of the relevant literature may be found in Lehiste (1967).

In summary, speaker recognition has been examined by three approaches: aural/perceptual speaker recognition; spectrographic or "voiceprint" speaker identification; and automatic and semiautomatic recognition. Each of these approaches have been examined in the research literature and some of their advantages and disadvantages explored. A review of the relevant literature follows.


Review of the Literature


Aural/Perceptual
Speaker Identification


One of the earliest systematic attempts to examine

speaker identification by listening (aural) was reported by McGehee (1937). In this study she first concealed auditors behind a screen and had a single speaker read (to them) a










passage of 56 words. These auditors returned the second day for a second set of trials. In this case they listened to five speakers (including the original talker) reading the same 56 word passage and they were asked to identify which of the speakers was the original talker. McGehee reports a correct identification rate of 83 per cent. The process was repeated at two week, three month, and five month intervals. In these cases the scores for correct identification were 68 per cent, 35 per cent, and 13 per cent, respectively. On this basis, McGehee concluded that the ability to identify speakers aurally deteriorated as a function of time.

In a later study, Pollack, Pickett, and Sumby (1954)

investigated certain aspects of the aural speaker identification task. They had normal male talkers of similar age, dialect, and rate of speaking read a single passage both in a whispered and voiced speaking mode. Comparisons were made between the identification scores of these two speaking conditions. The authors reported that if similar correct identification scores were to be obtained for both conditions, an utterance duration for the whispered passage had to be three times that of the voiced passage. On this basis the authors suggested that duration plays a significant role in the aural speaker identification process;










primarily it allows larger samples of the speaker's repertoire to be tested. Indeed, results of the voiced samples showed that duration was important to identification only up to 1200 milliseconds, beyond that point no further improvement in performance was noted. The authors also investigated the effects of low- and high-band pass filtering on listener performance. The findings from this portion of the Pollack et al. study demonstrated that correct identification is not critically dependent upon any delicate balance of frequency components, in any single portion of the spectrum.

Compton (1963) also examined the effects of filtering and duration on the aural identification process. He used sustained productions of the vowel /i/ while varying duration and filtering conditions. He reported that duration is a factor in listener performance, especially that levels of correct identification increased with lengthening of the sample up to a duration of about 1250 milliseconds. This finding is close in agreement with those reported by Pollack et al. (1954). Compton also found that if the speech sample was filtered above the frequencies of 1020 Hz, speaker identification rates were substantially reduced but filtering below 1020 Hz appeared to have no significant effect on the identification performance.










Bricker and Pruzansky (1966) investigated the auralperceptual effects of both duration and content on speaker identification by listening. Ten male talkers recorded five types of speech samples: excerpted vowels, excerpted consonants-vowel (CV) sequences, monosyllable words, disyllabic nonsense words, and sentences. Listeners exhibited the highest correct scores for samples of the greatest length (sentences) and identification performance decreased as length of sample decreased. Moreover, better listener performances were obtained with CV speech samples than with vowel excerpts of equal duration. The experimenters inferred that the number of phonemes within a speech sample was of greater importance to identification than its absolute duration.

In the second part of their experiment, Bricker and Pruzansky utilized the vowels /i/ and /a/ as experimental stimuli to study the effects of content in aural identification. Their results indicate that speaker recognition is not independent of the utterance. They also found that a confusion matrix for talker identification was not symmetrical. That is, talker A may be confused with talker B but talker B may not be confused with talker A. These findings raise some interesting questions about the decision criterion utilized by human listeners determining a given speaker's identity.










As one aspect of a much larger study, Stevens et al.1 (1968) investigated aural speaker identification using the vowels /i/ and /a/as speech stimuli. The talkers utilized in this study were homogeneous with respect to their voice; each read isolated words from which the experimental vowels were extracted. Identifications based on the aural portion of this study demonstrated that a word with the front vowel, /i/, was a better cue to the identification of the speaker than a word containing the back vowel /a/. The authors suggested that the higher second formant of /i/ might have aided in the improved listener performances.

Iles (1972) also found that speaker identification scores varied when different vowels were used as speakeridentifying features. This study involved six speakers and 16 listeners. Speech stimuli for the listeners was excerpted from a passage read by the talkers; it consisted of several sentences and four different vowels. Among other things, Iles found that speaker differentiation cues were present to a greater degree in the low vowels. She concludes that the



'(The main purpose of this study was to compare aural and visual methods of speaker identification. The aural method produced higher identification scores than the visual method in all portions of this study.)










first formant may contain some idiosyncratic characteristics which aid in speaker identification. However, Iles' results are somewhat in variance with those of Stevens et al. (1968), but the manner of presentation may have had a differential affect on the results of the two studies.

Clarke and Becker (1969) used a rating system to study aural speaker identification. In this study, each listener rated the talker on six different scales: pitch, pitch variability, rate, click-like elements, sibilant intensity and breathiness. The listeners used a seven point scale for pitch and five-point scales for the other variables. Identificatiorawere made utilizing each scale singly and then in all possible permutations. Pitch was found to be the most effective speaker discrimination characteristic when it was used singly to determine a speaker identity. Click-like elements, sibilant intensity, breathiness, rate and pitch variability followed in decreasing order of effectiveness.

LaRiviere (1975) studied the role of voice source in, and the effects of vocal tract transfer characteristics on, speaker identification. He used voiced, whispered and lowpass filtered vowels in order to examine (1) source information (filtered), (2) vocal-tract information (whispered), and (3) both (voiced). Using these three










sample conditions, LaRiviere found the whispering (vocal tract) and low-pass (source) samples resulted in about equal correct identification scores. In addition, he found that the summed scores for the whispering and low-pass filtering conditions were about equal to the scores for the voiced fullband condition. The author concluded that both voice source and vocal tract characteristics were of equal importance to speaker identification and seemed to be communicating different information to the listener.

A study examining the effects of stress and disguise on perceptual identification of talkers was carried out by Hollien, Majewski, and Hollien (1974). Adult male talkers read an extended prose passage under the following conditions:

(1) normal speech, (2) stress (talkers were subjected to randomly distributed electric shock while speaking), and

(3) disguised speech. Three types of listeners were utilized:

(1) listeners who knew the talkers, (2) listeners who did not know the talkers, and (3) listeners who neither knew the talkers nor the language (i.e., native speakers of Polish who did not know English). As would be expected, the results of this study revealed that the group who knew the talkers did best under all speaking conditions, with the other two groups exhibiting poorer scores. The authors concluded










that exposure to a talker aids in the speaker identification process. By examining across speaking condition, rather than listener type, it was found that stress (as utilized in this study) had little effect on the speaker identification task. However, disguise greatly imparied the listener's ability to identify the talkers. The same speaking condition trends were consistent for all three listener types.

Finally, in a recent study, Rothman (1977) investigated

the effects of non-contemporary speech samples on aural speaker identification. Pairs of speakers were chosen for their vocal similarities. Some were father-son combinations; some were brothers or twins; still others were simply sound-alikes. These subject pairs all had a long history of being confused with one another. Rothman recorded the talkers twice--at one week intervals. These recorded samples were presented to listeners in two-second speech segments for each pair of talkers. Same or different talker judgements were made by the listeners under the following conditions: (1) same talker/contemporary sample; (2) same talker/noncontemporary sample; and (3) different talkers. The results: (1) speakers paired with their own contemporary sample were identified 94 per cent of the time; (2) speakers paired with their own noncontemporary sample were identified 42










per cent of the time; and (3) speakers paired with the claimed vocal twin were correctly identified as themselves 58 per cent of the time. The fact that the per cent correct identifications is greater for group 3 than for group 2 (580% vs. 42%), would imply that listeners can detect some idiocyncratic cues which aid in the determination of a speaker. The results also indicated that within the constraints of the population utilized, i.e., adult males chosen for similarity of their voices, time appeared to have played the most important role in aural speaker identification. It would, therefore, appear that even when recordings are made only one week apart, aural identification of speakers is greatly impaired.

To summarize the research on the aural/perceptual

approach to speaker identification approach, the following relationships have been observed.

A. Duration. Some evidence suggests that speaker
identification may be largely a function of the absolute duration of an utterance. More recent
studies, however, suggest that duration is
important only insofar as it allows listeners to sample a larger repertoire of the talkers
speech behavior.

B. Fundamental Frequency. There is evidence that
the fundamental frequency of a speaker plays an
important role in speaker identification.










C. Formant Frequency. The reported research suggests
that there is a relationship between formants
(especially the second formant) and speaker
identification. However, these relationships are
not consistent over individuals.

D. Phoneme Effects. Speaker confusion appears to
vary with the phoneme, however; this confusion
also varies with specific vowels as well as voice
inflections and consonant-vowel sequence.

E. Speaker Conditions. Speakers have the ability to
disguise their voices and considerably reduce
correct identification, even if they are familiar to the listeners. However, there is some evidence
to suggest that speakers recorded under physical
stress are not much more difficult to identify than
normally recorded speakers.

F. Contemporaneousness. There also is evidence to
indicate that if speech samples are noncontemporary,
aural/perceptual speaker identifications are
greatly impaired.

G. Familiarity. The research reports demonstrate that
exposure of a talker to a listener plays an important role in aural speaker identification. Also
evidence indicates that knowledge of the speaker's
language aids in the identification process.


Spectrographic Speaker
Identification


A second method by which speaker identification has

been investigated utilizes the frequency-time-intensity

sound spectrograph. The sound spectrograph was originally

developed at the Bell Telephone Laboratory primarily for

the purposes of teaching the deaf to speak (see Potter










et al., 1966). However, in 1944, Gray and Kopp discussed the identification of speakers by visual inspection of spectrograms and concluded that this method appeared to offer good potential for such application.

Later, Kersta (1962) proposed that speech spectrograms (voiceprints) contained cues which would enable observers to identify speakers from spectrograms of their utterances alone. In his "investigation," talkers apparently produced ten monosyllabic words in isolation, spectrograms were made of each word. The observers (young females) were given a five-day training period in which they were taught to identify speakers from their spectrograms (the exact nature of this training was not specified). Upon completion of their training, these observers were required to match known spectrograms with unknown spectrograms. Kersta reports that the results of chis matching technique yielded a 99 per cent correct identification rate.

Young and Campbell (1967) attempted to replicate elements of the Kersta research. They used the same ten monosyllabic words Kersta used, both in isolation and in context. Sound spectrograms were made of each word in a fashion identical to that employed by Kersta. The observers utilized in this study also went through a training










procedure. Then they were presented with spectrograms of known and unknown speakers for identification. The results indicate that words in isolation produce much higher correct identification scores than words in context (78.4% in isolation, 37.3% in context). There scores are also substantially different from those reported by Kersta. Young and Campbell (1967) concluded that one reason for the difference between the two scores (isolation and context) may be the duration of the word. In isolation, the words tended to be much longer in duration than when put within the context of several words. The authors also inferred that different contexts change the speaker identification scores.

Referring back to the Stevens et al. (1968) study discussed earlier, it may be observed that they also investigated the spectrographic speaker identification procedures by means of open and closed sets. The speech samples used were monosyllables, disyllables, and phrases. In the closed set tests, the correct match was always present, while in open set tests, the correct match might or might not have been present. The closed set tests yielded a mean identification score of 79 per cent; the open sets, 47 per cent identification. The results of the Stevens et al. study appear to be in fair agreement with Young and










Campbell (1967). However, these results are not in agreement with the results reported by Kersta (1962).

In 1972, Tosi et al. reported a comprehensive laboratory study that examined the spectrographic method of speaker identification. These authors had speakers produce words in isolation, in fixed context, and in random context. The observers for this study were given a month of training in "phonetics and spectrogram matching procedures." In order to proceed to the more difficult tasks, each observer had to reach an identification score of 96 per cent correct for closed set tasks. The more difficult experimental trials consisted of open sets, words in context, and non-contemporary samples. It was found that closed sets produced lower error rates than open sets (5.5% vs. 9.9%), and that contemporary samples were more identifiable than non-contemporary samples (4.80 vs. 12.1% errors). In addition, the error rates for words in isolation, in fixed context and in random context were 4.2, 7.6, and 13.4 per cent, respectively. The results of this study indicate that visual speaker identification performance is best for (a) words in isolation, (b) closed sets, and (c) contemporary speech sample. Tosi et al. concluded that from the data in this experiment the spectrographic method of speaker identification demonstrates










interspeaker variability as greater or different than the intraspeaker variability. The authors also state that their results confirm the findings of Kersta (1962).

In a contrasting study, Hazen (1973) examined speaker identification in open and closed sets under differing contextual conditions. For speech samples, Hazen utilized 60 speakers recorded while talking spontaneously. Four cue words were then isolated from the speech sample and spectrograms made of each word. The matching or identification procedure was done Dy subjects who were trained in the same manner as those receiving Kersta's "Voiceprint Identification Training Course." Results at all levels of this experiment demonstrated higher errors rate than either Kersta (1962) or Tosi et al. (1972). Hazen concluded that spectral similarities due to intra-speaker consistency were not apparent enough to outweigh the similarities due to a different phonetic context.

Hollien and McGlone (1976) examined the effects of

disguise on the spectrographic approach to speaker identification. The talkers they utilized were instructed to read an extended prose passage in their normal voice and then repeat the reading using a "disguised" voice. The auditors in this experiment consisted of faculty and a










graduate student in phonetic sciences. All were well acquainted with the use of spectrograms for speaker identification purposes. These skilled auditors were asked to match the disguised sample spectrograms with a spectrogram made of the same speaker in his normal voice. The average score of these observers was 25 per cent correct identification. The authors concluded that the disguised condition greatly affected spectrographic speaker identification. In a later experiment, McGlone, Hollien, and Hollien (1977), demonstrated the possible reasons why the spectrographic method of speaker identification is not able to recognize an individual who is disguising his speech. Specifically, these authors found variations in speaking fundamental frequency, formant frequencies, and formant bandwidths. They also found the duration of speech samples were generally greater for the disguised voice than for the normal voice. These results imply that the speech of a talker can be altered significantly. They concluded that until these acoustic alterations can be generalized for the whole population and their effects predicted, the spectrographic approach to speaker identification would appear to be inadequate for use as a practical speaker identification system.









Reich, Moll, and Curtis (1976) designed an experiment to investigate the effects of vocal disguise upon spectrographic speaker identification. Forty male speakers provided speech samples on two separate occasions--one week apart. The samples consisted of several words excerpted from different sentences. This sample arrangement was analogous to the non-contemporary, random context of Tosi et al. (1972). The speaking modes utilized were: (1) normal,

(2) old-age disguise, (3) hoarse disguise, (4) hypernasal disguise, (5) slow-rate disguise, and (6) free disguise. The examiners were Ph.D. candidates in the speech sciences program at the University of Iowa and were extensively trained in the use of spectrograms for speaker identification. Matching the undisguised test spectrograms to the undisguised reference spectrograms, the examiners correctly identified 56.7 per cent of the talkers. When the disguised spectrograms (all types) were matched to the undisguised reference samples, only 33 per cent of the speakers were correctly identified. The authors conclude that:

(1) the type of disguise affects the degree to which spectrographic speaker identifications can be made, (2) certain speakers are more difficult to identify, and (3) the findings of this study do not substantiate the prior










claims that spectrographic speaker identification is unaffected by attempts at vocal disguise.

Houlihan (1977) also examined spectrographic speaker identification of disguised voices. She carried out two related experiments; only the second will be reviewed. The speakers (eight females and eight males) in this experiment each produced sentences in five speaking modes: undisguised, lowered fundamental frequency, falsetto, whispered, and muffled. The mean identification score for the undisguised condition was 85.5 per cent (71 per cent for females and 100 per cent for males). However, for the disguised conditions, they exhibited overall an identification score of 27 per cent (28.5 per cent for females and 25 per cent for males). Houlihan's results demonstrate that female speakers are not more difficult to identify than male speakers under a normal or undisguised condition. The results of this experiment also confirm earlier studies that the disguised voice confounds the spectrographic method of speaker identification.

Finally, in the second part of a study previously reviewed, Rothman (1977) investigated the effects of similar sounding talkers on the spectrographic approach to speaker identification. As described earlier, two










recorded speech samples (separated by one week) were produced by twelve talkers. Spectrograms were made of these samples and were presented to examiners for identification. Rothman found that contemporary phrases produced identification rates of 24 per cent while noncontemporary samples were considerably lower; about 6 per cent correct identification. On the basis of these data, he suggests that within the constraints of the study, time of utterance is a very important factor for speaker identification purposes. The author also concluded that the aural/perceptual approach to speaker identification is a significantly better approach than is the spectrographic method.

In summary, while some studies (notable those of

Kersta and Tosi) show high identification rates utilizing the spectrographic method for speaker recognition, most of the reported research does not confirm these high scores. The bulk of the literature reviewed above demonstrates that there are many factors which seriously affect the identifications made from spectrograms. Some of these factors are: (1) the effect of training of examiners;

(2) whether or not the speaker is altering his/her voice in any way; (3) the phonemic context of the utterance;

(4) whether or not the utterance to be identified is










recording in a contextual setting or in isolation; (5) utterance duration; (6) whether or not identification trials are in open or closed sets; and (7) whether or not the sample recordings were contemporary or noncontemporary. As McGlone et al. (1977) have pointed out, it seems apparent from the literature that spectrographic representation of the voice is easily and greatly altered by numerous means. Therefore, a great deal of further investigation is necessary before the spectrographic approach can be considered a valid and reliable means of speaker identification.


Automatic or Semi-Automatic
Speaker Identification


A number of "machine" approaches have been utilized in the study of the speaker identification process. These approaches may be categorized in a number of ways including (1) method of analysis, (2) the statistical technique, or (3) the speech features studied. For the purposes of this review, the automated approaches will be divided into groups on the basis of the acoustic features utilized as the determination of a speaker's idiosyncratic speaking characteristics. The features or










parameters which nave been most extensively investigated are spectral analysis (long-term speech spectra), fundamental frequency, formant frequencies, and in a few cases, temporal features. A review of the more relevant literature dealing with automatic speaker identification follows.


Spectral Analysis
(Long-term Speech Spectra)


An early study utilizing spectral analysis was reported by Pruzansky (1963). Spectral patterns were developed from ten words, excerpted from context, spoken by ten talkers (seven males and three females). The spectral patterns of three utterances of the same word by the same talker were used as a reference. The remaining utterance of the word was used as the test. Product moment coefficients of correlation were calculated between the reference pattern and the test pattern. The test and reference sample patterns which were most highly correlated were identified as being produced by the same speaker. Utilizing this correlation method of identification, Pruzansky correctly classified 89 per cent of 393 test utterances. She concluded that spectral distinctiveness of talkers is retained in long-term spectra.










Majewski and Hollien (1974) and Zalewski, Majewski, and Hollien (1975) examined the usefulness of long-term speech spectra as a cue for speaker identification. Both studies utilized the same group of subjects (50 Americans and 50 Poles), Each subject read a prose passage as a speech sample, from which power spectral information was extracted. Majewski and Hollien (1974) utilized Euclidean distances to classify the speakers and Zalewski et al. (1975) applied cross-correlations to the spectral data in order to make the identifications. The mean error rate for both the Hollien and Majewski and Zalewski et al., studies was about

4 per cent. In a third study Doherty (1976) used the 50 American speakers and the same feature extraction techniques as the two previous studies. However, he applied discriminant analysis in order to recognize the speaker patterns. Utilizing the same full bandwidth (80 Hz to 12.5 kHz), no errors in identification were found. Doherty at this point limited the bandwidth of the spectral information to 315 Hz and 3.5 KHz, and he recalculated his error rate. In this case it was 24 per-cent. Doherty concluded that even though error rates under the bandpass condition were encouraging, the 24 per cent was unacceptable for most practical applications. It should also be










noted that, while all of these studies (Majewski and Hollien, 1974; Zalewski et al., 1975; and Doherty, 1976) used basically similar populations, each utilized a different statistical technique. In other words, Doherty also demonstrated that the selection of statistical technique is of importance in the automatic approach to speaker identification.

In order to test the resistance of spectral analysis to distorted speech, Hollien and Majewski (1977) studied 25 talkers who produced speech under psychologically stressful and disguise conditions. Power spectral information was extracted for all talkers in both full band (80 Hz to 17.5 kHz) and limited band (315 Hz to 3.5 KHz), and euclidean distances calculated for the talker's individual spectral data. Under the stress condition, 92 per cent correct identification was achieved for the full bandwidth and 68 per cent identification with the limited bandwidth. The disguised condition yielded considerably lower scores, 20 per cent correct with full bandwidth and 10 per cent identification with limited bandwidth. Doherty and Hollien (1977) also examined speaker identification of distorted speech. The authors employed the same talkers, speech samples and feature extraction technique as those










employed by the previous study. However, in this case discriminant analysis was used as the statistical procedure. Results of this study were 72 per cent correct identification for the stress condition and 24 per cent correct for the disguise. The results of both studies demonstrate that stress has little effect on the recognition of a speaker, while disguised voices are much harder to identify than undisguised voices. The conclusion that may be made is that disguise, not stress, greatly alters the spectral characteristic of a talker's voice.


Fundamental Frequency


In order to examine another element within the frequency domain, Atal (1972) investigated pitch contours (fundamental frequency variations) as a cue to speaker identification. Atal formed a 20-dimensional vector based on the pitch data of ten speakers and utilized linear transfer functions to maximize the ratio of the interspeaker and intraspeaker variations of these pitch contours. From this data, reference and test vectors were formed and Euclidean distances calculated between these vectors. By matching -he reference and test vectors with the smallest distance, the author was able to achieve an










identification rate of 98 per cent. On the basis of these results, Atal concludes that the pitch contours are useful acoustic features in a speaker recognition system.

In 1972, Wolf used several classes of frequency data (fundamental frequency and features of vowel and consonant spectra) as speaker identification cues. The author had 21 male speakers record six sentences whereupon the frequency information was extracted from the recordings and a linear classification procedure applied. Wolf did not analyze his parameters separately but did state that fundamental frequency was a very useful parameter in his identification paradigm. However, he also states that fundamental frequency is perhaps the easiest and most obvious acoustic feature to modify in vocal disguise. In a similar study, Sambur (1975) utilized the same speech material and features as those of Wolf (1972). However, the Sambur recording sessions spanned a period of three and one-half years and included eleven talkers. Average fundamental frequency was not actually tested in this recognition experiment. However, by use of probability of error criterion, Sambur ranked average fundamental frequency twelfth among 38 possible recognition features. These calculations indicate that fundamental frequency










can be (possibly) a useful cue in recognizing an unknown speaker.

In referring back to the Doherty (1976) study, it can be found that he also extracted speaking fundamental frequency from the speech of his 50 speakers. Using only a two parameter vector to specify fundamental frequency, a 30.2 per cent correct identification score was achieved. This vector was also used in conjunction with limited bandwidth long-term speech spectra. The combination of the two vectors yielded an identification score of 97.7 per cent. As an overall conclusion, Doherty states that fundamental frequency does contain useful idiosyncratic data and appears to be independent of the information carried in long-term spectra.

In the Doherty and Hollien (1978) study reported

previously, the authors also used fundamental frequency as a cue to identify the 25 talkers. In this case, the individuals who produced stressed speech samples were correctly recognized 30 per cent of the time but the disguised samples only 10 per cent of the time. Since results of the stress condition are in agreement with those of Doherty (1976), it may be concluded that stress, of the type examined in these experiments, at least, does not










greatly alter speaking fundamental frequency. The poor identification scores for the disguise condition would seem to indicate that speaking fundamental frequency is greatly changed when a speaker attempts to disguise his voice. This relationship confirms the statement made by Wolf (1972) about disguised voices. Formant Frequency


The resonances or formant frequencies of vowels and consonants have also been investigated as cues to speaker identification. In a study described earlier, Wolf (1972) extracted selected formants from his 21 male talkers. Utilizing 17 parameters, of which six were formant frequency, 100 per cent recognition was achieved. Wolf concluded the vowel and consonant spectral information is useful in the classification of speakers. In a study also evaluated earlier, Sambur (1975) reports four measurements of the same type. They were: (1) the second formant of /n/, (2) the third formant of /u/, (3) the second formant of /i/, and (4) the third formant of /m/. Utilizing these four measurements and one temporal measurement, only one identification error in 320 trials was made. These results lead Sambur to conclude that










formant information was among the most useful in recognizing a speaker from his voice alone.

Finally, Goldstein (1976) evaluated ten vowel formant structures as speaker identifying features. Ten adult males were recorded reading sentences containing key words. The formant tracts were extracted from these key words; 199 measurements were made and evaluated for effectiveness in speaker identification. In an identification experiment using two formant measurements, only 12 errors were made in 80 identifications. Furthermore, Goldstein evaluated five formant measurements with a technique called probability-of-error (described by Sambur, 1975). Utilizing this technique an error rate of 0.25 per cent was calculated for the five measurements. Based on the findings in this study, Goldstein concludes that, since certain formant measures demonstrate large speaker differences, the variations must ie dependent more upon the speaker's habits than on vocal-tract configuration. In addition, she states that it is possible that zirst and second formant measures contain more information than just vocal-tract length information.










Temporal Parameters


Temporal features within the speech signal have not

been examined for speaker dependency in the same detail as have frequency features. However, in a study described earlier, Pruzansky (1963) tested the effect of only utilizing temporal data on speaker recognition success. In brief, she used ten talkers who recorded several repetitions of ten words in context. Two-dimensional patterns, consisting of the total energy in time segments, were formed by summing the energy over the several frequency bands for each time section. With a pattern matching technique, the temporal patterns yielded a 47 per cent correct identification score. Pruzansky concluded from her results that the temporal patterns were more correlated to the individual words than to the speakers.

In the Wolf (1972), Sambur (1975), and Goldstein (1976) studies, limited temporal measurements were evaluated for possible speaker dependent characteristics. Wolf (1972) measured the total duration of the word "bought." He states that certain learned idiosyncratic voice characteristics possibly deal mainly in timing. However, his single example of a gross timing measure did not provide useful identification










information, that is, in comparison to his spectral measurements. Goldstein reports similar results for her timing measurements. She investigated the duration of several formant tracks and their ratios. None of the temporal parameters were included in the final identification analysis because of low inter-speaker variations. In contrast, Sambur found temporal measures useful in speaker identification. He made two measurements: (1) the slope of the second formant of the diphthong /ai/ and (2) the duration of the frication and aspiration noise of the plosive /k/ in "cash." Both these features were among Sambur's 10 most effective speaker identification cues.

In the Doherty (1976) experiment previously discussed, two temporal parameters were examined for use in speaker recognition. These two measurements were speaking time and phoneme rate. Utilizing this two-parameter vector, only six of the total 50 speakers were correctly identified. However, when the temporal vector was added to his other vectors (speaking fundamental frequency and long-term speech spectra), the identification rates increased from

8 per cent to 26 per cent. This finding would suggest that speaker dependent information within spectral features was different from those contained in the temporal elements










of the speech signal. These two temporal features, speaking time and phoneme rate, aiso were examined under stress and disguise speech conditions (Doherty and Hollien, 1978). The identification scores for this study were very low (12 per cent for the stress and 16 per cent for the disguise). In addition, when these temporal features were combined along with the spectral vectors, correct identification rates only increased from 4 per cent to 10 per cent. These results suggest there is not enough idiosyncratic information within these temporal measurements to justify their use alone. However, the results also indicate that the temporal elements within the speech signal contain some identification information that may be insensitive to stress or disguise.

In summarizing the experimental findings examined in the preceding section, several conclusions can be drawn pertaining to the automatic or semi-automatic approach to speaker identification. Initially, it can be seen that the spectral components of the speech wave contain certain speaker-dependent features. It should be noted also that when studies utilized similar spectral measures, similar results were obtained. Thus, there appears to be a great deal of consistency among these experiments. Moreover, it was found that limited portions of the spectrum, viz., fundamental










frequency and formant frequencies, appeared to contain idiosyncratic characteristics relevant to the speaker identification task. A finding of special interest to this research is that temporal features extracted from the speech waveform were of value in speaker identification systems. Finally, it would seem that "machine" approaches to speaker identification appears to exhibit the greatest potential. There are two specific relationships which support this conclusion. First, when all the research carried out on speaker identification (aural, visual, and semi-automatic) are compared, this approach has produced consistently the highest and most reliable correct identification scores. Second, the semi-automatic approach has, for the most part, removed subjective judgements from the identification process.


Objectives


The primary aim of this research was to develop and test a system of inter-speaker differentiation which will seek to discover if certain speaker dependent features will remain relatively invariant during both physical and psychological modifications to the acoustic environment. Specifically, this study examined selected temporal parameters which exist within the speech waveform. These parameters










were: (1) durational measurements of energy at several levels of magnitude, (2) presence of vocal activity, (3) patterns of silence, and (4) the duration of several specific words and phrases. It also was the aim of this study to investigate the effects of several speaking conditions upon the identification capabilities of the selected temporal parameters.

The specific goals of this research may be stated in the form of several questions. They are as follows:

A. Which of the several selected temporal elements,
found within the speech signal, are invarient,
idiosyncratic characteristics of an individual's
speaking repertoire?

1. Can these temporal measurements be made
reliably?

2. Once measured, do these parameters predict
a speaker's identity?

B. Will stressful or disguised speaking conditions
significantly reduce the speaker identification
capabilities of these selected temporal
parameters?

C. What changes occur in the identification strength
of these parameters (or set of parameters) when
the method of choosing a test speech sample is
altered?














CHAPTER II


METHODS


As has been stated, the aim of this research is to develop a speaker identification system based on the analysis of certain temporal features. In general, these temporal measurements include durational analysis of (1) relative energy at several levels of intensity, (2) voiced or voiceless activity, (3) patterns of silence, and (4) specific words and phrases. These parameters were grouped into several vectors consisting of from 2 to 40 measurements and studied under a variety of speaking conditions. These experiments were classified as (1) laboratory-normal

(LN), (2) laboratory-distorted (LD), and (3) semi-field (SF). Specifically, in the first experiment the temporal measurements were tested under "ideal" laboratory conditions to the purpose of establishing baseline data on the speakeridentifying ability of the selected features. In the second (LD) experiment, the vectors were applied to a speaker identification task under three speaker conditions: normal, stress, and disguise. This experiment was designed










to evaluate the effects of speaker distortion (voluntary and involuntary). on' the selected (temporal) vectors' identification capabilities. The third and final experiment can be described as a semi-field study (SF). In this experiment recordings were made of a "crime-related" telephone message and several "suspects." An attempt then was made to identify the "criminal caller" from among the "suspects" in a closed set paradigm. The purpose of this last study was to evaluate the temporal parameters in a quasi-forensic environment. A detailed description of the temporal vectors, experiments, and the statistical analyses utilized follows.


Temporal Parameters


From all the possible temporal characteristics which may exist within the speech signal, four general sets of vectors have been chosen. They are: (1) time-energy distribution (TED), (2) voiced/voiceless speech time (V/VL), (3) vowel/consonant duration ratios, and (4) word and phrase durations (WPD). These vectors were chosen on the basis of two factors: (1) their potential to contain information idiosyncratic to a speaker (based on previous research) and (2) the potential of obtaining these parameters from the speech signal.










Time-Energv Distribution (TED)


This temporal vector is based on a group of time-byenergy measurements, none of which have been studied previously in relation to speaker identification. In general terms, this analysis reflects the total accumulated time a talker's speech intensity remains at a specific energy level (relative to his peak amplitude). It also provides indicationsof the speaker's speech pattern with respect to speech bursts and pause periods.

For the purposes of this research, the speech bursts are defined as the portion of the speech energy which is above a given energy level and pause periods represent those areas between the speech bursts at a given level when the speech energy falls below that given energy level. Operationally, the TED procedure was carried out utilizing a resistor-capacitor circuit to generate an energy envelope of the speech signal. This signal then was digitized by an analogue to digital conversion on a Digital Equipment Corporation PDP 8i minicomputer. Figure 1 gives a block diagram of the equipment configuration. The digitized sequence was analyzed for duration relative to ten intensity levels via specific programming written especially




















































Fig. 1. Block diagram of the equipment utilized to
generate an energy envelope from the analogy
speech signal and extract the time-energy
distribution (TED) parameters.










for this analysis. That is, the digitized envelope was partitioned, functionally, into ten linearly equal energy levels, initiating with its peak amplitude. Figure 2 gives a graphic representation of a typical energy envelope as it would be partitioned. Also the speech burst and pause periods are shown in this figure. The number of speech bursts, mean and standard deviation of the speech bursts, and the standard deviation of the pause periodswere computed for each energy level. The mean pause periods and number of pauses were not computed as they are direct reciprocals of the speech bursts. The total number of features measured in this vector was 40 (see Table 1). This vector was used in all three experiments.


Voiced/Voiceless Speech
Time Vector (VVL)


This vector is created by combining the duration of the voiced and voiceless portions of the speech signal. The first parameter was defined as the duration of articulated speech during a given speech sample. The second term of the VVL vector consisted of the total duration of phonation or vocal activity during a speech sample. This specific vector previously has not been used in the speaker identification task. However, both voiced and voiceless phonemes
























.10



a


LU 7 ...
LU


1= 5LU
z
LU 4

31

2






TIME -----A aC 0
I ... ... .. I , I . . . . .... . . .
A-8 and C-0 are speech bursts
B-C is the pause period
A-8 plus C-0 is total articulation time


Fig. 2. Schematic representation of a typical energy
envelope as generated from the TED equipment configuration.















Table 1.


Parameters Measured and Investigated for Possible Utilization as the Time-Energy Distribution (TED) Vector; Speech Bursts Represent the Portion of the Speech Energy Which Is Above a Given Relative Intensity Level and Pause Periods Are Defined as Those Areas Between the Speech Bursts at a Given Energy Level Where the Energy Falls Below the Given Level.


Parameters
(Measured at each of Ten Enerav Levels)


Number of Parameters


A. Speech Bursts


1. Number of bursts
2. Mean duration of each burst
3. Standard deviation of the duration
of bursts

B. Pause Periods

1. Standard deviation of the duration
of pauses

C. Total Number of Parameters Available










have been studied by a number of researchers in speaker identification (see, for example, LaRiviere, 1975 and Coleman, 1973). Their research has indicated the importance of both voiced and voiceless speech sounds to the identification of talker for their speech.

Phonation time is obtained from the IASCP Fundamental Frequency Indicator (FFI-8) linked to a PDP8i minicomputer. FFI-8 is a digital readout fundamental frequency tracking device. It consists of a group of successive low-pass filters with cutoffs at half-octave intervals coupled with a high speed switching circuits which are controlled by a logic system. FFI produces a string of pulses which are delivered to a PDP8i computer. These pulses mark the boundary of a fundamental period from a complex wave. An electronic clock marks the time from pulse to pulse and these values are processed digitally to yield (along with other data) the geometric mean frequency level and standard deviation of the frequency distribution. While FFI-8 is basically designed to extract fundamental frequency from a complex wave, it will also calculate the duration of the vocal activity (phonation time).

The second phase of this procedure was to extract total articulation time. Referring back to the TED procedure,










the summed duration of the speech bursts of energy level one represents the total amount of articulated speech (see Figure 2). Utilizing the total articulation time and phonation time, the voiceless speech componant of the sample is represented. This two-dimensional vector was utilized in all three experiments.


Vowel/Consonant Duration
Ratio (V/C)


The V/C vector is made up of the ratios of the duration of selected vowels to the duration of their consonantal environments. For this procedure, a separate ratio was calculated for each of the following words: "good," "not," "cannot," and "sort." A vector of this type previously has not been utilized in speaker identification research. However, many researchers have demonstrated the individual importance of vowels and consonants in identification of talkers (Clarke and Becker, 1969; LaRiviere, 1975; and

Glenn and Kliener, 1968).

The procedure for the extraction of the V/C vector

utilized time-by-frequency-by-intensity speech spectrograms, made on a Voiceprint Identification, Inc., Model 700 spectrograph. Speech spectrograms were made of the words










"good," "not," "cannot," and "sort." Hand measurement of the vowels and their associated consonants durations were made from the time-frequency-intensity spectrograms. These measurements were spot checked for accuracy by an independent observer. The V/C ratios were formed from the duration of a vowel or vowels in a selected word and the duration of the consonant or consonants in that same word. This process allowed the formation of one ratio for each sampling of the selected word. The V/C was only utilized as a speaker identification feature in the first two experiments.


Word and Phrase
Durations (WPD)


The WPD vector is generated from measurements of the individual duration of several selected words and phrases. The words chosen for use in the WPD vector were: "good," "not," "cannot," and "sort." In addition the duration of three selected phrases were computed. These phrases include "they have," "they cannot," and "it is not." A WPD vector of this type has been examined previously by Wolf (1972), who investigated a number of acoustic parameters for use in the identification of speakers. Specifically, one of the parameters he used was the duration of the word "bought."










He indicated that this word duration (alone) did not provide good speaker identification information but that such a parameter may be useful.

In order to obtain the WPD vector, the cited words and phrases were processed in the same manner as were the words for the V/C vector. That is, speech spectrograms were made and the durations calculated from hand measurements. This technique yielded a vector made up of seven parameters. The WPD vector was only utilized in the laboratory-normal and laboratory-distorted speech experiments.


Multiple Vectors


In addition to examining the speaker identification

capabilities of each of the vectors separately, these vectors were investigated in all possible combinations. Specifically, the vectors combinations were utilized in groups of two vectors (TED V/VL, TED V/C, TED WPD, V/VL V/C, V/VL WPD and V/C WPD), three vectors (TED V/VL V/C, TED V/VL WPD, TED V/C WPD, and V/VL V/C WPD), and all four vectors (TED V/L V/C WPD). The technique of combining several parameter sets has been shown to improve speaker identification systems (Doherty, 1976; Sambur, 1973; and Goldstein, 1976).










This multiple vector approach provide the most efficient use of these temporal vectors in a speaker identification system. For example, if two vectors contain the same or similar information about a speaker's voice then the combining of these vectors would not produce improved identification scores. Conversely, if two vectors vary independently of one another, then their combination should produce better identification scores. Therefore, this procedure demonstrated which vectors are contributing new information and which vectors contain duplicate information.


Experiment One


Laboratory, Normal (LN)


This initial experiment was a laboratory-based study. The subjects read speech material in a laboratory environment and were recorded with "ideal" laboratory equipment. The purpose of this experiment was to develop baseline data on the speaker identification capabilities of the temporal vectors. A detailed description of the subjects, speech material, and recording procedure will follow.











Subjects


Forty adult speakers were chosen from a volunteer pool of faculty and students at the University of Florida. Subject selection was made on the following basis: (a) 18 to 40 years of age, (b) no apparent speech defects, and

(c) no unusual regional or foreign dialects. These minimal requirements yielded a relatively homogeneous population. Thus, these subjects permitted initial baseline testing of the temporal parameters and presumably reduced any obvious inter-speaker variations. Speech Material


Subjects read a modernization of "An Apology for

Idlers" by R. L. Stevenson, an approach which permitted the speech samples to be context independent. The passage was chosen because it is relatively long (approximately 600 words) and contains most phonemes of the English language. Therefore, it provided a good representation of the subject's speech repertoire and allowed the sample to be divided into several smaller subunits where necessary.










Procedure


The subjects utilized in the LN experiment were recorded under laboratory conditions, in an 1AC-1200 sound-treated chamber with an Ampex Model No. 354 tape recorder coupled to an Electro-Voice microphone Model No. EV 664. The recorded speech samples were divided into four subsamples (30 seconds) in order to permit the extraction of the TED and VVL vectors. Also, selected words and phrases were processed as described

in the V/C and WPD sections.


Experiment Two


Laboratory Distorted
Speech (LD)



As stated earlier, the purpose of the second experiment was to investigate the effects of speaker distortions (voluntary and involuntary) on the robustness of the temporal vectors, re: the speaker identification task. The subjects were recorded under the same laboratory conditions as those of the first experiment. However, this experiment involved three speaking conditions (normal, stress, and disguise). This experiment provided data on the consequences of speaker distortion on this speaker identification system. A complete description of the procedures utilized will be found below.










Subjects and Speech Material


The speakers were 20 male faculty and graduate students at the Institute for Advanced Study of the Communication Processes and the Department of Speech at the University of Florida. These subjects were normal speakers of American English ranging in age from approximately 25 to 45 years; they exhibited no unusual dialects, speech or voice disorders. The speech material for this experiment was the same as that used in the first experiment. Procedure


The subjects were recorded using the same equipment

described for the first experiment. However, three speaking conditions were included in the recording procedure. They were: (a) normal speech (control), (b) stress, and (c) disguise. Emotional stress can be defined in a number of ways; in this case, it was induced by applying electric shock, delivered randomly, while the subject was speaking. For the third condition, subjects were requested to disguise their speech as completely as they could. The only restriction placed on them was that they could not use a "foreign dialect" or whisper; in addition, they were encouraged to use only the modal voice register.










The recorded speech samples were divided into four

subsamples (30 seconds) for TED and VVL analysis. For the V/C and WPD vectors the selected words ("good," "not," "cannot," and "sort") and phrases ("they have," "they cannot," and "it is not") were extracted as specified above.


Experiment Three


Semi-Field Conditions (SF)


The third and final experiment had as its purpose the testing of the identification capabilities of the temporal vectors under less than laboratory conditions. In this case, subjects simulated a "crime" over a telephone. Later, suspect interrogation procedures were carried out. It was hoped that the results of this experiment would provide information as to how well the selected temporal vectors performed under conditions which more closely parallel the forensic model. Subjects and
Speech Material


The subjects for this semi-field experiment were 12

adult volunteers drawn from local law enforcement agencies. From this pool of volunteers, one of the subjects was selected to assume the role of the criminal; he simulated










a telephone related "crime" (a kidnapper's demand call). The remaining eleven subjects were designated as "suspects." All subjects (the 11 suspects and the caller) were recorded during an interview in which they recited statements made by the original caller. This interrogation procedure permitted the control of context and provided a closed set approach.


Procedure


The criminal call was recorded over a telephone on a

reel-to-reel tape recorder via a direct line hookup. It was made from a telephone in a relatively quiet environment--a procedure that provided for reasonably high quality recordings of this type. However, the interrogation procedure was recorded in quite a different manner. In this case, all subjects (suspects and caller) were recorded in a large and relatively noisy room. This procedure was followed in order to model a typical forensic situation. Only the TED and VVL vectors were extracted from these recordings. The V/C and WPD vectors could not be utilized because the particular subject recordings did not provide sufficient repeated words and phrases.










Statistical Analysis


Of the many techniques available, discriminant analysis was chosen as the statistical approach to the identification of the speakers. This particular technique was chosen over several others because it demonstrated higher speaker identification results than have other such methods as Euclidean distance analysis or cross-correlation (Doherty, 1976; Doherty and Hollien, 1978; and Zalewski et al., 1975).

Discriminant analysis computes a set of linear functions, termed discriminant functions, which then are utilized to classify individual samples or observations into one of several groups. In the case of this research, the discriminant functions were utilized to classify an unknown speaker's test sample against reference sets generated on known speaker sets. The input data to this procedure consisted of sets of samples; each of which contained values for all the parameters. From the parameters an F-statistic was calculated and used to determine which parameters were the most powerful as identification cues.

Three classification methods were utilized within the

statistical technique. Initially, the known or reference set consisted of all four of the talker's speech samples.










Then, each sample was reclassified, in turn, with respect to the reference sets. This method was labeled posterior classification and was used to "pretest" the speaker identifying features investigated in this research. The second method chosen was the jackknife approach. In this procedure, each sample was eliminated (in turn) from the reference set before computation of the discriminant functions were carried out. In this case, classification was made on the "removed" speech samples. A third method was utilized to simulate the forensic model. The initial of a talker's sample was arbitrarily designated the test sample and the reference set consisted of the remaining three samples. This final method was employed in the identification task.














CHAPTER III


THE RESULTS AND DISCUSSION OF THE
LABORATORY-NORMAL EXPERIMENT


The initial experiment had two major purposes. The

first was to examine certain temporal characteristics of an individual's speech in order to discover if they permit him to be identified from his voice alone; a second objective was to establish baseline speech data. As described in previous sections, adult males provided speech samples from which the following temporal vectors were extracted:

(1) a time-energy distribution vector, (2) a voiced/voiceless speech time vector; (3) a vowel/consonant duration vector; and (4) a vector resulting from the analysis of the durations of several selected words and phrases. These vectors were made up of a number of parameters; each was tested for its speaker identification capability with respect to the acoustically controlled environment of this laboratory type experiment. A description of the obtained results and a discussion of these findings follow.










Results


The pretest and identification scores which were obtained when the vectors were utilized singly may be found in Table 2. Three types of speaker classifications were used to test these vectors; two of which were pretests and the third an identification task. The first pretest was labeled "posterior classification." In this case, all samples were utilized in the reference set determination and then each sample was classified in relationship to one of the reference sets. The purpose of this initial pretest was to examine the inter-sample variability. By means of this first method, it was found that the time-energy distribution (TED) vector could be utilized to correctly. recognize the speakers 100 per cent of the time; a correct classification rate of 57.5 per cent was found for the voiced/voiceless speech time (VVL) vector. The scores for the vowel/consonant duration ratio (V/C) vector decreased to only 22.5 per cent correct while the word and phrase duration (WPD) vector yielded no correct classifications at all.















Table 2.


Pretest and Identification Scores for the LaboratoryNormal Experiment Obtained by Utilizing the TimeEnergy Distribution (TED), Voiced/Voiceless Speech Time (VVL), Vowel/Consonant Duration Ratio (V/C), and Word and Phrase Duration (WPD) Vectors on an Individual Basis, N = 40.


Pretest Classifications
Vectors Posterior Jackknifed Identifications TED 100.0% 82.5% 25.0% VVL 57.5 37.5 25.0 V/C 22.5 7.5 5.0 WPD 0.0 0.0 0.0









The second type of pretest used was a "jackknifing"

method. That is, each sample, in turn, was eliminated from the discriminant computations and a reference set formed from the remaining samples. Each sample then was classified according to the particular reference set developed when it was removed. This pretesting procedure allowed all of a given individual's speech samples, taken in a complete set, to be evaluated for their identification capabilities. When the jackknifing method of speaker selection was utilized,consistently lower classification scores were recorded. However, the same general trends exhibited by the posterior pretesting method also were found when this second procedure was employed. That is, a classification score of 82.5 per cent correct was attained by the TED vector. The VVL and the V/C vectors yielded scores of 37.5 per cent and 7.5 per cent, respectively; the WPD again did not result in any correct speaker selections.

Finally, identifications were carried out in the

following manner. The first speech sample for each talker was chosen as the test sample and the remaining samples were utilized as the reference set. The first sample was chosen as the test because it is believed to represent the most variable portion of the entire speech sample. In










the identifications, the TED and VVL vectors both resulted in a 25 per cent correct level of speaker selection. Utilizing the V/C vector only 5 per cent of the speakers were correctly identified and the WPD, as would be expected from the pretests, yielded no correct identifications at all.

The second phase of this experiment consisted of testing the identification abilities of the temporal vectors in all possible combinations. Identification scores of the twovector combinations may be found in Table 3. As can be seen from the table, the same general trends were found as when the vectors were tested singly. Specifically, it should be noted that the TED and VVL combination yielded the highest score in the identification task (40 per cent correct). Also, it is apparent that the addition of V/C and WPD to either of the other two vectors did not increase the identification rates appreciably. The three- and fourvector combinations may be found in Table 4. The speaker recognition scores for this group of vector combinations appear to have resulted in essentially diminishing returns. That is, the scores did not change substantially from one vector grouping to another. This plateau effect can be noted especially with respect to the identification task. The combinations of TED, VVL, and V/C and the TED, V/C,
















Table 3. Pretest and Identification Scores for the LaboratoryNormal Experiment Obtained by Utilizing the TED, VVL,
V/C, and WPD Vectors in All Possible Pairs, N = 40.


Vectors

TED X VVL

TED X v/c TED X WPD VVL X V/C VVL X WPD V/C X WPD


Pretest Classification Posterior Jackknifed

100. 0% 80.0% 95.0 50.0 100.0 82.5 82.5 45.0 57.5 37.5 37.5 17.5


Identifications

40.0%

27.5 25.0 30.0 25.0

5.0

















Table 4. Pretest and Identification Scores for the LaboratoryNormal Experiment Obtained by Utilizing the TED, VVL,
V/C, and WPD Vectors in All Three- and Four-Vector
Combinations, N = 40.


Vectors

TED X VVL X V/C TED X VVL X WPD TED X V/C X WPD VVL X V/C X WPD TED X VVL X V/C
X WPD


Pretest Classification Posterior Jackknifed 100./0% 72.5% 100.0 80.0 100.0 65.0 87.5 47.5


Identifications

47.5%

42.5 47.5 35.0


100.0 72.5


45.0










and WPD both yielded 47.5 per cent correct identification scores. Whereas, the two combinations TED, VVL, and WPD and the four-vector combination vectors, yielded identification scores of 42.5 per cent and 45 per cent, respectively.


Discussion


Single Vector Effectiveness


Based on the results of the laboratory-normal experiment, it would appear possible to consider, at least, some of the basic questions asked by this research. For example, of the several selected temporal parameter groups, the time-energy distribution (TED) vector apparently contained the most idiosyncratic characteristics, at least for these normal or "ideal" recording conditions. This judgement is based on the fact that application of the TED resulted in the highest correct classifications being made. Utilizing correct speaker selection as the judgement criterion, the voiced/voiceless speech-time (VVL) vector, vowel/consonant duration ratio (V/C) vector, and the word and phrase duration (WPD) vector followed TED in a decreasing order of effectiveness.

There are several possible explanations for this ranking of the temporal vectors. The high identification










scores for the TED vector may well have been predicted since it contained the largest number of parameters (40) and, therefore, it possibly contained more information about the talker's speech characteristics than did any of the other vectors investigated in this research. In addition, previous research has demonstrated the possible effectiveness of a time energy vector. For example, in 1963 Pruzansky examined a time-energy distribution similar to the one utilized in this study. She found that, of her ten speakers, about half were correctly identified. Also, spectral information may be considered to be a frequency counterpart of the TED vector, and spectral analysis has been shown to exhibit a number of speaker dependent properties. That is, while it does not necessarily follow that the work of Majewski and Hollien (1974), Hollien and Majewski (1977), Doherty and Hollien (1978), and others would predict the relative success of the TED vector, their results do infer that an attempt to utilize the timeenergy information may be warranted.

The voiced/voiceless speech-time vector (containing only two parameters) also demonstrated identification scores that were sometimes equal in magnitude to those of the TED vector. These identification levels possibly could










have been predicted on the basis of studies that examined the function of phonation in the speaker identification process (see for example, LaRiviere, 1975; Atal, 1972; Wolf, 1972; Sambur, 1975; Doherty, 1976; and Doherty and Hollien, 1978). The results of these investigations seem to support the supposition that the level of phonation or vocal activity plays a role in the speaker identification task.

In direct contrast to the TED and VVL vectors, the

application of the vowel/consonant duration ratio and the word and phrase duration vectors resulted in very low identification scores. These low scores may be due, in part, to the method by which they were obtained. Specifically, hand measurements made on sound spectrograms were used and therefore lent themselves to imprecision. Also, the number of samples and type of samples were severely limited. Even if the V/C and WPD vectors did not produce high identification scores, they should not be discounted totally for future studies as not having potential as speaker-identifying features. Indeed, the speaker identification literature supports the importance of analysis of vowels and consonants in identification task (For example, see Bricker and Pruzansky, 1966; Stevens










et al., 1968; Illes, 1972; LaRivere, 1974; Wolf, 1972; Goldstein, 1976; and others). Multiple Vector
Effectiveness


The results of the paired vectors appear to exhibit the same general trends as in the case of the findings of the single vectors. However, most of the vector pairs exhibited at least slight increases in their identification levels on both the pretest and identification tasks. As would be expected (from examining the single vector sccres), the TED-VVL vector combination resulted in the highest identifications for any paired vectors. This finding supports the assumption made earlier that TED and VVL vectors contain more idiosyncratic (talker) characteristics than do either of the remaining two vectors (i.e., V/C or WPD). The high levels for the TED-VVL combination also suggest that these two vectors are, for the most part, sampling different types of information. That is, if the information contained in the vectors was redundant, there should not have been an increase in the combined identification score. On the other hand, if the information was mutually exclusive, it would be expected that the combined identification rates would be about equal to the sum of










the single vector identification levels. In the case of the TED-VVL vector, the individual scores were 25 per cent correct identification each, but the score when they were paired was 40 per cent correct. Thus, the paired scores support the suggestion that these two vectors sample different speaker-dependent characteristics. At this stage in the research it is impossible to know exactly what characteristics are being measured. However, some speculation as to the composition of this speaker information is possible. Since the TED vector is a temporal correlate of the speech spectra, it is possible that some of the information being measured by the TED vector is spectral in nature. On the other hand, the VVL vector deals with the level of phonation. Therefore, this vector most likely contains only fundamental frequency information.

Further examination of the paired vectors demonstrates that when either the V/C or WPD vectors are added to the other vectors (or to each other), the results provide little or no improvement in identification level. However, there is an interesting relationship which occurs when the V/C vector is paired with either the TED or VVL vectors. When the TED-V/C combination was utilized, the pretest scores were lower than for the TED alone--thus, the overall










performance of the TED vector was, in a sense, degraded. This finding would suggest that the information being sampled by V/C was enhancing speaker similarities and the procedures utilized in this research were insufficiently sensitive to separate and use the speaker-dependent information. On the other hand, the pairing of the VVL and V/C vector resulted in some improvement in the identification levels of both the pretest and identification tasks. Therefore, the results of the VVL-V/C vector combination suggest that the data being sampled are significantly different and thus speaker identification is enhanced.

The pretest classification and identification scores reached a peak of "effectiveness" in the three- and fourvector combinations. However, these increases were not of the same magnitude as those achieved when vectors were paired. An explanation for this relationship may be found in the parameter selection procedure. For example, once the TED and VVL vectors have been included in a vector combination, most of the parameters available for identification are accounted for. Therefore, addition of the remaining vectors contribute little in the way of new parameters. Hence, very little improvement in speaker identification levels can be expected.










Parameter Selection


As stated previously, from two to 40 separate parameters were entered into the discrimination process. However, a statistical selection procedure was utilized in order to maximize the correct identification scores and to reduce the number of parameters needed for identification. That is, a stepwise statistical method of parameter inclusion and exclusion was carried out during the discriminant analysis. The determination of predictiveness for each parameter and/or group of parameters was carried out utilizing an F-statistic with a computation formula that may be found in Forsytieet al. (1973). In accordance with this system of parameter selection, only the most predictive measurements were entered into the discrimination process. In this first experiment, the 40 parameters of the time-energy distribution vector were reduced to only 17 usable parameters. However, both of the VVL parameters, voiced speech time and voiceless speech time, were utilized. Of the four vowel/consonant ratios, only two met the sample requirements of discriminant analysis, V/C ratio of "good" and the V/C ratio of "not." From these two parameters only the V/C ratio of not was of any functional value as a predictive element in the identification process. As a










single vector, the word and phrase duration (WPD) vector yielded no usable parameters.

Further, in the parameter selection process of the

multiple vectors, some variables were added to the vector group while others, formally included, were dropped. In the larger vector (TED), some of the parameters utilized in the single pattern were dropped when the parameters of the VVL vector were included. However, the inclusion of the V/C and WPD vector resulted in no change in the TED parameter selection. Generally, only one of the V/C parameters was included in multiple vectors groups. This relationship may have been caused because of the type of information being measured in the various parameters. In this case, certain parameters of the vectors duplicate the data measured by parameters of other vectors. Therefore, some of the parameters have reduced predictive power -and are dropped.


Test Sample Selection


As should be readily apparent from the results in

Tables 2-4, the method of classification plays an important role in determining the magnitude of the pretest or identification scores. In all cases, the posterior classification










procedure facilitated the highest percentage of correct subject selection--and there appeared to be two explanations for these higher scores. First, in this case the test sample was utilized in the computation of the reference discriminant functions. Consequently, the classification procedure is biased toward the selection of the correct reference set. Second, by utilizing each of the four samples (in turn) as test samples, four opportunities for recognizing the correct reference set e allowed. Therefore, this procedure makes the chances of matching the correct test and reference much greater than if only one sample was utilized.

The jackknifed classification method demonstrated the next highest correct classification levels. As stated earlier, this method did not utilize the test sample within the reference set computations and this difference may account for the decreased scores, relative to the posterior classifications. However, the jackknifed classifications were better than those for the identification task. The most obvious explanation for this finding is that in the jackknife pretest all four samples were used as tests, whereas the identification task utilized only one of the speech samples as the test.










The identification task was the most important test of vectors speaker identification effectiveness. The pretest classifications serve the purpose of demonstrating which parameter groups have the greatest potential as speaker-identifiers. However, they are only usable in closed set tests (when the criminal is known). Therefore, the identification task must be considered the core test of this research. This approach utilized only the first of the four speech samples as the speech sample. Hence, classifications in this case were based solely on the information being sampled by that single speech sample. A second identification task was investigated utilizing the fourth speech sample as the test. This procedure was carried out in order to test the consistency of the first identification task. The vector effectiveness resulting from the second identification task followed more closely the trends set in the posterior and jackknifed procedures. Specifically, the TED vector resulted in the highest identification score (52.0 per cent correct). The VVL vector followed with a considerably lower identification score of 12.5 per cent correct and both the V/C and WPD vectors attended the same low scores resulting from the first identification task. This set of findings seems to










indicate that the identification effectiveness of a vector is not consistent throughout the four speech samples. That is, for the TED vector, the first sample was less idiosyncratic of a given speaker and therefore less recognizable than was the fourth speech sample. However, for the VVL vector, the first sample was more idiosyncratic of a particular speaker than was the fourth sample. In addition, while the identification tasks attempt to model more closely the forensic model, this fourth sample identification task suggests that the pretest procedures (especially the jackknifed method) provide a better indication of the relative identification capabilities of these selected temporal vectors. Finally, the identification task demonstrates that the more restrictive the test sample selection procedure becomes, the more difficult the identification of the correct speaker becomes. Nevertheless, it also should be noted that the type of classification method utilized to make the speaker discriminations did not change the vector effectiveness ranking among the temporal vectors.

In summary, the following conclusions may be stated with respect to the selected temporal vectors and their speaker identification capabilities.










1. Under the constraints of this experiment, the
time-energy distribution and the voiced/voiceless
speech time vectors (and to a lesser degree
the vowel/consonant duration ratio) appear to
exhibit idiosyncratic speaker identifying
characteristics.

a. The V/C and WPD vectors, as defined and
measured in this study, do not function well as cues to speaker identification.

2. The TED and VVL vectors appear to contain distinctly different speaker related information.

3. The V/C vector enhances the performance of the
VVL vector while it seems to degrade the speaker
identifying abilities of the TED vector.















CHAPTER IV


THE RESULTS AND DISCUSSION OF THE LABORATORY-DISTORTED SPEECH EXPERIMENT


The second experiment undertaken in this program of

research was carried out in order to test the speaker identification capabilities of the selected temporal parameters with respect to speaker distortions of a specific type. The speakers (20 adult males) were recorded under the same conditions as in the first experiment. However, three experimental speaking situations were imposed upon the talkers. As detailed earlier, these speaking conditions were: (1) normal, (2) stress, and (3) disguise. The results of this second experiment may be found below.


Results


Normal Speaking Condition


In a manner similar to that utilized in the first

experiment, three test selection procedures were used to examine the identification capabilities of the selected temporal measures. Two pretesting methods, posterior and










jackknifed classification, again were used to investigate the intra- and inter-sample variability. The third approach was that of speaker identification; this third method was utilized to simulate the forensic model. The results of these three experimental procedures may be found in Table 5. As may be seen from these results, the time-energy distribution (TED) vector, when applied in isolation, produced the highest pretest and identification scores. This vector correctly classified all twenty speakers in the posterior procedure, 95 per cent of the speakers in the jackknifed pretest, and 60 per cent of the speakers utilizing the identification task. Following in vector effectiveness (re. speaker's identity) was the voiced/voiceless speech time (VVL) parameters. In this case, the correct speaker classification rates were 65 per cent and 40 per cent in the two pretesting methods but only at a 7.5 per cent level relative to the identification method. Application of the vowel/consonant duration ratio (V/C) and the word and phrase duration (WPD) vectors resulted in no correct classifications or identifications at all.

By further examination of Table 5, it may be seen that the classification and identification scores of the combined vectors reveal several relationships; that is, those that










Table 5.


Pretest and Identification Scores of the LaboratoryDistorted Speech Experiment Obtained Utilizing the Time-Energy Distribution (TED), Voiced/Voiceless Speech Time (VVL), Vowel/Consonant Duration Ratio (V/C) and Word and Phrase Duration (WPD) Vectors in the Normal Speaking Condition: N = 20; All Scores Are in Percentages.


Pretest Classifications IdentifiVectors Posterior Jackknifed cations


A. Single Vectors

TED VVL V/C
WPD

B. Paired Vectors


TED X VVL TED X V/C TED X WPD VVL X V/C VVL X WPD v/C x WPD


100.0 65.0 0.0 0.0


100.0 100.0 100.0 65.0 80.0
0.0


95.0
40.0 0.0 0.0


95.0 95.0 95.0 40.0 40.0 0.0


60.0
7.5 0.0 0.0


55.0 60.0 60.0 15.0
20.0 0.0


C. Three-Vector Combinations


TED X VVL X V/C TED X VVL X WPD TED X V/C X WPD VVL X V/C X WPD


100.0 100.0 100.0 65.0


95.0
90.0 95.0 40.0


55.0 50.0
60.0 15.0


D. Four-Vector Combinations


TED X VVL X V/C X WPD


100.0


90.0 55.0










could be predicted from the examination of the single vectors. Moreover, it should be noted that the effects of the other vectors when combined with the TED vector resulted in little or no improvement in the overall pretest or identification scores. This finding resulted from the fact that in the posterior classification procedure, all the speakers were correctly identified; therefore no improvement was possible. In the case of the jackknifed and identification procedures, no improvement was observed because few new parameters were included into the vector groupings.


Stress Speaking Condition


In the case of the speaking condition during which subjects were stressed~the posterior classification procedure was eliminated. This classification procedure was judged inappropriate since the stressed speaking samples were compared only with the normal samples. Therefore, only the jackknifed pretest and identification tasks could be carried out. The results of this set of tests may be found in Table 6. Under this stress condition, the TED, again, yielded the highest levels of speaker classification and identification (70 per cent and 40 per cent correct). However, the VVL vector did almost as well as TED in the jackknifed method,










Table 6. Pretest and Identification Scores of the LaboratoryDistorted Speech Experiment Obtained Utilizing the
TED, VVL, V/C, and WPD Vectors in the Stress Speaking Condition: N = 20; All Scores Are in Percentages.

Pretest Classifications IdentifiVectors jackknifed cations


A. Single Vectors

TED
VVL V/C WPD

B. Paired Vectors


TED TED
TED
VVL VVL v/C


X VVL
x v/C X WPD x v/C
X WPD X WPD


C. Three-Vector Combinations

TED X VVL X V/C TED X VVL X WPD
TED X V/C X WPD VVL X V/C X WPD

D. Four-Vector Combinations


70.0 65.0
0.0 0.0


65.0 70.0 70.0 65.0
35.0 0.0


65.0 65.0
70.0 35.0


40.0
20.0
0.0 0.0


30.0
40.0 40.0 20.0 15.0 0.0


30.0 35.0
40.0 20.0


TED X VVL X V/C X WPD


30.0


65.0










65 per cent. On the other hand, the identification rate of the VVL vector was 20 per cent correct; considerably lower than that achieved by the TED. The remaining two vectors, V/C and WPD, did not correctly classify any of the talkers when these vectors were used in isolation.

The multiple vectors produced no real improvement in

the classification or identification of the stress talkers over those of the single vectors. The TED X VVL vector combination achieved the same jackknifed score (65 per cent,) as that of the VVL vector and only a slightly better identification score (30 per cent). Combinations involving the V/C and WPD vectors resulted in no improvement and in some cases their effects tended to degrade the level of correct identification. For example, the VVL X WPD vector combination produced a jackknifed classification of 35 per cent correct and an identification rate of 15 per cent; both of these scores were lower than those of the VVL vector utilized singly. In the case of the four vector combination, the jackknifed pretest resulted in only 55 per cent correct. This score is 5 per cent lower than the score attended when the TED Vectors were utilized in isolation.










Disguised Speaking
Condition


The results of the disguise condition of the second

experiment may be seen in Table 7. From this table it can be seen that only the jackknifed and identification procedures were utilized. The posterior classifications could not be carried out because all classifications of this speaking condition were done by comparing the disguised samples to the normal samples. As with the two preceding conditions, the TED vector for this speaking condition resulted in the highest set of scores. In the jackknifed pretest, 45 per cent of the disguised voices were correctly matched to the talkers who produced them; correct identifications were made in 30 per cent of the cases. The VVL vector achieved 35 per cent correct disguise to normal matches but in the identification task only 5 per cent of the voices were correctly identified. Also, as was seen from the previous speaking conditions, the V/C and WPD vectors were unable to produce any correct classifications or identifications.

Combining the vectors resulted in some classification

and identification improvement for these disguised conditions. Specifically, the TED X VVL vector combination correctly classified the disguised voices 60 per cent of the time;






82


Table 7. Pretest and Identification Scores of the LaboratoryDistorted Speech Experiment Obtained Utilizing the
TED, VVL, V/C, and WPD Vectors in the Disguised
Speaking Condition: N = 20; All Scores Are in
Percentages.

Pretest Classifications IdentifiVectors Jackknifed cations

A. Single Vectors

TED 45.0 30.0 VVL 35.0 5.0 v/C 0.0 0.0
WPD 0.0 0.0

B. Paired Vectors

TED X VVL 60.0 40.0 TED X V/C 45.0 30.0 TED X WPD 45.0 30.0 VVL X V/C 35.0 5.0 VVL X WPD 35.0 15.0 V/C X WPD 0.0 0.0

C. Three-Vector Combinations

TED X VVL X V/C 60.0 40.0 TED X VVL X WPD 60.0 35.0 TED X V/C X WPD 45.0 30.0 VVL X V/C X WPD 35.0 5.0

D. Four-Vector Combinations


TED X VVL X V/C X WPD


60.0


40.0










this combination also resulted in 40 per cent correct in the identification task. All other vector combinations resulted in scores which were representative of the vector with the highest single vector score. For example, vector combinations which contain the TED parameters (except TED X VVL) resulted in a set of scores which were identical with those of the TED used singly. This same relationship is found for combinations containing VVL and TED X VVL parameters.


Discussion


Normal Speaking
Condition


It will be remembered that in this first procedure the classifications are made by comparing the normal speaking samples to the normals. The results of first speaking condition of this second experiment reaffirm many of the findings of the initial experiment. That is, the time-energy distribution (TED) vector appeared to be the most effective predictor of talker's identity, at least for the vectors investigated in this research. It also is demonstrated that the voiced/voiceless speech time (VVL) vector was ranked second to TED in speaker identification capabilities, whereas the remaining two vectors (vowel/consonant duration ratio









.and word and phrase durations) showed little speaker identification power.

The high levels of identification found for this normal

condition might have been expected on the basis of the results of the initial experiment. In that regard, previous researchers have shown that "TED-like" information is a reasonable predictor of talker's identity (Majewski and Hollien, 1974; Doherty and Hollien, 1978, and others). For example, Pruzansky (1963) reported that a time-energy distribution was effective as a speaker identification cue. Therefore, it may be assumed that a vector such as TED contains idiosyncratic speech characteristic which would permit identification of a speaker from his voice alone. Also, the higher identification scores, in relation to the other vectors, provided by the TED vector may be due, at least in part, to the large number of parameters (40) available for the identification process. Thus, a great deal of speaker-dependent material apparently is utilized in this classification and identification technique.

The VVL vector also demonstrated a modest level of effectiveness. This finding also was shown in the first experiment. In addition, previous investigation has shown vocal activity to play a role in the speaker identification










process (see, for example, LaRiviere, 1975; Atal, 1972; Wolf, 1972; Sambur, 1975; Doherty, 1976; and Doherty and Hollien, 1978). Therefore, it should appear that the VVL vector measures certain invariant speaker identification features and this vector may be a viable tool in a speaker identification system.


Stress Speaking
Condition


This experimental condition attempts to discover if

stressful speaking situations reduce the speaker identification capabilities of the selected temporal parameters. The results of this experiment demonstrated that the type of stress utilized in this investigation had the effect of lowering the level of identification. For example, the TED vector yielded classification scores reduced by 25 per cent (95 per cent vs. 70 per cent) and identification scores reduced by 20 per cent (60 per cent vs. 40 per cent). These two relationships suggest that the features being measured by the TED vector are altered or varied when a speaker is placed in a stressful situation. However, it should be remembered that the stress speech samples are being compared to the normal speech samples. Thus, the samples were non-contemporary.




Full Text

PAGE 1

TEMPORAL PARAMETERS WITHIN THE SPEECH SIGNAL APPLIED TO SPEAKER IDENTIFICATION BY CHARLES CLIFFORD JOHNSON, JR. A DISSERTATION PRESENTED TO THE GRADUATE COUNCIL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1978

PAGE 2

I Copyright 1978

PAGE 3

} To my wife, Christine, and my children Charles III, and Cristen.

PAGE 4

ACKNOWLEDGEMENT S I would like to express my sincerest thanks to Dr. Harry Hollien. He guided me through very trying times and through his experience with students and insight into research problems has, more than anyone else, made this dissertation possible. Drs . Howard Rothman and Sam Brown also deserve special thanks for being friends as well as members of my committee. I also acknowledge Drs. Alan Agresti, William Sanders, and Donald Dewsbury for taking the time to be members of my committee. I would like to acknowledge Drs . E . Thomas Doherty and Gilbert Tolhurst for freely giving of their time and experience during this dissertation. I also express thanks to the entire staff of the Institute for Advanced Study of the Communication Processes, especially Amy Burnett, Dave Campbell, Kathy Farley, Carlos Febles, Jim Fitzgerald, Norman Green, Patti Hollien, Angela Hunt, Bob Idzikowski, Arlene Malick, Debbie Martin, Sally Potter, Russ Pierce, and Jamie Stone. The above mentioned people were listed in alphabetical order since there is no way in which each person's contribution to this dissertation could be measured. iv

PAGE 5

Special thanks go to the Alachua County Sheriff's Department, especially Captain Wes Schellenger. Extra special thanks go to all my friends who have given me much needed encouragement, guidance, and some very helpful insight. Thank you Jim Hicks, Paul Graycar, Alan Smith, Rich Hill, and Brian Klepper. Finally and certainly not least, I would like to thank my wife, Christine. She has sacrificed greatly for me, encouraged me in the worst of times and never given up on me. To her and her alone do I owe the most, a debt I will never be able to repay. v

PAGE 6

TABLE OF CONTENTS Page ACKNOWLED GEMENTS iv LIST OF TABLES ix LIST OF FIGURES xi ABSTRACT xii CHAPTER I INTRODUCTION 1 Review of the Literature 5 Aural/Perceptual Speaker Identification 5 Spectrographic Speaker Identification 14 Automatic or Semi -Automatic Speaker Identification 23 Spectral Analysis (Long-term Speech Spectra) 24 Fundamental Frequency 27 Formant Frequency 30 Temporal Parameters 32Objectives 35 II METHODS 37 Temporal Parameters 38 Time-Energy Distribution (TED) . . 39 Voiced/Voiceless Speech Time Vector (WL) 41 vi

PAGE 7

TABLE OF CONTENTS (Continued) Page Vowel/Consonant Duration Ratio (V/C) 45 Word and Phrase Durations (WPD) . . 46 Multiple Vectors 47 Experiment One 48 Laboratory, Normal (LN) 48 Subjects 49 Speech Material 49 Procedure 50 Experiment Two 50 Laboratory Distorted Speech (LD) . . 50 Subjects and Speech Material. ... 51 Procedure 51 Experiment Three 52 Semi-Field Conditions (SF) 52 Subjects and Speech Material. ... 52 Procedure 53 Statistical Analysis 54 III THE RESULTS AND DISCUSSION OF THE LABORATORY-NORMAL EXPERIMENT 56 Results 57 Discussion 63 Single Vector Effectiveness .... 63 Multiple Vector Effectiveness 66 Parameter Selection 69 Test Sample Selection 70 vii

PAGE 8

TABLE OF CONTENTS (Continued) Page IV THE RESULTS AND DISCUSSION OF THE LABORATORY — DISTORTED SPEECH EXPERIMENT 75 Results 75 Normal Speaking Condition .... 75 Stress Speaking Condition .... 78 Disguised Speaking Condition. . . 81 Discussion 83 Normal Speaking Condition .... 83 Stress Speaking Condition .... 85 Disguised Speaking Condition. . . 88 Parameter Selection Techniques. . 89 Test Sample Selection 91 V THE RESULTS AND DISCUSSION OF THE SEMI-FIELD EXPERIMENT 96 Results 96 Discussion 102 VI SUMMARY AND CONCLUSION 105 REFERENCES HI BIOGRAPHICAL SKETCH 115 viii

PAGE 9

LIST OF TABLES Table E*2£ 1 Parameters Measured and Investigated for Possible Utilization as the TimeEnergy Distribution (TED) Vector; Speech Bursts Represent the Portion of the Speech Energy Which Is Above a Given Relative Intensity Level and Pause Periods Are Defined as Those Areas Between the Speech Bursts at a Given Energy Level Where the Energy Falls Below the Given Level 43 2 Pretest and Identification Scores for the Laboratory-Normal Experiment Obtained by Utilizing the Time-Energy Distribution (TED) , Voiced/Voiceless Speech Time (WL) , Vowel/Consonant Duration Ratio (V/C) , and Word and Phrase Duration (WPD) Vectors on an Individual Basis 58 3 Pretest and Identification Scores for the Laboratory-Normal Experiment Obtained by Utilizing the TED, WL, V/C, and WPD Vectors in All Possible Pairs 61 4 Pretest and Identification Scores for the Laboratory -Normal Experiment Obtained by Utilizing the TED, WL, V/C, and WPD Vectors in All Threeand Four -Vector Combinations 62 ix

PAGE 10

LIST OF TABLES (Continued) Table Pa 9f5 Pretest and Identification Scores of the Laboratory-Distorted Speech Experiment Obtained Utilizing the Time-Energy Distribution (TED) , Voiced/Voiceless Speech Time (WL) , Vowel/Consonant Duration Ratio (V/C) and Word and Phrase Duration (WPD) Vectors in the Normal Speaking Condition 77 6 Pretest and identification Scores of the Laboratory-Distorted Speech Experiment Obtained Utilizing the TED, WL, V/C, and WPD Vectors in the Stress Speaking Condition 79 7 Pretest and Identification Scores of the Laboratory-Distorted Speech Experiment Obtained Utilizing the TED, WL, V/C, and WPD Vectors in the Disguised Speaking Condition 82 8 Identification Scores for the LaboratoryDistorted Speech Experiment Obtained Utilizing the Fourth Speech Sample as the Test; Only the TED, WL, and TED X WL Vectors Were Examined 93 9 Identification Ranking for the Semi-Field Experiment Obtained Utilizing Discriminant Analysis on the Time-Energy Distribution (TED) and the Voiced/Voiceless Speech Time (WL) Vectors 99 10 Identification Rankings for the Semi-Field Experiment Obtained Utilizing CrossEorrelations on the TED and WL Vectors, the Correlation Coefficients Between the Known and Unknown Speakers Are Also Listed. 101 x

PAGE 11

LIST OF FIGURES Figure Page 1 Block diagram of the equipment utilized to generate an energy envelope from the analogy speech signal and extract the time-energy distribution (TED) parameters 40 2 Schematic representation of a typical energy envelope as generated from the TED equipment configuration 42 xi

PAGE 12

Abstract of Dissertation Presented to the Graduate Council of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy TEMPORAL PARAMETERS WITHIN THE SPEECH SIGNAL APPLIED TO SPEAKER IDENTIFICATION By Charles Clifford Johnson, Jr. August, 1978 Chairman: Harry Hollien Major Department: Speech This research project was carried out with the purpose of investigating some of the idiosyncratic speech characteristics which permit an individual to be identified from his voice alone. The specific objectives of this study were: (1) select and examine certain temporal speech parameters, with reference to their speaker identification capabilities, (2) test the speaker identification effectiveness of the selected parameters under stress and disguise conditions, and (3) examine the affects of simulated field conditions on the speaker identification capabilities of the selected temporal vectors. From all the possible temporal characteristics which exist within the speech signal, four general sets of parameters were chosen. These temporal vectors included durational analysis of: (1) relative energy at several levels of intensity, (2) voiced and voiceless activity, xii

PAGE 13

(3) vowel/consonant ratios, and (4) specific words and phrases. Each of these vectors was composed of from 2 to 40 variables. These temporal vectors were extracted from speech samples generated from three experiments. The initial experiment was a laboratory-based study. Forty adult males read a standard prose passage while being recorded in an "ideal" laboratory setting. The results of this, the first, experiment demonstrated the time-energy distribution (TED) vector as the most effective of the selected temporal parameters. The voiced/voiceless speech time (WL) , vowel/consonant duration ratio (V/C) , and word and phrase duration (WPD) vectors followed in descending order of identification effectiveness. The second experiment in this research also was laboratory-based. In this case, the subjects (20 adult males) were recorded under similar conditions as those of the first experiment. However, these subjects read the passage in three different manners. These speaking conditions were: (1) normal, (2) stress (applied via electric shock), and (3) free disguise. This experiment resulted in the same vector effectiveness as the first experiment. That is, application of the TED vector yielded the highest levels of identification and the WL, V/C, and xiii

PAGE 14

WPD followed in effectiveness. In addition, it was found that stress and disguise speaking conditions do reduce the identification power of the selected temporal vectors. It should be noted that, while the disguise condition yielded much lower scores than the normal, this condition was higher than any other similar studies. In the third study, the temporal parameters were investigated under conditions which would parallel those found in the forensic model. A speaker simulated a "crime" over the telephone and a "suspect pool" was created by recording subjects in a simulated interrogation procedure. The findings in this study demonstrated that the vectors were relatively ineffectual in this very restrictive situation. However, the TED and WL vectors did show some limited potential; indicating that these vectors may, at some later date, be useful in a speaker identification system suitable for the forensic world. In general, a few overall conclusions can be made based on the findings of the three completed studies. 1. Temporal characteristics found within the speech signal are important in the speaker identification process. 2. Certain temporal characteristics are idiosyncratic of an individual's speech patterns. xiv

PAGE 15

3. Stressful and disguised speaking conditions reduce the levels of identification exhibited by these selected temporal vectors. 4. The temporal parameters examined in this research program are less effected than frequency parameters when a speaker disguises his voice. 5. The restrictive condition of a simulated field situation greatly interferes with the identification powers of these temporal vectors. 6. The temporal parameters may be a useful addition to an established speaker identification system. xv

PAGE 16

CHAPTER I INTRODUCTION In general, the most important information contained within the speech signal, produced by a given individual, is the linguistic message. However, this message is by no means the only information transmitted to a listener via the utterance. Data about the speaker's general emotional state, educational background, geographic origin, and/or specific identity also may be provided. All this information is important and warrants investigation. However, it was the focus of this research to examine only the speaker identifying information which is conveyed through the speech signal. The identification of a speaker from his/her voice alone routinely occurs under a variety of familiar circumstances; for example, in telephone conversations, at cocktail parties, from radio broadcasts, etc. In some situations, it is not just desirable but crucial to be able to extract the speaker's identity from his voice alone. For example, "infallible" speaker identification techniques 1

PAGE 17

2 must be available to the military before voice activated weaponry can be developed and utilized with safety. In addition, businesses, banks, security companies and similar organizations have a need for voice activated computers, electronic devices, and machinery. Speaker identification is of special interest to law enforcement agencies. Currently, the recording of conversations by criminals and/or suspects is common practice in criminal investigations. In addition, these recorded conversations often are admissible in courts of law. Therefore, a reliable and objective speaker identification system would be of inestimable value in the identification and conviction of criminals. A substantial amount of research has been carried out in response to the need for speaker identification techniques that are reliable and objective. This research may be classified into three categories: (1) aural/perceptual speaker recognition, (2) visual recognition (spectrogram matching) , and (3) machine recognition. The aural/perceptual "method" is simply speaker recognition by listening. That is, the technique utilizes the abilities of the human auditory system and

PAGE 18

3 the cognitive powers of the human brain to determine the identity of a speaker. Research examining the aural/ perceptual speaker recognition has shown it to be quite valid under some circumstances (e. g., when the talker is well known to the listener) , but severe limitations also have been demonstrated. A review and discussion of the literature appropriate to aural/perceptual speaker identification will be found below. The so-called "spectrographic method" of speaker identification uses the aural/perceptual approach combined with a visual pattern matching technique based upon frequency-by-time-by-intensity sound spectrograms ("voice-prints"). These spectrograms are compared by means of a pattern matching procedure. A review of the relevant spectrographic speaker identification litera• ture also will be presented below. However, controversy exists relevant to the predictive value of the "voiceprint" research; review of this conflict may be found in the following: Black et al ., 1973; Bolt et al ., 1970, 1973; Hollien, 1974, 1977; and Hollien and McGlone, 1976.

PAGE 19

4 The third general approach currently being employed in speaker identification tasks utilizes sophisticated electronic devices (other than the sound spectrograph) . In reality there are many "machine" approaches; however, all appear to exhibit four important advantages: (1) the specific acoustic and/or temporal parameters to be employed may be extracted from the speech signal serially and/or simultaneously, (2) the parameters (or group of parameters) utilized may be used in various combinations, (3) the subjectivity of human analysis is eliminated to a great degree (Hollien and Majewski, 1977) , and (4) the analysis can be done to any level of desired accuracy. In sum, the machine approach appears to show the greatest potential for ultimately producing a valid and objective speaker identification system. It should be pointed out that most research using machine approach utilizes parameters based on acoustic analysis — particularly frequency spectrum and fundamental frequency — and the importance of these speaker characteristics to speaker identification has been demonstrated in the research literature. On the other hand, there are various temporal parameters that can be extracted

PAGE 20

5 from the speech signal; they generally have not been studied for potential as speaker identification cues. However, temporal speech parameters, such as vowel and consonant duration, have been extensively investigated relative to the area of speech perception. A review of some of the relevant literature may be found in Lehiste (1967) . In summary, speaker recognition has been examined by three approaches: aural/perceptual speaker recognition; spectrographic or "voiceprint " speaker identification; and automatic and semiautomatic recognition. Each of these approaches have been examined in the research literature and some of their advantages and disadvantages explored. A review of the relevant literature follows. Review of the Literature Aural/Perceptual Speaker Identification One of the earliest systematic attempts to examine speaker identification by listening (aural) was reported by McGehee (1937) . In this study she first concealed auditors behind a screen and had a single speaker read (to them) a

PAGE 21

6 passage of 56 words. These auditors returned the second day for a second set of trials. In this case they listened to five speakers (including the original talker) reading the same 56 word passage and they were asked to identify which of the speakers was the original talker. McGehee reports a correct identification rate of 83 per cent. The process was repeated at two week, three month, and five month intervals. In these cases the scores for correct identification were 68 per cent, 35 per cent, and 13 per cent, respectively. On this basis, McGehee concluded that the ability to identify speakers aurally deteriorated as a function of time. In a later study, Pollack, Pickett, and Sumby (1954) investigated certain aspects of the aural speaker identification task. They had normal male talkers of similar age, dialect, and rate of speaking read a single passage both in a whispered and voiced speaking mode. Comparisons were made between the identification scores of these two speaking conditions. The authors reported that if similar correct identification scores were to be obtained for both conditions, an utterance duration for the whispered passage had to be three times that of the voiced passage. On this basis the authors suggested that duration plays a significant role in the aural speaker identification process;

PAGE 22

primarily it allows larger samples of the speaker's repertoire to be tested. Indeed, results of the voiced samples showed that duration was important to identification only up to 1200 milliseconds, beyond that point no further improvement in performance was noted. The authors also investigated the effects of lowand high-band pass filtering on listener performance. The findings from this portion of the Pollack et al . study demonstrated that correct identification is not critically dependent upon any delicate balance of frequency components, in any single portion of the spectrum. Compton (1963) also examined the effects of filtering and duration on the aural identification process. He used sustained productions of the vowel /i/ while varying duration and filtering conditions. He reported that duration is a factor in listener performance, especially that levels of correct identification increased with lengthening of the sample up to a duration of about 1250 milliseconds. This finding is close in agreement with those reported by Pollack et al . (1954) . Compton also found that if the speech sample was filtered above the frequencies of 1020 Hz, speaker identification rates were substantially reduced but filtering below 1020 Hz appeared to have no significant effect on the identification performance.

PAGE 23

8 Bricker and Pruzansky (1966) investigated the auralperceptual effects of both duration and content on speaker identification by listening. Ten male talkers recorded five types of speech samples: excerpted vowels, excerpted consonants-vowel (CV) sequences, monosyllable words, disyllabic nonsense words, and sentences. Listeners exhibited the highest correct scores for samples of the greatest length (sentences) and identification performance decreased as length of sample decreased. Moreover, better listener performances were obtained with CV speech samples than with vowel excerpts of equal duration. The experimenters inferred that the number of phonemes within a speech sample was of greater importance to identification than its absolute duration. In the second part of their experiment, Bricker and Pruzansky utilized the vowels /i/ and /a/ as experimental stimuli to study the effects of content in aural identification. Their results indicate that speaker recognition is not independent of the utterance. They also found that a confusion matrix for talker identification was not symmetrical. That is, talker A may be confused with talker B but talker B may not be confused with talker A. These findings raise some interesting questions about the decision criterion utilized by human listeners determining a given speaker's identity.

PAGE 24

9 As one aspect of a much larger study, Stevens et al . (1968) investigated aural speaker identification using the vowels /i/ and /a/as speech stimuli. The talkers utilized in this study were homogeneous with respect to their voice; each read isolated words from which the experimental vowels were extracted. Identifications based on the aural portion of this study demonstrated that a word with the front vowel, /i/, was a better cue to the identification of the speaker than a word containing the back vowel /a/. The authors suggested that the higher second formant of /i/ might have aided in the improved listener performances. lies (1972) also found that speaker identification scores varied when different vowels were used as speakeridentifying features. This study involved six speakers and 16 listeners. Speech stimuli for the listeners was excerpted from a passage read by the talkers; it consisted of several sentences and four different vowels. Among other things, lies found that speaker differentiation cues were present to a greater degree in the low vowels. She concludes that the "(The main purpose of this study was to compare aural and visual methods of speaker identification. The aural method produced higher identification scores than the visual method in all portions of this study.)

PAGE 25

10 first formant may contain some idiosyncratic characteristics which aid in speaker identification. However, lies' results are somewhat in variance with those of Stevens et al. (1968), but the manner of presentation may have had a differential affect on the results of the two studies . Clarke and Becker (1969) used a rating system to study aural speaker identification. In this study, each listener rated the talker on six different scales: pitch, pitch variability, rate, click-like elements, sibilant intensity and breathiness. The listeners used a seven point scale for pitch and five-point scales for the other variables. Identifications were made utilizing each scale singly and then in all possible permutations. Pitch was found to be the most effective speaker discrimination characteristic when it was used singly to determine a speaker identity. Click-like elements, sibilant intensity, breathiness, rate and pitch variability followed in decreasing order of effectiveness. LaRiviere (1975) studied the role of voice source in, and the effects of vocal tract transfer characteristics on, speaker identification. He used voiced, whispered and lowpass filtered vowels in order to examine (1) source information (filtered), (2) vocal-tract information (whispered), and (3) both (voiced). Using these three

PAGE 26

11 sample conditions, LaRiviere found the whispering (vocal tract) and low-pass (source) samples resulted in about equal correct identification scores. In addition, he found that the summed scores for the whispering and low-pass filtering conditions were about equal to the scores for the voiced fullband condition. The author concluded that both voice source and vocal tract characteristics were of equal importance to speaker identification and seemed to be communicating different information to the listener. A study examining the effects of stress and disguise on perceptual identification of talkers was carried out by Hollien, Majewski, and Hollien (1974) . Adult male talkers read an extended prose passage under the following conditions (1) normal speech, (2) stress (talkers were subjected to randomly distributed electric shock while speaking) , and (3) disguised speech. Three types of listeners were utilized (1) listeners who knew the talkers, (2) listeners who did not know the talkers, and (3) listeners who neither knew the talkers nor the language (i.e., native speakers of Polish who did not know English) . As would be expected, the results of this study revealed that the group who knew the talkers did best under all speaking conditions, with the other two groups exhibiting poorer scores. The authors concluded

PAGE 27

12 that exposure to a talker aids in the speaker identification process. By examining across speaking condition, rather than listener type, it was found that stress (as utilized in this study) had little effect on the speaker identification task. However, disguise greatly imparled the listener's ability to identify the talkers. The same speaking condition trends were consistent for all three listener types. Finally, in a recent study, Rothman (1977) investigated the effects of non-contemporary speech samples on aural speaker identification. Pairs of speakers were chosen for their vocal similarities. Some were father-son combinations; some were brothers or twins; still others were simply sound-alikes . These subject pairs all had a long history of being confused with one another. Rothman recorded the talkers twice — at one week intervals. These recorded samples were presented to listeners in two-second speech segments for each pair of talkers. Same or different talker judgements were made by the listeners under the following conditions; (1) same talker/contemporary sample; (2) same talker/noncontemporary sample; and (3) different talkers. The results: (1) speakers paired with their own contemporary sample were identified 94 per cent of the time; (2) speakers paired with their own noncontemporary sample were identified 42

PAGE 28

13 per cent of the time; and (3) speakers paired with the claimed vocal twin were correctly identified as themselves 58 per cent of the time. The fact that the per cent correct identifications is greater for group 3 than for group 2 (58% vs. 42%), would imply that listeners can detect some idiocyncratic cues which aid in the determination of a speaker. The results also indicated that within the constraints of the population utilized, i.e., adult males chosen for similarity of their voices, time appeared to have played the most important role in aural speaker identification. It would, therefore, appear that even when recordings are made only one week apart, aural identification of speakers is greatly impaired. To summarize the research on the aural/perceptual approach to speaker identification approach, the following relationships have been observed. A. Duration . Some evidence suggests that speaker identification may be largely a function of the absolute duration of an utterance. More recent studies, however, suggest that duration is important only insofar as it allows listeners to sample a larger repertoire of the talker's speech behavior. B. Fundamental Frequency . There is evidence that the fundamental frequency of a speaker plays an important role in speaker identification.

PAGE 29

14 C. Formant Frequency . The reported research suggests that there is a relationship between formants (especially the second ' formant) and speaker identification. However, these relationships are not consistent over individuals. D. Phoneme Effects . Speaker confusion appears to vary with the phoneme, however; this confusion also varies with specific vowels as well as voice inflections and consonant-vowel sequence. E. Speaker Conditions . Speakers have the ability to disguise their voices and considerably reduce correct identification, even if they are familiar to the listeners. However, there is some evidence to suggest that speakers recorded under physical stress are not much more difficult to identify than normally recorded speakers. F. Contemporaneousness . There also is evidence to indicate that if speech samples are noncontemporary , aural/perceptual speaker identifications are greatly impaired. G. Familiarity . The research reports demonstrate that exposure of a talker to a listener plays an important role in aural speaker identification. Also evidence indicates that knowledge of the speaker's language aids in the identification process. Spectrographic Speaker Identification A second method by which speaker identification has been investigated utilizes the frequency-time-intensity sound spectrograph. The sound spectrograph was originally developed at the Bell Telephone Laboratory primarily for the purposes of teaching the deaf to speak (see Potter

PAGE 30

15 et_al., 1966). However, in 1944, Gray and Kopp discussed the identification of speakers by visual inspection of spectrograms and concluded that this method appeared to offer good potential for such application. Later, Kersta (1962) proposed that speech spectrograms (voiceprints) contained cues which would enable observers to identify speakers from spectrograms of their utterances alone. In his "investigation," talkers apparently produced ten monosyllabic words in isolation, spectrograms were made of each word. The observers (young females) were given a five-day training period in which they were taught to identify speakers from their spectrograms (the exact nature of this training was not specified) . Upon completion of their training, these observers were required to match known spectrograms with unknown spectrograms. Kersta reports that the results of chis matching technique yielded a 99 per cent correct identification rate. Young and Campbell (1967) attempted to replicate elements of the Kersta research. They used the same ten monosyllabic words Kersta used, both in isolation and in context. Sound spectrograms were made of each word in a fashion identical to that employed by Kersta. The observers utilized in this study also went through a training

PAGE 31

16 procedure. Then they were presented with spectrograms of known and unknown speakers for identification. The results indicate that words in isolation produce much higher correct identification scores than words in context (78.4% in isolation, 37.3% in context). There scores are also substantially different from those reported by Kersta. Young and Campbell (1967) concluded that one reason for the difference between the two scores (isolation and context) may be the duration of the word. In isolation, the words tended to be much longer in duration than when put within the context of several words. The authors also inferred that different contexts change the speaker identification scores. Referring back to the Stevens et al . (1968) study discussed earlier, it may be observed that they also investigated the spectrographic speaker identification procedures by means of open and closed sets. The speech samples used were monosyllables, disyllables, and phrases. In the closed set tests, the correct match was always present, while in open set tests, the correct match might or might not have been present. The closed set tests yielded a mean identification score of 79 per cent; the open sets, 47 per cent identification. The results of the Stevens et al . study appear to be in fair agreement with Young and

PAGE 32

17 Campbell (1967). However, these results are not in agreement with the results reported by Kersta (1962) . In 1972, Tosi et al . reported a comprehensive laboratory study that examined the spectrographic method of speaker identification. These authors had speakers produce words in isolation, in fixed context, and in random context. The observers for this study were given a month of training in "phonetics and spectrogram matching procedures." In order to proceed to the more difficult tasks, each observer had to reach an identification score of 96 per cent correct for closed set tasks. The more difficult experimental trials consisted of open sets, words in context, and non-contemporary samples. It was found that closed sets produced lower error rates than open sets (5.5% vs. 9.9%), and that contemporary samples were more identifiable than non-contemporary samples (4.8% vs. 12.1% errors). In addition, the error rates for words in isolation, in fixed context and in random context were 4.2, 7.6, and 13.4 per cent, respectively. The results of this study indicate that visual speaker identification performance is best for (a) words in isolation, (b) closed sets, and (c) contemporary speech sample. Tosi et al . concluded that from the data in this experiment the spectrographic method of speaker identification demonstrates

PAGE 33

18 interspeaker variability as greater or different than the intraspeaker variability. The authors also state that their results confirm the findings of Kersta (1962) . In a contrasting study, Hazen (1973) examined speaker identification in open and closed sets under differing contextual conditions. For speech samples, Hazen utilized 60 speakers recorded while talking spontaneously. Four cue words were then isolated from the speech sample and spectrograms made of each word. The matching or identification procedure was done oy subjects who were trained in the same manner as those receiving Kersta' s "Voiceprint Identification Training Course." Results at all levels of this experiment demonstrated higher errors rate than either Kersta (1962) or Tosi et al . (1972) . Hazen concluded that spectral similarities due to intra-speaker consistency were not apparent enough to outweigh the similarities due to a different phonetic context. Hollien and McGlone (1976) examined the effects of disguise on the spectrographic approach to speaker identification. The talkers they utilized were instructed to read an extended prose passage in their normal voice and then repeat the reading using a "disguised" voice. The auditors in this experiment consisted of faculty and a

PAGE 34

19 graduate student in phonetic sciences. All were well acquainted with the use of spectrograms for speaker identification purposes. These skilled auditors were asked to match the disguised sample spectrograms with a spectrogram made of the same speaker in his normal voice. The average score of these observers was 25 per cent correct identification. The authors concluded that the disguised condition greatly affected spectrographic speaker identification. In a later experiment, McGlone, Hollien, and Hollien (1977), demonstrated the possible reasons why the spectrographic method of speaker identification is not able to recognize an individual who is disguising his speech. Specifically, these authors found variations in speaking fundamental frequency, formant frequencies, and formant bandwidths . They also found the duration of speech samples were generally greater for the disguised voice than for the normal voice. These results imply that the speech of a talker can be altered significantly. They concluded that until these acoustic alterations can be generalized for the whole population and their effects predicted, the spectrographic approach to speaker identification would appear to be inadequate for use as a practical speaker identification system.

PAGE 35

20 Reich, Moll, and Curtis (1976) designed an experiment to investigate the effects of vocal disguise upon spectrographs speaker identification. Forty male speakers provided speech samples on two separate occasions — one week apart. The samples consisted of several words excerpted from different sentences. This sample arrangement was analogous to the non -con temporary, random context of Tosi et al . (1972) . The speaking modes utilized were: (1) normal, (2) old-age disguise, (3) hoarse disguise, (4) hypernasal disguise, (5) slow-rate disguise, and (6) free disguise. The examiners were Ph.D. candidates in the speech sciences program at the University of Iowa and were extensively trained in the use of spectrograms for speaker identification. Matching the undisguised test spectrograms to the undisguised reference spectrograms, the examiners correctly identified 56.7 per cent of the talkers. When the disguised spectrograms (all types) were matched to the undisguised reference samples, only 33 per cent of the speakers were correctly identified. The authors conclude that: (1) the type of disguise affects the degree to which spectrographic speaker identifications can be made, (2) certain speakers are more difficult to identify, and (3) the findings of this study do not substantiate the prior

PAGE 36

21 claims that spectre-graphic speaker identification is unaffected by attempts at vocal disguise. Houlihan (1977) also examined spectrographs speaker identification of disguised voices. She carried out two related experiments; only the second will be reviewed. The speakers (eight females and eight males) in this experiment each produced sentences in five speaking modes: undisguised, lowered fundamental frequency, falsetto, whispered, and muffled. The mean identification score for the undisguised condition was 85.5 per cent (71 per cent for females and 100 per cent for males) . However, for the disguised conditions, they exhibited overall an identification score of 27 per cent (28.5 per cent for females and 25 per cent for males) . Houlihan's results demonstrate that female speakers are not more difficult to identify than male speakers under a normal or undisguised condition. The results of this experiment also confirm earlier studies that the disguised voice confounds the spectrographic method of speaker identification. Finally, in the second part of a study previously reviewed, Rothman (1977) investigated the effects of similar sounding talkers on the spectrographic approach to speaker identification. As described earlier, two

PAGE 37

22 recorded speech samples (separated by one week) were produced by twelve talkers. Spectrograms were made of these samples and were presented to examiners for identification. Rothman found that contemporary phrases produced identification rates of 24 per cent while noncontemporary samples were considerably lower; about 6 per cent correct identification. On the basis of these data, he suggests that within the constraints of the study, time of utterance is a very important factor for speaker identification purposes. The author also concluded that the aural/perceptual approach to speaker identification is a significantly better approach than is the spectrographic method. In summary, while some studies (notable those of Kersta and Tosi) show high identification rates utilizing the spectrographic method for speaker recognition, most of the reported research does not confirm these high scores . The bulk of the literature reviewed above demonstrates that there are many factors which seriously affect the identifications made from spectrograms. Some of these factors are: (1) the effect of training of examiners; (2) whether or not the speaker is altering his/her voice in any way; (3) the phonemic context of the utterance; (4) whether or not the utterance to be identified is

PAGE 38

23 recording in a contextual setting or in isolation; (5) utterance duration; (6) whether or not identification trials are in open or closed sets; and (7) whether or not the sample recordings were contemporary or noncontemporary . As McGlone et al . (1977) have pointed out, it seems apparent from the literature that spectrographs representation of the voice is easily and greatly altered by numerous means. Therefore, a great deal of further investi gation is necessary before the spectrographic approach can be considered a valid and reliable means of speaker identification. Automatic or Semi-Automatic Speaker Identification A number of "machine" approaches have been utilized in the study of the speaker identification process. These approaches may be categorized in a number of ways including (1) method of analysis, (2) the statistical technique, or (3) the speech features studied. For the purposes of this review, the automated approaches will be divided into groups on the basis of the acoustic features utilized as the determination of a speaker's idiosyncratic speaking characteristics. The features or

PAGE 39

24 parameters which nave been most extensively investigated are spectral analysis (long-term speech spectra), fundamental frequency, formant frequencies, and in a few cases, temporal features. A review of the more relevant literature dealing with automatic speaker identification follows. Spectral Analysis (Long-term Speech Spectra) An early study utilizing spectral analysis was reported by Pruzansky (1963) . Spectral patterns were developed from ten words, excerpted from context, spoken by ten talkers (seven males and three females) . The spectral patterns of three utterances of the same word by the same talker were used as a reference. The remaining utterance of the word was used as the test. Product moment coefficients of correlation were calculated between the reference pattern and the test pattern. The test and reference sample patterns which were most highly correlated were identified as being produced by the same speaker. Utilizing this correlation method of identification, Pruzansky correctly classified 89 per cent of 393 test utterances. She concluded that spectral distinctiveness of talkers is retained in long-term spectra.

PAGE 40

25 Majewski and Hollien (1974) and Zalewski, Majewski, and Hollien (1975) examined the usefulness of long-term speech spectra as a cue for speaker identification. Both studies utilized the same group of subjects (50 Americans and 50 Poles), Each subject read a prose passage as a speech sample, from which power spectral information was extracted. Majewski and Hollien (1974) utilized Euclidean distances to classify the speakers and Zalewski et al . (1975) applied cross-correlations to the spectral data in order to make the identifications. The mean error rate for both the Hollien and Majewski and Zalewski et al . , studies was about 4 per cent. In a third study Doherty (1976) used the 50 American speakers and the same feature extraction techniques as the two previous studies. However, he applied discriminant analysis in order to recognize the speaker patterns. Utilizing the same full bandwidth (80 Hz to 12.5 kHz), no errors in identification were found. Doherty at this point limited the bandwidth of the spectral information to 315 Hz and 3.5 KHz, and he recalculated his error rate. In this case it was 24 per cent. Doherty concluded that even though error rates under the bandpass condition were encouraging, the 24 per cent was unacceptable for most practical applications. It should also be

PAGE 41

26 noted that, while all of these studies (Majewski and Hollien, 1974; Zalewski et al . , 1975; and Doherty, 1976) used basically similar populations, each utilized a different statistical technique. In other words, Doherty also demonstrated that the selection of statistical technique is of importance in the automatic approach to speaker identification . In order to test the resistance of spectral analysis to distorted speech, Hollien and Majewski (1977) studied 25 talkers who produced speech under psychologically stressful and disguise conditions. Power spectral information was extracted for all talkers in both full band (80 Hz to 17.5 kHz) and limited band (315 Hz to 3.5 KHz), and euclidean distances calculated for the talker's individual spectral data. Under the stress condition, 92 per cent correct identification was achieved for the full bandwidth and 68 per cent identification with the limited bandwidth. The disguised condition yielded considerably lower scores, 20 per cent correct with full bandwidth and 10 per cent identification with limited bandwidth. Doherty and Hollien (1977) also examined speaker identification of distorted speech. The authors employed the same talkers, speech samples and feature extraction technique as those

PAGE 42

27 employed by the previous study. However, in this case discriminant analysis was used as the statistical procedure. Results of this study were 72 per cent correct identification for the stress condition and 24 per cent correct for the disguise. The results of both studies demonstrate that stress has little effect on the recognition of a speaker, while disguised voices are much harder to identify than undisguised voices. The conclusion that may be made is that disguise, not stress, greatly alters the spectral characteristic of a talker's voice. Fundamental Frequency In order to examine another element within the frequency domain, Atal (1972) investigated pitch contours (fundamental frequency variations) as a cue to speaker identification. Atal formed a 20-dimensional vector based on the pitch data of ten speakers and utilized linear transfer functions to maximize the ratio of the interspeaker and intraspeaker variations of these pitch contours. From this data, reference and test vectors were formed and Euclidean distances calculated between these vectors. By matching uhe reference and test vectors with the smallest distance, the author was able to achieve an

PAGE 43

28 identification rate of 98 per cent. On the basis of these results, Atal concludes that the pitch contours are useful acoustic features in a speaker recognition system. In 1972, Wolf used several classes of frequency data (fundamental frequency and features of vowel and consonant spectra) as speaker identification cues. The author had 21 male speakers record six sentences whereupon the frequency information was extracted from the recordings and a linear classification procedure applied. Wolf did not analyze his parameters separately but did state that fundamental frequency was a very useful parameter in his identification paradigm. However, he also states that fundamental frequency is perhaps the easiest and most obvious acoustic feature to modify in vocal disguise. In a similar study, Sambur (1975) utilized the same speech material and features as those of Wolf (1972). However, the Sambur recording sessions spanned a period of three and one-half years and included eleven talkers. Average fundamental frequency was not actually tested in this recognition experiment. However, by use of probability of error criterion, Sambur ranked average fundamental frequency twelfth among 38 possible recognition features. These calculations indicate that fundamental frequency

PAGE 44

29 can be (possibly) a useful cue in recognizing an unknown speaker . In referring back to the Doherty (1976) study, it can be found that he also extracted speaking fundamental frequency from the speech of his 50 speakers. Using only a two parameter vector to specify fundamental frequency, a 30.2 per cent correct identification score was achieved. This vector was also used in conjunction with limited bandwidth long-term speech spectra. The combination of the two vectors yielded an identification score of 97.7 per cent. As an overall conclusion, Doherty states that fundamental frequency does contain useful idiosyncratic data and appears to be independent of the information carried in long-term spectra. In the Doherty and Hollien (1978) study reported previously, the authors also used fundamental frequency as a cue to identify the 25 talkers. In this case, the individuals who produced stressed speech samples were correctly recognized 30 per cent of the time but the disguised samples only 10 per cent of the time. Since results of the stress condition are in agreement with those of Doherty (1976), it may be concluded that stress, of the type examined in these experiments, at least, does not

PAGE 45

30 greatly alter speaking fundamental frequency. The poor identification scores for the disguise condition would seem to indicate that speaking fundamental frequency is greatly changed when a speaker attempts to disguise his voice. This relationship confirms the statement made by Wolf (1972) about disguised voices. Formant Frequency The resonances or formant frequencies of vowels and consonants have also been investigated as cues to speaker identification. In a study described earlier, Wolf (1972) extracted selected formants from his 21 male talkers. Utilizing 17 parameters, of which six were formant frequency, 100 per cent recognition was achieved. Wolf concluded the vowel and consonant spectral information is useful in the classification of speakers. In a study also evaluated earlier, Sambur (1975) reports four measurements of the same type. They were: (1) the second formant of /n/, (2) the third formant of /u/, (3) the second formant of /i/, and (4) the third formant of /m/. Utilizing these four measurements and one temporal measurement, only one identification error in 320 trials was made. These results lead Sambur to conclude that

PAGE 46

31 formant information was among the most useful in recognizing a speajcer from his voice alone. Finally, Goldstein (1976) evaluated ten vowel formant structures as speaker identifying features . Ten adult males were recorded reading sentences containing key words. The formant tracts were extracted from these key words; 199 measurements were made and evaluated for effectiveness in speaker identification. In an identification experiment using cwo formant measurements, only 12 errors were made in 80 identifications. Furthermore, Goldstein evaluated five formant measurements with a technique called probability-of -error (described by Sambur, 1975) . Utilizing this technique an error rate of 0.25 per cent was calculated for the five measurements . Based on the findings in this study, Goldstein concludes that, since certain formant measures demonstrate large speaker differences, the variations must jje dependent more upon the speaker's habits than on vocal-tract configuration. In addition, she states that it is possible that rirst and second formant measures contain more information than just vocal-tract length information.

PAGE 47

32 Temporal Parameters Temporal features within the speech signal have not been examined for speaker dependency in the same detail as have frequency features. However, in a study described earlier, Pruzansky (1963) tested the effect of only utilizing temporal data on speaker recognition success. In brief, she used ten talkers who recorded several repetitions of ten words in context. Two-dimensional patterns, consisting of the total energy in time segments, were formed by summing the energy over the several frequency bands for each time section. With a pattern matching technique, the temporal patterns yielded a 47 per cent correct identification score. Pruzansky concluded from her results that the temporal patterns were more correlated to the individual words than to the speakers . In the Wolf (1972), Sambur (1975), and Goldstein (1976) studies, limited temporal measurements were evaluated for possible speaker dependent characteristics. Wolf (1972) measured the total duration of the word "bought." He states that certain learned idiosyncratic voice characteristics possibly deal mainly in timing. However, his single example of a gross timing measure did not provide useful identification

PAGE 48

33 information, that is, in comparison to his spectral measurements. Goldstein reports similar results for her timing measurements. She investigated the duration of several formant tracks and their ratios. None of the temporal parameters were included in the final identification analysis because of low inter-speaker variations. In contrast, Sambur found temporal measures useful in speaker identification. He made two measurements: (1) the slope of the second formant of the diphthong /ai/ and (2) the duration of the frication and aspiration noise of the plosive A/ in "cash." Both these features were among Sambur' s 10 most effective speaker identification cues. In the Doherty (1976) experiment previously discussed, two temporal parameters were examined for use in speaker recognition. These two measurements were speaking time and phoneme rate. Utilizing this two-parameter vector, only six of the total 50 speakers were correctly identified. However, when the temporal vector was added to his other vectors (speaking fundamental frequency and long-term speech spectra) , the identification rates increased from 8 per cent to 26 per cent. This finding would suggest that speaker dependent information within spectral features was different from those contained in the temporal elements

PAGE 49

34 of the speech signal. These two temporal features, speaking time and phoneme rate, also were examined under stress and disguise speech conditions (Doherty and Hollien, 1978) . The identification scores for this study were very low (12 per cent for the stress and 16 per cent for the disguise) . In addition, when these temporal features were combined along with the spectral vectors, correct identification rates only increased from 4 per cent to 10 per cent. These results suggest there is not enough idiosyncratic information within these temporal measurements to justify their use alone. However, the results also indicate that the temporal elements within the speech signal contain some identification information that may be insensitive to stress or disguise. In summarizing the experimental findings examined in the preceding section, several conclusions can be drawn pertaining to the automatic or semi-automatic approach to speaker identification. Initially, it can be seen that the spectral components of the speech wave contain certain speaker-dependent features . It should be noted also that when studies utilized similar spectral measures, similar results were obtained. Thus, there appears to be a great deal of consistency among these experiments. Moreover, it was found that limited portions of the spectrum, viz., fundamental

PAGE 50

35 frequency and formant frequencies, appeared to contain idiosyncratic characteristics relevant to the speaker identification task. A finding of special interest to this research is that temporal features extracted from the speech waveform were of value in speaker identification systems . Finally, it would seem that "machine" approaches to speaker identification appears to exhibit the greatest potential. There are two specific relationships which support this conclusion. First, when all the research carried out on speaker identification (aural, visual, and semi-automatic) are compared, this approach has produced consistently the highest and most reliable correct identification scores. Second, the semi-automatic approach has, for the most part, removed subjective judgements from the identification process. Objectives The primary aim of this research was to develop and test a system of inter-speaker differentiation which will seek to discover if certain speaker dependent features will remain relatively invariant during both physical and psychological modifications to the acoustic environment. Specifically, this study examined selected temporal parameters which exist within the speech waveform. These parameters

PAGE 51

36 were: (1) durational measurements of energy at several levels of magnitude, (2) presence of vocal activity, (3) patterns of silence, and (4) the duration of several specific words and phrases. It also was the aim of this study to investigate the effects of several speaking conditions upon the identification capabilities of the selected temporal parameters. The specific goals of this research may be stated in the form of several questions. They are as follows: A. Which of the several selected temporal elements, found within the speech signal, are invar ient, idiosyncratic characteristics of an individual's speaking repertoire? 1. Can these temporal measurements be made reliably? 2. Once measured, do these parameters predict a speaker's identity? B. Will stressful or disguised speaking conditions significantly reduce the speaker identification capabilities of these selected temporal parameters? C. What changes occur in the identification strength of these parameters (or set of parameters) when the method of choosing a test speech sampxe is altered?

PAGE 52

CHAPTER II METHODS As has been stated, the aim of this research is to develop a speaker identification system based on the analysis of certain temporal features. In general, these temporal measurements include durational analysis of (1) relative energy at several levels of intensity, (2) voiced or voiceless activity, (3) patterns of silence, and (4) specific words and phrases. These parameters were grouped into several vectors consisting of from 2 to 40 measurements and studied under a variety of speaking conditions. These experiments were classified as (1) laboratory-normal (IN), (2) laboratory-distorted (ID) , and (3) semi-field (SF) . Specifically, in the first experiment the temporal measurements were tested under "ideal" laboratory conditions to the purpose of establishing baseline data on the speakeridentifying ability of the selected features. In the second (LD) experiment, the vectors were applied to a speaker identification task under three speaker conditions: normal, stress, and disguise. This experiment was designed 37

PAGE 53

38 to evaluate the effects of speaker distortion (voluntary and involuntary) on the selected (temporal) vectors 1 identification capabilities. The third and final experiment can be described as a semi-field study (SF) . In this experiment recordings were made of a "crime-related" telephone message and several "suspects." An attempt then was made to identify the "criminal caller" from among the "suspects" in a closed set paradigm. The purpose of this last study was to evaluate the temporal parameters in a quasi-forensic environment. A detailed description of the temporal vectors, experiments, and the statistical analyses utilized follows. Temporal Parameters From all the possible temporal characteristics which may exist within the speech signal, four general sets of vectors have been chosen. They are: (1) time-energy distribution (TED) , (2) voiced/voiceless speech time (V/VL) , (3) vowel/consonant duration ratios, and (4) word and phrase durations (WPD) . These vectors were chosen on the basis of two factors: (1) their potential to contain information idiosyncratic to a speaker (based on previous research) and (2) the potential of obtaining these parameters from the speech signal.

PAGE 54

39 Time -Energy Distribution (TED) This temporal vector is based on a group of time-byenergy measurements, none of which have been studied previously in relation to speaker identification. In general terms, this analysis reflects the total accumulated time a talker's speech intensity remains at a specific energy level (relative to his peak amplitude) . It also provides indications of the speaker's speech pattern with respect to speech bursts and pause periods. For the purposes of this research, the speech bursts are defined as the portion of the speech energy which is above a given energy level and pause periods represent those areas between the speech bursts at a given level when the speech energy falls below that given energy level. Operationally, the TED procedure was carried out utilizing a resistor-capacitor circuit to generate an energy envelope of the speech signal. This signal then was digitized by an analogue to digital conversion on a Digital Equipment Corporation PDP 8i minicomputer. Figure 1 gives a block diagram of the equipment configuration. The digitized sequence was analyzed for duration relative to ten intensity levels via specific programming written especially

PAGE 55

40 Tape Recorder (Input) LowPass Operational ResistorCapacitor Circuit Filter Amplifier (Rectifier / Intergrater) r i A/D Minicomputer i Teletype (Output) TimeGated Analog Switch Meter Mixer D C Power Supply Fig. 1. Block diagram of the equipment utilized to generate an energy envelope from the analogy speech signal and extract the time-energy distribution (TED) parameters.

PAGE 56

41 for this analysis. That is, the digitized envelope was partitioned, functionally, into ten linearly equal energy levels, initiating with its peak amplitude. Figure 2 gives a graphic representation of a typical energy envelope as it would be partitioned. Also the speech burst and pause periods are shown in this figure. The number of speech bursts, mean and standard deviation of the speech bursts, and the standard deviation of the pause periods were computed for each energy level. The mean pause periods and number of pauses were not computed as they are direct reciprocals of the speech bursts. The total number of features measured in this vector was 40 (see Table 1) . This vector was used in all three experiments . Voiced/Voiceless Speech Time Vector (WL) This vector is created by combining the duration of the voiced and voiceless portions of the speech signal. The first parameter was defined as the duration of articulated speech during a given speech sample. The second term of the WL vector consisted of the total duration of phonation or vocal activity during a speech sample. This specific vector previously has not been used in the speaker identification task. However, both voiced and voiceless phonemes

PAGE 57

42 Fig. 2. Schematic representation of a typical ene envelope as generated from the TED equipment configuration.

PAGE 58

43 Table 1. Parameters Measured and Investigated for Possible Utilization as the Time-Energy Distribution (TED) Vector; Speech Bursts Represent the Portion of the Speech Energy Which Is Above a Given Relative Intensity Level and Pause Periods Are Defined as Those Areas Between the Speech Bursts at a Given Energy Level Where the Energy Falls Below the Given Level. Parameters (Measured at each of Ten Energy Levels) Number of Parameters A. Speech Bursts 1. 2. 3. Number of bursts Mean duration of each burst Standard deviation of the duration of bursts 10 10 10 B. Pause Periods 1. Standard deviation of the duration of pauses 10 c. Total Number of Parameters Available 40

PAGE 59

44 have been studied by a number of researchers in speaker identification (see, for example, LaRiviere, 1975 and Coleman, 1973) . Their research has indicated the importance of both voiced and voiceless speech sounds to the identification of talker for their speech. Phonation time is obtained from the IASCP Fundamental Frequency Indicator (FFI-8) linked to a PDP8i minicomputer. FFI-8 is a digital readout fundamental frequency tracking device. It consists of a group of successive low-pass filters with cutoffs at half-octave intervals coupled with a high speed switching circuits which are controlled by a logic system. FFI produces a string of pulses which are delivered to a PDP8i computer. These pulses mark the boundary of a fundamental period from a complex wave. An electronic clock marks the time from pulse to pulse and these values are processed digitally to yield (along with other data) the geometric mean frequency level and standard deviation of the frequency distribution. While FFI-8 is basically designed to extract fundamental frequency from a complex wave, it will also calculate the duration of the vocal activity (phonation time) . The second phase of this procedure was to extract total articulation time. Referring back to the TED procedure,

PAGE 60

45 the summed duration of the speech bursts of energy level one represents the total amount of articulated speech (see Figure 2) . Utilizing the total articulation time and phonation time, the voiceless speech componant of the sample is represented. This two-dimensional vector was utilized in all three experiments. Vowel/Consonant Duration Ratio (V/C) The V/C vector is made up of the ratios of the duration of selected vowels to the duration of their consonantal environments. For this procedure, a separate ratio was calculated for each of the following words: "good," "not," "cannot," and "sort." A vector of this type previously has not been utilized in speaker identification research. However, many researchers have demonstrated the individual importance of vowels and consonants in identification of talkers (Clarke and Becker, 1969; LaRiviere, 1975; and Glenn and Kliener, 1968) . The procedure for the extraction of the V/C vector utilized time-by-frequency-by-intensity speech spectrograms, made on a Voiceprint Identification, Inc., Model 700 spectrograph. Speech spectrograms were made of the words

PAGE 61

46 "good," "not," "cannot," and "sort." Hand measurement of the vowels and their associated consonants durations were made from the time-frequency-intensity spectrograms. These measurements were spot checked for accuracy by an independent observer. The V/C ratios were formed from the duration of a vowel or vowels in a selected word and the duration of the consonant or consonants in that same word. This process allowed the formation of one ratio for each sampling of the selected word. The V/C was only utilized as a speaker identification feature in the first two experiments. Word and Phrase Durations (WPP) The WPD vector is generated from measurements of the individual duration of several selected words and phrases. The words chosen for use in the WPD vector were: "good," "not," "cannot," and "sort." In addition the duration of three selected phrases were computed. These phrases include "they have," "they cannot," and "it is not." A WPD vector of this type has been examined previously by Wolf (1972) , who investigated a number of acoustic parameters for use in the identification of speakers. Specifically, one of the parameters he used was the duration of the word "bought."

PAGE 62

47 He indicated that this word duration (alone) did not provide good speaker identification information but that such a parameter may be useful. In order to obtain the WPD vector, the cited words and phrases were processed in the same manner as were the words for the V/C vector. That is, speech spectrograms were made and the durations calculated from hand measurements . This technique yielded a vector made up of seven parameters. The WPD vector was only utilized in the laboratory-normal and laboratory-distorted speech experiments. Multiple Vectors In addition to examining the speaker identification capabilities of each of the vectors separately, these vectors were investigated in all possible combinations. Specifically, the vectors combinations were utilized in groups of two vectors (TED V/VL, TED V/C, TED WPD, V/VL V/C, V/VL WPD and V/C WPD), three vectors (TED V/VL V/C, TED V/VL WPD, TED V/C WPD, and V/VL V/C WPD), and all four vectors (TED V/L V/C WPD) . The technique of combining several parameter sets has been shown to improve speaker identification systems (Doherty, 1976; Sambur, 1973; and Goldstein, 1976) .

PAGE 63

48 This multiple vector approach provides the most efficient use of these temporal vectors in a speaker identification system. For example, if two vectors contain the same or similar information about a speaker's voice then the combining of these vectors would not produce improved identification scores. Conversely, if two vectors vary independently of one another, then their combination should produce better identification scores. Therefore, this procedure demonstrated which vectors are contributing new information and which vectors contain duplicate information. Experiment One Laboratory, Normal (LN) This initial experiment was a laboratory-based study. The subjects read speech material in a laboratory environment and were recorded with "ideal" laboratory equipment. The purpose of this experiment was to develop baseline data on the speaker identification capabilities of the temporal vectors. A detailed description of the subjects, speech material, and recording procedure will follow.

PAGE 64

49 Subjects Forty adult speakers were chosen from a volunteer pool of faculty and students at the University of Florida. Subject selection was made on the following basis: (a) 18 to 40 years of age, (b) no apparent speech defects, and (c) no unusual regional or foreign dialects. These minimal requirements yielded a relatively homogeneous population. Thus, these subjects permitted initial baseline testing of the temporal parameters and presumably reduced any obvious inter-speaker variations. Speech Material Subjects read a modernization of "An Apology for Idlers" by R. L. Stevenson, an approach which permitted the speech samples to be context independent. The passage was chosen because it is relatively long (approximately 600 words) and contains most phonemes of the English language. Therefore, it provided a good representation of the subject's speech repertoire and allowed the sample to be divided into several smaller subunits where necessary.

PAGE 65

50 Procedure The subjects utilized in the LN experiment were recorded under laboratory conditions, in an 1AC-1200 sound-treated chamber with an Ampex Model No. 354 tape recorder coupled to an Electro-Voice microphone Model No. EV 664. The recorded speech samples were divided into four subsamples (30 seconds) in order to permit the extraction of the TED and WL vectors.. Also, selected words and phrases were processed as described in the V/C and WPD sections. Experiment Two Laboratory Distorted Speech (LP) As stated earlier, the purpose of the second experiment was to investigate the effects of speaker distortions (voluntary and involuntary) on the robustness of the temporal vectors, re: the speaker identification task. The subjects were recorded under the same laboratory conditions as those of the first experiment. However, this experiment involved three speaking conditions (normal, stress, and disguise). This experiment provided data on the consequences of speaker distortion on this speaker identification system. A complete description of the procedures utilized will be found below.

PAGE 66

51 Subjects and Speech Material The speakers were 20 male faculty and graduate students at the Institute for Advanced Study of the Communication Processes and the Department of Speech at the University of Florida. These subjects were normal speakers of American English ranging in age from approximately 25 to 45 years; they exhibited no unusual dialects, speech or voice disorders. The speech material for this experiment was the same as that used in the first experiment. Procedure The subjects were recorded using the same equipment described for the first experiment. However, three speaking conditions were included in the recording procedure. They were: (a) normal speech (control), (b) stress, and (c) disguise. Emotional stress can be defined in a number of ways; in this case, it was induced by applying electric shock, delivered randomly, while the subject was speaking. For the third condition, subjects were requested to disguise their speech as completely as they could. The only restriction placed on them was that they could not use a "foreign dialect" or whisper; in addition, they were encouraged to use only the modal voice register.

PAGE 67

52 The recorded speech samples were divided into four subsamples (30 seconds) for TED and WL analysis. For the V/C and WPD vectors the selected words ("good," "not," "cannot," and "sort") and phrases ("they have," "they cannot," and "it is not") were extracted as specified above. Experiment Three Semi-Field Conditions (SF) The third and final experiment had as its purpose the testing of the identification capabilities of the temporal vectors under less than laboratory conditions. In this case, subjects simulated a "crime" over a telephone. Later, suspect interrogation procedures were carried out. It was hoped that the results of this experiment would provide information as to how well the selected temporal vectors performed under conditions which more closely parallel the forensic model. Subjects and Speech Material The subjects for this semi-field experiment were 12 adult volunteers drawn from local law enforcement agencies. From this pool of volunteers, one of the subjects was selected to assume the role of the criminal; he simulated

PAGE 68

53 a telephone related "crime" (a kidnapper's demand call). The remaining eleven subjects were designated as "suspects." All subjects (the 11 suspects and the caller) were recorded during an interview in which they recited statements made by the original caller. This interrogation procedure permitted the control of context and provided a closed set approach. Procedure The criminal call was recorded over a telephone on a reel-to-reel tape recorder via a direct line hookup. It was made from a telephone in a relatively quiet environment — a procedure that provided for reasonably high quality recordings of this type. However, the interrogation procedure was recorded in quite a different manner. In this case, all subjects (suspects and caller) were recorded in a large and relatively noisy room. This procedure was followed in order to model a typical forensic situation. Only the TED and WL vectors were extracted from these recordings. The V"/C and WPD vectors could not be utilized because the particular subject recordings did not provide sufficient repeated words and phrases.

PAGE 69

54 Statistical Analysis Of the many techniques available, discriminant analysis was chosen as the statistical approach to the identification of the speakers. This particular technique was chosen over several others because it demonstrated higher speaker identification results than have other such methods as Euclidean distance analysis or cross-correlation (Doherty, 1976; Doherty and Hollien, 1978; and Zalewski et al . , 1975). Discriminant analysis computes a set of linear functions, termed discriminant functions, which then are utilized to classify individual samples or observations into one of several groups. In the case of this research, the discriminant functions were utilized to classify an unknown speaker's test sample against reference sets generated on known speaker sets. The input data to this procedure consisted of sets of samples; each of which contained values for all the parameters. From the parameters an F-statistic was calculated and used to determine which parameters were the most powerful as identification cues. Three classification methods were utilized within the statistical technique. Initially, the known or reference set consisted of all four of the talker's speech samples.

PAGE 70

55 Then, each sample was reclassified, in turn, with respect to the reference sets. This method was labeled posterior classification and was used to "pretest" the speaker identifying features investigated in this research. The second method chosen was the jackknife approach. In this procedure, each sample was eliminated (in turn) from the reference set before computation of the discriminant functions were carried out. In this case, classification was made on the "removed" speech samples. A third method was utilized to simulate the forensic model. The initial of a talker's sample was arbitrarily designated the test sample and the reference set consisted of the remaining three samples. This final method was employed in the identification task.

PAGE 71

CHAPTER III THE RESULTS AND DISCUSSION OF THE LABORATORY-NORMAL EXPERIMENT The initial experiment had two major purposes. The first was to examine certain temporal characteristics of an individual's speech in order to discover if they permit him to be identified from his voice alone; a second objective was to establish baseline speech data. As described in previous sections, adult males provided speech samples from which the following temporal vectors were extracted: (1) a time-energy distribution vector, (2) a voiced/voiceless speech time vector; (3) a vowel/consonant duration vector; and (4) a vector resulting from the analysis of the durations of several selected words and phrases. These vectors were made up of a number of parameters ; each was tested for its speaker identification capability with respect to the acoustically controlled environment of this laboratory type experiment. A description of the obtained results and a discussion of these findings follow. 56

PAGE 72

57 Results N The pretest and identification scores which were obtained when the vectors were utilized singly may be found in Table 2. Three types of speaker classifications were used to test these vectors; two of which were pretests and the third an identification task. The first pretest was labeled "posterior classification." In this case, all samples were utilized in the reference set determination and then each sample was classified in relationship to one of the reference sets. The purpose of this initial pretest was to examine the inter-sample variability. By means of this first method, it was found that the time-energy distribution (TED) vector could be utilized to correctly, recognize the speakers 100 per cent of the time; a correct classification rate of 57.5 per cent was found for the voiced/voiceless speech time (WL) vector. The scores for the vowel/consonant duration ratio (V/C) vector decreased to only 22.5 per cent correct while the word and phrase duration (WPD) vector yielded no correct classifications at all.

PAGE 73

58 Table 2. Pretest and Identification Scores for the LaboratoryNormal Experiment Obtained by Utilizing the TimeEnergy Distribution (TED) , Voiced/Voiceless Speech Time (WL) , Vowel/Consonant Duration Ratio (V/C) , and Word and Phrase Duration (WPD) Vectors on an Individual Basis, N = 40 . Vectors Pretest Classifications Posterior Jackknif ed Identifications TED 100.0% 82.5% 25.0% WL 57.5 37.5 25.0 V/C 22.5 7.5 5.0 WPD 0.0 0.0 0.0

PAGE 74

59 The second type of pretest used was a " jackknif ing" method. That is, each sample, in turn, was eliminated from the discriminant computations and a reference set formed from the remaining samples. Each sample then was classified according to the particular reference set developed when it was removed. This pretesting procedure allowed all of a given individual's speech samples, taken in a complete set, to be evaluated for their identification capabilities. When the jackknif ing method of speaker selection was utilized, consistently lower classification scores were recorded However, the same general trends exhibited by the posterior pretesting method also were found when this second procedure was employed. That is, a classification score of 82.5 per cent correct was attained by the TED vector. The WL and the V/C vectors yielded scores of 37.5 per cent and 7.5 per cent, respectively; the WPD again did not result in any correct speaker selections. Finally, identifications were carried out in the following manner. The first speech sample for each talker was chosen as the test sample and the remaining samples were utilized as the reference set. The first sample was chosen as the test because it is believed to represent the most variable portion of the entire speech sample. In

PAGE 75

60 the identifications, the TED and WL vectors both resulted in a 25 per cent correct level of speaker selection. Utilizing the V/C vector only 5 per cent of the speakers were correctly identified and the WPD, as would be expected from the pretests, yielded no correct identifications at all. The second phase of this experiment consisted of testing the identification abilities of the temporal vectors in all possible combinations. Identification scores of the twovector combinations may be found in Table 3 . As can be seen from the table, the same general trends were found as when the vectors were tested singly. Specifically, it should be noted that the TED and WL combination yielded the highest score in the identification task (40 per cent correct) . Also, it is apparent that the addition of V/C and WPD to either of the other two vectors did not increase the identification rates appreciably. The threeand fourvector combinations may be found in Table 4. The speaker recognition scores for this group of vector combinations appear to have resulted in essentially diminishing returns. That is, the scores did not change substantially from one vector grouping to another. This plateau effect can be noted especially with respect to the identification task. The combinations of TED, WL, and V/C and the TED, V/C,

PAGE 76

61 Table 3. Pretest and Identification Scores for the LaboratoryNormal Experiment Obtained by Utilizing the TED, WL, V/C, and WPD Vectors in All Possible Pairs, N = 40. Vectors Pretest Classification Posterior Jackknifed Identifications TED X WL 100.0% 80.0% 40.0% TED X V/C 95.0 50.0 27.5 TED X WPD 100.0 82.5 25.0 WL X V/C 82.5 45.0 30.0 WL X WPD 57.5 37.5 25.0 V/C X WPD 37.5 17.5 5.0

PAGE 77

62 Table 4. Pretest and Identification Scores for the LaboratoryNormal Experiment Obtained by Utilizing the TED, WL, V/C, and WPD Vectors in All Threeand Four -Vector Combinations, N = 40. Vectors Pretest Classification Posterior jackknifed Identifications TED X WL X V/C 100.0% 72.5% 47 . 5% TED X WL X WPD 100.0 80.0 42.5 TED X V/C X WPD 100.0 65.0 47.5 WL X V/C X WPD 87.5 47.5 35.0 TED X WL X V/C X WPD 100.0 72.5 45.0

PAGE 78

63 and WPD both yielded 47.5 per cent correct identification scores. Whereas, the two combinations TED, WL, and WPD and the four-vector combination vectors, yielded identification scores of 42.5 per cent and 45 per cent, respectively Discussion Single Vector Effectiveness Based on the results of the laboratory-normal experiment, it would appear possible to consider, at least, some of the basic questions asked by this research. For example, of the several selected temporal parameter groups, the time-energy distribution (TED) vector apparently contained the most idiosyncratic characteristics, at least for these normal or "ideal" recording conditions. This judgement is based on the fact that application of the TED resulted in the highest correct classifications being made. Utilizing correct speaker selection as the judgement criterion, the voiced/voiceless speech-time (WL) vector, vowel/consonant duration ratio (V/C) vector, and the word and phrase duration (WPD) vector followed TED in a decreasing order of effectiveness. There are several possible explanations for this ranking of the temporal vectors . The high identification

PAGE 79

64 scores for the TED vector may well have been predicted since it contained the largest number of parameters (40) and, therefore, it possibly contained more information about the talker's speech characteristics than did any of the other vectors investigated in this research. In addition, previous research has demonstrated the possible effectiveness of a time energy vector. For example, in 1963 Pruzansky examined a time-energy distribution similar to the one utilized in this study. She found that, of her ten speakers, about half were correctly identified. Also, spectral information may be considered to be a frequency counterpart of the TED vector, and spectral analysis has been shown to exhibit a number of speaker dependent properties. That is, while it does not necessarily follow that the work of Majewski and Hollien (1974) , Hollien and Majewski (1977), Doherty and Hollien (1978), and others would predict the relative success of the TED vector, their results do infer that an attempt to utilize the timeenergy information may be warranted. The voiced/voiceless speech-time vector (containing only two parameters) also demonstrated identification scores that were sometimes equal in magnitude to those of the TED vector. These identification levels possibly could

PAGE 80

65 have been predicted on the basis of studies that examined the function of phonation in the speaker identification process (see for example, LaRiviere, 1975; Atal, 1972; Wolf, 1972; Sambur, 1975; Doherty, 1976; and Doherty and Hollien, 1978) . The results of these investigations seem to support the supposition that the level of phonation or vocal activity plays a role in the speaker identification task. In direct contrast to the TED and WL vectors, the application of the vowel/consonant duration ratio and the word and phrase duration vectors resulted in very low identification scores. These low scores may be due, in part, to the method by which they were obtained. Specifically, hand measurements made on sound spectrograms were used and therefore lent themselves to imprecision. Also, the number of samples and type of samples were severely limited. Even if the V/C and WPD vectors did not produce high identification scores, they should not be discounted totally for future studies as not having potential as speaker-identifying features. Indeed, the speaker identification literature supports the importance of analysis of vowels and consonants in identification task (For example, see Bricker and Pruzansky, 1966; Stevens

PAGE 81

66 et al .i 1968; Illes, 1972; LaRivere, 1974; Wolf, 1972; Goldstein, 1976; and others) . Multiple Vector Effectiveness The results of the paired vectors appear to exhibit the same general trends as in the case of the findings of the single vectors. However, most of the vector pairs exhibited at least slight increases in their identification levels on both the pretest and identification tasks. As would be expected (from examining the single vector scores) , the TED-WL vector combination resulted in the highest identifications for any paired vectors. This finding supports the assumption made earlier that TED and WL vectors contain more idiosyncratic (talker) characteristics than do either of the remaining two vectors (i.e., V/C or WPD) . The high levels for the TED-WL combination also suggest that these two vectors are, for the most part, sampling different types of information. That is, if the information contained in the vectors was redundant, there should not have been an increase in the combined identification score. On the other hand, if the information was mutually exclusive, it would be expected that the combined identification rates would be about equal to the sum of

PAGE 82

67 the single vector identification levels. In the case of the TED-WL vector, the individual scores were 25 per cent correct identification each, but the score when they were paired was 40 per cent correct. Thus, the paired scores support the suggestion that these two vectors sample different speaker-dependent characteristics. At this stage in the research it is impossible to know exactly what characteristics are being measured. However, some speculation as to the composition of this speaker information is possible. Since the TED vector is a temporal correlate of the speech spectra, it is possible that some of the informa tion being measured by the TED vector is spectral in nature On the other hand, the WL vector deals with the level of phonation. Therefore, this vector most likely contains only fundamental frequency information. Further examination of the paired vectors demonstrates that when either the V/C or WPD vectors are added to the other vectors (or to each other) , the results provide little or no improvement in identification level. However, there is an interesting relationship which occurs when the V/C vector is paired with either the TED or WL vectors. When the TED-V/C combination was utilized, the pretest scores were lower than for the TED alone — thus, the overall

PAGE 83

68 performance of the TED vector was, in a sense, degraded. This finding would suggest that the information being sampled by V/C was enhancing speaker similarities and the procedures utilized in this research were insufficiently sensitive to separate and use the speaker-dependent information. On the other hand, the pairing of the WL and V/C vector resulted in some improvement in the identification levels of both the pretest and identification tasks. Therefore, the results of the WL-V/C vector combination suggest that the data being sampled are significantly different and thus speaker identification is enhanced. The pretest classification and identification scores reached a peak of "effectiveness" in the threeand fourvector combinations. However, these increases were not of the same magnitude as those achieved when vectors were paired. An explanation for this relationship may be found in the parameter selection procedure. For example, once the TED and WL vectors have been included in a vector combination, most of the parameters available for identification are accounted for. Therefore, addition of the remaining vectors contribute little in the way of new parameters. Hence, very little improvement in speaker identification levels can be expected.

PAGE 84

69 Parameter Selection As stated previously, from two to 40 separate parameters were entered into the discrimination process. However, a statistical selection procedure was utilized in order to maximize the correct identification scores and to reduce the number of parameters needed for identification. That is, a stepwise statistical method of parameter inclusion and exclusion was carried out during the discriminant analysis. The determination of predictiveness for each parameter and/or group of parameters was carried out utilizing an F-statistic with a computation formula that may be found in Forsythe et al . (1973). In accordance with this system of parameter selection, only the most predictive measurements were entered into the discrimination process. In this first experiment, the 40 parameters of the time-energy distribution vector were reduced to only 17 usable parameters. However, both of the WL parameters, voiced speech time and voiceless speech time, were utilized. Of the four vowel/consonant ratios, only two met the sample requirements of discriminant analysis, V/C ratio of "good" and the V/C ratio of "not." From these two parameters only the V/C ratio of not was of any functional value as a predictive element in the identification process. As a

PAGE 85

70 single vector, the word and phrase duration (WPD) vector yielded no usable parameters. Further, in the parameter selection process of the multiple vectors, some variables were added to the vector group while others, formally included, were dropped. In the larger vector (TED) , some of the parameters utilized in the single pattern were dropped when the parameters of the WL vector were included. However, the inclusion of the V/C and WPD vector resulted in no change in the TED parameter selection. Generally, only one of the V/C parameters was included in multiple vectors groups. This relationship may have been caused because of the type of information being measured in the various parameters. In this case, certain parameters of the vectors duplicate the data measured by parameters of other vectors. Therefore, some of the parameters have reduced predictive power and are dropped. Test Sample Selection As should be readily apparent from the results in Tables 2-4, the method of classification plays an important role in determining the magnitude of the pretest or identification scores. In all cases, the posterior classification

PAGE 86

71 procedure facilitated the highest percentage of correct subject selection — and there appeared to be two explanations for these higher scores. First, in this case the test sample was utilized in the computation of the reference discriminant functions. Consequently, the classification procedure is biased toward the selection of the correct reference set. Second, by utilizing each of the four samples (in turn) as test samples, four opportunities for recognizing the correct reference set were allowed. Therefore, this procedure makes the chances of matching the correct test and reference much greater than if only one sample was utilized. The jackknifed classification method demonstrated the next highest correct classification levels. As stated earlier, this method did not utilize the test sample within the reference set computations and this difference may account for the decreased scores, relative to the posterior classifications. However, the jackknifed classifications were better than those for the identification task. The most obvious explanation for this finding is that in the jackknife pretest all four samples were used as tests, whereas the identification task utilized only one of the speech samples as the test.

PAGE 87

72 The identification task was the most important test of vectors speaker identification effectiveness. The pretest classifications serve the purpose of demonstrating which parameter groups have the greatest potential as speaker-identifiers. However, they are only usable in closed set tests (when the criminal is known). Therefore, the identification task must be considered the core test of this research. This approach utilized only the first of the four speech samples as the speech sample. Hence, classifications in this case were based solely on the information being sampled by that single speech sample. A second identification task was investigated utilizing the fourth speech sample as the test. This procedure was carried out in order to test the consistency of the first identification task. The vector effectiveness resulting from the second identification task followed more closely the trends set in the posterior and jackknifed procedures. Specifically, the TED vector resulted in the highest identi fication score (52.0 per cent correct). The WL vector followed with a considerably lower identification score of 12.5 per cent correct and both the V/C and WPD vectors attended the same low scores resulting from the first identification task. This set of findings seems to

PAGE 88

73 indicate that the identification effectiveness of a vector is not consistent throughout the four speech samples. That is, for the TED vector, the first sample was less idiosyncratic of a given speaker and therefore less recognizable than was the fourth speech sample. However, for the WL vector, the first sample was more idiosyncratic of a particular speaker than was the fourth sample . In addition, while the identification tasks attempt to model more closely the forensic model, this fourth sample identification task suggests that the pretest procedures (especially the jackknifed method) provide a better indication of the relative identification capabilities of these selected temporal vectors. Finally, the identification task demonstrates that the more restrictive the test sample selection procedure becomes, the more difficult the identification of the correct speaker becomes. Nevertheless, it also should be noted that the type of classification method utilized to make the speaker discriminations did not change the vector effectiveness ranking among the temporal vectors. In summary, the following conclusions may be stated with respect to the selected temporal vectors and their speaker identification capabilities.

PAGE 89

74 1. Under the constraints of this experiment, the time-energy distribution and the voiced/voiceless speech time vectors (and to a lesser degree the vowel/consonant duration ratio) appear to exhibit idiosyncratic speaker identifying characteristics . a. The V/C and WPD vectors, as defined and measured in this study, do not function well as cues to speaker identification. 2. The TED and WL vectors appear to contain distinctly different speaker related information. 3 . The V/C vector enhances the performance of the WL vector while it seems to degrade the speaker identifying abilities of the TED vector.

PAGE 90

CHAPTER IV THE RESULTS AND DISCUSSION OF THE LABORATORY — DISTORTED SPEECH EXPERIMENT The second experiment undertaken in this program of research was carried out in order to test the speaker identification capabilities of the selected temporal parameters with respect to speaker distortions of a specific type. The speakers (20 adult males) were recorded under the same conditions as in the first experiment. However, three experimental speaking situations were imposed upon the talkers. As detailed earlier, these speaking conditions were: (1) normal, (2) stress, and (3) disguise. The results of this second experiment may be found below. Results Normal Speaking Condition In a manner similar to that utilized in the first experiment, three test selection procedures were used to examine the identification capabilities of the selected temporal measures. Two pretesting methods, posterior and 75

PAGE 91

76 jackknifed classification, again were used to investigate the intraand inter-sample variability. The third approach was that of speaker identification; this third method was utilized to simulate the forensic model. The results of these three experimental procedures may be found in Table 5. As may be seen from these results, the time-energy distribution (TED) vector, when applied in isolation, produced the highest pretest and identification scores. This vector correctly classified all twenty speakers in the posterior procedure, 95 per cent of the speakers in the jackknifed pretest, and 60 per cent of the speakers utilizing the identification task. Following in vector effectiveness (re. speaker's identity) was the voiced/voiceless speech time (WL) parameters. In this case, the correct speaker classification rates were 65 per cent and 40 per cent in the two pretesting methods but only at a 7.5 per cent level relative to the identification method. Application of the vowel/consonant duration ratio (V/C) and the word and phrase duration (WPD) vectors resulted in no correct classifications or identifications at all. By further examination of Table 5, it may be seen that the classification and identification scores of the combined vectors reveal several relationships; that is, those that

PAGE 92

77 Table 5. Pretest and Identification Scores of the Laboratory Distorted Speech Experiment Obtained Utilizing the Time-Energy Distribution (TED) , Voiced/Voiceless Speech Time (WL) , Vowel/Consonant Duration Ratio (V/C) and Word and Phrase Duration (WPD) Vectors in the Normal Speaking Condition: N = 20; All Scores Are in Percentages. Pretest Classifications IdentifiVectors Posterior Jackknifed cations A. Single Vectors TED WL V/C WPD B. Paired Vectors TED X WL TED X V/C TED X WPD WL X V/C WL X WPD V/C X WPD C. Three-Vector Combinations TED X WL X V/C TED X WL X WPD TED X V/C X WPD WL X V/C X WPD D. Four -Vector Combinations TED X WL X V/C X WPD 100.0 95.0 60.0 65.0 40.0 7.5 0.0 0.0 0.0 0.0 0.0 0.0 100.0 95.0 55.0 100.0 95.0 60.0 100.0 95.0 60.0 65.0 40.0 15.0 80.0 40.0 20.0 0.0 0.0 0.0 100.0 95.0 55.0 100.0 90.0 50.0 100.0 95.0 60.0 65.0 40.0 15.0 100.0 90.0 55.0

PAGE 93

78 could be predicted from the examination of the single vectors. Moreover, it should be noted that the effects of the other vectors when combined with the TED vector resulted in little or no improvement in the overall pretest or identification scores. This finding resulted from the fact that in the posterior classification procedure, all the speakers were correctly identified; therefore no improvement was possible. In the case of the jackknifed and identification procedures, no improvement was observed because few new parameters were included into the vector groupings. Stress Speaking Condition In the case of the speaking condition during which subjects were stressed, the posterior classification procedure was eliminated. This classification procedure was judged inappropriate since the stressed speaking samples were compared only with the normal samples. Therefore, only the jackknifed pretest and identification tasks could be carried out. The results of this set of tests may be found in Table 6. Under this stress condition, the TED, again, yielded the highest levels of speaker classification and identification (70 per cent and 40 per cent correct) . However, the WL vector did almost as well as TED in the jackknifed method,

PAGE 94

79 Table 6. Pretest and Identification Scores of the LaboratoryDistorted Speech Experiment Obtained Utilizing the TED, WL, V/C, and WPD Vectors in the Stress Speaking Condition: N = 20; All Scores Are in Percentages . Vectors Pretest Classifications IdentifiJackknifed cations A. Single Vectors TED WL V/C WPD 70.0 65.0 0.0 0.0 40.0 20.0 0.0 0.0 B. Paired Vectors TED X WL TED X V/C TED X WPD WL X V/C WL X WPD V/C X WPD 65.0 70.0 70.0 65.0 35.0 0.0 30.0 40.0 40.0 20.0 15.0 0.0 C. Three-Vector Combinations TED X WL X V/C TED X WL X WPD TED X V/C X WPD WL X V/C X WPD 65.0 65.0 70.0 35.0 30.0 35.0 40.0 20.0 D. Four -Vector Combinations TED X WL X V/C X WPD 65.0 30.0

PAGE 95

80 65 per cent. On the other hand, the identification rate of the WL vector was 20 per cent correct; considerably lower than that achieved by the TED. The remaining two vectors, V/C and WPD, did not correctly classify any of the talkers when these vectors were used in isolation. The multiple vectors produced no real improvement in the classification or identification of the stress talkers over those of the single vectors. The TED X WL vector combination achieved the same jackknifed score (65 per cent) as that of the WL vector and only a slightly better identification score (30 per cent) . Combinations involving the V/C and WPD vectors resulted in no improvement and in some cases their effects tended to degrade the level of correct identification. For example, the WL X WPD vector combination produced a jackknifed classification of 35 per cent correct and an identification rate of 15 per cent; both of these scores were lower than those of the WL vector utilized singly. In the case of the four vector combination, the jackknifed pretest resulted in only 55 per cent correct. This score is 5 per cent lower than the score attended when the TED Vectors were utilized in isolation.

PAGE 96

81 Disguised Speaking Condition The results of the disguise condition of the second experiment may be seen in Table 7. From this table it can be seen that only the jackknifed and identification procedures were utilized. The posterior classifications could not be carried out because all classifications of this speaking condition were done by comparing the disguised samples to the normal samples. As with the two preceding conditions, the TED vector for this speaking condition resulted in the highest set of scores. In the jackknifed pretest, 45 per cent of the disguised voices were correctly matched to the talkers who produced them; correct identifications were made in 30 per cent of the cases. The WL vector achieved 35 per cent correct disguise to normal matches but in the identification task only 5 per cent of the voices were correctly identified. Also, as was seen from the previous speaking conditions, the V/C and WPD vectors were unable to produce any correct classifications or identifications. Combining the vectors resulted in some classification and identification improvement for these disguised conditions. Specifically, the TED X WL vector combination correctly classified the disguised voices 60 per cent of the time;

PAGE 97

82 Table 7. Pretest and Identification Scores of the LaboratoryDistorted Speech Experiment Obtained Utilizing the TED, WIj, V/C, and WPD Vectors in the Disguised Speaking Condition: N = 20; All Scores Are in Percentages . Pretest Classifications IdentifiVectors Jackknif ed cations A. Single Vectors TED 45.0 30.0 WL 35.0 5.0 V/C 0.0 0.0 WPD 0.0 0.0 B. Paired Vectors TED X WL 60.0 40.0 TED X V/C 45.0 30.0 TED X WPD 45.0 30.0 WL X V/C 35.0 5.0 WL X WPD 35.0 15.0 V/C X WPD 0.0 0.0 C. Three-Vector Combinations TED X WL X V/C 60.0 40.0 TED X WL X WPD 60.0 35.0 TED X V/C X WPD 45.0 30.0 WL X V/C X WPD 35.0 5.0 D. Four-Vector Combinations TED X WL X V/C X WPD 60.0 40.0

PAGE 98

83 this combination also resulted in 40 per cent correct in the identification task. All other vector combinations resulted in scores which were representative of the vector with the highest single vector score. For example, vector combinations which contain the TED parameters (except TED X WL) resulted in a set of scores which were identical with those of the TED used singly. This same relationship is found for combinations containing WL and TED X WL parameters. Discussion Normal Speaking Condition It will be remembered that in this first procedure the classifications are made by comparing the normal speaking samples to the normals. The results of first speaking condition of this second experiment reaffirm many of the findings of the initial experiment. That is, the time-energy distribution (TED) vector appeared to be the most effective predictor of talker's identity, at least for the vectors investigated in this research. It also is demonstrated that the voiced/voiceless speech time (WL) vector was ranked second to TED in speaker identification capabilities, whereas the remaining two vectors (vowel/consonant duration ratio

PAGE 99

84 and word and phrase durations) showed little speaker identification power. The high levels of identification found for this normal condition might have been expected on the basis of the results of the initial experiment. In that regard, previous researchers have shown that "TED-like" information is a reasonable predictor of talker's identity (Majewski and Hollien, 1974; Doherty and Hollien, 1978, and others) . For example, Pruzansky (1963) reported that a time-energy distribution was effective as a speaker identification cue. Therefore, it may be assumed that a vector such as TED contains idiosyncratic speech characteristic which would permit identification of a speaker from his voice alone. Also, the higher identification scores, in relation to the other vectors, provided by the TED vector may be due, at least in part, to the large number of parameters (40) available for the identification process. Thus, a great deal of speaker-dependent material apparently is utilized in this classification and identification technique . The WL vector also demonstrated a modest level of effectiveness. This finding also was shown in the first experiment. In addition, previous investigation has shown vocal activity to play a role in the speaker identification

PAGE 100

85 process (see, for example, LaRiviere, 1975; Atal, 1972; Wolf, 1972; Sambur, 1975; Doherty, 1976; and Doherty and Hollien, 1978) . Therefore, it should appear that the WL vector measures certain invariant speaker identification features and this vector may be a viable tool in a speaker identification system. Stress Speaking Condition This experimental condition attempts to discover if stressful speaking situations reduce the speaker identification capabilities of the selected temporal parameters. The results of this experiment demonstrated that the type of stress utilized in this investigation had the effect of lowering the level of identification. For example, the TED vector yielded classification scores reduced by 25 per cent (95 per cent vs. 70 per cent) and identification scores reduced by 20 per cent (60 per cent vs. 40 per cent). These two relationships suggest that the features being measured by the TED vector are altered or varied when a speaker is placed in a stressful situation. However, it should be remembered that the stress speech samples are being compared to the normal speech samples. Thus, the samples were non-contemporary.

PAGE 101

86 The contemporainess of speech samples has been demonstrated to be very delatorious to speaker identification (for example, see McGehee, 1937; Rothman, 1977; and Tosi et al . , 1972). Therefore, the contemporariness of the speech samples must be considered as a contributing factor to the reduced identification rates. The jackknifed classification pretest scores obtained utilizing the WL factor demonstrated trends similar to TED. That is, the level of correct speaker classification was reduced comparing the normal and stress conditions. However, the WL vector yielded improved scores under the stress speaking condition. A possible explanation for these rather confusing findings may be found in examining the physiological mechanisms which are affected under certain stressful conditions . When the combination vectors were evaluated, little or no improvement in the classification or identification scores was demonstrated. As a possible explanation for this finding, it would appear that not enough new information was being added to the vector combinations. In the first experiment, it was found that the addition of more variables improved, to varying degrees, the overall performance of the vectors. It is possible in this second experiment that

PAGE 102

87 there may not be a large enough population to express the small changes which may be occurring. Previous research on the relationship between stressful speaking conditions and speaker identification has been very limited. However, a few studies have been completed which demonstrated that stress speaking conditions do have some degrading effects upon the speaker identification processes. For example, Hollien and Majewski (1977) found that, under stress, their speaker identification rates utilizing longterm speech spectra were reduced from 8 per cent to 20 per cent. Further, in a similar LTS study, Doherty and Hollien (1978) report reductions in their identification system for the stress condition. These reductions were in the order of 5 per cent to 20 per cent contrasted to the normal condition. To date no investigators have examined the same parameters as this research under stressed speaking conditions. However, the findings of this experiment are in general agreement with those limited studies completed. Therefore, it seems certain that stress reduces the reliability of some speaker identification systems. These findings and those of other investigators demonstrate that any speaker identification system must be tested with respect to stressful speaking conditions. Such testing of systems helps

PAGE 103

88 predict the robustness of these procedures in real-world situations . Disguised Speaking Condition The results of this speaking condition demonstrate that, while voice disguise does interfere with and lower classification scores, some scores were recorded which were substantially above chance. This finding demonstrates that the speaker identification features measured by the TED vector and to a lesser extent the WL vector, are still present and effective even when the speaker attempts to disguise his voice. Therefore, these two vectors are functional in a speaker identification system. Furthermore, comparing the results of this disguised speaking condition with those of other investigators show the TED X WL combination yielded the highest set of scores (Hollien et_al. , 1974; Hollien and McGlone, 1976; Houlihan, 1977; Hollien and Majewski, 1977; Doherty and Hollien, 1978; and Reich et al ., 1976). Therefore, it would appear that the speaker is not altering the features being measured by the TED and WL vectors to the same degree as other such features — speaking fundamental frequency and long-term speech spectra .

PAGE 104

89 Further investigation of the combination vector, TED X WL, shove an increase of 15 per cent compared to the single vectors. This finding is in general agreement with those of the initial experiment. Therefore, the explanations for these results may be similar. That is, the TED and WL vectors are, for the most part, sampling different types of speaker-dependent information. This explanation is reasonable since it appears that the TED vector is measur ing characteristics which may be considered a temporal counterpart to spectral information and WL parameters are a direct result of vocal activity. Thus, these two parameter groups should measure different speech characteris tics . Parameter Selection Techniques The experimental selection of parameters for this second investigation is the same as that utilized in the initial experiment. As stated previously, a statistical procedure was utilized in order to maximize the level of correct speaker identification and to reduce the number of parameters needed to make the identifications. In this case, a stepwise statistical technique of parameter inclusion was carried out during the discriminant analysis.

PAGE 105

90 A determination of predictiveness for each parameter and/or group of parameters was conducted utilizing an F-statistic with a computation formula that may be found in Forsythe etal. (1973). In accordance with this system of parameter selection, only the most predictive measurements were entered into the discrimination process. As was seen in the first experiment, only a small number of the total available variables were utilized in the actual discrimination process. Of the 40 TED parameters, 19 produced sufficiently high F-values to contribute to the identification process. A point of interest, the parameters selected in the laboratory -distorted speech were not the same as those chosen in the laboratory-normal experiment. This finding indicates that the set of parameters varies with the circumstances of the particular population and/or recording conditions. Therefore, all variables should be investigated in order to determine which set will prove the most efficient in the classification or identification procedure . Examination of the vector combinations demonstrate that the addition of new variables may in some case result in the exclusion of others. For example, when the WL vector was combined with the TED vector, four parameters

PAGE 106

91 formerly included in the single TED vector were excluded from the combination vector of TED X WL. In this particular case the overall classification and identification scores remain relatively unchanged. Therefore, it can be concluded that total voiced speech time and total articulation time contained a small amount of redundant information. However, since only four parameters were removed it also would appear that not very much TED information is replicated by the WL vector. Test Sample Selection The classification procedures utilized in this experiment (posterior, jackknifed, and identification) demonstrate the importance of test sample selection. As was seen in the first experiment, posterior classification of the speech samples often resulted in the highest set of scores. An explanation of this finding may be found in the method and number of test sample selection. This method involved all samples being used in the computation of the reference set. Also, all four samples were utilized, each in turn, as test samples, thus, affording the best possible method of speaking selection.

PAGE 107

92 The jackknifed method of speaker selection resulted in slightly lower classification; this finding is in agreement with the first experiment. That is, in both the first and second phases of this research project, the jackknifed procedure demonstrated lower levels of speaker identification than did the posterior classifications. The consistency of these pretesting methods lies in the fact that they are not dependent upon the type of information utilized. Rather, classification scores are affected primarily by the method by which the test sample was selected. As was expected, the identification task yielded the lowest set of scores. Stated in an earlier section, this procedure utilized only the first of four speech samples as the tests. Therefore, speaker selection for the identification task was based entirely on the information measured in the first sample. In order to test the reliability of this identification task, a second identification task was carried out (this was also done in experiment one) . In this case the fourth speech sample was utilized as the test sample. The results of this task may be seen in Table 8. Comparing both the first and second identification techniques indicates that the speaker predictiveness is not consistent through all the speech samples. That is, the

PAGE 108

93 Table 8. Identification Scores for the Laboratory-Distorted Speech Experiment Obtained Utilizing the Fourth Speech Sample as the Test; Only the TED, WL, and TED X WL Vectors Were Examined; N = 20; All Scores Are in Percentages. Vectors Normal Identifications Stress Disguise TED 60.0 15.0 25.0 WL 10.0 35.0 10.0 TED X WL 65.0 25.0 25.0

PAGE 109

94 first sample is not necessarily the most idiosyncratic of a given talker's speech. There also is indications that the vector effectiveness varies from sample to sample. Specifically, the TED vector demonstrated some decrease in identification level while the WL vector shows an increase. The findings appear to indicate that the selection of a test sample may be critical for certain speaker-dependent measures. In addition it should be noted that the more difficult the identification task becomes, the lower the identification scores become. In examining the findings of this distorted speech experiment, the following statements may be concluded. 1. Under all three speaking conditions, normal, stress, and disguise, the time-energy distribution (TED) vector and, to a lesser degree, the voiced/voiceless speech time (WL) vector are effective as speaker identification cues. A. The vowel/consonant duration ratio (V/C) and the word and phrase duration (WPD) vectors did not appear to function reasonably as predictors of a talker's identity. B. The findings of this experiment are in general agreement with those of the first experiment and those of previous researchers . 2. Stressful and disguised speaking conditions appear to have deleterious effects on the speaker identification abilities of the selected temporal parameters .

PAGE 110

95 A. The degrading effects reduced the levels of identification by varying degrees, depending on the sample selection and vector utilized in the test. B. The temporal parameters utilized in this experiment seem to be able to identify a disguised voice with a higher degree of accuracy than a previous set of vectors known to this author.

PAGE 111

CHAPTER V THE RESULTS AND DISCUSSION OF THE SEMI -FIELD EXPERIMENT The final experiment of this research program constituted an attempt to test the speaker identification capabilities of the specified temporal parameters in a situation that would parallel those conditions that can be found in the forensic model. As cited previously, a speaker (the unknown) simulated a "crime" over the telephone; the known exemplars were simulated by an interrogation procedure carried out later. In other words, an unknown caller was recorded over the telephone and the suspect pool was recorded during a reading session which was set up to simulate an interrogation. The time-energy distribution (TED) and the voiced/voiceless speech time (WL) vectors were applied to these recordings for use in the identification process. The vowel/consonant duration ratio and the word and phrase duration vectors were not utilized in this experiment. As these vectors yielded few, if any, correct speaker identifications in either of the two previous experiments, therefore, it was concluded that these vectors (V/C and WPD) 96

PAGE 112

97 are ineffectual as speaker identification cues. Moreover, the examples recorded were limited in word content and did not contain words and phrases that were repeated an adequate number of times. It should be noted that the unknown speaker also was recorded as one of the known exemplars; such a procedure permitted the identifications to be made on a closed set. Results The simulated field situation presented some unique problems. Some of these restrictive conditions were important in determining the discrimination or identification procedure that could be applied to the data. For example, while the "unknown" call lasted over two minutes, the recordings of the "suspects" seldom were longer than 60 seconds in duration. Therefore, the "suspect" recordings could not be utilized to develop the needed reference sets. This problem was circumvented by using the unknown call as the reference set. That is, the tape recording of the criminal call was divided into four 30 second samples and these samples were used to generate the reference. In turn, the recordings made by the suspects were used as tests and only one 30 second sample per talker was

PAGE 113

98 necessary for the process. Another problem existed because there was only one correct speaker selection and the identification was a binary decision. Thus, it could be either right or wrong; 100 per cent or 0 per cent correct. Therefore, a ranking system was utilized that would demonstrate the vocal similarities between any of the known speakers and the unknown speaker. The results of this ranking method may be found in Table 9. As can be seen from the table, the TED vector ranked suspect No. 11 as the one most similar to the unknown from among the twelve possible suspects. On the other hand, the correct suspect was ranked sixth. Applying the WL vector to these data yielded only slightly better results. Again, suspect No. 11 was judged as most like the unknown; the correct choice, suspect No. 9 was ranked fifth. Combining the two vectors yielded results identical to those of the TED vector when used in isolation. A second method of speaker selection also was investigated. In this subsequent set of tests, a cross-correlation procedure was utilized as the statistical technique rather than discriminant analysis. The algorithm used to generate the correlations was the same as that employed by Zalewski et al . (1975) . The parameters selected for

PAGE 114

99 Table 9. Identification Ranking for the Semi-Field Experiment Obtained Utilizing Discriminant Analysis on the Time-Energy Distribution (TED) and the Voiced/ Voiceless Speech Time (WL) Vectors, N = 12. Vectors Suspect Identified as the Unknown TED WL TED X WL 11 11 11 Correct Choice Ranking of Correct Choice 9 9 9 6 5 6

PAGE 115

100 inclusion in the correlation procedure were chosen on the basis of the previous experiments. Of the 40 TED vectors, only a limited number (17) were entered into the statistical technique. In the case of the WL vector, both total articulation time and total voiced speech time were included in the correlation process. The results of this cross-correlation technique are enumerated in Table 10. Again, the relative speaker identification effectiveness of the vector is represented as a function of its suspect rankings. In this case, TED chose suspect No. 6 as the unknown; the correlation coefficient was .9628. However, the correct choice, suspect No. 9, was ranked second with a correlation coefficient of .9611. The WL vector selected the correct suspect as being most similar to the unknown speaker. The correlation coefficient in this case was 1.0000. Rankings based on the TED X WL vector combination resulted in suspect No. 11 being chosen as the unknown; suspect No. 9 (correct choice) was ranked fourth. The correlation coefficients for this vector grouping were .9391 and .8675, respectively.

PAGE 116

101 Table 10. Identification Rankings for the Semi -Field Experiment Obtained Utilizing Cross -Correlations on the TED and WL Vectors, the Correlation Coefficients Between the Known and Unknown Speakers Are Also Listed, N = 12 . Suspect Identified Ranking as the Correct of Correct Vectors Unknown Choice Choice TED 6 (.9628) 9 (.9611) 2 WL 9 (1.000) 9 (1.000) 1 TED X WL 11 (.9391) 9 (.8675) 4

PAGE 117

102 Discussion This third and final experiment provided the greatest challenge to the speaker identification capabilities of the selected temporal parameters. In the first two sets of experiments, discriminant analysis was utilized to test the power of these vectors as cues for speaker identification. Hence, this approach also was employed in this, the third, experiment. However, the vectors provided only very low levels of correct identification. A possible explanation for these low scores may be found, at least in part, in the method of parameter selection. Generally, in discriminant analysis the number of parameters included in the discrimination process may not exceed one-half the number of reference samples. In this particular case, only the four speech samples of the unknown speaker were utilized to generate the reference set. Therefore, only two parameters could be included from among the vectors. It would appear from the findings that the inclusion of only a few parameters did not provide enough speaker dependent information to match the correct known speaker to the unknown. In an attempt to force the inclusion of more variables, a cross-correlation procedure was applied to the

PAGE 118

103 data. This method was selected because it permits the application of any number of parameters to statistical process, as was seen in Table 10, the second method of speaker selection resulted in improved levels of identification. Possibly this finding is a direct result of the addition of more speech characteristics to the identification process,an explanation which seems plausible because the cross -correlation method permits more speech parameters to be utilized in the identification process (2 vs . 17 parameters). It also should be noted that the correlation coefficients produced using the TED vector were only .0017 apart. Therefore, these two suspects were both about equally correlated to the unknown speaker. This finding demonstrates that some ambiguity does exist between the suspects and the unknown speaker's speech patterns. These results may also indicate that cross-correlations are not sensitive to some of the more subtle talker variations. It is difficult to compare the results obtained for this semi-field experiment with those provided by other research. Few studies have been carried out investigating speaker identification within the forensic model. However, one such study has been reported by Johnson, Hollien, and Doherty (1977). These authors, using this same data base,

PAGE 119

104 employed long-term speech spectra as a speaker identification cue. The data provided by this study demonstrated very low levels of speaker identification. Johnson et al . concluded that most of the degradation produced was due to the poor recording conditions utilized in the gathering of the speech samples. Therefore, it may be assumed that these same poor recording conditions contributed, at least in part, to the low levels of identification reported in the present experiment. To summarize the results of this simulated field experiment, the following statements may be made. 1. The TED and WL vectors demonstrated relevance to the speaker identification task. The power levels of these vectors were not high but some relevance even under these simulated field situations was found. 2. When very small numbers of samples are available, discriminant analysis may not yield the highest possible levels of speaker identification. 3. It appears that when very few controls are employed in gathering the data base, the levels of speaker identification may be reduced.

PAGE 120

CHAPTER VI SUMMARY AND CONCLUSION The basic purpose of this research project was to seek potential invariant characteristics within the temporal elements of an individual's speech, which permit him to be identified from his voice alone. Generally, these temporal measurements included duration analysis of (1) relative energy at several levels of intensity (TED) , (2) voiced and voiceless activity (WL) , (3) vowel/consonant ratios (V/C) , and (4) specific words and phrases. In more specific terms, the TED vector was based on a group (40) of time-energy measurements; this analysis reflected the total accumulated time a talker remains at a specific energy level. The WL vector was made up of two parameters which represented the total duration of voiced and voiceless activity during a speech sample. The third vector, V/C, was composed of the ratios of the duration of selected vowels to the duration of their consonantal environment. For the purposes of this vector, four specific words were utilized in the formation of the 105

PAGE 121

106 V/C ratios. Finally, the WPD vector consisted of the overall duration of several words and phrases. In this case four words and three phrases were chosen for use in the WPD vector. The selected temporal vector were investigated under a variety of speaking conditions. Three experiments were utilized to study these conditions: (1) laboratory-normal, (2) laboratory-distorted-speech, and (3) semi-field. In the initial experiment, the subjects (40 adult males) read speech material while being recorded in an "ideal" laboratory environment. The aim of this experiment was to develop baseline data on the selected temporal parameters. The findings in this case resulted in 100 per cent to 25 per cent correct identification scores for the TED vector. The WL vector yielded levels of identification in 57 per cent to 25 per cent range. The remaining two vectors demonstrated very low identification power. The vector combinations yielded improved speaker identification scores. Specifically, the identification tasks showed the greatest increases. The TED X WL X V/C vector combination identified 19 of 40 speakers. This represents about a 50 per cent increase over any of the vectors when used in isolation.

PAGE 122

107 The purpose of the laboratory-distorted speech experiment was to examine the effects of speech distortion on the robustness of the temporal vectors. The speakers (20 adult males) were recorded under the same laboratory condition as the first experiment. However, in this case, three speaking conditions were imposed upon the speakers; normal, stress, and disguise. When the TED parameters were applied to this normal data, 100 per cent to 40 per cent of the subjects were correctly identified. Application of the WL vector resulted in correct identification scores of from 65 per cent to 7.5 per cent. The remaining two vectors (V/C and WPD) resulted in no correct identification at all. In the various vector combinations, the correct identification rates were 100 per cent to 15 per cent. The levels of identification obtained in the stress and disguise conditions were substantially reduced. However, the same trends which were exhibited in the normal speaking condition were found in both the stress and disguise data. In addition, the TED X WL vector combination correctly classified 40 per cent of the disguised voices to their normal counterparts. The third and final experiment was included in this research program in order to test the speaker identification

PAGE 123

108 capabilities of the selected temporal parameters under conditions which would parallel more closely those found in the forensic model. In this case, a speaker simulated a "crime" over the telephone, later, a suspect pool was created by recordings made at a simulated interrogation. The TED vector under this situation was unable to identify the unknown caller from the suspect pool. On the other hand, the WL vector did correctly match the known suspect with the caller. The V/C and WPD vector were not utilized in this experiment because of some restrictive recording conditions . It appears from the findings of these three experiments that the questions stated at the onset of this project have been answered, at least within the constraints of this study. It is apparent that temporal speech characteristics play a role in the speaker identification process. Moreover, the selected temporal parameters are able to identify a speaker (recorded under normal conditions) with a fair degree of accuracy. Therefore, it must be concluded that these vectors do measure some invariant, idiosyncratic characteristics of an individual's speech repertoire. In addition, these experiments demonstrated that the levels of speaker identification are reduced when a speaker

PAGE 124

109 is placed in a stressful situation or when he attempts to disguise his voice. Thus, under these conditions, the talker alters (voluntarily or involuntarily) certain temporal elements of the speech waveform. An important result was produced in the laboratory-distorted speech experiment. That is, the TED X WL vector combination resulted in the highest identification rate of any study known to the author. Therefore, it must be concluded that the temporal vectors, measured in this study, are less affected by vocal disguise than certain frequency elements of speech previously investigated. Finally, the results of the third experiment demonstrated the effects of simulated forensic conditions upon the selected temporal parameters. From these findings it must be concluded that poor recording conditions and sampling techniques degrade the speaker identification powers of the selected temporal parameters. In conclusion, this research program appears to have succeeded in demonstrating the role of temporal measurement (in various situations) in the speaker identification process. However, it also has shown that these temporal vectors are inadequate identification cues to be utilized

PAGE 125

solely in a speaker identification system. Therefore, the aim of further investigations into temporal measure ment should examine their possible incorporation into speaker identification systems which utilize frequency parameters .

PAGE 126

REFERENCES Atal, B. S. Automatic Speaker Recognition Based on Pitch Contours. J. Acoust. Soc. Amer ., 52, 1687-1697, 1972. Black, J. W., Lashbrook, W., Nash, W., Oyer, H. J., Pedrey, C, Tosi, 0. I., and Truby, H. Reply to Speaker Identification by Speech Spectrograms: Some Further Observations. J. Acoust. Soc. Amer ., 54, 535-537, 1973. Bolt, R. H. , Cooper, F. S., David, E. C, Denes, P. B., Pickett, J. M. , and Stevens, K. N. Speaker Identification by Speech Spectrograms. J. Acoust. Soc. Amer ., 47, 597-613, 1970. Bolt, R. H., Cooper, F. S., David, E. C, Denes, P. B., Pickett, J. M., and Stevens, K. N. Speaker Identification by Speech Spectrograms: Some Further Observations, J. Acoust. Soc. Amer ., 54, 531-534, 1973. Bricker, P. and Pruzansky, S. Effects of Stimulus Content and Duration on Talker Identification. J. Acoust. Soc. Amer., 40, 1441-1450, 1966. Clarke, F. R. and Becker, R. W. Comparison of Techniques for Discriminating Among Talkers. J. Speech Hearing Res ., 12, 747-761, 1969. Coleman, R. 0.. Speaker Identification in the Absence of Intersubject Differences in Glottal Source Characteristics, J. Acoust. Soc. Amer ., 53, 1741-1743, 1973. Compton, A. J. Effects of Filtering and Vocal Duration Upon the Identification of Speakers Aurally. J. Acoust. Soc . Amer., 35, 1748-1752, 1963. Doherty, E. T. An Evaluation of Selected Acoustic Parameters for Use in Speaker Identification. J. Phonetics , 4, 321-326, 1976. Ill

PAGE 127

112 Doherty, E. T. and Hollien, H. Multiple-Factor Speaker Identification of Normal and Distorted Speech, J. Phonetics , 6, 1-8, 1978. Forsythe, A. B., Engelman, L., Jennrich, R., and May, P. R. A. A Stopping Rule for Variable Selection in Multiple Regression. J. Amer. Stat. Assoc ., 68, 75-77, 1973. Glenn, J. w. and Kliener, N. Speaker Identification Based on Nasal Phonations. J. Acoust. Soc. Amer ., 43, 368-372, 1968. Goldstein, U. G. Speaker-Identifying Features Based on Formant Tracks. J. Acoust. Soc. Amer ., 59, 176-182, 1976. Grey, C. and Kopp, G. Voiceprint Identification. Report presented to the Bell Telephone Laboratory, Inc., 1-14, 1944. Hazen, B. M. Effects of Differing Phonetic Context on Spectrographic Speaker Identification. J. Acoust. Soc . Amer., 54, 650-660, 1973. Hollien, H. Peculiar Case of "Voiceprints . " J. Acoust. Soc . Amer., 56, 210-213, 1974. Hollien, H. Status Report of "Voiceprint" Identification in the United States. Proceedings, International Conference on Crime Countermeasures , Science and Engineer ing . Oxford, England, July 25-29, 1977. Hollien, H. and Majewski, W. Speaker Identification by LongTerm Speech Spectra under Normal and Distorted Speech Conditions. J. Acoust. Soc. Amer ., 62, 975-980, 1977. Hollien, H., Majewski, W., and Hollien, P. Perceptual Identification of Voice under Normal, Stress and Disguise Speaking Conditions. J. Acoust. Soc. Amer ., 56, 553, 1974. Hollien, H. and McGlone, R. E. The Effects of Disguise on "Voiceprint" Identification. Nat. J. Crim. Def ., 2, 117-130, 1976.

PAGE 128

113 Houlihan, K. The Effects of Disguise on Speaker Identification from Sound Spectrograms. Proceedings, IPS-77 , Miami Beach, Florida, December 17-19, 1977 (in press) . lies, M. Speaker Identification as a Function of Fundamental Frequency and Resonant Frequencies, Ph.D. Dissertation, University of Florida, 1972. Johnson, C. C, Hollien, H. , and Doherty, E. T. Long-term Power Spectra and Formant Tracks as Speaker Identification Cues in Simulated Forensic Situations. Occasionally, 2, 41-43, 1977. Kersta, L. G. Voiceprint Identification. Nature , 196, 1253-1257, 1962. LaRiviere, C. L. Speaker Identification from Turbulent Portions of Fricatives. Phonetica , 29, 246-252, 1974. LaRiviere, C. L. Contributions of Fundamental Frequency and Formant Frequencies to Speaker Identification. J. Phonetics , 31, 185-197, 1975. Lehiste, I. Reading in Acoustic Phonetics. The MIT Press, Cambridge, Mass., 358p., 1967. Majewski, W. and Hollien, H. Euclidean Distances Between Long-term Speech Spectra as a Criterion for Speaker Identification. Proceedings, Speech Communication Seminar 74 . Stockholm, Sweden, 202-210, 1974. McGehee, F. The Reliability of the Identification of the Human Voice. J. Gen. Psychol ., 17, 246-271, 1937. McGlone, R. E., Hollien, P., and Hollien, H. Acoustic Analysis of Voice Disguise Related to Voice Identification, Proceedings, International Conference on Crime Countermeasures , Science and Engineering , Oxford, England, July 25-29, 1977. Pollack, I., Pickett, J. M., and Sumby, W. H. On the Identification of Speakers by Voice. J. Acoust. Soc. Amer ., 26, 403-412, 1954.

PAGE 129

114 Potter, R. K. , Kopp, G., and Kopp, H. G. Visible Speech. Dover Press, New York, N. Y. , 439p., 1966. Pruzansky, S. Pattern Matching Procedure for Automatic Talker Recognition. J. Acoust. Soc. Amer ., 35, 354-358, 1963. Reich, A. R., Moll, K. L., and Curtis, J. F. Effects of Selected Vocal Disguises upon Spectrographic Speaker Identification. J. Acoust. Soc. Amer ., 60, 919-925, 1976. Rothman, H. B. A Perceptual (Aural) and Spectrographic Identification of Talkers with Similar Sounding Voices. Proceedings, International Carnahan Conference on Crime Countermeasures , Oxford, England, 1977. Sambur, M. R. Selection of Acoustic Features for Speaker Identification. IEEE Transactions on Acoustics, Speech, and Signal Processing , ASSP-23, 169-176, 1975. Stevens, K. N. , Williams, C. E., Carbonnell, J. R., and Woods, D. Speaker Identification and Authentication: A Comparison of Spectrographic and Auditory Presentation of Speech Materials. J. Acoust. Soc. Amer ., 44, 1596-1607, 1968. Tosi, 0., Oyer, H., Lashbrook, W., Predrey, C, and Nash, W. Experiment on Voice Identification. J. Acoust. Soc . Amer., 51, 2030-2043, 1972. Wolf, J. J. Efficient Acoustic Parameters for Speaker Recognition. J. Acoust. Soc. Amer ., 51, 2044-2056, 1972. Young, M. A. and Campbell, R. A. Effects of Context on Talker Identification. J. Acoust. Soc. Amer ., 42, 1238-1254, 1967. Zalewski, J., Majewski, W. , and Hollien, H. Cross-Correlation Between Long-term Speech Spectra as a Criterion for Speaker Identification. Acoustica , 34, 20-24, 1975.

PAGE 130

BIOGRAPHICAL SKETCH Charles Clifford Johnson, Jr. was born on September 17, 1948, in Glen Cove, New York. He graduated from Flushing High School in 1967. He received an Associate of Arts degree from Manhattan Community College in June 1969. From September 1969 to June 1971, Mr. Johnson attended City College of New York and received a Bachelor of Science degree in Biological Oceanography in 1971. During the years from 1971 to 1973, he pursued and received a Master of Science degree in Marine Science from C. W. Post Center, Long Island University. In June 1973, he entered a doctorate program at the University of Florida, and has since pursued work for a Doctor of Philosophy degree at the Institute for Advanced Study of the Communication Processes . On June 3, 1972, Mr. Johnson was married to Christine Olga Ginal. They now have two children, Charles Clifford III and Cristen Elizabeth. 115

PAGE 131

I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Id — ? I4UL~: Harry Hollien, Chairman Professor of Linguistics I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Howard B. Rothman Associate Professor of Speech I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully ^adequate , in scope and quality, as a dissertation for the 7 degree of Doct0r of Philosophy. Professor of Psychology / I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Alan Agresti Associate Professor of Statistics

PAGE 132

I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. This dissertation was submitted to the Graduate Faculty of the Department of Speech in the College of Arts and Sciences and to the Graduate Council, and was accepted as partial fulfillment of the requirements for the degree of Doctor of Philosophy. August, 1978 W. S. Brown Associate Professor of Speech Dean, Graduate School