Citation
Some acoustic and perceptual correlates of speaker identification

Material Information

Title:
Some acoustic and perceptual correlates of speaker identification
Creator:
Lariviere, Conrad Louis, 1942-
Publication Date:
Language:
English
Physical Description:
ix, 100 leaves : illus. ; 28 cm.

Subjects

Subjects / Keywords:
Acoustic data ( jstor )
Analysis of variance ( jstor )
Audio frequencies ( jstor )
Consonants ( jstor )
Discriminative listening ( jstor )
Ions ( jstor )
Listening ( jstor )
Mutual intelligibility ( jstor )
Phonemes ( jstor )
Vowels ( jstor )
Dissertations, Academic -- Speech -- UF ( lcsh )
Speech perception ( lcsh )
Speech thesis Ph. D ( lcsh )
Speech, Intelligibility of ( lcsh )
Genre:
bibliography ( marcgt )
non-fiction ( marcgt )

Notes

Thesis:
Thesis - University of Florida.
Bibliography:
Bibliography: leaves 97-99.
General Note:
Manuscript copy.
General Note:
Vita.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
This item is presumed in the public domain according to the terms of the Retrospective Dissertation Scanning (RDS) policy, which may be viewed at http://ufdc.ufl.edu/AA00007596/00001. The University of Florida George A. Smathers Libraries respect the intellectual property rights of others and do not claim any copyright interest in this item. Users of this work have responsibility for determining copyright status prior to reusing, publishing or reproducing this item for purposes other than what is allowed by fair use or other copyright exemptions. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder. The Smathers Libraries would like to learn more about this item and invite individuals or organizations to contact the RDS coordinator(ufdissertations@uflib.ufl.edu) with any additional information they can provide.
Resource Identifier:
022270297 ( ALEPH )
13573734 ( OCLC )

Downloads

This item has the following downloads:


Full Text


















SzLoe A~coui:c and Perrov:ual Crrelates
of Speaker ident fcaui















By

CONRAD~ LOUIS LARIVfERE


A D)TSERTAfION PRESM;NED TO TOE GRADUATE COUNCIL OF
THE UU-NIVERSif'V OF FLOR!DA lN PARTIAL
Fl.iL711,.,NNT OF THE REQUIREMM~TS FOR THE DEGPEE OF DOCTOR OF PLLLOSOPHY










UNIVARSIfl OY FLORIVA 1971














AC~ PLEDGE MENT S


The author gratefully acknowledges the guidance and support of Dr. Harry Hollien, not only through the Course of this study, but also throughout the writer's graduate career at the University of Florida.

The author is pleased to acknowledge the constructive

comments of his supervisory committee, composed of Drs. Brandt,

RaeJensen, Dew and Algeo.

Special tl~anks are due the author 's wife, Penny, for her unflagging love and encouragement and the author's son, Clh-,s, who could not: care less about the initials after his lather's name.

Finally, the author wishes to acknowledge his iiimmenise debt to his pjcents, Roland and Bella, who thought mullers un.-fit places for: young men.


Ui


















A CKN 0WTLIED GE VEHNT S .. . . . . . . . . . . . . . . . . . . .II

LIST OF TABLES.................................................

LIST OF FICtMhS..................... .................... .....

ABSTRACT.......................................................

CPA PTE R

I INTRODUCTTON.......................................
Review of the LiterCature .........................
StatomenL of the Problcm.........................


PROCEDLURE..


II TIlE


General Experimental Approaches.............
Details of the Experimental Procedure.......

RESULTS AND DISCIJSSION........................
Sentences....................................
Vowels................................... ....
Consonants...................................
Utterance IntellIigibilitLy and Sp ea ke r
Identification...............................
Saaiple Time Int;-erval........................
Acoustic Analyses............................
Differences Among Lisb:eners and Spealkers .

1',flARY AID COINCLUSIONS........................


IV


APPENDIX A............

A PP1,1NDIX 11............

A P !-'END TX C............

APPENDIX D............

APPENDIX E ............

BIPLIO'P?,,APIiY..........

BIOGRAPHICAL SKETCH..


j. 11


iv iv


vii




1
5
6


. . . . . . . . . . . . . . . . . . . . . .9


10

27 c

27 S30 319

!1 4 49
51 59

72

78

80

82

34 38

97

100














T F~ S l1 ~i


TABLE

I. ANALYSIS OF VARIANCE SU!LRtRY 'OR L;hNS
ilESPON:SES TO SENTE-NCE STIMULI ............

I1. ANALYSIS OF VARIANCE SU-BPMARY FOR VOICED VOWTEL STIMULI.. I r A O4T POSTERIORI COMPARISONS AMONG VOICED VOELS..........

i. ANALYSIS OF VARIANCE SUMEARY FOR. WVISPEREI) VOWEL-S......

V. A POSTERIORI COMPARISONS AMONG WHISPERED VOWELS........

VT. Al.FEA\AGE FORMANT FREQUENCY A':IPLITU!)ES FOR VOICED
AIND WH-IISPERED VOWELS (dB DOWN FROM F1 AiPLITUDE) ....... VI T. ANALYSIS OF VARIANCE SUMARY FOR FILTERED VOWELS.... VI L t. ANALYSIS OF VARIANCE SUC 2E4ARY FOR CONSONANT STIMULI ....

Ix. A POSTERIORI COMPARISONS AMONG CONSONANTS..............

X. ANALYSIS OF VAIANCE SUTE4ARY FOR MONOSYLLABLES.........

XI. RANK ORDER CORRELATIONS BETWEEN ACTUAL AND
EXPECTED CONFUSIONS AMONG SPEAKERS FOR. VOICED
VOWElLS .. . . . . . . . . . . . . . . . . . . . . . .

xiI. ',ANK ORDER CORR11ELAT IONS BETWEEN ACTUAL AND EXPE',iFED CONFUSIONS AMONG SPEAKERS FOR 1-1111 !E'D AND FT
VWL..................................................

MITI. RA(ORDER CORRELATIONS BETWEEN ACTUAL AND
LXVOE CONFUSIONS AMONG SPEAKERS FOR CONSONANTS......

XIV. ANK ODE-R CORRELATIONS BETWEEN ACTUAL AND
EXlECT'ED CONF-USIONS AMONG SPEAKERS FOR /va/............

XV. FUNDAMENTAL F*RrQUENCY AND FORMANTf FREQUENCY N IIASURECS FOR TEE1 VOICED AND WHISPERtED VOWELS...........

XVI. F L NDANiENTAL FREQUENCY, FOLMAN'Z FREQUENCY, AND) FOR-MAINT BANDWIDTHS FOR THE CONS--ONANTS...................

XV~I. CUNDAME'FNTAL FEUNYAND FRAT R(EC
MEAS E FOR THE MONOSYLLABLEA,, /va/.....................


F a ,


28 33

34 36 37 33

40 42 43 52


LTE RED
56 57 60


85 86


iv














LIST[ 01' FIGCURE'S


F I GURE Page

1 SCHEMATIC REPRESENTATION C-' GENERAL tI7RGED UT....... 15

2 THIE STRUCTURAL SC,,EME OF THE FACTORIAL DEFSIG N;
DATA FOR THE CONSON;ANTf STIMULI............................ 23

3 LISTENER PERFORMANCE FOR SENTENCES PRODUCED BY
EACH SPEAKER.................................. ........... 29

4 OVERALL f.-IS1lNER PERFORMANCE FOR VOW, EL STIUTLI............31

5 O OVERALL LISTENER PERFORMANCE FOR CONSONAN11T STIMULI ... 41

6 OVERALL INTELLIGIBILITY LEVELS OF ThE UTTERA, NCES
EMJPLOYED IN SPEAKER IDENTITY TASKS....................... 45

7 PROPORTION OF LISTENER RESPONSE TYPES FOR
UTTERANCES WHERE INTE ,LLl( ICI3IllITY AND IrDENTITTY
JU:DGM'fS WERE MADE....................................... 47

8 OVERALL LISTENER PLEE1"ORT.'l\NCE F OR EQUIVALENT
DURATIONS OF 'MONOJSYLLAPLES ANMD ISOLATED PHONEMES .... 50

9 PERFORMANCE BY LISTENER FOR VOICED, MMISPIERED,
AN-%D FILTERED VOWELS....................................... 62

1O PERFORMANCE BY LISTEFNER FOR VOICED AND VOICELESS CONSONANTS............................................... 63

1t PERFORMANCE BY LISTENER FOR MONOSYLLABLES................ 64

12 PERFORMANCE BY SPEAKER FOR VOICED, WHISPERED, AND FILTERED VOWELS....................................... 66

13 FORM4ANT FREQUENCY VALUES BY SPEAKER FOR W4HTSPERED /i./ AND /u/....................................67

14 FORMANT FREQUENCY VALUES BY SPEAKER FOR WHISPERED / 52/ AND /a/....................................68

15 PERFORMANCE BY SPEAKER r'OR CONSONANTAL STIMULI.............9

16 PERFORMANCE BY SPEAKER F'OR MONOSYLLABIC STIMULI...........71

17 CO'NFUSIONS AMONG SPEAKERS FOR VOICED /i............... 88














F :GIJRE

Ls CONFUS LONS 19 CONFUSIONS 20 CONFUSIONS 21 CONFUSIONS 22 CONFUSIONS 23 CONFUSIONS 24 CONFUSIONS 25 CONFUSIONS 26 CONFUSIONS 27 CONFUSIONS 28 CONFUSIONS 29 CONFUSIONS 30 CONFUSIONS 31 CONFUSIONS 32 CONFUSIONS 33 CONFUSIONS


AYONG

AMNG

AMO-NG AMONG AMONG AMONG AMONG AMONG AMONG

AMONG AMONG AMONG AMONG AMONG

ANONG AMONG


SPEANL S SPEAKER S S HAlT, R S

SPEAKERS SPEAKERS SPEAKER S

S PEAKRNS

SPEAKERS SPEAKERS SPEAKE RS SPEAKERS SPEAKER S SPEAKERS SPEAKER S SPEAKERS SPFAXERS


FOR VOICE EOR VOICE FOR VOICE FOR WIISi: FOR WHISP FOR WHISP FOR WITISi? FOR FILTE

FOR FILTE FOR FILTE FOR FILTE FOR Is! FOR /z/ FOR /f/ FOR /v/ FOR /v+a/


D




ID


F









R

R


A / . ... .... ... 89

II.................89

RD/a/..................9

RED /i/.............. 90


RED /k ............... 91

RED /a/..............91

ED /i. ..............92

ED /......... ......92

ED .. .... .... 93

ED /a/............... 93




.... .... .... ....94

.... ....I .. . I ...95

.... .... .... ....95

.... .... .... ....96


34 CONFUSIONS AMONG SPEAKERS FOR /va/ ...


vi


96

















Abstract of Dissertation Presented to the Graduate Council of
of the University of Florida in Partial Fulfillment of the Roqu~remnnts
for the Degree of Doctor of Philosophy


SOT ACOUST IC AND ?ERCE PTUAL COWRELATE S CF S PEAKED R fDENT1Y1 CAT WN

By

Ccrvnzi Louts Waiviere

Chairman: Harry HllienMajor Department: Speech


An investigation was undeccaken concerning the ability of

listeners to identify speakers solely on the basis of voice. The

purples of this study were: (1) to establish the relative concributions of source and vocal tract transfer characteristics to speaker

identification, (2) to determine whether or not speaker identification w;as possible on the basis of isolated utterances of cent iuant

consoants, (3) to investigate the nature of the relation between

utterance intelligibility and speaker identification, and (1i) to

de tntmirie whether sample dlurat ion was a variable in spoakor idenc ification in absolute or relative terms.

W>. sobjects for this study were eight male speakers and

iWol~ve listecrs. The listenoi wo~sre exposed to the following

speaker "Uteronces: two prose sentences; four vowels under thre

conditions--voiced, whispered, and low-pass filtered; four consonants;

two) consonant-vcuel monosyllable s, one natural and one "synthe tic.'

The tkree vovel conditions were tan to simulate the presence


vii











only of : (1) source information (filtered), (2) vocal tract transfer information (whispered), or (3) both (voiced). The monosyllables all owed for the evaluation of the- rule of duraicai in speaker identif ication. The listeners' task 1was3 11o identify the speaker they felt produced eachi item. For the vow,,el and con :,Dnant stimuli, they woee also required to identify the utterance.

Acoustic analyses of the speakers' utterances were perfemcd

and the parameters extracted were then correlated with the confusions among speakers. The parameters extracted included fundamental frequency, the first three formant frequencies, and formant bandwidths.

The results indicated that: (1) all stimuli yielded speaker idcnt ificat ion per formance above chance levels, (2) sentences ,i l ded perfor:-mance 'far above any other stimlUi, (3) tim p'orforrkances achieved for whispered vowels and filtered vowe I.e-,e_ ver y nearly equal and su=n,:d to the performance achieved for voiced vowels, (4) v,,otced consonants yielded significantly higher performances than -voiceless consonants, (5) the natural monosyllable resulted in higher perforrmance than the "synthetic" monosyllable, and (6) utterance intelligibility was neither a necessary nor sufficient concomitant to speaker identification.

The major conclusions reached were that there seem to be no aco-ustic invariants related to speaker identification, and that

speech inteclligibility and speaker identification appear then to be qualitatively different percepts. This indicates that an adequate MQ~el for phoneme identification would not necessarily constitute an adequate model. for speaker identification, and vice-vursa.


Vi 1.1











Tho following conclu3 :o; Aso~ n:ewd tenaible: (1) speaker

identification for Nowels seems based on both fundamental frequency and forwant frequency informat ion, and the influence of these paraieters is both equal and additive, (2) speaker identification is possible on the bas is of isolated continuant consoriints ; the acoustic cues responsible for the ideeitification of those consonants proved to be elusive, and they deserve further investigation, (3) duration is an important variable in speaker identification performance in that it allows listeners to sample added information on the basis of some integral measure of multi-phonemic utterances, and (4) the. very high performance yielded by the sentence stimuli points to the possible importance of suprasegmental cues to speaker identification.


Ax


















I

INTRODUCTION

Tho extraction of a speaker's identity solely from his

voice is a familiar perceptual phenomenon that can be observed in conversations, cocktail parties, audio portions of television broadcasts, etc. Indeed, in certain circumstances, speaker identification from voice cues is crucial; a pilot, for example,

Must 'he able to isolate and attend to one voica amid a welter of voices if he is to guide hiis aircraft appropriatc-ly; an individual will not reveal confidences cvier a telephone unless hie is sure of his listener's identity. d~ derons trated helow, there is amiple

experiircntal evidence to support such anecdotal observations.

S up r fic ially, the process of speaker identification

from-, voice may seem to be of little consequence, constituting little more than ain empirically established curiosity. On closer inspections, however, one finds several compelling re-isons underlying th-e investigation of speaker identity from voice cites-. R-cently, for example, there has been considerable interest conec~ni& ~eaeridentif ication from spectrographic represents at ions o:) voice 'vciceprints' ). The forensic implications of such a process are obvious, buc there has been some skepticism and ce-ntrcve--sy re:-ardin,- its validity. Bolt et al, (1970) point out Lt1hat (1) exp-e rimenral studies of voice identification using


*1







2


visual inspection of spectuogrL-r;-s h>eyielded false idoc-,ti.Hcation rates ranging from zero to 6D",', depenrding on type of task set for the observer and the latter's training-, (2) as yet experimental studies designed to assess the reliabii'y of identification from voiceprints under practical c.jnditioN,-s have not been carried out, and (3) no experimental studies have been attempted which deal with speakers who are attempting to disguise their voices. Tosi et al. 's (1971) recent work concerning speaker identification from voiceprints' seems to represent an attempt to resolve some of these issues. Althcughi their preimiiinary results are promising, Tosi et al. (1971) note a need for a great deal more research in this area.

One must also countenance th2 (for some) Orwellian notion

that imai' s voice will serve as his universal identifying cre2dential for financial, security or professional matters. For example,

Kersta (1971) has reported the successful use of machine speaker rcog 7nition for security purposes. 'However, hie worked with a small (N 16), closed, and cooperative speaker society, and warned that extensions of this approach to larger societies will be very difficult. Tosi et al. (1971)lrave also stressed the manifest difficulties of using large anid/or non-cooperative speaker inv~entories.

Another motivation for investigating speaker identification from voice revolves around possible implications for models of speech perception; specifically, the relation between speaker

identification and speech inhtelligibility presently is unknown, as it is larg-ely uninvestigated, Most models of speech perception







3


or speech rocognit ion posit a gross pro]. hi &~r naLYsis Of Mhe signal prior to entry into either a strategy section, which converges on a solution through an iterative process, or directly into a lexicon (of phonemes, or wcrds;,or phrases, etc.), ,Vr the Enrut signal is matched against tW icRms stored there. Prelimninary analysis (e.g., Halle and Stevens, 1962) is usually taken to consist of a bank of contiguous bandpass filters, whose output somehow serves as a rough guide (in the frequency domain) for lexical selection.

if the perception of an utterance is found to be a sufficient and/or necessary concomitant to speaker identification, then it would be reasonable to infer that an adequate model for speech recognition could also be an adequate model for speaker identification. If such a relation does not obtain, one is left with the prospect that speaker identification is a qualitatively-different percept from speech recognition, and may be mediated by different cues in the acoustic signal. It may be tenable, then, to ascribe the speaker icentiftieation process only to the preliminary analysis segment of speech recognition schemes.

The main difficulty in all these apprmiches is that there axists no explicit and valid model for speaker identification. For instance, no known-speech percept ion model offers or enterLains va explanation of an observer's ability to identify speakers; machine recognition schemes (such as Kersta's, 1971) rely largely on selected time-frequency-amplitude measures, yet no systematic attoopt has been made to relate such measures to speaker identification performance. It is not tha intent of this study to propose











or establish suclh a model, for it seems n ue to f-,ul.I IlIat e one at this te.Speech is first and -Lorec;Ii-st aa ocoustical signal, and v-1htever infocm-iacion a listener can, extract from speech, including the speaker's identity, is coded completely

among, the acoustic characteristics of the signal. Note that only a portion of such characteristics are available in visual representations of voice, and an observer who attempts to identify a speaker from spectrograms may well be at a distinct disadvantage from one who can avail hisclf of the original acoustic signal. Yet there remains som,-e i-nocance concerning the acoustic correlates to speaker identification. The literature (outlined below) concerning speaker identification by aural approaches has been mainly concerned with quantifying the extent of the effect; there has yet to be a systematic series of e:eiet attempting to relate speaker identification to the acoustic parameters of the signal. Conceivably, such acoustic correlates could play a key role in an eventual speaker identification model. Thus the focus of the present study concerns the investigation

of some po- eacolistical and perceptual correlates of speaker Id'-'nti Cica tioi fro(m voice. I



I t should be noted at the onset that some liberties with traditional terminology have been taken in the literature. The terms 'speech' and 'voice' have been usually taken t.o refer to different phenomena, the former dealing with linguistic events (e.g., events relating to the phonemnes of some natural language) and the latter dealing with non-linguistic events (e.g., laryngeal parameters). Although all of the known research concerning aural speaker identification utilizes 'speech' cues (texts, sentences, phonemes), the common appellation for such a paradigm is 'speaker identification from voice. ' One suspects that this is due primarily to the clumsiness of 'speaker identification from speech';
In ny vent, heplnua of xhiyngaSDeaker Solely from















Pollack et al. (1954) ware na.ng the first to empirically establish that listeners could identify spoak'ers on th'e basis~ of voice alone. Their speakers read identical, unspecified texts under two cond itions: voiced and whispered. The independent variables were duration and filtering. Results for the voiced samples indicated that speaker identification scores were directly proportional to duration up to 1.2 seconds, beyond which there was no improvement in perforuance. Listener performance under varying high- and low-pass filtering conditions suggested that identification performance was not "critically dependent upon any delicate balance of different frequec: components in any single portion of the speech spectrum." On the other hand, for the whispered productions it was found that performance was equivalent V:o the voiced condition if duration was increased by a factor of three. The authors concluded that duration was a significant variable in speaker identification performance, in that it admitted larger samplings of a speaker's repertoire.

Compton (1963) used only sustained productions of the vowel,/i/,and varied duration and filtering conditions. He) too, found that performance increased with duration only up to* durations of 1250 msec. Moreover, he found that filtering frequencies above 1020 Q~ substantially reduced speaker identification performance and filtering frequencies below 1020 Hz



his utterances will be universally referred to here, for conformability, as speaker identification from voice.








6


had i-c SIail1 c 17 F fcet oi eralac.Compi-on conclUdedl ihat1

confusions m ~ speakers Tec argely exp lined by s imilari ties of their fundamental frequencies, yet 'h_ reported no attempt to account for the confusions on, any other b1asis.

Bricoker and Pruzanski (1966) used a variety of spccch materials as stimuli in a speaker identification task. Of particular interest here are their results when the stimuli were vowels (V) as opposed to consonant vowel monosyllables (CV): when the V's and CV's were presented to listeners Lit identical durations (14Q msec.), they found that CV stimuli yielded significantly better identification performance than the V stimuli. They concluded that the number of phonemes in a speech sample correlated more closely with identification performance than did the sample's absolute duration. TIhey also nosed that confusions among speakers were not independent of the vowel uttered-- i.e., the pattern of speaker confusions differed over vowels. Stevens egt a!. (1968) also noted vowel effects; specifically, they found that utterances containing a front vowel (Ii/) yielded higher identification scores than utterances containing a back vowel (/a/). They Speculated that this latter result may have been due to the importance of the second formant as a cue to speaker identification, since front vowels are characterized by a wide frequency gap between the first and second fornant and a high absolute frequency location of the second formant.


Statement of the Problem


In the acoustic domain, the basic parameters of a speech











signal are t mc, frequency and rimplitude. Given constant o:m piitude, Che vaiLa!les of interest are time and frequency. Phonemic effects

have been noted in previous ex,,perjrmentation ind will also be considered here. To susn-marize rhoe literature with respect to these influences:

A. Duration. Some evidence (Compton, 1963) suggests that speaker

identification is a function largely of absolute duration.

Bricker and Pruzanski (1966), Pollack et al. (1954), and by

inference, Stevens et al. (1968) suggest that duration is important only insofar as it allows listeners to sample a larger

repertoire of the speaker's phoniemes. PerhDaps a more cogent

approach to the role of duration in speaker identification is.

to consider its contributioIIs in terms of ifrwto-ie

increaSes in duraLt on y-Teld m-iore infori-.a-ion ahcut the speaker.

The problem then becomes how and where the added information is

coded in the speech signal.

B. Frequency. Although Compton (1963) has attributed speaker

confusions to fundamental frequency, he reports no attempt to

explain the confusions among speakers on any other basis, e.g.,

fofmanc frequencies. Indeed, the speculations of Stevens et

al 1968) suggest that the second formant may deserve a closer

look as a possible source of speaker confusion, and further

research in this area seems indicated.

C. Phonemic effects. Bricker and Pruzanski (1966) hav.2 show,-n

that speaker confusions vary with the vowel uttered. Stevens

e-t al. (1968) noted that front 1vouCls yield higher scores than

back vowels. There are no known investigations concerning speaker











identification from ccnti r.uant- courts nf1s (a huhSclh;:ar tz, 1968, has uccessfully use-d co-.ntiausot consonants as stimf.u L i

in a sex identification experiment) , or the effects of consonant to vowel Formant (c%.,sitions;, or terelation, if 'any,

between saple intelligibility :-nd s-peaker identification.

A systei!uatic att,.,)t to resolve the inlconsistenlcies and gaps noted in this review may serve as a useful preliminary to thle eventual establishment of a 'ndlfor speaker identification and Nol seem to constitute a viable general problem for research. Haen ce, th~e research proposed !)ere addresse-s itself to four: specific problemss for research:

A. To what extent do source characteristics (e.g., fundamental

frequency) and vocal tract transfer characteristics (e.g.,

formant center frequcii ies) contributed to sipe.aker: identification and confusions among speakers?

B. Is speaker identification possible from isolated samples of

continuant consonants?

C. Mhat is the relation between utterance intelligibility and

speaker identification?

D. Does the time interval of a sample (herein referred to as

duration) contribute :-o speaker identification from an absolute

standpoint, or is iL significant only because it allows the

listener to sample more of the phonemeis in a speaker's repertoire?

If the latter, is the ex.se.a information coded i~n targec or

transitional phonemic cues?



















Ii

PROCED L'RE


SiLnce che experimental methods emiiployed in this study were

rather involved, an overview of the general experimental approaches related to each specific problem for research is discussed initially in this section. This is followed by a descri-ption of the procedural details employed in the actual experimentation -- e.g., subject and utterance selection, utterance treatments, data reduIction-, etC.

The scope of the present oxperimeuot,;ion was limited to

the identification of a closed set of speakers from auditory cues alone. Thie speech samples addressed to the specific research problems were restricted to 'simple' utterances -- vowels, consonants, ,

-ind consonant-vowel monosyllables -- for two reasons: (a) there is ample -2apirical evidence that speaker identification can be reliably accomplished from Such samples, and (b) sample control (for duration and amplitude) and sample analysis (for the extraction

of acoustic parzii-iiters) are much more easily achieved for these utterances than for more complex materials (e.g., sentences). The

la~ehowever, contain speech characteristics, such as rate of articulation, inflectior, and dialect, which are 1-ot present in the simple utterances used in this study. In order to obtain somne osLJ ,ratc of the contributions of such characteristics, two prose sentences were also used in this study.


9







1-c


Concerning the f i rs t u s';-rch m in the evidence

noted above (Coripton, !)tAt ;sou vowels yield speaker

idont:ificac-iun infor,-ation helps to ndIrow do,:n the possible
-OCOU. Ic ecwhich the litwsavail themselves of, for a 'stcady socate' vowel niay be described in terms of its source (fundamental frequency and spectral overtones) and vocal tract transfer (formant structure) chiara-cteristics (Pant, 1960). The

question remains whether either of these charac teris tics alone is sufficiet for speaker identification, or whether their contributions are additive. Althou,-h it is functionally impossible to

uncouple the source (larynx) and Wei vocal tract, an attem,-pt was made to Limulate such an uncoupling. The speakers were asked to reduIce isolated vowels, voiced and whispered. These were pr~:onted to listeners under three conditions:

(a) voiced: contained both source anid tronsfer characteristics.

(b) whispered: contained primarily transfer characteristics.

(c) voiced, lcw-pass filtered below,, the first Corm,-ant (at 200H1z):

primarily simulated source characteristic.

Thie second research problem was examined by having identity judgments imale of continuant consonant cogniate pairs. Continuants were selected because they could be produced and presented in isolation and their durational characteristics could be controlled without affecting choir intelligibility. Cognate pairs were chosen because they differ in the presence or absence of fundamental frequency w..ithin each pair, so that the data collected hero could also be addressed to the first research prob-lem, i.e., if speaker










ident ificat ion is largely coded InM fundaiient a I frctj ukncy information, oc2.e xTould exp'ect sig-nifijcanItly betty?__ > tict~ scores for tae voiced consonants than for '._i : ir-I consi ant S.

The third problems concerns possibleI ltiC ~ p between

speech Intelligibility and speaker identification. -it was investigated by having3 lsteners make intelligibility judge merits for the vc,,el aad consonant stimuli simultaneously with speaker identity judg:!,ents. A limited inventory of Vol.se ls and consonants (four of each) were produced by the speakers. The listeners were apprised of the inventory and forced to make intelligiblity judgments as well as spc! tkor identity judgments.

UIThe fourth problem, regarding the role of sam-.ple timeinterval. was investigated by comparing the listeners1 identification performance for equi Valent durations of isolated phonemes vs; conlsonalnt-vowel monosyllables. Two types of consonant- vowel lionosy llab le.; .ecre employed. One was a '"synthetic" monosylIlab le, assembled Eroma sustained productions of a continuant consonant and a vowel used above. The other was a naturally produced monosyllable, containing the sae'e two phonemes as the "6yrthetic" monosyllable. Yt_,e that, given c.4uivalent time- intervals (duration), one obtains a progression of:

(a) continuant consonant: one phoneme.

(b) vow, elI: one phoneme.

(c) synthetic_ rnonosyl lable: two contiguous phonei-,m s, but
lacks consonant to vowel formant transitional information.

(M natural monosyllable: two phonaemes, intact t rans itional information.













Such an osec'.bie provides no}~ aait, to evaluate the informational aspects of duration in srezridentification. If duration

is important only in absolute terms, one expects no significat difference in listener performance over these stimuli, sinco they have equal durations. if increases in duration are consequential because they allow listeners added -information in t r-ms of the stea-dy state portions of a greater number of phonemes, then there would be no significant difference between the natural and synthetici' mnonsyllables, but the latter two would yield higher scores than the cantinuant consonant or vowel stimulus alone. Finally, if the import of duration is that it allows the sainplIing, of consonant to vowel transitional information, the natural nonosyllable should yield higher identification scores than all other s t iwredl i .


Details of hEi.p entalProcedure





(1) Speakers. Eight male speakers oere used iii this stud.(y.

Solectioln criteria ;:r:(a) age range twenty to thirty-five. years,

(b) no krceml spechI jefects, (c) speakers of general American Fn~lish,

(2) Litnr, The observers were twelve individuals

free trom auy know.- heavcrg defects, who had beeni in routine contact with each of th~e c3.eia esor a period of at least six nt.

'it was . :ebl to assume, *then, that the listeners were fZa,1i i -IiA: with the speakers' -voices and that practice sessions in the actual experimentation would not be rnqUired.







;


3. Stimulus 'Itrials.

Select-ion of vowe7ls. Because of the observations of Stevens et al. (10968) and Bricker and Pruzianoki (1.966) noted above, two front vowels, /i/i and kQand two back vTowels, /u/ and /a/, were chosen for speaker production. Note that within Lhe categories front vowel-back vowel, one of the selected vowels is a high vowel and the other a low vowel. Peterson and Barney (1952) demonstrated that, for any given speaker, fundamental frequency probably does eaot vary signiFicantly over theso vcml(,s, that front vowels are characterized by a high second formant vs back vowels, and that high vowels are characterized by a low first formant frequency vs low vowels. Moreover, these particular vowels were chosen because they occupy extreme positions in the traditional vo-wel diag-ram; hence their formant frequency structures should shcw ..ide contrasts over a range of speakers. Additionally, the use of four vowels, as noted above, allows for incorporating a closed set of tterances for Che generation of intelligibility and confusion iieasures. As outlined- above, the talkers produced these vowels in two conditions, voiced and whispered.

SeLection of cmrnsonants. Four continuant consonants were us ed: I v, s, z/. Four consonats were used because, again, such a selec.'ion allowed for tche generation of intelligibility and confusion data. These particular consonants were chosen because of their high frequency of occurrence in Englith and because they are relatively easy to produce and identify in isolation, Additionally, some data regarding their acoustic characteristics are available in the literature. (Flanagan, 1965;










Harris, 1953; 1 nz and Stev:: s, 1961; Hughas and Halle, 1956).

Selection of consonant-vowel nyhn.-lbles. The talkers produced the rmenosyllable ,/a/. This monosyllable was sci i:: d because its two phonemes, /v/ aad /a/, wore also produced in isolation by each suea':or. From these isolated productions, a synthetic monosylabla, /v +- a!, was assembled by a p :ocedure detailed below in section C.

Overall, these stimulus materials sufficiently address themselves to the specific questions for research outlined above: the vco~cl and conosnant cccU iti oar allow the evaluation of the contributions Kf source and transfe cooorCL9Mtrstics, the use of several vowels and consonants allow for the evaluation ofpossible relations between utterance intoLligibility and speaker identification,and the monosyllobles should yield some insight into the informational role of sample time interval (duration). C. Pr~rn Siuu Materials.

Figure 1 is a schematic representation of the four principal steps employed in the procedure of this study.

-Record-ingcond itions. Each talker was seated in a sound

treated room (TAG 1204-A) and positioned approximately six inches fro~m a ~a) ulic microphone (Elcotro-Voic e 664). All utterances were recorded at 71 inches per second on a singla.-track tape recorder (Matuecord PT6-l), located outside the rcom.

The speakers were asked to produce the following utterances:

(a) name, (b) the first two sentences of tho Rainbow Passage (Fairbanks, 1956), us Lg natural rane and inflection, (Q) five threesecond uroduc tins of werh vco:;e , whi s pered, (e) Mie chre secondd







15


F_~"I -A7Ii~-~


c, f&' C-tR~~RE7 U'EANCES





L\ORS _____ _ Ri CTI1DFR1






C PI~i R ION 0F TALES


El -, -IT Ic I0


I APE __T i.IfC 'AF






-1). ST-i'ly-i,IJo-,,S-P~(N ;SE I CIIR


I--I -- -


R', C 0 P D E R


7p~r 1:' Sc'imaric represeniuatiori of -,eneral. procedure.


17 1,







16


productions of ea-ch consonanit, and (f) five productions of /\,/ in which they prolongeod the final vowel far two seconds.

Speakers were instructed to achieve a constant VU meter deflection (-2) for all of their utteitances. Their performance was in onitored by both speaker and r porir-enter.

At the time of the record in-s , a black and white photograph- ias taken of each speaker against a flat background with

high quality 35nmm caiiera. Eight-by-ten inch enlargements of

these photographs were obtained for use in the listening sessions.

.Uterance selection arnd treatment. The five productions of each stimulus were evaluated by a panel of four experienced listeners (phoneticians). The latter chose the most representative production of each stimulus from the five original productions made by the speakers. It was these "L1r0eerrcd" productions which were treated for duration, filtered where appropriate, and repeated and randomized for the actual experimental listening tapes. Tastructions to the evaluators are contained in Appendix A.

One duration, 1250 milliseconds, was used for all stimlus materials. This duration was selected on the basis of the observations of Compton (1963) and Pollack et al, (1954) that durations

*~xa~dngthis had no effect en identification performance. The sc-lected duration .,.-)s generated in the following way: the "preferred"' utterances served as input to an electronic sw'.tch (GrasonStadler 471-1.), whose duty cyXcle (1250 msec.) was manually initiated by the experimenter at the beginning- of each utterance. Rise and fall times were set at 25 milliseconds. This selection was based








17


on data from Bricke~r and Pr'v~anski (1966); they ceperted that rise aa fail tjyus Auv~e 15 Ye. di not OLM n~eUCC any; artifaccual consonantal effects. The output of electronic switch


1;': pujCZ b CaLhole r~ "jj~oo(Tktronix 56A).

in order to conn euc t the synthetic /v I- a! utterances, the following procedure was used: 600 millisecond excerpts of a sustained production of /v/ were obtained (as described above) and recorded on one channel of a Sony 530 recorder; similarly, a 620 msec. excerpt of /a/ w..as recorded on the second channel; a tape loop was conducted containing both of the phonemes thus recorded. This tape loop was placed on PAMMS (a pause adjustmcnt mechanism and mcasurer~ent system deveirped by Jensen et al. , 1970), a device which allows a variable, monitorable delay to be introduced between the material on each channel . The delay was adjusted to approximately 5 rnsec.*, and the Output of PAM>IS was then recorded on a Sony 530 recorder at 7.5 i.p.s.

To generate the filtered vowels, the 1250 msec. excerpts of the voiced vowels were low-pass filtered at 200 Az. The frequency response of the filter (Krohn-Kite 3100) showed an attenuation rate of 23 dB/octave.

.Cenerat-ing the experimental tapes. Six experimental tapes, one for each stimulus category (i.e., sentences, voiced vowels, whispered vowels, filtered vowels, consonants, and monosyllables), were generated. In all cases, the interstimulus interval was four seconds; each production was repeated five times and randomized according to a random number table. Since the listener response










forms contained 25 item-s per pagoe, a tea second interval followe:-d every 25 it m on all -ues !ITichuld be noted thlat there was

no direct correspondence betwee-n any experimental tape and the specific problems for research. ihe grouping of litterance types on these ta-nes: )..as solely Loc facilitate the listowiers' task and the analysis of data.

Ea-perimental Tape I consisted of the voiced vowels. The eight speakers had produced four different vowels and, since each stimulus was repeated five times, this tape consisted of 160 stimulus items. This construction also applies for Experimental Tapes 2, 3, and 4, which contained, respectively, whispered vowels, filtered voweils, and consonants. Playback time for Trapes

1 th!ro-ugh 4 ,.a.s approxim-.ately 15 mIinUtes each.

The fifth experimental tape contained all of the consonant-vowe 1 monosyllables. Therc were two monosyllables for each speaker, and each monosyllable veas repeated five times, so that Tape 5 consisted of 80 stimulus items. The playback time for Tape 5 was approximately 7.5 minutes.

Additonally, a control tape wqas assembled consisting of five randomaizied repetitions of each subject's sentence productions. The data collected from this tape were used as a performance baseline over speakers and listeners-i.e., the listeners here could avail themselves of relatively long durations of natural speech (including speaking rate and inflection, as well as phonemic effects), so their performance here, then, was used as a metric in assessing their proficiency for all other stimuli.








19


D0. AcousticalAnalysis.

Since one of the objeccvi\es of chi 3udy is to assess to what extent speaker identification and confusions among speakers may be accoli~nted for by source and transfer characteristics of the

vocal mehnseach talker's "preferred" utterances from Section C, above, were subjected to acouisti-cal analyses for thie extraction of the following parameters: fundameonl frequency, formant frequencies, and, for the voiceless consonants, formant bandwidths.

Fundamental frequency was determined, for all voiced stimu.1-li, by oscillographic analysis (Honeywell Visicorder, operated at 15 inches per second). Formant frequencies were estimated from wideband spectrographs generated by a Kay Electric 6061 spectrographic unit. The protocols suggested by Dew et al. (1969) for such

measurements were followed. FormanL bandlwidths, for the voic(,less fricatives, were estimated from narrow-band amplitude sections. Bandwidt,-hs xwere measured at a point 3 dB down from the peak of each formant (this was found to correspond to a distance of 5 m-m on the spectrographic paper).

E. Stimulus -Response Task.

Portra--its of each speak ker were attached to the wall of the listeuiAng room (lAG 1204 A). immediately below each portrait, the initials of the individual portrayed were printed in large block letters. Instructions given the listeners ace shown in Appendix B.

V'the control tape and E-xperimental TCape 5, tLhe listeners were presented with answer sheets denoting the stimulus number, followed by che eight sets of identifying speaker initials. The











listeniers' task uias to simply circle the initials of the speaker whom they felt produced that stiulus itemu.

For E-porimental Tapes 1 thifough 4, i-he aniswer sheet consisted of the stimulus number, followed by the catalog-ue of utterarnces iii each ti and thie ,-ealkers' initials. The listener was asked to circle the utterance. heard and circle the initials of the speaker whomp they felt produced it. Ani exa-mple of an answer sheet the listeners used when they were hea ring Ex-perimental TCape 4 is contained in Appendix C.

F. Playback Conditions.

All experimental tapes were played from an Ampex 351 tape

recorder, through one channel of a NMarantz M4odel 7 pre-amplifier and Marantz 3-B power amplifier. 'the loudspeaker, an acoustic suspension device (AR-4), w'as located in a Sound treated room (IAC 1204-A).

Stimuli were presented to listeners at 710 dB SPL. Onle or two observers per listening session were used, seated equidistantly (approximately three feet) from the loudspeaker. A sound le-vel TieLter (Genieral Radio), positioned where listeners were tc be seated, was used as a calibration device. Calibration was accomplished via a 1000 Hlz tone which was recorded at the salve VU level (-2) as the speech samples. The RMS voltage at the loudspealker's input corresponding to 70 dB SPL I!n the room was noted on a vacuum tube voltmeter (Ballantine 321-C), and the latter was monitored by he experimenter throug.hoult the listening sessions.

All listeners heard the control tape fEirst. The other experimental tapes were presented in random order. All listening








21


sessions were condluct~d over a ,,criod o-'-f 'Lve days. All listeners indicated their responses by circling the appropriate iems with a black ink pen. Listener response forms were graded by overlaying them on a coded key, on which tHe correct responses were circled in colored ink, each color denoting the pa--rt1cular utterance and speaker involved in a given trial. All responses by a given listener were then transferred to confusion matrices for each type of utterance-for example, actual speaker vs perceived speaker for whispered Ii!, etc. The diagonals of these matrices represented correct listener responses and were used in all analysis of variance procedures.

Although this grading aad response summarization procedure

was a rather laborious one, it should be noted that this study involved 9920 separate listener responses. It is felt that the procedure used here ,,as not only time saving, in the long run, but also tended to minimize the intrusion of experimenter error, since the number of responses in any given matrix may be summed, after the scoring procedUre, and used as a check for errors (for example, a given listeners' responses for whispered /i/ Must sum to 40, etc.). As a further check, all responses were also graded by an independent observer, the few discrepancies which weore found were then resolved.

Although the possibility of experimenter error in a paper and pencil response paradigm of the magnitude used here is always high, it is felt that the technique employed tended to minimize this unwanted cource of variation to the point where it is not a significant -rector in the results presented in this investigation. C. Data Peduiction.

The paradigm used here is a randomized block factorial design,










representing a mixed model (utte-rances are considered fixed effects, speakers land listeners are .-e!oAidorcd rLtnd.n effects). '[he consonant-al. conditions, for in~pllvolve the following factors:

eight levels of speakers and four levels of consonants. Each iist~enor is considered as a block, since each is exposed to all s ti'Mu Ii.

In his discussion of randomized block designs, Kirk (1968) notes that it is not necessary, for mixed models, to assume that the block and treatmentt effects are additive in order to test the treatment effcacts. Consequently, block (listener) by treatment (speakeLr and utterance) interactions were aot considered in the statistical analyses employed in this study.

Factorial designs of this sort allow the analysis of two or m-ore variables, both in terms of their individual ('wain') effects and in terms of their interactions with one another. The presence of significant differences among treatment means iVs determined by an analysis of variance procedure which utilizes the F distribution. This procedure yields information concerning factor effects and interactions; when any of the fixed effects factors

showu significant effects, comparisions among treatment level means were mnade in order Uo (13ter'i-i1ne the contr 1'butions of the levels within each factor to the test scores. The procedure employed here

involved a oosterioci cormparisons among mcans using Scheffe's method (Hays, 1963).

Figure 2 show,.s the structural design for the consonantal

s t iul i. .The designs for the vow,,els and monosyllables arc similar.







, 3


_-

- -AA/




















3











Figure 2: 'The structural sclaeme of t: n Factorial design; da'ca
for the consonant st-imuli.







24


Concerning the relation of speaker identification and utter,lnco intelligibility, it should be noted that, for all vowe.,l and. consonant stimuli, four typa;-: of: responses are possible--that is, identification is either coru~cet or incorre-_ct and utterance in~elligibility is either correct or incorrect. An inspection of the distributi-on. of these responses was made in order to ascertain whether or not utterance intell igibili-ty is a necessary and/or sufficient condition for speaker identification.

Confusion Matrices. Confusion matrices, similar to those employed by M~ller and Nicely (1955), were generated for all vowels, consonanits, and monosyllables. These matrices are frequency plots relating actual speakers to perceived speakers; hence, they not only indicate correct responses but also yield i.nformation concerning the pattern of errors. An attempt was inade to account for the observed confusions among speakers on the basis of rank order correlation

techniques (hnalstau--Siegel, 1956) relating confusions to the paramtores ohtai-ed from acoustic analyses, above. The rationale here is that if, for a given utterance type, speaker identification is largely coded in some acousliic parameter, then that parami~eter should correlate highly with confusions among speakers.

For the voiced vowels, confusions were related to fundamental Frequency (f.), first formant frequency (Fl), second formant frequency (F2), third formant frequency (F3), and the zatio of formant two to formant one frequency (F2/Fl). For the whiLpered vowels, confusions were related to F!, F2, F3, and F2/Fl. For the low-pass filtered vowels, fo only was employed.







25


7cr Ci,2 v-oicd conL-on:ants, /v/ Land /z/, confusions were related Lo fo, --1, and r2. Tho %eonfusions ~;~sp(ealkers for thevoiceloss consonants, /s/ and If!, r r .Lted to Fl, F?2 and the bandwidths (B14) of those forirma-nts, .11,11 and BN12, respectively.

F or the rmoaosyllable, /va/, the foil__.; 11,h-ac:cer

chose : f0, the first th ree i~ r ants of ti,2e 0-~e 1 e : a ) nd the locus of the second formant transi[tions.

The acoustic parameters employed here were chosenl because they not oniy represent charactLoris tics which are both easily extracted f'rom-, the signal and Croqu :atly uisod in acoustic doscriptions in the literature, but tbu-y also, in sowo instances, represent those acoustic parameters thought to 11eo germane to speech intelligibility. Pe~terson and 'Barney (1952), for instance, concluded that vowels are differentiated from ono another largol, on the basis of the F2/Fl ratio. In~ connection with the voiceless fricatives, Hecinz and Stevens (1961), and Flanagan (1965) have indicated that

these are largely differentiated on the basis of upper formant frequency locations; Hughes :nd 'Halle (1956) concluded that voiceless an1 voiced fricatives are differ rentiated largely on the basis of the presence of substantial spectral energy in the latter below 1000 Hz. Stevens and House (1956) and Harris (1953) have shown that, i~n CV monosyllables, the perception of the consonant is signaled largely by thle locus of the second fermeant t-ansit ion.

Additional in-sighit int(o the relation of speaker idontiication anid utterance intelligibility was sought, then, by incorporat ing these acoustic characteristics into the attempt to account for the contusions among speakers in the identification task. If







26


identificationt Pid intelligibility are o~~c n similar acoustic

cues, Lhe confusions * hould largely be explained by those cues.















T IT

LESiJLTS ADDISCUSSION


it will be recalled that the sentence stimuli were used to

obtain some estimate of the contributions to speaker identification of sup ras egi-,ental features, such as rate of a-_rticulation and inflIection. The vowe.,l sti-muli are directed toward the first research problem, concerning the relative contributions of source and vocal tract transfer characteristics to spcake-r identification. The consonantal stimuli concern the second research problem but also have implications for the first problem. The monosyllabic stimuli are directed toward evaluating the informational aspects of duration.

In this chapter, results are discussed initially by utterance

type. Later subsections are concerned more specifically with the relations obtaining between acoustic characteristics and identification, and between utterance intelligibility and identification.


Sentences


Thie grand mean identil-ication performance for the sentence stimul.i was 97%1. JBrickor and Pruzanski (1966) also used sentence stimuli in their investigation, and their result (98%) is in good agreement witli that obtained here.

tAn alysis of variance, summarized in Table 1, showed no

significant differences over listeners, but a significant difference over speakers. Figure 3 illustrates the differences over speakers;


97







23


TABLE I. ANALYSIS OF VARIANCE SUMt\MARY FOR LISTENERS'
RESPONSES TO SENTENCE STIMULI.


SOURCE

Bet-ween Speakcers Botwcer Listeners


Residual


SS

4.959


3.21


1 1.791 19.96


Total


df

7


MS

.7034


11 .2918


77 95


1531


F-ratio

4. 627 * 1. 9059


*p< .05









19


-'

1 2 3


I - f----~- ------v-----.--.--1--.~
4 5 6 7 8


S PEA" -0RS


iigure 3: listener performance for sentences produced by eachi speaker.


1N0


'-9 F->

44



H C)
~x3

C)

---4


SO0







30q


the range of scores yielded by the speakers' utterances was 87 to l0A~. Mh rLage of listener performance was 90 to 100fi.

Liston~o porlormance recorded for the sentences was by' far higher than for any other slti" ius used in this study. (The next best stimulus type in terms of identification performance was the monosyllable 'va!, for which the grand mean w~as 581.) This seems to point to the contribution of dynamic variations in articulation, for here the listeners could avail themselves not only of such characteristics as fundamental frequency and formant frequencies, but also of such suprasegrnental features as tempo and intonation. A systewatilc investigation ok the contributions Lo speaker identifications of suoh extra-phonemic features would iWdod constitute a viable problem for future research. It would, for example, be interesting to amplitude modulate white noise against sentences, thus preserving tempo and amlitude characteristics, and discover whether or not speakers may he identified solely on this basis.


Vow el5


Overall listener performance for theC vcwel stimuli is shown in Figure 4. The critical value shown in this Figuire, aad in subsequent ones, was determined by reference to a table of cumulative binomial distributions (Staff ol Computa4'onal Laboratory, 1955), and represents the minimum performance level for rejecting the null hypothesis (no performance above chance levels) at a significance level of .05. The overall mo~ans associated with each type of stimulus in Figure 4 are 40.2% for the voiced vowels, 21.81 for the whispurod vowels,and 20.7% for the filtered vowels.








60.,










4CJ 30







20




/u/ /a I//j



V0~WED WSI'E E D I'D


*critical value


Figure 4: Awerall listener performance for vowel stimuli.







32


Ithe data in Figure 4 wcederived frc-m separate experimental tapes, and the interactions ;;ihmay have,, oltaincd an;-ong voiced, whispered, and filtered stimuli are hence unknown. It is nonietheless interesting to note that performance, where source characteristics alone ,ro present (filtered vow,,els) and performance ,.here vocal tract chiaracteristics alone w.ere presen-t whisperedd vow.,els), sumi alTrost exactly to the performance where both are present (voiced vowels). Fro-m this evidence, one might tentatively infer that the contributions of source and vocal tract transfer characteristics toward speaker identification are app roximately equal and additive (an inference which is further supported by correlation of Lte acoustic characteristics of these stimuli to confusions among speaker, below). Additionally, it is clear from these data that speaker identification may he achieved only on the basis of source characteristics and only on the basis of vocal tract transfer characteristics.


Voiced Vowels


Analysis of variance procedures for i-ain effects showed a

significant vowel-speaker interaction for the voiced vowe.,ls. Hence, tests for siriple main effects, as outlined by Kirk (1968), were conducted, and are summarized in Table II. As indicated, there are sig-nificant differences over speakers at three of the vowoels,.and significant differences over vowels at six speakers. At each of the latter, a posteriori compparisons among vowels were- performed using Schieffe's mrec1od (Hays, 1963). Results of this procedure, summarized in Table III, show a general trend for low vowels, /a/ and /PI, to














TABLE II. ANALYSIS OF VARIANCE STGA2RY FOR VOICED VOWEL STIMULI


SOURCE 55 df MS F-ratio

1. Vowels (V) 74.5204 3 24.8401 11.5422*
2. V at Speaker 1 4.667 3 1.5556 .7228
3. V at Speaker 2 46.334 3 1.5445 7.1765*
4. V at Speaker 3 8.5 3 2.8333 1.3165
5. V at Speaker 4 31.5 3 10.5 0.8789*
6. V at Speaker 5 58.75 3 19.5833 9.0996*
7. V at Speaker 6 16.896 3 5.632 2.6169*
8i. V at Speaker 7 31.084 3 10.3613 4.8145*
9. V at Speaker 8 29.062 3 9.6873 4.50135,
10. Speakers (S) 92.8324 7 13.2617 6.1622*
11. S at /i/ 61.1567 7 8.7366 4.0595*
12. S at /u/ 18.458 7 2.6368 1.2252
13. S at AQW1/ 120.49 7 17.2128 7.9931*
14. S et /a/ 45.0 7 6.4285 2.987*
15. Vowel-Speaker 152.2726 21 7.251 3.3692*
16. ReLidual 733.8766 341 2.1521


p <.05






I,


TABLE III. A POSTERIORI COMPARISONS AMONG VOICED VOW4ELS AT SPEAKER*


4


Si


6


Z> / u!/

/ a >/u!


7


/ a/I u


*no significant differences among
voiced vowels were found at speakers 1 and 3.


2


fi-,j/> / u /


8


I,;C/ >/ i / a/>!/







35


yiJl better identification performance than high vowels, i nd// and a cicar trand for /u/ to yield sign-Uficantly poorer identification performance than the low vex/els. T!-his is somewhat at variance with the conch-,sion of Stevens et al. (1963) that front vowels generally result Ln higher identification perfor-mainces than back vowels; it should be hinted, however, that these investigators did not employ sustained, isolated vocalic utterances.


Whispered Vow,,els


The results of analysis of variance procedures for the

whispered vowels are shown in Table IV. Significant differences among speakl-ers obtain at all vow els but /u/. For three speakers, there are significant differences between the performances yielded by voxvels.

TaLcle V suinmarizes the results of a posteriori comparisons among whispered vowels, and indicates the same trends observed for differences among voiced vowels--viz., that low vowels, /a/ and $/ yield better identification scores than high vowels.

In connection with this general trend, an explanation for thlis high vowel-low vowel difference was sought in, comparing the relative forrnant frequency amplitudes for the four vowels, voiced and wiipered. These measures were extracted from narrow band spectrographic sections, and are shown in Table VI. As indicated, the fos-1,sant amplitudes for the low vowels are considerably greater than for the high vowels, especially in the voiced condition. This distiifction. is present, but not as great, for the whispered vowels;














TABLE IV. ANALYSIS OF VARIANCE SUIMvARY FOR WHISPERED VOWELS


SOURCE

Vowels (V) V a t Speaker 1 V at Speaker 2 V at Speaker 3 V at Spcaker 4 V at Spcaker 5 V at Speaker 6 V at Speaker 7 V at Speaker 8 Speakers (S) S at Ii,' S at /u/ S at /Z2/ S at /a/
Vowel- Speaker Residual


1.
2.
3.
4.

6.
7.
8.
9.
10.
i1.
12.

14..
15.
16.


S5

7.8409
1.562 27.0 58. 729 34.75
3.229 2.229
3.417 4.562 117. 2479 15.4067 10.4893
101.49 67.4-997 127.6371
346.7441


d f

3
3
3
3
3
3
3
3
3
7 7 7 7 7
21
341


VJ4S


2. 6136
.5 206
9.0
19.5763 11. 5833 1. 0763
743 1.139
1.5 206 16. 7497 2. 2009
1. 4984 14.4985 9. 6428 6*0779 1. 0168


F-ratio

2.5704
.5119 8. 85 12*
19. 25 28* 11.3-)91i9 *
1. 0585 7307
1. 1201 1. 4954 16.4729 *
2. 1645*
1.4736 14 .2 5 89*
9.4834* 5. 9774*


*p <.05







37


TABLE V. A POSTERIORI COMPARISONS AMONG WHISPERED VOWELS.


AT SPEAKER*


3


2


,' ~e~>,' ~I


4


/ a />/ i /


-no significant differences among vowels were found at speakers 1) 5, 6, 7, and 8.







38


AVERAGE FORYMANT FREQUENCY AMPLITUDES FOR VOICED AND WHISPERED VOWELS (dIB DOWN FROMil F1 AMPILITUDE).


VOICED


U


IE [SPE RED


a!


U p


a!


F 2 t5 19 9 4


F 23 30 15 1.5
3


5 3 5 2


7 13 6 7


TABLE VI.


/ i







5 9


FilteredVoals


As inlicated ia the analysis of variance suinary in Table V11, no significant differences among vowaels Yore foupd for these stimuli. Since these stimuli do not c )ntain 7orniant structures, this rrsuILt supports the notion that the differences among vowel, found for the voiced and whispered stimuli are indeed due to formant amplitude differences.

Overall, performances Cor the yardl stimuli indicate that

speaker identification is pass Lble on the basis of source information only and on Lhe basis of vocal tract characteristics only. Additionally, there is some indication chat, for the voiced vowels, these contributions are equal and additive.


Consonants


The grand mean identification performance for the ccntinuant consonants was 21.82%. Figure 5 surmarizes overall listener Iperformance for each consonant, and shows that alconsonants yielded speaker identification performance at a level significantly above chance.

Analysis of variance for the consonantal stimuli, shown in Table VITI, indicates significant differences among speakers at the voiced consonants, /v/ and /z/, and significant differences among consonants for four speakers. Comparisons among consonant moans at

each of these speakers are shown in Table IX. The general trend observed is Lhat the voiced consonants, /v/ and /z/, yield significantly higher identification performance than their voicolass cognates, /f/



















TABLE 1711. ANALYSIS OF VARIANCE SUYIAARY FOR FILTERED VOWELS.




SOURCE SS df Ms F-ratio


..Listeners 18.5506 11 1.6S64 1.5594
2. Vowels (V) 2.2606 3 .7535 .6585
3.Speakers (S) 63.8656 7 9.1236 7.9737*
4. Vowel-Speaker 24.0294 21 1.1442 1.058
5. Residual 368.7834 341 1.0814
6. Total 477.4896 383


*p K.05








































Is!


If/! " ivi


'critical valuQ


Figure 5: Overall listener pcrf- ormance for consonant stimuli.


40 30 20 j


P4













TABLE VIII. ANALYSIS OF VARIANCE SUMA.RY FOR CONSONANT STIMULI



SOURCE SS df MS F-ratio

1. Consonants (C) 36.1119 312.0373 8.30382. C at Speaker 1 5.75 3 1.9166 1.3221
3. C at Speaker 2 37.584 312.528 8.6423*
4. C at Speaker 3 5.084 3 1.6946 1.169
5. C at Speaker 4 1-1.167 3 3.7223 2..5678
6. C at Speaker 5 21.396 3 7.132 4.9199*
7. C 'at Speaker 6 .896 3 .2986 .2049
8. C at Speaker 7 13.084 34.3613 3.0086*
9. C at Speaker 8 29.729 3 9.9096 6.836*
10. Speakers (S) 11.7049 7 1.6721 1.1534
11. S at Is! 5.4583 7 .7797 .5378
12. S at /z/ 50.125 7 .7.1607 4.9397*
13. S at If/ 7.24 7 1.0342 .7134
14. S zjt /1v/ 37.458 7 5.3511 3.6914*
15. Consonant-Speaker 83.5781 21 4.218 2.9097*
16. Re~,idual 494.3261 341 1.4496


*p <.05
















TABLE IX. A PCOSTERIORI COMPARISONS AMONG CONSONANTS AT SPEAKER*


2

/ V/ ./ S /


5


7


8


/z/> s / /;/ s/ /z / >/sI/


/zI>/v/


*no significant differences among
consonants were found at speakers
1, 3, 4, and 6.


/V / >/ f /







44


and /s/.

These dat a would seem to indicate thiat source characteristics, aIt .east for these stim,,uli, re re hotvily;o hdb tei stener m,,aking identity jURgmentS than -3reOeoii charaicteris tics of the vocal tract. It must be noted that such a conclusion construes an oversimplification, for the cognates employed here are not differentiated simply by the presence or absence of fundamental frequency. Glottal excitation, in the case of /v/ and /z/, also yields vocal tract resonances not present in /f! and Is!.

Actually, it was rath.ar surprising that /f/ and Is! did yield speaker identification performance above a chance level. The articulatory constrictions which serve as a turbulent source for these phonemes are located very far forward in the vocal tract, and the anterior portion of the vocal tract 'which resonates to this excitation is quite short; hence one would expect that the articulatory and acoustic distinctions among speakers would be lost. Although the data indicate that such distinctions are not lost, they do indcato '_hat differences are minimized. The performances recorded for /f/ and Is! (15.2%~ and 16.2%, respectively) were the two lowest scores encountered in this study.


Utterance TnrLolli iility and Speaker identificat ion


The overall intelligibility levels for those utterances

where the listeners v,--re forced to make intelligibility decisions as well as speaker identity decisions are shown in Figure 6. As a group, the filtered voT.wels vere not intelligible; individually, /u/

















690,







-40


1 0


/u////a



WHIrSPErED VGWELS


/u a/ Oi //77-1a















V -2, J ILTEIED


X CWELS


VOWELS


''critical value


Figure 6: OveraAl1 intelligibility levels of t',-e utterances employed in. speaker identity tasks.


/s/ z /V/











was the only intelligible filtered vowel.

The listeners could produce only four types of responses for

thcse stimuli: (a) speaker identification correct, stimulus intelligibility correct, (b) speaker identification incorrect, stimulus intelligi bility correct, (c) speaker identification correct, stimulus intelligibility incorrect, and (d) speaker identification incorrect, stimulus identification incorrect.

The proportion of each type of response for all utterances is shown in Figure 7. Of particular interest here are response type (b) (identification incorrect, intelligibility correct) and response type (c) (identification correct, intelligibility incorrect). For the voiced and whispered vowels and the continuant consonants, response type (b) is by far the most typical. This would indicate that stimulus intelligibility is not a sufficient preliminary to speaker identification. On the other hand, the filtered vowcls, unintelligible as a group, yielded speaker identification performances

which were significantly above chance. The inference here is that stimulus intelligiblity is not a necessary preliminary to speaker identification.

The evidence noted here suggests then that speech intelligibility and speaker identification are not necessary concoitants-that is, it is possible to have oue without the other. That they are qualitatively d ifferent pe rcepts is also reinforced by the fact that the cues which have been traditionally viewed as important to the perception of an utterance generally do not correlate highly, as mentioned below, with speaker confusions generated by the









_____I.D Correct, Stim. Correct

L-'ncorrecct,S-cimn. Correct
C. Correct, Stim. :ncorroct
13 lncorrect,Stir. Thecorrcct


.70


. 60-1 .5 011

.40.





.20:


*10;


- -4




/


C-)

f-i.


C

0



C
0 (9;


.,*.


*1 \
/ \.


/
.v.


-'S
-I


p

/ \.

C. ./.


VOICED WHISP",ED


I.


F~TRED


a!


C V/


U T T E R A N C E


Figure .7:


Proportion of listener response tylpes for utterances where in cAiigi il izy and identity judgments were made.


11W


/ ~K.







48


identification tasks.

Tn the introduction Lo Lhis scudy, it was noted that most. models of phonemne recoc,,nition involve, as a preliminary, analysis of the input in the t-i-ne-freqquency-ai-plitut-do domain. Cue might ascribe Llhe process of speaker identification entirely to such a preliminary analysis component if and only if evidence of acoustic invariances to speaker identification had been found. Although some of the acoustic parameters extracted in this study did correlate

rather well with speaker identification performance, the correlations were not high enough to characterize those parameters as invariants to the process. It seems clear then that time-frequencyamplitude information is also only a preliminary to speaker identification, ank! that this inforlmIation must uiirergo further analysis, or sharpening, before decisions are reached.

The nature of this additonal analysis is unknown. Further insight into this problem may be offered by dichotic listening tests in which the subjects' task is to identify speakers, niot utterances. In the dichotic paradigm, pairs of speech samples are delivered simultaneously to the listener, one to the right ear and one to the left. The presence of an ear advantage (that is, if the listener reports the stimuli presented to one ear more accurately than the stiimuli presented to tbe other ear) is taken to indicate that perceptual processing is mediated chiefly at the cortex contralateral to ear showing the advantage. It has 'been~ shown that there is a right ear-left hemisphere effect for categorical or coded speech materials (as discussed by Liberman et al,, 1967), for example,







4 9


c ors oin' -It -1vovel mot-o;'lla-_bies; the most cogent inference drawn from these dalize is that the left haisiuc 3, 3 a 10 L7 4wc

process ing (Studdert-Kenncdy Lend 19/0.Ltir i~) K KiTrlia i ! )

has sliu;..rn that a left ear-r L-lt hei~tsphere advantage is found for musical passaiges, a-nd hence the le,_ Jti hemisphere is considr,.red to be seilidfor the perception of aurditory patternFs.

Darwin (1969) has shown that there exists no ear advantag-e for vo-Wels, indicating that perceptual processing for these is perhaps occutrring ait a sub-cortical levelI.

IHence, dichotic present~ations involving speaker jdcrntjfjcition jud~gi-ents ,- iJht be expected to yield iniform,,ation conce:1ning the locus and.nature of the processing involved in those judgments. A rigrht ear advantage would iindicate that the speaker identification is associated wihlingutistic cues; a left e-ar dangewould indicate that the processing is chiefly a matter of auditory pattern analysis (and would point toward the importance of suprasegmental

features); finally, the absence of an ear advantage would indicate that the processing is either low-level or involves a combination of linguistic and pattern analyses.



Sample Time interval


Figure 8, a summary of listener performance fcr the natural

and synthetic monosyllables and equivalent durations (1250 msec.) of the isolated phonemes froi.i which Lhe latter was generated, defies a simple interpretation. There is evidence of a 1stairstep" function in Figure 8. It is interesting to note, for example, that if the












































// /va/ /v-~a/


U T TE RA N CE


0

U








0 C-,


C-)


60 5 C 4 C 301





20



I __)


*critical valuo


Overall listener performance for equivalent durations of monkosyllables and isolated phonemes.


- .4- -


- *


/V /


Figure 8:


U,







51


overall 1perfor;malice for /v/ is averaged with the overall performance for /a/, the result is very nearly the performance yielded by the sntheici c ~osylilale ,/v-l / .Tssem to indicate that the list(2r-ers woro reaching their de-cisions for this stimulus on the basis of steady state acoustic characteristics. Although they were indeed treating /v-,a/ as a two phoneme utterance, note that the overall performance effect was not additive, but averaged.

On the other hand, there was a significant difference between /va/ and /v-:-aI, as indicated by the analysis of variance summary in Table X. T1he most cogent explanation of this difference is that listeners aire not reaching speaker identification judgments for lye! only on the basis of the target acoustic values of this mronosyllable 's constituent phnrebut also on th bais eithe

of thec added forim.ant trans it ion values or :supraso. mental features.

In general, the trend indicated by these data would tend to support the notion that utterance duration is an important variable in speaker identification in that it allows listeners to sample larger segments of a speaker's phonemic repertoire. Furthermore, this added information is based not on steady state phonemic cues but on some more integral basis.


Acoustic Analyses


Results of acoustic analyses of the utterances used in this investigation arc tabled in Appendix D. (In connection with the latter, the fundamental frequencies extracte-d for the voice vowels are the same as those for the low-pass filtered vowels, since the














TAbLE X. ANALYSIS OF VARIANCE SUMMARY FOR MONOSYLLABLES


SOURCE SS df MS F-ratio


1. Monosyllables (M) 64.172 1 64.172 31.0956*
2. M at Speaker 1 15.042 1 15.042 7.2888*
3. M at Speaker 2 4.166 1 4.166 2.0187
4. M at Speaker 3 15.041 1 15.041 7.2883*
5. M at Speaker 4 18.375 1 18.375 8.9039*
6. M at Spcaker 5 0.0 1 0.0 0.0
7. M at Speaker 6 54.0 1 54.0 26.1665*
8. M at Speaker 7 12.042 1 12.042 5.8351*7
9. M at Speaker 8 5.041 1 5.041 2.4422
10. Speakers (S) 65.33 7 9.3328 4.5223*
11. S at ,'va/ 71.9584 7 10.2798 4.9812*
12. S at /v+a' 52.906 7 7.558 3.6623*
13. Monosyllable-Speaker 59.535 7 8.505 4.1212*
14. Residual 340.522 165 2.0637


*p? .05







53


latter were generated directly from the formerr)

In an attempt to discover the bases on which listeners were arriving t speaker identification judgments, they- were assign ned ranks according to the order in which they would be expected to be confused with each other had some acoustic parameter been the basis of speaker identification. These rank orders in expected confusions among speakers were then correlated with rank orders in actual confusions (contained in Appendix E) among speakers. A high rank order correlation obtained on the basis of some acoustic

parameter would indicate that the parameter was heavily w.,eighed by the listener in reaching his decisions. vTowqels


Table XI suriarizes the results of tnis procedure for the

voiced vowels. Each cell entry in Table X represents the degree of correlation between the actual confusions among speakers against each speaker (Xi) and the expected confusions among speakers predicted by the rank of some acoustic parameter against each speaker

(Yi).

The statistic employed in Tables XI through XIV is Kendall's tau. The actual cell entries in the tables are the denominator of this statistic, S; Siegel (1956) notes that S has the same probability distribution as tau, and provi'les a table for the probability of obtaining any given S. The last column in Tables XI through XIV lists the probability of obtaining the actual S values under the null hypothesis of no association between ranks.









RANIK ORDER CORRELATIONS BETWvENl ACTUAL CONFUSIONS AMONG SPEAKERlS (Xi)/ AND5 EXPECTED CONFUSIONS AMONG SPEAKERS (Yi) FOR VOICED VOWELS


fo c
Ff
/i/ F2
F3
F2 /F1


F,.
-I
F3
F2 /F1

fo
Fj]
F 2 F3
F 2/F13

Ia! F1

F2I F 3


Xy1


3
-3
11
-5

14
4
3
3
-2

14
6
-4
21
1


16
8 11
9
4


X2Y2 X3Y3 X4Y4 X5Y5 X6Y6 X7Y7 X8Ys 7


20
6
3
5
4

.4
0
5
8
6

6
6
9
1
8

10
7
-2
18 11


13
3
5 10


13
7
-2
6


4
6
8
7


3 -4 7


10
-7
16
21 17

0
5
14 14
9

5
12
10 10
7


6
-5
10
6
13

3
-3
-4
-2


.5
3
10
4 11


15
3 17
2 16

6
4 11
9
4


3
0
7
9
5


13
9
6
6
3

6
3 1C
20 15

5
7
9
4
9

6
1
20 10
6


13
-1
7
7


3
-2
7
2


6 4

4 9


-5
4,
0
1

3 10 15
7
12

8
2 15
7
-7


-3
14 14
9

6
-7
9
0
7


9
13
12 16


87


53
23

68
-10 79
7Z.
75

43 28 59
54 47


61
42 84 79 53


S p (S)


10.9
4.5 3.9 6.6
2.9

8.5
-1.25
9.9 9.3
9.4

5.4 3.5
7.4
6.8 5.9


7.6 5.3 10.5 9.9 6.6


.11 .32
.36 .23 . 4

.i3
.54 .14 .15 .15

.9

.22 .23
. 27

. 22 .29
.12 .14 .23


TABLE X1.


cI.~

z


0


Lq 4-1






5 :5


It should be noted at the outset that none of the tabled probabilities are equal t-o or smaller than .05. Although this does not invalidate the trends w.,hich may be drawn from these it does seem to indicate that, at least for the utterances and acoustic parameters employed in this study, there oex-ists no sot of acoustic invariances to speaker identification.

For the voiced vowels, the data in Table XI indicate that, in general, (I) fundamental frequency, (2) the second formant frequency (F2), and (3) the third formant frequency (F3) are equally good predictors of confusions among speakers., This is entirely consistent with "he speaker identification results obtained with these stimuli, discussed above, and indicates that these parameters

are indeed the basis of speaker identification judgments for voiced V oweIS .

incidentally, it was noted in the introduction that Co-ipton (1963) bad used only the vowel ,/i/, in his investigation aud had concluded that fundamental frequency was the basis for speaker identification decisions. The data here bear this out, but they also show that this conclusion does not generalize to vowels as a

whole.

For the whispered vowels, the data in Table XII show that F3 is the best predictor overall for confLusions arong speakers. Once asain, /i/ exhibits a trend distinct from the other vowels;

for this whispered phionerae, the ratio F2/Fl is Lthe best predictor of speaker confusions. Overall, though, the F2/Fl ratio is a poor

predictor of confusions; as nctad previously, this ratio has been













TABLE XII. RANK. ORDER CORRELATIONS BETWEEN ACTUAL CONFUSIONS AMONG SPEAKERS (Xi) A-ND
EXPECTED CONTUSIONS AMONG SPEAKERS (Yi) FOR WHISPERED AND FILTERED VOWELS


x1Y1 X2Y2 X,3Y3 X4Y4 X5Y5 X 6Y6 X7Y7 XSY8 Z


// F2

3

FU F2
F2/ 1

F1
// F2
F-)



F?

F2! F,
F3




fo

/u/!


6 17 16 17

0
-6
9
0

9
-1
17
-3

9
9
7
6

15
7
9
10


13
-3
19

3
12
8
13

6
13
4 14

-9
-4
8
5

16
18 19
12


3 17
3 13


-9
- 1
-6

9
5
8
5

I 16 19 11

13
5
8
5


16
1
2 10c

9
8
7 10

10
8
2 12

9
1
9
3

11
9 18 17


-6

0

-5
3
17
0

-1
-5
-8
-12

-12
-11
-2
-12

17
20 16 13


-2
8
12
7

-6
10 13 10

2
3
7
9

3
10
20
4

4 16 10
-6


6
-5
-2
-2

1
2

-6

11
12
8
2

6
14 11
5

9
10
4
19


6
4
0 9

-6
4 17 10

2
-5
6
-12


-5
1
7
1

14 12
0 18


39
54 22 73

-5
24 81 31

48 30
44 15

2
36 79 23

97 97
84 88


Cd)


i-


0


C/)

0


4.9 6.8 2.8 9.1


3.0 10.1 3.9

6.0 3.3 5.5
1.9

.25
4.5 9.9 2.9

12.1 12. 1 10.5 11.0


p (s)

.32
.23
. 4-1
.16

.54

*14 .36

.27 .36 . 29
.45

.53 .32
.14 .4

.08 .08
. 12 . 11













TABLE XIII.


RANK ORDER CORRELATION BETWEEN ACTUAL CONFUSTONS AMONG SPEAKERS (Xi) AND EXPECTED CONFUSIONS AMONG SPEAKERS (Yi) FOR CONSONANTS


x1Y1 X2Y2 X3Y3


15
-3


15
9


X 4 y4


X5Y5 X6Y6 X7Y7 X8 Y8


14 -6 15 10 3 3


2 12

2

-4


12 20


-1
9


-5
2


-2 -4


F 2


/z/ F1.
S2'

F
/s! BwI
F2
BW2




F2
B142


2
-3
5
5


-10
4


17
-3
4

7
6)
7
6

3
-4
11
9


6


6 -4 3


1 12 -1 12


11
-5
-11


9
1
2


6
6
4


8
6 15


9 11 - 1 14


11 16
6

-5
5
7
3


5 4 14 0
5 16 0 -3
4 - 1 7 -5


3
8


4

1
2


.7
16
7 13


-3
-9
-11
-14


V)


4
-2
10
0


;.-4
cIn


-1
2
0

16
12
8


0


53
2
38

84 27 23


6.6
.25
4.8

10.5 3.1
29


72 9


p()


.54 .33


. I-2
.38
.41

.10,6
.33 . 23
.47

.4
.54 .36 .39


34 52 11

23
0
33 26


6.5 1.4

2.9
0
4.1










considered as a primary cue to vowel inteiligibility. Thera is an indication hora, than, that uttrance incligibili~' :nlJ speaker identification are based on different cup's.

The correlations obtained for the lcw-rr'ss filtered vowels,

also show.a in Yable XII, are the highest obtained for all utterances, demonstrating that the listeners are indeed using fundamental frequency cues in reaching Mdcniy judgments for these stimuli. Overall, the

correlations obtained for the vowel stimuli offer more conclusive evidence for the notion that the contributions of source and vocal tract transfe: characteristics are equal and additive. Also of interest is the Finding that ctes which are thought to be crucial to speech intelligibility are poor predictors of confusions among speakers.


Consonants.


Table XIII shows that fundamental frequency is a better

predictor of confusions than the formant structure for the voiced continuants, /v/ and /z/. For /s! and Ifl, however, the center frequency of the first formant accounts for more confusions than

formant bandwidths.

The correlations obtained for /v/ are uniformly lower than those obtained for Iz/, yet the fundamental frequencies data for these stimuli were very similar. This would seem to indicate that thare are other factors invol.ved-in speaker identification for these stimuli. As noted by Hughes and Halle (1956) the acoustic characteristics for these phonemes are quite complex, since they represent the


5 3











in Cr 'c f ~cn ti S o 7- coarc s tan ua s i- p oriod ic , a t th e l aryn x and the other, turbulent, at the point of constriction) and the vocal tract resonances which they excite. That these resonances do play a role in speaker identification may be inferred by noting that the portion of the vocal tract anterior to the consonantal constriction associated with /z/, an alveolar phoneme, is considerably longer (and hence would more enhance interspeaker variations) than that associated with /v/, a labio-dental phoneme.

This saime trend between degree of correlation and place of articulation is evidence for the voiceless consonants, Is! and /f/. IMonosyll-ables


As shown in Table XIV, confusions among speakers for the

monosyllable, /va/, are highly predictable from fundamental frequency. Note also, however, that F2, F3, and the locuIS of the F2 transition also correlate reasonably well with obtained confusions. Fundamental

frequency and the F2 locus represent here characteristics of the entire utterance, and are not characteristics proper of its constituent phonemes. This reinforces the notion, discussed above,

that irulti-phonemic utterances yield higher speaker identification scores not because of target acoustic values but rather on the

bases of phonemic interactions. These data arc not adequate for firmly establishing the nature of such interactions, but they strongly suggest their existence..


Differences Among Listeners and_. peakers


The analysis of variance procedures detailed above indicated significant differences among listeners for the voiced anid whispered
















TABLE XIV.


RANK ORDER CORRELATIONS BETWEEN ACTUAL CONFUSIONS AMONG SPEAKERS (Xi) AND EXPECTED CONFUSIONS AMONG SPEAKERS FOR /va/


XlyI X2Y2 X3Y3 X4Y4 X5y5 x 6Y6


X7Y7 X8Y8


S


p (S )


14 8 13


Q -)


4
11
9


10
15 18


7


9


98 12.3
41 1
63 7.9
73 9.1i
74 9.3


Wj:


crf


F1 F2 LOCUS F9


-2
-2
-2


22
6

8
0


2
0
12 12 15


17
7
7
7
6


7

12 19


.2G
.16 .15







1


vowels, consonants, and mronosyllables. For the vowels, speaker identification performance by listener, poole d over speakers, is shown in F7igure 9. Listener p, .formances for the consonant and monosyllabic stimuli are shownq in Figure 10 and Figure 11, respecL:ively.

The Sonc-ral trend ev-Idonced in these representations is

that listeners 1, 2, 3, 5, and 6 perform consistently well, while th'e performances of listeners 4, 7, 11 appear to be consistently depressed.

In regard to these trends, it is interesting to note that each of the listeners whose performances were consistently high are better acquainted with the speakers, as a group, than are listeners 4, 7, anid 11. Additionally, the latter are the only listeners who are not prof ess ionals or advanced Students in the field of speech and hearing science, a group which has had considerable experience in serving as subjects in behavioral experiments (it should be noted, however, that listener 4 does have such experience).

The important feature in the listeners' performance is consistency. This would seem to indicate that the differences encountered among listeners are not due to such transient effects as fatig-ue or attention, but rather to experience both with the speakers and stimulus-response paradigms.

Significant dIfferences among speakers were found for the voiced and whispered vowels, voiced consonants, and the monosyllables. Identification performance by speaker for the vowel stimuli,










VOlICED


---.WHISPERE D


*CRIT IGAL VALUE


E-4,


10

90

80

70 60 50 -.


40 30

-)0 10J


I


3


i
4


5


6


7


8


9 10 11


12


L I SI E NE R


Figure 9: Performance by listener for voiced, whispered,and filtered vowels.


A



1~
* I . ..
* -
- - - - . S -


- -',









V____ 0OIC ED CON1:'S C"N7,L IT S

-- -:VOICELESS CO-NSONANTS


"'KRTTICATL VA-CE


D
-A

0


4-.


H

0

0


0


50i~ 401





3 0.


/


7 8 9


10 1-.i 12


L IS TE N EI S


Figur 10:Performiance by listener for voiced and voiceless consonants.


\ /


10o


/
/
/


/ A


/


12


3


4 5


6


'p


Figure. 10:










-/va/

90-~*C-Rl7?:CAL VALUE




r- 702
NN
S60..
rA
S50'



P-4
401 i /

4 5 -7 8/
LI \ / /&S


Figur II: Periornance by listener for monosyllables.


Figure 11:







65


pooled over listeners, is shown in Figure 12. Overall, the additivity effect of source and vocal tract transfer characteristics holds hfare. There is somne evidence, however, that if a

speaker's utterances sho-*zed some acoustic characteristic which was quite distinct from the othe--rs in the group, then the listeners tended to weig:;h that parameter more heavily when making identity judgments for that speaker. Note, for instance, that the performances yielded by speakers 2 and 5 tend to be higher, for voiced and filtered vowels, than those of the other speakers. Acoustic analyses revealed that the mean fundamental frequencies for speakers 2 and 5 (126 Hz and 144 H~z, respectively) are higher than those for the rest of the

group.

.. - speaker 5's distinctive souce information is absentas in the whi spered vowels--the performance yielded by his utterances then deteriorates dramatically. The formant frequencies for each speaker's whispered vowels are plotted in Figure 13 and Figure 14. Note that the formant frequency values for the utterances of speakers 5 and 8--which were identified at a level below chancetend to be unexceptional (but for F3 for speaker 5's utterance of /i/). On the other hand, the formant frequencies for speaker 2's

whispered vowels, dhich yielded the highest identification scores among the group, tcnd to be exceptional; note the low F3 for /u/, the high F3 for and the very high F 3 for /a/.

Performance by speaker for the consonants, pooled over listeners, is s!kewnr in Figure 15. The performances, for the voiced consonants, yielded b.y speakers 1, 2, 5 and 7 Lend to be








VOICED


F'ILTEPEJO






70


60~ 50J74


40_;;- /,~-.30-1 " ".-


H c-J

H
z
C

(9

C (9
H (9


10


S PE A-,E B


Figur 12: Performance by speaker for voiced, whispered, and filtered vowels.


5 678


'C


Figure 12:






67


. )3






'N/


3.5 3.0






2 .5






2.0 1.5.





1.01






.5_


'N N~d~


1 2 34 5 6 7 8


Figure 13: For-mant frL~quancy values by speake,-r for whispered /iI and lu/.


A




















3.0

//a








2.5-A




/2




















/ 3 .7 89

S.- 1 E/







//a/.











10%r 9CJ Soni 70. 60


5 0i


-~ N



I I


5


6


-7


S PE A KE' R


Figure 15: Performance by speaker for consonantal stimuli.


_______ VOICED CONSONANY.,S

- - - .VOICELESS ONSOINANTS
*CRIT ICAL VALUE


F; U-


nv 4010.


1


2


4


0
4-,


3







70


higher than those of othar speakers. Fundamental frequency measures

tend Lo offer an E,:Planat ion for these differences: the fundamental frequencies for the voiced consonant utterances of speakers 1, 2, and

5 (137 H1z, 130 l1z, and 144 Hiz, respectively) are the three highest measures for the group, while speaker 7's utterances show the lowe;st such measure (106.5 11z) for the group.

For the monosyllabic stimuli, performance by speaker is

shown in Figure 16. Differences among speakers for the 'synthetic' monosyllable Lend to show the same pattern exhibited for voiced

consonants (Figure 13) and tlhe low-passed vowe.TLls (Figure 10), indicating that toe sare1 aIcoustic Cue (i.e.*, fundamental frequency) is being used for /v+,a/ as for these stimuli. For the natural monosyllable, no explanation for the differences among speakers is offered, however, by the distribution of the acoustic parameters. which were extracted in this study. It may well be that speaker identification performance for this stimulus is determined by a supras~g-mental feature such as inflection.

Overall, the trends exhibited in the differences among speakers, although they do not apply universally, confirm that the acoustic parameters which w7ere found to correlate with speaker confusions are inidJ-ed the basis for listener judgmients. There is also an indication Lhat listeners more heavily weigh a given acoustic correlate if, for a given speaker, it stands in distinction from

the general speaker group.















qcd ~'CRIT]-CAL VALUE

80~








4 04 N7
PO











23 Z;


Figure 16: Performance by speaker for monosyllabic stimuli.

















IV

StUNLARY AND CONCLUSIONS


An. investigation was undertaken concerning the ability of subjects to identify speakers solely on the basis of voice. The purposes of 1:his study were: (1) to establish the relative contribuLions of source and vocal tract transfer characteristics to speaker identification, (2) to establish whether or not speakers could be identified on the basis of isolated utterances of continuant consonants, (3) to investigate the nature of the relation 1 Lweeni utterance intellicibility and speaker identification, and (4) to determine whether s-ample duration was a variable in speaker identification in absolute or relative terms.

The subjects for this study consisted of eight male speakers

anTLd taielistencrs; the latter had been in routine contact with te corner for a period of at least six m-onths. The follc-win- speaker utterances, equated for intensity, were presented to the listeners:

r;op.ose s.cilLonces; [our vowels (/i, u, ?, a/) under three coadiicns, vcJwhispered, and low-pass filtered at 200 H-z; four Consonants (Is, f, v, z/); two monosyllables, one natural (/va/), and one gencraced by abutting two steady state phonemic excerpts. (/v+a/).

The three vowel conditions were taken to simulate the presence only of (1) source information (filtered vowels),


72










(2) vocal tract transfer information (whispered- vow,,-els), or (3) both (voice-d voc;:,els) . Except for the sentences, all stim-,uli were presented at a duration of 1250 mnsec. The inclusion of the monosyllables allowed for the evaluation of the contributions of the informational aspects of duration; if the latter was a variable only in absolute terms, no differences in speaker identification performance for single phoneme vs two phoneme utterances would have been obtained.

The listeners were presented with forms listing each speaker by initials, and their task was to circle the speaker they felt produced each item. The listeners were also required to choose which stimulus item was presented for all the vowel and consonant stimuli employed in the study.

Acoustic analyses of the speakers' utterances were performed and the following parameters were extracted: fundamental frequency, the first three form-,ant frequencies, the ratio of the second to the first formant frequency, formant bandwidths (for the voiceless consonants) and formant amplitudes (for the voiced and whispered vowels). Theo confusions among, speakers predicted by each of these

paraineters were correlated with the actual confusions among speakers in an attempt to ascertain which acoustic characteristics serve as

important Cues to speaker identification.

The results of this study may be summarized as follows:

1. All stimuli yielded speaker identification performance

at a level significantly above chance.

2. hesen-ten~ce stimuli resulted in. performance far above any other stimulus type.







74


3. The performances achieved for whispered vowels and

filtered vowels were very nearly equal anid sunroed to the performance achieved for voiced vow1,els. Thec correlations betw- -een acoustic characterist ics ?,nd confus ions among speakers revealed that fuindamental frequency, the second formpant, and the third forimrant we re equally good predictors of speaker confusions. There was a general trend for low vowels to yield higher performances than high vowels.

4. The voiced continuant consonants yielded significantly higher per for-mances than their voiceless counterparts. Fundamental frequency was the best predictor of speaker confusions for the voiced consonants; for the voiceless consonants, the first formant frequency was the best such predictor obtained, though the correlation was weak in absolute terms.

5. In regard to duration, /va/ yielded significantly better

results than /v+a/, and also resulted in performances above equivalent durations of /v/ and /a/. Also, if the performance for Iv/ and for /a/ w,,ere averaged, the result was very nearly the performance yielded by /v+a/.

6. Differences among listeners were accounted for in terms of their relative familiarity both with speakers and with behavioral paradigms. The trends present in the differences among speakers Were largely accounted for in terms ofl the acoustic parameters of their utterances.

7. Little correspondence was found between the cues important

for speech intelligibility and those tho:ngh1t to be important for speaker identification. Utterance intelligibility was found to be











neither a icS:y no r Uo CFicient concom"I ill t to speakeer ida'ntiIi catL i onII

The macor conIclus ions province by this investigation are

thlat, althCough1 one can point to acoustic correlates of speaker-identi1 _Ciction, tCSeem to_ be flhI _COUiStiC invatriants related to ekr identification; furthorrDre, sp _cclt intelligibility and speaker identification seem to be qualitatively different percepts. This would ilidicate that an adequate -model for phoneme identif ic-)tion would not necessarily serve is an nde qUate model for speaker identification and vice-versa. i'urther research into the nature and locus of speaker identification processing is strongly recoi-uneadi-ed, and a dichotic listening paradigm may prove particularly ftuitful.

Othe-r and muore specific conclusions also seem warranted. First, speake-lr identifi ca. on for vowel s ap r oh ae nbt fundamental frequency and fori~ t frequjency in format ion; the influence of these parameters is both equal and additive. The general trend for low vowels to yield higher performance than high vowels may be occouiited for by systematic differences in the forniant amnplitudes of these vowels.

Secondly, speaker ideatif ication is possible on the basis

of isolated continuant consonants. The level of performance achieved

for these stimuli, although above chance, was the low,-est encountered in this s-ludy. Alfthn'idh identification of the voiced consonants correlates w ell ,7ith fundame ital frc quenicy, additional research into the nature of the acoustic cues which allow identification of the voiceless consonanLs is needed.










Thirdly, the sample tire interval cootibutes to speaker idutilication in a relative sense oniy-i.e.*, a;hat is i'povtant is not the absolute duration of this inturval , but the nature of the utterance contained in the interval. Specifically, this study darwcnstrates that, for a given duration, n;ultiphone:yic utrances yield better speaker identification per fornunces thin single phoneme samples; further, this added information is based on some integral measure of a multiphonemic utterance and not on the target values of its constituent phonemes.

Finally, the very hinh pcL [arco yielded by the sentence Aimuli points to the possible ic IporLance of suprasegmental cues such as tempo and inflection to speaker identification.









































APPENDIX A

INSTRUCTIfONS TO EVALUATORS OF SPEAKERS' UTTERANCES











NAME:

DATE:



Instructions:

The speakers you will hcar will produce each of the listed utterances

five- times each. The utterances arc approximately three seconds in duration, but only the initial 1.5 seconds of each will be used in experimentation. If you feel that any repetition is not a representative production of the utterance, please place an X in the appropriate cell. If you feel that it is a representative production, please assign iL a rating on a scale of

1 to 5, where 1 represents Mi nimal acceptability and 5 represents maximal acceptability. If you have any questions, please ask them now.









































APPENDIX B

INSTRUCTIONS TO LISTENERS
















I NS TR UC T ITO'NS


This is an experiment in speaker identification. You will be listening to various speech samples, each p-oduced by one of the eight speakers pictured otn the wail in front of you. Your task is to listen to each sample and then decide which speaker produced it; in some cases, you will also decide on w-,hich phoneme was produced.

Indicate Your decision by circling the appropriate speaker and phoneme. Please respond to all items (if you are not sure, guess). If, after circling an item, you change your mind, cross out the former decision and circle the new one.

There will be a four second interval between each item. After every 25th sample, there will be a ten second pause, so that you may tu-n to a new response page. If you find that you
have not completed a response page when the Longer pause occurs, notify the experimenter immediately.

Pase note that the pictures of the speakers and their
identifying initials appear onl the wall in the same left to right order as the initials on the response form (please also note that John Booth and John Brandt have the same initials; thle latter is designated here as 'Dr!"" and not as 'JB') .

The first series of samples you will hear will be thle
utterance: ",-,lien the sunlight strikes raindrops in the air, they ac.-t like a prism and form a rainbow. The rainbow is a division of white light into many be-autiful c-olors."

Before the start of subsequent series, the experim~center will inform you of the specific samples you will be hearing.

If you have any questions, please ask them now.


80








































APPENDIX C

EXAMPLE OF LISTENER RESPONSE FORM












RI FEl BB JB

RI EMl BB JB

RI EM BB JB

RI EM B B JB


PW CL Piq CL PW CL PW CL PW CL PW CL PW Cil PW CL PW CL, PW CL PT. CL PU CL ILW CL PW CL PW CL Piq CL PW CL PU CL PUl CL F W CL 2W cl PW C L PU CL PW CL PW CL


IRI EM BB

RI EM BB

RI EM BB

RI EM BB

RI EM BB

RIf EM BB

RI EMl BB

RI EM LB

RI EM BB

RI EM BB

RI EM BB

RI EM BB

RI EM BB

RI EM BB

RI EM BB

RI EM BB

RI IM BB

RI EM BB

RI EM BB

RI EM BB

RI EM BB


82


CuDE


*JB JB JB JB JB JB JB JB JB JB

JB JB JB JB JB JB JB JB JB JB


DrB DrB DrB Dr B DrB DrB DrB DrB

DrB DrB DrB OrB DrB DrB DrB DrB DrB DrB DrB DrB DrB DrB DrB DrB DrB


s f

S f

S f

S f

S f

S f

S f

S f

S f

S f

S f

S f

S f

S f

S f

S f

s f

S f

S f

s f

S f

s f

S f

S f

S f


z v z v z v z v z V z v z v z v z v z v z v z v z v z v z v z v z V z V z v z V z v z V z v z v z v








































APPENDIX D

MEASURES DERIVED FROM ACOUSTIC ANALYSES OF SPEAI&,RS' UTTERANCES















TABLE XV. FUNDAETAL FREQUENCY AND FORMANT FRE-QUENCY (liz)
MEASURESS FOR THE VOICED* AND WHISPERED VOWELS



SPEAKER

1 2 3 4 5 6 7 8

fo 101 124 113 105 141 114 109 130
F1 240 280 280 280 320 280 320 320
hiF2 2440 2560 2800 2520 2560 2440 2040 2720
F3 2960 3160 3640 3880 3160 3040 2960 3680


f
0
/Ul Fj
o F2
> F3


Fl




F3


> Fl
/ai F2
F3



F3

> F1
/U 2
F 3


P-4 F13

~z /a/IF2
F3


103 122 118 107 149 115 112 128
200 280 280 280 280 320 320 320
960 1040 960 760 1120 1000 880 1000
2360 2320 2560 1640 2800 2560 3080 2600

95 132 118 105 142 113 109 117
840 720 760 760 800 800 680 800
1640 1680 1880 1960 1720 1840 1520 1680 2560 2440 2520 2600 2400 2560 2680 2440

96 126 116 104 143 113 106 127
680 800 640 800 780 760 640 800
1140 1080 1160 1330 1200 1160 1060 1180 2400 2640 2560 2320 2420 2600 2120 2600

240 320 320 400 360 320 280 320
2320 2480 2600 2680 2840 2600 2720 2720 2840 3440 3040 3240 3840 3120 3480 3520

400 400 320 440 400 360 360 440
1480 1120 1040 920 1080 1080 880 960 2520 2240 2560 2520 2680 2640 2600 3040

40 840 880 800 800 880 760 1000
1840 1680 2000 2400 1920 2160 1760 1920 2720 2400 2640 2920 2520 2760 2520 2800

760 960 880 920 800 800 840 960
1240 2520 1280 1480 1200 1240 1280 1560 2560 3080 2720 2440 2600 2720 2640 2680


*Fundamental frequencies for the filtered vowels are the same as those reported for the voiced vowels.


84





















TABLE XV I. FUNDAME-NTAL FREQUENCY, FMIAN F FREQUENCY, AND FOIANT BANDWIDTHS (Hz-) FOR TIE CONSONANTS


SPEAKER

1 2 3 4 5 6 7 8

f0 148 132 116 98 144 112 108 114
C4 /V/ F1 280 320 440 280 280 320 320 280
F2 1O!& 720 1120 920 1000 1280 1200 1000

00
/z/ F1 280 320 360 4+00 320 280 320 280
F2 760 680 1000 800 800 1040 1000 960

F 1 320 200 200 160 320 360 520 400
/s! B1W1 1000 15 20 64 0 320 560 1360 1240 920
F2 4560 2400 3120 4400 5600 4640 2960 2840
BW2 830 520 1360 1680 640 480 480 800

F1 320 280 320 360 320 320 360 360
~ f! BWI 880 1200 1360 1520 960 840 840 1120
F2 2800 3120 2720 2760 2480 2440 2720 2840 BW2 760 520 840 960 800 920 760 480


85


























TABLE XVII. FUNDAMENTAL FREQUENCY AND FORNAN FREQUENCY (liz) MEASURES FOR THEl MONOSYLLABLE, /val



SPEAKER

1 2 3 4 5 6 7 8

if 0 95 129 113 98 140 111 101 114
F 1 760 560 640 800 800 760 640 800
F2 1200 1120 1240 1240 1200 1200 1160 1200 F3 2520 2160 2520 2120 2400 2640 1840 2360 LOCUTS F 2 920 800 1000 840 960 1000 760 960








































APPENDIX E

CONFUSIONS AMONG SPEAKERS












L jP ie
1 2 3 4 5


21

0

24,

0

1

5

6

3


4

30

5

2

3

6

1

9


I

0 29

7

1 16

1

5


8

2 16

14

0

9

5

6


18

1

34

5

0

0


6 7 8


2

I '24

3

3 25

2

0


21

1

4

3

1

2 24

4


6

t.

20

1

24

6

0

2


Figure 17: Confusions among speakers for voiced /i/.


1~

2 d3

4

*,4 5 ej6

7

8


Actual Speaker
1 2 3 4 5


23






7

0

0

22

6


9 26

2

5

4

0

4 10


9

4 15

4

1 17

3

7


3 2

7 22

0 6

17 0

o 13

2 1

21 0

10 16


6 7 8


6

2

1.7



2 13

4

11


10

14

3

14

1

2

9

7


13

8

6

6

5

7

1

14


Figure 18: Confusions among speakers for v oiced /u/.


88


I

2 (3



.~5



7

3










Actual Speaker
___ 1 2 3 4 5 6 7 S

1 25 3 4 7 8 3 4 5

2 1 50 1 6 2 1 15 12

~3 12 1 20 10 1 12 0 3

4 6 0 1 5 2 0 4 5

*~5 0 2 4 0 45 6 1 0
C)
.) 6 10 0 29 10 1 33 0 3

7 2 4 0 8 0 3 34 9

8 4 0 1 14 1 2 2 23


Figure 19: Conf us ions iamonig speakers for voiced /2/





Actual. Speaker

--_ 1 2 3 4 5 6 7 8

1 31 3 6 3 8 10 4 4

2 3 18 5 1 2 7 21 2
24
o~ 3 6 4 20 5 3 11 1 1


C)
S5 1 19 3 0 46 5 0 6

(' 6 11 10 10 1 0 24 0 16

7 1 0 5 1 1 1 31 1

8 1 1 4 2 17 0 2 1 26


Figure 20: Confusions among speakers for voiced /a/.






0


Actut Speaker
1 2 3 4 5


15

9

8 11

3

4

6

4


Ii 23

7

8

3

3

2

3


1

4





9 15

5

4


3

5 10

6 18

8

4

6


6

0 13 10

9 11

4

7


6 7 8


10

0 17

9

7

12

1

4


12

7

3

7

7 11

12

I


3

1

8

7 13 15

4

9


Figure 21.: Confusions among speakers for whispered /i/.


Actual Speaker
___ 2 3 4 5 6 7 8


0.
cri


04


I

2

3

4

5

6


15

5

5

6

2

4

12 11


7 17 11

4

5

7

4

5


11

4

7

3

1

8

9

1.7


9

6

5 15

5

4 10

6


7

2

7 10

8

14

3

4


2

0

14

3

7 15

8 11


13

2 15

7

7

5

7

4


3

3

7

5

9 13

11

9


Figure 22: Confusions among speakers for whispered //


(3)
P-4


-4--







91


A :tua1 Spnakor
1 2 3 4 5 6


15

2 12

6

6

14

1

4


10

41

0

3

1

0

1

4


11

7

6

4

8

12

5

7


11

2

5

6 18 10

5

3


14 16

6

9

1

3

4

7


Figure 23: Confusions among speakers for whispered I P/.


Actual Speaker
1 2 3 4 5 6


10

3

12

9

9

7

3

7


19 31

1

0

3

4


11

2 19

2

4 12

5

5


7

8

3 31

3

2

3

3


9 1~0

8

4

5

6

8 10


Figure 24: Con ftis ons aiv (ong speakers for whispered /a/.


7 8


0)
Ce
0)

'U
C')
0)
U
0)


8

5

3

3 13 17

4

7


8

14

5

6

3

7 15

2


12

4 18

7

6

12

1

0


7 8


0)




Q) P'4


6

1 19

3

3 19

1

8


10

2 13

4

7

6

8

10


4

3








20



5




Full Text

PAGE 1

Scne Acounuic and Perceptual Correlates of Speaket" IdeiH; i ; icat ion By CONRAil LOUIS LARIVIERE A DISSERTATION PRESKNTED TO TblS GRADUATE COUNCIL OF THE WJIVEFlSITl' OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIRE^IEMTS FOR THE DEGPEE OF DOCTOR OF PHiLOSOFKY loSlVERSITY OF FLORIDA 1971

PAGE 2

ACKNOWLEDGEMENTS The author gratefully acknoT/ledges the guidance and support of Dr. Harry Hollieu, not only through the course of this study, but also throughout the v;riter's graduate career at the University of Florida. Ihe author is pleased to acknowledge the constructive comments of his supervisory committee, composed of Drs. Brcmdt, Ramey, Jensen, Dew and Algeo. Special thanks are due the author's wife. Penny, for her unflagging love and encouragement and the author's son, Chris, who could not care less about the initials after his father's name. Finally, the author wishes to acknowledge his iniv.ense debt to his parents, Roland and Bella, who thought mullers unfit places for young men. ii

PAGE 3

TABLE OF CONTENTS Page ACKNOWI.EDGEMENTS i-i LIST OF TABLES iv LIST OF FIGURES v ABSTRACT vii CHAPTER I INTRODUCTION I Review of the Literature 5 Statement of the Problem 5 II PROCEDLTRE 9 General Experimental Approaches 10 Details of the Experimental Procedure 12 III RESULTS AI'TD DISCUSSION 27 Sentences 27 Vowels 30 Consonants 39 Utterance Intelligibility and Speaker Identification 44 Sample Time Interval 49 Acoustic Analyses 51 Differences Among Listeners and Speakers 59 IV SUllMRY AND CONCLUSIONS 72 APPENDIX A 78 APPENDIX B 80 APPENDIX C 82 APPENDIX D , 34 APPENDIX E 88 BIPLIO';RAPIiY 97 BIOGRAPHICAL SKETCH 100 iii

PAGE 4

LIST OF TABLES TABLE Page I. ANALYSIS OF VARIANCE SUMM/'.RY FOR LISTENERS' RESPONSES TO SENTENCE STIMULI 28 II, ANALYSIS OF VARIANCE SU>MARY FOR VOICED VOIJEL STIMULI,. 33 III. A POSTERIORI COMPARISONS AMONG VOICED VOWELS 34 IV, ANALYSIS OF VARIANCE SWCIARY FOR WHISPERED VOWELS 36 V. A POSTERIO RI COMPARISONS AI-IONG WHISPERED VOWELS 37 VI. AVERAGE FORMANT FREQUENCY AMPLITUDES FOR VOICED P.m WHISPERED VOWELS (dB DOWN FROM F-^ AMPLITUDE) 33 VII. ANALYSIS OF VARIANCE SU>I>IARY FOR FILTERED VOVJELS . 40 VIII. ANALYSIS OF VARIANCE SUl^fl^IARY FOR CONSONANT STIMULI 42 IX. A POSTERIORI COMPARISONS AMONG CONSONANTS 43 X. ANALYSIS OF VARIANCE SUMMARY FOR MONOSYLLABLES 52 XI. RANK ORDER CORRELATIONS BETWEEN ACTUAL AND EXPECTED CONFUSIONS i\MONG SPEAKERS FOR VOICED V01-/ELS 54 XII. KAllK ORDER CORl^iLATIONS BETITEEN x\CTUAL AND EXPECTED CONFUSIONS AMOxNG SPEAKERS FOR WHISPEllED AND FILTERED VOWELS 55 XIII. R.\NK ORDER CORRELATIONS BETWEEN ACTUAL AND EXPECTED CONFUSIONS AMONG SPEAKERS FOR CONSONANTS 57 XIV. P^ANK ORDER CORRELATIONS BETWEEN ACTUAL AND EXPECTED CONTUSIONS AMONG SPEAKERS FOR /va/ 60 XV. FLfisDAMJjlNTAL FREQUENCY AND FORM/.NT FREQUENCY MEASUPilS FOR THE VOICED AND WHISPERED VOWELS 84 XVI, FIJNDAI-IENTAL FREQUENCY, FOkMANT FREQUENCY, A^E) FOm\NT BANDWIDTHS FOR THE CONSONANTS 85 XVII. FWIDA^IENTAL FREQUENCY AND FORMANT. FREQLTiNCY MEASUi^S FOR THE MONOSYLIABLE , /va/ 86 iv

PAGE 5

LIST 01FIGURi'-S F IGURE Page. 1 SCHEMATIC REPRESENTATION GENERAL PROCEDURE 15 2 TllE STRUCTURAL SC.IEME OF TIffi FACTORIAL DESIGN; DATA FOR THE CONSON-ANT STIMULI 23 3 LISTENER PERFORMANCE FOR SENTENCES PRODUCED BY EACH SPFJiKER 29 4 OVERALL LISTENER PERFORMANCE FOR VOI-TEL STIMULI 31 5 OVERALL LISTENER PERFORMANCE FOR CONSONANT STIMUI.I 41 6 OVERALL INTELLIGIBILITY LEVELS OF THE UTTEIUNCES EMPLOYED IN SPEAKER IDENTITY TASKS 45 7 PROPORTION OF LISTENER RESPONSE TYPES FOR UTTERANCES WHERE INTELLIGIBILITY AND IDENTITY JUDGMENTS l-fERE MADE 47 8 OVERALL LISTENER PER'ORi-L\NCE FOR EQUIVALENT DURATIONS OF MONOSYLLABLES AND ISOLATED PHONEMES 50 9 PERFORMANCE BY LISTENER FOR VOICED, WIISPERED, AND FILTERED VOWELS 62 10 PERFORI^IANCE BY LISTENER FOR VOICED AND VOICELESS CONSONANTS 63 11 PERFORMANCE BY LISTENT^R FOR MONOSYLIABLE S 64 12 PERFORMANCE BY SPEAKER FOR VOICED, TOISPERED, AND FILTERED VOWELS 66 13 FORMANT FREQUENCY VALUES BY SPEAKER FOR IffllSPERED /i/ AND /u/ 67 14 FORMANT FPvEQUENCY VALUES BY SPEAKER FOR WHISPERED /t^ AND /a/ , 68 15 PERFORMANCE BY SPEAKER FOR CONSONANTAL STIMULI 69 16 PERFORMANCE BY SPEAKER FOR MONOSYLLABIC STIMULI 71 17 CONF'USIONS AMONG SPEAKERS FOR VOICED /i/ 88 V

PAGE 6

FIGURE Page 18 CONFUS.CONS AKOXG SPEAKERS FOR VOICED /u/ 88 19 CONI'USIONS AMONG SPEAKK^S FOR VOICED jh^-J 89 20 CONFUSIONS AMONG SPEAKERS FOR VOICED /a/ 89 21 COaTUSIONS AMONG SPEAKERS FOR WHISPERED /i/ 90 22 CONFUSIONS AMONG SPEAKERS FOR WHISPERED /u/ 90 23 CONFUSIONS AMONG SPEAKERS FOR WHISPERED 91 24 CONFUSIONS AMONG SPEAKERS FOR WHISPERED /a/ 91 25 CONFUSIONS AMONG SPEAKERS FOR FILTERED /i/ 92 26 CONFUSIONS AxMONG SPEAKERS FOR FILTERED /u/ 92 27 CONFUSIONS AMONG SPEAKERS FOR FILTERED pSlJ 93 28 CONFUSIONS AMONG SPEAKERS FOR FILTERED /a/ 93 29 CONFUSIONS AMONG SPEAKERS FOR /s/ 94 30 CONFUSIONS AMONG SPEAKERS FOR /z/ 94 31 CONFUSIONS AMONG SPEAKERS FOR /f/ 95 32 CONFUSIONS AMONG SPEAKERS FOR /v/ 95 33 CONFUSIONS AMONG SPEAKERS FOR /v+a/ 96 34 CONFUSIONS AMONG SPEAKERS FOR /va/ 96 vi

PAGE 7

Abstract of Dissertation Presented to the Graduate Council of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy so;-n5 ACOUSTIC and perceptual correlates OF speaker identification 3y Conrad Louis LaRiviere Chairman: Harry Hoi lien • Major Department: Speech An investigation v/as undercaken concerning the ability of listerors to identify speakers solely on the basis of voice. ITie purposes of this study were: (1) to establish the relative contributions of source and vocal tract transfer characteristics to speaker identification, (2) to determine whether or not speaker identification v;as possible on the basis of isolated utterances of continuant cousonants, (3) to investigate the nature of the relation between utterance intelligibility and speaker identification, and (4) to determine wliether sample duration was a variable in speaker identification in absolute or relative terms. The subjects for this study were eight male speakers and txvelve listeners. The listeners were exposed to the following speaker utterances: two prose sentences; four vowels under three conditien3--voiced, whispered, and low-pass filtered; four consonants tv.'o consonant-vov;el monosyllables, one natural and one "synthetic." The three vowel conditions were taken to simulate the presence vii

PAGE 8

only of: (1) source information (filtered), (2) vocal tract transfer information (whispered), or (3) both (voiced). The monosyllables allowed for the evaluation of the role of duration in speaker identification. The listeners' task was to identify the speaker they felt produced each item. For the vowel and consonant stimuli, they were also required to identify the utterance. Acoustic analyses of the speakers' utterances were perfcnnsd and the parameters extracted were then correlated with the confusions among speakers. The parameters extracted included fundamental frequency, the first three formant frequencies, and formant bandwidths. The results indicated that: (1) all stimuli yielded speaker identification performance above chance levels, (2) sentences yielded performance far above any other stimuli, (3) the performances achieved for whispered vowels and filtered vowels were very nearly equal and summed to the perforimnce achieved for voiced vowels, (4) voiced consonants yielded significantly higher performances than voiceless consonants, (5) the natural monosyllable resulted in higher performance than the "synthetic" monosyllable, and (6) utterance intelligibility was neither a necessary nor sufficient concomitant to speaker identification. The major conclusions reached v/ere that there seem to be no acoustic invariants related to speaker identification, and that speech intelligibility and speaker identification appear then to be qualitatively different percepts. This indicates that an adequate model for phoneme identification would not necessarily constitute an adequate model for speaker identification, and vice-versa. vili

PAGE 9

The following conclusions also oeemed tenable: (1) speaker identification for vowels seems based on both fundamental frequency and forwant frequency information, and the influence of these parameters is both equal and additive, (2) speaker identification is possible on the basis of isolated continuant consonants; the acoustic cues responsible for the identification of these consonants proved to be elusive, and they deserve further investigation, (3) duration is an iraportant variable in speaker identification performance in that it allows listeners to Siim.ple added information on the basis of some integral measure of multi-phonemic utterances, and (4) the . very high perform.ance yielded by the sentence stimuli points to the possible importance of suprasegmental cues to speaker identification. ix

PAGE 10

I INTRODUCTION The extraction of a speaker's identity solely from his voice is a familiar perceptual phenomenon that can be observed in conversations, cocktail parties, audio portions of television broadcasts, etc. Indeed, in certain circumstances, speaker identification from voice cues is crucial; a pilot, for example, must be able to isolate and attend to one voice amid a welter of voices if he is to guide his aircraft appropriately; an individual will not reveal confidences over a telephone unless he is sure of his listener's identity. As demonstrated below, there is ample experimental evidence to support such anecdotal observations. Superficially, the process of speaker identification from voice imiy seem to be of little consequence, constituting little more than an empirically established curiosity. On closer inspection, however, one finds several compelling reasons underlying tlie investigation of speaker identity from voice cues. Recently, for example, there has been considerable interest concerning .speaker identification from spcctrographic representations of voice ( 'voiceprints' ) . llie forensic implications of such a process are obvious, buc there has been some skepticism and controversy regarding its validity. Bolt et al. (1970) point out chat (1) experimencal studies of voice identification using 1

PAGE 11

2 visual inspection of specti-ograiTis have yielded false identification rates ranging from zero to 637,, depending on type of task set for the observer and the latter 's training, (2) as yet experimental studies designed to assess the reliability of identification from voiceprints under practical conditions have not been carried out, and (3) no experimental studies have been attempted which deal with speakers who are attempting to disguise their voices. Tosi et al. ' s (1971) recent work concerning speaker identification from 'voiceprints' seems to represent an attempt to resolve some of these issues. Although their preliminary results are promising, Tosi al . (1971) note a need for a great deal more research in this area. One must also countenance the (for some) Or^vellian notion that man's voice will serve as his universal identifying credential for financial, security or professional matters. For example, Kersta (1971) has reported the successful use of machine speaker recognition for security purposes. However, he worked with a small (N = 16), closed, and cooperative speaker society, and warned that extensions of this approach to larger societies will be very difficult, Tosi et^ al . (1971)have also stressed the manifest difficulties of using large and/or non-cooperative speaker inventories. Another motivation for investigating speaker identification from voice revolves around possible implications for models of speech perception; specifically, the relation between speaker identification and speech intelligibility presently is unkno\vn, 3.P. it is largely uninvestigated. Most models of speech perception

PAGE 12

3 or speech recognition posit a gross preliminary analysis of the signal prior to entry' into either a strategy section, which converges on a solution through an iterative process, or directly into a lexicon (of phonemes, or words, or phrases, etc.)j where the input signal is matched against the items stored there. Preliminary analysis (e.g., Halle and Stevens, 1962) is usually taken to consist of a bank of contiguous bandpass filters, whose output somehow serves as a rough guide (in the frequency domain) for lexical selection. If the perception of an utterance is found to be a sufficient and/or necessary concomitant to speaker identification, then it would be reasonable to infer that an adequate model for speech recognition would also be an adequate model for speaker identification. If such a relation does not obtain, one is left with the prospect that speaker identification is a qualitatively-different percept from speech recognition, and may be mediated by different cues in the acoustic signal. It may be tenable, then, to ascribe the speaker identification process only to the preliminary analysis segment of speech recognition schemes. The main difficulty in all these apprcQches is that there ' exists no explicit and valid model for speaker identification. For instance, no knovm speech perception model offers or entertains an explanation of an observer's ability to identify speakers; machine recognition schemes (such as Kersta's, 1971) rely largely on selected time -frequencyamplitude measures, yet no systematic attempt has been made to relate such measures to speaker identification performance. It is not the intent of this study to propose

PAGE 13

4 or establisli such a model, for it seems premature to formulate one at this time. Speech is first and foremost an acoustical signal, and v.-hatever information a listener can. extract from speech, including the speaker's identity, is coded completely among the acoustic characteristics of the signal. Note that only a portion of such characteristics are available in visual representations of voice, and an observer who attempts to identify a speaker from spectrograms may well be at a distinct disadvantage from one who can avail himself of the original acoustic signal. Yet there remains some ignorance concerning the acoustic correlates to speaker identification. The literature (outlined below) concerning speaker identification by aural approaches has been mainly concerned with quantifying the extent of the effect; there has yet to be a systematic series of experiments attempting to relate speaker identification to the acoustic parameters of the signal. Conceivably, such acoustic correlates could play a key role in an eventual speaker identification model. TTius the focus of the present study concerns the investigation of some possible acoustical and perceptual correlates of speaker identification from voice. It should be noted at the onset that some liberties with traditional terminology have been taken in the literature. The terms 'speech' and 'voice' have been usually taken to refer to different phenomena, the former dealing V7ith linguistic events (e.g., events relating to the phonemes of some natural language) and the latter dealing with nonlinguistic events (e.g., laryngeal parameters) . Although all of the known research concerning aural speaker identification utilizes 'speech' cues (texts, sentences, phonemes), the common appelation for such a paradigm is 'speaker identification from voice. ' One suspects that this is due primarily to the clumsiness of 'speaker identification from speech' in any event, the phenomenon of identifying a speaker solely from

PAGE 14

5 Review of the Literature Pollack et_ al . (1954) V7ere among the first to empirically establish that listeners could idnntify speakers on the basis of voice alone. Their speakers read identical, unspecified texts under two conditions: voiced and v/hispered. The independent variables were duration and filtering. Results for the voiced samples indicated that speaker identification scores were directly proportional to duration up to 1.2 seconds, beyond v;hich there was no improvement in performance. Listener performance under varying highand lov7-pass filtering conditions suggested that identification perform.ance was not "critically dependent upon any delicate balance of different frequency components in any single portion of the speech spectrum." On the other hand, for the whispered productions it was found that performance was equivalent to the voiced condition if duration was increased by a factor of three. The authors concluded that duration was a significant variable in speaker identification performance, in that it admitted larger samplings of a speaker's repertoire. Compton (1963) used only sustained productions of the vowel ,/i/, and varied duration and filtering conditions. He, too, found that performance increased with duration only up to durations of 1250 msec. Moreover, he found that filtering frequencies above 1020 substantially reduced speaker identification performance and filtering frequencies below 1020 Hz his utterances will be universally referred to here, for couformability, as speaker identification from voice.

PAGE 15

6 had no significant effect on performance. Compton concluded that confusions among spcakei-s were largely explained by similarities of their fundamental frequencies, yet he reported no attempt to account for the confusions on any other basis. Bricker and Pruzanski (1966) used a variety of speech materials as stimuli in a speaker identification task. Of particular interest here are their results when the stimuli x-jere vowels (V) as opposed to consonant vowel monosyllables (CV) : when the V's and C V s were presented to listeners at identical durations (14Q msec), they found that CV stimuli yielded significantly better identification performance than the V stimuli, lliey concluded that the number of phonemes in a speech sample correlated more closely with identification performance than did the sample's absolute duration. They also noted that confusions among speakers were not independent of the vowel uttered-i.e., the pattern of speaker confusions differed over vowels. Stevens e_t aj^ (1968) also noted vowel effects; specifically, they found that utterances containing a front vowel (/i/) yielded higher identification scores than utterances containing a back vowel (/a/). They speculated that this latter result may have been due to the importance of the second formant as a cue to speaker identification, since front vowels are characterized by a xvide frequency gap between the first and second formant and a high absolute frequency location of the second formant. Statem.ent o f the Problem In the acoustic domain, the basic parameters of a speech

PAGE 16

7 signal are time, frequency and amplitude. Given constant amplitude, the variables of interest are time and frequency. Phonemic effects have been noted in previous experimentation and will also be considered here. To summarize the literature x>7ith respect to these influences: ^Duration . Some evidence (Compton, 1963) suggests that speaker ; identification is a function largely of absolute duration. Bricker and Pruzanski (1966), Pollack et aj^ (1954), and by inference, Stevens et al . (1968) suggest that duration is important only insofar as it allows listeners to sample a larger repertoire of the speaker's phonemes. Perhaps a more cogent approach to the role of duration in speaker identification is to consider its contributions in terms of inform.ation--i.e. , increases in duration yield more information about the speaker. The problem then becomes how and where the added information is coded in the speech signal. B, Frequency . Although Compton (1963) has attributed speaker confusions to fundamental frequency, he reports no attempt to explain the confusions among speakers on any other basis, e.g., formant frequencies. Indeed, the speculations of Stevens et al_^ (1963) suggest that the second formant may deserve a closer look as a possible source of speaker confusion, and further research in this area seems indicated. C. Phonem.ic effects . Bricker and Pruzanski (1966) hava shown that speaker confusions vary with the vowel uttered, Stevens et al. (1968) noted that front vowels yield higher scores than back vowels. There are no known investigations concerning speaker

PAGE 17

identification from continuant: consonants (although Schwartz, 1968, has successfully used continuant consonants as stimuli in a sex identification experiment), or the effects of consonant to vowel formant transitions, or the relation, if any, between sample intelligibility and speaker identification. A systematic attempt to resolve the inconsistencies and gaps noted in this review may serve as a useful preliminary to the eventual establishment of a model for speaker identification and \'70uld seem to constitute a viable general problem for research. Hence, the research proposed here addresses itself to four specific problems for research: A. To what extent do source characteristics (e.g., fundamental . frequency) and vocal tract transfer characteristics (e.g., formant center f requeiii_ies) contribute to speaker identification and confusions among speakers? B. Is speaker identification possible from isolated samples of continuant consonants? C. Wliat is the relation between utterance intelligibility and speaker identification? D. Does the time interval of a sample (herein referred to as duration) contribute to speaker identification from an absolute standpoint, or is it significant only because it allows the listener to sample more of the phonemes in a speaker's repertoire? If the latter, is the excra information coded in target or transitional phonemic cues?

PAGE 18

II PROCEDURE Since the experimental methods employed in this study were rather involved, an overview of the general experimental approaches related to each specific problem for research is discussed initially in this section. This is followed by a description of the procedural details employed in the actual experimentation -e.g., subject and utterance selection, utterance treatments, data reduction, etc. The scope of the present experimentation was limited to the identification of a closed set of speakers from auditory cues alone. The speech samples addressed to the specific research problems were restricted to 'simple' utterances -vov;cls, consonants, and consonant-vowel monosyllables -for two reasons: (a) there is ample empirical evidence that speaker identification can be reliably accomplished from such samples, and (b) sample control (for duration and amplitude) and sample analysis (for the extraction of acoustic parameters) are m.uch more easily achieved for these utterances than for more complex materials (e.g., sentences). The lacter, however, contain speech characteristics, such as rate of articulation, inflectior, and dialect, which are i.ot present in the simple utterances used in this study. In order to obtain some estimate of the contributions of such characteristics, two prose sentences were also used in this study.

PAGE 19

10 General ExpGX'imental Approaches Concerning the first research question, tlie evidence noted above (Compton, 19o3) that swstained vov/els yield speaker identification information helps to narrow down the possible acoustic cues ivhich the lisfeners avail themselves of, for a 'steady state' vowel raay be described in terms of its source (fundamentcil frequency and spectral overtones) and vocal tract transfer (formant structure) characteristics (Fant, 1960). The question remains whether either of these characteristics alone is sufficient for speaker identification, or whether their contributions are additive. Although it is functionally impossible to uncouple the source (larynx) and the vocal tract, an attempt was made to simulate such an uncoupling. The speakers were asked to produce isolated vowels, voiced and whispered. These were presented to listeners under three conditions: (a) voiced: contained both source and transfer characteristics. (b) whispered: contained primarily transfer characteristics. (c) voiced, lew-pass filtered below the first formant (at 200Hz) : prim.arily simulated source characteristic. The second research problem was examined by having identity judgments made of continuant consonant cognate pairs. Continuants were selected because they could be produced and presented in isolation and their durational characteristics could be controlled without affecting cheir intelligibility. Cognate pairs were chosen because they differ in the presence or absence of fundamental frequency within each pair, so that the data collected here could also be addressed to the first research problem, i.e., if speaker

PAGE 20

11 identification is largely coded in fundamental frequency information, one would expect significantly better speaker identification scores for the voiced consonants than for the unvoiced consonants. The third problem concerns possible relationships between speech intelligibility and speaker identification. It was investigated by having listeners make intelligibility judgments for the va\?el and consonant stimuli simultaneously with speaker identity judgments. A limited inventory of vowels and consonants (four of each) were produced by the speakers. The listeners were apprised of the inventory and forced to make intelligiblity judgments as well as speaker identity judgm.ents. The fourth problem, regarding the role of sample timeinterval, was investigated by comparing the listeners' identification performance for equivalent durations of isolated phonemes vs cousonant-vowel monosyllables. T\
PAGE 21

12 Such an ensemble provides an opportunity to evaluate the informational aspects of duration in speaker identification. If duration is important only in absolute terms, one expects no significant difference in listener performance over these stimuli, since they have equal durations. If increases in duration are consequential because they allow listeners added information in terms of the steady state portions of a greater number of phonemes, then there would be no significant difference between the natural and "synthetic" monosyllables, but the latter tv;o V70uld yield higher scores than the continuant consoiiant or vowel stimulus alone. Finally, if the import of duration is that it allows the sampling of consonant to vowel transitional information, the natural monosyllable should yield higher identification scores than all other st iniuli . Details of th e E xperimental Procedure A . Sub iec ts . (1) Speakers. Kight male speakers were used in this study. Selection criteria were: (a) age range twenty to thirtyfive years, (b) no known speech defects, (c) speakers of general American English. (2) Listeners. The observers were twelve individuals free from any know.' hearing defects, who had been in routine contact with each of the speakers for a period of at least six months, it was rensoiiable to assume, then, that the listeners were familiar with the speakers' voices and that practice sessions in the actual experimentation would not be required.

PAGE 22

1 13. B. Stimulus Natei'ials . Selecti on of vowels . Because of the observations of Stevens et_ a l . (1968) and Bricker and Pruzanski (1956) noted above, two fi-ont vowels, /i/ and M^/ , and two back vowels, /u/ and /a/, were chosen for speaker production. Note that within the categories front vowel-back vowel, one of the selected vowels is a high vowel and the other a low vowel. Peterson and Barney (1952) demonstrated that, for any given speaker, fundamental frequency probably does not vary significantly over these vowels, that front vowels are character i::ed by a high second formant vs back vowels, and that high vowels are characterized by a low first formant frequency vs low vowels. Moreover, these particular vowels were chosen because they occupy extreme positions in the traditional vowel diagram; hence their formant frequency structures should show viide contrasts over a range of speakers. Additionally, the use of four vowels, as noted above, allows for incorporating a closed set of utterances for the generation of intelligibility and confusion measures. As outlined above, the talkers produced these vowels in tv/o conditions, voiced and whispered. .Selecti on of c o nsonants . Four continuant consonants were used: / f ^ v, s, z/ , Four consonants were used because, again, such a selection allowed for the generation of intelligibility and confusion data. These particular consonants were chosen because of their high frequency of occurrence in English and because they are relatively easy to produce and identify in isolation. Additionally, some data regarding their acoustic 'i characteristics are available in the literature . (Flanagan, 1965;

PAGE 23

14 Harris, 1953; Ueinz and Stc-vcus, 1961; Hughes and Halle, 1956). Selection of consonant: yowgj^rQoiio^^^^ The talkers produced the T?.onosyllable ,/va/ . This inonosyllable was selected because its two phonemes, /v/ and /a/, v/cre also produced in isolation by each, speaker. From these isolated productions, a synthetic monosyllable, /v + a/, was assembled by a procedure detailed below in section C. Overall, these stimulus materials sufficiently address themselves to the specific questions for research outlined above: the vowel and consonant conditions allov/ the evaluation of the contributions of source and transfer characteristics, the use of several vowels and consonants allow for the evaluation of possible relations between utterance intelligibility and speaker identification, and the m.onosyllables should yield some insight into the informational role of sample tim.e interval (duration). C. Preparing Stimulus Materials. Figure 1 is a schematic representation of the four principal steps employed in the procedure of this study. Recordin g conditions. Each talker was seated in a sound treated roora (lAC 1204 -A) and positioned approximately six inches from a dynamic microphone (Electro-Voice 664). All utterances were recorded at 7% inches per second on a single-track tape recorder (Magnecord PT6-1), located outside the room. The speakers were asked to produce the following utcerances: (a) name, (b) the first two sentences of the Rainbow Passage (Fairbanks, 1956), using natural rate and inflection, (c) five threesecond oroductions of each vowel, whispered, (e) five rhree-second

PAGE 24

15 A. UTTERAl'CES lAG Rorvi JALKER TAPE RECOPHER B. SELECTION OF PREFERRED DTTERANCES EVALUATCRS CRO TAPE PJ-CORDER D, STI?IlJl,US-PvESPCNSE PROCEDURE LISTENERS AR-3V.. TAPE RECORDER Figure 1: Schematic representation of general procedure.

PAGE 25

16 productions of each consonant, and (f) five productions of Ival , in which they prolonged the final vowel for two seconds. Speakers were instructed to achieve a constant VU meter deflection (-2) for all of their utteiances. Their performance v/as monitored by both speaker and experimenter. At the time of the recordings, a black and v^7hite photograph T/as taken of each speaker against a flat background with a high quality 35ram camera. Eight-by-ten inch enlargements of these photographs were obtained for use i.n the listening sessions. Utterance selection and treatment. The five productions of each stimulus were evaluated by a panel of four experienced listeners (phoneticians). The latter chose the most representative production of each stimulus from the five original productions made by the speakers. It V7as these "preferred" productions which were treated for duration, filtered where appropriate, and repeated and randomized for the actual experimental listening tapes. Instructions to the evaluators are contained in Appendix A. One duration, 1250 milliseconds, was used for all stimulus materials. This duration was selected on the basis of the observations of Compton (1963) and Pollack et aU (1954) that durations ex--:;aeding this had no effect cn identification performance. The selected duration was generated in the following way: the "preferred" utterances served as input to an electronic switch (GrasonStadler ) , whose duty cycle (1250 m.sec.) was manually initiated by the experimenter at the beginning of each utterance. Rise and fall times were set at 25 milliseconds. This selection was based

PAGE 26

17 on data from Bricker and Pruzanski (1966); they reported that rise aad fall tirr:es above 15 msec, did not introduce any artifactual consonantal effects. The output of electronic sv/itch was monitored by a cathode ray oscilloscope (Tektronix 56''0 . In order to construct the synthetic /v + a/ utterances, the following procedure was used: 600 millisecond excerpts of a sustained production of /v/ were obtained (as described above) and recorded on one channel of a Sony 530 recorder; similarly, a 620 msec, excerpt of /a/ was recorded on the second channel; a tape loop was constructed containing both of the phonemes thus recorded. This tape loop was placed on PAMMS (a pause adjustment laochanism and measurement system developed by Jensen et ad . , 1970), a device which allows a variable, monitorable delay to be introduced between the material on each channel. The delay was adjusted to approximately 5 msec, and the output of PAMl-IS was then recorded on a Sony 530 recorder at 7.5 i.p.s. To generate the filtered vowels, the 1250 msec, excerpts of the voiced vowels were low-pass filtered at 200 Hz. The frequency response of the filter (Krohn-Kite 3100) showed an attenuation rate of 23 dB/ octave. Generat ing the experim.ental tapes . Six experimental tapes, one for each stimulus category (i.e., sentences, voiced vowels, whispered vowels, filtered vowels, consonants, and monosyllables) ^ V7ere generated. In all cases, the interst imulus interval was four seconds; each production was repeated five times and randomized according to a random number table. Since the listener response

PAGE 27

18 forms contained 25 items per page, a ten second interval follov/ed every 25th item on all tapes. It should be noted that there was no direct correspondence between any experimental tape and the specific problems for research. The grouping of utterance types on these tapes was solely to facilitate the listeners' task and the analysis of data. Experimental Tape 1 consisted of the voiced vowels. The eight speakers had produced four different vowels and, since each stimulus was repeated five times, this tape consisted of 160 stimulus items. This construction also applies for Experimental Tapes 2, 3, and 4, which contained, respectively, whispered voxrcls, filtered vowels, and consonants. Playback time for T&pes I through 4 was approximately 15 minutes each. The fifth experimental tape contained all of the consonant-vowel monosyllables. There were two monosyllables for each speaker, and each monosyllable was repeated five times, so that Tape 5 consisted of 80 stimulus items. The playback tim.e for Tape 5 was approximately 7.5 minutes. Additonally, a control tape was assembled consisting of five randomized repetitions of each subject's sentence productions. The data collected from this tape were used as a performance baseline over speakers and 1 is teners-i.e . , the listeners here could avail themselves of relatively long durations of natural speech (including speaking rate and inflection, as well as phonemic effects), so their performance here, then, was used as a metric in assessing their proficiency for all other stimuli.

PAGE 28

19 D . Acous tical Analysis . Since one of the objectives of this study is to assess to what extent speaker identification and confusions among speakers may be accounted for by source and transfer characteristics of the vocal mechanism, each talker's "preferred" utterances from Section C, above, were subjected to acoustical analyses for the extraction of the following parameters: fundamental frequency, formant frequencies, and, for the voiceless consonants, formant bandwidths. Fundamental frequency was determined, for all voiced stimuli, by oscillographic analysis (Honey^"7ell Visicorder, operated at 15 inches per second). Formant frequencies were estimated from wideband spectrographs generated by a Kay PUectric 6061 spectrographic unit. The protocols suggested by Dew ejt a_l, (1969) for such measurements were followed. Formant bandwidths, for the voiceless fricatives, were estimated from narrow-band amplitude sections. Bandwidths were measured at a point 3 dB down from the peak of each formant (this was found to correspond to a distance of 5 mm on the spectrographic paper). E . Stimulus-Response Task . Portraits of each spor;ker were attached to the wall of the listening room (lAC 1204 A). Immediately below each portrait, the initials of the individual portrayed were printed in large block letters. Instructions given the listeners are shown in Appendix B. For the control tape and Experimental Tape 5, uhe listeners were presented with answer sheets denoting the stimulus number, followed by the eight sets of identifying speaker initials. The

PAGE 29

20 listeners' task was to simply circle the initials of the speaker whom they felt produced that stimulus item. For Experimental Tapes 1 through 4, the answer sheet consisted of the stimulus number, followed by the catalogue of utterances in each tape, and the speakers' initials. The listener was asked to circle the utterance heard and circle the initials of the speaker whom they felt produced it. An example of an answer sheet the listeners used when they were hearing Experimental Tape 4 is contained in Appendix C. F . Play back Conditions . All experimental tapes were played from an Ampex 351 tape recorder, through one channel of a Marantz Model 7 pre-amplif ier and Marantz 8-B power amplifier. The loudspeaker, an acoustic suspension device (AR-4) , was located in a sound treated room (lAC 1204 -A). Stimuli were presented to listeners at 70 dB SPL. One or two observers per listening session were used, seated equidistantly (approximately three feet) from the loudspeaker. A sound level meter (General Radio), positioned where listeners were to be seated, v;as used as a calibration device. Calibration was accomplished via a 1000 Hz tone which was recorded at the same VU level (-2) as the speech samples. The RMS voltage at the loudspeaker's input corresponding to 70 dB SPL in the room was noted on a vacuum tube voltmeter (Ballantine 321-C), and the latter was monitored by the experimenter throughout the listening sessions. All listeners heard the control tape first. The other experimental tapes were presented in random order. All listening

PAGE 30

21 sessions were conducted over a period of five days. All listeners indicated their responses by circling the appropriate items with a black ink pen. Listener response forms were graded by overlaying them on a coded key, on which the correct responses \i7ere circled in colored ink, each color denotiiig the particular utterance and speaker involved in a given trial. All responses by a given listener were then transferred to confusion matrices for each type of utterancefor example, actual speaker vs perceived speaker for whispered /i/, etc. The diagonals of these matrices represented correct listener responses and were used in all analysis of variance procedures. Although this grading and response summarization procedure was a rather laborious one, it should be noted that this study involved 9920 separate listener responses. It is felt that the procedure used here was not only time saving, in the long run, but also tended to minimize the intrusion of experimenter error, since the number of responses in any given matrix may be summed, after the scoring procedure, and used as a check for errors (for example, a given listener's responses for v/hispered /i/ must sum to 40, etc.). As a further check, all responses were also graded by an independent observer; the few discrepancies which were found were then resolved. Although the possibility of experimenter error in a paper and pencil response paradigm of the magnitude used here is always high, it is felt that the techni.que employed tended to minim.ize this unwanted source of variation to the point where it is not a significant factor in the results presented in this investigation. • Data Reduction . The paradigm used here is a randomized block factorial design,

PAGE 31

representing a mixed model (utterances are considered fixed effects, speakers and listeners are considered random effects). The consonantal conditions, for example, involve the following factors: eight levels of speakers and four levels of consonants. Each listener is considered as a block, since each is exposed to all stimuli. In his discussion of randomized block designs. Kirk (1968) notes that it is not necessary, for mixed models, to assume that the block and treatment effects are additive in order to test the treatment effects. Consequently, block (listener) by treatment (speaker and utterance) interactions were not considered in the statistical analyses employed in this study. Factorial designs of this sort allow the analysis of two or more variables, both in terms of their individual ('main') effects and in terms of their interactions with one another. The presence of significant differences among treatment means is determined by an analysis of variance procedure which utilizes the F distribution. Tliis procedure yields information concerning factor effects and interactions; when any of the fixed effects factors show significant effects, com.parisions among treatment level means were made in order to determine the contributions of the levels within each factor to the test scores. Tlie procedure employed here involved a p osteriori comparisons among means using Scheffe's method (Kays, 1963). Figure 2 shows the structural design for the consonantal stimuli. The designs for the vowels and monosyllables are similar.

PAGE 32

23 1 Figure 2: 'Ihe structural scheme of the factorial design; data for the consonant stimuli.

PAGE 33

24 Concerning the relation of speaker identification and utterance intelligibility, it should be noted that, for all vowel and' consonant stimuli, four types of responses are poss ible-that is, identification is either correct or incorrect and utterance intelligibility is either correct or incorrect. An inspection of the distribution of these responses v.'as made in order to ascertain whether or not utterance intelligibility is a necessary and/or sufficient condition for speaker identification. Confusion Matrices . Confusion matrices, similar to those employed by Miller and Nicely (1955), were generated for all vowels, consonants, and monosyllables. These matrices are frequency plots relating actual speakers to perceived speakers; hence, they not only indicate correct responses but also yield information concerning the pattern of errors. An attempt was made to account for the observed confusions among speakers on the basis of rank order correlation techniques (Kendall's tau--Siegel , 1956) relating confusions to the parameters obtained from acoustic analyses, above. The rationale here is that if, for a given utterance type, speaker identification is largely coded in some acoustic parameter, then that parameter should correlate highly with confusions among speakers. For the voiced vowels, confusions were related to fundamental frequency (fo), first formant frequency (Fl), second formant frequency (F2), third formant frequency (F3), and the ratio of formant two to form.ant one frequency (F2/F1) . For the whi^^pered vowels, confusions were related to Fl, F2, F3, and F2/F1. For the low-pass filtered vowels, fg only was employed.

PAGE 34

25 For the voiced consonants, /v/ and /z/, confusions ware related to f^j, ?1, and F2, The confusions among speakers for the voiceless consonants, I si and /f/, were related to Fl, F2, and the bandwidths (BW) of those formants, 3W1 and BW2, respectively. For the rnouosyllable, /va/, the following parameters were chosen: f^, the first three formants of the vowel portion, and the locus of the second formant transitions. The acoustic parameters employed here were chosen because they not only represent characteristics which are both easily extracted from the signal and frequently used in acoustic descriptions in the literature, but they also, in some instances, represent those acoustic parameters thought to be germane to speech intelligibility. Peterson and Barney (1952), for instance, concluded that vowels are differentiated from one another largely on the basis of the F2/F1 ratio. In connection with the voiceless fricatives, Heinz and Stevens (1961), and Flanagan (1965) have indicated that these are largely differentiated on the basis of upper formant frequency locations; Hughes and Halle (1955) concluded that voiceless and voiced fricatives are differentiated largely on the basis of the presence of substantial spectral energy in the latter below 1000 Hz. Stevens and House (1956) and Harris (1958) have shown that, in CV monosyllables, the perception of the consonant is signaled largely by the locus of the second formant t:'-ansition. Additional insight into the relation of speaker identification and utterance intelligibility was sought, then, by incorporating these acoustic characteristics into the attempt to account for the confusions among speakers in the identification task. If

PAGE 35

26 identification and intelligibility are based on similar acoustic cues, the confusions ijhould largely be explained by those cues.

PAGE 36

TIT RESULTS AND DISCUSSION It will be recalled that the sentence stimuli were used to obtain some estimate of the contributions to speaker identification of suprasegmental features, such as rate of articulation and inflection. The vov7cl stimuli are directed toward the first research problem, concerning the relative contributions of source and vocal tract transfer characteristics to speaker identification. The consonantal stimuli concern the second research problem but also have implications for the first problem. The monosyllabic stimuli are directed toward evaluating the informational aspects of duration. In this chapter, results are discussed initially by utterance type. Later subsections are concerned m.ore specifically with the relations obtaining between acoustic characteristics and identification, and between utterance intelligibility and identification. Sentences The grand mean identification performance for the sentence stimuli was 977o. Brickcr and Pruzanski (1966) also used sentence stimuli in their investigation, and their result (98%) is in good agreement with that obtained here. Analysis of variance, summarized in Table 1, showed no significant differences over listeners, but a significant difference over speakers. Figure 3 illustrates the differences over speakers; ?7

PAGE 37

23 TABLE I. ANALYSIS OF VARIANCE SIM-1/\RY FOR LISTENERS' RESPONSES TO SENTENCE STIMULI. SOURCE SS Between Speakers 4.959 Between Listeners 3.21 Residual 11.791 Total 19.96 df MS F-ratio 7 .7084 4.627* 11 .2918 1.9059 77 .1531 95 "P< . 05

PAGE 38

29 .. -J -j. jp , J1 2 3 4 5 6 7 8 SPEA--'ERS jbigure 3: Listener performance for sentences produced by each speaker.

PAGE 39

30 the range of scores yielded by the speakers' utterances was 87 to 100%. The range of listener performance was 90 to 100%. Listene.r perf onnance recorded for the sentences was by far higher than for any other stiv.iulus used in this study. (The next best stimulus type in terras of identification performance was the monosyllable /va/, for which the grand mean vias 58%.) This seems to point to the contribution of dynamic variations in articulation, for here the listeners could avail themselves not only of such characteristics as fundamental frequency and formant frequencies, but also of such suprasegmental features as tempo and intonation. A systematic investigation of the contributions to speaker identifications of such extra-phonemic features would indeed constitute a viable problem for future research. It would, for example, be interesting to amplitude modulate white noise against sentences, thus preserving tempo and amplitude characteristics, and discover whether or not speakers may be identified solely on this basis. Vowels Overall listener performance for the vowel stimuli is shown in Figure 4. The critical value shown in this figure, and in subsequent ones, was determined by reference to a table of cumulative binomial distributions (Staff or Computational Laboratory, 1955), and represents the minimum performance level for rejecting the null hypothesis (no performance above chance levels) at a significance level of .05. The overall means associated with each type of stimulus in Figure 4 are 40.2% for the voiced vowels, 2Io8% for the whispered vowels, and 20.7% for the filtered vowels.

PAGE 40

31

PAGE 41

32 The data In Figure 4 were derived from separate experimental tapes, and the interactions which may have obtained among voiced, whispered, and filtered stimuli are hence unknown. It is nonetheless interesting to note that performance, where source characteristics alone were present (filtered vowels) and performance v;here vocal tract characteristics alone were present (whispered vowels), sura almost exactly to the performance where both are present (voiced vowels). From this evidence, one might tentatively infer that the contributions of source and vocal tract transfer characteristics toward speaker identification are approximately equal and additive (an inference Xi?hich is further supported by correlation of the acoustic characteristics of these stimuli to confusions among speaker, below). Additionally, it is clear from these data that speaker identification may be achieved only on the basis of source characteristics and only on the basis of vocal tract transfer characteristics. Voiced Vowels Analysis of variance procedures for main effects showed a significant vowel-speaker interaction for the voiced vowels. Hence, tests for simple main effects, as outlined by Kirk (1968), were conducted, and are summarized in Table II. As indicated, there are significant differences over speakers at three of the vowels, and significant differences over vowels at six speakers. At each of the latter, a posteriori comparisons among vowels were performed using Scheffe's m.echod (Hays, 1953). Results of this procedure, summarized in Table III, shoij a general trend for low vowels, /a/ and /c5/, to

PAGE 42

33 o •1-1 •u nJ U I cxD in iri C7^ O a> LO ro CN IT) < CM oi ^ >0 CO C^ «3 U-) r-.-I iN o IT) CX5 o V (X M !:3 i-toinn CO ncor--vDooooLo i-i Ou-)-lCM ;S coiriLnooinu->v£)coocMr-~vocM o M o > O i w u <: M Pi < > o T3 CO cococoncnc")cococoi^r^r^r--r^i-ii-i CO Or~CO CslvO c-ivoco iricr\oovxicniriLOcr\ r--l^ Lovoc^Lou~lr^cooooOl-l<^<}•oc^Joo -<^<^M^co^-loovo^-lc^cMr-lcoou-)CMc^ <1coiri.-icnc~jcnwDi-ir\il l-J W H w u O CO -H O-J c^l nJ ci3 (t! ct3 a O Q) •D Ph a. d, W CO CO CO CO o > vi ;-( }_i ^ CD QJ CJ m CO ,^ 4!i ^ ^ n3 cvi cij n} d) O 0) w fju Ph a. M CO CO CO >>>>>>>> CO CO CO 3( »4 (1) CO nj U 4J 3 123 1— ICvlcO
PAGE 43

34 TABLE III, A POSTERIORI COMPARISONS A>IONG VOICED VOWELS. AT SPEAKER* 2 4 5 6 7 8 P^l>lul /a/>/-c$/ /i/>/u/ /^/>/u/ l^>lnl /35/>/i/ /a/>/u/ /^>/u/ /a/>/u/ /a/>/i/ /a/>/u/ -''no significant differences among voiced vowels Xv-ere found at speakers 1 and 3.

PAGE 44

35 yield better identification performance than high vowels, /i/ and /u/ , and a clear trend for /u/ to yield significantly poorer identification performance than the lov; vox-jals. This is somewhat at variance with the conclusion of Stevens et al^ (1968) that front vowels generally result Ln higher identification performances than back vowels; it should be noted, however, that these investigators did not employ sustained, isolated vocalic utterances. I-Thispered Vowels The results of analysis of variance procedures for the whispered vcv/els are shown in Table IV. Significant differences among speakers obtain at all vowels but /u/. For three speakers, there are significant differences between the performances yielded by vowels. Table V summarizes the results of a pos tcrior i comparisons among whispered vowels, and indicates the same trends observed for differences among voiced vowels--viz , , that lov/ vowels, /a/ and /^/ , yield better identification scores than high vowels. In connection with this general trend, an explanation for this high vowel-low vowel difference was sought in comparing the relative formant frequency amplitudes for the four vov/els, voiced and whispered. These measures were extracted from narrow band spectrographic sections, and are shown in Table VI. As indicated, the formant amplitudes for the low vowels are considerably greater rhan for the high vowels, especially in the voiced condition. This distinction is present, but not as great, for the whispered vov/els;

PAGE 45

36 V V J^ -' J' J' <^O^|^Jc0C^^n^~-^-l<}•O^L^v.oc3^<}-<| Or-<.-IO^r-IOOOOinCN coo v0cnoo^c^ocr^oooooc^^r-.va CM ONOrHr-l I— li— |i.OCNli-<^l-l CM -dCO o oi cTi criCJ>r^cNr^vocr\ C7>r-'^ c3 0) 'J u) 0) CD o > ca >>>>>>>> CO C/^ CO 00 CO pc3 CO u u cd td 111 3 •H (U Pi i-ic^Jco^j-'nMDr^-oooNOi-ic^ioTvtu-)*^ 7 -1 r-l 1-1 r 4 r-l r-l I-I

PAGE 46

37 TABLE V. A POSTERIORI COMPARISONS AMONG TOISPERED VOWELS. AT SPEAKER" 2 3 4 /5$/>/i/ /«/>/a/ /a/>/u/ /^>/u/ /a/>/i/ /a/>/ae/ ''no significant differences among vowels were found at speakers 1, 5, 6, 7, and 8.

PAGE 47

38 TABLE VI. AVEPAGE FORMANT FREQUENCY AMPLITUDES FOR VOICED AND WHISPERED VOWELS (dB DOWN FROM Fjl AMPLITUDE) . VOICED WHISPERED /i u ^ a/ /i u "(^ a/ F 23 30 15 1.5 7 13 6 7

PAGE 48

39 Fil tered Vowels As indicated ia the aiuilysis of variance surnmary in Table VII, no significant differences among vov.'els were found for these stimuli. Since these stimuli do not contain '"orraant structures, this rsult supports the notion that the differences among vcv;elo found for the voiced and whispered stimuli are indeed due to foriTiant amplitude differences . Overall, performances for the vowel stimuli indicate that speaker identification is possible on the basis of source information only and on the basis of vocal tract characteristics only. Additionally, there is some indication that, for the voiced vowels, these contributions are equal and additive. Consonants The grand mean identification performance for the continuant consonants was 21.82%. Figure 5 summarizes overall listener performance for each consonant, and shows that all consonants yielded speaker identification performance at a level significantly above chance. Analysis of variance for the consonantal stim.uli, shown in Table VIII, indicates significant differences among speakers at the voiced consonants, /v/ and /z/, and significant differences among consonants for four speakers. Comparisons among consonant means at each of these speakers are shown in Table IX. The general trend observed is that the voiced consonants, /v/ and /,'./, yield significantly higher identification performance than their voiceless cognates, /f/

PAGE 49

40 o •rH •<) U 0^ CO CO LO in u LO o 1 1-1 >-l CO 1-1 CO JD n cn M H i-J M Pi O fl4 M-l T3 r-l ro t-i 1-1 CO •vt 00 CO CO CO w Pi > o 03 CO >i) vO -<} > J W o CO 03 u 0) d 4J OJ •H f-5 ^ at CO ^ 0) to ex 1 CO 01 I a p. o CO > o H 1— I Osl CO
PAGE 50

41

PAGE 51

42 I 1 t^vOCO•^-^cNc^o^vD^~l— ioncvi cocovo i-iinoocO'X)cric^ooc--)oir>cNvl-Lor-cs i-it--LoOr-icnc»or^r^vi-r-iog w r-f C^J CO in 00 U i-l U u >-i U — V o CO cn ^ u 03 03 03 03 03 d CJ a o O CU a) to w D, p. a a. a p. to N IW u d CO CO CO C/i CO CO C/D C/3 OJ — ^ o CO u 4J u 4-1 U 4J 4J 4-1 d 4J 4-) 4J 4-1 o !-! t3 03 03 oJ . 03 03 03 (1) 03 (Ti 03 cj CO O a. U U CJ CJ CJ CJ U CJ) GO o:) CO CO CO t-i CM 4in CO C?i o r-t CM on .-I r-4 r-l r-l 1-1 u (U M cd
PAGE 52

43 o CO o o i (/> !z; O CO M Pi <« U M o M Pi W H CO o Oil CO w CO H CM 03 CO N CO to '4-1 A > > A N CO fcO c o B O o, CO CO CD O 4J d ca (U M cu d 4-1 4-1 O •H 4-1 (U • •U MS d CCJ o d •H CO •r-l d d CO
PAGE 53

44 and /s/. These data would seem to indicate that source characteristics, at least for these stimuli, are more heavily weighed by the listener making identity judgments than are resonant characteristics of the vocal tract. It must be noted that such a conclusion construes an oversimplification, for the cognates employed here are not differentiated simply by the presence or absence of fundamental frequency. Glottal excitation, in the case of /v/ and /z/, also yields vocal tract resonances not present in /f/ and /s/. Actually, it v/as rather surprising that /f/ and /s/ did yield speaker identification performance above a chance level. The articulatory constrictions which serve as a turbulent source for these phonemes are located very far forward in the vocal tract, and the anterior portion of the vocal tract which resonates to this excitation is quite short; hence one would expect that the articulatory and acoustic distinctions among speakers would be lost. Although the data indicate that such distinctions are not lost, they do indcate that differences are minimized. The performances recorded for /f/ and I si (15.27o and 16.2%, respectively) were the two lov/est scores encountered in this study. ]Jtterj?iice_Intoll igibility and Speaker I dentification The overall intelligibility levels for those utterances where the listeners Xv^re forced to make intelligibility decisions as well as speaker identity decisions are shewn in Figure 6. As a group, the filtered vowels V7cre not intelligible; individually, /u/

PAGE 54

45 [ [ o o — r o o 00 o o o o O CD o •H U •H !-i O w H CO o > o o > > w CJ C/3 W 1-4 P-i W M O a > CO •u •H 13 a. (0 H •o 0) o 1-1 0) 01 OJ u c cd M (U 4J U 3 U o > <3i •l-l 5 (90 0) •ri u 3 00 •H [1:4 o • . o CM i-l

PAGE 55

46 was the only intelligible filtered vowel. The listeners could produce only four types of responses for these stimuli: (a) speaker identification correct, stimulus intelligibility correct, (b) speaker identification incorrect, stimulus intelligibility correct, (c) speaker identification correct, stimulus intelligibility incorrect, and (d) speaker identification incorrect, stimulus identif icatioii incorrect. The proportion of each type of response for all utterances is shown in Figure 7. Of particular interest here are response type (b) (identification incorrect, intelligibility correct) and response type (c) (identification correct, intelligibility incorrect). For the voiced and whispered vowels and the continuant consonants, response type (b) is by far the most typical. This would indicate that stimulus intelligibility is not a sufficient preliminary to speaker identification. On the other hand, the filtered vowels, unintelligible as a group, yielded speaker identification performances which were significantly above chance. The inference here is that stimulus intelligiblity is not a necessary preliminary to speaker identification. The evidence noted here suggests then that speech intelligibility and speaker identification are not necessary concomitants-that is, it is possible to have one ^^7ithout the other. That they are qualitatively different percepts is also reinforced by the fact that the cues which have been traditionally viewed as important to the perception of an utterance generally do not correlate highly, as mentioned below, with speaker confusions generated by the

PAGE 56

47 4J o w o o iJ o rj o a V-i u i) !-i ^-1 o O a u o o a o d • e i s •H e •H 4J u JJ 4J CO CO JJ JJ o u 4J a u f / V > < \ 3 >-' j w 1 w J3 M -ft/ Q W O > 13 C nJ >. 4-J •H r-l ^ •H 60 •H r-l r-l (U rt •H Q) M 0) ca ^ 4J 4J !-i •H O 4J fX c o
PAGE 57

48 identiif icatiau tasks. In the introduction to this scudy, it was noted that most, models of phoneme recognition involve, as a preliminary, analysis of the input in the timefrequency-amplitude domain. One might ascribe the process of speaker identification entirely to such a preliminary analysis component if and only if evidence of acoustic invar iances to speaker identification had been found. Although some of the acoustic parameters extracted in this study did correlate rather v/ell with speaker identification performance, the correlations were not high enough to characterize those parameters as invariants to the process. It seems clear then that timefrequencyamplitude information is also only a preliminary to speaker identification, and that this information m.ust undergo further analysis, or sharpening, before decisions are reached. The nature of this additonal analysis is unknown. Further insight into this problem may be offered by dichotic listening tests in which the subjects' task is to identify speakers, not utterances. In the dichotic paradigm, pairs of speech samples are delivered simultaneously to the listener, one to the right ear and one to the left. The presence of an ear advantage (that is, if the listener reports the stimuli presented to one ear more accurately than the stimuli presented to the other ear) is taken to indicate that perceptual processing is mediated chiefly at the cortex contralateral to ear showing the advantage. It has been shown that there is a right ear-left hemisphere effect for categorical or coded speech materials (as discussed by Liberman et al, , 1967), for example.

PAGE 58

49 consoiiant-vowel monosyllables; the most cogent inference drawn from these data is that the left hemisphere is specialized for linguistic processing (Studdert-Kennedy and Shankvreiler , 1970). Kimura (1964) has shown that a left ear-rlglit hemisphere advantage is found for musical passages, and hence the left hemisphere is considered to be specialized for the perception of auditory pattern?. Darwin (1969) has shown that there exists no ear advantage for vov7els, indicating that perceptual processing for these is perhaps occurring at a sub-cortical level. Hence, dichotic presentations involving speaker identification judgments might be expected to yield information concerning the locus and. nature of the processing involved in those judgments. A right ear advantage would indicate that the speaker identification is associated with linguistic cues; a left ear advantage v/ould indicate that the processing is chiefly a matter of auditory pattern analysis (and would point toward the importance of suprasegmental features); finally, the absence of an ear advantage V70uld indicate that the processing is either lowlevel or involves a combination of linguistic and pattern analyses. Sample Time Interval Figure 8, a summary of listener performance for the natural and synthetic monosyllables and equivalent durations (1250 m.sec.) of the isolated phonemes froi.i which the latter was generated, defies a simple interpretation. There is evidence of a "stairstep" function in Figure 8, It is interesting to note, for example, that if the

PAGE 59

50 t3 (U 1-1 o to D t-l > y •H •H o w < W H H — ^. 1 CO CO (U r-( ,o nJ t-i 1-1 >^ w o £3 O 6 o CO f-3 O •rl 4J at 3 •o d (U 1-1 <« .^i a o*
PAGE 60

51 overall performance for /v/ is averaged with the overall performance for /a/, the result is very nearly the performance yielded by the synthetic monosyllable ,/v+a/ . This seems to indicate that the listeners were reaching their decisions for this stimulus on the basis of steady state acoustic characteristics. Although they were indeed treating /v-l-a/ as a tv;o phoneme utterance, note that the overall performance effect was not additive, but averaged. On the other hand, there was a significant difference between /va/ and /v-l-a/, as indicated by the analysis of variance summary in Table X. The most cogent explanation of this difference is that listeners are not reaching speaker identification judgments for /va/ only on the basis of the target acoustic values of this monosyllable's constituent phonemes, but also on the basis either of the added formant transition values or suprasegmental features. In general, the trend indicated by these data would tend to support the notion that utterance duration is an important variable in speaker identification in that it allows listeners to sample larger segments of a speaker's phonemic repertoire. Furthermore, this added information is based not on steady state phonemic cues but on some more integral basis. Acoustic Analyses Results of acoustic analyses of the utterances used in this investigation arc tabled in Appendix D. (In connection with the latter, the fundamental frequencies extracted for the voice vowels are the same as those for the low-pass filtered -vowels, since the

PAGE 61

CO (T) o in 00 CO 00 n o CO i—i 00 o O CM o c^i o\ Lo n vOr-l r-lr^Olr^OOO^LOCM CO CM -t . w o o -M n lU ctj J-l a CO w to i-l (U . 3 CO t3 O -r-l C W o OJ i-icNirn
PAGE 62

53 latter V7ere generated directly from the former.) In an attempt to discover the bases on v/hich listeners were arriving at speaker identif cation judgments, they were assigned ranks according to the order in which they would be expected to be confused with each other had some acoustic parameter been the basis of speaker identification. These rank orders in expected confusions among speakers were then correlated with rank orders in actual confusions (contained in Appendix E) among speakers. A high rank order correlation obtained on the basis of some acoustic parameter would indicate that the parameter was heavily weighed by the listener in reaching his decisions. Vowels Table XI summarizes the results of tnis procedure for the voiced vowels. Each cell entry in Table X represents the degree of correlation between the actual confusions among speakers against each speaker (Xi) and the expected confusions among speakers predicted by the rank of some acoustic parameter against each speaker (Yi). Tlie statistic employed in Tables XI through XIV is Kendall's tau. The actual cell entries in the tables are the denominator of this statistic, S; Siegel (1956) notes that S has the same probability distribution as tau, and provides a table for the probability of obtaining any given S. The last column in Tables XI through XIV lists the probability of obtaining the actual S values under the null hypothesis of no association between ranks.

PAGE 63

54 I C/3 .— I I.N VD cn ^ 1— I cn m CM A. CO -^f o> O CO vo <» rin on 00 >> 00 CO CM c\i ON CO -' CO f-^ r-l o o Q CO w O 1— 1 CO M O t — 1 pi-i erf o O in >< in X H O CO ^ 3 CO csl o O X C^J O d o w Q Q t-lM W O f-H r-l CO < X CO C3N '>0 O CO
PAGE 64

55 It should be noted at the outset that none of the tabled probabilities are equal to or smaller than .05. Although this does not invalidate the trends V7hich may be drawn from tliese it does seem to indicate that, at least for the utterances and acoustic parameters employed in this study, there exists no set of acoustic invariances to speaker identification. For the voiced vov;els, the data in Table XI indicate that, in general, (1) fundamental frequency, (2) the second formant frequency (F2), and (3) the third formant frequency (F3) are equally good predictors of confusions among speakers. , This is entirely consistent with the speaker identification results obtained with these stimuli, discussed above, and indicates that these parameters are indeed the basis of speaker identification judgments for voiced vowels. Incidentally, it was noted in the introduction that Compton (1963) had used only the vov/el ,/i/, in his investigation and had concluded that fundamental frequency was the basis for speaker identification decisions. The data here bear this out, but they also show that this conclusion does not generalize to vowels as a whole. For the whispered . vowels , the data in Table XII show that F3 is the best predictor overall for confusions among speakers. Once again, /i/ exhibits a trend distinct from the other vowels; for this whispered phoneme, the ratio F2/F1 is the best predictor of speaker confusions. Overall, though, the F2/F1 ratio is a poor predictor of confusions; as nctad previously, this ratio has been

PAGE 65

56 [in (M CO I— I o CO CM -l < w ^ o ^ p W H S (J < M PL) < CO 00 in a^oooOl— I v.oOi— icjN ooOLriO> ^^llnc;^CJ^ t-ii— iioO crioooo o <: >^ CO o >^ o -< < I-l o o CO O t-l LO w P pa X 1 1 1 1—1 I 1 CO w pq NG ?3 vO t-l o CTi CO 1^ O o 00 O < X t-l t-l r-l r-( M RELAX SIGNS CO >| CO CO r~ro 00 r-l 0^ i-i as X 1-1 t-l 1 i 1 o ON etS CJ w p * H CM CO CO CO CM CO CO CO O X 1 r-l 1 1-1 t-l r-l t-l W RAN EXP t-l • vO vO O \o O as t-l M X 1-1 r-l 1-1 1 1 HI I II I CM O 00 1-1 t-l t-l i-l 1-1 I r-l r-i coincom t-+ogrOr-i 1— icMcOi-i i-icMrOr-i t-iCM CO t-l O CDC O U\ \^ \M flHpLlflHpM frjpLlpLih pLlfXlpLifjH miHU_|lH CM fa CM fa CM fa CM fa S13M0A aaaaasiHn SIMOA S >I N V H A >IOI S I S V a

PAGE 66

57 <1ro o! LO on ol CO >-i r-( cn ' 00 X X r-l CM O CO CO CM n -J ^ O CM -dCO CO CM CM VO CN 00 CO o in INV^ A ^ 0 i SISVH

PAGE 67

58 considered as a primary cue to vov;el intelligibility. There is an indication here, then, that utterance intelligibility and speaker identification are based on different cues. The correlations obtained for the low-pass filtered vov7ols, also shown in Table XIl, are the highest obtained for all utterances, demonstrating that the listeners are indeed using fundamental frequency cues in reaching identity judgments for these stimuli. Overall, the correlations obtained for the vowel stimuli offer more conclusive evidence for the notion that the contributions of source and vocal tract transfer characteristics are equal and additive. Also of interest is the finding that cues which arc thought to be crucial to speech intelligibility are poor predictors of confusions among speakers . Conso nants . Table XIII shows that fundamental frequency is a better predictor of confusions than the formant structure for the voiced continuants, /v/ and . For /s/ and /f/, however, the center frequency of the first formant accounts for more confusions than formant bandwidth s. The correlations obtained for /v/ are uniformly lower than those obtained for /z/, yet the fundamental frequencies data for these stimuli were very similar. This would seem to indicate that there are other factors involved, in speaker identification for these stimuli. As noted by Hughes and Halle (1956) the acoustic characteristics for these phonemes are quite complex, since they represent the

PAGE 68

59 interactioas of two sources (one, quasi-periodic, at the larynx and the other, turbulent, at the point of constriction) and the . vocal tract resonances which they excite. Tliat these resonances do play a role in speaker identification may be inferred by noting that the portion of the vocal tract anterior to the consonantal constriction associated with /z/, an alveolar phoneme, is considerably longer (and hence would more enhance interspeaker variations) than that associated with /v/, a labio-dental phoneme. Tliis same trend between degree of correlation and place of articulation is evidence for the voiceless consonants, /s/ and /f / . Monosyllables As shoMi in Table XIV, confusions among speakers for the monosyllable, /va/, are highly predictable from fundamental frequency. Note also, however, that F2, F3, and the locus of the F2 transition also correlate reasonably well with obtained confusions. Fundamental frequency and the F2 locus represent here characteristics of the entire utterance, and are not characteristics proper of its constituent phonemes. This reinforces the notion, discussed above, that multi-phouemic utterances yield higher speaker identification scores not because of target acoustic values but rather on the bases of phonemic interactions. These data arc not adequate for firmly establishing the nature of such interactions, but they strongly suggest their existence. D ifferences Among Listeners and Speakers The analysis of variance procedures detailed above indicated significant differences among listeners for the voiced and whispered

PAGE 69

60 ICO CO CO O "£> lO ^ O C-J o CO 00 r-l CO CO ^ ^3 H CO CO >> 00 X X CO 00 r-i a> F-l 1 CO CO O iri 00 <] 00 <) 1-1 1-1 r-l P5 < g CO w o pq ^: o CO 2 M CO H ?: O u o Q W H O w w X! X CO >^ CO X CM >< CSl X r-l CO O-J CTv 1^ r-~CM O CSl CM lO oi
PAGE 70

61 vowels, consonants, and monosyllables. For the vowels, speaker identification performance by listener, pooled over speakers, is shown in Figure 9. Listener performances for the consonant and monosyllabic stimuli are showTi in Figure 10 and Figure 11, respectively. The general trend evidenced in these representations is that listeners 1, 2, 3, 5, and 6 perform consistently well, v;hile the performances of listeners 4, 7, 11 appear to be consistently depressed. In regard to these trends, it is interesting to note that each of the listeners whose performances were consistently high are better acquainted with the speakers, as a group, than are listeners 4, 7, and 11. Additionally, the latter are the only listeners who are not professionals or advanced students in the field of speech and hearing science, a group v/hich has had considerable experience in serving as subjects in behavioral experiments (it should be noted, however, that listener 4 does have such experience) . Tlie important feature in the listeners' performance is consistency. This \\70uld seem to indicate that the differences encountered among listeners are not due to such transient effects as fatigue or attention, but rather to experience both with the speakers and stimulus -response paradigms. Significant differences among speakers v:ere found for the voiced and whispered vowels, voiced consonants, and the monosyllables. Identification performance by speaker for the vowel stimuli.

PAGE 71

62

PAGE 72

63 CO ROIXVOIIIIiMaOI ID3Hri03 cLNaOHa.!

PAGE 73

64

PAGE 74

65 pooled over listeners, is shown in Figure 12. CK'crall, the additivity effect of source and vocal tract transfer characteristics holds here. There is sorae evidence, however, that if a speaker's utterances showed some acoustic characteristic , which V7as quite distinct from the others in the group, then the listeners tended to weigh that parameter more heavily when making identity judgments for that speaker. Note, for instance, that the performances yielded by speakers 2 and 5 tend to be higher, for voiced and filtered vov/els, than those of the other speakers. Acoustic analyses revealed that the mean fundamental frequencies for speakers 2 and 5 (126 Hz and 144 Hz, respectively) are higher than those for the rest of the group. WTcn speaker 5's distinctive source information is absent-as in the v/hispered vowels--the performance yielded by his utterances then deteriorates dramatically. The formant frequencies for each speaker's whispered vowels are plotted in Figure 13 and Figure 14. Note that the formant frequency values for the utterances of speakers 5 and 8-which were identified at a level below chance-tend to be unexceptional (but for F^ for speaker 5's utterance of /i/). On the other hand, the formant frequencies for speaker 2's whispered vovrels, v/hich yielded the highest identification scores among the group, tend to be exceptional; note the low F3 for /u/, the high F3 for y^/, and the very high F^ for /a/. Performance by speaker for the consonants, pooled over listeners, is s!:oto in Figure 15. Tlie performances, for the voiced consonants, yielded by speakers 1, 2, 5 and 7 tend to be

PAGE 75

66

PAGE 76

67 Figure 13: Formant frequency values by speaker for whispered /i/ and /u/

PAGE 77

: 1^1 : /a/ A 1234 56 7*" 8 SPEAKER Figure 14: Formant frequeucv^ valups by speaker for whispered I'^f and /a/.

PAGE 78

69

PAGE 79

70 higher than those of other speakers. Fundamental frequency measures tend to offer an explanation for these differences: the fundamental frequencies for the voiced consonant utterances of speakers 1, 2, and 5 (137 Hz, 130 Hz, and 144 Hz, respectively) are the three highest measures for the group, while speaker 7's utterances show the lowest such measure (106.5 Hz) for the group. For the monosyllabic stimuli, performance by speaker is shown in Figure 16. Differences among speakers for the 'synthetic' monosyllable tend to show the same pattern exhibited for voiced consonants (Figure 13) and the low-passed vowels (Figure 10), indicating that the same acoustic cue (i.e., fundamental frequency) is being used for /v+a/ as for these stimuli. For the natural monosyllable, no explanation for the differences among speakers is offered, however, by the distribution of the acoustic parameters, which were extracted in this study. It may well be that speaker identification performance for this stimulus is determined by a suprasegmental feature such as inflection. Overall, the trends exhibited in the differences among speakers, although they do not apply universally, confirm that the acoustic parameters which were found to correlate with speaker confusions are indeed the basis for listener judgments, there is also an indication that listeners more heavily weigh a given acoustic correlate if, for a given speaker, it stands in distinction from the general speaker group.

PAGE 81

IV SUMMARY AND CONCLUSIONS An investigation was undertaken concerning the ability of subjects to identify speakers solely on the basis of voice. The purposes of this study were: (1) to establish the relative contributions of source and vocal tract transfer characteristics to speaker identification, (2) to establish whether or not speakers could be identified on the basis of isolated utterances of continuant consonants, (3) to investigate the nature of the relation between utterance intelligibility and speaker identification, and (4) to determine whether sample duration was a variable in speaker identification in absolute or relative terms. The subjects for this study consisted of eight male speakers and fnreive listeners; the latter had been in routine contact with the former for a period of at least six months. The following speaker utterances, equated for intensity, were presented to the listeners: two prose sentences; four vowels (/i, u,£a., a/) under three coaditions, voiced, whispered, and low-pass filtered at 200 Hz; four consonants (/s, f, v, z/) ; tv70 monosyllables, one natural (/va/), and one. generated by abutting two steady state phonemic excerpts, (/v+a/). The three vowel conditions were taken to simulate the presence only of (1) source information (filtered vowels). 72

PAGE 82

73 (2) vocal tract transfer information (whispered vowels) , or (3) both (voiced vowels). Except for the sentences, all stimuli were presented at a duration of 1250 msec. The inclusion of the monosyllables allowed for the evaluation of the contributions of the informational aspects of duration; if the latter was a variable only in absolute terms, no differences in speaker identification performance for single phoneme vs two phoneme utterances would have been obtained. Hie listeners were presented with forms listing each speaker by initials, and their task was to circle the speaker they felt produced each item. Ilie listeners were also required to choose which stimulus item was presented for all the vowel and consonant stimuli employed in the study. Acoustic analyses of the speakers' utterances were performed and the following parameters were extracted: fundamental frequency, the first three formant frequencies, the ratio of the second to the first formant frequency, formant bandwidths (for the voiceless consonants) and formant amplitudes (for the voiced and whispered vowels). The confusions among speakers predicted by each of these parameters were correlated ;i7ith the actual confusions among speakers in an attempt to ascertain x^^hich acoustic characteristics serve as important cues to speaker identification. Tlie results of this study may be summarized as follows: 1, All stim.uli yielded speaker identification performance at a level significantly above chance. 2. The sentence stimuli resulted in performance far above any other stimulus type.

PAGE 83

74 3. The perfomaiices achieved for whispered vowels and filtered voTzels were very nearly equal and suirar.ed to the performnce achieved for voiced vowels. The correlations between acoustic characteristics and confusions among speakers revealed that fundamental frequency, the second formant, and the third form^nt were equally good predictors of speaker confusions. There was a general trend for low vowels to yield higher performances than high vowels. 4. Tlie voiced continuant consonants yielded significantly higher performances than their voiceless counterparts. Fundamental frequency was the best predictor of speaker confusions for the voiced consonants; for the voiceless consonants, the first formant frequency was the best such predictor obtained, though the correlation was weak in absolute terms. 5. In regard to duration, /va/ yielded significantly better results than /v+a/, and also resulted in performances above equivalent durations of /v/ and /a/. Also, if the performance for /v/ and for /a/ were averaged, the result was very nearly the performance yielded by /v+a/. 6. Differences among listeners were accounted for in terms of their relative familiarity both with speakers and with behavioral paradigms. The trends present in the differences among speakers were largely accounted for in terms of the acoustic parameters of their utterances. 7. Little correspondence was found between the cues important for speech intelligibility and those thought to be important for speaker identification. Utterance intelligibility was found to be

PAGE 84

75 neither a necessary nor sufficient concomitant to speaker identification. The major conclusions provided by this investigation are that, although one can point to acoustic correlates of speaker identification, there seem to be no acoustic invariants related to speaker identification; furtherirore, speech intelligibility and speaker identi fication seem to be qualitatively different percepts. This would indicate that an adequate model for phoneme identification v;ould not necessarily serve as an adequate model for speaker identification and vice-versa. Further research into the nature and locus of speaker identification processing is strongly recommended, and a dichotic listening paradigm may prove particularly fruitful. Other and more specific conclusions also seem warranted. First, speaker identification for vowels appears to be based on both fundamental frequency and formant frequency information; the influence of these parameters is both equal and additive. The general trend for low vowels to yield higher performance than high vowels may be accounted for by systematic differences in the formant amplitudes of these vowels. Secondly, speaker identification is possible on the basis of isolated continuant consonants. The level of performance achieved for these stimuli, although above chance, was the lowest encountered in this study. Although identification of the voiced consonants correlates well with fundamental frequency, additional research into the nature of the acoustic cues which allow identification of the voiceless consonants is needed.

PAGE 85

76 Thirdly, the sainple time interval, contributes to speaker identification in a relative sense only--i,e., what is iiv.portant is not the absolute duration of this interval , but the nature of the utterance contained in the interval. Specifically, this study demonstrates that, for a given duration, multiphonemic utterances yield better speaker identification performances than single phoneme samples; further, this added information is based on some integral measure of a multiphonemic utterance and not on the target values of its constituent phonemes. Finally, the very high performance yielded by the sentence stimuli points to the possible importance of suprasegmental cues such as tempo and inflection to speaker identification.

PAGE 86

APPENDIX A INSTRUCTIONS TO EVALUATORS OF SPEAKERS' UTTERANCES

PAGE 87

O w H 4-1 O 4-J • •r^ I-H o n) d jd c o 4.J M-l g a •H O •H u T3 rd q; nJ o 4J QJ 4J d jj r-l g •H d d I-H nJ d) o Q) o 03 CO a •r-l GJ 03 T3 •rl U 1+-I d • tu £3 I-l o tU QJ 4J O Q) 3 3 03 o 03 O !X t3 O d CJ d H OJ X O >. o ^1 03 4-1 m X) •rH • rj r^ u QJ 4-) 1-1 )-i T3 03 o 03 cd r-l d >. 3 4J > 03 a XI 03 aj •r^ 4-1 PJ i-l ;J CJ c3 3 4-) •H u m ft X) d OJ QJ O CJ <-i J-l O U 03 3 (d tx QJ o o ,-1 r-l ^1 >. >. Cj o. QJ Cu •H nJ > CJ •+-1 CO 4J QJ •H 5j M U •r-l 4-) 03 4-1 d Clj c3 rH CJ •H nd CJ W c3 cS ;j t-l d QJ >, CO CU 0) , 14-1 CJ Cu rO O •H i-j d QJ C3 3 H 4J d 3 ttj ^1 +J U o o 5-1 D< u O >, QJ 03 o 4J 4-1 4-1 o ci •H 3 u-l i-i 03 o H 14-1 I-l 3 •r-l r-l 78

PAGE 88

APPENDIX B INSTRUCTIONS TO LISTENERS

PAGE 89

INSTRUCTIONS This is an experiment in speaker identification. You will be listening to various speech samples, each produced by one of the eight speakers pictured on the wall in front of you. Your task is to listen to each sample and then decide v/hich speaker produced i in some cases, you will also decide on which phoneme was produced. Indicate your decision by circling the appropriate speaker and phoneme. Please respond to all items (if you are not sure, guess). If, after circling an item, you change your mind, cross out the former decision and circle the new one. There will be a four second interval between each item. After every 25th sample, there will be a ten second pause, so that you may turn to a new response page. If you find that you have not completed a response page vjhen the longer pause occurs, notify the experimenter immediately. Please note that the pictures of the speakers and their identifying initials appear on the wall in the same left to right order as the initials on the response form (please also note that John Booth and John Brandt have the same initials; the latter is designated here as 'Dr3' and not as 'JB'). The first series of sam^ples you will hear will be the utterance: "VJlien the sunlight strikes raindrops in the air, they act like a prism and form a rainbow. The rainbow is a division of white light into many beautiful colors." Before the start of subsequent series, the experimenter will inform you of the specific samples you will be hearing. If you have any questions, please ask them now. 80

PAGE 90

APPENDIX C EXAMPLE OF LISTENER RESPONSE FORM

PAGE 91

PW CL RI EM BB PW CL RI EM BB PW CL RI EM BB PW CL RI EM BB PW CL RI EM BB PW CL RI EM BB PW CL RI EM BB PW CL RI EM BB PW CL RI EM BB PW CL RI EM BB PW CL RI EM BB PW CL RI EM BB PW CL RI EM BB PW CL RI EM BB PW CL RI EM BB PW CL RI EM BB PW CL RI EM BB PW CL RI EM BB PW CL RI EM BB PW CL RI EM BB PW CL RI EM BB PW CL RI EM BB PW CL RI EM BB PW CL RI EM BB PW CL RI EM BB CODE: JB DrB SB JB DrB SB JB DrB SB JB DrB SB JB DrB SB JB DrB SB JB DrB SB JB DrB SB JB DrB SB JB DrB SB JB DrB SB JB DrB SB JB DrB SB JB DrB SB JB DrB SB JB DrB SB JB DrB SB JB DrB SB JB DrB SB JB DrB SB JB DrB SB JB DrB SB JB DrB SB JB DrB SB JB DrB SB s f z V s f z V s f z V s f z V s f z V s f z V S f Z V S f Z V S f Z V S f Z V S f Z V S f Z V S f Z V S f Z V S f Z V S f Z V S f Z V S f Z V S f Z V S f Z V S f Z V S f Z V S f Z V S f Z V S f Z V 82

PAGE 92

APPENDIX D MEASURES DERIVED FROM ACOUSTIC ANALYSES OF SPEAKERS' UTTERANCES

PAGE 93

TABLE XV. FTJNDMIENTAL FREQUENCY AND FORMANT FREQUENCY (Hz) MEASURES FOR THE VOICEDAND WHISPERED VOITELS SPEAKER w o > w o M O O > Q W Oi W CO H 1 2 3 4 5 6 7 8 101 124 113 105 141 114 109 130 240 280 280 280 320 280 320 320 / ; / F2 0 A /. r» Z5d(J 2oU(J 25 20 O C ^ A 25o0 O / / A 2040 O "7 O 2720 ^3 2960 3160 3640 3880 3160 3040 2960 3680 103 122 118 107 149 115 112 128 /u/ 200 280 280 280 280 320 320 320 ^2 960 1040 960 760 1120 1000 880 1000 ^3 2360 2320 2560 1640 2800 2560 3080 2600 ^0 95 132 118 105 I/O 142 113 109 117 Fl 840 720 760 760 800 800 680 800 F2 1640 1680 1880 1960 1720 1840 1520 1680 F3 2560 2440 2520 2600 2400 2560 2680 2440 u 96 126 116 104 143 113 106 127 ^'1 680 800 640 800 780 760 640 800 /a/ F2 1140 1080 1160 1330 1200 1160 1060 1180 F3 2400 2640 2560 2320 2420 2600 2120 2600 Fl 240 320 3 20 400 360 320 280 320 /if F-) 2320 2480 2600 2680 2840 2600 2720 2720 F3 2840 3440 3040 3240 3840 3120 3480 3520 Fl 400 400 320 440 400 360 360 440 /u/ 1480 1120 1040 920 1080 1080 880 960 F3 2520 2240 2560 2520 2680 2640 2600 3040 840 840 880 800 800 880 760 1000 1840 1680 2000 2400 1920 2160 1760 1920 2720 2400 2640 2920 2520 2760 2520 2800 760 960 880 920 800 800 840 960 /a/ 1240 2520 1280 1480 1200 1240 1280 1560 F3 2560 3080 2720 2440 2600 2720 2640 2680 ^Fundamental frequencies for the filtered vowels are the same as those reported for the voiced vowels. 84

PAGE 94

85 TABLE XVI, Fl^'DAI^IENTAL FREQUENCY, FORl-lANT FREQUENCY, AND FORJ^IANT BANDWIDTHS (Hz) FOR THE CONSONANTS SPEAKER 1 2 3 4 5 6 7 8 /v/ Fi 148 280 1040 132 320 720 116 440 1120 98 280 920 144 280 1000 112 320 1280 108 320 1200 114 280 1000 /z/ •i 125 280 760 128 320 680 117 360 1000 103 400 800 143 320 800 115 280 1040 105 320 1000 111 280 960 ^1 's/ BWl F2 BW2 320 1000 4560 880 200 1520 2400 520 200 640 3120 1360 160 320 4400 1680 320 560 5600 640 360 1360 4640 480 520 1240 2960 480 400 920 2840 800 Fl f/ BWl F2 BW2 320 880 2800 760 280 1200 3120 520 320 1360 2720 840 360 1520 2760 960 320 960 2480 800 320 840 2440 920 360 840 2720 760 360 1120 2840 480

PAGE 95

86 TABLE XVII. FUNDAMENTAL FREQUENCY AND FOR>L\NT FREQUENCY (Hz) MEASURES FOR THE MONOSYLLABLE, /va/ SPEAKER ^1 F2 LOCUS F2 1 2 3 4 5 6 7 8 95 129 113 98 140 111 101 114 760 560 640 800 800 760 640 800 1200 1120 1240 1240 1200 1200 1160 1200 2520 2160 2520 2120 2400 2640 1840 2360 920 800 1000 840 960 1000 760 960

PAGE 96

APPENDIX E CONFUSIONS AMONG SPEAKERS

PAGE 97

Actual Speaker 1 Z J c D /: 0 / Q o 1 21 4 L 8 1 I 21 /D u (!) 2 0 30 0 2 1 1 1 1 Ipcak 3 24 5 29 16 18 24 4 20 T3 4 0 2 7 14 1 3 3 1 > •r-l O o 5 1 3 I 1 0 34 3 1 -i4 Per 6 5 6 16 9 5 25 2 6 7 6 1 1 5 0 2 24 0 3 3 9 5 6 0 0 4 2 Figure 17: Confusions among speakers for voiced / i/ . 1 2 3 Actual 4 vSpeaker 5 6 7 8 1 23 9 9 3 2 6 10 13 u 2 1 26 4 7 22 2 14 8 OJ (U p. w aj 3 1 2 15 0 6 17 3 6 4 7 5 4 17 0 5 14 6 . Cc X\ 5 0 4 1 0 13 2 1 5 QJ 6 0 0 17 2 1 13 2 7 7 22 4 3 21 0 4 9 1 8 6 10 7 10 16 11 7 14 Figure 18: Confusions among speakers for voiced /u/. 88

PAGE 98

89 Actua. 1 Cfc. O ^ opGa.K.e V 1 2 3 4 5 6 7 8 1 25 3 4 7 8 3 4 5 u ri 5 I 19 8 0 46 5 0 6 Perc* 6 11 10 10 1 0 24 0 16 7 1 0 5 1 1 1 31 1 0 1 4 2 17 0 2 1 26 Figure 20: Confusions among speakers for voiced /a/.

PAGE 99

90 Actual Speaker I o £. o J /, J c. 0 7 / Q O 1 ID 1 1 1 i 1 1 o J 0 1 n 1 9 3 J u •) Q y c J n u u 7 1 1 J. 3 8 7 11 10 13 17 3 8 P. 4 11 8 11 5 10 9 7 7 £ XVG< 5 3 3 9 18 9 7 7 13 Perci 6 4 3 15 8 11 12 11 15 7 6 2 5 4 4 1 12 4 8 4 3 4 6 7 4 1 9 Figure 21: Confusions among speakers for whispered /i/. Actual Speaker 3 4 5 6 15 5 5 5 2 4 12 11 7 17 11 4 5 7 4 5 11 4 7 3 1 8 9 17 9 6 5 15 5 4 10 6 7 2 7 10 8 14 2 0 14 3 7 15 8 11 13 2 15 7 7 5 7 4 8 3 3 7 5 9 13 11 9 P'igure 22: Confusions among speakers for whispered /u/.

PAGE 100

91 1 15 2 12 6 6 14 1 4 Actual Speaker 3 4 5 6 8 10 41 0 3 1 0 1 4 11 7 6 4 8 12 5 7 11 2 5 6 18 10 5 3 14 16 6 9 1 3 4 7 8 5 3 3 13 17 4 7 14 5 6 3 7 15 2 12 4 18 7 6 12 1 0 Figure 23: Confusions among speakers for whispered /^Q./, Actual Speaker 4 5 6 7 8 6 10 4 1 2 3 19 13 11 3 4 8 3 7 8 19 6 20 1 8 1 8 10 5 10 3 12 9 9 7 3 19 31 1 0 3 4 1 I 11 2 19 2 4 12 5 5 7 8 3 31 3 2 3 3 9 10 8 4 5 10 Figure 24: Confusions among speakers for whispered /a/,

PAGE 101

92 Actual Speaker 1 2 . 3 4 5 6 7 8 1 JL 11 3 12 4 5 11 11 8 u 2 4 17 6 8 14 4 6 10 J 6 3 10 3 11 a. CO 13 A H1 0 J. ^ 2 7 8 3 0) > •rH [U 5 0 7 0 1 21 2 4 12 Perc( 6 10 6 8 12 5 5 11 4 7 9 4 7 8 1 11 11 0 8 10 13 5 6 9 10 .6 12 Figure 25: Confusions among speakers for filtered /i/. 1 7 3 6 9 5 5 15 10 Actual Speaker 3 4 5 6 5 7 13 6 5 10 6 8 12 9 7 3 5 10 9 7 5 8 9 6 4 16 5 3 11 5 4 20 8 4 5 9 9 5 7 0 13 15 2 7 4 11 4 2 10 11 11 6 9 7 3 11 5 6 13 Figure 26: Confusions among speakers for filtered /u/.

PAGE 102

93 1 1 •5 J Actual A Speaker u 7 0 o 1 '+ -3 o 1 n 1 J. 1 1 1 J. Q O 1 7 u c J 1 7 1 0 Q O 1 1 Q 1 J. J Q) CO QJ o J A O ? 1 1 J. L 7 D. CO A J 1 1 1 J. 1 1 7 7 Q 0 > •H 5 1 14 6 3 28 1 1 0 u P-i 6 0 3 9 8 0 5 5 5 7 17 2 3 13 2 8 19 3 8 9 10 11 4 13 11 5 8 Figure 27: Confusions among speakers for filtered 7*^/ . Actual Speaker 1 2 3 4 5 6 7 8 1 7 3 8 10 4 13 10 5 u 0) 2 13 20 5 7 14 8 6 12 Ipeak 3 2 3 9 6 3 10 10 10 4 11 7 3 12 3 2 12 2 y 5 1 7 8 3 24 5 2 7 u
PAGE 103

94 o. > •i-l o Actual Speaker 3 4 5 6 8 5 9 8 9 10 3 5 14 7 11 9 5 4 5 12 8 5 8 3 13 6 5 7 10 14 9 5 4 4 8 9 7 9 11 3 9 7 5 7 10 6 12 9 4 6 6 6 12 9 8 6 3 9 6 8 11 13 4 Figure 29: Confusions among speakers for /s/. Actual Speake r 1 2 3 4 5 6 7 8 1 17 0 5 3 3 2 2 6 2 23 5 6 2 2 6 4 3 o •rl 5 5 14 3 0 27 3 1 Perci 6 2 9 14 4 3 15 4 13 7 I 0 1 3 0 1 26 1 8 7 24 13 39 20 25 4 27 Figure 30: Confusions among speakers for /z/.

PAGE 104

95 u (/) > •r-l a o
PAGE 105

96 Actual Speaker 1 ' 2 3 4 5 6 7 8 1 19 1 7 10 2 12 9 4 u 0) 2 0 23 6 1 10 3 6 8 'ed Speak 3 20 4 10 4 2 13 1 7 4 7 7 5 17 3 7 1 7 ? •H
PAGE 106

BIBLIOGPAPHY Bolt, R.H., Cooper, F.S., David, E.E., Denes, P.B., Pickett, J.M. , and Stevens, K.N. (1970). "Speaker Identification by Speech Spectrogram," J. Acoust. Soc, Amer. 47, 597-613. Bricker, P., and Pruzanski, S. (1966). "Effects of Stimulus Content and Duration on Talker Identification," J. Acoust, Soc. /jner. 40, 1441-1450. Compton, A.J. (1963). "Effects of Filtering and Duration upon the Identification of Speakers, Aurally," J. Acoust. Soc. Amer. 35, 1748-1753. Darwin, C.J. (1969). "Laterality Effects in the Recall of Steady-State and Transient Speech Sounds," J. Acoust. Soc. Amer. 46, 114(A). Dew, D., Jensen, P., and Menon, K. (1969). "Objective Methods for Measuring Selected Acoustic Features from Sonagrams," Quart. Prog. Rep. Commun. Sci. Lab., Univ. of Fla. 7.1 , 6-12. Fairbanks, G. (1956). D rill and Articulation Handbook (AppletonCentury-Crof ts , New York). Fant, G. (1960). Acoustic Theory of Speech Production (Mouton & Co., ' S-Gravenhage) . Flanagan, J.L. (1965). Speech Analysis, Synthesis and Perception (Academic Press, New York). Halle, M. , and Stevens, K.N. (1962). "Speech Recognition: A Model and a Program for Research," IRE Trans. Info. Theory 8, 155-159. Harris, K.S. (1958). "Cues for the Discrimination of American English Fricatives in Spoken Syllables," Language 1_, 1-7. Hays, W.H. (1963). Sta tistics (Holt, Rinehart, and V^inston, New York). Heinz, J.M. , and Stevens, K.N. (1961). "On the Properties of Voiceless Fricative Consonants," J. Acoust. Soc. Amer. 33, 589-596. Hughes, G.W., and Halle, M. (1956). "Spectral Properties of Fricative Consonants," J. Acoust. Soc. Amer. 28, 303-310. 97

PAGE 107

98 Jensen, P.J., HarringLon, W.VJ., and Ruder, K.F. (1970). "Pause Adjustment Mechanism and Measurement Systeni," Paper SU-7, Ccnventicn of Amer. Speech & Hearing Assoc. Kersta, L.G. (1971). "Progress Report on Automated SpeakerRecognition Systems," J. Acoust. Soc . Amer. 4_9, 139(A). Kimura, D. (1964). "Left-Right Differences in the Perception of Melodies," Quart. J. Exp. Psych. 16, 355-358. Kirk, R.E. (1968). Ex perimental Design: Procedures for the Beh avioral Sciences (Brooks-Cole, Belmont, California). Liberman, A.M., Cooper, F.S., Shankvjeiler , D.P., and StuddertKennody, M. (1967). "Perception of the Speech Code," Psychol. Rev. 74, 431-461. Miller, G.A., and Nicely, P.E. (1955). "Analysis of Perceptual Confusions Among Some English Consonants," J. Acoust. Soc. Amer. 27, 339-352. Peterson, G.E., and Barney, H.L. (1952). "Control Methods Used in a Study of the Vowels," J. Acoust. Soc. Amer. 24, 175-184. Pollack, I., Pickett, J.M., and Sumby, W.H. (1954). "On the Identification of Speakers by Voice," J. Acoust. Soc. Amer. 26, 403-412, Schwartz, M.F. (1958). "Identification of Speaker Sex from Isolated Voiceless Fricatives," J. Acoust. Soc. Amer. 43, 1178. Siegel, S. (1956). Non-Parametic Statistics (McGraw-Hill, New York) . Staff of the Computation Laboratory, (1955) . Tables of the Cumulative Binomial Probability Distribution (Harvard University Press, Cambridge), Stevens, K.N., and House, A.S. (1956). "Studies of Formant Transitions Using a Vocal Tract Analog," J. Acoust. Soc. Amer. 28, 578-585. Stevens, K.N., Williams, C.E., Carbonell, J.R. , and Woods, B. (1968). "Speaker Authentication and Identification: A Comparison of Spectrographic and Auditory Presentation of Speech Materials," J. Acoust. Soc. Amer. 44, 1596-1607. Studdert-Kennedy, M. , and Shankweilor, D. (1970). "Hemispheric Specialization for Speech Perception," J. Acoust, Soc. Amer. 48, 579-594.

PAGE 108

99 Tosi, 0., Oyer, H. , Pedrey, C. , Lashbrook, W. , and Nicol, J, (1971). "Voice Identification Research: A Report to Law Enforcement Assistance Administration, United States Department of Justice," Grant Number NI 70-004.

PAGE 109

BIOGRAPHICAL SKETCH Conrad Louis LaRiviere xjas born on July 23, 1942, at Goffstoxm, New Hampshire. He graduated from Bishop Bradley High School, Manchester, New Hampshire, in 1960. From 1960 to 1964, Mr. LaRiviere participated in the Honors Science Program at Providence College, in Providence, Rhode Island, and received the Bachelor of Arts degree in June 1964, with a major in Biology. In September 1965, he matriculated to the University of Arizona, where he majored in Audiology. As a URA trainee, he worked at the University Speech and Hearing Clinic; he also served as Clinical Audiologist at the Thomas-Dairo Clinic, Tucson, Arizona, from June 1966 to June 1967. He received the Master of Arts degree in June 1967. From September 1967 to June 1969, Mr. LaRiviere held an appointment as Assistant Professor of Speech at the State University of New York at Albany, where he taught courses in Audiology and language. He first enrolled at the University of Florida in the suiTimer of 19G3, and returned to full time graduate study in June1969. Since that time he has pursued work for the degree of Doctor of Philosophy in the ConaTiunication Sciences Laboratory, Department of Speech. Ml-. LaRiviere is married to the former Priscilla Anne Laflamme and has one son, Christopher John. -100

PAGE 110

I certify that I have, read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. 1 — J Harry Hollien, Chairman Professor of Speech I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. lona i'. BrandT^ fsociate Professor of Speech I certify that I have read this study and that in my opinion it confornis to acceptable standards of scholarly presentation and is fully adequate, D.n scope and quality, as a dissertation for the degree of uoctor of Philosophy. Paul J. Jensen Associate Prof essor._,pf Speech I certify that I have read this stady and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Donald Dew Associate Professor of Speech

PAGE 111

I cerLify that I have read this scudy and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. 1(1 I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Madelaine M. Ramey 'J Assistant Professor of Psycholog^,^ This dissertation was submitted to the Department of Speech in the College of Arts and Sciences and to the Graduate Council, and was accepted as partial fulfillment of the requirements for thf^. degree of Doctor of Philosophy. August, 1971 Dean, Graduate School