Hemispheric involvement in the perception of synthetic syllables, natural syllables, and "chirps"

Material Information

Hemispheric involvement in the perception of synthetic syllables, natural syllables, and "chirps"
Gelfer, Marylou Pausewang, 1953- ( Dissertant )
Hollien, Harry F. ( Thesis advisor )
Brown, W. S. ( Reviewer )
Fishchler, Ira S. ( Reviewer )
Rothman, Howard B. ( Reviewer )
Sullivan, William J. ( Reviewer )
Place of Publication:
Gainesville, Fla.
University of Florida
Publication Date:
Copyright Date:
Physical Description:
vii, 175 leaves : ill. ; 28 cm.


Subjects / Keywords:
Auditory perception ( jstor )
Chirp ( jstor )
Consonants ( jstor )
Frequency discriminators ( jstor )
Hemispheres ( jstor )
Phonemes ( jstor )
Sensory discrimination ( jstor )
Speech instruction ( jstor )
Syllables ( jstor )
Vowels ( jstor )
Dissertations, Academic -- Speech -- UF
Speech -- Physiological aspects ( lcsh )
Speech perception ( lcsh )
Speech thesis Ph. D
bibliography ( marcgt )
non-fiction ( marcgt )


The procedures utilized in this research permitted several assumptions underlying a proposed model of speech perception to be tested. It was hypothesized that the voiced stop consonants /b/ and /d/ would be differentiated in the left hemisphere, and that speech and nonspeech tasks involving identical ambiguous stimuli would reveal different perceptual processes and hemispheric asymmetry patterns. Hemispheric involvement in the processing of vowels and acoustic differences between stimulus sets also was examined. Finally, the effect of a task variable—stimulus difficulty/required attention—was assessed. In addressing these issues, average evoked responses (AER's) from the left and right hemispheres of 12 subjects were collected. The evoking stimuli consisted of the syllables /bi , bee, bo, di, das, do/, both synthetic and spoken. In addition, a set of "chirps" (isolated F2-F3 transitions associated with the above syllables) was included. The chirp stimuli were presented twice: once with instructions to discriminate them as /b/ and /d/ , and secondly with instructions to discriminate "high" vs. "low" onset frequencies. Subjects indicated whether they heard /b/ or /d/ (or "high" or "low") during the electrocortical recording procedure. The resulting AER's were later analyzed utilizing Principal Components Analysis, Analyses of Variance, and by preplanned and post hoc comparisons. Results revealed an early bilateral differentiation of /b/ and /d/, but inconsistent left hemisphere unilateral processing. Speech vs. nonspeech instructions for identical stimuli elicited dissimilar perceptual strategies; however, differences in hemispheric asymmetry did not reach significance. No evidence for hemispheric asymmetry in vowel discrimination was revealed, although acoustic differences between stimulus sets were discriminated in both bilateral and left hemisphere processes. Finally, stimulus difficulty/required attention appeared to influence patterns of hemispheric involvement. It was concluded that stop consonant perception is mediated primarily through bilateral cortical processes except when discrimination is particularly difficult and that the perceptual results of this study support the concept of a "speech mode" of perception.
Thesis (Ph. D.)--University of Florida, 1984.
Bibliography: leaves 166-174.
General Note:
General Note:
Statement of Responsibility:
by Marylou Pausewang Gelfer.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright [name of dissertation author]. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Resource Identifier:
000479245 ( AlephBibNum )
11802445 ( OCLC )
ACP5969 ( NOTIS )


This item has the following downloads:

Full Text








I would like to thank my chairman, Dr. Harry

F. Hollien, for his support of this project, and my

committee members for their helpful comments and criticisms.

I would also like to express my appreciation to Robert

B. Netzer for his programming and extensive assistance with

the computer-related aspects of this research. Special

thanks go to all my subjects, who endured many hours of

electrode placement and re-placement, and endless stimulus

tapes. And finally, I would like to thank my husband, John

P. Gelfer, for his patience, understanding, and love through

a very difficult time.


ACKNOWLEDGMENTS ......................................... ii

ABSTRACT..... .......... .................................v

CHAPTER I: INTRODUCTION................................... 1
The Question of Invariance..........................2
Coarticulation... ...................................5
Categorical Perception ..............................8
Levels of Processing in Speech Perception..........11
Stimulus Expectation................................14
Hemispheric Specialization..........................20
A Theory of Speech Perception......................29
The Problem of Task Variables ......................37
Purpose.................................................. 39

CHAPTER II: METHODS ................................... 43
Overview... ............................ .............43
Methods.................................................... 44
The Prinicpal Components Analysis Procedure........70

CHAPTER III: RESULTS...................................75
Preliminary AER Data Analysis .....................75
Analysis One: The Full Data Set.................... 77
Analysis Two: Synthetic and Natural Syllables
Only.............................. 115
Analysis of Perceptual Data .......................131

CHAPTER IV: DISCUSSION.................................142
Hemispheric Involvement in Stop Consonant
Perception .................................... 142
Stimulus Expectation .............................148
Hemispheric Involvement in Vowel Discrimination...151
Hemispheric Involvement in Stimulus Class
Differentiation.............................. 152
A (Revised) Theory of Speech Perception...........154
Conclusions........................................ 160

APPENDIX ......... ................................... 162

BIBLIOGRAPHY ........................................... 166

BIOGRAPHICAL SKETCH .................................... 175

Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy



Marylou Pausewang Gelfer

August, 1984

Chairman: Harry F. Hollien, Ph.D.
Major Department: Speech

The procedures utilized in this research permitted

several assumptions underlying a proposed model of speech

perception to be tested. It was hypothesized that the

voiced stop consonants /b/ and /d/ would be differentiated

in the left hemisphere, and that speech and nonspeech tasks

involving identical ambiguous stimuli would reveal different

perceptual processes and hemispheric asymmetry patterns.

Hemispheric involvement in the processing of vowels and

acoustic differences between stimulus sets also was

examined. Finally, the effect of a task variable--stimulus

difficulty/required attention--was assessed. In addressing

these issues, average evoked responses (AER's) from the left


and right hemispheres of 12 subjects were collected. The

evoking stimuli consisted of the syllables /bi, be, bo, di,

de, do/, both synthetic and spoken. In addition, a set of

"chirps" (isolated F2-F3 transitions associated with the

above syllables) was included. The chirp stimuli were

presented twice: once with instructions to discriminate

them as /b/ and /d/, and secondly with instructions to

discriminate "high" vs. "low" onset frequencies. Subjects

indicated whether they heard /b/ or /d/ (or "high" or "low")

during the electrocortical recording procedure. The

resulting AER's were later analyzed utilizing Principal

Components Analysis, Analyses of Variance, and by preplanned

and post hoc comparisons. Results revealed an early

bilateral differentiation of /b/ and /d/, but inconsistent

left hemisphere unilateral processing. Speech vs. nonspeech

instructions for identical stimuli elicited dissimilar

perceptual strategies; however, differences in hemispheric

asymmetry did not reach significance. No evidence for

hemispheric asymmetry in vowel discrimination was revealed,

although acoustic differences between stimulus sets were

discriminated in both bilateral and left hemisphere

processes. Finally, stimulus difficulty/required attention

appeared to influence patterns of hemispheric involvement.

It was concluded that stop consonant perception is mediated

primarily through bilateral cortical processes except when

discrimination is particularly difficult and that the

perceptual results of this study support the concept of a

"speech mode" of perception.


How is speech perceived? What auditory and

neurological mechanisms, what conscious and unconscious

strategies must a listener use in order to decode the rapid,

complex stream of sound that is speech? Investigations of

the speech perceptual processes have revealed that phenomena

such as categorical perception and coarticulation appear to

be important; and various investigators have hypothesized

the existence of a special "speech mode" in which speech

stimuli are processed in a different manner than nonspeech

auditory stimuli. More recently, the apparent asymmetry of

hemispheric function has been discussed in relation to

processing the diverse acoustic patterns which comprise

speech (Kimura, 1961; Shankweiler and Studdert-Kennedy,

1967; Cutting, 1974; Wood, 1975; Molfese, 1978a, 1978b,

1980a; Molfese and Schmidt, 1983).

Because of the complexity of both the speech signal

itself and the human listener, the question "how is speech

perceived?" has proven a particularly difficult one to

answer. The acoustic structure of the signal must be

considered; and within that signal, the specific parameters

critical to decoding must be isolated. In addition, the

receptive, perceptual and cognitive structures and processes

of the human listener must be taken into account in order to

fully describe speech perception. In this research, an

attempt will first be made to review the literature

pertinent to the cited speech perception problem; and a

model suggesting how speech is perceived will be generated.

Finally, experiments which extend current findings in this

area will be carried out.

The Question of Invariance

One of the problems in determining how speech is

perceived involves the isolation and identification of

consistent acoustic patterns which correspond to particular

phonemes. According to Liberman, Cooper, Shankweiler and

Studdert-Kennedy (1967), their own initial attempts to

isolate speech segments that would be perceived as phonemes

were quite unsuccessful, that is except for steady-state

portions of vowels and prolonged fricatives. Indeed, the

acoustic cues which give rise to the perception of a

particular consonant appear to depend greatly on phonemic


Early investigations into this issue utilized a speech

spectrograph, which displays speech in terms of frequency on

the vertical axis over time on the horizontal axis. With

such a display, invariant acoustic properties, with the

exception of vowels and fricatives, were not observed. On

the contrary, it appeared that in some cases, different

acoustic cues were perceived as the same phoneme. For

example, the formant transitions cueing /d/ in the syllables

/di/ and /du/ were observed to have different frequency

compositions and directions, yet both sets of transitions

were consistently perceived as /d/ in their appropriate

vowel context (Liberman et al., 1967; Liberman and

Studdert-Kennedy, 1978). It was also reported that for the

syllable /pu/, the /p/ could be signalled by a rising F2

transition leading into the vowel /u/, while in the syllable

/spu/, the /p/ could be signalled by a silent interval of

approximately 60 ms between the /s/ noise and the onset of

the /u/ (Liberman and Studdert-Kennedy, 1978). In both

cases, two markedly different acoustic cues gave rise to the

consistent identification of a particular phoneme.

Conversely, a single acoustic cue was observed to signal two

different phonemes, depending on context. Cooper, Delattre,

Liberman, Borst and Gerstman (1952) describe a study in

which a noise burst centered around 1440 Hz was perceived

alternately as /p/ or /k/ as a function of the following


In summary, a simple listing of which acoustic cues

correspond to which.phonemes did not appear to be an

adequate method of addressing the complexity of speech

perception. Indeed, some researchers argued that the

problem of invariance furnished support for theory of a

"special speech processor" (Liberman et al., 1967), or a

"speech mode" (Mattingly, Liberman, Syrdal and Halwes,

1971), a neural mechanism utilized by a listener during the

perception of speech.

More recent evidence based on spectral analysis has

begun to show that the problem of invariance may be due more

to measurement limitations than actual variations in the

acoustic cues for particular phonemes. For example, Stevens

and Blumstein (1978) and Blumstein and Stevens (1980)

advanced a theory, supported by both pattern-matching of

spectra and listener judgements, that the invariant cue for

stop consonants is contained within the first "20-odd" ms

following the release of the burst. Kewley-Port, Pisoni and

Studdert-Kennedy (1983) challenged that viewpoint somewhat,

but proposed a very similar theory in which both the static

burst information in the first 5 ms following stop release

and two additional "dynamic" (or time-varying) cues within

the first 40 ms were adequate for listeners to reliably

discriminate between stop consonants.

Not all recent research, however, supports theories of

invariance in acoustic cues. Howell and Rosen (1983) have

shown that the "boundary" between /S/ and /tS/, previously

hypothesized to be a 40 ms rise time (Gerstman, 1957;

Cutting and Rosner, 1974) in fact varies considerably

depending on speaking situation in a production task and

stimulus range in a perceptual task.

In summary, although invariant cues to individual

phonemes may exist within the speech stream, their brief

duration and complex nature would appear to make speech

perception a difficult task. Early researchers hypothesized

an innate "special speech processor" or "speech mode" to

account for a listener's ability to extract a particular

phoneme regardless of variability in the acoustic signal.

Now, it appears that the invariance problem may not be as

insoluable as once believed. However, it is possible that

the concept of a "speech mode" of perception may still be

useful in order to explain the seeming ease with which most

listeners are able to decode speech. (See the appendix for

a discussion of the nature and possible origin of the

hypothesized "speech mode.")


The difficulty in identifying invariant acoustic cues

for each phoneme may be due to the dynamic nature of speech

and the resulting phenomenon of coarticulation (Ohman, 1966;

Daniloff and Moll, 1968; Kuehn and Moll, 1972). In

continuous discourse, phonemes are produced in sequences,

not as isolated units. As the articulators move to modify

the laryngeal spectrum, articultory positions blend smoothly

into one another, and one articulatory position influences

another. This influence goes in both directions: the

configuration of the oral structure for one phoneme may be

carried forward into the next; or the anticipation of

producing a certain phoneme may affect the one which

preceded it (Daniloff and Moll, 1968). In any case, the

acoustic properties of a particular phoneme depend to a

large extent on the surrounding phonemic context--a

relationship which results from the overlapping and

interrelating movements involved in speech production.

The challenge to perception presented by coarticulation

would seem substantial; however, perceptual research has

shown that coarticulation actually aids in the rapid

perception and processing of speech. Kuehn and Moll (1972)

studied listeners' perceptions of portions of spoken CV

syllables and found above-chance levels of identification

for both phonemes when only the consonant and the initial

part of the formant transitions (preceding the vowel

formants) were presented. When they contrasted phonemes in

terms of manner, voicing and place of production, they found

that place of production was the feature most often

correctly perceived. Ostreicher and Sharf (1976) examined

both forward and backward coarticulatory effects for

separated portions of CV, VC and CVC syllables. They found

that place of articulation, voicing features and manner

features of consonants could be determined from the

associated vowel; tongue height and tongue advancement of

vowels could be deduced from contiguous consonants; and

that subjects were able to determine vowel and consonant

features more correctly from preceding sounds than from

following sounds (backward coarticulation). Place of

articulation was correctly identified significantly more

often than manner in nine of twelve instances. Two of their

conclusions were that "coarticulatory effects are perceived

and may be used by listeners to help identify adjacent

sounds in conversational speech," and that "adjacent phoneme

perception involves parallel processing of features"

(Ostreicher and Sharf, 1976, pg. 297).

Thus, coarticulation appeared to facilitate speech

perception by simultaneously encoding information about

several phonemes at any given point in the speech stream.

Place information was conveyed particularly well, even when

syllables (and thus available cues) were truncated

(Ostreicher and Sharf, 1976). Formant transitions appeared

to play an important part in perception of articulatory

coarticulation, although place information was not coded

exclusively through these transitions (Kuehn and Moll,

1972). This knowledge, however, tells us very little about

the mechanism which processes speech, except that parallel

decoding must be involved. Is a "speech mode" of perception

necessary? It can be argued that the acoustic stimuli which

comprise speech vary lawfully, depending on coarticulation

effects; however, the complexity required to decode the

signal in purely acoustic terms is formidable.

Categorical Perception

Another phenomenon which has been used by researchers

to support of a theory of a speech mode is categorical

perception. This concept refers to the tendency of a

listener to perceive synthetic speech stimuli varied

continuously along some dimension as belonging to two or

three discrete cateogries. For example, voice onset time

(VOT), the amount of time between the release of a stop

consonant and the onset of voicing for the following vowel,

serves as a cue for voicing of stop consonants. In English,

voiced stop consonants were found to have VOT's of less than

30 ms (depending on the consonant), while voiceless stops

generally had VOT's greater than 30 ms (Lisker and Abramson,

1964). According to Liberman et al. (1967), when VOT was

varied in 20 ms steps from 0 to 60 ms, English-speaking

subjects were generally able to discriminate well between

VOT's of 20 and 40 ms, corresponding to the "phoneme

boundary" between /b/ and /p/. However, these same

listeners were not able to discriminate adequately between

VOT's of 0 and 20 ms, or between 40 and 60 ms VOT's,

occurring within phoneme categories. Thus, perception of

voicing was hypothesized to be categorical: different

stimuli belonging to the same category were not

discriminated, while stimuli belonging to different

categories were discriminated very well along some

continuous dimension. Categorical perception has also been

demonstrated for place contrasts through continuous

variation of second formant transition (Liberman, Harris,

Hoffman and Griffith, 1957). These researchers found that

when onset frequency of the second formant transition is

varied continuously, listeners do not hear a continuum of

step-wise changes, for example, from /b/ to /d/. Instead,

as with the VOT-varied stimuli, they perceive the stimuli

categorically as belonging to one phoneme class (/b/) or

another (/d/), with an abrupt boundary between.

Liberman et al. (1967) contrast the categorical

perception that appears to characterize encoded consonants

with the continuous perception noted with other acoustic

stimuli. They point out that in general, listeners can

discriminate about 1200 pitches, although they can identify

only about seven. As described above, when the stimuli are

consonants, listeners can only discriminate as many as they

can identify. These researchers argue that the phenomenon

of categorical perception furnishes evidence that speech is

perceived and processed in a different manner from other

auditory stimuli, and that categorical perception is

characteristic of the "speech mode."

It is interesting to note that Liberman et al. (1967)

specifically exclude vowels from their categorical

perception mode. According to these investigators, vowels

are considered long-duration "unencoded" stimuli and can be

perceived along a continuum. This hypothesis formed the

basis for a number of challenges to the theories of Liberman

and his colleagues, most notably one by Lane (1965).

According to Lane's review, if vowels are degraded by being

presented in noise, they will also be perceived

categorically with steep boundaries similar to those for


Other investigators have challenged the notion that

categorical perception is unique to speech stimuli. Among

them are Miller, Wier, Pastore, Kelly and Dooling (1976),

who examined a nonspeech noise burst-buzz analog of VOT in

consonant-vowel (CV) syllables. These researchers found a

sharp boundary and discrimination peaks at category

boundaries, similar to the patterns reported for the /pa-ba/

continuum. Pisoni (1977) obtained similar results for two

simultaneously presented tones with varying lead or lag

times. Cutting, Rosner and Foard (1976) synthesized speech

analogs with various rise times, a longer duration rise time

which was perceived as a bowed note on a violin ("bow"), and

a shorter duration rise time which was perceived as a

plucked note on a violin ("pluck"). These investigators

also demonstrated that listeners perceived such nonspeech

stimuli categorically.

The implication of these categorical perception studies

is that categorical perception is not limited to

brief-duration consonantal speech sounds, and thus cannot be

viewed as evidence for a speech mode of perception.

However, categorical perception remains an important concept

in the investigation of speech perception. Brief-duration

consonantal stimuli do appear to be perceived categorically,

while vowel stimuli (under normal circumstances) do not.

Therefore, it would seem that different types, levels or

modes of processing do exist, regardless of "auditory"

vs. "phonetic" distinctions, and somehow interact during the

process of decoding running speech. The nature of these

modes of processing, their characteristics and their

patterns of interaction have been the subject of a number of


Levels of Processing in Speech Perception

Interference is one type of interaction which can be

utilized in the study of processing levels. For example,

Day and Wood (1972a) used a reaction time (RT) paradigm to

examine the interference patterns of varying vowels and stop

consonants. Their results demonstrated that, when two stop

consonants occurred in a variable vowel context (/ba, be,

da, de,/), reaction times (RT's) for discrimination were

slower than when the same stop consonants were paired with a

single vowel (/ba, da/). Conversely, when the target sounds

for discrimination were two vowels, reaction times (RT's)

were longer when consonant context varied (/ba, be, da,

de,/) than when consonant context remained the same (/ba,

be/). In both cases, interference was mutual. These

findings contrast to Day and Wood (1972b) and Wood (1975),

who varied stop consonants and fundamental frequency. In

these studies, changing consonant context did not

significantly increase RT for fundamental frequency

discrimination; however, varying fundamental frequency did

significantly increase RT for stop consonant discrimination.

In this case, interference was unilateral; i.e., changing

the phonetic context (consonants) did not affect nonphonetic

discrimination (fundamental frequency), while changing the

nonphonetic context did affect phonetic discrimination.

Wood (1975) also studied the interference patterns of two

nonphonetic aspects of speech: fundamental frequency and

intensity. In this context, interference was again mutual,

with irrelevant variation of one dimension increasing RT for

the target dimension. Collectively, these studies appear to

support a hypothesis of two different modes of processing,

apparently auditory vs. phonetic, as evidenced by changes in

patterns of interference.

Additional research, however, has caused investigators

to question the "phonetic" level of processing. Blechner,

Day and Cutting (1976) used the "bowed" and "plucked" speech

analogs cited above (Cutting, Rosner and Foard, 1976) in a

reaction time paradigm. In their study, Blechner et

al. (1976) varied rise time and intensity. According to the

previous interference studies (Day and Wood, 1972a, 1972b;

Wood, 1975), it was expected that rise time and intensity

would show mutual interference patterns, since both are

"auditory" (as opposed to "phonetic") stimuli. However,

Blechner et al. found that irrelevant variations in rise

time did not significantly affect RT for intensity, while

irrelevant variations in intensity did affect RT for rise

time, a unilateral interference pattern similar to that

obtained by Day and Wood (1972b) for consonant

vs. fundamental frequency discrimination. From their

results, Blechner et al. hypothesized that the "auditory"

level of processing most likely consists of several levels;

that unilateral interference patterns might be due to ease

of discrimination; and that perhaps in previous studies,

"the importance of the linguistic-nonlinguistic dimension

may have been overrated, and the role of acoustic

factors may have been underrated" (Blechner et al., 1976,

pg. 264).

Pastore, Ahroon, Puleo, Crimmins, Golowner and Berger

(1976) also questioned the notion of a "phonetic" level and,

like Blechner et al., studied interference patterns of

nonphonetic acoustic tokens. Their stimuli consisted of

narrow bandwidth frequency glides (similar to second formant

transitions) followed by high and low pitched buzzes

(analogous to vowels with different fundamental

frequencies). These investigators found unilateral patterns

of interference closely resembling those obtained with

consonants and fundamental frequency in the Day and Wood

studies. Pastore et al. concluded that their results--as

well as those of their predecessors--could be explained in

terms of general properties of the auditory system without

involving a speech mode of processing.

The two methodologies discussed so far, categorical

perception and reaction time, have failed to demonstrate

that speech is perceived in a unique and different way from

nonspeech stimuli. However, two important points remain to

be reviewed. The first is the concept that the unique

aspect of speech perception is a function of higher cortical

processes; the second is the concept of hemispheric

asymmetry and its role in speech perception.

Stimulus Expectation

The effects of expectation on comprehension of the more

complex linguistic units is a topic that has been discussed

frequently, especially by linguists (Chomsky, 1965;

Bickerton, 1979; Chu, 1977; Oh and Godden, 1979). The

basis of this expectation appears to be semantic and

syntactic knowledge of a particular language and cultural

conventions. There also is evidence that phonological

knowledge of a language establishes certain stimulus

expectations and affects auditory perception. For example,

Warren and Warren (1970) report research in which certain

phonemes in connected discourse were replaced by tones,

buzzes or hisses. As a consequence of these operations,

listeners were unable to tell which phonemes had been

replaced. They continued to perceive the missing phoneme

even after being told where to listen for the substitutions.

Although Warren and Warren did not interpret their data in

this manner, it is possible that the listeners' knowledge of

the structure of English phonology and semantics predisposed

them to perceive what they expected to hear.

Day (1970) showed a similar if less consistent

phenomenon occurring with dichotic fusion. She took words

beginning with consonant clusters such as black, and made

two component words, each beginning with one letter of the

cluster; for example, black became back and lack. She then

presented one component word to a listener's right ear and

the other component word to the left. Onset times of the

component words were varied such that the back component

word was presented with lead times of 100, 75, 50, 25, or 0

ms. Then the lack component word was presented with lead

times at the same intervals. However, when the listeners

were asked which consonant they heard first (the /b/ or the

/1/), two markedly different response patterns emerged.

Some subjects consistently reported /b/ as occurring first,

even when the lack component word actually led; while

others were able to accurately choose which consonant was

actually presented first. Interestingly, even when subjects

in the former category were told that in some trials the /1/

would precede the /b/, they were still unable to accurately

order the component words. This phenomenon is consistent

with the findings of Warren and Warren (1970), where

listeners were not able to identify the missing phoneme even

when told its general location.

Day's results lead to a number of interesting

speculations. First, it appeared that for at least some

listeners, knowledge of the phonological rules of English

interfered with temporal order judgements. Second, since

all listeners did not show this pattern, it might be

hypothesized that at least two levels of processing were

involved in the task. At one level, phonemes were decoded

in the actual order of presentation. At the second level, a

reordering of the incoming stimuli occurred for some

subjects who appeared to be particularly

"language-dependent." This restructuring may have been based

on phonological knowledge of acceptable phoneme sequences.

The interaction between the second level and the first in

the "language-dependent" subjects suggests that higher

cortical processes can act in an efferent or down-feeding

direction to influence perception. In fact, anatomical

evidence for two descending auditory pathways has been found

(Harrison and Howe, 1974), and physiological research

(Desmedt, 1971; Wiederhold and Kiang, 1970) suggests that

cortical involvement can affect sensory processes.

If certain expectations about a stimulus are important

influences of perception, the issue of how expectations are

engaged must be considered. On the surface, it would appear

probable that acoustic and/or linguistic aspects of the

stimulus itself would trigger expectations. However, in the

case of ambiguous stimuli, perhaps direct instructions to

listeners can be demonstrated to vary mode of perception or

level of processing. One example of this technique is

demonstrated by Schwab (1981) in a study in which

synthesized sinewave analogs of the syllables /bu, ub, du,

ud/ (among others) were utilized. These syllables contained

appropriate formant frequencies for the vowel and transition

onsets, but had bandwidths of 1 Hz. They were not

immediately recognizable as speech, according to the author.

Schwab instructed half her listeners to discriminate these

tokens on the basis of rising or falling frequency, a task

presumably requiring an "auditory" mode of perception. The

other listeners were told that these tokens were

computer-generated speech samples, and were asked to label

the tokens, a task requiring a "phonetic" mode. Results of

the five experiments contained in Schwab's article

consistently indicated that sinewave syllable analogs could

be discriminated above chance levels in both modes, and that

the discrimination functions for the "auditory" listeners

and "phonetic" listeners were different. For example, as

additional formants were added, making the signal more

speech-like, phonetic discrimination improved while auditory

discrimination deteriorated. In addition, there were marked

backward masking effects and frequency masking effects for

the auditory group, but not for the phonetic group. Schwab

interpreted her results as supporting the concept of a

speech mode of information processing. Indeed, her work is

particularly important in that she demonstrated that

identical stimuli could be perceived in two distinctly

different ways, depending on the instructions to the

subjects. Expectations about the stimulus (speech

vs. nonspeech) appeared sufficient to vary the way an

ambiguous signal was processed. However, as hypothesized

above, it would seem that the stimulus characteristics

themselves are important in engaging or maintaining

expectations: the more speech-like the sinewave analogs

were, the better they were discriminated in the speech mode.

Other types of ambiguous stimuli also might be used to

show how perceptual mode can be influenced. For example,

Mattingly, Liberman, Syrdal and Hawles (1971) used a

categorical perception paradigm to study perception of

second formant transitions, both in isolation and embedded

in syllables. Mattingly et al. (1971) found that second

formant transitions embedded in syllables were perceived

categorically as voiced stop consonants, while those in

isolation were not perceived categorically. Further, they

reported that the transitions in isolation sounded like

clicks or "chirps," and bore no resemblance to speech

sounds. In a later study, however, Nusbaum, Schwab and

Sawusch (1983) discussed additional research suggesting that

with proper instructions to listeners, chirps might be

perceived as speech. Although this was not the focus of

their research, they demonstrated that if listeners were

told that chirps were parts of phonemes and given practice

in identifying them as such, these stimuli could be

perceived categorically in a manner similar to intact

(synthetic) syllables. Thus, ambiguous auditory stimuli

were shifted from being perceived noncategorically in the

Mattingly et al. research to being perceived categorically

in the Nusbaum et al. study through manipulation of stimulus


Taken in concert, the results of Schwab (1981), Nusbaum

et al. (1983), Warren and Warren (1970) and Day (1970)

support a theory of a speech mode of perception and the

importance of stimulus expectation in determining how

acoustic signals are perceived. Under normal circumstances,

this speech mode is probably utilized by a listener when the

incoming stimuli have the appropriate frequency and temporal

characteristics. However, due to variability in the signal

and transmission distortion, the human perceptual mechanism

must be capable of processing a degraded signal and still

extracting meaning. Thus, stimulus expectation "fills the

gaps," and the incoming signal is restructured to conform to

some previously learned pattern. This restructuring can

occur at many linguistic levels: the perception of a word

or phoneme may be altered to maintain correct syntax or

semantic sense (Warren and Warren, 1970); or perception of

a various acoustic cues may be altered to maintain a correct

phonological sequence (Day, 1970). Thus, the hypothesized

"speech mode" may be the result of efferent feedback from

the cortical level based on the listeners' expectations

about the stimulus.

Hemispheric Specialization

In any model of speech perception, it is important to

consider the respective roles of the left and right cerebral

hemispheres. It has been assumed since the late 1800's that

the left hemisphere of the brain is somehow specialized for

language (Gevins et al., 1979). Indeed, the pervasiveness

of various types of aphasia following injury to or disease

of the left hemisphere in right-handed individuals gives

credence to this view. But what exactly is the left

hemisphere's role in speech perception? The studies cited

above appear to indicate the importance of processing

levels. If an auditory vs. speech perceptual task could be

shown to evoke different patterns of hemispheric

involvement, further support would be provided for a theory

of different levels or modes of processing.

Unfortunately, "speech" is composed of many

acoustically diverse elements, and hemispheric involvement

in processing these elements has been difficult to

determine. One popular methodology in the study of

hemispheric asymmetry has been dichotic listening. This

method was first reported by Broadbent (1954) who found that

when two simultaneous competing stimuli were presented

binaurally, most right-handed listeners appeared to have a

bias for signals coming into the right ear. Kimura (1961)

was the first to label this phenomenon as "right ear

advantage" or REA. She reasoned that since most of the

fibers carrying information from the right ear go to the

left side of the brain, this right ear advantage must be

indicative of a left hemisphere superiority for processing

speech and language. Kimura further hypothesized that these

contralateral neural pathways were superior to ipsilateral

fibers in conducting sensory information to the auditory


Right-handed subjects are typically used in

lateralization studies, because hemispheric dominance in the

left-handed is less predictable (McGlone and Davidson,

1973). However, it should be noted that not all

right-handed subjects show REA's for speech stimuli.

According to Sidtis (1982), REA is demonstrated by only

70-75% of all right-handed subjects. Further, Sidtis (1982)

has hypothesized that only 50% of the dextral population

fits Kimura's model of dominant contralateral/secondary

ipsilateral pathways. Thus, when applying models of left

vs. right hemispheric processing to individuals, it is

important to take into account the range of normal

variability. Additionally, when assessing the strength of

left or right hemispheric asymmetry as revealed by group

trends, it should be remembered that the population mean may

include data from subjects with questionable lateralization


Shankweiler and Studdert-Kennedy (1967) investigated

REA for syllables consisting of a stop consonant and vowel,

and for vowels alone. These researchers found that a right

ear advantage existed, but only for CV stimuli; that is,

steady-state vowels did not elicit a significant REA.

Cutting (1974) supported and extended the Shankweiler

and Studdert-Kennedy findings. While earlier researchers

had used only stop consonants in CV stimuli and vowels in

isolation, Cutting (1974) included liquids (/r,l/) in

addition to stops and vowels. He also examined REA effects

related to consonant position, the presence or absense of

formant transitions, nonspeech sinewave formant "CV" analogs

and inverted or "nonphonetic" formant transitions (labelled

as such because they could not have been produced by the

human vocal tract). Results revealed that stops had a

significantly greater REA than liquids, which in turn had a

significantly greater REA than vowels. Both stops and

liquids showed an REA, but while final stops retained their

REA, final liquids showed an LEA. When results were

averaged across conditions, all sounds identifiable as

speech (consonants and vowels) showed some degree of REA.

On the other hand, of the sinewave formant CV

approximations, those with transitions showed a small,

nonsignificant REA, while those without transitions showed a

slight LEA. Cutting interpreted these results as supporting

a theory that two specific and different mechanisms operate

in the left hemisphere, and that both are important to

speech perception. The first is involved in processing

complex acoustic aspects of the signal, such as the rapid

frequency over time changes characteristic of (but not

limited to) formant transitions. This mechanism was

hypothesized to be "acoustic" in nature rather than

"phonetic," since it was activated by nonphonetic (i.e.,

inverted) transition stimuli, as well as phonetic stimuli.

Cutting's hypothesized second mechanism was "phonetic,"

i.e., a system which responds differentially to speech

sounds. He based this second hypothesis on the observation

that both CV syllables and vowels with normal speech-like

bandwidths evoked an REA, while stimuli that were not

perceived as "speech"--sinewave formant CV analogs--did not

evoke a significant REA even when transitions were present.

Molfese (1978a) attempted to replicate some of

Cutting's results, but employed average evoked responses

(AER's) rather than dichotic listening to demonstrate the

differential hemispheric responses. Like Cutting, he used

stop consonant-vowel syllables with normal (phonetic)

transitions, CV syllables with inverted (nonphonetic)

transitions, and sinewave formant stimuli with both phonetic

and nonphonetic transitions. Molfese used Principal

Components Analysis in analyzing his AER data. This

procedure permitted him to identify underlying components of

the AER's which might vary with experimental manipultions.

Results revealed that both /b/-/g/ with phonetic transitions

and /b/-/g/ with nonphonetic transitions were differentiated

in the left hemisphere, but in different ways. Thus, the

left hemisphere appeared to be sensitive to normal formant

/b/-/g/ contrasts, nonphonetic /b/-/g/ contrasts, and normal

vs. nonphonetic transitions. No such differences were

observed in the right hemisphere. It should be noted,

however, that the bandwidth variable was not a significant

factor in this interaction. That is, in assessing left

hemisphere sensitivity to /b/-/g/ contrasts, both the

responses to normal formant syllables and sinewave formant

CV analogs were averaged together. Thus, the results of

Molfese (1978a) appeared to support Cutting's proposed left

hemisphere mechanism which processed all stimuli containing

transitions (although Cutting himself did not demonstrate

processing of sinewave formant stimuli in the left

hemisphere). Cutting's second hypothesized left hemisphere

mechanism which processed only "speech" (normal formant

bandwidth speech or speech-like stimuli) was not supported

by Molfese's results.

In a 1979 experiment, Molfese and Molfese used stimuli

and methods similar to those previously employed by Molfese

(1978a), but in this case they used infants of approximately

one day old as subjects. Their stimuli consisted of normal

bandwidth /ba/ and /ga/ syllables and sinewave formant CV

analogs (phonetic vs. nonphonetic transitions were not

utilized). The results of this research were similar to

those of Molfese (1978a); i.e., both demonstrated a left

hemisphere differentiation between the CV syllables /ba/ and

/ga/. However, in contrast to the adult subjects, the

bandwidth variable was a significant factor in the infants'

responses. Results showed that for infants, only /b/

vs. /g/ syllables with normal bandwidth formants were

discriminated in the left hemisphere, while sinewave formant

syllables were not. Neither were differentiated in the

right hemisphere. This study furnished support for

Cutting's second proposed left hemisphere mechanism, which

processes only stimuli perceived as speech. Molfese and

Molfese (1979) attributed this difference in results to

possible maturationall" factors, although they did not

elaborate on what these factors might be.

One possible explanation for the difference in results

between Molfese (1978a) and Molfese and Molfese (1979) is

purely acoustic. It is possible that the left hemisphere

does indeed process all transitional stimuli, regardless of

bandwidth, and does so on the basis of transition direction.

The infants' responses were different from the adults'

because presumably they did not have enough experience with

acoustic stimuli to respond appropriately to sinewave

formant transitions. This explanation, however, is not

consistent with Cutting (1974), who did not find left

hemisphere processing of sinewave formant stimuli in his

adult subjects.

Another possible explanation for the difference in

results between Molfese (1978a) and Molfese and Molfese

(1979) may relate to the findings of Schwab (1981). As

previously reported, Schwab (1981) found that sinewave

formant analogs of /ba/ and /ga/ could be perceived either

as speech or as nonspeech, depending on instructions to the

subjects, as evidenced by differences in discrimination

functions. Molfese's (1978a) subjects heard the sinewave

formant CV analogs interspersed with normal bandwidth

stimuli. Although they were not given specific instructions

to label each stimulus as /ba/ or /ga/, it is possible that

they did so, thus utilizing the speech mode and differential

left hemisphere processing regardless of formant structure.

The infant subjects of Molfese and Molfese (1979), however,

were not mature enough to employ this strategy, and thus

only /ba/ and /ga/ syllables with normal formant structure

were differentially processed in the left hemisphere. This

interpretation might be reconciled with Cutting (1974) if

the different modes of stimulus presentation used by Cutting

(1974) and Molfese (1978a) are taken into account. As noted

above, Molfese randomized both his normal bandwidth and

sinewave formant stimuli on the same tape; thus subjects

heard both types of stimuli in the same trial, and possibly

attempted to interpret all stimuli as "speech." The subjects

of Cutting, on the other hand, heard all normal bandwidth

CV's in one trial, and all sinewave formant CV's in a

separate trial, and would therefore have less motivation to

try to process all stimuli in the same manner.

Additional studies by Molfese and colleagues (Molfese,

1980a; Molfese and Schmidt, 1983) have generally supported

the finding that adult subjects tend to discriminate both

normal bandwidth and sinewave formant /b/ and /g/ in the

left hemisphere. Molfese (1980a) utilized /b,g/ in varying

vowel contexts (/i,s,o/). Results revealed a significant

Hemisphere by Consonant interaction such that /b/ and /g/

(regardless of vowel environment--or formant structure) were

differentiated in the left hemisphere but not in the right.

Molfese and Schmidt (1983) essentially replicated the

Molfese (1980a) preliminary study, reporting similar (though

more detailed) results.

Molfese (1980a) and Molfese and Schmidt (1983) were the

first AER studies to reveal a consistent left hemisphere

response to consonants in varying vowel contexts. This is a

significant finding, as the acoustic cues for each consonant

are different, depending on the following vowel. However,

in these studies, the effects of several possible

confounding factors were not examined. First, transition

direction was positive for all /b/ stimuli, regardless of

vowel context, and negative for all /g/ stimuli, despite

different onset frequencies. Thus, transition direction may

have furnished an acoustic cue for consonant identification.

Second, subjects' expectations regarding the nature of the

ambiguous stimuli were not discussed. In this regard,

Molfese and Schmidt (1983) concluded that their results

furnished support for a "lateralized mechanism that is

sensitive to or extracts relevant linguistic information"

(pg. 68). However, it must be assumed that the sinewave

formant CV analogs were percieved as speech signals if this

conclusion is to be accepted. Finally, it has been

demonstrated by Kewley-Port (1982) that formant transitions

alone are not sufficient cues in natural speech for accurate

stop consonant identification, despite their frequent use in

speech perceptual studies. It is possible that perceptual

processes identified in the literature could vary

significantly as a function of the type of speech stimulus

used (synthetic vs. natural).

In summary, the research results reported by Molfese

and his colleagues are generally consistent with a theory of

a "speech mode" of perception. This speech mode is

characterized by left hemisphere differentiation of stimuli

containing transitions and which subjects perceive as

"linguistic," regardless of acoustic differences in cues

related to varied vowel contexts. However, the results of

Molfese (1978b) are contradictory to a theory that the

speech mode is a left hemisphere function.

Molfese (1978b) has suggested that one of the acoustic

cues for voicing of stop consonants, voice onset time (VOT),

appears to be processed in the right hemisphere. That is,

in a typical categorical perception paradigm, a differential

response to between-category VOT changes (20 and 40 ms) was

only found in the right hemisphere. To be specific, the

right hemisphere response correlated with listeners'

perception of /b/ vs. /p/, while the left hemisphere did

not. However, differential left hemisphere responses to the

endpoints of the continuum (0 and 60 ms) were observed; and

a second response showed discrimination of the endpoints (0

and 60 ms) from the midpoints (20 and 40 ms) of the

continuum. Similar results were obtained by Molfese (1980b)

when nonspeech tonal stimuli with varying relative onset

times were utilized. Thus, the idea of a simple correlation

between a speech mode of perception and left hemisphere

activity appears to be inadequate to describe the actual

complexity of speech perception.

A Theory of Speech Perception

While a theory explaining speech perception is

desirable, it should be one which includes hemispheric

asymmetry data and the concept of stimulus expectation.

Such a model is presented in Figure 1-1.

The Feature Level

As may be seen, the model specifies that both the left

and right cerebral hemispheres are active in the primary

processing of acoustic stimuli. Complex, rapidly-changing

frequency over time information (including formant

transitions) is analyzed in the left hemisphere. Left

hemisphere involvement in the processing phonetic and

nonphonetic transitions has been demonstrated by Cutting

(1974), and Molfese (1978a), although evidence for similar

aysmmetrical differentaition of sinewave formant stimuli is

less clear (Molfese, 1980a; Molfese and Molfese, 1979;

Molfese and Schmidt, 1983). At the same level, further

analysis of the spectral and temporal characteristics of the

acoustic signal may take place in the right hemisphere.

These perceptual processes in both the left and right

hemispheres could be considered the "feature level" of

speech perception, because decisions as to place, manner and

voicing are made at this point, in accordance with feedback

from higher cortical processes. If the stimuli were not

speech-like, a different set of expectations would be

utilized by the listener, and another model would be


Right Hemisphere

A model of speech perception. See text for

Figure 1-1.

Left Hemisphere

The Phoneme Level

At the next level, the feature information is combined

in the left hemisphere, and it is determined if the stimulus

is speech or some nonspeech signal. Again, a basis for this

mechanism can be seen in Molfese (1978a), where both

phonetic and nonphonetic /b/ vs. /g/ were discriminated

differently in the left hemisphere. Stimuli presumably

would be classified as speech or nonspeech on the basis of

1) frequency, bandwidth and other characteristics of the

signal (Schwab, 1981), 2) the presence of acoustic cues

appropriate to a particular place or manner of articulation

and voicing, and 3) the temporal characteristics of the

signal. However, even in the absence of clear cues for

speech, feedback from higher cortical centers may override

the inadequate acoustic cues, and the signal may be

perceived as "speech" (Schwab, 1981; Nusbaum et al., 1983).

Simultaneous with the speech-nonspeech decision is one

regarding the identity of the phoneme. Again, stimulus

expectation, now based on linguistic knowledge, may override

actual acoustic cues. Expectations at this level may also

include some knowledge of how speech is produced; thus, if

an articulatory referent exists (Liberman et al., 1967), the

left hemisphere phoneme level is the processing stage where

such cross-correlations would take place in this model.

In the right hemisphere, initial signal processing at

the feature level involves the spectral and temporal aspects

of a signal, as mentioned above. At the phoneme level, the

relevant features extracted from such an analysis are

transmitted to the left hemisphere. Thus, right hemisphere

input contributes to the speech vs. nonspeech decision and

phoneme labelling. There can also be feedback from the left

hemisphere to the right hemisphere at this level if the

feature information does not confirm to expectations or if

it results in an ambiguous speech/nonspeech decision or

phoneme label. There is constant interaction between the

hemispheres at this level. These processes take place below

the level of consciousness.

The Word Level

At the word level of processing, phonemes are combined

and sequenced in the left hemisphere according to

phonological rules. Marked individual differences can be

seen at this level, with some subjects' perception highly

dependent on phonological knowledge while others' perception

is not (Day, 1970). A number of auditory "illusions" may

also occur (Warren and Warren, 1970) when the actual

acoustic cues for a particular phoneme are omitted or

distorted. Again, there is constant feedback between the

phoneme and word levels of processing. If the incoming

phoneme sequence violates phonological or semantic rules,

the sequence of phonemes may be altered in the left

hemisphere, or the questionable sound may be shunted back to

the phoneme level to be relabelled. At this point, there is

also interaction between the hemispheres. Feedback from the

left hemisphere to the right can influence further spectral

and temporal analysis, while feedback from the right

hemisphere to the left can influence phoneme sequencing.

This level is at the borderline of consciousness; thus,

some listeners may be aware of modifications while others

are not (Day, 1970).

The Concept Level

At this juncture, a particular meaning must be

associated with the sequence of phonemes--a left hemisphere

function. The "lexical store," or long-term word memory, is

searched for similar phoneme sequences. If such a sequence

is found, the same meaning is assigned to the incoming

stimulus item. If a similar sequence is not found, several

alternate steps may ensue. The listener may accept the word

as an unknown, and process no further. Or, the word may be

stored while the meaning is gradually learned, thus adding

to the lexical store. Alternatively, the listener may try

to find the best possible "match" already existing in the

lexical store, and accept the sequence as a known but

distorted word. The latter occurs when attempting to

understand a speaker with a speech problem or foreign

accent. At this point of processing, modifications are

easily accessible by the conscious mind, although individual

variables such as intelligence, training and motivation

determine the extent of conscious involvement.

In the right hemisphere at the conceptual level,

sensory imagery contributes to establishing word meaning and

memory. The listener associates information from the

visual, auditory, tactile, kinesthetic and olfactory

modalities with phoneme sequences in order to fully

comprehend the meaning of a particular word. Interaction

between the left and right hemispheres at this level serves

as the link between external language and internal


As the final step in a feedback loop, the acoustic

characteristics and features of the word are channeled to

lower levels in both hemispheres. In the left hemisphere,

this information influences the manner in which future

phoneme input will be sequenced and modified. In the right

hemisphere, this feedback will affect the phoneme level and

future extraction of salient spectral and temporal features.

This model can be considered the "speech mode" of

perception. Although previous investigators have

demonstrated that adequately complex nonspeech stimuli can

evoke similar perceptual processes, only with speech stimuli

do expectations regarding syntactic, semantic, phonological

and possibly physical constraints feed downward through the

left hemisphere to affect perception at a basic level.

Most of the literature cited in this paper supports

this model, particularly in terms of the importance of

stimulus expectation and efferent feedback. Day (1970) has

shown that a specific sequence of phonemes can be

unconsciously reordered by a listener in order to conform to

English phonological rules. Warren and Warren's (1970)

results indicate that missing phonemes can be perceived as

being present, presumably based on listeners' expectations.

Finally, Nusbaum et al. (1983) and Schwab (1981) have

demonstrated that identical stimuli can be perceived in

distinctly different ways based on instructions to the

listeners for engaging different sets of expectations. On

the other hand, Molfese's (1987b) results are not completely

consistent with the proposed model. According to this

model, /ba/ and /pa/ should have elicited some differential

processing in the left hemisphere, since they are presumably

processed with reference to stimulus expectation. No such

left hemisphere involvement at the phoneme boundary was

discovered. However, it may be the case that since place of

articulation (and therefore second formant transitions) was

the same for both consonants, the left hemisphere did not

differentially process these syllables; while the temporal

processing in the right hemisphere based on learning of the

appropriate VOT's in English did discriminate between the

two in this particular study. In contrast, research in

aphasia (Gandour and Dardarananda, 1982) has revealed that

patients with left hemisphere lesions were significantly

impaired in VOT perception. This would tend to confirm the

importance of the left hemisphere phonetic labelling

features, and in perceiving a signal as speech. Finally, as

noted previously, the results of Molfese (1978a, 1980a) and

Molfese and Schmidt (1983) are not completely compatible

with the hypotheses advanced in this model unless one

accepts the premise that their sinewave formant CV analogs

were perceived by listeners as "speech." Research which

includes manipulation of subjects' expectations regarding

the "speech" or "nonspeech" nature of identical ambiguous

stimuli is needed to clarify this issue.

The Problem of Task Variables

Much of the research cited in support of the model

presented above involves the AER methodology and the work of

Molfese and colleagues (Molfese, 1978a; 1980a; Molfese and

Molfese, 1979; Molfese and Schmidt, 1983). Their research

generally included both normal syllables and nonspeech CV

analogs; and in their analysis, responses from both sets of

stimuli were averaged together. This research design,

however, does not take into account two important variables:

first, one set of stimuli is less familiar than the other,

more difficult to discriminate, and presumably requires a

greater degree of attention from subjects; and second, some

of the subjects' perceptual judgements of the ambiguous

stimuli will be incorrect.

Regarding the differences in stimulus

difficulty/required attention, the results of numerous AER

studies suggest that as difficulty in discriminating among

stimuli increases, AER latencies become longer (Ritter,

Simson and Vaughn, 1972) and amplitude increases (Poon,

Thompson and Marsh, 1976). Other studies have shown that as

the amount of attention required by a task increases, so do

AER amplitudes (Eason, Harter and White, 1969; Harter and

Salmon, 1972). Further, dichotic listening studies have

shown that increasing task difficulty results in larger

hemispheric differences. For example, when listeners were

asked to identify vowels in noise (Weiss and House, 1973)

and vowels of brief duration (Godfrey, 1974), a tendency

toward right ear advantage increased. Further, Kasischke

(1979) demonstrated that increasing the complexity of tonal

stimuli resulted in asymmetric left hemispheric involvement.

Thus, it is possible that the left hemisphere /b/-/g/

discrimination found in the Molfese research may be

dependent upon the inclusion of ambiguous stimuli in the

research design, and does not reflect normal speech


The second confounding variable mentioned above,

incorrect perceptual judgements, is also potentially

serious. If one assumes that electrocortical activity

reflects a cognitive process or series of processes, an

incorrect perceptual judgement should result in a slightly

different waveshape than a correct perceptual judgement.

Thus, it would appear important to include only correct

perceptual judgements when averaging trials to obtain AER's.

In summary, it is possible that uncontrolled task

variables affected the results obtained in previous AER

studies. A research design which includes stimulus

difficulty/required attention and accuracy of judgement as

independent variables would appear to be necessary in order

to separate hemispheric response to stimulus characteristics

from hemispheric response to task variables.


The purpose of this study is to test several aspects of

the theory of a speech mode of perception presented above.

According to this theory, perception of stop consonants

should result in a cognitive process specific to the left

hemisphere. Further, when ambiguous stimuli are utilized,

subjects who have been instructed to perceive these tokens

as "speech" should demonstrate a similar left hemisphere

differentiation. When subjects are instructed to process

the same stimuli in a nonspeech manner, a different pattern

of hemispheric involvement is predicted.

In addition to testing these hypotheses, a number of

more general questions related to inter- and

intra-hemispheric processing of the various classes of

stimuli will be explored. They are: 1) Are there bilateral

processes which differentiate /b/ from /d/, regardless of

vowel context? 2) Do AER's from the left and right

hemispheres differ significantly, regardless of consonant,

vowel or trial? 3) Are there bilateral processes which

discriminate vowels, regardless of consonant context or

trial? 4) Are there bilateral processes which discriminate

between trials (natural syllable trial, synthetic syllable

trial, chirps with speech instructions, chirps with

nonspeech instructions)? 5) Is /b/ differentiated from /d/

in the left hemisphere regardless of trial? 6) Do trials

appear to be discriminated in one hemisphere or the other?

7) Is there any evidence for hemispheric asymmetry in the

perception of vowels?

Finally, an additional purpose of this study is to

explore the effect of two task variables (stimulus

difficulty/required attention and correct judgements) on the

obtained pattern of cortical responses. In order to assess

the importance of these task variables, the data obtained in

this study in response to synthetic syllables and natural

syllables will be analyzed separately from the chirp data.

With the exclusion of the chirp trials and incorrect

syllable perceptions, only stimulus presentations in which

subjects correctly judged /b/ or /d/ will be included when

calculating individual AER's.

The primary hypothesis to be tested in this second

analysis is that /b/ and /d/ will be significantly different

in the left hemisphere, but not the right, for both the

natural and synthetic syllables. Such a finding would

support a theory of stimulus expectation, and reject a

hypothesis that stimulus difficulty/attentional variables

caused the left hemisphere differences noted in the previous


As in the first analysis, a number of secondary

questions will also be considered. These are: 1) Is there

a similar pattern of hemispheric involvement for /b/ vs. /d/

discrimination for both synthetic syllables and natural

syllables? 2) Are there bilateral processes which

differentiate /b/ from /d/, regardless of vowel context or

trial? 3) Are there bilateral processes which differentiate

synthetic and natural syllables? 4) Are there left or right

hemispheric processes which differentiate between the two

types of syllables? 5) Are there bilateral processes which

differentiate vowels regardless of consonant context? 6) Is

there any evidence of hemispheric asymmetry in the

perception of vowels? 7) Are there significant differences

between the AER's from the left and right hemispheres

regardless of consonant, vowel or trial?

Finally, this study will examine subjects' perceptual

responses in the two chirp trials (speech vs. frequency

instructions). Accuracy of response between the two trials


will be compared, both as a main effect and as a function of

the order in which the instructions were presented to

subjects. Error patterns between the two trials will also

be compared as a main effect and as a function of order of

instructions. Finally, subjects' perceptions of their

strategies for discriminating between the two classes of

stimuli in the "speech" trial and the "nonspeech" trial will

be informally compared.



The purpose of this study was to investigate

hemispheric involvement during the perception of

phonemes--specifically, stop consonants--and to evaluate a

theory of a "speech mode" of perception. Cortical responses

were collected from twelve subjects in response to both

synthetic and spoken (natural) /bi, be, bo, di, dm, do/, and

to isolated F2-F3 transitions ("chirps"). In one chirp

trial, subjects were instructed to label the stimuli as

beginning with /b/ or /d/; and in a second, they were

instructed to listen for "high" vs. "low" onset frequencies.

Subjects' cortical responses from the left and right

hemispheres were digitized, averaged and normalized on a PDP

11/23 computer. The resulting average evoked responses

(AER's) were later subjected to off-line Principal Component

Analysis. The resulting factor scores were used as

dependent variables in a number of Analyses of Variance in

order to determine if any changes in AER could be related

systematically and significantly to the independent

variables (hemispheres, consonants, vowels or trials).



Stimuli of three types were utilized in this research.

They included synthetic syllables, natural syllables and

"chirps," or isolated F2-F3 transitions.

The synthetic stimuli were six CV syllables, /bi, be,

bo, di, de, do/; each consisted of a 50 ms transition

followed by a 300 ms steady-state segment. These vowel and

transition durations parallel those reported by Cutting

(1974), Molfese (1978a, 1980a) and Molfese and Schmidt

(1983). They were used in this study in order to facilitate

cross-research comparisons. Specific onset values for each

transition and the associated steady-state formant are given

in Table 2-1. Transition onset frequencies were taken from

data presented by Kewley-Port (1982) and Klatt (1980), and

modified as necessary during synthesis in order to achieve

optimal discriminability. Vowel formant frequencies for Fl

through F3 were taken from Peterson and Barney's (1952)

data. It will be noted that F4 and F5 are constant across

the entire syllable duration, and are the same for each

vowel. The upper formants were included in order to make

the synthetic syllables sound more natural. For all

syllables, bandwidth of Fl was 60 Hz, for F2, 90 Hz, and for

F3 through F5, 120 Hz (Cutting, 1974; Molfese, 1978a,

1980a; Molfese and Schmidt, 1983). Each syllable had an

Table 2-1.

Onset and steady-state frequencies for synthetic

Syllable Formant Onset Frequency Steady-state
(Hz) (Hz)









/d e/

/do /







associated fundamental frequency of 130 Hz (Peterson and

Barney, 1952), and a rise time of 30 ms.

All synthetic syllables were produced by a Klatt

software synthesizer (Klatt, 1980) implemented by a Data

General IV computer, digital to analog converter and

low-pass filter with a cutoff frequency of 5000 Hz.

Stimulus parameters were entered using a Hewlett-Packard

2648A Graphics Terminal. All syllables were recorded on one

channel of a TEAC 6120 dual channel tape recorder. See

Figure 2-1 for the equipment configuration.

The natural syllable stimuli were produced by a male

speaker with clearly identifiable vowel formants and the

ability to modify fundamental frequency upon request.

During production of the stimuli, the speaker was seated in

a double-walled Industrial Acoustics Company (IAC) booth.

Stimuli were recorded using a B&K 5065 half-inch condenser

microphone and a B&K 37A preamplifier, coupled with a Revox

B-77 tape recorder. First, the speaker produced each

syllable five times. Each of the recorded syllables was

then examined on a Voiceprint Model 700 t-f-a spectrograph

for vowel and transition durations, and clarity and

stability of formant structure. At this juncture, the best

two or three examples of each syllable were modified by

eliminating prevoicing of the consonant and by reducing

vowel duration to conform as closely as possible to the

synthetic stimuli (50 ms transitions, 300 ms vowel

TEAC 2060
tape recorder

Figure 2-1. Equipment configuration for generation
of synthetic speech.

durations). Mean transition duration of the selected

syllables was calculated to be 48 ms (range: 24-72 ms).

Mean vowel duration was found to be 292 ms (range: 262-304

ms). Finally, the rise time of each syllable was calculated

from the output of a Honeywell 1508 A Visicorder, and the

exemplars with a rise time most closely approximating 30 ms

were selected for inclusion as natural syllables stimuli.

Mean rise time of the selected syllables was 38.1 ms (range:

31.2-41.6 ms). Actual onset (transition) and steady-state

(vowel) frequencies for the first three formants of the

natural syllables stimuli are provided in Table 2-2.

The "chirps," or isolated F2-F3 formant transitions,

were utilized as ambiguous stimuli in this study. They were

selected because such brief-duration signals do not sound at

all like "speech" to naive listeners, and most probably

would not be perceived as speech without special

instructions. Thus, stimulus expectation could be

controlled through instructions to the subject.

The chirp stimuli were taken from their respective

synthesized complete syllables. When the program for each

syllable was run, only the first 50 ms (the transition

portion) was activated. A digital filter developed by

J. J. Yea at the University of Florida was utilized in order

to eliminate fundamental frequency, Fl and F4-5. The

isolated F2-F3 transitions, or "chirps," for each of the six

syllables were recorded in the manner described above.

Table 2-2. Onset and steady-state frequencies for natural

Syllable Formant Onset Frequency Steady-state
(Hz) (Hz)

















Tape Construction

Three stimulus tapes were constructed for presentation

during the experimental sessions. One contained the

synthetic syllables, one the natural syllables and a third

the chirps (F2-F3 transitions in isolation). Each of the

six syllable (or chirp) stimuli was repeated 20 times in

random order on each tape, for a total of 120 stimuli per

tape. A random numbers table was utilized in establishing

the stimulus sequence. The original order was maintained on

all three tapes due to program limitations.

The specific recording procedures for each tape were as

follows: the 120 stimuli were recorded in the specified

sequence on both channels of an Akai GX-77 tape recorder

from the master tape played on a Revox B-77 tape recorder.

Inter-stimulus intervals were varied from two to nine

seconds in order to avoid eliciting a time-locked cortical

expectancy response from the subjects. Maximum amplitude of

each syllable or chirp was monitored on the VU meter of the

Akai tape recorder, and adjusted prior to recording so that

all stimuli peaked at 0 VU.


Subjects were 12 young adults--six males and six

females--aged 23-33 years. The mean age for male subjects

was 28.2 years with a range of 23.3 to 32.8 years; and mean

age for females was 27.8 with a range of 23.4 to 30.7 years.

All subjects were majoring/employed in the fields of

experimental phonetics or speech pathology, and participated

in this study at the request of the experimenter.

In the first selection protocol, subjects were required

to demonstrate pure tone thresholds of better than 20 dB at

0.5, 1.0, 2.0, 4.0, and 8.0 kHz, with a mean between-ear

threshold difference of less than 5 dB. In addition, any

potential subject with a 10 dB or greater between-ear

threshold difference at any single frequency was rejected.

These criteria were included in order to eliminate potential

asymmetric hemispheric effects due to failure to control for

differences in peripheral sensation level. Subjects also

were selected on the basis of a strong right-hand preference

as measured by the Edinburgh Inventory of Handedness, or EIH

(Oldfield, 1971). This second selection protocol was

included because previous research had indicated that there

might be some interaction between hand preference and

cortical responses to syllables (Molfese, 1978a). The study

of such an interaction is undoubtedly important for refining

and generalizing theories of speech perception. However, an

investigation of individual differences in perceptual

asymmetries as a function of handedness is not the focus of

this research. Subjects were limited to those demonstrating

a strong right hand preference with the assumption that such.

an effect reflects a dominant left hemisphere. After a

model of hemispheric involvement in speech perception has

been established for this population, modifications of the

model may be added through further research involving

sinistrals, ambidextrals and other groups of questionable

cerebral dominance. In the present study, the average

laterality quotient on the EIH was 92.8 (range: 83-100),

with a mean decile of 8.21 (range: 6-10). These scores

suggest a strong right-hand preference in the subjects

utilized in the experiment. As a final step in the

selection process, a screening test of the synthetic stimuli

was presented. A tape was played, containing ten randomized

samples of each syllable; and subjects indicated which

consonant they heard at the beginning of each stimulus item.

This protocol was included in order to insure that these

stimuli were perceived correctly. A score of 95% or better

on the 60-item screening test was required in order for

volunteer subjects to be included in the experiment.

Subjects were allowed up to three attempts to pass the test.

On the final trial, mean percent correct was 98.5% with a

range of 97-100%.


The experimental procedure included a training protocol

prior to presentation of the syllable stimuli, determination

of electrode locations and placement on the subjects' heads,

actual electrocortical recording of the subjects' responses

to the syllable stimuli, a second training protocol prior to

presentation of the chirp stimuli, and electrocortical

recording of subjects' responses to the chirps. This

procedure took approximately five hours for each subject.

Training prior to presentation of syllable stimuli.

Subjects were seated individually in a double-walled IAC

booth and familiarized with the testing environment. At

this juncture, they were instructed in the response

procedure. As has been discussed, it was considered

important to monitor the accuracy of subjects' perceptual

judgements; thus, subjects were required to make an overt

response to each stimulus presented.

The response procedure involved use of a Wollensak 4055

battery-powered tape recorder coupled to a microphone.

Subjects were instructed to hold this microphone in a

comfortable position such that the index finger of one hand

rested on it while the fourth finger of either hand did not

make contact with the microphone in any way. When subjects

perceived the first phoneme of the syllable as /b/, they

were instructed to gently raise and lower the index fingers

of both hands. Subjects were further instructed to respond

to syllables beginning with /d/ by gently raising and

lowering the fourth fingers of both hands. A bilateral

motoric response was judged necessary in order to eliminate

potential hemispheric asymmetry associated with a unilateral

response. This procedure resulted in a sound being recorded

on the Wollensak in response to stimuli perceived as /b/,

and no sound in response to stimuli perceived as /d/. This

code was later utilized by the experimenter in determining

correct and incorrect perceptual responses.

At this time, the screening test for synthetic

syllables was administered. If potential subjects exhibited

more than three errors in the 60 trials, they were permitted

to listen to the training tape a second time, then took the

test again. If they failed the screening test a second

time, they could choose to terminate their participation in

the study or return a third time for a last attempt.

A 60-item screening procedure for the natural stimuli

was also administered in order to familiarize subjects with

these experimental syllables. As in the synthetic syllable

screening procedure, if any subject had been unable to

achieve a score of 95%, they would have been eliminated from

further participation. However, these stimuli were not

difficult to discriminate, and no subject exhibited any

difficulty whatsoever with this set of protocols.

Electrode placement. The active electrode sites chosen

for this study were T3 and T4 as described in Jasper's

(1958) "10-20 Electrode System." These locations were chosen

because they (theoretically) overlie the left and right

posterior superior temporal gyrii, areas of the brain

associated with primary auditory reception (Penfield and

Roberts, 1959). Additionally, recent AER research has shown

that right and left hemisphere differences can be observed

at those locations (Molfese, 1978a; Molfese, 1980a;

Molfese and Schmidt, 1983; Wood, 1975). Finally, use of

standardized electrode placements within the 10-20 System

was judged desirable in order to facilitate interlaboratory

comparisons of data.

In the 10-20 System, recording sites are located either

10% or 20% of the distance between several standard

reference points for measurement. These standard points are

the nasion, or bridge of the nose; the inion, or occipital

protruberance; the left and right aural clefts (Al and A2

respectively); and CZ, the intersection of a line drawn

from the nasion to the inion with another from Al to A2.

The T3 location as described by Jasper (1958) is 10% of the

distance from Al to A2 as measured upward along a line from

Al to CZ. The T4 location was measured the same way except

from A2 to CZ (see Figure 2-2). These locations are

designated "T" because they are assumed to overlie the

temporal lobe (anatomical studies presented in Jasper, 1958,

support this assumption). The "T3" location denotes the

left hemisphere, as all odd numbers are on the left side of

the head, while the "T4" location denotes the corresponding

point on the right hemisphere.

The active electrodes (T3 and T4) were referenced to

contralateral earlobes. These inactive sites were selected

because there is little muscle tissue in that area to

generate EMG artifacts, and they are less subject to picking

up temporal lobe activity than mastoid sites (Goff, 1974).



Electrode configuration. Electrocortical
responses were recorded at T3 and T4.
The EOG electrodes recorded eyeblinks and
facial movements for input to an artifact
rejection channel.

Figure 2-2.

Additionally one electrode was placed above the inside

corner of the subject's right eye while another was placed

at the lateral superior aspect of the right orbital ridge.

These electodes were used to record extraoccular eye

movements (extraocculogram, or EOG) and blinks, for an

artifact rejection channel. Finally, one electrode was

placed on the left mastoid process to serve as a grounding


Once the seven electrode sites -were located and marked,

the skin at each was thoroughly cleaned with a cotton swab

dipped in alcohol. This type of cleansing is necessary to

remove any skin oils or dead epithelial cells which reduce

electrical conductivity. Grass E6SH chlorided silver

electrodes were then filled with paste (Grass EC-2) and

attached to the subject's head with surgical tape. This

type of electrode has been recommended for recording "slow"

electroencephalographic (EEG) waves because of the

resistance of this combined substance (silver-silver

chloride) to polarization (Goff, 1974).

Next, resistance in kOhms was measured between 1) T3

and the right earlobe, 2) T4 and the left earlobe, and 3)

the two EOG (facial) electrodes, both prior to the recording

session and at its conclusion. For the active electrodes,

resistances were as follows: for T3, average initial

resistance was 3.82 kOhms (range: 1.1-8.9 kOhms); for T4,

average initial resistance was 4.03 kOhms (range: 1.5-6.4

kOhms). Final resistances measured at the end of each

recording session averaged 4.39 kOhms for T3

(range: 1.3-11.1 kOhms) and 4.59 kOhms for T4

(range: 1.7-7.8 kOhms).

Training and presentation procedures for syllable

stimuli. Subjects reclined on a bed in a comfortable

position in a double-walled, electrically shielded IAC booth

during the electrocortical recording protocol. They were

instructed to keep their eyes closed, jaws relaxed and move

as little as possible during the stimulus presentation in

order to minimize movement artifacts. Subjects were

provided the microphone of a small battery-powered tape

recorder and reminded to raise both index fingers if the

stimulus item initiated with a B and both fourth fingers if

the stimulus item began with a D, as they had done during

the training procedures. Presentations were

counterbalanced; that is, the synthetic stimuli were

presented first to half the subjects, and the natural

syllables first for the other half. All subjects were

permitted a short break following the presentation of the

first set of syllable stimuli, and electrode resistances

were checked. This procedure was carried out in order to

insure that the electrodes were still properly attached and

in good contact with the scalp. Subjects then returned to

the booth for the second set of syllable stimuli.

Training and presentation procedures for the chirp

stimuli. At the conclusion of the second syllable trial,

subjects were again given a short break and the electrode

resistances tested. At this juncture, subjects were

presented stimuli from a second training tape in order to

familiarize them with the chirp stimuli. Half the subjects

were first instructed that the chirps were frequency glides,

and they were to discriminate high vs. low onset

frequencies; while the other half of the subjects were

first instructed that the chirps were parts of syllables and

they were to discriminate /b/ from /d/. It should be

emphasized that the stimuli in both trials were exactly the

same; only the instructions varied. All subjects were

given both instruction conditions, with order of

presentation balanced across subjects. After the first

chirp trial, subjects were given a short break, electrode

resistances were checked, and the second training tape was

played. They then returned to the booth for the last trial.

At the end of the session, electrode resistances were

measured one final time.

Electrocortical recording. The procedures followed

during electrocortical recording and stimulus presentation

were as follows: stimuli from the right channel of an Akai

GX-77 tape recorder were played through a Kenwood KA-7100

amplifier outside the booth to an ADS 810 speaker located

inside the booth, at an intensity level of 62 dB re: .0002

d/cm2 at the subject's ear. The speaker was positioned

approximately 78 inches directly in front of the subject.

The syllables or chirps on the left channel of the stimulus

tape were input directly to a Schmidt trigger, which

produced a 4 V pulse at the onset of each syllable. This

pulse was utilized to synchronize stimulus onsets during the

averaging procedure.

During the cortical site recordings, two Grass 7P122A

Low Level DC Amplifiers switched to AC settings were used.

Bandpass was flat (half amplitude) from .04 Hz to 60 Hz.

This bandpass setting insured that frequencies for .3 Hz to

35 Hz would be amplified at 100% of maximum gain. Such a

range was desired in order to maximally amplify all

frequencies which might be associated with syllable

discrimination, while attenuating the very slow (DC)

potentials associated with the contingent negative variation

and the very high frequencies associated with electrical

interference. System gain was set at 28k, in order to

amplify the raw EEG wave to +/- 1.25 V, the optimal range

for input to the A/D converter.

For the EOG (facial) electrodes, a Grass 7P3B AC

Preamplifier coupled with a Grass 7DAF DC Driver Amplifier

was used. Bandpass was flat (half amplitude) from .3 Hz to

75 Hz, with gain set at 11k. Because the data from this

channel served only for artifact rejection purposes,

bandpass and gain settings were less crucial. Gain was

determined during a pilot study such that eyeblinks and

facial movements resulted in amplified potentials which did

not exceed the limits of the amplifier, but were measurably

greater than ongoing facial electrical activity.

Analog to digital conversion. Immediately following

amplification, the unprocessed EEG waves from the three

amplifiers were input to three channels of an analog to

digital (A/D) conversion device and Digital Equipment

Corporation PDP 11/23 computer. The Schmidt trigger pulse

was fed into a fourth channel of the A/D board. At the

occurrence of each pulse, corresponding to the onset of each

stimulus item, the electrocortical waves in channels one

through three were digitized at the rate of 200 Hz (one

sampling every 5 ms) for a period of 500 ms, resulting in

100 voltage values per wave. These digitized waves were

stored on hard disk for later averaging on the PDP 11/23.

Channel three, which received input from electrodes placed

near the eye, was utilized as an artifact rejection channel.

When the absolute voltage of channel three exceeded 1.8 V,

indicating an eyeblink or facial muscle movement, the data

on channels one and two were dropped, and thus were not

available for later averaging. See Figure 2-3 for the

equipment configuration.

Preliminary Data Analysis

Extraction of AER's. The subject's perceptual

responses were scored at the conclusion or each experimental

Figure 2-3. Equipment configuration for stimulus
presentation during the electrocortical
recording procedure.

Akal GX-77


session. At this juncture, a selective averaging program

was utilized; it first allowed the digitized response waves

associated with incorrect responses to be excluded from the

averaging process. This procedure was carried out for both

the natural and synthetic syllable trials, on the assumption

that it would maximize differences between AER's associated

with /b/ and /d/. If more than five responses for any

syllable had to be eliminated, due to incorrect perception

and/or muscle artifacts, the subject was tested on that

particular trial a second time. For the synthetic syllable

trial and the natural speech trial, an average of 19.5

responses and 18.7 responses per syllable, respectively,

were available as a basis for obtaining each AER.

Subjects incorrectly identified approximately half the

chirps in the two chirp trials, so elimination of all

incorrect responses was not possible. As a result, all

chirp responses not contaminated with muscle artifacts were

included in the averaging process. For the chirp trials, an

average of 19.7 responses per syllable was utilized in

obtaining each AER. The procedures described above resulted

ultimately in 576 separate AER's based on 12 subjects, 4

trials, 2 consonants, 3 vowels and 2 hemispheres.

Normalization of AER's. Due to equipment limitations,

precise calibration of the biological amplifiers was not

possible. They were calibrated prior to each use with a 1

mV square wave pulse, but calibration in microvolts could

not be accomplished because measuring devices of adequate

sensitivity were not available. Two methods were employed

to compensate for this limitation and insure that the

amplifiers associated with T3 and T4 were equivalent.

First, the two 7P122A amplifiers were balanced over

hemispheres and conditions so that for each condition, half

the responses from a particular hemisphere were amplified by

one amplifier and half by the other. A second manner in

which potential amplifier differences were eliminated from

the data was by normalizing each AER. This process was

carried out by converting each of the 100 voltage points to

Z-scores--a procedure which involves subtracting the mean of

all points of a particular wave from each individual point

and dividing by the standard deviation (i.e.,

Z=(x-MEAN)/STANDARD DEVIATION). This procedure had the

effect of aligning all the AER's along a common baseline and

equalizing peak amplitudes. Once normalized, the entire

data set was sent via modem to an Amdahl 470 V/6-ll computer

in the University of Florida's Northeast Regional Data

Center for statistical processing.

Analysis of the average evoked responses. Analysis of

waveforms comprising AER's traditionally has been a

difficult task, due to the complexity of the response.

Thus, the question arises: what procedures can be used to

measure a waveform of this type? Several methods have been

frequently utilized by AER researchers, including various

types of peak measurement and area analysis.

Peak analysis is based on the assumption that it is

only necessary to measure the waveform at a limited number

of points in relating electrophysiological response with

cognitive variables. Although peak analysis of individual

responses is intuitively appealing and does not require

sophisticated computer interface, it has a number of

disadvantages. First, peak identification is dependent on

experimenter interpretation. Due to variations in latency,

and the frequent presence of several peaks at the desired

latency, the experimenter must often make subjective

decisions relative to the precise point at which measurement

should be made. Second, there may be a large number of

individual AER's to be analyzed, depending on the number of

subjects and independent variables. Since peak measurements

are made by hand, the time and effort required for this type

of analysis may be prohibitive for the more complex

experimental designs. A third serious disadvantage with

this technique is the necessary assumption that the peaks

observed in the waveform are independent, and not caused by

some single underlying process. Finally, additional

technical problems, such as reliable estimates of baseline

and spurious values at the point being measured, reduce the

utility of this approach.

Some experimenters, such as Wood (1975), have utilized

peak analysis on grand mean AER's rather than on individual

waveforms. A "grand mean AER" is a composite waveform

derived from averaging responses over all subjects for a

particular experimental condition. This technique has the

advantage of producing smooth waveforms with easily defined

peaks, since a large number of individual AER are generally

averaged in calculating each grand mean AER's. Further, the

averaging can be done by computer, and results in several

composite AER's rather than hundreds of individual

waveforms, thus simplifying the final peak and latency

measurements which are done by hand. However, the problem

of reliable baseline estimates remains. In addition,

comparisons between group averaged AER's do not take into

account inter-subject variability; thus, comparing peaks

(or all points comprising a wave, as Wood, 1975 did) for

significant differences may produce inaccurate results due

to large variances in the data.

Area measurements overcome some of the disadvantages of

peak analysis, but this technique also is somewhat limited.

In this case, amplitude measures within a latency range of

interest are integrated; hence the measure is less

subjective than peak measurement and less subject to

spurious values. However, a number of disadvantages exist.

It is not possible to specify the underlying components

present in the wave, and how these components may relate to

multidimensional experimental variables. Further,

integration limits must often be set arbitrarily because the

experimenter does not know the location of the underlying

components, and baseline estimates continue to be a problem.

The concept of "underlying components" is an important

one when considering AER measurement techniques. According

to Donchin, Ritter and McCallum (1978), most researchers

consider the individual peaks comprising their observed

waveforms as "components." However, as Donchin et

al. (1978) point out, it is more probable that the observed

AER waveform is the sum of a number of underlying "component

waves," which occur both sequentially (in serial) and

simultaneously (in parallel). These authors define

components as reflecting "the activity of .

functionally distinct neuronal aggregates" (pg. 5). Thus,

"components" are hypothesized to represent specific neural

processes which occur in response to particular aspects of a


In any case, it is possible that these cited component

waves vary reliably as a function of experimental

manipulations, and result in a more complete description of

cognitive processing than peak or area analysis reveals.

Chapman, McCrary, Bragdon and Chapman (1979) furnished

support for this theory, by relating underlying components

extracted through Principal Component Analysis to various

aspects of information-processing tasks. Their results

revealed two components which correlated with previously

identified surface phenomena, the contingent negative

variation (CNV) and a late positive peak (P300). These two

features were associated with expectency of relevant stimuli

and the presentation of relevant stimuli, respectively.

However, Chapman et al. were also able to isolate additional

AER components correlating with other processing tasks which

had not been previously noted. Thus, it appeared that

Prinicpal Component Analysis allowed a more complex analysis

of the effects of experimental variables than traditional

measurement techniques. For this reason, Prinicpal

Component Analysis (PCA) was chosen for use in the present


According to Donchin and Heffley (1978), there are

several disadvantages to be considered in applying PCA to

AER research. First, it is not intuitively obvious how the

PCA values relate to the original waveforms, and the

experimental results may be difficult to interpret. A more

serious flaw in terms of data analysis is that PCA is not

resistant to artifacts created by variations in peak

latency. Amplitude differences at a particular latency are

treated as if all waves peaked at the same point in time,

which may or may not be the case. Potentially, this

disadvantage is overcome by use of careful recording

techniques, by examination of the data prior to PCA

application and by adjustment of latencies if necessary.

Chapman et al. (1979) did not appear to consider this

latency variation a problem in extracting components and

reconstructing original AER's, and other researchers using

this technique have not mentioned latency variation as a

problem prior to analysis or a confounding factor post hoc

(Donchin et al., 1978; Molfese, 1978a; Molfese, 1978b;

Molfese, 1980a; Molfese and Schmidt, 1983).

An additional consideration when applying this type of

analysis is the lack of physiological evidence to support

the validity of components. Although "neuronal aggregates"

have been hypothesized as the source of these factors, such

structures have not been isolated in the cortex. PCA is a

mathematically parsimonious procedure, which isolates

components solely on the basis of correlations, axis

rotations, and other formulae. Proponents of PCA, such as

Donchin et al. (1978), would argue that the strong

relationship between a component and an experimental

variable can furnish important information about cognitive

processing, regardless of the source of the component. That

point of view is adhered to in this study.

The Principal Component Analysis Procedure

The AER waveform can be conceptualized as a series of

voltage measurements; and the "variables" in PCA are these

voltage values. The number of variables in any given study

is determined by digitization rate of the computer and

duration of the averaging epoch. For example, in the

present study, the sampling rate was 200 Hz for 500 ms,

resulting in 100 voltage values, or variables, for each of

the 576 AER's. Thus, each waveform was represented as a

series of 100 discrete numbers.

Calculating the Centroid

The first step in PCA is to average each variable over

all AER's in the data set. This procedure results in a

grand mean AER known as the centroid. In turn, the centroid

reflects the average voltage value at each time point for

all AER's. This measure is used as the basis for factor


Matrix Construction

The next step in PCA is to construct a matrix in which

all the voltage values are correlated with each other. If

the raw data are used, this matrix is referred to as a

"cross-products" matrix. In this case, the total variance

of the data set is analyzed. Alternatively, a covariancee

matrix" can be used, in which the mean of all the voltage

values at a particular time point are subtracted from each

original AER at that time point, prior to computing the

matrix. This procedure has the effect of removing that

portion of the variance due to differences in means.

Finally, a "correlation matrix" may be used, in which the

mean of all values at a certain time point is subtracted

from each original AER (as in the covariance matrix) and the

remainder is divided by the standard deviation of all

voltage values at that particular time point. The result of

this treatment is to normalize peak amplitudes over all


According to Donchin and Heffley (1978), use of the

covariance matrix is most desirable in AER research. The

cross-products matrix may result in components related more

to subject variability than to experimental manipulation,

and the correlation matrix may give too much weight (due to

normalization) to small, unreliable differences in

waveforms. Analysis of the covariance matrix is based on

the difference between an individual AER and the grand mean,

and this is most useful when an analysis of the effects of

experimental manipulations across subjects is planned.

Extraction of Principal Components

Following the calculation of the centroid and the

construction of the matrix, the next step in PCA is to

extract the principal components or factors. (In this

study, the terms "components" and "factors" will be used

interchangeably, as they are in current AER literature;

however, according to Donchin and Heffley, 1978, the label

"component" is correct). Factor extraction involves

reduction of the variables in the matrix to a predetermined

number of linear combinations, or factor loadings, which

account for the most possible variance in the data. Each

factor loading consists of n coefficients corresponding to

the original time points, and reflects the influence of each

factor (component) on that time point.

Uncorrelating the Factors

The next step in PCA is to rotate these factor loadings

in order to maximize orthagonality. When attempting to

relate underlying components (or factors) to experimental

variables, it is desirable to have each factor as

uncorrelated as possible with other factors. Since the

initial factors extracted from the centroid tend to be

somewhat correlated (due to the sequential nature of the

variables), some type of rotation is necessary in order to

improve orthagonality. Varimax rotation (Kaiser, 1958) is

traditionally used. The result of this procedure is to

concentrate the high loadings for each factor within a given

time range, thus producing distinct AER components.

Derivation of the Factor Scores

The final step in PCA is to transform the original

AER's to the new, rotated axes. This transformation is

accomplished by multiplying each original AER by a

coefficient vector derived from the rotated factor loadings.

A number of factor scores, equal to the number of factors,

is the result of this process. These factor scores

represent a measure of the magnitude of a specific factor in

a particular AER. Factor scores (for the factor being

analyzed) can then be averaged over experimental conditions

to yield a mean factor score, which in turn can be utilized


as the dependent variable in an Analysis of Variance. In

this way, the effect of experimental manipulations on

electrocortical activity can be assessed.

In this study, a separate Analysis of Variance (ANOVA)

was calculated for each factor. Mean factor scores were

compared between levels of the independent variables

Consonant, Vowel, Hemsiphere and Trial in each ANOVA.

Following this assessment of main effects, variables were

compared in all possible combinations for two-, three-, and

four-way interactions.


Preliminary AER Data Analysis

The electrocortical recording procedure utilized in

this research resulted in a total of 576 separate AER's.

This value was obtained from 12 subjects responding to two

consonants in combination with three vowels from both

hemispheres in four separate trials (12 x 2 x 3 x 2 x 4 =

576). Two examples of the unprocessed AER's are presented

in Figure 3-1. Each waveform is based on approximately 20

repetitions of the syllable /bi/, and each is from a

different subject. The AER's then were normalized, as

described above, and subjected to off-line Prinicpal

Component Analysis (PCA). Finally, the output of this

preliminary statistical procedure was utilized in ten

Analyses of Variance (ANOVA'S), and subsequent preplanned

and post hoc comparisons. The PCA and ANOVA procedures were

carried out twice: once on the full data set of AER's, and

a second time on only the AER's associated with the

synthetic and natural syllables. Finally, the perceptual

results of this study were analyzed.

-Left Hemisphere
---Right Hemisphere

0 100 200 300 400 500

TIME (msec)

Normalized AER's based on approximately
20 repetitions of the syllable /bi/ from
(a) subject 1 and (b) subject 2.

Figure 3-1.

Analysis One: The Full Data Set

The reasons for analyzing the full data set were (in

part, at least) to evaluate the findings of Molfese (1980a)

and Molfese and Schmidt (1983) regarding left hemisphere

differentiation of stop consonants. A further goal was to

provide electrophysiological evidence for the perceptual

changes noted with differences in stimulus expectation

(Schwab, 1981; Nusbaum et al., 1983). Statistical

procedures utilized in investigating these issues were PCA

and ANOVA's, as well as preplanned and post hoc comparisons.

The first step in the PCA was to calculate the

centroid, or average, of all 576 normalized AER's (Dixon,

1981). The centroid is pictured in Figure 3-2. It is

characterized by a small positive peak at 45 ms (P45), a

large negative peak at 120 ms (N120), a large positive peak

at 195 ms (P195), a negative peak at 270 ms (N270), a small

positive peak at 340 ms (P340), followed by a gradual--and

negative--decline asymptoting at 455 ms (N455). This

centroid is very similar in waveshape to the one reported by

Molfese and Schmidt (1983), who showed a P30, N120, P200,

N270, P345, and N450. The main difference occurred in the

final 150 ms of the wave, during which the present study

found a falling configuration while Molfese and Schmidt

(1983) found a level to rising configuration.

The next step in the PCA was formation of a 100 x 100

covariance matrix and extraction of the prinicpal components

Figure 3-2. The centroid, or grand mean auditory
evoked response obtained from the
Principal Components Analysis based
on the full data set (Analysis One).


w 0-


2 -.5-

0 100 200 300 400 500
TIME (msec)

(or factors). Factors with eigen values of one or more were

retained for further analysis (Chapman et al., 1979). This

procedure resulted in 10 factors which accounted for 62.7%

of the variance. Factors then were rotated using a varimax

criterion (Kaiser, 1958) in order to improve orthagonality.

After 14 iterations, the terminal solution was reached. The

rotated factors are pictured in Figures 3-3. These factors,

or component waves, are assumed to underlie the surface

waveshape of the centroid, and to be present to a greater or

lesser extent in each individual AER. Peaks in these factor

waveshapes represent the latency at which a specific factor

affected the centroid, regardless of polarity (Molfese and

Schmidt, 1983). Factor 1 was characterized by a positive

peak at 40, a negative peak at 90 ms, and a major positive

peak at 150 ms. This component influenced the centroid at

P45 and the N120-P195 complex. Factor 2 was characterized

by a positive peak at 75 and a small negative peak at 145

ms; it influenced the centroid at the P45-N120 complex.

Factor 3 showed a major peak at 25 ms, a small negative peak

at 85 ms and a positive peak at 120 ms, and also influenced

the P45-N120 complex of the centroid. Factor 4 had several

small peaks throughout its duration and one major peak at

200 ms. This major peak influenced the P195 of the

centroid. In Factor 5, a major positive peak occurred at

330 ms followed by a small positive peak at 440 ms. This

factor probably influenced N340 and the declining latter


0 100 200 300 400

TIME (msec)

The ten factors extracted by means of a
Principal Components Analysis based on
the full data set (Analysis One).
(a) Factor 1, (b) Factor 2, (c) Factor
(d) Factor 4, (e) Factor 5, (f) Factor
(g) Factor 7, (h) Factor 8, (i) Factor
(j) Factor 10

Figure 3-3.






z .5-


0 100 200 300 400 500

TIME (msec)

0 o00 200



400 500
400 500


n AZ


z .5-

u .,,


I -

.5 -





0 100 200 300
TIME (msec)

400 500




0 100 200 300

TIME (msec)

400 500

part of the centroid. Factor 6 was characterized by several

small peaks, similar to Factor 4, with a major peak at 245

ms. This component probably influenced the P195-N270

portion of the centroid. Factor 7 showed a positive peak at

0 ms, a negative peak at 50 ms, a major positive peak at 115

ms, a small negative peak at 175 ms and a small positive

peak at 220 ms. This factor appeared to have its major

influence at the P0-N120 portion of the centroid. For

Factor 8, a small positive peak at 320 ms followed by a

major positive peak at 420 ms influenced the P340-N455

complex of the centroid. Factor 9 contained a major

positive peak at 290 ms, a small positive peak at 385 ms and

a negative peak at 450 ms and influenced the portion of the

centroid just after N270. And finally, Factor 10 was

characterized by a major positive peak at 375 ms, a negative

peak at 445 ms and a small positive peak at 480 ms,

influencing the final epoch of the centroid.

The final step in the PCA was calculation of ten sets

of factor scores for each of the original 576 AER's (based

on the ten extracted factors). Thus, each AER in the data

set was effectively represented by ten factor scores in

place of its original 100 voltage values.

At this point, factor scores for each AER were utilized

as the dependent variables in ten separate ANOVA's (one for

each factor). All possible main effects and interactions

for the independent (classification) variables of Consonant,

Vowel, Hemisphere and Trial were calculated (Dixon, 1981).

In assessing the significance of ANOVA results, a

probability of .05 was chosen, in order to include as many

main effects and interactions as possible while maintaining

a reasonably high level of significance. The .05 level is

appropriate when the data are being explored for significant

trends in new research areas. The .01 level was considered

too stringent, with too great a possibility of rejecting

major effects and interactions (Type II error).

Primary Hypothesis Analysis

The principle question addressed in Analysis One was

whether /b/ and /d/ were differentiated in the left

hemisphere for trials which included both syllable stimuli

and ambiguous stimuli (chirps with speech instructions),

essentially a replication of Molfese (1980a) and Molfese and

Schmidt (1983). Such a finding would support a hypothesis

of left hemisphere involvement in the perception of voiced

stop consonants. In order to test this relationship, the

ten ANOVA's described above were examined for significant

Consonant by Hemisphere by Trial interactions.

The ANOVA of one factor (Factor 9) did indeed reveal a

significant Consonant by Hemisphere by Trial interaction (F

= 3.63, p = .0229, df = 3,33). However, this result is

somewhat ambiguous as the interaction contained 16 mean

factor scores, obtained from two consonants by two

hemispheres by four trials. Although the entire interaction

was found to be significant, it was not apparent which pairs

or combinations of means were significantly different.

Thus, in order to specify significant combinations of mean

factors scores, post hoc testing was necessary.

A t-square Planned Comparison procedure was utilized in

the post hoc analysis of the significant Consonant by

Hemisphere by Trial interaction. Mean factor scores of /b/

in the left hemisphere were averaged over the three trials,

and compared with those associated with /d/. For this test,

a probability level of .01 was chosen, in order to reduce

the possibility of concluding that differences were

significant when in fact they were not (Type I error). This

more conservative level was considered necessary because a

t-square Planned Comparison does not control error rate

simultaneously for multiple comparisons, and thus repeated

tests on the same set of data greatly increase the chances

of a Type I error. Such a statistic is appropriate only for

planned comparisons when ten or fewer comparisons are being

made, at significance levels of .01 or better, according to

Shearer (1982). Results of this comparison revealed that

when /b/ and /d/ were compared in the left hemisphere,

averaged over synthetic syllables, natural syllables and

chirps with speech instructions (speech chirps), differences

between means failed to attain significance at the .01 level

(although F = 7.02, p = .0118). Thus, a hypothesis of left

hemisphere involvement in the perception of voiced stop

consonants was not supported. This result was not

consistent with the findings of the Molfese research for the

consonants /b/ and /g/.

A similar post hoc comparison between right hemisphere

means was also made for this Consonant by Trial by

Hemisphere interaction. This procedure was carried out in

order to eliminate the possibility of right hemisphere

differentiation of /b/ and /d/, a phenomenon not previously

reported. In this case, differences between mean factor

scores were not significant, as expected (F = .204, p =


Finally, post hoc t-square comparisons of the means in

this interaction were utilized to determine whether patterns

of hemispheric involvement could be changed as a function of

instructions to the subjects (stimulus expectation). In

this procedure, the mean factor score for "low" stimuli

(identical to /b/) in the frequency chirp trial was compared

to the mean factor score for the "high" stimuli (identical

to /d/). Comparisons were made for both the left and the

right hemispheres. Results revealed that /b/ vs. /d/ in the

left hemisphere were, again, not significant as expected (F

= .681, p = .580). When a similar test was made in the

right hemisphere, differences between /b/ and /d/ did not

attain significance at the .01 level (although F = 6.77, p =

.0131). Based on these data, it appeared that changes in

instructions did result in some shift in hemispheric

involvement, although this trend was not significant.

A graphic display of mean factor scores in the four

comparisons for this interaction is presented in Figure 3-4.

The first set of bars illustrates the difference in mean

factor scores for /b/ vs. /d/ in the left hemisphere

averaged over the natural syllable, synthetic syllable and

and speech chirp trials. The difference between /b/ and /d/

is substantial, although not significant at the .01 level.

The second set of bars shows mean factor score differences

for the same set of variables in the right hemisphere. As

can be seen, the differences are negligible. Taken

together, these two sets of bars display a definite trend

toward left hemisphere involvement in /b/-/d/

discrimination, although this trend was not statistically


The third set of bars in Figure 3-4 illustrates mean

factor score difference between "low" vs. "high" judgments

in response to chirps with frequency instructions (frequency

chirps) in the left hemisphere. A small difference appears

to occur. The last set of bars shows mean factor score for

the same discrimination task in the right hemisphere. Here,

a substantial difference between "low" and "high" responses

can be seen, although again, this difference did not attain

significance at the .01 level. The third and fourth sets of

bars in this figure display a trend toward right hemisphere

Mean factor scores for /b/ vs. /d/ in the
left and right hemispheres; and "high"
vs. "low" onsets in the left and right
hemispheres. See text for discussion.

Figure 3-4.


2 Left Rii ight
Hemisphere e e Left Right
Hein I sphere
Hemisphere Hemisphere


UJ % """i
QC:* /r" '" I !:!:;
0 y !;;:
" ~ ~ I -% I'!:bl^



"h i gh"


involvement in frequency discrimination--although again,

this trend was not significant.

A visual inspection of the "Speech Instructions"

portion of Figure 3-4 and the "Frequency Instructions"

portion reveals a tendency toward differential hemispheric

involvement depending on the instructions to the subjects.

Although neither relationship was significant, speech

instructions appeared to result in greater left hemisphere

differentiation, while frequency instructions yielded

greater right hemisphere differentiation.

Secondary Analyses

Further post hoc analyses were undertaken on the ANOVA

data in order to answer a number of additional questions.

Because the data analysis at this point was exploratory in

nature, Scheffe post hoc comparisons were used in

determining significance of results. Significance level was

set at .05 in order to reduce the possibility of making a

Type II error while keeping the probability of a Type I

error at a reasonable level.

The questions investigated in the secondary analyses

are as follows:

1) Are there significant differences between responses

to /b/ vs. /d/, independent of other variables? Both

Molfese (1980a) and Molfese and Schmidt (1983) found such a

relationship, which they interpreted as a bilateral process