A system for time modification of synthesized speech

MISSING IMAGE

Material Information

Title:
A system for time modification of synthesized speech
Physical Description:
ix, 240 leaves : ill. ; 29 cm.
Language:
English
Creator:
White, John McLean, 1960-
Publication Date:

Subjects

Genre:
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 1995.
Bibliography:
Includes bibliographical references (leaves 233-239).
Statement of Responsibility:
by John McLean White III.
General Note:
Typescript.
General Note:
Vita.

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 002052092
notis - AKP0053
oclc - 33448220
System ID:
AA00003613:00001

Full Text










A SYSTEM FOR TIME MODIFICATION OF SYNTHESIZED SPEECH


By

JOHN McLEAN WHITE El















A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA


1995





























To Marie Grande White,

the most courageous person I have ever known.














ACKNOWLEDGMENTS


I am forever grateful to my advisor and committee chairman, Dr. Donald G.

Childers. For five long years, he has been an untiring source of guidance, direction, and

knowledge. His dauntless dedication to students like myself is highly commendable.

I am grateful to Dr. Howard Rothman for his insightful observations regarding the

science of speech. He was also instrumental in the development and success of the listening

tests. His willingness to repeatedly participate as the lone speech scientist in a group of

electrical engineers is admirable, and is greatly appreciated.

In addition, I would like to thank Dr. Jose Principe, Dr. Scott Miller, and Dr. Fred

Taylor for serving on my committee. I would also like to thank all of the past and present

members of the Mind-Machine Interaction Research Center for their help and

understanding.

This research was funded by the Mind-Machine Interaction Research Center and

the Audio Engineering Society Educational Foundation.














TABLE OF CONTENTS

ACKNOWLEDGMENTS ............................................... iii

ABSTRACT ................... ..................................... viii

CHAPTERS


1 INTRODUCTION ................................................. 1

1.1 History of Time Modification Methods ........................... 2
1.1.1 Variable-Playback-Rate Method ......................... 2
1.1.2 Sampling Method .................................... 3
1.1.3 Vocoder Methods ..................................... 5
1.1.4 Recent M ethods ...................................... 6
1.2 Phonological versus Psychological Testing ........................ 8
1.3 Review of Research ...................................... 11
1.3.1 Quantitative Measures of Time Modification .............. 11
1.3.2 Phonological Tests .................................. 12
1.3.3 Psychological Tests ................................. 14
1.3.3.1 Intelligibility tests and influencing factors ......... 15
1.3.3.2 Comprehensibility tests and influencing factors ..... 17
1.4 Motivation ........................................... 20
1.5 Goals ..................................................... 21
1.6 General System Description ................................... 23
1.7 Chapter Organization ....................................... 25


2 SPEECH ANALYSIS, SEGMENTATION, AND LABELING ............. 26

2.1 Selection of Speech Segment Categories ......................... 27
2.2 Overview of Automatic Segment Detection ...................... 28
2.3 Feature Detection Algorithms--General Development .............. 32
2.3.1 Input Data and V/U/S Pre-Processing .................... 33
2.3.2 Volume Function ................................... 34
2.3.3 Fixed Thresholds and Feature Scores ..................... 36
2.3.4 Automatic Correction Rules ............................ 40
2.3.5 Summary ....................................... 43
2.4 Feature Detection Algorithms-Detailed Descriptions .............. 43
2.4.1 Sonorant Detection ................................... 45
2.4.2 Vowel Detection ..................................... 47
2.4.3 Voiced Consonant Detection ........................... 49
2.4.4 Voice Bar Detection ................................. 50
2.4.5 Formant Tracking ................................... 52








2.4.6 Nasal Detection ..................................... 56
2.4.7 Semivowel Detection ............. ................ 58
2.4.8 Voiced Fricative Detection ............................ 62
2.4.9 Unvoiced Stop and Fricative Detection ................... 63
2.5 Speech Segmentation ..................................... 68
2.5.1 Spectral-Based Boundary Detection and Segmentation ....... 68
2.5.2 V/U/S Boundary Detection ............................ 69
2.5.3 Final Segmentation .................................. 71
2.6 Segment Labeling ........................................... 74
2.7 Manual Modification of Automatic Segmentation and Labeling
Results ..................................... ..... 76
2.7.1 Description of Errors .............. .................. 78
2.7.2 Software and GUI for Manual Modification ............... 81

3 TIME MODIFICATION ALGORITHMS AND USER INTERFACE ........ 88

3.1 The Linear Prediction Coding (LPC) Speech Synthesizer ............ 88
3.2 Time Modification Basics-Frame Skipping and Frame Doubling .... 89
3.3 User-Specified Modification Parameters ........................ 93
3.4 Mapping ............................................. 94
3.5 Time Modification and Synthesis............................. 98
3.6 Glitch Prevention ....................................... 101
3.7 Graphical User Interface (GUI) ............................... 105
3.7.1 Main Window ...................................... 105
3.7.2 Preview Window ................................... 108
3.7.3 Scale Factors Window .............................. 108
3.7.4 Minimum Durations Window ......................... 110
3.7.5 Manual Scale Factors Window ......................... 111
3.7.6 Map Window ..................................... 113
3.7.6.1 Map Display window ....................... 114
3.7.6.2 Map Edit window .......................... 117
3.7.7 Postview Window .................................. 120
3.8 Summary ........................................... 120

4 LISTENING TESTS .......................................... 122

4.1 Word-Length versus Sentence-Length Test Tokens ................ 122
4.2 Pilot Studies .......................................... 124
4.2.1 Quality ......................................... 126
4.2.2 Nasals ......................................... 126
4.2.3 Stops ......................................... 127
4.2.4 Fricatives ....................................... 129
4.3 Development of the Formal Listening Test ...................... 130
4.3.1 Test Tokens: ....................................... 131
4.3.1.1 Type ..................................... 131
4.3.1.2 Duration .................................. 133
4.3.1.3 Position .................................. 138
4.3.1.4 Synthesis and time resolution .................. 143
4.3.2 Test Format ................ ....................... 146








4.3.3 Listeners and Listening Environment ................... 149
4.3.3.1 Type .................................. 149
4.3.3.2 Number .................................. 149
4.3.3.3 Training ................................. 151
4.3.3.4 Screening .................................. 151
4.4 Results of the Formal Listening Test .......................... 153
4.4.1 Perception of the Time-Modified Variations of the
Word"Sue" .......................... ......... 159
4.4.2 Perception of the Time-Modified Variations of the
Word"Zoo" .................................. 160
4.4.3 Perception of the Time-Modified Variations of the
Word"Said" .................................. 161
4.4.4 Perception of the Time-Modified Variations of the
Word "Zed"...................................... 162
4.4.5 Summary of Answers Selected Most Often ............... 163
4.5 Summary ....................................... 166

5 DISCUSSION OF THE FORMAL LISTENING TEST RESULTS ......... 169

5.1 Perception of the Time-Modified /s/ ................ ......... 169
5.1.1 The Word "Sue" ................................... 169
5.1.1.1 Tokens that preserved the beginning of the /s/ ..... 170
5.1.1.2 Tokens that preserved the middle of the/s/ ........ 173
5.1.1.3 Tokens that preserved the end of the /s/ .......... 175
5.1.1.4 Comparison of the results as a function of position 177
5.1.2 The Word "Said" ................................. 178
5.1.2.1 Tokens that preserved the beginning of the/s/ ..... 178
5.1.2.2 Tokens that preserved the middle of the /s/ ........ 180
5.1.2.3 Tokens that preserved the end of the/s/ .......... 181
5.1.2.4 Comparison of the results as a function of position 181
5.1.3 Summary for "Sue" and "Said" ....................... 182
5.2 Perception of the Time-Modified/z/ ........................... 188
5.2.1 The Word "Zoo" ............... ............ ......... 188
5.2.1.1 Tokens that preserved the beginning of the/z/ ..... 188
5.2.1.2 Tokens that preserved the middle of the z/ ........ 192
5.2.1.3 Tokens that preserved the end of the /z/ .......... 193
5.2.1.4 Comparison of the results as a function of position 194
5.2.2 The Word "Zed" ............ ......... .... ..... ... 195
5.2.2.1 Tokens that preserved the beginning of the /z/ .... 195
5.2.2.2 Tokens that preserved the middle of the /z/ ........ 197
5.2.2.3 Tokens that preserved the end of the /z/ .......... 198
5.2.2.4 Comparison of the results as a function of position 198
5.2.3 Summary for "Zoo" and "Zed" ....................... 199
5.3 General Observations ...................................... 205

6 SUMMARY AND CONCLUSIONS ............................... 207

6.1 Summary ........................................... 207








6.1.1 The Time Modification System ...................... 207
6.1.2 The Listening Tests ................................ 209
6.2 Recommendations for Further Work ........................... 213
6.2.1 Additional Listening Tests ............................ 213
6.2.2 Enhancements to the Time Modification System ........... 214


APPENDICES


A THE DIAGNOSTIC RHYME TEST WORD LIST ..................... 217

B LISTENING TEST INSTRUCTIONS .................... .......... 218

C FORMAL LISTENING TEST RESULTS ........................... 220


REFERENCES ................................................. 233


BIOGRAPHICAL SKETCH ......................................... 240













Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
A SYSTEM FOR TIME MODIFICATION OF SYNTHESIZED SPEECH

By
John McLean White III

May, 1995


Chairman: Dr. Donald G. Childers
Major Department: Electrical Engineering

The aim of this research was twofold. The first goal was to create a software-based,

time modification system to independently and automatically modify the durations of the

phonetic segments in a speech signal. The system was intended to be used to create high

quality test tokens for use in speech perception studies. The second goal was to use this

system to investigate the role of duration in perception of the fricatives /s/ and /z/ in

word-initial position in four single-syllable words.

The first portion of the time modification system analyzes the speech signal. The

signal is divided into phoneme-type segments, and each segment is labeled as either vowel,

semivowel, nasal, voice bar, voiced fricative, unvoiced fricative, unvoiced stop, or silent.

These segmentation and labeling algorithms are based primarily on the short-term

frequency distribution of the speech signal.

The second portion of the time modification system invokes a graphical user
interface that allows the user to specify, via slide-bar controls, both the desired time scale

factor and minimum duration for each segment. The user can also specify a weighting

function, or "map," for each segment. The map determines the portion of the segment that








is modified. The resulting time-modified speech is created by a linear predictive coding

speech synthesizer.

The system was used to modify the duration of the initial consonant in the words
"sue," "zoo," "said," and "zed." The duration was adjusted in 10-ms increments, and

ranged from zero ms to the original, unmodified duration (approximately 240 ms). For

each duration, three tokens were created: The first preserved the beginning of the

consonant, the second preserved the middle of the consonant, and the third preserved the

end of the consonant. A total of 270 tokens were created.

Formal listening tests showed that duration strongly affected perception of the

initial consonant. In addition, the portion (i.e. beginning, middle, or end) of the consonant

that was preserved was also shown to affect perception of many of the test tokens.













CHAPTER 1
INTRODUCTION

For certain speech applications, it is desirable to change the rate at which recorded

or synthesized speech is presented to a listener. One example of this is a device that varies

the playback rate of audio books for the blind. This allows the non-sighted listener to

"read" at his or her own speed, independent of the rate at which the recording was originally

made.

One of the driving forces behind research of time-modified speech is the fact that

it has long been known that a human being can comprehend speech at a rate greater than

he or she can produce speech (de Haan, 1982; Foulke and Sticht, 1969; Goldman-Eisler,

1968; Goldstein, 1940). Therefore, a significant time savings results by increasing the rate

of pre-recorded speech in applications such as playback of academic lectures, conference

papers, religious sermons, and archived political speeches, just to name a few. Because of

this difference between the maximum speaking and perception rates, the large majority of

published research has studied speech compression ("speeded up" speech) as opposed to

speech expansion ("slowed down" speech). While there are applications for speech

expansion, these are far fewer in number. The most common are expansion of speech for

the hearing impaired or for foreign language learning.

Time-modified speech also has application as a significant research tool for the

development of test data for use in perceptual studies of both normal and pathological

patients. The durations of different portions of the speech signal are modified in order to

test theories of speech perception from either a psychological or phonological viewpoint.

In addition, time modification can be used to create larger databases for

development of speech recognition systems. Many different variations (in terms of








duration) of a test word or sentence can be systematically created from a single token. The

different variations can then either be used to further train the system, or to perform

controlled tests of the system's ability to correctly detect data different from the training

data.


1.1 History of Time Modification Methods

The methods used to accomplish time modification have evolved through several

stages over the last 40 years. This section provides some history and background into these

methods. Although there are variations in the specific implementations, almost all of the

methods used to date can be assigned to one of four main categories.

1.1.1 Variable-Playback-Rate Method

The variable-playback-rate method is relatively simple, and accomplishes a rate

change by playing back previously recorded speech at a rate different from the original

recording rate. An example of this is playing an LP phonograph record at 45 RPM, instead

of its intended rate, 33 1/3 RPM. A modem, digital signal processing (DSP) analogy of

this is digital-to-analog conversion (and appropriate filtering) at a sampling rate different

from the signal's original sampling rate, without interpolation or decimation. Note that for

this technique, the rate change is always accompanied by a linear shift in the frequency

content of the signal. For speech that has been slowed down or speeded up by a factor of

about two or more, this frequency shift leads to undesirable perceptual effects that mask

the identity of the speaker and, in general, cause a decrease in intelligibility (Garvey, 1953a;

Tiffany and Bennett, 1961).

The variable-playback-rate method was never popular among researchers, mainly

because of the detrimental perceptual effects attributed to the accompanying pitch shift.

However, it was the only method available until about 1950, when the sampling method

was introduced.








1.1.2 Sampling Method

The general class of rate-change techniques known as the "sampling method"

involves the periodic removal or duplication of small segments of recorded speech. The

remaining segments are then spliced together to form the rate-altered speech. The main

advantage of this method is that the frequency content of the resulting speech is not

affected, and as a result, many of the speaker-dependent characteristics are preserved. It

has also been shown that for a variety of different rates, the speech produced by this method

is significantly more intelligible than speech produced by the variable-playback-rate

method (Fletcher, 1929; Garvey, 1951; Lee, 1972).

One way of implementing the sampling method is by manually cutting and splicing
magnetic recording tape (Garvey, 1949). The disadvantage of this method is the time

required to perform the task. Another disadvantage is that it is impossible to guarantee

waveform continuity across the splice boundary. This results in audible "pops" and

"clicks," although these clicks can be reduced by cutting the tape at a 45 degree angle

relative to the edge of the tape. This method does, however, have the advantage of allowing

the user to (manually) select the locations and durations of the discarded segments. To

mark the tape, the operator manually passes the tape back and forth across the playback

head of a tape recorder and listens for the starting and stopping points of a syllable. Once

these points are found, they are marked and labeled on the back of the tape with a grease

pencil. This method is flexible, but extremely time consuming. As a result, it is often

impractical for extensive use and is typically used only for "proof of concept."

Today, this "cut and splice" method is typically implemented on a digital computer

and a CRT display. Although the physical inconvenience of the magnetic tape is no longer

present, the problems of manual segment identification and waveform continuity across

the splice boundary still remain.








An automatic time modification method exists that is based on a modified magnetic

tape recorder (Fairbanks et al., 1954). It involves a rotating playback head assembly that

contains four playback heads. As the head assembly rotates, each one of the four playback

heads individually and sequentially contacts the moving magnetic tape. The outputs from

the four heads are wired in parallel. Time compression results from the fact that there are

gaps between each of the four heads. Therefore, as the head assembly rotates, every

segment of the tape that these gaps contact (instead of a playback head) are not reproduced

by the playback head. The spacings in the head assembly are calculated so that as the head

assembly turns, one playback head is always beginning to make contact with the tape just

as the previous playback head is loosing contact with the tape. The output of the rotating

head assembly is re-recorded onto a second, conventional, tape recorder. The second tape

then contains the compressed speech.

The initial automatic time modification device introduced by Fairbanks in 1954 has
some major limitations. The biggest problem is that the duration of the segments that are

discarded or duplicated is fixed. This was changed in later adaptations and copies of the

original machine (Lee, 1972; Neuburg, 1978). However, other problems ultimately limit

the fidelity and usefulness of the machine. The first of these problems is the lack of

repeatability of the process. In order to repeat the process of compressing a tape segment,

the user has to know exactly the starting phase of the rotating head assembly with respect

to the beginning of the tape. The second problem is the noise created by the rotating head

assembly's slip rings and distortion due to the heads' misalignment with the moving tape.

Despite the shortcomings of the method, the large majority of published research
on the intelligibility and comprehensibility of compressed speech was done using the

sampling method.








1.1.3 Vocoder Methods

A third class of rate change techniques is accomplished by the use of vocoders

(VOice CODERS). Vocoders were originally designed to reduce the bandwidth

requirements for transmission of a normal voice signal. Their ability to modify the rate of

speech is thought of as a secondary benefit. Of all of the vocoders, the phase vocoder is

the best suited for rate modification (Flanagan and Golden, 1966).

Vocoders implement an analysis-synthesis speech transmission scheme. In the

analysis stage, natural speech is analyzed, typically by a bank of bandpass filters. The

output of each bandpass filter in the bank is coded by one of a variety of different methods,

and this coded information is transmitted across a channel. At the synthesis stage at the

receiving end of the channel, the coded information is decoded, and is used to control a

bank of tuned oscillators. The outputs of the oscillators are then summed to produce

synthesized speech (Rabiner and Schafer, 1978).

Typically, the synthesis oscillators are tuned to the same frequencies as the

bandpass filters in the analysis stage. However, this one-to-one match in tuning is not

strictly required, and if the oscillator frequencies are tuned to multiples of the analysis

stage's bandpass filters, it is possible to implement a modification of the synthesized

speech. For example, the phase vocoder can be used to implement a rate change in the
following two-stage manner: In the first stage, speech is analyzed by a bank of equally

spaced bandpass filters with center frequencies at oi, for i E { 1,2,3, ..., N). The outputs

of the bank of bandpass filters are then used to control a bank of oscillators tuned to center

frequencies (oi / 2), for iE { 1, 2, 3, ..., N). At this point, the rate of the synthetic speech
is identical to that of the original speech, but the frequency spectrum of the synthetic speech

is shifted down to one-half that of the original speech. The second stage of the process is

to double the playback speed of the speech synthesized by the first stage. The resulting








speech is twice the rate of the original speech, and has the same spectrum as the original

speech.

While vocoders are able to modify the rate of speech, they suffer from the fact that

their analysis-synthesis schemes create unwanted artifacts in the speech signal (Portnoff,

1981). The speech produced by vocoders is often described as sounding artificial or

"buzzy." Another problem previously associated with the use of vocoders for research

applications is that in the 1960s and 1970s, vocoders were relatively expensive and not a

cost-effective option for many researchers.

The literature shows that vocoders were seldom used in experiments on the
intelligibility of rate-altered speech. The reasons for this are probably due to the problems

listed, as well as the fact that vocoders were unavailable when the peak in the research

interest in rate-altered speech occurred (late 1950s and throughout the 1960s). Note,

however, that vocoders have been studied extensively for speech that has not been rate

altered. Many of the low-bit-rate communication schemes in use today employ basic

bandpass filter concepts first introduced in the early vocoders (Jayant, 1990).

1.1.4 Recent Methods


There has been continued interest over the last 10 to 15 years into newer methods
of modifying the rate of speech. While the speech produced by these methods is seldom

tested in formal intelligibility or comprehensibility tests, the methods are being studied due

to their low cost, low computational requirement, and relative ease of implementation.

Note that some of the newer methods are hybrids of older vocoder technology and recent

waveform coding technology.

The simplest new method consists of "a pitch detector followed by an algorithm
that discards (or repeats) pieces of speech equal in length to a pitch period" (Neuburg,

1978). This is a minor variation of the sampling method. The method does not operate

pitch-synchronously, meaning that the beginning of the segment that is either duplicated








or discarded does not occur at the instant of glottal closure. The method relies upon the

fact that for the majority of time, the speech signal does not vary greatly across a single

pitch period (about 10 ms for a male speaker). Therefore, as long as the duration of the

discarded (or repeated) segment is exactly equal to the pitch period, the ear can not discern

any significant distortion from the process. No formal listening tests have been conducted

for this method.

Another speech rate modification method by Malah is similar in principle to the

two-step process implemented by the phase vocoder described in Section 1.1.3 (Malah,

1979). For both of these methods, if the speech rate is modified by an integer scale factor

n, the first step shifts the frequency spectrum by a factor of n, and the second step plays

the frequency-shifted speech at a speed of n times the original rate. Since the second step

of this process is essentially trivial, the success of this method relies on the ability to shift

the frequency spectrum of the speech. Malah developed numerically-efficient algorithms

that can shift the frequency spectrum of speech (without changing the rate). These

algorithms are known as time-domain harmonic scaling (TDHS) algorithms. In most

cases, the algorithms require only one multiplication and two additions per output sample

of speech. The primary problem in this implementation is that rate modification is only

implemented in integer multiples (i.e. 2:1, 3:1). Thus, only a small, finite set of

compression and expansion ratios can be implemented. Malah claims that for his two-step

rate modification scheme, rates of greater than 2:1 are impractical, "due to perceptual

limitations." Because of this limitation to integer multiples, the algorithm is not that

applicable to speech research. Although no formal tests were conducted, the author states

that "Simulation results with a scaling factor of two for different speakers and texts have

been informally judged to be very good ...."

A recent and popular approach for time modification is based upon the short-term
Fourier transform (STFT). The method is composed of three parts. The first part models

the speech signal with the STFT. The second part modifies the STFT parameters to








implement the rate change. The third part synthesizes the modified speech signal from the

modified STFT parameters (Portnoff, 1981). The method was simulated on a DEC PDP-11

computer. While no formal listening tests were performed, the author claims that the

system "is capable of producing high quality rate-changed speech ... for compression ratios

as high as 3:1 and expansion ratios as high as 4:1." For ratios outside this range, the method

introduces reverberation for expanded speech, and exhibits a "rough" quality for

compressed speech.

Another recent approach models the speech signal by a set of time-varying sine

waves (Quatieri and McAulay, 1992). In terms of the general procedure, this method is

very similar to the STFT approach. The speech is modeled by a set of time-varying

parameters, namely sine wave amplitudes and phases. The algorithm adjusts the speech

parameters, and then re-synthesizes the modified speech by controlling a set of sine wave

generators. Again, no formal listening tests were conducted, and the authors state that for

time-modified speech, "the synthesized speech was generally natural sounding and free of

artifacts."

Perhaps the most interesting point in the Quatieri and McAulay study is that they

experiment with what they call "speech-adaptive time-scale modification." In essence,

they implement rate change by modifying only the voiced portions of speech. In addition,

they measure the "degree" of voicing, and concentrate their time-base modification on the

frames that exhibit the highest degree of voicing. Note that their measurement of the degree

of voicing is based upon how little the harmonic structure varies across multiple

frames-the less the harmonics vary, the higher the "degree of voicing." However, no

formal listening tests were conducted to test their results.


1.2 Phonological versus Psychological Testing

Time modification is a research methodology for studying certain aspects of

speech. Time modification of speech is used in a multitude of tests in several interrelated








research areas. In this study the research on time modification techniques are grouped

according to two different points of view: "psychological testing" and "phonological

testing." These terms categorize the research according to its ultimate purpose.

It is important to emphasize that the definitions given in this study for psychological
and phonological testing are provided solely to divide the large number of applications of

time-modified speech into more manageable categories. Strictly speaking, there is an

overlap of the two definitions. For example, it is incorrect to assume from the following

definition of phonological testing that speech pathologists are unconcerned with the speech

perception process. Consequently, not all of the studies labeled here as psychological tests
were conducted by psychologists-several were conducted by speech pathologists. In

general, both psychologists and speech pathologists examine the perception process,

although each group tends to focus on different parts of the process, or at least approach

it from a different point of view.

Despite the above problems, the studies described as psychological tests in this

report are usually conducted by psychologists. One goal is the investigation of accelerated
learning rates. The ultimate goal, however, is to create an accurate model of the speech

perception process. Researchers often attempt to link the listening test results with higher

cognitive processes, and not surprisingly, the results are often in disagreement.

In general, psychological testing is concerned with issues including (1)
determination of the highest rate at which speech can be presented to a listener and still be

understood, (2) an explanation of why our perception process fails at higher speaking rates,

and (3) the role of short-term and long-term memory in speech perception.

Psychological testing utilizes rate change over relatively large time intervals,
typically sentences or paragraphs. The results are measured in terms of either intelligibility

or comprehensibility. The term "intelligibility" is defined as a measure of the ability to
repeat a short word, phrase, or sentence (Carlson et al., 1979). For example, suppose a

listener is presented with a list of time-compressed words. After each word, he or she is








asked to write down or speak each word that he or she heard. The percentage of correct

responses is the measure of intelligibility. The term "comprehensibility" is defined as the

listener's ability to answer a detailed set of questions about a passage of rate-altered text

(Foulke, 1968; Heiman et al., 1986). The test differs from intelligibility in that the

listener's understanding and comprehension of the material is tested, not just the

intelligibility. Of course, these factors are related. For example, if a single word in a

passage of text is unintelligible, it may (or may not, depending upon the information

content of the word) affect the comprehensibility of the passage.

Psychological testing was conducted extensively in the 1950s and 1960s. Although

a universally accepted model of perception was never created, a wealth of data was

collected regarding the intelligibility and comprehensibility of time-modified words,

sentences, and paragraphs.

In contrast, phonological testing is often conducted by speech pathologists. The

goal is to model human perception of phonemes (or similar segments of speech). The

results are discussed in terms of measurable acoustic features and how their presence (or

absence) affects perception.

Although phonological tests usually measure intelligibility, they differ from

psychological tests in that they relate the intelligibility to acoustic features measured from

the speech signal. For example, the tests are concerned with questions like "how much of

the /s/ must be removed from the word "said" in order to cause the word "zed" to be

perceived," and "what acoustic features are used to distinguish between the phonemes /b/

and /w?"

An example of a phonological test incorporates an algorithm that removes the

initial portion of the initial consonant /s/ in the word "sit." Multiple tokens of the word are

created by progressively removing longer durations of the initial consonant. The tokens

are then used in listening tests to determine how duration influences perception of the








unvoiced fricative /s/ in a word-initial position in a consonant-vowel-consonant (CVC)

word.


1.3 Review of Research

The review of the research is divided into two parts according to the type of test that

was performed: psychological or phonological. Although this study is primarily concerned

with phonological testing, review of psychological tests is beneficial since there is some

overlap between the two definitions, and since many factors that influence listening tests

are discussed in the psychological studies. This section begins with a discussion of the

various definitions of the degree of time modification used in various studies.


1.3.1 Quantitative Measures of Time Modification


Researchers are not in agreement regarding a single quantitative measurement (or

definition) of the amount of time compression or expansion, other than specifying the

initial and final word rates in words per minute (w.p.m.).

One definition in use defines the compression ratio (or expansion ratio) as the ratio

of the duration of the resulting time-altered speech to the duration of the original speech.

For example, if ten seconds of speech are compressed to a duration of six seconds, the

compression ratio is 60%, or 0.6. This is the definition used in this study.

A second definition in use defines the the compression ratio as the duration of the

discarded interval (or added interval, for speech expansion) to the duration of the original

speech. Note that this is the converse of the above definition. Thus, for the same durations

as for the above example, the speech compression ratio is 40%, or 0.4.

Alternatively, the compression or expansion ratio is described in terms of the ratio

of the final to initial word rate, both defined in w.p.m. This, of course, requires that more

than one word is being modified.








Due to the lack of agreement in the definition of the modification ratio, each
researcher typically states the definition that he or she uses. Although adequate, this creates

problems when trying to compare the results of different reports that use different

definitions. Therefore, for the purpose of comparison, in the following sections all of the

original speech compression ratios cited in the literature have been numerically converted

to conform with the definition of "compression ratio" used in this study.

1.3.2 Phonological Tests


This subsection discusses the various phonological tests that examine the effects of

removing a portion of a word or phoneme. The studies relate the results to measurable

acoustic features, such as duration and frequency distribution.

Cole and Cooper (1975) studied the effects of consonant and vowel duration on the
perception of the voiced-voiceless distinction for two affricates and four fricatives in

word-initial position. Shortening the duration of the word-initial frication in an unvoiced

affricate or unvoiced fricative changed the perceived result from unvoiced to voiced.

However, changing the duration of the following vowel had little effect upon the perceived

result.

Denes (1955) studied the effects of consonant and vowel duration when the

consonant occurred in a word-final position. He created multiple variations of two

single-syllable words using all combinations of four different vowel durations and five

different consonant durations. He concluded that both the consonant duration and the

preceding vowel duration had a significant effect upon the perceived voicing of the final

consonant. Short consonant durations resulted in a voiced percept, while longer consonant

durations resulted in an unvoiced percept. Interestingly, short preceding-vowel durations

resulted in an unvoiced percept of the consonant, while longer preceding-vowel durations

resulted in a voiced percept of the consonant. It was concluded that perception is not








performed on a "phoneme-by-phoneme" basis, and that the attributes of a single phoneme

affect the perception of adjacent phonemes.

Raphael (1972) also studied the effect of preceding-vowel duration upon the

perceived voicing of a word-final consonant Raphael reported that "regardless of the cues

for voicing or voicelessness used in the synthesis of the final consonant or cluster, listeners

perceived the final segments as voiceless when they were preceded by vowels of short

duration and as voiced when they were preceded by vowels of long duration." This agrees

in part with the results of Denes in that the preceding-vowel duration plays an important

role in the perception of the following consonant. However, this differs from Denes in that

it negates the importance of consonant duration in perception of the voicing of the final

consonant.

Grimm (1966) studied the effect of eliminating different amounts of the initial

portion of the consonant in consonant-vowel syllables. It was reported that "the listeners

were able to detect correct place of articulation more accurately than either voicing or

manner of release as greater amounts of the initial part of a syllable were removed." Both

initial fricatives and initial plosives were examined. The results showed that as the initial

consonant duration was decreased, errors in identification of the initial fricatives occurred

more gradually than errors in the identification of the initial plosives.

Jongman (1989) studied the effect of varying the duration of frication in

consonant-vowel (CV) syllables. His results showed that the required duration of frication

for correct fricative identification varied depending upon the specific fricative. In addition,

the results disagreed with those of Cole and Cooper (1975) in that "subjects do not have

a tendency to identify more fricatives as voiced as the frication duration decreases." No

explanation was given for this disagreement.

Summerfield, Bailey, Seton, and Dorman (1981) investigated the minimum

duration of silence between "s" and "lit" required to hear the word "split." The results

showed that typically, "split" was heard when the silent interval exceeded about 50 ms.








However, this value was not observed to be constant for all test conditions. They reported

that "less silence is required (a) if the intensity fall at the "s" is made more abrupt, or (b)

if the durations of the "s" and "lit" segments are reduced."

The finding that less silence is required if the surrounding segments are shortened

is important. It is an example of the widely believed theory that a listener's perceptual

"thresholds" used in speech perception are not absolute, and vary depending upon multiple

factors including speaking rate and speaking style (Gottfried et al., 1990; Lindblom, 1963;

Miller, 1981; Miller and Baer, 1983; Miller and Liberman, 1979; Summerfield et al., 1981).

1.3.3 Psychological Tests


Due to the large number of studies, as well as the variety of test conditions used in

psychological tests, it is difficult to divide the research into logical units for comparison.

One approach is to classify the research based upon the physical device used to modify the

speech. However, almost all of the quantitative results were obtained in studies that used

the sampling method exclusively. Therefore, a logical division and discussion of the

research based upon method of implementation is inappropriate.

There are some common threads, however. Almost all of the studies measured

either the intelligibility or the comprehensibility of time-modified speech, but usually not

both (see Section 1.2 for definitions). There is also a general trend in the literature to isolate

and focus upon the different factors that affect rate-altered speech (Foulke and Sticht,

1969), which is logical, since the ultimate goal of psychological studies is to model the

abstract levels of the speech perception process. As an example of the research focus on

individual features, one study measured the comprehension of rate-altered speech as a

function of the age of the test subject, while another study measured the intelligibility of

rate-altered speech as a function of the intelligence of the test subject.

For this report, the review of the psychological tests is divided into two subsections:

the studies of intelligibility, and the studies of comprehensibility. The test results, in terms








of the factors that influence the tests, are discussed in each subsection. This approach offers

the advantage of listing all of the factors that must be considered (and possibly eliminated)

in designing a new study.

1.3.3.1 Intelligibility tests and influencing factors

One of the primary test factors that influences intelligibility is the method used to

compress or expand speech. There are unique problems associated with each of the time

modification methods outlined in Section 1.1 of this report. For example, speech

compression implemented by increasing the playback speed suffers from an accompanying

shift in the frequency content. Thus, speech compressed to 50% of its original duration by

the variable playback rate produces different results than speech compressed to 50% of its

original duration by the sampling method (Garvey, 1953a).

Garvey (1953a) studied intelligibility at many different rates using both the
variable playback rate method and the sampling method. He found that the sampling

method produced significantly better intelligibility scores than the variable playback

method for compressed speech. Among his findings was the result that speech compressed

to 0.66 times its original duration had a mean intelligibility of 98.7%, while speech

compressed the same amount by the variable playback rate method had a mean
intelligibility of only 58%. He also observed that speech compressed to 0.4 times its

original duration by the sampling method still had an intelligibility score greater than 90%.

Fletcher studied the effects of altering the speech using the variable-playback-rate

method (Fletcher, 1929). He found that the intelligibility of speech decreased sharply for

expanded or compressed speech as the rate was changed. For expanded speech,

intelligibility decreased to approximately 50% at an expansion rate of 1.5. For compressed

speech, intelligibility decreased to approximately 50% at a compression ratio of about

0.625. He concluded that the primary reason for loss of intelligibility was the frequency

shift, and not the actual speed of the speech.








Another factor that influences intelligibility is the duration of the segment to be

discarded or repeated (these durations are known in the literature as the "discard interval"

and the "repeat interval", respectively). For example, a 50% compression rate can be

accomplished by either deleting 10 ms from every 20 ms segment of speech, or by deleting

80 ms from every 160 ms segment of speech. While these two methods implement the same

overall quantitative amount of compression, they have different perceptual results. This

is because in the second case, the 80 ms segment that is deleted may delete an entire

phoneme, whereas the case of deleting every 10 ms has a much smaller chance of deleting

an entire phoneme (Fairbanks and Kodman, 1957). This factor was studied by several

researchers. Note again that all studies used the sampling method. Garvey compressed

single words by a factor of 0.5 using a variety of different discard interval lengths (Garvey,

1953a). His results showed intelligibility scores of 95.33%, 95.67%, 95.0% and 85.67%

for discard intervals of 40 ms, 60 ms, 80 ms, and 100 ms, respectively. Note that the

intelligibility decreased significantly for the longest discard interval. Fairbanks and

Kodman (Fairbanks and Kodman, 1957) also investigated intelligibility as a function of

the duration of the discard interval. They found that intelligibility decreased dramatically

when the discard interval changed from 80 ms to 160 ms (no discard intervals between

these two values were tested).

Two additional factors that also affect intelligibility are the number of phonemes

in a single word (for a single-word test), and the number of syllables in a simple utterance

(for a simple-utterance test). It has been reported that the intelligibility increased as the

number of phonemes increased for tests involving single words (Foulke and Sticht, 1969).

It has also been reported that the intelligibility increased as the number of syllables

increased (Klumpp and Webster, 1961). Intuitively, this seems reasonable, since as the

number of phonemes and syllables increases, the brain can invoke higher level perceptual

rules ("levels" of perception are defined here in an abstract sense) in an attempt to string

the phonemes together to make the incomplete word or utterance "make sense."








Up to this point, all of the factors described are related to the characteristics of the

speech tokens themselves. There are also "human-based" characteristics that affect
intelligibility. Note, however, that most of the studies of human factors have been

concerned with how these factors affected comprehension, and not intelligibility. Still, a

few tests specifically relate human factors to intelligibility.

It has been reported that adaptation, or learning, occurred when a listener was

subjected to compressed speech. In one study, repeated exposure to compressed words

brought about a small increase in intelligibility (Garvey, 1953b).

The hearing capacity (ability) of test subjects was also shown to affect intelligibility

of time-altered speech. Calearo and Lazzaroni (1957) showed that aged subjects with

hearing impairment showed a sharper decrease in intelligibility scores as the rate of speech

was increased, when compared with subjects with normal hearing capacity.

1.3.3.2 Comprehensibility tests and influencing factors

There have been numerous research studies to examine how comprehension is

affected by different experimental factors. It was stated earlier that the factors that affect

intelligibility may also affect comprehension. Therefore, some of the following factors

were discussed in the previous subsection. The difference between the earlier subsection

and this subsection is that the specific studies described here focus on how the various

factors directly affected comprehension, and not intelligibility.

Quite surprisingly, there is disagreement in findings that speech compressed by the
sampling method is more comprehensible than speech compressed by the

variable-rate-playback method. Foulke found no significant difference in the two methods

when presenting compressed speech to blind test subjects (Foulke, 1964). This is

contrasted by the findings of McLain, who found that her test subjects, also blind, scored

better when listening to speech produced by the sampling method than speech produced

by variable rate playback (McLain, 1962).








This disagreement is puzzling, since there is agreement concerning findings that the

method used strongly affects intelligibility. Foulke and Sticht suggested that the difference
between intelligibility and comprehensibility results was due to different cognitive

processes being invoked for the two different tests (Foulke and Sticht, 1969).

Nonetheless, speech produced by the sampling method simply sounds more natural

to the test subject than does speech produced by the variable rate playback method. So

while it is unclear if the method used affects comprehensibility, the sampling method is

almost always chosen due to its more pleasant "sound."

Another fundamental factor that affects comprehension is word rate. Fairbanks,

Guttman, and Miron reported that there was little difference in comprehension at rates of

141 w.p.m., 201 w.p.m., and 282 w.p.m. (Fairbanks et al., 1957a). At 470 w.p.m.,

comprehension declined to 26%, compared with 58% at 282 w.p.m. Another study found

little significant change in the comprehensibility of speech presented within the range of

126 w.p.m. to 175 w.p.m. (Diehl et al., 1959). This is in general agreement with other

published studies that showed a small decline in comprehension as the rate was increased

to about 300 w.p.m., and then a much quicker decrease in comprehension as the rate was

increased above 300 w.p.m. (Foulke, 1968).

Listening difficulty also affects comprehension. Unfortunately, there is no single

measure of listening difficulty, so it is difficult to compare different studies. However, in

general terms, the studies that have been performed agreed and showed that as the word

rate was increased, more difficult selections showed a quicker decline in comprehension

scores than did easier selections (Harwood, 1955).

As with intelligibility, there are several human-based variables that affect

comprehension. A listener's intelligence was shown to influence comprehension scores

at different word rates (Fairbanks et al., 1957b). This study showed that persons with

higher intelligence scored better on comprehension tests as the speech rate was increased.








Another human-based variable that influences comprehension is the test subject's

(silent) reading rate. Goldstein found a positive correlation between reading rate and

comprehension scores for speeded speech (Goldstein, 1940).

Prior learning or exposure to compressed speech also influences comprehension

test results. Orr, Friedman, and Williams found that students that received systematic

practice in listening to compressed speech consistently scored better than unpracticed

students in comprehension tests of speech at rates greater than twice the normal rate (Orr

et al., 1965). At rates less than twice the normal rate, the evidence is not so conclusive.

Friedman conducted an extensive study concerning learning and compressed speech, and

found that comprehension at 325 w.p.m. was no different after one week of practice

listening than comprehension with no practice (Friedman, 1967). Note that he did find that

one week of practice improved comprehension scores at rates greater than 325 w.p.m. This

result that training and practice have different effects at different word rates again suggests

that the processes that are involved in a human's attempt to process speech above 300

w.p.m. may be different than the processes that are involved in processing speech below

300 w.p.m.

One controversial human factor is the visual ability of the listener. The intuitive

belief among the general population is that blind listeners can comprehend speech at a

greater rate than sighted persons. However, this has not been proven conclusively, and is

a matter of debate. Crowley, Lake, and Rathgaber found that in the intermediate grade

levels, sighted students scored better than blind students in comprehension tests (Crowley

et al., 1965). They tested subjects at 175 and 225 w.p.m. On the other hand, for the same

age group of school children, Bixler, Foulke, Amster, and Nolan (Bixler et al., 1961) found

that "blind school children can be given information at a rate commensurate with that

employed by normal children with no loss in comprehension." Their test measured no

significant differences between blind and sighted children at rates up to 275 w.p.m. In








another study, Hartlage (1963) found that there was no significant difference in

comprehension between blind and sighted persons at normal word rates.

While this result is surprising, it has little importance in practical applications. This

is because the average Braille reading rate for blind high school students is only 90 w.p.m.

(Bixler et al., 1961). Thus, from a practical point of view, if blind students listen to speech

at a rate of 125 w.p.m., they are still effectively increasing their "reading rate" by a

substantial amount. Note that arate of 125 w.p.m. is not considered to be "speeded speech,"

and all of the researchers agree that at this low rate, there is little or no difference in

comprehension between blind and sighted students.


1.4 Moatvation

The majority of published research agrees that in order to automatically modify

speech duration, one must periodically remove (or repeat, for expanded speech)

constant-length segments of a speech token at fixed time intervals, with no regard to the

phonemic or acoustic content of the specific segment. This results in a somewhat effective

method of modifying the speech rate, but is crude and ignores the fact that certain portions

of the speech signal carry more information than others. As a result, the user has little

control over the information that is eliminated.

If the user wants to modify only a certain portion of a word or sentence, he or she

must still manually edit the signal to produce the desired test token, a process that involves
"cutting and splicing" the signal. The only difference in this procedure between present

day and the 1950s is that today, this cutting and splicing is done with the help of a waveform

editor that is implemented on a digital computer.

This process of computer-aided cutting and splicing is exactly what is used in many

of the recent phonological studies that selectively eliminate portions of phonemes. The

term "selectively" is defined here to mean that the user can select the segments of the speech

signal to be discarded (or repeated). While the cut and splice method has the advantage








of being precise, it has the disadvantages of being time consuming, tedious, and requires

sufficient software. It also requires that the user mark the segments to be removed or

repeated, and concatenate the remaining segments. In addition, it does not automatically

prevent or smooth any discontinuities that may result from joining two dissimilar

waveforms. Therefore, the sets of test words that are created by this method are difficult

for other researchers to recreate, unless precise documentation is kept during the

development of the test data.

The disadvantages of both the waveform editors and the other previously described

methods used to modify the duration of speech provided the motivation for this study. The

speech research community could benefit from a system that creates high quality,

time-modified test tokens in a quick, convenient, repeatable, and selective manner. Ideally,

the system would eliminate the need for computer-based waveform editors and the

associated manual cutting and pasting processes. In addition, it would be desirable if the

user could control the system with parameters that are closely related to the acoustic

features of the speech signal. This would greatly decrease the training time, and in general,

would make the system easier for the speech researcher to operate. The system might also

inspire new research into time-modified speech, due to the increased levels of efficiency

and flexibility that were previously unavailable in a time modification system.


1.5 Goals

Given the desire for a better time modification system, this study set forth the

following goals: The first goal was to develop and implement a new time modification

system that allows the user to selectively modify certain portions of a speech signal, based

upon the signal's time-varying acoustical composition. In order to aid the speech
researcher, the segments are similar to the set of phoneme types (i.e. vowels, nasals,

semivowels, etc.). To do this, a software tool was created that first analyzes the speech

signal to determine the identity of the different phonetic segments, and then independently








modifies the durations of the phonetic segments according to global parameters specified
by the user. The time modification is done automatically without the use of waveform
editors. In addition, the software is written in the MATLAB programming language, and

can easily be ported at relatively low cost to a wide variety of computing platforms.

The time modification system incorporates a graphical user interface (GUI) that
frees the user from having to remember any complicated command-line syntax. All of the

modification parameters are adjusted by using a mouse to move and select slide-bar and

push-button controls that are displayed in various windows. After the modification

parameters are specified, the resulting time-modified speech is synthesized and played
with the click of a button.

The second goal of this study was to test the time modification system to ensure that
it was capable of creating high quality, synthesized test tokens. To do this, a set of
time-modified speech tokens was created and used in studies of perception in both informal

and formal listening tests. The formal listening test was chosen to be similar to several of

the phonological studies described in Section 1.3.2. By closely approximating published

tests, the ability of the system to create high quality synthesized test tokens was

investigated.

The third goal was to compare the results of the formal listening test with the results
of similar published research. Particular attention was given to the perception of initial

consonants in single-syllable, consonant-vowel (CV) and consonant-vowel-consonant

(CVC) words when different portions of the initial consonant were removed. A

comparison of previous research shows that in each study, typically only one portion of the

initial consonant was removed or modified. For example, one investigator reduced the

duration of a word-initial consonant by preserving the beginning portion of the consonant

(Cole and Cooper, 1975), while another investigator reduced the duration of a word-initial

consonant by preserving the end portion of the consonant (Grimm, 1966). One of the goals








in this study was to determine if there were significant changes in perception when various

portions (i.e. the beginning, middle, or end) of the initial consonants were modified.


1.6 General System Description

This section presents an overview of the system that is used for time modification.

A block diagram of the entire system is shown in Figure 1-1. The three main stages of the

system are (1) the speech analysis, segmentation, and labeling stage, (2) the manual

correction stage (for optional correction of the segmentation and labeling results), and (3)

the segment time modification and synthesis stage. Both the natural speech input signal

and the synthesized speech output signal are sampled-data time-domain signals. The

sampling frequency is fixed at fs = 10 kHz.

The first stage works automatically with no input from the user, other than the
sampled-data input signal. This stage divides the signal into pitch-synchronous frames

(pitch-asynchronous for unvoiced and silent speech) and performs a linear predictive

coding (LPC) analysis for each frame. The frames are then grouped into segments, and

each segment is labeled with the most appropriate phonemic type label (i.e. vowel,

semivowel, etc.). This entire process is accomplished by a series of software programs that

extract the acoustic features from the signal and compare the relative contribution of each

feature to the specific speech segment.

The second stage manually corrects the automatic segmentation and labeling

results. This is only required if the automatic segmentation and labeling stage makes

mistakes. Determination of whether or not a mistake has been made is left to the discretion

of the user. In this stage, a set of software programs with a graphical user interface (GUI)

allows the user to display and graphically edit the segment boundaries and labels. The user

adjusts the results by moving sliders and pushing buttons (with a mouse) on the computer

display.













Original Speech


Speech Analysis, Segmentation,
and Labeling




Labeled Speech Segments


User Interaction
(via mouse)










User-Specified ___
Parameters


Manual Correction of
Segmentation and Labeling Results

(if required)


Edited and Labeled
SSpeech Segments




Segment Time Modification and
Synthesis


Time-Modified Speech


Figure 1-1. Block diagram of the time modification system.








The third stage performs the actual time modification process. It also performs the

synthesis of the resulting, time-modified speech. This stage uses a set of software programs

with a graphical user interface (GUI) that allows the user to graphically specify how the

speech signal is modified. Each type of phoneme has its own modification parameters, and

in addition, each segment can also have its own modification parameters independent of

phoneme type, if desired. Once the parameters are all specified, the third stage synthesizes

the time-modified speech using an LPC speech synthesizer.


1.7 Chapter Organization

Chapter 2 details the automatic speech analysis, segmentation, and segment

labeling programs that comprise the first stage in the time modification task. It also

discusses the mistakes made by the automatic programs, and describes a software tool

developed to manually correct any mistakes in the segmentation and labeling results.

Chapter 3 describes the time modification programs and the associated graphical user

interface. Chapter 4 describes the development and results for both the formal and informal

listening tests. Chapter 5 presents a discussion of the formal listening test results. Chapter

6 summarizes the study, and suggests ideas for future research.













CHAPTER 2
SPEECH ANALYSIS, SEGMENTATION, AND LABELING

The time modification system in this study varies the durations of selected

segments of the speech signal. Possible segments include vowels, nasals, unvoiced

fricatives, etc. Each segment duration is modified according to parameters specified by

the user. These parameters apply to either all occurrences of a specific type of segment,

or to a single occurrence of a segment One example of a parameter that applies to a vowel

is the vowel scale factor, SFvowel. The vowel scale factor specifies the desired ratio of the

duration of the vowel segments) in the time-modified word to the duration of the

corresponding vowel segments) in the original, unmodified word.

Since the speech segments are the basis of the time modification system, it is

important that the segments are accurately detected and identified (for brevity, the

detection and identification processes will hereafter be called detection). To accomplish

this goal, the system designer is faced with one of two choices: manual detection (by hand),

or automatic detection (software). While manual detection provides good results, it is

extremely tedious and time consuming. This limits the usefulness of the overall time

modification system. Automatic detection is quick and relatively "painless," but is more

prone to mistakes than the manual method and requires significant initial development

time.

As a compromise, this study uses automatic detection of the speech segments with

subsequent manual editing to correct errors that may occur in the detection process.

Automatic detection consists of three main steps: (1) speech analysis, (2) segmentation of

the word or sentence into segments of unknown type, and (3) appropriate labeling of these

segments. The manual editing process allows the user to display and edit the automatic








segmentation and labeling results using a set of software programs with a convenient,

easy-to-use, graphical user interface (GUI). The GUI allows the user to insert silent

segments, change segment boundaries, change segment labels, and merge together

adjacent, like segments, all with the click of a workstation mouse.

This chapter describes both the automatic and manual algorithms used to analyze,

segment, and label the speech segments. It begins with a discussion of the selection of

speech segment categories. Next, a brief overview of the automatic segment detection

process is presented with an example of automatic detection of the voiced/unvoiced/ silent

(V/U/S) parameter for the word "sue." The example illustrates the methods associated with
automatic detection of a single parameter, or "feature," of speech. The relevance and

application of these methods to the general set of feature detection programs is described.

Each of the automatic feature detection algorithms is then detailed. The segmentation and

labeling processes are also presented in detail. The nature of the mistakes produced by the

automatic algorithms are discussed, and the manual editing system and corresponding GUI

are described.


2.1 Selection of Speech Segment Categories

The complexity of the algorithms for segmentation and labeling in any speech

analysis task depends upon the degree of recognition that must be achieved (Davis and

Mermelstein, 1980). For a given speech sound, algorithms that determine only the

phoneme category will be less complicated than algorithms that determine not only the

category, but the identity of the phoneme as well. Likewise, for a given speech sound,

algorithms that determine the allophonic variation of a particular phoneme can be expected

to be complicated, due to the number of variations of the pronunciation of a single phoneme

that can occur during conversational speech (Klatt and Stevens, 1973). Thus, the

complexity of the segmentation and labeling task is dependent upon both the number and

choice of categories used to subdivide the speech.








There are several possibilities for selecting speech segment categories. In a

top-down paradigm, the simplest choice is to classify speech as voiced, unvoiced, or silent

(V/U/S). The next, more complex choice is to classify each speech sound as a member of

one of the basic phoneme types. Although the exact description and number of different

phoneme categories vary slightly depending upon the school of thought, these categories

usually include vowels, nasals, semivowels, voiced fricatives, voiced stops, unvoiced

fricatives, unvoiced stops, and silence. An even more complex categorization requires

identification of the exact phoneme. This requires matching the segment under

consideration with one of the 47 phonemes in English (Borden and Harris, 1984).

For this study, speech is divided into eight segment categories: vowels,

semivowels, nasals, voiced fricatives, voice bars, unvoiced stops, unvoiced fricatives, and

silence. Overall, this choice is a compromise between the complexity of the segment

recognition algorithms and the resolution of the resulting speech segments. Since

automatic language understanding is not required in this project, recognition of individual

phonemes is not required. This greatly reduces the complexity of the segment recognition

algorithms, since the choices in the matching process are reduced from 47 to eight. In

addition, it is easier to recognize the typically large differences between phoneme

categories than it is to recognize smaller differences between various phonemes of the same

category (Schwartz and Makhoul, 1975).


2.2 Overview of Automatic Segment Detection

Automatic detection of the speech segments is accomplished by a series of software

programs that sequentially analyze the speech signal. These programs are grouped

according to the task they perform in the detection process. The three main tasks are shown

in Figure 2-1 as: (1) speech analysis, (2) segmentation of the word or sentence into

segments of unknown type, and (3) labeling of these segments with the most appropriate
























original
speech


i


SPEECH
ANALYSIS

(LPC and feature
detection)


SEGMENTATION


Vowel Score

Semivowel Score

Nasal Score

Voice Bar Score

Voic. Fric. Score

Unv. Fric. Score

Unv. Stop Score

Silence Score
1-


Segment
Boundaries


LABELING


Figure 2-1. Block diagram of automatic speech detection.


labeled
- speech
segments


LPC

V/U/S








segment label. A brief overview of each of these tasks follows, and they are discussed in

detail in later sections of this chapter.

Speech analysis is the most complicated of the three tasks, and is divided into

several steps. A block diagram of the speech analysis task is shown in Figure 2-2. The

initial analysis and decomposition of the speech waveform is derived from a two-pass

method developed by Hu (1993). In the first pass, the sampled waveform is divided

asynchronously into 5 ms frames. A 13th-order, linear predictive coding (LPC) analysis

is performed for each frame, and the residue is processed to determine the glottal closure

points (Hu, 1993). Only the glottal closure indices (GCI) are retained for use in the second

pass of the algorithm. In the second pass, the sampled waveform is again divided into

frames. The frames are chosen pitch asynchronously for unvoiced speech and silence, and

pitch synchronously for voiced speech, using the glottal closure indices as a reference. A
13th-order, LPC analysis is performed for each frame, and the residue, LPC coefficients,

and power are saved for later modification and synthesis. Next, a set of feature detection

algorithms analyzes each frame individually. Each algorithm in the set detects a different

acoustic feature. For example, one of the algorithms detects the presence or absence of

nasals, while another algorithm detects the presence or absence of semivowels. Each

feature detection algorithm uses a combination of fixed thresholds, median filtering, and

empirical rules to calculate the final result, or "feature score."

The second automatic detection task shown in Figure 2-1 is determination of the

time-domain boundaries that separate the segments of the speech signal. This process is

known as segmentation. The boundaries are chosen such that each segment has relatively
stable acoustic properties for the duration of the segment. Segmentation is accomplished

by combining the results of two different algorithms. The first algorithm determines the

changes in the "trend" of the short-term frequency spectra, and the second uses the results

of the voiced / unvoiced / silent (V/U/S) feature detector.








original speech


Figure 2-2. Block diagram of speech analysis.








The third task in Figure 2-1 is labeling of the segments using the results obtained

from the feature detection algorithms. Examples of labels are vowel, semivowel, and

unvoiced fricative. Labeling is done in two steps. First, the average feature detection

scores are calculated. Next, empirical rules are applied to the average scores to determine

the most appropriate label for each segment.


2.3 Feature Detection Algorithms-General Development

Acoustic feature detection is the search for different (acoustic) features. Examples

of acoustic features include voicing, nasality, and sonorance. While acoustic features are

used to help differentiate between the various segment categories, it is important to realize

that individual acoustic features may not be unique to one particular segment category. For

example, nasality may indicate the presence of a nasal, or it may indicate the presence of

a nasalized vowel. Thus, in this example, one acoustic feature is common to two different

segment categories. This lack of one-to-one correspondence between acoustic features and

segment categories requires that multiple acoustic features be evaluated and weighed when

attempting to match an unknown speech segment with the most appropriate segment label.

Although it is logical to use the term "segment detector" to define an algorithm that

detects one of the eight segment types listed in Section 2.1, this term is misleading, since

it can be confused with the previously defined definition of segmentation, which is the task

of dividing the speech signal into segments of unknown type. Therefore, in this study, the

term "feature detector" is used in a broad sense, and implies both an algorithm that detects

a single acoustic feature, as well as an algorithm that detects multiple acoustic features in

order to detect one of the eight segment types listed in Section 2.1.

In general terms, feature detection is achieved by algorithms that examine the

short-term frequency spectra of the speech signal. The spectra are calculated from the LPC

coefficients that are, in turn, calculated during the initial analysis stage for each frame of

the signal. It has been shown that the short-term frequency spectra method is a reliable








technique used in a wide variety of recognition systems (Bush et al., 1983; Glass and Zue,

1986; Glass and Zue, 1988; Klatt, 1977; Leung et al., 1993; McCandless, 1974; Meng and

Zue, 1991; Mermelstein, 1977; Weinstein et al., 1975; Zue et al., 1989).

Each feature detection algorithm utilizes a sequence of processing stages to

calculate the resulting feature score. In many instances, the structure of each of the feature

detection algorithms is similar, although the exact numerical values may differ.

The detection of acoustic features from the speech signal is the most complicated

portion of the analysis, segmentation, and labeling process. Because of the complexity of

the feature detection algorithms, the explanation of the algorithms is broken down into two

sections. In this section, a simple example is given to explain one feature detection

algorithm and its development and implementation. Although the example is for a single

feature detector, it illustrates the general structure of the majority of the feature detectors.

The example also discusses the problems and considerations associated with the set of

feature detection algorithms as a whole. In the next section, the algorithms are detailed

individually, examining the specific equations of the algorithms.

The feature detection algorithms use a combination of methods to produce the final

results. These methods include bandpass filters, fixed thresholds, median filter smoothing,

and empirical pattern recognition rules. These methods are used in a similar manner in each

of the algorithms. The example that follows illustrates how these methods work in a feature

detection algorithm that detects the voiced / unvoiced / silence (V/U/S) feature of speech.

2.3.1 Input Data and V/U/S Pre-Processing

All of the feature detection algorithms require the LPC results as input data. Most

of the feature detectors also require the results from other feature detectors, (specifically

the V/U/S results), as shown in Figure 2-2.

V/U/S classification is different from the other feature detection algorithms in that

a portion of the algorithm is accomplished during the initial LPC analysis algorithm.








During the first pass of the LPC algorithm, the first reflection coefficient is calculated for

each pitch-asynchronous frame. The frame is classified as voiced (V) if the reflection

coefficient is greater than 0.2, and is classified as unvoiced (U) if the reflection coefficient

is less than or equal to 0.2. This threshold was determined empirically by Hu (1993). In

addition, Hu makes no distinction between unvoiced and silent frames. Therefore, all silent

frames are classified as unvoiced. During the second pass of the LPC algorithm, certain

frames are labeled as transitional frames (T). The first voiced frame in a unvoiced-voiced

sequence, and the last voiced frame in a voiced-unvoiced sequence are changed to

transitional frames. Hu's explanation for this is that mistakes may be made in the simple

V/U decision process, so the frames at the transition regions are marked since this is

typically where the mistakes are made. Since it has been observed that the transition frames

are always voiced, all transition frames are converted to voiced frames in this study.

2.3.2 Volume Function


A volume function, V(i), similar to one presented by Weinstein et al. (1975) is

calculated for each frame to determine a quantity analogous to the loudness, or acoustic

volume, of the signal at the output of a hypothetical bandpass filter. This is the first

processing step in the majority of feature detectors. The volume function is normalized

by the umber of samples in the frame, and is given by

m=B
V(i) = IHi(eJ 2 (2.1)
S m=A


where i is the current frame index, Ni is the number of samples in frame i, A is the index

of the low cutoff frequency of the bandpass filter, B is the index of the high cutoff frequency

of the bandpass filter, and Hi(ei-) is the complex, single-sided, frequency response of the








IIR filter, Hi(z), produced by the LPC coefficients and evaluated at the points

exp(jjm/256), for 0 < m < 255. Hi(z) is given by

G(i)
Hi(z) = )(2.2)
ao + a1z-1 + a2z-2 + ... + aN-N


where N = 13, ao = 1, and G(i) is given by

nn=t
G(i) = r(n)2 (2.3)
n=s


where r(n) is the value of the LPC residue at sample n, i is the current frame index, s is the

beginning sample number of the current frame, and t is the ending sample number of the

current frame.

The volume function of Equation 2.1 is used extensively in this study, although the

frequency range of the bandpass filter varies depending upon the specific detector. In

addition, many of the feature detection algorithms calculate the ratio of two volume

functions, each with its own frequency range. This compares the energy in one frequency

band to the energy in a second frequency band.

In the majority of feature detectors, median filtering is done to smooth any large,

short-term, fluctuations in the volume function. The fluctuations are caused by a variety

of sources including incorrect GCI determination, incorrect V/U/S classification, and

recording artifacts such as tape hiss and background noise. Although the majority of the

feature detectors use a 5th-order median filter for smoothing, the exact order is given in

the detailed description of each detector. The filter order is determined empirically in each

case.

The V/U/S detector uses a single volume function of Equation 2.1 with the values

A = 17 and B = 255. The lower limit of A = 17 serves to highpass (HP) filter the

frequency response with a cutoff frequency of 312 Hz (the upper limit B = 255








corresponds to one-half the sampling rate, thus a highpass instead of a bandpass filter).

Weinstein claims that the HP filter is needed to reduce the sensitivity to voiced stops, but

experiments show that its primary effect is to reduce low frequency artifacts such as wind

noise and other pop-like sounds caused by non-optimum microphone placement during the

recording process. In general, the volume function is used in the V/U/S detector as a

relatively wide-band integrator that calculates the approximate energy in each frame. The

role of this integrator is discussed in the next section.

A graph of the frequency response of Hi(z) before and after the hypothetical

highpass filter for one pitch period of the vowel portion of the word "sue" spoken by a male

speaker is shown in Figure 2-3. As described in Equation 2.1, the single sided frequency

response of Hi(z) is evaluated at 256 equally spaced points around the upper half of the unit

circle in the z-plane. Although it is not shown on the graph, the value of V(i) for the pitch

period analyzed in Figure 2-3b is 70.62 dB. It is seen from the graph that this is

approximately the average level of the frequency response.

Note also that unlike the other feature detectors, median filtering is not performed

on the V/U/S volume function. This is to ensure that any short-term energy fluctuations,

such as those produced by stops, are not inadvertently smoothed.

2.3.3 Fixed Thresholds and Feature Scores

Each feature detection algorithm calculates a feature score to indicate the presence

of the corresponding acoustic feature in a given frame of speech. The feature score is

typically continuous over the range [0,1], although there are exceptions (several of the

feature scores are discrete, either binary or trinary). In general, the feature score is

calculated by comparing the value of the volume function (or the ratio of two volume

functions) with one or more fixed thresholds. The values of the thresholds are determined

empirically by trial and error during the analysis of approximately 100 words of the

Diagnostic Rhyme Test (DRT) spoken by two male and one female speakers (Voiers, 1983).









100


0 ........ .... ....................... ................. ..... ............... ....... ..... ..
80 ...........


60


40

0 50 100 150 200
m
(a)



100 -. ... ... ......... .........





600


40 .

0 50 100 150 200
m
(b)

Figure 2-3. Hi(z) used to calculate the V/U/S volume function for
one pitch period of the vowel portion of the word "sue."
a) Hi(z) before filtering;
b) Hi(z) after filtering.


250


250








In some cases, initial estimates for the thresholds are taken from literature (sources are

listed in discussion of the specific algorithms), and the thresholds are then "fine tuned"

using the DRT speech data.

The empirical determination of these thresholds constitutes a type of "learning"

phase in the algorithm development. This contrasts one of the initial goals of the automatic

segmentation and labeling process, which is to not require "training" of the algorithms.

However, given the nature and variability of the speech signal, it now seems impossible

to create a set of reliable segmentation and labeling algorithms based upon frequency

distributions (i.e. volume functions) without some type of training, or parameter "tuning."

The advantages and disadvantages of training are obvious. If the training data does

not accurately represent the set of intended users, the algorithms will not function as

expected in practice. If the training data completely represents the set of intended users,

the algorithms will work efficiently and accurately. Since the topic of training of speech

recognition algorithms is beyond the scope of this dissertation, it will be accepted that

training is mandatory, regardless of the particular algorithm.

In general, if two thresholds are used, the feature score for each frame is determined

by

1, if Vol_Fcn(i) > Tuppr
0, if Vol_Fcn(i) < Tlow
Feature_Score(i) = (2.4)
Vol_Fcn(i) Tioe,
TVl Tlowe' if Tlower VolFcn(i) < Tuppe
uppr lower


for a feature score that increases as the volume function increases. If the feature score

decreases as the volume function increases, the feature score is given by









0, if Vol_Fcn(i) > Tuppe
1, if Vol_Fcn(i) < Tiowe
Feature_Score(i) = (2.5)
T Vol Fcn(i)
Tuppr- Towr if T1er 5 Vol_Fcn(i) < Tuppr
1 upper l lower


For both equations, i is the current frame index, Tiower is the fixed lower threshold, Tupper

is the fixed upper threshold, and Vol_Fcn(i) is the volume function (or the ratio of two

volume functions) for the current frame. Both Equation 2.4 and Equation 2.5 are used in

practice. If only one threshold is used to calculate a binary feature score (zero or one), then

either Equation 2.4 or Equation 2.5 is used with Tlower = Tuppe.

The original LPC analysis described in Section 2.3.1 distinguishes only between

voiced and non-voiced frames. The non-voiced frames denoted by Hu (1993) as

"unvoiced" are either unvoiced or silent. The procedure used in the V/U/S detector to

classify non-voiced frames as either unvoiced or silent is based upon a single volume

function, the background noise power in the speech signal, and Equation 2.4. First, the

mean and the standard deviation of the background noise power, BNP, are calculated as

n=20
BNPmen =2 p(n) (2.6)
n=l


n n=20
BNP dv = 1 (p(n) BNPmean)2 (2.7)
n= l


where p(n) is the frame power in decibels (dB). It is assumed that the first 100 ms (20

frames) of the speech signal are silence.

The V/U/S volume function for each non-voiced frame is then compared to a

constant threshold, Tu/s, using Equation 2.4 with Tower = Tupper and Tupper = TU/s.

The V/U/S feature score for each non-voiced frame is given as









f 1, if 20 logl0o{V(i)} > TU/s 1
VUS_Score(i) = 0, if 20 loglo0V(i)) < Tus (2.8)


where V(i) is calculated from Equation 2.1 with A = 17 and B = 255, and i is the index

of the current frame. If VUS_Score(i) = 1, the frame is classified as unvoiced, and if

VUSScore(i) = 0, the frame is classified as silent. Note that this method only separates

unvoiced from silent frames. The value of VUS_Score(i) is arbitrarily set equal to two for

all voiced frames. As a result, the V/U/S feature score is different from many of the other

feature scores in two ways: First, it spans the range [0,2] instead of [0,1]. Second, it can

have only one of three discrete values, while most of the other feature scores are

continuous.

It is found (empirically) that the best results are obtained when

Tu/s = BNPmen + k BNPsd ev (2.9)

where k = 2.0. Obviously, the value used for k is dependent upon the statistical properties

of the background noise in the speech signal. However, the absolute level of the

background noise is compensated for automatically since the value of BNPmen is

calculated before analysis of each word.

Figure 2-4 shows the V/U/S classification for the word "sue" spoken by a male

speaker: a) shows the time-domain speech waveform, b) shows the volume function (in

dB) from Equation 2.1 and the fixed threshold, TU/s, and c) shows the resulting V/U/S

score.

2.3.4 Automatic Correction Rules

The feature scores produced by Equation 2.4 (or Equation 2.5) sometimes require

additional processing to help eliminate false detection of features. The processing is

accomplished by the application of pattern recognition rules. These rules are used


















2000 4000


2000


4000


6000 8000
Sample Number
(a)


6000 8000
Sample Number
(b)


10000 12000 14000


10000 12000 14000


0 2000 4000 6000 8000
Sample Number
(c)


Figure 2-4.


10000 12000 14000


Voiced / unvoiced / silent classification using the volume function and a
single fixed threshold for the non-voiced portion of the the word "sue."
a) Time-domain waveform;
b) Volume function and TU/s threshold (threshold shown as dashed line);
c) V/U/S score.


xlO

1-


I I I II
. .. ... ..... ... .. .. ... ... ..
.. .. .... .. ..... ...... ... ..








sparingly, and counteract specific, regularly occurring, incorrect classifications of feature

types. They are developed empirically on an algorithm-by-algorithm basis.

In the case of the V/U/S detector, the unprocessed, or "raw," V/U/S score produced

by Equation 2.8 has the undesirable effect of sometimes oscillating between states during

word onsets and offsets, as well as during transitions between unvoiced and voiced regions.

Since this does not accurately model the human speech process, rules are applied to smooth

the V/U/S score, or "track." Table 2-1 lists the rules.


Table 2-1. Rules to modify initial voiced/unvoiced/ silent (V/U/S) results.
The symbols x and y denote any segment type.



Rule Initial Pattern Requirements Final Pattern
Number for Modification

1 VUS length V > 100.0 ms, VVS
length U < 25.1 ms

2 x Sy length S < 10.1 ms xxy

3 SUV length U < 7.5 ms SSV

4 x U y (except SUV) length U < 10.0 ms x y y (if x = S)
else
xxy



The first rule eliminates an incorrect unvoiced classification at the end of a long

voiced segment. Since the energy often decreases quickly during the last few glottal cycles

of a voiced-silent transition, the classification algorithm sometimes (incorrectly) labels

these pitch periods as unvoiced. The second rule smooths out momentary "drop outs" that

occur when the signal level drops below the Tu/s threshold for a brief period of time. The

third rule is similar to the first rule, except that it smooths out the beginning of the segment,

instead of the end. It reclassifies the very short, low energy frame at the beginning of a








silence-unvoiced-voiced transition from unvoiced to silent. The fourth rule smooths out

momentary unvoiced segments of very short duration. This rule does not eliminate the

short noise bursts exhibited by plosives, since it only acts if the unvoiced segment is less

than 10 ms, which is far shorter than the average duration of the plosive burst (Klatt, 1979;

Umeda, 1977).

Figure 2-5 shows the V/U/S score for the word "sue" spoken by a male speaker

before and after the application of the pattern recognition rules. The rules reduce the

number of non-silent segments from five to two.

2.3.5 Summary

Each feature detection algorithm is comprised of a similar sequence of processing

stages. In general, the first stage calculates one or more volume functions. The volume

function (or ratio of volume functions) is smoothed by a median filter to remove any

short-term fluctuations. The second stage calculates a feature score, typically over the

range [0,1], by comparing the volume function with one or two fixed thresholds. The third

stage applies pattern recognition rules to correct for any known deficiencies in the

algorithms.

The following section details the individual feature detection algorithms. Each of

the detectors in Figure 2-2 is described, and any differences from the V/U/S example

detector are discussed.


2.4 Feature Detection Algorithms-Detailed Descriptions

This section gives the details of the feature detection algorithms. The algorithms
are, for the most part, similar in form to the V/U/S feature detection algorithms described

in Section 2.3.

















2000


4000


6000 8000
Sample Number
(a)


-1 I I I I
0 2000 4000 6000 8000
Sample Number
(b)


-1 I I I I
0 2000 4000 6000 8000
Sample Number
(c)


Figure 2-5.


10000 12000


14000


10000 12000 14000


10000 12000 14000


Voiced / unvoiced / silent (V/U/S) classification for the word "sue."
Segment durations are indicated in number of samples.
a) Time-domain waveform;
b) Before pattern recognition rules;
c) After pattern recognition rules.


x 10





1 i. I I I I I.. .. .








2.4.1 Sonorant Detection

To be classified as sonorant, a frame must be voiced, and must also have a high ratio

of low frequency to high frequency energy. The group of sonorants typically include

vowels, voice bars, nasals, and semivowels. The non-sonorants include unvoiced

fricatives, unvoiced stops, and strong, voiced fricatives. Weak voiced fricatives are

classified as sonorant if they have a relatively large proportion of low-frequency energy

(Weinstein et al., 1975).

The volume function from Equation 2.1 is calculated for each frame with G = 1,
A = 5, and B = 46. This is termed the low frequency volume function, orLFV. The LFV

is equivalent to a bandpass filter from 98 Hz to 898 Hz. A second volume function from

Equation 2.1 is calculated for each frame with G = 1, A = 189, and B = 255. This is

termed the high frequency volume function, or HFV. The HFV is equivalent to a bandpass
filter from 3691 Hz to 5000 Hz. The sonorant ratio, R(i), is calculated for each frame as

LFV(i)
R(i) = FV(i) (2.10)
HFFV(i)

where i is the index of the current frame.

The sonorant ratio is then smoothed by a fifth-order median filter. The smoothed
sonorant ratio is compared to a threshold, Tson, and a binary (zero or one) sonorant score,

SS(i), is calculated for each frame as


S 0, if R(i) < Tso (2.11)
SS(i) = 1, if R(i) < T(2.1)


where i is the frame index, and Ton = 10. The threshold Tson is determined empirically.

Figure 2-6 shows the sonorant detection results for the word "sue" spoken by a
male speaker. The threshold Tson = 10 is shown as a dashed line in Figure 2-6b. Note

that the sonorant ratio is nearly zero for the entire duration of the /s/.










10


2000 4000


0 2000


4000


6000 8000
Sample Number


6000 8000
Sample Number


10000 12000 14000


10000 12000 14000


0 2000 4000 6000 8000
Sample Number
(c)


Figure 2-6.


10000 12000 14000


Sonorant detection for the word "sue."
a) Time-domain waveform;
b) Sonorant ratio and threshold;
c) Sonorant score.


400 U


200 .......








2.4.2 Vowel Detection

Vowel detection is accomplished in a manner similar to that for sonorant detection.

A LFV function from Equation 2.1 is calculated with G = 1, A = 1, and B = 51. The

LFV is equivalent to a bandpass filter from 20 Hz to 996 Hz. A HFV function from

Equation 2.1 is calculated for G = 1, A = 52, and B = 255. The HFV is equivalent to

a bandpass filter from 1016 Hz to 5000 Hz. A vowel ratio, VWL(i), is calculated for each

frame by

LFV(i)
VWL(i) = FV(i) (2.12)
HFV(i)

where i is the frame index. The vowel ratio is then smoothed with a fifth-order median

filter. A vowel score, VWLS(i), within the continuous range [0,1] is calculated for each

frame by comparing the smoothed vowel ratio with two thresholds. The score is given by



0, if VWL(i) 2 Tuppe
1, if VWL(i) < Tlow
VWLS(i) = (2.13)
T VWL(i)
Tppr if T1ower VWL(i) < Tuppe
upper Tlower


where Tupper = 18 and Tower = 8. The two thresholds are determined empirically.

In a final processing stage, the vowel score is automatically set to zero for all frames
in any vowel segment that is 150 samples (15 ms) or less in length. This helps to reduce

false vowel detection.

Figure 2-7 shows the vowel detection results for the word "said" spoken by a male
speaker. The two thresholds are shown as dashed lines in Figure 2-7b. Note that the voiced

/d/ exhibits a high vowel score for a short duration. Since there is no voiced stop segment









2 II






_ I I I I I I


2000 4000


0 2000


4000


6000 8000
Sample Number


6000 8000
Sample Number
(b)


10000 12000 14000


10000 12000 14000


0 2000 4000 6000 8000
Sample Number
(c)


Figure 2-7.


10000 12000 14000


Vowel detection for the word "said."
a) Time-domain waveform;
b) Vowel ratio and thresholds;
c) Vowel score.


I I I I I I


_ __


.... ..!


! !








category in this study, voiced stops are typically classified as either a vowel-unvoiced stop

sequence, or a vowel-unvoiced fricative sequence.

2.4.3 Voiced Consonant Detection

Voiced consonant detection is accomplished in a manner almost identical to vowel

detection. A LFV function from Equation 2.1 is calculated with G = 1, A = 1, and

B = 51. The LFVisequivalent to a bandpass filterfrom20Hzto996Hz. AHFV function

from Equation 2.1 is calculated for G = 1, A = 52, and B = 255. The HFV is equivalent

to a bandpass filter from 1016 Hz to 5000 Hz. These filter values are the same as those used

for vowel detection. A voiced consonant ratio, VC(i), is calculated for each frame as

LFV(i)
VC(i) = FV(i) (2.14)
HFV(i)

where i is the frame index. The voiced consonant ratio is then smoothed with a fifth-order

median filter. A voiced consonant score, VCS(i), within the continuous range [0,1] is

calculated for each frame by comparing the smoothed voiced consonant ratio with two

thresholds. The score is given by

1, if VC(i) > Tupper
0, if VC(i) < Towe
VCS(i) = (2.15)
VC(i)- T-
TV ) Tlower if Townr VC(i) < Tuppe
Upper Tlower


where Tupper = 18 and Tlowr = 8. The thresholds are determined empirically.

Note that VCS can be calculated during the VWLS calculation foreach frame, since
VCS = 1 VWLS, provided that the value of VWLS is used before the short segment

(< 15 ms) vowel detection and elimination of Section 2.4.2 is done. This is because the

same filters and thresholds are used for both vowel and voiced consonant detection.








However, calculating VCS directly from VWLS eliminates the possibility of future

experiments with the filter characteristics and thresholds for voiced consonant detection
independent of the vowel detection algorithm.

Figure 2-8 shows the voiced consonant detection results for the word "said"

spoken by a male speaker. The two thresholds are shown as dashed lines in Figure 2-8b.

The voice bar that occurs before the release of the /d/ is clearly classified as a voiced

consonant. However, the algorithm assigns a score of slightly greater than 0.5 to the initial

portion of the release of the /d/. This shows the difficulty of detecting voiced stops using

a single acoustic feature.

2.4.4 Voice Bar Detection

Voice bar detection is accomplished in a manner similar to both vowel and voiced

consonant detection. A LFV function from Equation 2.1 is calculated with G = 1,

A = 1, and B = 33. The LFV is equivalent to a bandpass filter from 20 Hz to 645 Hz.

A HFV function from Equation 2.1 is calculated for G = 1, A = 34, and B = 255. The

HFV is equivalent to a bandpass filter from 664 Hz to 5000 Hz. A voice bar ratio, VB(i),

is calculated for each frame as

LFV(i)
VB(i) = FV(i) (2.16)
HFV(i)

where i is the frame index. The voice bar ratio is then smoothed with a fifth-order median

filter. A voice bar score, VBS(i), within the continuous range [0,1] is calculated for each

frame by comparing the smoothed voice bar ratio with two thresholds. The score is given

by








x 10



S I I I I I I----


2000 4000


2000


4000


6000 8000
Sample Number


6000 8000
Sample Number


10000 12000 14000


10000 12000 14000


0 2000 4000 6000


8000 10000 12000 14000


Sample Number


Figure 2-8.


Voiced consonant detection for the word "said."
a) Tune-domain waveform;
b) Voiced consonant ratio and thresholds;
c) Voiced consonant score.


S... ..........





---------------------- -----II-----

~ ~ -, ,- - -


I ............. .............


iV









1, if VB(i) Tupp
0, if VB(i) < Towr
VBS(i) = (2.17)
VB(i) Tlowe, if Tower VB(i) < Tupe
uppe- T lower'


where Tupper = 30 and Tower = 10. The thresholds are determined empirically.

In a final processing stage, the voice bar score is automatically set to zero for all

frames in any voice bar segment that is 300 samples (30 ms) or less in length. This helps

to reduce false voice bar detection.

Figures 2-9 and 2-10 show voice bar detection for the words "said" and "bond,"

respectively, spoken by a male speaker. The two thresholds are shown as dashed lines in

Figures 2-9b and 2-10b. In Figure 2-9, the voice bar before the release of the /d/ is clearly

detected. Figure 2-10 shows both the voice bar of the initial /b/, and the voice bar

associated with the final /d/, for the word "bond." Also note in Figures 2-9 and 2-10 that

the voice bar associated with the final /d/ is much shorter in the word "bond" than in the

word "said" due to the preceding nasal in "bond."


2.4.5 Formant Tracking

Formant tracking is accomplished in a manner completely different from the

typical detection process. The output of the algorithm is also different from the other

feature detector outputs. Actually, the formant tracking algorithm is a "front-end"

processor for the nasal detection algorithm, since the nasal detector does not use volume

functions, but rather a ratio of the amplitudes of the first two formant frequencies to

determine the nasal feature score.

Formant tracking is accomplished by an algorithm developed by McCandless

(1974). Only a brief description is given, since the algorithm is documented elsewhere.

Only the voiced portions of the speech signal are analyzed to estimate the formant tracks.








2xl4



2--lop


2000 4000


0 2000


4000


6000 8000
Sample Number
(a)


6000 8000
Sample Number


10000 12000 14000


10000 12000 14000


)00 8000 10000 12000 14000
Sample Number
(c)


Figure 2-9.


Voice bar detection for the word "said."
a) Time-domain waveform;
b) Voice bar ratio and thresholds;
c) Voice bar score.


2 0 ............

--------------- -------------------
2 0o. ........................... ........................... .. ..... .................. ............

















2000 4000


0 2000


4000


6000 8000
Sample Number
(a)


6000 8000
Sample Number
(b)


10000 12000 14000


10000 12000 14000


0 2000 4000 6000 8000
Sample Number


Figure 2-10.


10000 12000 14000


Voice bar detection for the word "bond."
a) Tune-domain waveform;
b) Voice bar ratio and thresholds;
c) Voice bar score.


x 10



0.-


10


0 4

CQ 2
1


1 I------------------------
. ... ......... ... ......... ............
-* ----I -- -- -- ------


.i .............. .........
.. . r .. . . . .








The algorithm attempts to find a best match between the peaks of the frequency

response obtained from the filter produced from the LPC coefficients, and estimates for the

first four formant frequencies. The amplitudes of the first four formant peaks are also

estimated. Initially, the estimates for the four formant frequencies are set to values that are

typical for the male voice (or the female voice, if female speech is being analyzed). For

the male voice, the initial estimates are fl = 320 Hz, f2 = 1440 Hz, f3 = 2760 Hz, and

f4 = 3200 Hz. For the female voice, the initial estimates are fl = 480 Hz,

f2 = 1760 Hz, f3 = 3200 Hz, and f4 = 3520 Hz. The algorithm matches each peak of

the frequency response of the LPC filter with the closest formant frequency estimate. The

estimates for the formant frequencies are updated after each frame of speech is processed,

provided that a match has been made.

In any given frame, if there is no match between the LPC filter peaks and the

formant frequency estimates, the algorithm attempts to increase spectral resolution by

iteratively evaluating the "frequency response" of the LPC filter on a circle in the z-plane

with a radius of less than one. This is done by evaluating the z-transform of the LPC filter

with z = rej 8, where r denotes the radius of the circle. The initial radius value is unity,

and is decreased by 0.004 during each iteration until a match is obtained, or until the radius

is less than 0.88. This procedure is able to resolve two closely spaced poles if they are

relatively close to the unit circle. If the radius is reduced to less than 0.88 and a match is

still not found for all of the formant frequencies in the frame, the algorithm reevaluates the

matches it has made for the frame. During the reevaluation, the algorithm either changes

the matches it has made, or assigns a zero value to the formant frequency in question for

that particular frame.

The final results are smoothed by checking each of the first three formant frequency

tracks individually for any zero values (this is not a part of the McCandless algorithm).

If the formant frequency is zero for either one or two (consecutive) frames, the frequency

and amplitude are linearly interpolated and the zero values are removed.








The first three estimated formant frequency and estimated formant amplitude

tracks for the voiced portion of the word "meat" spoken by a male speaker are shown in

Figure 2-11. Although only the first two formant tracks are used in this study, the first three

formants were retained for future work.

2.4.6 Nasal Detection

Nasal detection is done by comparing the estimated amplitudes of the first two

formants obtained in the McCandless formant tracker. A nasal ratio, N(i), is calculated for

each frame as

A2(i)
N(i) = (i) (2.18)


where A 1(i) is the estimated first formant amplitude, A2(i) is the estimated second formant

amplitude, and i is the current frame index (Mermelstein, 1977). The nasal ratio is then

smoothed by a fifth-order median filter. A nasal score, NS(i), within the continuous range

[0,1] is calculated for each frame by comparing the smoothed nasal ratio with two
thresholds. The score is given by

0, if N(i) 2 Tupper
1, if N(i) < Tow
NS(i) = (2.19)
Tuppe N(i)
Tuppr T N if Tlow1 r N(i) < Tuppe
TuPP Tlower


where Tuppe = 0.20 and T1ow, = 0.05. The thresholds are determined empirically.

Additional processing is done by the application of pattern recognition rules to
distinguish nasals from other segment types. First, if a frame has a voice bar score greater

than 0.75, the nasal score for that frame is set to zero. This is because strong voice bars

typically exhibit strong nasal scores. The opposite, however, is not true. The nasal score








x 10,






_9 I I 1 I I I


2000


0 2000


4000


4000


6000 8000
Sample Number
(a)


6000 8000
Sample Number
(b)


10000 12000 14000


10000 12000 14000


0 2000 4000 6000 8000 10000 12000 14000
Sample Number
(c)


Figure 2-11.


McCandless formant tracker for the word "meat"
a) Time-domain waveform;
b) Amplitudes of the first three formants;
c) Center frequencies of the first three formants.


S40


1 20


0


.................... ............ .. ............ ............








is then set to zero if the frame has a zero sonorant score. This is done to prevent

non-sonorant frames from being classified as nasal. Finally, the nasal score for all frames

in any continuous nasal segment that is less than 25 ms in length are set to zero.

Figure 2-12 shows nasal detection for the word "bond" spoken by a male speaker.

The two thresholds are shown as dashed lines in Figure 2-12b. Note that the nasal score

rises slowly after the transition from the vowel to the nasal. This shows that the

formant-amplitude-based algorithm is not as accurate as the other feature detection

algorithms that are based upon the short-term frequency response. Still, the algorithm is

able to correctly identify the nasal region.

2.4.7 Semivowel Detection

Semivowel detection is inspired by a method developed by Espy-Wilson (1986).

The algorithm deviates slightly from the standard detector, although it uses the volume

functions from Equation 2.1. A LFV function from Equation 2.1 is calculated with G = 1,

A = 1, and B = 20. The LFV is equivalent to a bandpass filter from 20 Hz to 391 Hz.

A HFV function from Equation 2.1 is calculated for G = 1, A = 21, and B = 50. The

HFV is equivalent to a bandpass filter from 410 Hz to 977 Hz. A murmur ratio, MUR(i),

is calculated for each frame as

L LFV(i)
MUR(i) FV(i) (2.20)
HFV(i)

The murmur ratio is then smoothed with a fifth-order median filter. A murmur score,

MS(i), within the continuous range [0,1] is calculated for each frame by comparing the
smoothed murmur ratio with two thresholds. The score is given by


















2000


4000


6000 8000
Sample Number
(a)


0 2000 4000 6000 8000
Sample Number
(b)


10000 12000 14000


10000 12000 14000


10000 12000 14000


Sample Number
(c)


Figure 2-12. Nasal detection for the word "bond."
a) Time-domain waveform;
b) Nasal ratio and thresholds;
c) Nasal score.


,x 10


0


r


I I I


r xi












1, if MUR(i) > TUpe
0, if MUR(i) < Tow
MS(i) = (2.21)
MUR(i) Tlower
TuO Tlower, if Tiowe. MUR(i) < Tupp
I upper lTlower

where Tuppe = 12 and TIow = 4. The thresholds are determined empirically. The

semivowel score, SVS(i), is then calculated for each frame as

SVS(i) = [1 MS(i)][1 VBS(i)]VCS(i) (2.22)

where i is the frame index, VBS(i) is the voice bar score from Section 2.4.4, and VCS(i)

is the voiced consonant score from Section 2.4.3. The value of SVS is limited to the range

[0,1]. If SVS is greater than one, it is set to unity for the frame. Equation 2.22 shows that

if a frame has a high voiced consonant score, a low murmur score, and a low voice bar score,

it will have a high semivowel score.

Additional processing is done to smooth SVS. If the frame has a nasal score greater

than 0.5, SVS is set to zero for that frame. This is because some strong nasals get labeled

as semivowels. In addition, the semivowel scores for all of the frames in any continuous

semivowel segment less than 30 ms in duration are set to zero. This eliminates mislabeling

of short segments that are not semivowels.

Figure 2-13 shows semivowel detection for the word "wield" spoken by a male

speaker. The algorithm correctly detects the leading /w/ as well as the /1/ near the end of

the word immediately preceding the release of the plosive /d/. Note that listening reveals

that the actual release in this example more closely resembles an unvoiced /t/ rather than

a voiced /d/. Also note that the post-processing does not detect any nasal segments nor any

semivowel segments less than 30 ms long. Therefore, the results for both before and after

post-processing are the same.

















2000 4000 6000 8000
Sample Number
(a)


6000 8000
Sample Number
(c)


10000 12000 14000


10000 12000 14000


Figure 2-13. Semivowel detection for the word "wield."
a) Time-domain waveform;
b) Semivowel score before post-processing;
c) Semivowel score after post-processing.


x 104






0 I I I I I I








2.4.8 Voiced Fricative Detection


The voiced fricative detection algorithm deviates from the standard detector,

although it does calculate feature scores from fixed thresholds. The first step in voiced

fricative detection is to add preemphasis to the frequency response of the filter produced

by the LPC coefficients. In other studies, the typical preemphasis method is to calculate

the first difference of the sampled data waveform before calculating the LPC coefficients.

In this study the first difference is not calculated before the LPC analysis, so the first

difference function is approximated by a weighting function, W, in the frequency domain

given by


W(eI = m 0 < m < 255 (2.23)
2560

The magnitude of the weighting function's frequency response is within 3 dB of the

magnitude of the frequency response of a first-order differentiator for all frequencies in the

A
filter passband. The preemphasized frequency response for frame i, Hi, is

A
Hi(eLI) = W(e') Hi(eJ), 0 m s 255 (2.24)

where Hi is calculated from Equation 2.2 for frame i with G = 1. The mean frequency

of the preemphasized frequency response, MF(i), is then found for each frame as

m=255
MF(i) = [ IH,(eJml ] (2.25)
Htotal(i) = 26 2


where fs = 10 kHz, and i is the frame index. Hotal(i) is given for frame i as

m=255
Htoti)= Ii(eJ0) (2.26)
m=0








MF(i) is then smoothed by a third-order median filter. A high frequency score, HFS(i), is

calculated for each frame as

1, if MF(i) > Tupe
0, if MF(i) < Tiow
HFS(i) = (2.27)
MF(i) Tlow
T p- Tlow,' if TIow, < MF(i) < Tupp
upp A lower


where Tuppe = 3200, and Tiwea = 2400. The thresholds are determined empirically.

The voiced fricative score, VFS(i), is then calculated using HFS(i). If the frame

is voiced and the sonorant score is zero, then VFS(i) = 1 for the frame. This is done

because the frame is voiced and also has a relatively large amount of high frequency energy

(i.e. the frame is non-sonorant). If the frame is voiced and the sonorant score is 1, then

VFS(i) = HFS(i) for the frame. In this case, the voiced fricative score depends solely

upon the frame's high frequency energy distribution. The final step is to set VFS(i) to zero

for all frames in any voiced fricative segment less than 15 ms in duration. This is done to

eliminate false detection of short segments.

Figure 2-14 shows the results of voiced fricative detection for the word "zoo"

spoken by a male speaker. Examination of the spectrogram (not shown) reveals that there

is little high frequency energy at the beginning of the /z/. This explains why the algorithm

does not classify the beginning of the /z/ as a voiced fricative.

2.4.9 Unvoiced Stop and Fricative Detection

If a frame is classified as unvoiced, it is either an unvoiced stop or an unvoiced

fricative. The algorithm used to distinguish between the two segment types differs from

the standard feature detector, and uses both time-based and frequency-based parameters.

First, the mean frequency is calculated for each frame from Equations 2.23 through 2.26.

The mean frequency track is smoothed by a third-order median filter. The high frequency
















6000 8000
Sample Number
(a)


10000 12000 14000


0 2000 4000 6000 8000 10000 12000 14000
Sample Number


Figure 2-14. Voiced fricative detection for the word "zoo."
a) Time-domain waveform;
b) High frequency score;
c) Voiced fricative score.


lx 10o

xlO


i __ i ____ i __ i i -- i -


2000


4000








score is then calculated for each frame from Equation 2.27 with Tupper = 3800 and

TIowr = 2400. The base-ten logarithm of the power, Ploglo, is calculated from the initial

LPC analysis results of each frame. Next, all adjacent unvoiced frames are grouped into

segments. For example, the word "sit" has two unvoiced segments, /s/ and /t/, and each

of these unvoiced segments is comprised of multiple, adjacent, unvoiced frames.

The slope of PloglO of the initial twelve frames (60 ms) of each unvoiced segment
is examined. If the segment is shorter than twelve frames, all of the frames are used. A

first-order approximation (i.e. a straight line), Mseg(j), of the slope of PloglI is calculated

for each segment j using the MATLAB function "polyfit." This is a least-squares fit. A

segment slope score, MSeg(j), is computed for each segment from Meg(j) by

0, if Mse(j) Tuppa
1, if Mseg(j) < Tow
MSsegO) = (228)
Tupper Mseg(j)
Tp T-eg if T1lowe < Mseg(j) < Tuppe
upper lower


where Tupper = 1.0, Tiowe = 1.0, andj is the index of the currentsegment. The "seg"

subscript is included to draw attention to the fact that the slope score is calculated as a single

value for the entire unvoiced segment. All of the frames in a given unvoiced segment are

assigned the same MSeg value. The frame slope score is denoted as MS(i). Thus,

MS(i) = MSseg() for each frame i in segment j.

Calculation of the unvoiced stop score, USS(i), takes advantage of the fact that

unvoiced stops are inherently shorter in duration than unvoiced fricatives (Cole and

Cooper, 1975; Klatt, 1979; Umeda, 1977). The unvoiced stop score, USS(i), is given for

each frame by

USS(i) = KsMS(i) (2.29)


where i is the current frame index, and Ks is given as









~ TIstop Lj
Gs( 1 + Ts-- L < Tstop

1 Trp -= Lj -< Ti
Ks = 1 T T (2.30)


1+ L- Tfri L > Tfric
Tf)iV 1


where G, = 8.0, Tstop = 50 ms, Tfric = 80 ms, and Lj denotes the length of the

unvoiced segment j (in milliseconds). The two thresholds and the gain, Gs, are all
determined empirically. The term Ks acts as a duration-dependent scale factor that greatly
amplifies the stop score for unvoiced segments less than 50 ms long. To a lesser degree,
Ks also attenuates the stop score for unvoiced segments greater than 80 ms long. The
unvoiced stop score, USS(i), is then limited to the range [0,1]. If USS(i) is greater than
one for a given frame, it is set to unity for the frame.
The final unvoiced fricative score, UFS(i), is calculated for each frame by

UFS(i) = HFS(i) (2.31)

which is simply the high frequency score for the frame.
Figure 2-15 shows both the unvoiced stop and the unvoiced fricative scores for the
word "pest" spoken by a male speaker. The /p/ is correctly detected as an unvoiced stop,
the /s/ is correctly detected as an unvoiced fricative, and the /t/ has both a high unvoiced
stop score and a high unvoiced fricative score. This is because of the high mean frequency
of the /t/. Note in Figure 2-15b that the /t/ is incorrectly split into two different segments,
which is an error caused by the V/U/S algorithm. Still, despite the V/U/S error, the /t/ has
a greater unvoiced stop score than unvoiced fricative score, which is desirable.


















2000 4000


6000 8000
Sample Number


10000 12000 14000


0 2000 4000 6000 8000 10000 12000 14000
Sample Number
(c)


Figure 2-15.


Unvoiced fricative and stop detection for the word "pest."
a) Time-domain waveform;
b) Unvoiced stop score;
c) Unvoiced fricative score.


xlO
20-1 *vow






_9 I I_ -i I I I l








2.5 Speech Segmentation

The algorithms described in the previous section focus primarily upon the acoustic

features associated with individual frames. However, in order to segment and label the

speech into the segment categories defined in Section 2.1, the boundaries between the

phoneme segments must be determined. This is done with two algorithms that are

described in the following subsections. The results from the two algorithms are then

combined to determine the final segment boundaries and durations.

2.5.1 Spectral-Based Boundary Detection and Segmentation


The first segmentation algorithm is based upon changes in the short-term frequency

spectra of the speech signal. It uses an algorithm developed by Glass and Zue (1986) that

measures the similarity between a current frame and its neighbors. To do so, the absolute

value of the frequency response of the filter produced by the LPC coefficients from

Equations 2.2 and 2.3 is calculated for each frame. A Euclidian distance measure, D(x,y),

is defined as

m=255
D(x,y) = I IH,(e )l IHy(eJi)l I (2.32)
m=0

where x is the current frame index, y is a past or future frame index, and Hx(eoJ) is the

single-sided frequency response evaluated for frame x at the points exp(jmn/256) for

0 s m S 255. From Glass and Zue (1986), the decision strategy is to associate the current

frame x with past frames if

max( D(x,y) ) < min( D(x,v) ), x 4 y x 2
x+2 v sx+4 (2.33)


and to associate the current frame x with future frames if









min( D(x,y) ) > max( D(x,v) ), x 4 y x 2
x+2 v x+4 (2.34)


No association (a "don't care" state) is made if neither of these conditions are met. After

each frame is associated with one of the three states, a segment boundary is determined to

occur whenever the current frame's association changes from the past to the future. The

location of the boundary is at the first sample of the frame where the transition to the future

occurs. Post processing is also done to remove any boundaries that occur in the middle of

silent segments.

Figure 2-16 shows the spectral-based boundary detection results for the word

"wield" spoken by a male speaker. The algorithm marks boundaries at the beginning of

the /w/, at the beginning of the relatively stationary portion of the vowel /i/, at the end of

the transition from the vowel to the liquid /1/, and at the beginning of the release of the /d/.

While all of these points are clearly seen from Figure 2-16a as dividing lines between

different parts of the word, the locations may not always agree with the results obtained

by manual parsing. For example, in the transition from the i/ to the //, manual parsing

might put the location of the transition at the middle of the second formant transition

region, instead of at the end of the transition. However, this lack of agreement between

automatic and manual segmentation results can often be attributed to the lack of a

universally accepted method to manually specify the "correct" transition point between

two phonemes.

2.5.2 V/U/S Boundary Detection

The second segmentation algorithm is based upon the voiced / unvoiced / silent

(V/U/S) feature detection algorithm results. The raw V/U/S results are processed using the

pattern recognition rules listed in Table 2-1. Boundaries are determined to occur wherever





















I I I I
4000 2000 4000 000 8000 10000 12000 1


















(b)
2000 ................. .



0 02 0.4 0.6 0.8 1 1.2
Time [sec]
(a)







I I I I I

0 2000 4000 6000 8000 10000 12000 14000
Sample Number
(b)




1 .I I I I



0 2000 4000 6000 8000 10000 12000 14000
Sample Number
(c)

Figure 2-16. Spectral-based boundary detection for the word "wield."
a) Spectrogram;
b) Frame association;
c) Spectral boundaries.








transitions in the V/U/S track occur. The boundary is marked at the first sample of the first

frame of the new segment

Figure 2-17 shows the V/U/S boundary detection results for the word "wield"

spoken by a male speaker. Note that as discussed in Section 2.4.7, the release of the/d/more

closely resembles a /t/, and is classified as unvoiced.

2.5.3 Final Segmentation

Results from both the spectral segmentation and the V/U/S segmentation

algorithms are used in the final segmentation process. All boundaries from the V/U/S

algorithm are marked as boundaries in the final result Any boundary from the

spectral-based boundary detection algorithm that occurs in the middle of a voiced segment

(as determined from the V/U/S results) is also marked as a boundary in the final result,

provided that the boundary occurs at a frame that is located greater than two frames away

from any V-U, U-V, V-S, or S-V boundary. This "two-frame rule" keeps the two

algorithms from marking the same phoneme boundary as two separate, but closely spaced

boundaries.

Figure 2-18 shows the final segmentation results for the word "wield" as spoken

by a male speaker. Note that the boundaries in both Figure 2-18b and 2-18c are added

together to create the final segmentation results that are shown in a later figure. The two

spectral boundaries from Figure 2-16c at (approximately) sample numbers 3700 and

10,100 have been discarded in Figure 2-18c since they coincide with the V/U/S boundaries

and are eliminated by the two-frame rule.

The inclusion of the spectral segment boundaries creates a type of
"sub-segmentation" that further divides voiced segments into regions of smaller durations.

Examples of this are the semivowel-vowel and the vowel-semivowel transitions shown

in Figure 2-18. Here, the boundaries between the /w/ and the /i/, and the /V and the /1/,

divide the voiced region into three distinct parts.

















2000 4000


6000 8000
Sample Number
(a)


10000 12000 14000


0 2000 4000 6000 8000


10000 12000 14000


Sample Number
(c)

Figure 2-17. V/U/S boundary detection for the word "wield."
a) Time-domain waveform;
b) V/U/S classification;
c) V/U/S boundaries.


2 x1





-~i








x104








0 2000 4000 6000 8000 10000 12000 14000
Sample Number
(a)


0 2000 4000 6000 8000
Sample Number
(b)


_1 1 I I I I
0 2000 4000 6000 8000
Sample Number
(c)


Figure 2-18.


10000 12000 14000


10000 12000 14000


Final boundary detection for the word "wield."
a) Time-domain waveform;
b) V/U/S boundaries;
c) Spectral boundaries.








Spectral boundaries that occur in the middle of unvoiced segments are ignored.

This is done to lessen mistakes in subsequent labeling, since the specific token words that

are analyzed in this study do not contain double consonant patterns. Note, however, that

double consonant patterns regularly exist in the English language. Therefore, in future

work, the spectral boundaries that occur in unvoiced segments should be included in the

final segmentation results.


2.6 Segment Labeling

Segment labeling is defined as the task of correctly assigning a label from one of

the eight speech segment categories of Section 2.1 to each (unknown type) segment

produced by the final segmentation algorithm of Section 2.5.3.

The labeling algorithm first examines the V/U/S results for each segment. If the

segment is voiced, it can be labeled as either a vowel, semivowel, nasal, voice bar, or voiced

fricative. If the segment is unvoiced, it can be labeled as either an unvoiced stop or an

unvoiced fricative. If the segment is silent, it can only be labeled as silent. For each

segment, each of the possible feature scores is averaged over the duration of the segment.

For example, if the unknown segment is unvoiced, then USS and UFS will both be

averaged across the frames that comprise the segment. Since the segment in this example

is unvoiced, the average scores for VWLS, NS, SVS, VBS and VFS need not be calculated

for the segment. Likewise, if the segment is voiced, the average scores for VWLS, NS,

SVS, VBS and VFS are all calculated, and the averages for USS and UFS are not. The

average unvoiced stop score, USSmean, is given by

i=b
USSmean) = b-a+1 USS(i) (2.35)
i-a








where a is the index of the starting frame in segment j, and b is the index of the final frame
in segment j. The averages for all of the other feature scores are calculated in the same
manner.

Once the average scores are calculated for all of the unknown segments, a first
choice label, Ll(j), and a second choice label, L2(j), are selected for each segment j. The

first choice label for each segment is the feature with the highest mean score. The second
choice label for each segment is the feature with the second highest mean score. Reliability

scores are also calculated for L1(j) and L2(j). The reliability score, Rl(j), for Ll(j) is
defined as the mean score for the first choice label divided by the sum of all of the mean

scores for that segment. For example, if a segment is voiced and L (j) is nasal, then R (j)
is given by

NSmem(j)
Rl(j) = (2.36)
VWLSmean() + SVSmean(j) + NSmean(j) + VBSmean) + VFSmm )(

Likewise, if the segment is unvoiced, and Ll(j) is an unvoiced fricative, then R1 (j) is given

by

UFSme(O)
RI(j) UFSma) (2.37)
UFSmemO) + USSmeai()

The reliability score, R2(j), for L2(j) is defined as the mean score for the second choice

label divided by the sum of all the mean scores for that segment.

There are two different cases where R1(j) and R2(j) can be used to override L1(j).
In the first case, if for a given segment j, (1) the preceding segment is a vowel, and (2) the
current segment is not a vowel, and (3) the current segment has a mean vowel score,

VWLSmean, greater than 0.45, and (4) the second choice for the segment, L2(j), is a vowel,

and (5) the reliability of the second choice, R2(j), is greater than 0.35, then L (j) is changed
to a vowel. This is done because the spectral sub-segmentation algorithm of Section 2.5.1
occasionally divides a single vowel into two or more parts. Often this is because the latter








part of the vowel is actually starting a transition to a following non-vowel and is exhibiting

coarticulation effects. The coarticulation effects may be strong enough to cause the final

portion of the vowel to be classified as a non-vowel. Therefore, if the final vowel segment

meets the conditions listed above, it can be assumed that coarticulation is taking place, and

that the segment is actually a vowel.

The second case where R1(j) and R2(j) are used to possibly override Ll(j) is as
follows. First, this rule is invoked only if the preceding rule was not invoked for the current

frame. Then, if (1) the current segment is a vowel, and (2) the current segment has a mean

vowel score, VWLSmen, less than 0.50, and (3) the reliability of the second choice, R2(j),

is greater than 0.10, then LI(j) is changed to the second choice, L2(j). This is done because

the average vowel score is less than 0.5, which implies from Section 2.4.3 that the average

voiced consonant score (which is not calculated since it is not a segment category) is greater

than 0.5. This indicates that the segment is actually some type of voiced consonant.

Figure 2-19 shows the final segmentation and labeling results for the word "wield"

spoken by a male speaker. The symbol "si" denotes silence, the symbol "SV" denotes

semivowel, the symbol "V" denotes vowel, and the symbol "US" denotes unvoiced stop.

The durations of the individual segments are shown as the number of samples in 2-19b.

The starting sample number of each new segment is shown in Figure 2-19c.

2.7 Manual Modification of Automatic Segmentation and Labeling Results

All speech recognition algorithms make mistakes. The number and nature of the

mistakes depend upon many factors including the variability of human speakers, the choice

of recognition categories, and the recognition algorithms themselves. Since this study's

primary goal is to modify the time sequence of speech, and not to create a totally new speech

recognition scheme, the errors that occur in automatic segmentation and labeling are fixed

manually before the time modification algorithms are invoked.








x 104






-2'I
0 2000 4000 6000 8000 10000 12000 14000
Sample Number
(a)

I I I I I I
3562 2548
5- 1248 450
2825 2701

si SV V SV US si

-5- 1 1 1 1 I
0 2000 4000 6000 8000 10000 12000 14000
Sample Number
(b)

I I I I I
6- 3563 10184
4 4811 10634
7636
2-

I I I I I I
0 2000 4000 6000 8000 10000 12000 14000
Sample Number
(c)

Figure 2-19. Final segmentation and labeling for the word "wield."
a) Time-domain waveform;
b) Segment labels and segment durations (in number of samples);
c) Segment boundary points (boundary sample number).








This section describes both the nature of the errors as well as a set of software

programs with a graphical user interface (GUI) created to help the user manually edit and
correct the automatic segmentation and labeling results.

2.7.1 Description of Errors

A variety of different types of errors can occur. Segmentation errors result when

the algorithms pick either an incorrect number of segments, or incorrect locations for the

segment boundaries. A labeling error results when the algorithms pick the wrong label for

a segment. Figure 2-20 shows examples of these types of errors.

In Figure 2-20c, the beginning of the unvoiced fricative is incorrectly detected as

starting at sample number 4014. Examination of the time waveform shows that the actual

beginning of the unvoiced fricative is closer to sample number 3300. This type of error

typically occurs with weak, unvoiced fricatives such as /f/, since the energy level of the /f/

is not much greater than the energy level of the background noise. Strong, unvoiced

fricatives such as /s/ and /f/ do not exhibit this problem.

A second type of error is seen in Figure 2-20b. The vowel is divided into four parts

by the spectral segmentation algorithm. While this is not a problem in itself, it creates the

possibility for labeling errors by requiring that the four parts of the same vowel be labeled

separately.

A third type of error is also seen in Figure 2-20b. An error occurs in the labeling

stage for the third part of the vowel. The third part is labeled as a nasal (N), instead of a

vowel (V). This occurs most often for long, voiced segments that are comprised of

nasalized vowels and/or word-final vowels.

Figure 2-21 shows a different type of error caused by the spectral segmentation
algorithm for the word "veal" spoken by a male speaker. In this case, the boundary between

the /v/ and the i/ is not detected. As a result, the combined /v/-i/ segment is classified as








xlO

10
1 0 -- -------- --!--I--- ---I--




I I I

0 2000 4000 6000 8000 10000 12000 14000
Sample Number
(a)

I I I I I I
4013 1483 2316
5 1145 1268
2473 636

si UF V V N V si

-5 1 I I I I I
0 2000 4000 6000 8000 10000 12000 14000
Sample Number
(b)

I II I
6- 4014 9115
4 5159 10383
7632 11019


I I I I I I
0 2000 4000 6000 8000 10000 12000 14000
Sample Number
(c)

Figure 2-20. Final segmentation and labeling for the word "foo."
a) Time-domain waveform;
b) Segment labels and segment durations (in number of samples);
c) Segment boundary points (boundary sample number).
















> 0.8
Time [sec]


0 2000 4000 6000
Sample N
(b)


0 2000 4000 6000
Sample N


8000 10000 12000 14000


8000 10000 12000 14000


(c)

Figure 2-21. Spectral segmentation for the word "veal."
a) Spectrogram;
b) Frame association;
c) Spectral boundaries.








a vowel, since the vowel score has the greatest mean value for the segment. This typically

occurs for the weak voiced fricatives, namely /v/ and /th/.

Figure 2-22 also shows the segmentation and labeling results for the word "veal"

spoken by a male speaker. Note that the final portion of the word in Figure 2-22b is

classified as an unvoiced stop. Examination of the spectrogram in Figure 2-21a shows that

there is significant, unvoiced, low frequency noise present at the end of the word. Listening

reveals that the noise is caused by the speaker exhaling after completion of the word.

Therefore, while unexpected, an unvoiced segment does exist at the end of the word, and

it is correctly detected.

2.7.2 Software and GUI for Manual Modification

A set of programs has been created to provide the user with a convenient method
of displaying and modifying the results of the automatic segmentation and labeling

algorithms.

Figures 2-23 and 2-24 show the two primary windows of the graphical user

interface for the modification programs. The Main window is shown in Figure 2-23. This

window allows the user to save or discard any modifications, select what is displayed in

the three graphs of the Display window, and open the three sub-windows for modification

of the segmentation and labeling (S&L) results. The Display window is shown in Figure

2-24. The window is divided into three graphs. The top graph displays either the original

or modified version of the time-domain waveform. Note that the modified version is

created whenever a silent segment is inserted into the word via the Insert Silent Segment

sub-window. The middle graph displays one of six choices: the original or modified

time-domain waveform, the original or modified segment type and duration (T&D) results,

or the original or modified segment boundaries. The third graph displays the same choices

as the second graph (independent of the second graph). The user selects the results




82



2x 1





-2----- I I I
0 2000 4000 6000 8000 10000 12000 14000
Sample Number
(a)

I I I I I I
3762 1033
S4163 1250
1842 1284
I I i I I .
si V SV SV US si

-5 1 1 1 1 1
0 2000 4000 6000 8000 10000 12000 14000
Sample Number
(b)


6- 3763 10801
4 7926 12051
9768
2-

I I I I I I
0 2000 4000 6000 8000 10000 12000 14000
Sample Number
(c)

Figure 2-22. Final segmentation and labeling for the word "veal."
a) Time-domain waveform;
b) Segment labels and segment durations (in number of samples);
c) Segment boundary points (boundary sample number).






























Figure 2-23.


Main window for manual modification of segmentation and
labeling results.


Figure 2-24. Display window for manual modification of segmentation and
labeling results.


Signal Name: wield_CH



Display
Top

Middle I

Bottom v

Merge Like Segments

M M lsssP ^ L"^"rr


S104 Original wieldCH

0
-2
0 2000 4000 6000 8000 10000 12000 14000
Original Segment Type and Duration
5s- 3562 1248 2825 25 450 2701
2825 2701
si n V n V IV I s
-5
0 2000 4000 6000 8000 10000 12000 14000
Modified Segment Type and Duration








displayed in each of the three graphs via the three push-button menus in the "Display"

section of the Main window.

The three sub-windows are shown in Figures 2-25 through 2-27. They are invoked

by the three push-buttons in the bottom left portion of the Main window. The Insert Silent

Segment sub-window is shown in Figure 2-25. This allows the user to insert a silent

segment of variable length at the beginning of any of the segment boundaries. The silent

segment cannot be inserted into the middle of a segment. Note that this feature does not

correct the automatic S&L results, but rather gives the user additional flexibility in creating

test tokens. The Move Segment Boundaries sub-window is shown in Figure 2-26. This

allows the user to move any or all of the boundaries between the segments. The Fix Labels

sub-window allows the userto change any or all of the labels for the segments, and is shown

in Figure 2-27. In all three of the sub-windows, once a parameter is modified, an "Update"

button and a "Cancel" button appear (not shown). These allow the user to either select the

desired parameters) and modify the S&L results, or discard the parameters) without

modifying the S&L results and start over, if desired. The "OK" button closes the window.

The Insert Silent Segment sub-window is shown in Figure 2-25. The length of the

silent segment is controlled by the top slider, and the slider position is automatically

rounded to the closest 5 ms, since this is the frame length of a silent frame in the original

analysis. The user can adjust the length of the silent segment by moving the bar in the center

of the slider with a mouse, or by clicking on the left or right arrow at either end of the slider.

The rounded duration is displayed in the small box to the left of the slider. The insertion

point of the beginning of the silent segment is controlled by the bottom slider. The user

can adjust the insertion point by moving the bar in the center of the slider with a mouse,

or by clicking on the left or right arrow at either end of the slider. The insertion point and

the corresponding slider position are automatically rounded to the beginning sample

number of the closest segment boundary, and the starting sample number is displayed in

the small box to the left of the slider.























Figure 2-25. Insert Silent Segment sub-window for manual modification of
segmentation and labeling results.


Figure 2-26. Move Segment Boundaries sub-window for manual modification of
segmentation and labeling results.


Figure 2-27. Fix Labels sub-window for manual modification of segmentation and
labeling results.


silence duration
(ms)

insert point
(sample e)


silent semivowel
semivowel I vowel
vowel semivowell
semivowel / unv stop
unv stop/ silent


[B~B~BB~s~i~


-Il~L 1 ~~


i~l~ IIL~I~GI


all .....


l""t l~l~ C Il .......lll .... .il ,,

IZ ................L ~ i








The Move Segment Boundaries sub-window is shown in Figure 2-26. The total

number of sliders displayed depends upon the number of boundaries in the given word.

There is one slider for each segment boundary. The two segments that surround each

boundary are listed to the far left of the slider. The boundary position is adjusted by moving

the bar in the center of the slider with a mouse, or by clicking on the left or right arrow at

either end of the slider. The slider position is automatically rounded to the starting point

of the nearest frame, and the starting point is displayed in the small box to the left of the

slider. The boundary for each segment is automatically limited to be greater than the

preceding segment boundary, and less than the following segment boundary.

The Fix Labels sub-window is shown in Figure 2-27. The total number of

push-button menus that are displayed depends upon the number of segments in the given

word. There is one push-button menu for each segment. The current label for each segment

is displayed in the push-button. The label can be changed by pushing the push-button and

selecting a new label from the pop-up menu.

An additional feature of the software is the ability to combine or merge like,

adjacent segments into one larger segment of the same type. This is done by the "Merge

Seg's" button in the bottom right corer of the Main window. This could be used, for

example, to combine the two adjacent vowel segments in Figure 2-20. Note that once this

is done, the number of sliders and push-buttons in the sub-windows change (since the total

number of segments change) and are updated and redrawn automatically.

The user can use the Display window to compare the "before" and "after" results
while editing the S&L parameters. At any point, the user can discard all of the edits and

start over by pushing the "Discard Changes" button in the Main window. After editing is

finished, the user can save the new S&L results under a different name by selecting the

"Save Changes" push-button. When this button is pushed, a pop-up field appears in the

Main window (not shown) in which the user can type the new name under which the edited





87


parameters will be saved. This allows the original, unmodified S&L results to be saved

for further reference and editing, if desired.













CHAPTER 3
TIME MODIFICATION ALGORITHMS AND USER INTERFACE

The time modification system in this study allows the user to selectively modify the

durations of the segments that comprise the speech signal. The previous chapter describes
the portion of the system that detects and labels the speech segments, and this chapter

describes the portion of the system that modifies the segment durations and synthesizes the

resulting time-modified speech.

The time modification system is controlled by user-specified parameters that allow
precise control over modification of the speech signal. For each segment, the user can

specify the time scale factor, the minimum desired duration, and the mapping method that

determines the portion of the segment to be altered. The parameter specification is

accomplished via a graphical user interface (GUI) program that is written in MATLAB and

runs on a Sun Microsystems workstation. Once the user-specified parameters are specified,

the time-modified speech is synthesized by a LPC speech synthesizer.

This chapter is organized as follows: First, the LPC speech synthesizer is briefly

described. Next, an introduction of the basic method used to modify the speech is given,

along with several examples. The user-specified parameters are then discussed in detail.
The mapping method is also discussed. The time modification algorithm is then presented.

The method used to prevent "glitches" during the synthesis process is described, and the

GUI controls are presented in detail.

3.1 The Linear Prediction Coding (LPC) Speech Synthesizer

From Markel and Gray (1976), the LPC speech synthesizer is comprised of an

all-pole filter described in the z-domain by









H A(z ) (3.1)


where Ai(z) for frame i of the speech signal is given by

Ai(z) = ao0 + alz-1 + a2z-2 + ... + aNz-N ao = 1 (3.2)

The vector Ai(z) is calculated and stored for each analysis frame. The value N = 13
adequately models the human speech production system (Hu, 1993).
The error signal obtained during the LPC analysis is termed the residue signal, r(n).
The residue signal is used to excite the all-pole filter during synthesis. Let Ri(n) be defined

as the portion of the residue signal obtained during the analysis of frame i. For example,
if frame 2 begins at sample number 101 and ends at sample number 150, then R2(n) is the
1 by 50 vector given by

R2(n) = [ r(101), r(102), r(103), ... r(148), r(149), r(150)] (3.3)

The input to the LPC speech synthesizer for frame i can then be described by the
ordered pair (Ai, Ri), where Ai is a 1 by N vector of LPC coefficients, Ri is a 1 by Mi vector

of the residue signal, and Mi is the length (in number of samples) of frame i. Note that Mi

is not constant in a pitch-synchronous synthesizer, since the frame size usually changes
from pitch period to pitch period.

3.2 Time Modification Basics-Frame Skipping and Frame Doubling

The basic time modification method used in this study involves either elimination
(for compression) or doubling (for expansion) of the (Ai, Ri) ordered pairs before they are

sent to the LPC synthesizer. This is best illustrated by the following examples. The
examples are also depicted in Figure 3-1. To synthesize the original speech token without
time modification, the ordered pairs (Ai, Ri) are sent to the LPC speech synthesizer for









Input File


Input File

(A!, RI)
(A, R3)
(A, R5)
(A7,R7)


*


Input File

(Al, RI)
(Al, RI)
(A2,R2)
(A, R2)
*
*


Snormal()


. LPC Speech Sfy(t)
Synthesizeras













LPC Speech S
Synthesizer


Figure 3-1.


Examples of time modification using an LPC speech synthesizer.
a) Normal rate: no time modification;
b) Fast rate: approximately one-half the original duration;
c) Slow rate: twice the original duration.








i = (1, 2, 3,..., L-2, L-l, L}, where L is the total number of frames in the original

speech signal. As a result, each frame is synthesized once. This is depicted in Figure 3-la.

To synthesize the token at approximately twice the original speaking rate (one-half the

original duration) the ordered pairs (Ai, R) are sent to the synthesizer for

i = {1, 3, 5,..., L-4, L-2, L). This method skips every other frame during the

synthesis process. This is depicted in Figure 3-lb. The term approximately is used since

the pitch period (and therefore duration) of each frame is not constant. To synthesize the

token at one-half the original speaking rate (twice the original duration) the ordered pairs

(Ai,Ri), where i = {1, 1, 2, 2, 3, 3,..., L-2, L-2, L-l, L-l, L, L}, are sent to

the synthesizer. This method synthesizes every frame twice, and is depicted in Figure 3-1c.

The three resulting speech tokens created by these methods for the word "meat" are shown

in Figure 3-2. Note that in each of these example tokens, the silent segment preceding the

unmodified word (from sample number 0 to sample number 3300) is not modified. This

is done only for demonstration purposes to preserve the alignment of the beginnings of the

three synthesized speech tokens in the three graphs.

Although these examples are simple, they demonstrate the basic method of time
modification of speech used in this study. This method involves the manipulation of the

sequence of (Ai, Ri) ordered pairs used as inputs by the LPC speech synthesizer. Note

however, that these examples do not demonstrate selective time modification, since they

control the information that is removed or doubled in a trivial manner. Selective time

modification is accomplished by exercising greater control (than the previous examples)

over the ordered pairs used for synthesis. In this study, the selection of the ordered pairs

is based on multiple user-specified parameters that are based on the phonemic content of

the speech token.