1 AN ACOUSTIC MODEL OF THE EMOT IONS PERCEIVABLE FROM THE SUPRASEGMENTAL CUES IN SPEECH By SONA PATEL A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLOR IDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2009
2 2009 Sona Patel
3 To my parents, whose pers onal sacrifices have allowed me to be where I am And to my grandparents
4 ACKNOWLEDGMENTS I would like to thank my family, friends, and colleagues who have made these past years memorable for me. I would like to thank Teresa for her company whenever I needed a break and for sharing her home with me. Supraja and Stacie we re also very helpful with various aspects of my data analysis and were great company whenever I needed to eat. I would also like to thank Dr. Bonnie Johnson for sharing her positive energy with me. Whether she shared her excitement in my research or her unbelievably funn y stories, chats with Bonnie always left a smile on my face. I would to thank my committee members, Dr. Mi ni Shrivastav, Dr. James Harnsberger, Dr. John Rosenbek, and Dr. John Harris, for their contri bution to various aspect s of this work. They were always willing to take the time to discu ss any issue I encountered along the way. I would especially like to thank Dr. Rahul Shrivastav for his mentorship during the course of my program. His endless patience with my never-endi ng list of questions, cons tructive criticism of my writing, and constant encouragement have helped me to develop the necessary skills to be a successful researcher. In addition, his willingness to share his experiences has prepared me for a career in academia beyond the knowledge attainab le through a doctoral curriculum. He is an amazing scientist and a wonderful person, and I am grateful to have had the opportunity to complete my graduate degrees under his guidance. And finally, I would like to thank my husba nd Akshay, for his company during all the weekends I stayed home working, for all the nigh ts he cooked dinner, for handling my emotions, for his interest in my research, and for being in credibly supportive throughout the course of these five years.
5 TABLE OF CONTENTS page ACKNOWLEDGMENTS...............................................................................................................4 LIST OF TABLES................................................................................................................. ..........9 LIST OF FIGURES................................................................................................................ .......11 ABSTRACT....................................................................................................................... ............13 CHAPTER 1 GENERAL INTRODUCTION..............................................................................................15 Expression and Perception of Emotions in Speech................................................................15 What are Emotions?........................................................................................................15 The Speaker-Listener Model...........................................................................................16 Components of Emotion Communication.......................................................................17 Emotions in Speech.........................................................................................................18 Significance................................................................................................................... .........19 Overview of Dissertation....................................................................................................... .21 2 EMOTION PERCEPTION FROM SUPR ASEGMENTAL INFORMATION.....................24 Background..................................................................................................................... ........24 Emotion Selection...........................................................................................................25 Speech Sample Selection.................................................................................................30 Task Selection.................................................................................................................34 Limitations and Overview...............................................................................................35 Experiment 1: Hierarchical Cl ustering of Emotion Terms.....................................................38 Participants................................................................................................................... ...38 Emotion Terms................................................................................................................39 Sorting Procedures...........................................................................................................39 Results........................................................................................................................ .....40 Experiment 2: Discriminating Emo tions from Suprasegmental Cues....................................42 Stimuli........................................................................................................................ .....42 Listeners...................................................................................................................... ....43 Procedures..................................................................................................................... ..43 Signal Detection Analysis...............................................................................................45 Listener Reliability..........................................................................................................47 Absolute Discrimination Threshold................................................................................48 Clustering Results............................................................................................................48 Discussion..................................................................................................................... ..........50
6 3 COMPUTING THE EMOTION DIMENSIONS...................................................................63 Background..................................................................................................................... ........63 Multidimensional Nature of Emotionality.......................................................................64 Dimensions of Vocal Emotion Expression......................................................................67 Overview....................................................................................................................... ..74 Methods........................................................................................................................ ..........76 Speech Stimuli.................................................................................................................76 Listeners...................................................................................................................... ....77 Discrimination Task Procedures......................................................................................77 Signal Detection Analysis...............................................................................................78 MDS Analysis.................................................................................................................78 Results........................................................................................................................ .............79 Listener Reliability..........................................................................................................79 ALSCAL Analysis...........................................................................................................80 Dimensionality.........................................................................................................80 Perceptual space.......................................................................................................80 Comparing MDS and HCS results...........................................................................82 Speaker Analysis.............................................................................................................83 Clustering results......................................................................................................84 MDS analysis...........................................................................................................84 Discussion..................................................................................................................... ..........85 4 INDIVIDUAL AND GENDE R DIFFERENCES IN LISTENER DIMENSIONS...............98 Background..................................................................................................................... ........98 Individual Differences.....................................................................................................99 Gender Differences........................................................................................................101 Overview.......................................................................................................................105 Methods........................................................................................................................ ........106 Speech Stimuli...............................................................................................................106 Listeners...................................................................................................................... ..106 Discrimination Task Procedures....................................................................................107 Signal Detection Analysis.............................................................................................107 INDSCAL Analysis.......................................................................................................108 Results........................................................................................................................ ...........109 Listener Reliability........................................................................................................109 INDSCAL Dimensionality............................................................................................109 INDSCAL Group Space................................................................................................110 INDSCAL Subject Weight Space.................................................................................110 Analysis of Gender Differences....................................................................................112 Discussion..................................................................................................................... ........113
7 5 A MODEL OF EMOTION RECOGNITION BASED ON SUPRASEGMENTAL CUES........................................................................................................................... .........125 Background..................................................................................................................... ......125 Acoustic Parameters......................................................................................................127 Fundamental frequency..........................................................................................127 Intensity..................................................................................................................130 Voice quality..........................................................................................................132 Duration..................................................................................................................136 Classification of Emotion Categories............................................................................138 Acoustic Correlates of Emotion Categories..................................................................141 Acoustic Correlates of Emotion Dimensions................................................................143 Limitations and Overview.............................................................................................147 Experiment 1: Development of an Ac oustic Model of Emotion Recognition......................150 Speech Stimuli...............................................................................................................151 Measurement of Acoustic Features...............................................................................152 Fundamental frequency..........................................................................................152 Intensity..................................................................................................................155 Duration..................................................................................................................156 Voice quality..........................................................................................................158 Preprocessing.................................................................................................................159 Feature Selection...........................................................................................................160 Model Classification Procedures...................................................................................162 Two-Dimensional Perceptual Model.............................................................................164 Perceptual Experiment...................................................................................................165 Perceptual Data Analysis...............................................................................................165 Results........................................................................................................................ ...166 Perceptual test results.............................................................................................167 Acoustic measures..................................................................................................168 Reliability of hand measurements..........................................................................168 Dimension models..................................................................................................169 Linearity analysis...................................................................................................172 Model predictions...................................................................................................172 Experiment 2: Evaluating the Model....................................................................................175 Participants................................................................................................................... .176 Stimuli........................................................................................................................ ...176 Identification Task Procedures......................................................................................177 Classification Procedures..............................................................................................177 Results........................................................................................................................ ...178 Perceptual test results.............................................................................................179 Acoustic measures..................................................................................................180 Reliability of hand measurements..........................................................................180 Classification of test2 stimuli..................................................................................180 Discussion..................................................................................................................... ........182
8 6 GENERAL CONCLUSIONS...............................................................................................235 Summary of Findings...........................................................................................................235 Limitations and Future Directions........................................................................................241 LIST OF REFERENCES.............................................................................................................244 BIOGRAPHICAL SKETCH.......................................................................................................260
9 LIST OF TABLES Table page 2-1 List of 70 emotion terms used in Experiment 1.................................................................582-2 Nineteen clusters of emotion terms fo rmed using hierarchical clustering scheme analysis in Experiment 1....................................................................................................582-3 Matrix of d values for each emotion pair.........................................................................592-4 The number of clusters formed at the d values for a series of percent correct scores......603-1 Stimulus coordinates of the 19 emotions in the three-dimensional perceptual space........893-2 Stimulus coordinates of all listener judgments of the 19 emotions arranged in ascending order for each dimension..................................................................................903-3 The d values between each pair of emotions for the five emotion clusters discriminable above the d cutoff of 2.73 are shown.........................................................914-1 Stimulus coordinates of the four-dimensional INDSCAL group space...........................1184-2 Listener weights on each of the four dimensions of the group space..............................1195-1 Perceptual and algorithm classification accuracy of emotions perceived from the suprasegmental information in speech.............................................................................1955-2 Initial feature sets used in e xperiments to classify emotions...........................................1975-3 Feature sets corresponding to the emotion dimensions...................................................2015-4 List of acoustic features analyzed....................................................................................2035-5 Matrix of d values for 11 emotions submitted for multidimensional scaling analysis...2045-6 Stimulus coordinates of all listener judgments of the 19 emotions arranged in ascending order for each dimension................................................................................2045-7 Perceptual accuracy for the training set based on all sentences and two exclusionary criteria....................................................................................................................... .......2055-8 Perceptual accuracy for the test1 set based on all sentence s and two exclusionary criteria....................................................................................................................... .......2065-9 Raw acoustic measurements for the test1 set...................................................................2075-10 Reliability analysis of manual acoustic measurements, and vowel-to-consonant ratio (VCR) for test1 set by Author and Judge 2......................................................................210
10 5-11 Regression equations for multiple perceptual models using the training and test1 sets..2115-12 Correlation matrix of all acoustic variables submitted to the stepwise regression analysis using the training set samples............................................................................2125-13 Classification accuracy for the full training set and a reduced set based on an exclusion criterion............................................................................................................2145-14 Classification accuracy for the test1 set using the Overall training acoustic model and the Overall test1 acoustic model.......................................................................................2155-15 Classification accuracy of the test1 set by two training and four test1 models................2165-16 Perceptual accuracy for the test2 set based on all sentence s and two exclusionary criteria....................................................................................................................... .......2175-17 Reliability analysis of manual acous tic measurements for the test set............................2215-18 Classification accuracy of the Overall training model for the test2 set samples using the k -means algorithm......................................................................................................2225-19 Classification accuracy of the Overall training model for the test2 set samples using the k NN algorithm for two values of k .............................................................................223
11 LIST OF FIGURES Figure page 1-1 Factors involved in emotion expression............................................................................23 2-1 Histogram of the number of categorie s of emotion terms formed by males and females on average.............................................................................................................61 2-2 Scatterplot of the percent correct scores corresponding to the d values for each emotion pair................................................................................................................... ....61 2-3 Dendrogram of HCS analysis using 19 emotion categories..............................................62 3-1 R-squared and stress measures as a func tion of the number of dimensions included in the MDS solution...............................................................................................................92 3-2 Three-dimensional ALSCAL solution...............................................................................92 3-3 Two-dimensional views of the emotion categories perceivable in SS using each pair of dimensions.................................................................................................................. ...93 3-4 Histogram of the d values for each speaker for the five emotion clusters discriminable above the d cutoff of 2.73..........................................................................94 3-5 Scatterplot of d values for Speaker 1 and Speaker 2 for the five emotion clusters discriminable above the d cutoff of 2.73..........................................................................94 3-6 Hierarchical clustering of each speaker fo r the five emotion clusters discriminable above the d cutoff of 2.73.................................................................................................95 3-7 R-square and stress curves as a functi on of the number of di mensions in the MDS solution for both speakers for the five emotion clusters....................................................96 3-7 Two-dimensional stimulus spaces for em otions greater than the discrimination threshold for Speakers 1 and 2...........................................................................................97 4-1 R-square and stress values as a func tion of the number of dimensions in the INDSCAL solution..........................................................................................................120 4-2 Group space. Stimulus configuration fo r the 4D INDSCAL solution is shown..............120 4-3 Subject weight space....................................................................................................... .121 4-4 The spread of the angles formed by the individual vectors and the x-axis are shown in corresponding histograms............................................................................................122 4-5 Histogram of the distri bution of weirdness indices.........................................................123
12 4-6 Hierarchical structure of emoti on similarity for males and females................................124 5-1 Acoustic measurements of pnorMIN and pnorMAX from the f0 contour........................224 5-2 Acoustic measurement of gtrend from the f0 contour.....................................................224 5-3 Acoustic measurement of normnpks from the f0 contour................................................225 5-4 Acoustic measurements of mpkrise and mpkfall from the f0 contour..............................225 5-5 Acoustic measurements of iNmin and iNmax from the f0 contour..................................226 5-6 Acoustic measurements of attack and dutycyc from the f0 contour................................226 5-7 Acoustic measurements of srtrend from the f0 contour...................................................227 5-8 Acoustic measurements of m_LTAS from the f0 contour................................................227 5-9 R-squared and stress measures as a func tion of the number of dimensions included in the MDS solution for 11 emotions...................................................................................228 5-10 Eleven emotions in a 2D stimulus sp ace according to the perceptual MDS model........228 5-11 Standardized predicted acoustic values for Speaker 1 and Speaker 2 and perceived MDS values for the training set according to the Overall perceptual model..................229 5-12 Standardized predicted and perceived valu es according to individual speaker models..230 5-13 Standardized predicted and percei ved values according to the Overall test1 model........231 5-14 Standardized predicted values according to the test1 set and perceived values according to the Overall training set model....................................................................232 5-15 Standardized acoustic values as a functi on of the perceived D1 values based on the Overall training set model...............................................................................................233 5-16 Standardized acoustic values as a func tion of the perceived Dimension 2 values based on the Overall training set model..........................................................................234
13 Abstract of Dissertation Pres ented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy AN ACOUSTIC MODEL OF THE EMOT IONS PERCEIVABLE FROM THE SUPRASEGMENTAL CUES IN SPEECH By Sona Patel August 2009 Chair: Rahul Shrivastav Major: Communication Sciences and Disorders The communication of emotions in speech is a critical part of hu man communication, and yet, very little is understood about this aspect of speech. The number of emotions that can be perceived in speech may vary according to the amount of information within the signal. For instance, it may be difficult to express the sentence, I won the state lottery in an angry way. The semantic information of the sentence may s uggest to the listener that the speaker is happy. On the other hand, the statement, Its going to rain may be expressed in a sad or happy way depending on the speakers suprasegmental variati ons. While the overall goal of this research is to understand how emotions are expressed and perceived in speech, the experiments performed as part of this disse rtation sought to understand how emoti ons are perceived and acoustically represented in speech based on only the suprasegmental information. To this end, a number of perceptual expe riments were performed to understand how listeners perceive emotions in speech from th e suprasegmental information. These experiments sought to identify the number of emotions th at could be perceived from speech and the perceptual properties of these emotions when lis teners were given only th e suprasegmental cues. In addition, an experiment was performed to veri fy that individual or gender differences in perceiving these emotions were small. The results of these experiments were used to develop a
14 preliminary acoustic model of emotion perceptio n from the suprasegmental information in speech. This acoustic model was unique in a numbe r of ways. First, the model was developed based on perceptual judgments obtained from a di scrimination task. While an identification task may be a better representation of everyday listener judgments, this task may be less than optimal for model development because it requires listeners to assign terms to emotions. It is unclear whether listeners make this decision by compar ing the actual expression against his or her impression of a prototypical e xpression for each emotion term. Th e use of a discrimination task avoids this potential bias. Furthermore, the use of catch trials in a di scrimination task easily lends itself to more sophisticated measures of accuracy using the Theory of Signal Detection. Second, the model proposed here do es not require comparison to a speakers base line expression (i.e., a neutral emotion). This approach is more practical from an automatic emotion recognition perspective as a speak ers emotion may be predicted without additional time (for example, useful in data mining) and without any apriori knowledge a bout the speaker (for example, as needed in military applications). The validity of the proposed acoustic model was tested using novel sentences from both, the speakers used for model development and fr om novel speakers. Accuracy of the model was tested by classifying all samples into the four emotion categories that could be perceived from the suprasegmental information in speech. Results showed that the model performance for the angry category was equivalent to averaged human judgments. Performance for the sad category was moderate, but did not equal listener accuracy.
15 CHAPTER 1 GENERAL INTRODUCTION Expression and Perception of Emotions in Speech The most common method of r ecognizing someones emotions may be through his or her tone-of-voice. And yet, for years the role of emotion communication in speech was all but ignored. For example, speech communication technologies sought to recognize the lexical information from speech and synthesize speech intelligibly such that listeners could comprehend the speech with high accuracy. Similarly, disorders of emotion communication were frequently overlooked until the past 2 to 3 decades (e.g., Ro ss, 1981; Stringer, 1996; Rosenbek et al., 2004; Myers, 1999) because this aspect of communicat ion was seen as insignificant compared to linguistic communication. Even toda y the right hemisphere, the cerebral hemisphere responsible for emotion communication, is often referred to as the nondominant hemisphere. However, some scientists argue that the f unction of emotional information may be much more than supplementary. Current research suggests that emotions may play an important role in learning, memory, and social interactions (L eDoux, 1998; Keltner & Kring, 1998). In fact, the ability to express and perceive emotions from sp eech may not just affect the quality of social interactions, but in addition, it may affect an indi viduals ability to communicate in an intelligent way (Tato, 2002). Hence, many t echnical and clinical applications may benefit from better knowledge of how listeners recogni ze the emotional state of a speak er in conversational speech. The following subsections provide an overview of emotions, the speech communication process, the components of emotion commun ication, and emotions in speech. What are Emotions? The term emotion is commonly used to capture a wide range of e xperiences from longterm affective states such as pe rsonality, attitudes, and moods to s hort-term affective states such
16 as feelings and full-blown or emergent emotions (Scherer, 1999). Some researchers have defined emotions as brief and highly distinct episodes (Cowie & Cornelius, 2003) and relatively intense affect that are short in duration (Scherer, 2003) Others such as Plut chik (1993, p. 59) have suggested that emotions "have ad aptive functions for the individua l, need to be inferred from various sources of evidence, are based on speci fic cognitions, and reve al something of an individual's attitudes and motiv ations." Oatley and Johnson-Laird (1987) stated that emotions are communications to oneself and others." Th e definition of an emotion seems to vary depending on the aspect (e.g. cognitive, physiologi c, expressive, etc.) of this multi-faceted phenomenon under investigation. As a result, a large number of emotions are studied and a variety of labels are used to represent thos e emotions (Section 2.1.1 provides an in-depth discussion of this issue). In the research presente d within this manuscript, the term emotion is used to refer to a subjective experience that can be externalized in speech. The Speaker-Listener Model An understanding of emotion communicati on in speech begins with the speech communication process. The communication of speech involves the speaker, the transmitted signal, and the listener. The speaker encodes his emotion as part of the acoustic speech signal. For this reason, the speaker is often referred to as the encoder The speech signal contains acoustic variations that convey linguistic and affective informa tion. The listener decodes the message by attending to the pertinent information and ignoring irrelevant parts. Many studies of emotion communication use both speaker and li stener-centered approaches to study how a speakers emotions are perceived from the spee ch signal. This involves a comparison of the speakers message (i.e., the acoustic signal) to the listeners perception of the message (i.e., perceptual judgments). In othe r words, an understanding of em otion communication involves the
17 processes of determining and extracting the rele vant acoustic characteristics of emotions in addition to obtaining reliable perceptual judgments of emotions from the same speech samples. Components of Emotion Communication Emotions may be conveyed through different means such as the body, the face, and the voice. It is possible that no si ngle component alone can be used to communicate the myriad of emotions described in the English language. More than likely, different emotions are perceived and expressed through different mechanisms. For example, the bodys autonomic nervous system may best communicate the presence of a fearful stimulus through sweating or an increased heart rate (Kemper, 1987; Cacioppo, Klein, Berntson, & Hatfield, 1993). On the other hand, happiness may be best communicated using the face (Ekman, 1992). In other words, each component individually may be capable of provi ding cues for only a limited set of the total number of emotions that can be communicated. Figure 1-1 depicts the components involved in emotion expression. Emotions can be recognized from the visual or vocal expression. The visual stimulus involves physical expression from the face, the rest of the body, or both the face and body. Physical forms of expression may be voluntary or involuntary. On the other ha nd, the vocal expression is mainly voluntary, although involuntary responses such as scream ing or vocal tremor may occur. The vocal expression can be divided into two parts, the se gmental and suprasegmental components. Each of these components can be used to communicate di fferent types of information. The segmental component can be used to convey syntactic a nd semantic information. Syntactic information includes the sequencing of words to form senten ces, whereas the semantic information is the meaning of these segments of speech. The supr asegmental component of expression includes features such as stress, duration, loudness, and pitch that can be used to communicate linguistic
18 information such as sentence type, and speaker ch aracteristics such as em otion, age, and gender. Accordingly, the suprasegmental info rmation is also referred to as the tone-of-voice. Emotions in Speech Listeners use both the segmental and suprasegme ntal information to recognize a speakers emotion. However, many researchers claim that th e primary method of expressing emotions is by varying the suprasegmental features of sp eech (e.g. Duffy, 1995, p. 63). The perception of suprasegmental changes are a result of acoustic variations in speech. Therefore, emotions perceived through suprasegmental changes can be studied through acous tic analysis of the speech signal. This research is commonly referred to as emotions in speech, where the term speech has been loosely used to indicate the supraseg mental aspect of speec h. In this dissertation, the abbreviation SS will be used to refer to the inform ation communicated by the suprasegmental aspect of speech. Suprasegmental variations can be objectively measured through a number of prosodic parameters such as mean speaking rate, mean intensity, mean fundamental frequency, mean segment duration, spectral slope, etc. A relative ly large body of research exists in favor of emotion-specific acoustic patterns (e.g. Banse & Scherer, 1996). Rec ognition accuracy of emotional speech tends to average around 60% (S cherer, 2003); however, accuracy for some emotions is generally higher than others. For example, in the study by Banse and Scherer (1996), hot anger was identified with 82% accuracy as opposed to the 15% accuracy of disgust. Recognition accuracy depends on a number of factors including the emotions tested, the number of emotions being judged, the type of speech used, and the speakers level of experience or training. In addition, low recogniti on accuracy may indicate that listeners cannot perceive the emotion in SS
19 Significance Understanding how emotions are expressed and perceived in speech is important for a number of applications. For example, this knowledge can help improve automatic speech recognition (ASR) technology and advance the de velopment of speech understanding systems by providing them with the ability to extract emo tional meaning instead of lexical information alone. This knowledge can also be used to genera te computational models of stress and anxiety, which in turn have the potential to result in severa l practical applications su ch as stress detection in various settings. These include applications such as prioritization of calls placed to doctors, hospitals, and other emergency numbers such as 911, call centers, etc. Governmental agencies such as the FBI and military would benefit fr om knowing the stress-level in persons under investigation. Text-to-speech software that can communicate emotional prosody instead of a monotonic prosodic contour can considerably improve the quality of alternative and augmentative communication devices. Knowledge of the emotion communication process in speech can also be useful in several clinical applications including rehabilitation of patients w ho have difficulty expressing or perceiving vocally expressed emo tions. A better understanding of th e perceptual characteristics of emotional expressions can help clinicians to provide a theoretical basi s for the selection of appropriate therapy goals. In addition, this info rmation can help equip clinicians with more reliable devices that can be used to diagnose patients and measure treatment outcome as well. These, in turn, can have a significant impact on the success of various rehabilitative efforts for patients with conditions such as Aprosodia, Sc hizophrenia, Parkinsons Disease, Huntingtons Disease, and Autism Spectrum Disorders. Hence, the main objective of this dissertation was to further our understanding of emotion perception in SS While both suprasegmental and segmental information are likely to play a role in cuing emotions, the initial focus was to
20 understand how suprasegmental factors influence th e perception of emotions. This information is of particular interest because emotional prosody is often the primary ta rget of treatment in conditions where vocal emotion e xpression is impaired. Neverthele ss, future experiments will also evaluate how emotion communication may be affected by the semantic content of speech. This dissertation was intended to contribute to the existing l iterature on emo tion perception in speech in a number of ways. The first contributi on was to establish a set of emotions that are basic in terms of perception from SS This was essential since perc eptual judgments are the gold standard for all research on emotion perception. Previous literature has studied the set of emotions that are considered basic or are universally perceived from facial expressions or biological responses and not specif ically those emotions that may be perceived in speech. In addition, selection of emotions that are perceptu ally indiscriminable in speech can adversely affect the results of any experimental work on emotion perception. Therefore, knowledge of the emotions that can be perceived in SS forms the foundation of further perceptual and acoustic experimentation on emotions in SS. Next, this dissertation sought to determine the perceptual mechanisms used to discriminate these emotions in SS whether these mechanisms differ across individuals, and the distribution of these emo tions in a multidimensional space. This has been addressed by other researchers with the semantic differential task and factor analysis. However, the dimensionality of the reported perceptual sp ace ranges in those experiments from two to six dimensions. Instead, a multidimensional scaling appr oach was used in the present study to obtain estimations of the relationship between emotions in a multidimensional space. The individual differences in perceptual strategies were also examined to establish the validity of averaging acoustic parameters of expressions across listeners to determine the acoustic correlates to the set of emotions that can be perceived in SS
21 The final contribution of this dissertation was to provide a preliminary acoustic model that can classify an expression as one of the basic emotions that can be perceived in SS This acoustic model was to be based on perceptual judgments of the emotions that can be perceived in SS. In addition, the model sought to incl ude only those acoustic cues that did not require any baseline information from a speaker such as a neutral emotional expression. The ab ility of the model to predict the emotions of novel speakers was also evaluated us ing two basic classification algorithms. These experiments lay the groundwor k for future research to understand how emotions differ according to a number of perceptual and acoustic properties or dimensions, to determine the isolated role of segmental inform ation and the combined role of segmental and suprasegmental information to emotion perceptio n, and how the present perceptual and acoustic model may differ by age, gender, culture, and clin ical populations such as those with Aprosodia, depression, and so on. Overview of Dissertation The following section describes the layout of th is dissertation. Since it was formatted as four manuscripts, the four chapters following th is general introduction (Chapters 2 through 5) contain the relevant background in formation pertaining to the expe riments performed within that chapter. These were followed by a chapter desc ribing the general conclu sions (Chapter 6). The following specific questions were addr essed in separate experiments. 1. Can a minimal set of discrete emotions be de termined from clusteri ng a set of 70 emotion terms based on similarity of emotion experiences? 2. How many of these emotions are discriminable in SS? 3. What are the perceptual mechanisms used by lis teners to discriminate the minimal set of emotions in SS? 4. Are individual or group (e.g. ma les and females) differences in the relative importance given to each perceptual dime nsion normally distributed?
22 5. Can each emotion dimension be described by a unique set of acoustic features? 6. How well does the acoustic model predict th e emotion categories of novel sentences? First, a series of perceptual experiments were completed to figure out which emotions can be perceived in SS and the strategies used by listeners to perceive them. Nonsense sentences were used to examine the role of the supraseg mental information on emotion perception without the influence of the segmental information. Specifi cally, the first experiment sought to determine a small set of terms that represent relatively distin ct emotions to be studied in speech. The next step was to empirically determine which of these emotion categories could be correctly discriminated in speech when only suprasegmental cues were available to listeners (Experiment 2). Once the emotions perceivable in SS were determined, the perceptual strategies used by listeners when discriminating the emotion categories in SS were investigated (Experiment 3). An analysis of individual differences was also performed to verify th at the differences in listener strategies were small and normally distributed (Experiment 4). Then, an experiment was performed (Experiment 5) to identify the acoustic char acteristics of each emotion dimension and develop an acoustic model to represent a set of sp eech samples as one of the emotions that can be perceivable in SS (based on the results of Experiment 2). In the final experiment (Experiment 6), the acoustic model was evaluated by classifying a set of novel speech stimuli from young-adult speakers to examine whether the model could generalize to novel stimuli.
23 Figure 1-1. Factors involved in emotion e xpression. The components for suprasegmental information are mostly based on Shriberg and Kent (1995, pp. 98). Etc. Smiling Frowning EMOTION PERCEPTION Physical Expression Vocal Expression Suprasegmental Information Segmental Information Facial Expression Body Language Voluntary Involuntary Voluntary Involuntary Syntax Semantics Prosody Paralinguistics Intonation Tempo Loudness Voice Quality Speaking Style Voice Effects Etc. JumpingFidgeting Etc. SweatingShivering Etc. Blushing Crying
24 CHAPTER 2 EMOTION PERCEPTION FROM SU PRASEGMENTAL INFORMATION Background The communication of emotions is crucial for building and maintaining interpersonal relationships. A large part of communication rests on how information is presented in addition to what information is presented. Research on emoti ons in speech is primarily concerned with understanding how information is presented, which is communicated through tone-of-voice or more formally, suprasegmental cues (referred to as SS ). This knowledge is useful in a variety of applications such as natural sp eech modeling and the rehabilitati on of populations with emotion communication disorders. Research to study emotion communication seeks to determine the acous tic features (e.g., fundamental frequency, intensity, speaking rate etc.) that correspond to the perception of emotions in SS Typically, the relevant acoustic paramete rs are determined by their degree of correlation with perceptual judgmen ts of emotions. Unfortunately, the perceptual accuracy rates reported by various experiments are inconsistent across emotions and experiments (Scherer, 1986; Yildirim et al., 2004). This variability may be due to differences in how researchers define emotions or differences in the methods used to obtain perceptual judgments of emotions. Different terms may be selected across experiments to represen t a single emotion category, when the terms may in fact represen t unique emotion categories. For example, van Bezooijen (1984, p. 85) reported 72% recognition accuracy of joy by Dutch adults from 10 emotions, whereas Dellaert, Polzin, and Waibel (1996) reported 88% recognition accuracy of happy from four emotions. It is difficult to identify how the accuracy of judgment was influenced by the number of emotion categories tested in th ese experiments or the use of so mewhat different terms to label the emotion category.
25 Likewise, emotion perception in speech is not always clearly defined, since information in speech can be communicated through segmental and suprasegmental means. For example, perceptual experiments have been performed using single-word samples (Hammerschmidt & Jurgens, 2007), neutral sentences or phrases (Kienast & Sendlmeier, 2000), or semantically colored sentences such as I won the lottery (Murray & Arnott, 1995). It is possible that different emotions may be perceived dependi ng on the information in speech given to the listener. Experiments also differ in terms of the perceptual tasks used to elicit judgments of emotion in speech. For example, many experiments have used forced choice identification tasks (Banse & Scherer, 1996; Zuckerman, Lipets, Koiv umaki, & Rosenthal, 1975), whereas others used an open set identification paradigm (Gr easley, Sherrard, & Waterman, 2000). Still others have used a rating scale task such as the se mantic differential (Osgood, Suci, & Tannenbaum, 1957). The accuracy rates for the emotions studied may vary depending on the paradigm and the number of emotion categories te sted within an experiment. To better understand how listeners perceive emotions in speech, all factors that may influence the fina l results need to be adequately controlled or addressed. Hence, these issues must be addressed using a highly controlled experimental design in order to obtain reli able measures of emotion similarity in SS The following sections describe thes e topics in further detail. Emotion Selection Experiments to study emotion perception in speech attempt to dete rmine the underlying mechanisms that result in the perception of one or more emotions. Although the number of emotions studied in a single experiment ranges from two emotions, such as pain and pleasure (Mowrer, 1960), to 14 or more emotions (Banse & Scherer, 1996), the number of emotion terms used to represent the same emotions is large. This complicates comparisons across experiments, since it is difficult to determine which emoti on terms are similar enough to represent a single
26 emotion category for examination in speech. Res earchers who assign labels to emotions view them as discrete categories that can be described by a small set of central or basic categories (e.g. Izard, 1977; Ekman, Friesen, & El lsworth, 1972; Epstein, 1984; Tomkins, 1984; Stein & Oatley, 1992). Numerous theories have been proposed to s upport the selection of a set of basic emotions (see Strongman, 1996 for a thorough re view), and a number of terms have been used to describe this specific set of emotions (primary and fundamental are common a lternatives). The term basic will be used henceforth to describe the se t of emotion categories referred to as primary, fundamental, and basic, since thes e terms represent a set of emoti ons that are universal according to some criteria. The following review provi des only a brief glimps e into this highly controversial and almost philos ophical debate (see Ortony & Turner, 1990; Ekman, 1992; Izard, 1992; Panksepp, 1992; Turner & Ortony, 1992, for a re view of the different perspectives). A large number of researchers suggest that emotions are qualified as basic based on biological factors, beginning with James ( 1884, 1890, p. 449) and Lange (Lange & James, 1922). The James-Lange Theory claimed that emotions are felt as a result of bodily changes in response to a stimulus. For example, this may include smiling in response to a positive stimulus (a response through facial expression) or shaking in response to fri ghtening situation (an autonomic system response; Cacioppo, Klein, Berntson, & Ha tfield, 1993). Kemper (1987) identified four physiologically basic emotionsfear, anger, de pression, and satisfactionbased on association with ANS activity. Ekman, Levenson, and Friese n (1983) found autonomic differences in six emotions, although they did not discuss the impl ications of their findings on whether these emotions were part of a basic emotions set. Panksepp (1992, p. 554) argue that there are four basic emotions arising from separate neurologic sy stems (the rage, fear, expectancy, and panic systems), which help instigate and me diate many aspects of emotionality.
27 Other researchers claim that spec ific factors must be present for an emotion to qualify as basic or for an affective state to be called an emotion. One such factor includes connections with a universal facial expression or facial expressions that can be expressed and perceived by all individuals (Ekman & Oster, 1979; Ekman, 1992; Fridlund, Ekman, & Oster, 1987). In literature reviews conducted in 1972 by Ekman, Friesen, and Ellsworth and 1987 by Fridlund, Ekman, and Oster, overwhelming evidence was found in s upport of six universal facial expressions: happiness, sadness, anger, fear, surprise, and di sgust/contempt. In the following years, some researchers have separated disgust and contempt, resulting in a tota l of seven basic emotions that are associated with a unique and universal faci al expression (Ekman & Heider, 1988). Another factor required by some research ers in selecting emotions as basic is cognitive appraisal (Roseman et al., 1990; Frijda, 1986; Scherer, 1984; Smith & Ellsworth, 1985). Izard and colleagues (1977, 1992) claim that the basic emoti ons are those that have the following three factors: innate neural substrates, unique facial expression s, and unique feelings. Izard (1977, pp. 83, 85-92) identified the following 10 basic em otions: interest/excitement, enjoyment/joy, surprise/startle, distress/anguish, anger/rage, di sgust/revulsion, contemp t/scorn, fear/terror, shame/humiliation, and guilt. Epstein (1984, 67) claims that basic emotions are complex, organized response disposition[s] to engage in certa in classes of biologicall y adaptive behaviors and are associated with a dist inctive state of physiological arou sal, a distinctive feeling or affective state, a distinctive st ate of receptivity to stimulati on, and a distinctive pattern of expressive reactions. Some of the emotions suggested as basic according to Epstein (1984) include fighting, fleeing, a nd expressing affection. Another group of scientists select the basic emotions according to prototype theory (Rosch, 1973; Rosch, Mervis, Gray, Johnson, & Boyes-Braem, 1976). The term basic is used to signify a
28 level of abstraction contain the categories for wh ich a concrete representation can be formed. In general, the basic level of abstraction in a hi erarchical structure is the level at which the categories carry the most information, possess the highest cue validity, and are most easily distinguished from one another. The basic level categories are repr esented by a prototype, i.e. the most common and best representative item or at tributes of the class. The prototypes are surrounded by other members of the class that are progressively more dissimilar to the prototype. Membership in an emotion category is based on the degree of similarity with the emotion prototype. According to the prot otype approach, hierarch ical clustering analys is can be used to determine the basic-level categories. Basic-level categories are of intere st for many reasons. For example, basic emotion categories are readily accessed when relevant stimuli are encountered, and they are likely to have simple labels (M ervis & Crisafi, 1982; Rosch et al., 1976). A handful of studies have follo wed the prototype approach to determine the emotion terms at the basic level (Fehr & Russell, 1984; Sh aver, Schwartz, Kirson, & O'Connor, 1987; Russell, 1997; Fillenbaum & Rapaport, 1971 ; Scherer, 1984; Alvarado, 1998) These studies assume that emotion terms are qualified to represent emotiona l experiences (Lutz, 1982) Therefore, a set of basic-level emotion terms would represent basiclevel emotional experiences. In one such study, Shaver and colleagues (1987) perf ormed a unconstrained sorting ta sk of 135 emotion terms to determine the hierarchical stru cture of emotions and the labe ls that represent basic-level categories. In other words, partic ipants were allowed to sort the 135 terms into as many piles as necessary to represent emotions that are similar and different. No limit was placed on the number of terms allowed in a pile. Through this process, they designated six em otion categories at the basic level. To find the correspond ing labels, they calculated the pr ototypicality of each term as the difference between the average number of co-occurrences with all words outside that
29 cluster and the average numb er of co-occurrences with all other words in the same large cluster (p. 1065). Using this pr ocedure, they found the following category prototypes: affection, happiness, amazement, anger, depression, and fear However, they decided instead to use the terms love, joy, surprise, anger, sadness, and fear because of their frequency of occurrence in the literature. Alvarado (1998, p. 329) performed a modified vers ion of the experiment by Shaver et al. (1987) because she claimed that Shav ers unconstrained pile-sort task methodology was flawed. She suggested that according to Welle r and Romney (1988), an unconstrained sort task does not allow the piles of terms to be summed across participants becau se participants with more piles will be given more weight than others. So she required subjects to divide the group of terms into two piles, and then subdivide each new group into two piles until eight piles were formed (i.e., a divisive clustering procedur e; Macmillan & Creelman, 1991). The eight resulting piles were reported as judgmental, show/feel, de finitional, controllability, undifferentiated, and unclassifiable. These terms reported to describe the piles were uncommon in the literature. She further concluded that the pile-sort method may be inappropriate for finding a prototypical set of emotion terms. Clearly, the emotions given basi c status vary considerably. Ev en within the large group of experts who claim that the basic emotions are those that have evolved to have biological significance, many different sets of these emotions exist. In addition, it is not necessary that any of these sets of basic emotions can be perceived in SS. It may be possible that different sets of emotions are basic depending on which aspect of emotion communication is examined. In other words, the emotions that are developmentally or neurologically basi c may not be the same emotions that are basic for emotion communication in speech. For example, Yildirim et al.
30 (2004) performed an experiment to better unde rstand the acoustic characteristics of four emotions (anger, happiness, sadness, and neutral) Four listeners performed a forced choice identification task in which a fifth response choi ce none of the listed was added. Identification accuracy of 25 samples for each emotion was 82% for angry, 56% for happy, 61% for sad, and 74% for neutral. Although each of these emoti ons was identified higher than chance level, recognition of happy was very low. As such, th e ability to recognize happy in speech may be questionable, whereas it is deci dedly basic in terms of facial expressions and emotional meaning. Speech Sample Selection The number of emotions that can be identified from speech may also be dependent on the methods of eliciting or obtaini ng speech samples. Factors such as the experience level of the speakers, tasks used for eliciting emotional sp eech, utterance length, and amount of semantic information contained in the utterance may im pact the perception of emotions from speech. Indeed, the ecological valid ity of the emotional expressions obtained from the various methods is still highly controversia l. Concern for obtaining realistic sa mples arose from observations that emotional speech recorded in th e laboratory environment seemed either unnaturally weaker than naturally occurring expressions or was highly vari able depending on the sp eakers interpretation of the terms used (Batliner, Fisc her, Huber, Spilker, & Nth, 2000) To obtain the same intensity or magnitude of emotions from all speakers, researchers have used techniques to induce emotional experiences in speakers. Some inducti on techniques include presentation of visual stimuli such as movies or pictures (Lane et al., 1997; Baumgartner, Esslen, & Jancke, 2006), presentation of audio stimuli such as music (Skinner, 1935; Juslin & Laukka, 2003; Scherer, 2004), oral reading of emotionally charged passages, or performan ce of specific tasks (Karlsson et al., 1998; Markel, Bein, & Phil lis, 1973; Tolkmitt & Scherer, 1986) such as interaction with an automated system in a Wizard-of-Oz paradigm (Batliner, Fischer, Huber, Spilker, & Noth,
31 2003). The underlying assumption in these elicitati on tasks is that emotions experienced by the speaker can be externalized in speech. Howe ver, the emotions experienced through these situations may vary for each individual, whic h may be reflected in the speech samples as differences in emotion magnitude. To avoid these effects, acted speech can be us ed. It is assumed that the emotions expressed by actors are more intense than the induced emoti onal speech of laymen in laboratory conditions because actors produce emotional speech in a manne r that facilitates reco gnition (Vogt & Andre, 2005; Banse & Scherer, 1996). Hacker, Batliner, and Noth (2006) have shown significantly higher classification rates of acted speech (81-93%) over spontaneous speech (43-74%) in identifying a speakers intended audienceeither an automated system or another person (e.g. talking in a car to the automate d voice dialing as opposed to talk ing to a passenger). Perceptual experiments have confirmed that acted speech is representative of everyday emotional speech through listener identification of acted emotions because of the relatively high accuracy in perceptual judgments of acted expressions (Burkhardt, Paeschke Rolfes, Sendlmeier, & Weiss, 2005; Rillard, Auberg, & Audibe rt, 2004; Montero et al., 1998) In addition, acted speech allows for increased control over the stimulus material, since actor s can express a small stimulus set in the context of a number of emotions. Actors are trai ned to experience the emotion requested before expressing it, thereby simulating a natural situation (e.g. the Style of Involvement advocated by Konstantin Stanislavs ky and Lee Strasberg; Konjin, 2000). This task is quite difficult for the layman in the laborator y setting, particularly when the sentences are devoid of semantic information. It is commonly understood that minor devi ations between acted emotional speech and natural emotional speech are possible (Scherer 2003). Some researchers claim that these
32 deviations are not acceptable. Instead, they pref er to obtain natural speech samples from prerecorded media such as televi sion shows, radio interviews, etc. (Douglas-Cowie, Campbell, Cowie, & Roach, 2003). Although these samples are the most ecologically valid, they may be noisy (background noise, multiple speakers talking simultaneously, etc.) and contain excessive variability (text of speech is not controlled, distance between the microphone and speaker may vary across speakers, etc.) due to the lack of control in the recording environment. This is problematic when analyzing the acoustics of thes e speech signals, since many algorithms such as pitch detection or voice quality estimation give less than optimal results for noisy signals. Furthermore, pre-recorded media typically cont ains semantic information that is congruent with the emotion expressed. Evidence from a number of studies has shown that semantic information contributes to the overall percep tion of emotions (Grimshaw, 1998; Schirmer & Kotz, 2003; Alter et al., 1999). For example, Murray and Arnott (1995) conducted a study in which participants were asked to judge the em otions from neutral and emotionally-charged speech samples. Participants were asked to make their judgments using the prosodic information alone. Results showed increased accuracy rates for the stimuli containing segmental information. This indicates that the emotion perception is influenced by sentence context even when participants were instructed to pay attention to only the supras egmental information. Therefore, experiments to recognize emotions from the acoustic aspect of speech may find higher perceptual accuracy rates due to the listen ers ability to integrate both semantic and suprasegmental knowledge, while most algor ithms rely on acoustic measures alone. To avoid the contribution of semantic inform ation to emotion perception, either neutral speech (e.g. Kienast & Sendlmeier, 2000) or pseu do-speech (e.g. Banse & Scherer, 1996) can be used. Neutral speech is speech that does not ca rry any emotional information, such as dates,
33 numbers, or declarative senten ces (Pereira, Watson, & Catherine, 1998). Neutral words such as dates, numbers, and colors can either be spoke n in isolation (Pereira 2000) or in carrier sentences. The argument against word-length stimuli or the use of carrier sentences is that the long-term prosodic contour is removed and in its place, a different stress pattern, unlike that of running speech, is used. Carrier sentences are used to obtain a couple of emotional words by embedding them within complete sentences. Howe ver, speakers will often place extra emphasis on the inserted words. To avoid these shortfalls pseudo-speech can be used (Grandjean et al., 2005; Banziger & Scherer, 2005; Schirmer, Kotz, & Friederici, 2005). Pseudo-speech consists of pronouns, articles, conjugates of the verb to be and English pseudo-words, which are words from the English language that have numerous phonemic substitutions such that the words are nonsensical (e.g. The laycats are wame.). The use of pseudo-words to form sentences or nonsense speech avoids any additional gain in perceptual accuracy due to semantic knowledge. In summary, the choice of stimuli used for a perceptual experiment may affect the overall findings. Inducing emotions in lay subjects prior to recording their emotional expressions may provide expressions that are more representati ve of everyday emotional speech. On the other hand, each subject may not experience equal intensiti es of emotions, which may be seen in the expressions as differences in em otion magnitudes. This may result in decreased naturalness of the samples, according to Banse and Scherer (1996). They suggest that the use of acted speech increases naturalness since ac tors are trained to experien ce the emotions through various techniques before expressing them, thereby resu lting in intense or full-blown emotional expressions. In addition, it is likel y that actors may be more at ease in expressing emotions using unusual samples (e.g. neutral sentences, pseudo-speech, numbers, etc. as have been used in the
34 past) as a result of their training. Thus, pse udo-speech expressed by actors may offer the most control and least bias over expressions. Task Selection Yet another factor that will impact the emotions that can be perceived in SS is the method of psychophysical testing. Perceptual judgments of emotion expression are usually quantified through identification tasks. T ypically, experiments on emotion perception use a closed-set forced-choice identification task and report results as percent-correct scores (Leinonen et al., 1997; Banse & Scherer, 1996). This task assume s that a limited set of emotion terms are sufficient to describe all the emotions in that st imulus set. One shortcoming of the identification task is the necessity of assigning emotion labe ls to speech samples. In such a closed-set paradigm, accuracy may be inflated due to the fo rced labeling of a limited number of responses. Listeners may use the process of elimination to select the correct emotion as opposed to actually associating a particular prosody with one of the emotion category choices. To avoid this problem, an open-set design may be used where list eners are asked to provid e labels to describe the emotional expressions (Greasley et al., 2000; Fakotakis, 2004). Grea sley et al. (2000) demonstrated the advantage of the free-respons e task by collecting per ceptual judgments of emotional expressions using both forced-choice and free response tasks. Identification accuracy of four negative emotions of the five investigat ed (anger, disgust, fear sadness, and happiness) was 54% using the forced choice task, while the average accuracy obtained using the free response task was 70%. Unfortunately, this techni que may result in a wide range of terms as responses, which may then lower the accuracy of the proposed emotion categories if only specific terms are accepted as correct. Some researchers suggest that an identificati on task might be better suited for obtaining emotion judgments so long as the number of emo tions being identified is large enough (i.e. 10 or
35 more) because this situation is more representati ve of natural situations (Banse & Scherer, 1996). However, identification of a small number of items may result in elimination techniques essentially making this a discrimination ta sk (Banse & Scherer, 1996). Discrimination experiments are generally performed by presenting pairs of emotional expr essions at a time. The listeners task is to determine whether the emo tions expressed are the same or different. One advantage of this approach is that the experimenter does not have to pre-select or define a set of emotion terms. Another benefit is that discrimina tion generally enhances the perception of subtle differences between emotions and can be used to detect fine perceptual differences between expressions (Jamieson & Morosan, 1986). Furtherm ore, discrimination tasks enable calculation of d-prime ( d ), a measure of perceptual distance betw een items as described by the Theory of Signal Detection (Green & Sw ets, 1966). The parameter d is a more accurate descriptor of perceptual distances between emotions than pe rcent correct rates, th e typical measure of identification accuracy. The main problem with us ing percent correct scores as a measure of perceptual accuracy is th at the false alarm rate is ignored. Th e false alarm rate is the number of times that two sentences expressing the same emotion are incorrectly judged as two different emotions. In other words, it is a m easure of listener bias towards judging different The parameter d uses both the hit rate (number of co rrect discriminations of two different expressions) and false alarm rate in quantifying pe rceptual distance and as a result, is a better measure of human percepti on than percent correct. Limitations and Overview Existing research on emotion pe rception in speech has not empi rically studied the number of emotional categories that can be perceived from the suprasegmental information in speech Instead, experiments to study emotions in SS tend to use emotions that are basic in some universal sense (e.g. biologically, prototypically, developmentally, socially, etc). These studies
36 assume that all of these emoti ons may also be perceivable in SS ; however, this may not be the case. For example, Buck (1999) claims that jeal ousy and envy are different emotions, and Ekman (1999) claims that embarrassment, shame, and gu ilt are different basic emotions. These emotion categories may be too similar to be differentiated in SS Essentially, there is little empirical evidence to demonstrate that these basic emo tion categories can be discriminated in speech, particularly when contextual and se mantic information is eliminated. Hence, research to study the mechanisms underlying the perception of emotions in SS first requires a set of emotions that can be perceived in speech. To determine this set of emotions, a larger set of fairly unique emotions must be obtained in order to increase the likelihood of including all potential emotion cat egories. The technique used in experiments to select basiclevel emotional experiences following the prot otype approach is promising for this task. However, previous experiments using this approach have had a number of shortfalls. First, some studies such as Scherer (1984) have selected such a large set of emotion terms (235 terms) that it exceeded the computational ability of any software available at the time. Other experiments may not have used terms that represented a larg e enough variety of emotions. For example, commonly used emotion terms such as bored, helpless, shy, and suspicious were not included in the data reduction experiment performed by Shaver et al. (1987), even though a large initial list of 135 terms was used. Furthermore, methodological issues such as the amount of contextual information within the speech stimuli or the parameters used to es timate perceptual accuracy may affect the number of emotion categories that can be perceived in SS Speech samples containing segmental information may provide additional cues that aid in emotion recognition. On the other hand, samples devoid of syntactic information (e .g., words or numbers) may not provide enough
37 information for the recognition of certain emo tion categories in speech. In addition, percent correct scores, the standard method of assessing accuracy in forced-choice identification tasks, may not provide a complete measure of listener performance. Moreover, percent correct scores cannot be used to obtain measures of the simila rity or perceptual distances between emotion categories, a useful measure when modeling emotion perception in SS Before attempting to develop a model for em otion perception, two experiments have been performed to determine the minimal set of emotions that can be perceived using the suprasegmental information in American-English speech. The first experiment sought to identify a small subset of emotion terms, each of which may describe a unique emotional experience, to serve as the basis for further research. This method of using emotion words to provide an account of emotional experiences has been validated in previous research (e.g. Lutz, 1982; Scherer, 1984). A large initial vocabulary of 70 terms was used to maximize the likelihood of including all potential emotion cat egories. Hierarchical clusteri ng analysis allowed us to group all terms that convey a similar emotion and iden tify the emotion terms that reflect distinct emotion categories. These terms were then used in the second experiment to determine the emotion categories that can be pe rceived in speech containing onl y suprasegmental cues. Hence, nonsense sentences composed of pseudo-words were used to prevent possible biases in perception from the semantic information in sp eech. A discrimination task was performed to determine the emotions that can and cannot be differentiated in SS without requiring participants to assign labels to stimuli. Finally, a hierarchical cluste ring analysis using d was used to group the emotions according to perceived similarity an d determine the emotions that can be perceived from the suprasegmental information in SS
38 Experiment 1: Hierarchical Clustering of Emotion Terms A review of the literature on emotion perception revealed that a large number of terms (more than 70) have been used to describe or categorize emotions. Si nce many of these terms may describe very similar categories, it was nece ssary to identify a relatively small number of distinct emotion categories in order to evalua te whether any emotions can be perceived in SS This experiment was unlike those searching for basic emotions since it was not necessary to determine the minimum set of emotional experi ences that may be common to everyone. The objective was to determine a set of terms that represent unique emoti onal experiences. Some redundancy in the resulting set of terms was acceptable, since this would ensure that all unique emotions were included. Therefore a data-reducti on algorithm, a hierarchi cal clustering scheme (HCS), was used to identify a small set of emoti on vocabulary that may be used in the context of speech. This data reduction experiment was necessary to ensure that all relevant emotions were identified, while also minimizing the likelihood of having multiple terms that represent the same (or similar) emotions. Participants Nineteen participants (9 males, 10 females) with a mean age of 24 years (range of 18-37 years) were recruited to partic ipate in this study. All particip ants were native speakers of American English and had normal hearing bilate rally (air-conduction pure -tone threshold below 20 dB HL at 125 Hz, 500 Hz, 1 kHz, 2 kHz, 4 kHz, and 8 kHz) as ensured through a hearing screening. Participants did not report sympto ms of depression as determined through a 17question Hamilton Rating Scale for Depression (HAM-D-17) test, wh ere scores of 0 to 6 were considered to be in the normal range (Ham ilton, 1960). In addition, par ticipants did not report taking any medication to treat depression, anxiet y, bipolar disorder, or any other mental or emotional health disorders. Absence of medi cal treatment suggested an absence of these
39 disorders and this served as an inclusionary crit erion for this experiment These screening criteria were necessary to avoid biasing judgments of perceived or experienced emotions. Emotion Terms The stimuli consisted of 70 emotion terms that were written on note-cards. These terms are listed in Table 2-1. Forty-seven of these terms were those recognized as basic emotions by at least 11 researchers examining the verbal or visu al expression of emotions (James, 1884; Arnold, 1960, p. 196; Panksepp, 1982; Plutchik, 1980; To mkins, 1984; Oatley & Johnson-Laird, 1987; Banse & Scherer, 1996; Lewis & Haviland, 1993 ; Buck, 1999; Ekman, 1999; Cowie et al., 1999). The remaining 23 terms were selected from other research on emotions, despite the fact that these terms were not identified as being basic A large number of terms were selected to minimize the likelihood of excluding any emotional category in the current research. These terms were not defined to avoid biasing participants, howe ver, participants were asked to read the list of emotion terms and verify that they knew the meaning of each term before proceeding with the experiment. One participant who did not know th e meaning of an emotion term was excluded from further participation, resul ting in nine male and nine female participants (mean age: 23 years; range: 18-31 years). Sorting Procedures Participants were given a vo cabulary of 70 emotions on notecards. The note-cards were shuffled prior to each test session to randomi ze the order in which s ubjects received the notecards. They were instructed as follows: Please group the emotion terms into as many categories as needed such that the terms within each category reflect similar emoti onal experiences. Do not group antonyms of an emotion within a single category. There ar e no limitations on the number of categories possible or the number of terms that can be grouped within a category. Once you are satisfied with your categories, designate one term from each category as the modal emotion term of that group. This is the em otion term that is most commonly known and best represents the emotions within that category.
40 Only one participant performed the task at a time. Once completed, the participants categories and modal term se lections were recorded. Results Since no restrictions were placed on the number of groups of emotion terms that could be used, subjects varied considerably in the tota l number of categories chosen (between 9 to 33 categories). The distribution of the total num ber of groups formed by males and females are shown in a histogram in Figure 2-1. To determin e the emotion terms that represent relatively distinct emotion categories for a ll participants, these groups of emotion terms were aggregated across participants as follows. The frequency that each pair of terms was grouped together was recorded in a 70 x 70 confusion matrix. Larger frequencies between two emotion terms indicated that both terms may describe the same or very si milar emotion categories, since they were placed into one category by most participants. Thus, thes e frequencies signified the similarity between emotion terms. These similarity measures we re converted to dissimilarity measures by subtracting each value in the entire matrix by 18 (the total number of participants). Larger dissimilarity values between emotion terms indi cated a larger conceived difference in emotion terms, since they were often placed in different categories by most participants. Hierarchical clustering scheme analysis was performed using these dissimilarity measures. The agglomerative clustering method was chosen si nce the number of items is large. In this method, each item begins as its own cluster (B aird and Noma, 1978, p. 211). The clusters are progressively fused together according to similar ity until all items are incl uded in one cluster. As clusters are merged together, in creasingly dissimilar clusters are joined. Clusters were joined according to the complete linkage algorithm (max imum method). This algorithm measures the dissimilarity of clusters as the greatest distan ce between members of each cluster, resulting in tight clusters of similar obj ects (Johnson, 1967). Consequently a trade-off exists between
41 choosing a level of the HCS containing too many or too few clusters. T oo many clusters would defeat the purpose of this data reduction tas k. Too few clusters may not capture all of the emotion categories. The optimum number of clusters, in this case 19, was determined by averaging the number of groups created by the 18 participants. The fi rst cluster included the emotions amused, funny, and hysterical. It is interesti ng that the emotion hysterical was associated with funny and amused more often than with terrified and anxious, although both groupings were seen. Cluster 2 included content, pleased, relieved, relaxed, a nd satisfied. Cluster 3 included affectionate, compassionate, and love. The fourth cluster co ntained only respectful a nd gratitude. Cluster 5 included enthusiastic, excited, happy, hopeful, a nd joy. Clusters 6, 8, and 17 each consisted of a single emotion term, exhausted, confused, and je alous. The seventh category included arrogant, confident, courageous, and proud. The ninth clus ter included apathetic, bored, and indifferent. Cluster 10 consisted of doubtful, shy, and susp icious. Cluster 11 included embarrassed, guilty, humiliated, and shame. The twelfth cluster was a large cluster that included agony, depressed, despair, disappointed, grief, hurt, pity, and sad. The thirteenth cluster contained amazed, shocked, and surprised. Cluster 14 included the emotion terms cu rious, desire, and interested. Cluster 15 included aggravated, annoyed, dis gusted, displeased, and irritated. Cluster 16 contained angry, contempt, hatred, hostile, rage and scorn. Cluster 18, an other large cluster, included afraid, anxious, distressed, fearful, pa nicky, scared, terrified, and worried. The final category included both helpless and lonely. The term selected to describe each cluster was the term that was most frequently chosen as the modal term within each cluster. The following emotion terms were chosen to represent the 19 relatively unique emotion categories: funny, c ontent, love, respectful, happy, exhausted,
42 confident, confused, bored, su spicious, embarrassed, sad, surpri sed, interested, annoyed, angry, jealous, anxious, and lonely. Thes e results are shown in Table 2-2. Experiment 2: Discriminating Em otions from Suprasegmental Cues Once a small number of emotion categories we re identified for further study, a second experiment was conducted to determine which of these might be perceived from SS Actors expressed nonsense sentences in the emotional contexts described by these terms. Individuals trained in theater were selected in order to pr ovide the maximum control over the speech stimuli. Furthermore, nonsense sentences were used to prevent possible biases in perception from the semantic information in speech. A set of non-sens e sentences were elicit ed in each of the 19 emotions identified in Experiment 1. Nonsense sentences composed of pseudo-words were used to isolate the contribution of s uprasegmental information used to convey a speakers emotions and thereby ensure that listener judgments of emotions were not biased by the semantic content of the sentences. Listeners were asked to judge the similarity of emotions for every pair of sentences using a same-different paradigm. This paradigm was utilized to avoid labeling or categorizing emotions because individual listener s may infer the same emotion term or label somewhat differently. Stimuli Two theater students (1 actor, aged 21; 1 actress, aged 20) we re recruited to generate the stimuli for this experiment. Participants were no t given examples or dir ections for expressing the sentences. They were instructed to express the sentences using each of the 19 emotional intonation patterns. The nonsense sentences were designed to resemble English sentences in terms of the syntax, but they were formed with pseudo-words or words created to resemble real words. The sentences include: My bakey seemer yumbled his fent (Sentence 1) and The kay nace is daunching in the rowalk (Sentence 2). The order of the emotion categories expressed
43 was randomized for each participant. Participants were permitted to monitor and reproduce their utterances until they were satisfied. All recordings for each participant were made within a single session. The recordings were obtained using a head-mounted microphone (Audiotechnica, ATM21a) that was connected to a high fidelity, external sound card (Creative E-MU 0202) of a PC. The sound card digitized the samples at 44.1 kHz. The samples were tested to ensure that no recording errors such as signal clipping or inappropriate enviro nmental noises were present. Although the recording levels were adjusted for each sample to avoid clipping, thes e attenuation levels were recorded and later used to scale the stimuli accordingly. This was done to preserve any differences in overall intensity that may serv e as acoustic cues for a particular emotion. This scaling was done using Audition (Adobe, Inc., San Jose, CA). After the recordings were made, each pair of sentences for an emotion was played to the participant, who th en selected his or her best expression of each emotion. No further in structions were given. The recorded sentences were then saved as 38 individual files (2 speaker s X 19 emotions X 1 best sentence/emotion) for use in the following perceptual task. Listeners Sixteen participants (8 male, 8 female) with a mean age of 21 years (range of 18-27 years) were recruited to participate in this study. The eligibility criteria and screening procedures were identical to those described in Experiment 1. In addition, listeners we re paid $60 for their participation if they completed the study. Procedures Participants judged the emotional stimuli in a same-different discrimination task. In each trial, listeners were presented two stimuli separa ted by an inter-stimulus interval of 500 ms. Each pair of stimuli contained either two sentences of the same emotion (matched trial) or different
44 emotions (unmatched trial). Participants were asked to determine whether the two sentences conveyed the same emotion or different emoti ons by selecting the a ppropriate button on the computer screen. Judgments were made using software developed in MATLAB 7.1 (Mathworks, Inc., Novi, MI). A total of 19 emotions were evaluated in th is experiment. This re sulted in 4 possible combinations of each unmatched pair of emotions (2 speakers X 2 emotions). Each of these was repeated twice, thereby resulting in 8 trials for each emotion pair or 2736 unmatched trials (total number of unmatched pairs for 19 emotions is 19 x 18; 19 x 18 total emotion pairs x 4 combinations of unmatched pairs x 2 repetitions = 2736 trials). Similarly, 4 combinations of each matched pair of emotions were possible (Set 1: 2 same speakers X 1 emotion + Set 2: 2 different speakers X 1 emotion). The emotion pairs in Set 1 were repeated 4 times, resulting in 8 matched pairs for each emotion. In addition, the emotions in Set 2 were repeated 9 times, resulting in 18 matched pairs for each emotion or 494 trials. The pairs in Set 1 were repeated a fewer number of times as these trials contained the same sample in both intervals. While both types of matched trials were included, the Set 1 trials were included to provide an evaluation of the entire stimulus set. The Set 2 trials were assumed to elicit a more appropriate assessment of matched trials from listeners. This resulted in a total of 26 matched pairs for each emotion or 494 matched trials (19 emotions x 26 matched pairs/emotion). Hence, a total of 3230 stimuli we re tested in this experiment (2736 unmatched trials + 294 matched tr ials). Only fifteen percent of the stimuli were same trials since it was hypothesized that a number of emotions would not be perceivable in speech. Specifically, it was estimated that at least four emotions (exhausted, interested, jealous, and respectful) would not be perceivable in speech, since previous research that examined these emotions have found their percep tual accuracy to be small (Liscombe et al.,
45 2003), or these emotions have not been studied in speech. Therefore, only 1680 trials of the 2736 unmatched trials were predicted to represent unmatched trials. A bala nced presentation of unmatched and matched trials was used to avoid biasing subjects to respond same or different (Macmillan and Creelman, 1991, p. 80). Since the tota l number of perceived unmatched trials was estimated as 1680, the remaining 1056 unmatched trials were predicted to be perceived as matched trials. To create a balanced presentation of unmatched and matched trials, an additional 494 matched trials were presented. Stimuli were presented in random orders binaurally at a comfortable loudness level using headphones (Se nnheiser HD280Pro) connected to the E-MU 0202 external sound card of a PC. Each listen er was tested in 6 sessions of 2 hours each. Signal Detection Analysis To estimate the accuracy of listener performan ce, the frequencies with which each matched stimulus pair was correctly and incorrectly pe rceived (correct rejec tions and false alarms, respectively) and each unmatched stimulus pair wa s correctly and incorrectly perceived (hits and misses, respectively) were summed across all part icipants. These frequencies were then recorded in 2 x 2 tables for each pair of emotions. The hits and misses were calculated based on the combined judgments of the forwards and backward s presentations of an unmatched pair (e.g. AB and BA). The correct rejections and false alar ms were computed as the sum of the matched presentations of both emotion pairs (e.g. AA and BB). These frequency counts were then converted into proportions to calculate the hit rate and false alarm rate. A percent correct analysis reports listener accuracy as the hit rates between emotions without including the false alarm rate. Therefore, this measure may be an inaccurate indicator of listener performance. Since the Theory of Signa l Detection (Green & Swets, 1966) uses both the hit rate and false alarm rate to estimate the percei ved similarity between each pair of emotions as d this measure was used in the present experime nt to obtain relatively unbiased measures of
46 emotion similarity. The same-different di fferencing model was used to calculate d (Macmillan and Creelman, 1991, p. 150, Equations 6.4). The di fferencing model by Macmillan, Kaplan, and Creelman (1977) was selected instead of the i ndependent-observation model because the former is more likely to be used by listeners when given large stimulus sets (Macmillan & Creelman, 1991, p. 148). In the differencing strategy, listeners are assumed to subtract the two observations in a trial and compare the differen ce to a criterion. If this difference is gr eater than the criterion, the stimuli are perceived as different On the other hand, the inde pendent-observation strategy assumes that listeners independently assess th e relative likelihood that each sound arose from the first stimulus (S1) or the second stim ulus (S2) (Macmillan & Creelman, 1991, p. 148). This results in a large number of likelihood estimations by the listener as the number of stimuli tested increases. The information needed for this calcu lation is often not availa ble, since the observer usually does not know the size of the stimulus set. The e quations used to calculate d according to the differencing model are shown below: (2-1) (2-2) where H is the hit rate, F is the false alarm rate, and represents the normal cumulative distribution function shown in Table A5.1 in Macmillan and Creelman (1991, p. 318). Equation 2-2 can be manipulated to obtain k using the Equation 2-3 below: (2-3) where is the inverse of the normal cumulative distribution function. D-prime can then be calculated as the value that satisfies Equation 2-1, given H and the limits of d as 0 and 6.93. The d between each pair of emotions was computed as a measure of the perceptual similarity or perceptual distance between emotions. Smaller d values indicate similar emotions (e.g. happy
47 and surprised are described by a d = 0.55). Conversely, larger d values between emotions would describe emotions that are dissim ilar and easy to discriminate based on the suprasegmental information in speec h, such as anxious and lonely ( d = 5.78). The results are shown in Table 2-3. Listener Reliability Inter-listener and intra-listener reliability were measured usin g Pearsons Correlation Coefficient in order to evaluate the consistency of listener responses. Inte r-listener reliability was computed by performing pair-wis e comparisons of each listeners d values. The correlations between each pair of listeners were entered into a 16 by 16 si milarity matrix. Descriptive statistics were calculated from the lower half of this similarity matrix. The average inter-listener reliability score was 0.66 (standard deviation or SD: 0.08). The range of correlations included 0.47 to 0.81, which fell within 3 standard deviati ons of the mean. To ca lculate intra-listener reliability, d values were computed for each of the two stimulus presentations for each listener. In other words, one repetition consisted of 8 unmatched pairs for each emotion and 12 matched pairs. Twelve matched pairs were used in the co mputation instead of the thirteen pairs used in testing in order to divide the stimuli evenly. Re call that the matched pairs were divided into two sets (Set 1: 2 speakers for 2 same emotions or AA and BB + Set 2: 2 different speakers for 2 different emotions or AB and BA) that were presented 8 times and 18 times respectively. Since Set 2 used 9 AB and 9 BA presentations, the re liability was calculated using the first 4 and following four presentations for each combinati on, resulting in 4 Set 1 and 8 Set 2 pairs per repetition or 12 matched pairs. The two sets of d values were then correla ted with each other for all listeners. This resulte d in an average intra-listener reliabi lity score of 0.68 (standard deviation or SD: 0.14). The range of correlations across liste ners was 0.36 to 0.85.
48 Absolute Discrimination Threshold An analysis was performed to determine the d value that describe s the threshold of discrimination. This value can be used as a cutoff to establish the emotions that were perceived as distinct categories. First, a normalized percent correct scor e was computed for each pair of emotions as follows: Percent Correct Score = [( Hits + Correct rejections )/(Total number of trials)] x 100 (2-4) The Hits represented the sum total of correct judg ments of unmatched pairs by all listeners and the Correct rejections represented the number of correct judgments of matched pairs by all listeners. Next, the percent correct scores for all emotion pairs were regressed against their corresponding d values. Cases that were judged below a d = 1.0 (n = 2 cases) were excluded from this analysis. The R-squared values for lin ear and nonlinear regressions were compared to determine the best fit line to describe the rela tion between the two measur es. In this case, the linear fit provided the highest R-squared value of 0.51. This scatterplot is shown in Figure 2-2. The d scores corresponding to a series of percen t correct values were computed based on the regression equation. Then, the number of emo tion clusters formed at each clustering level was identified based on the dendrogram. These ar e shown in Table 2-4. One method of selecting the discrimination threshold might be as the d value that corresponded to a normalized percent correct value of 70.7%. This value was based on the transformed up-down procedure by Levitt (1971), which converges at the 70.7% point on th e psychometric function. This resulted in a d cutoff of 3.56. Although five emotion clusters we re formed at this level as seen in the dendrogram, not all of them were discriminable from each other with a d of 3.56. The emotions that were discriminable from each other at this cutoff value included angry, happy, and sad (alternate sets of discriminable emotions were possible by substituting lonely or embarrassed for sad, annoyed for angry, and surprised or funny and anxious for happy). If a larger number of
49 emotions were desired for further study, a lower cutoff value could be selected. For example, the d value that corresponded to a percent correct of 60%, d = 2.73, could be selected as the discrimination threshold. Once again, a higher number of clusters were formed at this level (six clusters) than the number of emotions that were discriminable from each other (five emotions: angry, happy, sad, annoyed, and bored). In the pr esent study, the discrimination threshold was selected as 3.9, the d value that corresponded to a percent co rrect of 75%. This resulted in the selection of four emotion categories based on the clustering results: Category 1 (happy), Category 2 (content-confident), Category 3 (angry) and Category 4 (sad). A high discrimination threshold was selected to further investigate th e emotions that could be easily perceived in SS. Clustering Results A HCS analysis can be used to identify th e emotion categories that are well-discriminated in SS by systematically grouping similar emotions. This time, the d values served as a measure of dissimilarity between emotions (i.e., larger d signify greater dissimila rity between emotions). Once again, agglomerative clustering was perfor med using the complete linkage algorithm to group the similar or confusable emotions into categories. The resulting dendrogram or tree structure of clustered items is shown in Figure 2-3. The groups formed at lower clustering levels, such as happy-surprised and jealous-confused, cons ist of emotions that are highly confusable in SS when listeners must rely on suprasegmental cues alone to discriminate emotions. In contrast, the groups joined at higher cl ustering levels were more di scriminable given this limited information in speech. The results of this cluste ring procedure show that at least four groups of emotion categories were perceivable in speech even when semantic information is not present (clustering level of 3.90). These four categories were sele cted because they are formed at the highest clustering levels. The first category included anxious, funny, happy, and surprised. This category
50 seems to represent the emotions th at may be described by the term happy The second category may be referred to as content/confident It included the emotions annoyed, confident, confused, content, interested, jealous, respectful, and su spicious. The third category consisted of only one emotion, angry The final category consisted of sa d, bored, lonely, love, embarrassed, and exhausted. The term sad may be used to represent this emotion category. The emotions within a category may not be perceptually different from each other; however, lis teners were able to differentiate the four emotion categories in SS Discussion With the long-term goal of understanding and modeling the perception of emotions from speech, the outcomes of two experiments are re ported. The first experiment involved a data reduction task, with the aim of reducing a large number of emotional te rms found in the English vocabulary to a smaller set that may still repres ent unique emotion categories. This initial data reduction helped in the selection of a small nu mber of emotion terms for further study with speech stimuli. The goal of the second experi ment was to evaluate whether the emotional categories selected in Experiment 1 could be di scriminated from each other when listeners were provided only the suprasegmental information in speech. This experiment used nonsense speech to isolate the role of suprasegmental info rmation in speech. While both segmental and suprasegmental information are likely to play a role in cuing emotions, our initial focus was to understand how suprasegmental factors alone may influence the perception of emotions. Suprasegmental information was studied first be cause this is often th e primary target of treatment in conditions where vocal emotion expre ssion is impaired. In addition, this information can help improve automatic speech recogniti on technology and advance the development of speech understanding systems by providing them with the ability to extract emotional meaning
51 instead of lexical information alone. Future experiments will also evaluate how emotion perception may be affected by th e semantic content of speech. The first experiment used clustering analysis to reduce a set of 70 emotion terms in the English vocabulary to a smaller set for more de tailed investigation. List eners were asked to identify as many unique categories of emotions as they could think of. Note that this experiment was designed to determine the total number of distinct em otion categories that may be discriminable in general and not specifically through speec h. It was assumed that the total number of emotion categories perceived through speech may only be equal to or less than this total (this assumption was empirically tested in Experiment 2). Thus, the first experiment was simply a means to select a smaller and more manageable set of emotion terms that could sufficiently represent all emoti ons that may be perceived in SS It was therefore acceptable to include a few terms that might represent a singl e emotion category rather than to attempt to reduce redundancy and inadvertently leave out an emotion category. This goal is in contrast to other studies such as Shaver et al. (1987) and Wierzbicka (199 2) who sought to determine the total number of basic or universal terms to describe emotion concepts. These researchers attempted to find a minimal, irreducible set of emotions, from all emotion terms in a language (e.g. Shaver et al., 1987, narrowed down a set of over 558 terms to 135 terms for further study and Scherer, 1984 examined 235 terms). Instead, the purpose of Experiment 1 was to find a small but not necessarily minimal set of emotions that could be further examined to determine which of these may be basic to or minimally perceivable in SS The results of the hierarchical clustering showed that 19 emo tion categories were sufficient to capture a wide range of emotions. The emo tion terms chosen to represent each of these emotion categories were the terms that were mo st frequently selected as the modal emotion.
52 Selection of the modal emotion term in the presen t experiment showed that love was the emotion term most frequently chosen as a modal term (by 16 of the 18 participants), followed by happy (15 of 18 participants), exhauste d (13 of 18 participants), and sad (12 of 18 participants). The selection of love, happy or joy, a nd sad as separate categories was confirmed by the literature as most studies included all three of these emotions as part of their basic set (Shaver et al., 1987; Fehr & Russell, 1984). Several of these emotion categories were the same as those classified as basic emotions in other studies. For example, the 6 emotions suggest ed to form the basic level of a hierarchical arrangement of emotion concepts by Shaver et al (1987) were also found in the results of the present experiment. These emotions include love, joy (happy), surprise, anger, sadness, and fear, assuming anxious and fear represent similar cate gories. Seven of the 19 emotions were also selected as part of Fehr and Russells (1984) li st of 20 target emotions. These emotion terms were obtained by asking 200 participants to iden tify as many items that fall under the category emotion as possible in one minute. The top 10 emotions common to a ll participants were selected as target emotions in addition to 10 ot her emotions chosen by th e authors to represent a broad range of degree of membership (p. 468). Th e emotions that overlapped with the present set were happy, angry, sad, anxious, embarrassed, bored, and love. Neither Wierzbicka (1992), Alvarado (1998), nor Lutz (1982) described the basic emotion concepts with commonly used emotion terms, and therefore, their results were not comparable with those of the present study. One other study sought to empirically determine a basic set of emotion vocabulary for examination in speech. In this study by Cowie, et al (1999), participants were first asked to rate the commonality and psychological simplicity of a set of 45 terms. The initial list was formed mainly from other lists published in the literature. These ratings were used to determine a set of
53 16 candidates to constitute a basi c emotion vocabulary. Then, partic ipants were provided this list separated from the remaining emotion terms and were asked to modify the final list of 16 terms as needed to construct a basi c emotion vocabulary. Analysis of the most frequently selected terms revealed a Basic English Emotion Vocabulary consisting of 17 terms. Seven of these terms, happy, angry, sad, interested, pleased, rela xed, and worried, were chosen by the majority of subjects. The remaining 10affectionate, afrai d, content, excited, bored, confident, amused, loving, disappointed, and satisfiedwere selected by at least half of th e participants. Although the total number of emotion terms selected is approximately the same as the present results, only eight of these terms were the same as those selected in the present experimenthappy, angry, sad, interested, content, confident, bored, and lo ve. The remaining nine terms were in fact included in our initial set of 70 terms, but were not selected to represent unique emotion categories. The discrepancy in the terms sel ected by each experiment probably lies in the instructions given to participan ts. Cowie et al. (1999) included emotion terms that were judged as either common or psychologically simple regard less of whether the terms were similar enough to describe a single emotion category. As a result emotion terms such as pleased, relaxed, and satisfied were included in their vocabulary (terms that formed the content category in the present experiment), but terms such as funny, respectful, exhausted, confused, suspicious, embarrassed, surprised, annoyed, jealous, anxious, or lonely were not. In th e present experiment, participants were instructed to first group the terms that descri bed similar emotional experiences, and then choose the term that best represen ted each grouping. As a result, only the common emotion terms that described different emotion categories were identified. Since the goal of Experiment 1 was to dete rmine a set of emotions that represented relatively distinct emotion categories, it was pr esumed that our set of emotion terms would
54 include the terms selected by other researchers along with other terms that represent different emotions that were not considered to fall within the minimal, basic set by other researchers. This was indeed the case. Emotion terms representi ng emotional experiences such as happiness, anger, sadness, love, and anxiety were seen in li terature as well as within the present set. In addition, a number of terms within the minimal se t of emotions in Experiment 1 have seldom been recommended as basic in previous rese arch. These terms incl ude: annoyed, confused, funny, exhausted, respectful, lonely, and suspici ous. This suggests that the data reduction procedure in the present study was able to captur e emotions that are not represented by existing lists of basic emotions These 19 clusters (shown in Table 22) can be used to find the emotion terms that are associated with similar emotional experiences. The results of studies using different emotion terms that have been found to represent similar emotional experiences (e.g. excited and happy) can then be compared. Also, the intra-cluster differences in emotions (e.g. worried and anxious) can be evaluated to dete rmine whether these clusters truly represent discrete categories. These 19 emotion categories were then used in a perceptual task to determine the number of emotions that are discriminable in SS Listeners judged the emo tional expressions of two acting students in a same-different discrimination task. A discrimi nation task was used to avoid assignment of labels to emotion categories as required by the identifica tion paradigm, thereby avoiding the inflated accuracy ra tes that may result from elimina tion strategies. In addition, the discrimination task permitted data analysis using d a more objective measure of perception than others such as percent correct sc ores. Analysis of perception using d is important because percent correct analyses do not c onsider listener response bias (i .e. the number of times that matched trials are perceived inco rrectly) and therefore, do not provi de an accurate description of
55 listener perception. An analysis of the d values suggested that five emotions were discriminable in SS above the threshold of discrimination ( d = 2.73). Then, clustering analysis was performed using the d scores to determine the hierarchical stru cture of the 19 emotions based on their perceived similarity. The results of the dendrogram showed that a minimum of four emotion categories were perceptually distinct (clust ering level of 3.90) in speech when only the suprasegmental information was available to liste ners: happy, sad, angry, and confident. In other words, at least four emotion categories can be perceived in SS It is possible that more than four emotion categories can be perceived c onsidering that five emotions had d values greater than the threshold of discrimination. For example, anxi ous was merged within the happy category and bored was merged within the sad category at high clustering levels ( 3.67 and 3.42, respectively). However, evidence suggests that these four emoti ons were easiest for listeners to discriminate on average. In addition, more researchers have selected happy, sad, and angry as basic (5/11, 8/11, 7/11 researchers, respectively) than bored and an xious (3/11 researchers for each; refer to the Emotion terms Section). Further support for the selection of a minimal set of four emotions in SS is provided by perceptual studies that have found high recognition rates for happy, sad, angry, and confident when they were not studied alongside emotions th at were confusable with them. For example, Paulmann and Kotz (2008) reported relatively low recognition accuracy of happy (58%) and pleasant surprise (55%) from an identification ta sk of seven emotions (anger, disgust, fear, happy, neutral, pleasant surprise and sad) when presented using pseudo-speech. On the other hand, Pereira et al. (1998) studied the percep tion of happy instead of both happy and surprised (emotion set: happiness, sadness, hot anger, cold anger, and ne utrality) using semantically neutral sentences and found that listeners were able to recogni ze 85% of these stimuli. The
56 results of these studies are in agreement with the outcomes of the present study, since happy and surprised were difficult to discriminate in SS (clustering level of 0.55). It is likely that both happy and surprise represent the same perceptual category in SS and as a result, th e inclusion of two terms to represent one emotion re sulted in a random assignment of responses to each category. Banse and Scherer (1996) investigated the pe rception of 14 emotions (hot anger, cold anger, panic fear, anxiety, despai r, sadness, elation, happiness, interest, boredom, shame, pride, disgust, and contempt) using pseudo-speech form ed from phonemes from several Indo-European languages. In this 14-item forced choice expe riment, listeners were able to achieve high identification accuracy of the four emotion categories perceivable in SS. For example, the accuracy of anger was 88% when the confusions of hot anger and cold anger were both counted as being correct. The accuracy of sad was 73% after including confusions with despair. The accuracy of interested (referred to as confid ent/content in the present study) was 75%. Although the recognition of happy was well above chance ( 54% correct after including confusions with elation), it may be increased by in cluding the confusions with othe r emotions (e.g. anxiety) that fell within the happy cluster in the present study. In addition, the low accuracy of emotions such as shame (22%), pride (43%), and disgust ( 15%) support the findings of Experiment 1, since these emotions were not selected as repr esentative of unique emotion categories. Some limitations of the current stimulus set ma y affect the number of emotions perceived in SS For example, this number may be dependent upon the number of speakers in the stimulus set. An assessment of perception based on a sma ll number of speakers may not generalize to all speakers. Secondly, it was assumed that list eners would be biased towards responding same for both matched and unmatched trials. As a result, fe wer matched trials were presented to provide an approximately equal number of matched and un matched trials. However, results showed that
57 the false alarm rates were relatively high, which s uggests that a larger than expected number of unmatched trials were perceived as having different emotions. Th erefore, future discrimination experiments should contain an equivalent number of matched and unmatched trials. The results of this experiment can be extended by using the dimensional approach In this approach, emotions are described as continuous properties along a number of dimensions. This technique is useful for finding the underlying perceptual mechanisms used by listeners in discriminating these 19 emotions in SS. The dimensional approach in conjunction with the category approach can provide additional insight into whether these four emotion categories are indeed perceptually distinct in SS Moreover, this joint approach will enable measurement of emotion magnitudes in addition to emotion type Assessment of emotion magnitudes has great clinical significance in that it will allow clinic ians to objectively measure treatment outcome in terms of the improvement in the quality of patients productions our eventual goal.
58 Table 2-1. List of 70 emotion terms used in Experiment 1. Affectionate ConfusedEnthusiasticHurtRelaxed AfraidContemptExcited Hysterical Relieved AggravatedContent ExhaustedIndifferentRespectful Agony CourageousFearfulInterestedSad AmazedCuriousFunny IrritatedSatisfied AmusedDepressed GratitudeJealousScared AngryDesireGriefJoyfulScorn AnnoyedDespairGuiltyLonelyShame AnxiousDisappointedHappyLoveShocked ApatheticDisgustedHatred PanickyShy ArrogantDispleasedHelplessPitySurprised BoredDistressedHopefulPleasedSuspicious CompassionateDoubtfulHostileProudTerrified ConfidentEmbarrassedHumiliatedRageWorried Emotions List Table 2-2. Nineteen clusters of emotion terms formed using hierarchi cal clustering scheme analysis in Experiment 1. The modal terms for each cluster are also shown. ClustersModal TermsRemaining Emotion Terms Within Each Category 1Funny Amused, Hysterical 2Content Relaxed, Pleased, Relieved, Satisfied 3Love Affectionate, Compassionate 4Respectful Gratitude 5Happy Enthusiastic, Excited, Hopeful, Joy 6 Exhausted 7Confident Arrogant, Courageous, Proud 8 Confused 9Bored Apathetic, Indifferent 10Suspicious Doubtful Shy 11Embarrassed Guilty, Humiliated, Shame 12Sad Agony, Disappointed, Depressed, Despair, Grief, Hurt, Pity 13Surprised Shocked, Amazed 14Interested Curious, Desire 15Annoyed Aggrevated, Displeased, Disgusted, Irritated 16Angry Contempt, Hatred, hosile, Rage, Scorn 17 Jealous 18Anxious Terrified, Distressed, Worried, Panicky, Scared, Afraid, Fearful 19Lonely Helpless
59 Table 2-3. Matrix of d values for each emotion pair (ang = angry; ann = annoyed; anx = anxious; bor = bored; cfi = confident; cfu = confused; cot = content; emb = embarrassed; exh = exhausted; fun = funny; hap = happy; int = interested; jea = jealous; lon = lonely; lov = love; res = respectful; sad = sad; sur = surprised; sus = suspicious). angannanxborcficfucotembexhfunhapintjealonlovressadsursusangry0.0 annoyed3.00.0 anxious22.214.171.124 bored126.96.36.199.0 confident2.41.73.03.60.0 confused4.02.23.33.31.80.0 content188.8.131.52.92.12.00.0 embarrassed4.73.35.02.184.108.40.206.0 exhausted220.127.116.11.18.104.22.168.00.0 funny22.214.171.124.62.02.72.64.63.50.0 happy126.96.36.199.188.8.131.52.184.108.40.206 interested220.127.116.11.18.104.22.168.22.214.171.124.0 jealous126.96.36.199.188.8.131.52.52.03.03.32.40.0 lonely184.108.40.206.220.127.116.11.51.94.55.04.33.40.0 love18.104.22.168.22.214.171.124.126.96.36.199.33.02.30.0 respectful2.01.72.63.01.51.71.32.188.8.131.52.184.108.40.206.0 sad220.127.116.11.18.104.22.168.22.214.171.124.53.01.81.52.30.0 surprised4.02.92.04.126.96.36.199.188.8.131.52.02.94.63.32.74.00.0 suspicious3.72.44.01.184.108.40.206.220.127.116.11.61.93.03.21.18.104.22.168
60 Table 2-4. The number of clusters formed at the d values for a series of percent correct scores. Percent Correct d' scores No. of Clusters 30%0.4019 35%0.7918 40%1.1817 45%1.5713 50%1.9511 55%2.349 60%2.736 65%3.126 70%3.515 75%3.904 80%4.293 85%4.673 90%5.063 95%5.452
61 0 0.5 1 1.5 2 2.5 3 3.5 510152025303540 Number of CategoriesFrequency Females Males Figure 2-1. Histogram of the number of categor ies of emotion terms formed by males and females on average. y = 0.0767x 1.9299 R2 = 0.5099 0 1 2 3 4 5 6 7 020406080100 Percent Correctd-prim e Figure 2-2. Scatterplot of the percent correct scores corresponding to the d values for each emotion pair.
62 hap sur fun anx ann cfu jea cfi res sus cot int ang bor emb lon exh lov sad 1 2 3 4 5 6 EmotionsClustering Level"Content""Sad" "Angry" "Happy" Figure 2-3. Dendrogram of HCS anal ysis using 19 emotion categorie s. The four categories that are perceivable in SS are shown (ang = angry; ann = annoyed; anx = anxious; bor = bored; cfi = confident; cfu = confused; cot = content; emb = embarrassed; exh = exhausted; fun = funny; hap = happy; int = in terested; jea = jealous ; lon = lonely; lov = love; res = respectful; sad = sad; sur = surprised; sus = suspicious).
63 CHAPTER 3 COMPUTING THE EMOTION DIMENSIONS Background The overall goal of this resear ch is to further our understa nding of emotion communication in speech. To achieve this goal, two experiment s were performed in Ch apter 2 to understand the relationship among emotions based on their percei ved similarity when listeners were provided only the suprasegmental information in American-English speech ( SS) While clustering analysis was useful for obtaining the hierarch ical structure of 19 discrete em otion categories, this analysis did not explain how listeners make judgments of emoti on similarity. In other words, the perceptual strategies used by listeners when making judgments of similarity in SS were still unclear. These perceptual properties can be viewed as varying along a number of dimensions. The emotions can be arranged in a multidimensional space according to their locations on each of these dimensions. For example, a group of citie s can be arranged in a two-dimensional space based on their physical location (a North-South dimension and an East-West dimension). This process can be applied to percep tual distances based upon perceived emotion similarity as well. Although researchers have i nvestigated the dimensionality of emotions in SS a number of unanswered questions remain. Most importantly, the minimum number of dimensions that are necessary to describe the emotions that can be perceived in SS is still under debate. In addition, the qualities described by each dimension differ across studies. This may be because the number of dimensions is dependent on the number of emotions studied, and this number varies considerably across studies. Hence, the present experiment sought to dete rmine the perceptual characteristics used by listeners in discriminating 19 emotions in SS This was achieved using a multidimensional
64 scaling (MDS) procedure. MDS can be used to de termine the number of dimensions needed to accurately represent the perceptual distances between emotions. The dimensional approach provides a way of describing emotions according to the magnitude of their properties on each underlying dimension. MDS analysis complements the HCS analysis performed in Chapter 2 by representing the emotion clusters in a multidimensional space. These two analyses together provide a comprehensive descrip tion of the perceptual relations among emotion categories. In addition, MDS can provide insight into the perc eptual and acoustic factors that influence listeners perception of emotions in SS The following sections provide a review of the dimensional approach and how it has been used to describe emotions. Multidimensional Nature of Emotionality Emotions can be described as ei ther a set of discrete (i.e., basic ) categories or as continuous along a number of perceptual dimensi ons. The category approach seeks to classify each utterance into a distinct category. The dime nsional technique, on the other hand, is used to describe how different utterances vary along a set of constituent dimensions. Emotions are differentiated from each other by their magnitude on one or more dimensions. The dimensions are sets of perceptual propert ies that vary along a bipolar co ntinuum. In this approach, the number of perceptual dimensions necessary to di fferentiate between the emotions is determined first, followed by the perceptual and acoustic pr operties of these dimensions. The dimensional approach is a simplified way of explaining differences among emotion-related phenomena such as the relations among emotion terms or em otion meaning (Davitz, 1964; Scherer, 1984; Fontaine, Scherer, Roesch, & Ellsworth, 2007; Avrill, 1975), facial ex pressions (Ekman, 1972; Russell, 1997; Abelson & Sermat, 1962), moods (Watson & Tellegen, 1985; Frijda, 1986), emotional memory (Bradley, 1994), and em otional experience (Schlosberg, 1941).
65 Although between two and six dimensions have been reported (Bradley, 1994: 3 dimensions; Green & Cliff, 1975: 2 dimensions; Smith & Ellsworth, 1985: 6 dimensions; Davitz, 1969: 4 dimensions; Mehrabian & Russell, 1974: 3 di mensions), many of these studies claim that two, independent, bipolar dimensions, activity and valence, can be us ed to describe most emotion-related phenomena (Schlosberg, 1941; Sche rer, 1984; Watson & Tellegen, 1985; Larsen & Diener, 1992; Yik, Russell, & Fe ldman Barrett, 1999). In other wo rds, these researchers claim that two dimensions can suffi ciently represent the important properties for distinguishing between emotions. The activity di mension (also known as activati on or arousal) separates items according to their degree of arousal or intens ity (Schlosberg, 1954; Block, 1957; Davitz, 1969; Bush, 1973). Items on this dimension span from a highly alert and excited state to relaxed and calm (Russell & Feldman Barrett, 1999). The vale nce dimension (also known as pleasantness, evaluation, or hedonic tone) divide s items that have positive or rewarding characteristics from those having negative or painful ones (Dav itz, 1969; Schlosberg, 1954; Block, 1957). These dimensions are most commonly found to account for the most variance in different emotionrelated phenomena. Russell and Feldman Barrett (1999) argued that a two-dimensional construct is necessary to explain core affect that is, elementary affective states that vary in intensity and are always present. The dimensions they suggest include pl easure (how well one is doing) and activation (sense of mobilization or energy ). Higher dimensions were interp reted as outside of the scope of core affect. In other words, they may be us eful in distinguishing be tween emotions that are subordinate to core affect. A gr aph of the affective feelings of 535 individuals at an arbitrary moment in their day revealed th at most two-dimensional combina tions of core affect occur, thereby illustrating the necessity of two dimensions. Nevertheless, a number of researchers have
66 determined that three or more dimensions are ne cessary to explain differe nces in emotion-related phenomena. Some of the additional dimensions s uggested include potency or power (Fontaine et al., 2007; Osgood, 1969), unpredictability (Fontaine et al., 2007), a nd dominance or confidence (Russell & Mehrabian, 1977). In 1957, Osgood and colleagues (Osgood et al., 1957) introduced the perceptual task known as the semantic di fferential as a method of obtaining similarity judgments. In their approach, pa rticipants rate d the amount of a quality on a number of bipolar scales that characterize one or more dimensions. This approach was used to obtain ratings on a number of scales to estimate differences in word meaning. Multiple factor analyses of these ratings were performed, and results revealed three factorsevaluation, potency, and activityto describe the universal aspects of meaning. Mehrabian and Russell (1974, p. 216) found that th ese same three dimensions (renamed to pleasure, arousal, and dominance) can be used to describe a number of emotional phenomena such as physiological reactions. They performe d an experiment to determine the number and type of semantic differential scales to represen t each of these dimensions. Results of a principal components analysis showed that six scales for each factor could ade quately describe each dimension or factor. The scales found to correl ate highest with the pleasure dimension included happy-unhappy, pleased-annoyed, satis fied-unsatisfied, contented-melancholic, hopefuldespairing, and relaxed-bored. The six scales fo r the arousal dimension included stimulatedrelaxed, excited-calm, frenzied-sluggish, jit tery-dull, wide-awake -sleepy, and arousedunaroused. The scales for dominance included cont rolling-controlled, influential-influenced, in control-cared-for, important-awed, dominant-s ubmissive, and autonomous-guided. These scales have since been used to rate the magnitude of emotion phenomena on these three dimensions by
67 a number of studies (Havlena & Holbrook, 1986; Russell, Weiss, & Mendelsohn, 1989; Bradley & Lang, 1994). The relevant emotion dimensions can be determ ined using the semantic differential task or a number of other analysis techniques such as principal components analysis, multidimensional scaling, or factor analysis. For ex ample, Fontaine et al. (2007) us ed the dimensional approach to study perceived similarity in emotion words in three languages (English, French, and Dutch). A semantic differential task was employed to obtain ratings of 24 emotion terms using 144 scales across six components: appraisals of events, psychophysiological changes, motor expressions, action tendencies, subjective experiences, and em otion regulation. Principal components analysis was then used to find the dimensions that accounted for the greatest variance, which was determined to be evaluation (degree of pleasan tness), potency (degree of control), activation (degree of activity or excitement), and unpredicta bility (R-square of 75.4%). The study by Church, Katigbak, Reyes, a nd Jensen (1998) provided yet another multidimensional solution to explain the differences in emotion terms. In this study, participants sorted 150 emotion-related, Filipi no adjectives into piles. The frequencies of groupings were used as a measure of emotion similarity and were submitted to a multidimensional scaling analysis. Results showed a large improvement in R-square for the addition of a sixth dimension (an additional 11%), and as a result, the sixdimensional solution was reported as most appropriate. The dimensions were described as 1) pleasantness, 2) negative emotions with uncertainty about outcomes-negative emotions with discontent about known outcomes, 3) arousal, 4) negative unstable-ne gative without hope, 5) sa d-angry, and 6) dominance. Dimensions of Vocal Emotion Expression A similarly high variability has been shown in reports of the underlying dimensions that describe emotions expressed in speech. Many of the early studies in the 60s and 70s used the
68 semantic differential technique by Osgood et al. (1957) to obtain j udgments of emotion similarity (Uldall, 1960; Davitz, 1964; Hu ttar, 1968; Green & Cliff, 1975). In these experiments, emotional expressions were rated in terms of their dimens ional properties and acous tic properties. The two sets of ratings were then correlated with each other to evaluate the number of dimensions necessary to describe differences among the emo tions and in addition, to figure out the acoustic descriptions of each dimension. For example, Davitz (1964) investigated the number of dimensions required to account for differences in 14 emotions. Listeners rated two semantically neutral sentences on scales of loudness, pitch level, timbre, and speech tempo. Then these sentences were embedded into emotional paragraphs to help induce the emotion in the listener. Listeners rated these paragraphs using the seman tic differential technique for three dimensions valence, strength, and activity. The acoustic ratin gs were then correlated with the dimension ratings. The correlations for the activity dimensi on were significant, but not for the valence nor strength dimensions. Huttar (1968) examined differences in the perc eption of one male speakers emotions using a number of semantic differential scales. Twelve seven-point scales we re used to assess the degree of emotion (bored-interested, excited-calm, passive-active, and str ong-weak), the specific type of emotion (afraid-bold, confident-timid, sure-unsure, angry-plea sed, and happy-sad), and prosodic features (high-low, loudsoft, and fast-slow). The ratings for each of these three groups were correlated with each other to determine the emotional and acoustic characteristics of the expressions. Results showed that the degree of perceived emotion was positively correlated with fundamental frequency ( f0 ) range and intensity range. Also, some significant correlations between the acoustic measures and the slow-f ast scale were found; however, these did not
69 include the acoustic measurement of total duratio n. In other words, other measures of speed besides total duration were necessa ry to describe changes in emo tions based on a slow-fast scale. Uldall (1960) also used the semantic differen tial technique to invest igate attitude types expressed through intonation patterns. Sixteen di fferent intonation pattern s were superimposed upon four, synthetically-generated sentences differing in type (statement, command, yes-no question, and question-word question). The intonation patterns were designed to vary in terms of the range of f0 the direction at the end of the sentence, the f0 contour shape (monotonic or otherwise), and the treatment of weak or unstr essed syllables. Listen ers judged the intonation patterns using 10 scales (bored-interested, polite-rude, timid-conf ident, sincere-insincere, tenserelaxed, disapproving-approving, di fferential-arrogant, impatien t-patient, emphatic-unemphatic, and agreeable-disagreeable). Spearman rank corr elations were calculated between each pair scales, and then a factor analysis was performe d on these correlations. Th ree factors were shown to account for most of the variance in judgmen tspleasant-unpleas ant (50%), inte rest-lack of interest (20%), and authoritative-submissive (8-13%). Pereira (2000) used the same semantic diffe rential technique to determine whether three dimensions (arousal, pleasure, and power) were necessary to describe differences among five emotional states (happiness, sadness, hot ange r, cold anger, and a neutral state) using semantically neutral speech. Listeners were aske d to rate the expression s using two scales for each of three dimensions. The mean dimension rati ngs were then correlated with four acoustic measures: mean f0 range of f0 mean RMS energy, and duration. Results were significant for mean and range of f0 and mean RMS energy for the arousal dimension (both male and female speakers) and for the power dimension (male sp eaker only). This suggests that the single
70 dimension of arousal explains differences betw een the emotions, however, no conclusion of the number of necessary dimensions was drawn. Besides a correlation analysis of semantic di fferential judgments, factor analysis and MDS can be used to examine participants ratings of emotion similarity using a number of scales. In one such study by Green and Cliff (1975), participan ts were presented speech samples of eleven emotions in a pair-wise fashion. The speech samp les consisted of 8 sec ond segments of alphabet recitations expressed by one male acting student. Listeners rated the degree of similarity between the two samples on a scale of 1 to 9, and th en using seven rating scales. Multidimensional scaling was used to determine the smallest num ber of dimensions that listeners used to differentiate between emotions. In addition, a principal components factor analysis of the seven rating scales was performed to supplement the MDS results. MDS suggested that a threedimensional solution was optimal (pleasant-unpl easant, excitement, a nd yielding-resisting), although the results of the prin cipal components factor analys is suggested that only two dimensions were relevant based on two important factors, pleasant ness and thinness. As a result, the authors recommended a two-dimensional mode l (including the pleasantness and excitement dimensions) to explain differences between emotions in speech. Between one and four dimensions are reported as necessary to differentiate among a small number (around 4 to 20) of emotions in SS (Davitz, 1964: 1 dimension; Banse & Scherer, 1996: 3 dimensions, 14 emotions; Cowie et al., 2000: 2 dimensions, 20 emotions). Some experts such as Pakosz (1983) believe that speech provides emotional information on only the activation dimension. In this study, listeners evaluated e xpressions using multiple semantic differential scales. Although multiple scales were used to obtain judgments for multiple dimensions, these scales for each dimension were carefully cons tructed to differ in one dimension only. He
71 suggested that the scales used by other researchers may have spanned multiple dimensions. The outcomes from his study resulted in the se lection of only one di mension, the activation dimension to explain differences in emotions in speech. He s uggested that this may have occurred since his method of emotion assessment wa s in contrast with studies like Uldall (1960), who used scales that may have unintentionally varied across multiple dimensions. Pakosz results confirmed the outcomes of other e xperiments by Davitz (1964), Huttar (1968), and Fonagy (1978), which led him to suggest that supr asegmental information can provide emotional information on only the activation dimension. He further suggested that the combination of suprasegmental and segmental information are requi red in order to truly perceive an emotion. Most experts agree that the act ivation or arousal dimension is necessary for interpreting intonation. For instance, Pittam, Gallois, and Callan (1990, p. 178) stated that arousal is inevitably present to some degree in all voices normal or pathologica l. The acoustic cues corresponding to the arousal dimension are also re latively consistent ac ross experiments (Cowie et al., 2001, Kappas, Hess, & Scherer, 1991; Trou vain & Barry, 2000). Still, many scientists argue that both the arousal and valence dimensions describe essential properties of emotions in SS (Bachorowski, 1999; Cowie et al., 2000; Tat o, Santos, Kompe, & Pardo, 2002). For example, Cowie and colleagues (2000) developed a softwa re program known as F eeltrace based on a twodimensional model that includes the activation (o rdinate) and evaluation (abscissa) dimensions. In this representation based on Schl osbergs (1941) circular model of facial expressions, emotion categories were modeled as circular regions in the two-dimensional space. The circumference of the circle represented full-blown or intense emot ional states (e.g., excited, serene, depressed, and terrified) that decreased in inte nsity as they approached the or igin (e.g. content, pleased, bored, and sad).
72 Tato et al. (2002) used a two-st ep classification approach to show that the differences in five emotions (angry, happy, sad, bored, and neutra l) could be described according to prosodic features (activation dimension) and quality features (pleasur e dimension). Fourteen German speakers (nonactors) expressed commands after be ing induced with one of the five emotional states by performing some context action. Separate neural networks clas sifiers were used for each dimension. First, linear regression was used to reduce the parameters to a smaller set of 10 relevant parameters. Then speaker-dependent an d speaker-independent classifications of the emotions were performed for each dimension. The speaker-dependent classifier for the arousal dimension provided a recognition ra te of 83.7%, whereas the speakerdependent classifier for the pleasure dimension was able to distinguish happy from angry at a rate of 73.5% and bored from sad at 66%. The speaker-independ ent classification predictably resulted in lower recognition rates for arousal (77%) and pleasure (60%). Yet, the perceptual and acoustic correlat es for the valence dimension have been inconsistent in the literature (Scherer, 1986; Leinonen, Hiltunen, Linnankoski, & Laakso, 1997; Paeschke & Sendlmeier, 2000; Trouvain & Barr y, 2000; Millot & Brand, 2001). Still, some researchers claim that two dimens ions are not sufficien t to differentiate among certain emotions (Larsen & Diener, 1992; Laukka & Juslin, 2005; Schrder, Cowie, Douglas-Cowie, Westerdijk, & Gielen, 2001). Other dimensions such as poten cy and intensity have been suggested, although most of the perceptual and acoustic characteristic s of these dimensions seem to overlap with the arousal dimensions (Ohala, 1983; Tusing & Dill ard, 2000; Pereira, 2000). For example, Schroder et al. (2001) investigated the acoustic corre lates to the activati on, evaluation, and power dimensions. Perceptual ratings of spontaneous em otional speech (Belfast Naturalistic Emotion Database; Douglas-Cowie, Cowie, & Schroder, 20 00) on these three dimensions were obtained.
73 Eighty-five percent of these samples were made by female speakers in TV recordings of chat shows and religious programs as well as intervie ws recorded in studios. The dimensional ratings were then correlated with the acoustic variables. Consistent with the literature, the highest number of correlations was found for the activ ation dimension. The features that were significantly correlated with at least one dimension were used in a stepwise regression to determine the acoustic patterns of happy, afraid, angry, sad, and neut ral on the three dimensions. The active end of the activation dimens ion was found to correspond to higher f0 mean and range, longer phrases, shorter pauses, larger and faster f0 rises and falls, increased intensity, and a flatter spectral slope. The nega tive end of the evaluation dimension was associated with longer pauses, faster f0 falls, increased intensity, and more prominent intensity maxima. The power dimension was characterized by lower f0 mean. Other gender-depende nt correlations were found (e.g. the f0 rises and falls were less steep in female speakers, the f0 falls had a smaller magnitude, and the intensity was reduced). The component process model of emotions by Scherers (1986) also required three independent dimensions. In this paper, he furt her developed his model by predicting the acoustic changes that should occur as a re sult of different voice types (na rrow-wide, lax-tense, full-thin) that vary along three response dimensions : hedonic valence, activation, and power. These voice types are a summarized account of the stimulus evaluation checks performed by an organism (i.e., a series of scans of external and intern al stimuli that continuous ly occur in sequential order). To validate his predictions, he compared them with the outcomes of other acoustic studies in the literature. The predictions were highly co nsistent with the lite rature for joy/elation, sadness/dejection, fear/terror, and rage/hot anger, and moderately consistent with the literature for displeasure/disgust, worry/anxiety, boredom/indi fference, and shame/guilt. However, most of
74 the similarities between the component process mode l and the acoustic cues in the literature were for the activation dimension (lax-tense voice type) because parameters associated with the other dimensions had not been systematically studied. Pittam, Gallois, and Callan (1990) performed two experiments to determine the relation between the long-term spectrum of speech and th ree emotion dimensionscontrol, arousal, and pleasure. Australian, British, and Italian speak ers of English expre ssed three passages of different emotional scenarios. In the first experiment, principal co mponents analysis was used to determine whether the long-term spectrum coul d be used to differentiate among emotional passages along the three dimensions. In the second experiment, a multivariate analysis of variance (MANOVA) was used to show a sp ectral differentiation among emotions that corresponded with the three di mensions. Results showed that the control dimension was correlated with the 0-250 Hz band and the 410 kHz band. Energy in the 2-2.5 kHz range was associated with the arousal and pleasure dime nsions. A rise in energy in this range was connected to an increase in arousal, but a decrease of energy in this range signaled an increase in pleasure. Nevertheless, the conn ection between part of the spect rum and the arousal and control dimensions would indicate that at least two di mensions were necessary to distinguish between the emotions presented in the passages. Overview A number of studies suggest that only two to three dimensions are re quired to capture the most relevant properties of vocally expressed emotion categories. However, there is still considerable disagreement in both the number and nature of these dimensions. There are a number of issues that may have caused this vari ability. First, the number of dimensions may vary based on the number and type of emotion ca tegories tested. Second, the emotion categories tested usually reflect the emotions considered bas ic in terms of facial ex pressions or other such
75 literature. The results of Chapter 2 suggested that not all of these emotions can be perceived in speech when only suprasegmental information was available to listeners. This might explain why two to three dimensions have been reported to su fficiently explain differenc es in emotions. It is also possible that two to three dimensions have been found to describe a subset of the emotions that can be perceived in SS even though more dimensions are necessary to describe the differences among all emotions that can be perceived in SS Since previous dimensional structures were determined using categories that may not reflect what is perceivable in speech, it is necessary to recompute the dimensionality. The purpose of the present experiment was to determ ine the number of perceptual dimensions used by listeners to discriminate the em otions that can be perceived in SS (reported in Chapter 2). To this end, a multidimensional scaling analysis was applied to determine the configuration of the 19 emotions in a multidimensional space that best approximated the obtained proximity measures. This technique has been successfully used in the past to determine the underlying dimensions of other perceptual phenomena su ch as voice quality (Kempster, Kistler, & Hillenbrand, 1991; Shrivastav, 2005), naturalnes s of synthetic speech (Mayo, Clark, & King, 2005), and emotional responses to music (Bigand, Vieillard, Madurell, Marozeau, & Dacquet, 2005). In addition, Green and Cliff (1975) used multidimensional scaling to determine the necessary dimensions to differentiate among 11 em otions in speech. However, the authors use of speech samples consisting of segments of alphabet recitations may have resulted in expressions that contained patt erns of emotional and linguistic prosody that are unnatural to everyday speech. In the present study, MDS was used to determ ine the number of dimensions needed to accurately represent the perceptual di stances between emotions using sentence-length
76 stimuli. In addition, the four emotion categor ies found in Chapter 2 to be perceivable in SS were examined further in the multidimensional space to validate whether they are indeed, distinct. Methods Previously (in Chapter 2), an experiment wa s performed to determine a set of terms to describe unique emotion categories. Results s uggested that the following 19 emotion terms may represent sufficiently distinct emotion categories: funny, content, lonely, respectful, embarrassed, confident, annoyed, exhausted, confused, bored, su spicious, sad, surprised, interested, angry, anxious, jealous, happy, and love. The experiment al procedures for collecting perceptual judgments of these 19 emotions in SS and computing measures of perceptual distances were previously described in Chapter 2. A brief summary of this inform ation is provided below. This data was used in the present experiment to dete rmine the perceptual dimensions used by listeners in discriminating the 19 emotions in SS. A multidimensional scaling analysis was performed to determine the relevant perceptual dimensions to describe all emotions. In addition, hierarchical clustering analyses and MDS analys es were performed separately for each speaker to determine whether individual differences in stimulus elicita tion might contribute to differences in emotion perception. Speech Stimuli Two students with acting training (1 male, 1 female) were asked to repeat two nonsense sentences in 19 emotional contexts. Nonsense sentences were used to retain only the suprasegmental and syntactic information in the sp eech signal. No further examples or directions were provided. A head-mounted microphone (Audi otechnica, ATM21a) connected to a high fidelity, external sound card (Cre ative E-MU 0202) was used to obt ain the recordings. After all recordings were made, each speaker selected hi s or her best production of each emotion. They were saved as 38 individual file s (2 speakers X 19 emotions X 1 best sentence/emotion) for use
77 in the following perceptual task. The overall differences in amplitudes were maintained by documenting the recording levels of each sample and then amplifying the samples appropriately after the recording session using Adobe Audition software. Listeners Eight male and eight female participants (16 participants; mean age = 21 years) participated in the listening tas k. As part of the eligibility criteria, participants were native speakers of American English, had normal h earing bilaterally (air -conduction pure-tone threshold below 20 dB HL at 125 Hz, 500 Hz, 1 kHz, 2 kHz, 4 kHz, and 8 kHz), and had normal emotional health (final score of between 0 to 6 on the 17-question Hamilton Rating Scale for Depression test; Hamilton 1960). In addition, partic ipants did not report being on medication to treat depression, anxiety, bipolar disorder, or any other mental or emotional health disorders. Everyone screened was eligible to take part in the experiment and was paid $60 for completing the study. Discrimination Task Procedures Listeners performed a same-different discrimi nation task using all possible combinations of the speakers and emotion. In each trial, li steners were presented two emotional speech samples separated by an inter-stim ulus interval of 500 ms. The s timulus pairs contained either two sentences having the same emotion (a matched trial) or two different emotions (an unmatched trial). The listeners task was to determ ine whether the two sentences presented in each trial conveyed a matched or unmatched pair by selecting the same or different buttons on the computer screen. The software used to present stimuli and obtain responses was developed in MATLAB 7.1 (Mathworks, Inc.). Each listener discri minated all possible combinations of emotion pairs at least twice, resulting in the pr esentation of 16 unmatched trials and 26 matched trials for every pair of em otions. All 3,230 stimulus pairs were randomly
78 presented binaurally at a co mfortable loudness level using headphones (Sennheiser HD280Pro) that were connected to an E-MU 0202 external sound card of a PC. Signal Detection Analysis The proportion correct and incorr ect for all trials were then calculated as part of the formula for d the Theory of Signal Detection (TSD) measure of discriminability (Green & Swets, 1966). This parameter essentially represen ts the perceptual distance between emotions. The d values were calculated for each pair of em otions using the hit rate s (proportion correct of unmatched trials) and false alarm rates (proportio n incorrect of matched trials). These values were calculated according to the differencing model (Macmillan and Creelman, 1991, p. 150, Equations 6.4) for each pair of emotions and were then entered into a 19 by 19 similarity matrix. The d values were then calculated for each pair of emotions for each speaker and were then entered into two, 19 by 19 similarity matrices for us e in an analysis by spea ker. For this analysis, the trials that included both speakers were discarded. MDS Analysis Since d represents the perceptual distance betw een two emotions, it is possible to use these distances to form a map depicting the rela tions between emotions in a perceptual space. However, the dimensionality of this perceptual space was unclear from the d analysis alone. Therefore, a multidimensional scaling (MDS) analysis using the ALSCAL algorithm was performed to determine the number of dimensions needed to represent all of the perceptual distances among emotion categories (Young, Takane, & Lewyckyj, 1980). The ALSCAL algorithm used the d values as measures of proximity to determine the locations of various emotion categories within a Euclidean space. Th e Euclidean distances between emotions were computed as the square root of the sum of the squared distances between emotions on each dimension (Baird & Noma, p 181, Eqn. 10.3). Large d values corresponded to emotions that
79 were farther apart on one or more dimensions The perceptual space was constructed through a shrinking and stretching of the original proximity measures to obtain a nonmetric (ordinal-level) solution with minimum dimensionality that satisfi ed the requirement of monotonicity (Baird & Noma, p. 186-187; Kruskal 1964). Since the d values were calculated using the averaged perceptual judgments across all li steners, the perceptual spaces represented listener judgments on average. This analysis was performed for the three 19 X 19 d matrices (Overall, Speaker 1, and Speaker 2). The SPSS statistical software package (SPSS, In c., Chicago, IL) was used to compute six ALSCAL solutions differing in di mensionality (one through six di mensions) using each of the three d matrices. To determine the optimal number of dimensions, the R-squared (R2) and stress coefficients were calculated for each of the six solutions. The R2 explains the proportion of variance accounted for by the MDS solution. Simila rly, the stress paramete r specifies how well the MDS representation of the perceptual di stance measures corresponds to the actual d scores Smaller stress values indicate a better mapping of the distances. Togeth er, these two measures can be used to estimate the solution that suffici ently explains the varian ce in the data, without placing too much stress on the model. Results Listener Reliability Inter-listener and intra-listener reliability we re previously computed in Chapter 2. using Briefly, the average inter-listener reliability sc ore, in terms of the Pearsons Correlation Coefficient, was 0.66 (standard deviation or SD : 0.08). The range of correlations included 0.47 to 0.81. These scores fell within 3 standard devi ations of the mean. The average intra-listener reliability score was 0.68 (SD: 0.14; range: 0.36 to 0.85).
80 ALSCAL Analysis The MDS analysis using the ALSCAL procedur e was performed in two steps. First, the optimum dimensionality of the perceptual space was computed. Then, an analysis of the perceptual space was performed to determine the properties of the dimensions, and the relations among emotions along each dimension. Dimensionality The R2 and stress values are shown as a functi on of the number of di mensions of the MDS solution in Figure 3-1. These results showed that the increase in the R2 after the second dimension is small (0.04, 0.03, and 0.02 for after th e addition of the thir d, fourth, and fifth dimensions, respectively). The elbow of the stre ss curve, the visual bend in the curve, also occurred at the 2D solution (s tress = 0.24). Typically, these resu lts would suggest that the 2D solution is the optimal model. However, since this model will serve as the basis for future acoustic experiments, higher R2 and lower stress measures are de sirable. The amount of variance accounted for by the two-dimensional (2D) so lution was 0.86, but the addition of the third dimension increased the R2 to 0.90. Therefore, the three-dimens ional (3D) solution was selected, since it accounted for 90% of the variance in perceptual judgment s and also provided a tolerable stress level of 0.12. Perceptual space A 3D ALSCAL solution indicated that three distinct perceptual dimensions were sufficient to describe the differences among the 19 emotion categories. The coordinates of each emotion in the multidimensional space are shown in Table 3-1. Each dimension corresponded to a unique set of suprasegmental cues that listeners used to differentiate among the 19 emotion categories. By arranging the emotion categories according to their location on each dimension as shown in Table 3-2, the emotion categories that were perc eptually separated according to each dimension
81 could be determined. Emotions that were farther apart on a dimension were easier for listeners to discriminate. For the purposes of the present stud y, emotions that were separated by at least 85% of the range for each dimension were considered to be well-separated by that dimension. This value was arbitrary and was simply used to id entify the emotions that were equally wellseparated on both ends of the continuum. By ex amining the emotions that were well-separated on each dimension, it was possible to hypothesize which acoustic properties corresponded to each dimension. These predictions were empirica lly tested in a subsequent experiment (see Chapter 5). In general, Dimension 1 separated the sad clus ter from the happy cluster. In particular, the properties of Dimension 1 were used to discrimi nate lonely and embarrassed (the low end of the continuum) from anxious and surprised (the high end of the continuum). The emotions at the high end of the continuum were louder and fast er than the low end and were characterized by frequent and large pitch changes. The emotions at the low end had more of a smoothly sloping f0 contour that decreased with ti me. Therefore, the high end may be described by an increasing gross f0 trend, a peaky f0 contour (i.e., frequent cha nges in the direction of the f0 contour along with a larger rate of change in f0 ), steep f0 contour peaks, a faster speaking rate, and greater global f0 measures (such as minimum, maximum, and range of f0 ). The second dimension separated emotions such as angry (low end) fr om love and happy (high e nd). Informal listening suggested this dimension to vary primarily acco rding to the vocal quality of the speaker, in addition to a staccato vs. legato prosody. The high end may be a ssociated with breathier voices with a legato prosodic contour, whereas the low end may be characterized by high tension and a prosodic contour spoken in staccato. The acoustic cues that are predicte d to describe this dimension include measures used to describe voice quality, such as sp ectral tilt, the cepstral peak
82 prominence, and the vowel-to-consonant rati o. To quantify the stacca to-like emphasis of syllables, attack time of the pitch or intensit y contour may be a useful measure. The third dimension separated anxious from funny. This dimension may be us ed to differentiate between positive and negative hesitations or frequency modul ations of the syllable rate. This dimension may be described by variations in overall sp eaking rate or the speak ing rate throughout the sentence (speaking rate tr end), and pause proportions. Comparing MDS and HCS results The results of the HCS analyses (Chapter 2) showed that some of the emotion categories were indeed perceptually discriminable in SS as shown by the large distances between these categories. On the other hand, certain other emoti on terms were difficult to discriminate from each other due to their perceptual similarity and were clustered together. In order to examine the differences between groups of emotions, the resu lts from the clustering analysis in Chapter 2 were applied to the MDS stimulus space. The clus tering analysis revealed that at least four emotion categories could be perceived in SS. These four, broad categorie s were plotted in a 3D space as shown in Figure 3-2 to understand the relati ve locations of these emotion clusters in a multidimensional space. To clearly see which emo tion clusters or categories were separated by each dimension, three, 2D graphs of each pair of dimensions are shown in Figure 3-3. Graphs A and B demonstrated that Dimension 1 was used to discriminate between Cluster 1 (the happy category) and Cluster 4 (the sad category) emotions. In particular it is clear that anxious and surprised were perceptually dis tinct from lonely and embarrassed, as indicated by the relative location of these emotions on the x-axis dimens ion. Graphs A and C revealed the utility of Dimension 2 in distinguis hing between Cluster 3 ( angry ) and some of the emotions from Clusters 1 and 4. An interesting result was that Dimension 3 was us ed to separate the emotions anxious from funny, two emotions within Cluster 1 ( happy ) as shown in Graphs B and C.
83 Speaker Analysis The ALSCAL model was developed based on li stener judgments of samples from two speakers. However, if the speakers varied consider ably in their expressions, then a single model based on two speakers may not adequately repres ent listener perception of emotions. Thus, an analysis was performed to determine whether th e emotions expressed by two different speakers were perceived similarly. For this analysis, the five emotions that were differentiated with d values greater than the discrimination threshold, happy, sa d, angry, annoyed, and bored (see Chapter 2), were used. First, the d values for the 10 combinations of emotion pairs resulting from the five emotions selected were determined for each speaker. These are shown in Table 3-3. Then, histograms of the d values were formed for each speaker (see Figure 3-4). It can be seen that the emotion pairs from Speaker 1 were easie r to discriminate than Speaker 2. The mean d values for Speaker 1 and Speaker 2 were 6.44 and 5.64, respectively. A co rrelation analysis was performed to determine the level of associ ation between the perceptual distances or d values of both speakers. Neither the Spearmans rho ( 0.345, p > 0.05), nor the Pearsons Correlation Coefficient (r = 0.454, p > 0.05) met significance. Then, the d values between emotions for Speaker 1 and Speaker 2 were grap hed in a scatterplot (shown in Figure 3-5). Linear regression showed an R-square value of 0.21 between the two speakers. The slope of 0.545 indicated that the samples from Speaker 1 were easier to perceive than the samples from Speaker 2. This relationship was significantly in fluenced by the sad-annoyed, sa d-bored, and sad-happy pairs. Removal of these three pairs resulted in an in creased R-squared of 0.82 (also shown in Figure 35) and significant Pearsons (0.907, p = 0.005) and Spearmans (0.982, p < 0.01) correlations. Hierarchical clustering analysis and multidimensi onal scaling analysis were then performed to determine whether listeners perceived the samples from each speaker differently.
84 Clustering results A hierarchical clustering analysis was completed using the d matrix for each speaker to compare the stimuli from each speaker. The de ndrograms are shown in Figure 3-6. For Speaker 1, happy and sad were equally well-discriminated from each other and the remaining three emotions. The most similar emotions were a ngry and annoyed. Similarly, for Speaker 2 happy was highly differentiated from angry/annoyed and sad/bored. However, the emotions bored and sad were relatively more difficult to discrimi nate from each other for Speaker 2. Angry and annoyed may be two emotion terms that describe very similar emotion categories. While the large cluster that includes both of these emotions may be easy to perceive by most listeners, distinctions between these emotions may be easie r to perceive by some speakers. A similar case may occur for bored and sad. These emotions were easier to distinguish using the samples from Speaker 1 than Speaker 2. This suggests that listener perception of th ree categories angry, happy, and sad were relatively similar across the two speakers, but that mi nor differences in the production of bored and annoyed may remain. MDS analysis A multidimensional scaling analysis (MDS) using the ALSCAL procedure was performed using the d matrices of five emotions for each speaker. Only one and two dimensional solutions were possible due to the small number of cases The R-squared and stress curves for both solutions were computed for each speaker (Figure 3-7). A two-dimensional (2D) solution was appropriate for both speaker models (R-square d: 1.00 for Speaker 1 and 0.97 for Speaker 2; stress: 0.00 for Speaker 1 and 0.05 for Speaker 2). Note that the R-squa red values were high because of the small number of items for each MDS solution. The s timulus coordinates for each dimension of each speakers stimulus space were correlated with each other to determine the relation between the emotions along each di mension in the multidimensional space. The
85 Spearmans correlations were significant for Dime nsions 1 and 2 (rho = 1.0, p < 0.01 and rho = 0.90, p < 0.05, respectively). This suggests that th e expressions by both sp eakers were described by similar perceptual dimensions. The graphs in Figure 3-8 illustrate the clear separation of the emotion clusters in the stimulus spaces for both speakers. Dimension 1 separated happy and sad from angry in both speakers. Dimension 2 separa ted happy from bored and sad for both speakers. These results suggest that the expressions by bot h speakers were evaluate d similarly by listeners, and that individual differences in production across the two speakers were relatively small. If such results hold across a larger number of speakers, then a m odel for emotion perception based on the utterances of a small number of speakers ma y generalize well, at least when the number of emotion categories being evaluated is small. Discussion Models of emotion perception have attempted to identify the acoustic correlates to a number of emotion categories or emotion dimensi ons. In the category approach, a set of emotion categories are perceptually eval uated and then analyzed to de termine the corresponding acoustic properties. However, the emotion categories studie d were not necessarily t hose that were easily perceived from the suprasegmental information in speech (Banse & Scherer, 1996; Cowie et al., 2000). This has resulted in acoustic models with mode rate to poor performance in predicting some emotions in SS The dimensional view of emotions s uggests that emotions can be described by sets of perceptual and acoustic properties or dimensions. Each emotion is represented in a multidimensional space according to its magnitude on each dimension. Although a number of studies have explored the perceptual dimensions that underlie emotions in speech, there has not been a consensus on the number of dimensions n ecessary to describe the emotions that can be perceived in speech or the perceptual propertie s of these dimensions (Pakosz, 1983; Banse & Scherer, 1996; Cowie et al., 2000). One possible cause is variability in the number and type of
86 emotion categories tested across experiments. Fo r example, if the emotions selected do not include the total number of emotions that can be perceived in SS the size of the obtained multidimensional space may be smaller than the size of the actual space. Ot her studies (Uldall, 1960; Huttar, 1968; Pereira, 2000) have assumed that the emotions in SS must be described by one or more dimensions from a standard se t (evaluation, activation, and power). However, emotions are often summarized according to thes e dimensions even though these terms may not best describe the properties of each dimension. This may result in insi gnificant correlations between the speech samples and these dimensions and findings of only one or two significant dimensions (Davitz, 1964; Fonagy, 1978; Pakosz, 1983). The present experiment attempted to determine the perceptual dimensions that listeners use when discriminating between 19 emotions in SS These emotion terms were previously identified to represent a variety of unique emotion categories and likely incl ude the emotion categories that can be perceived in SS A multidimensional scaling analysis was used to determine the number of dimensions needed to perceptually differen tiate between these emoti ons. This technique was useful because it did not require any assumpti ons of the nature of these dimensions. MDS provided the location of each emotion in the multidimensional space based on the perceptual distances or d values between emotions. Results suggested that three dimensions were necessary to account for 90% of the variance in d values across 19 emotion te rms. The emotions that were easily discriminated from each other were highly separated on a specific dimension. On the other hand, emotions that were not well discriminated were placed closest to the origin of the resulting stimulus space. For example, emotion terms such as angry, love, happy and anxious were observed to occupy the extreme ends of the stimul us space, whereas emotions such as respectful,
87 confused, content, jealous, and suspicious were ob served to be closest to the origin in the 3D space. The dimensions of the stimulus space roughly corresponded to arousal, valence, and confidence, as described in the literature (M ehrabian & Russell, 1974; Green & Cliff, 1975). However, these labels may not be accurate repr esentations for these dimensions. For example, although Dimension 2 separated the negative emoti on angry from the positive emotions love and happy, it failed to separate sad from other posit ive emotions. This suggests that Dimension 2 may not simply reflect valence (i.e. positive vs. negative); rather, it may be based on some other shared quality between specific emotions categor ies (in this case, angry vs. love/happy/sad). Since the underlying shared quality for each dimension is not obvious, a unique descriptive label has not been assigned to each dimension. Instead these dimensions are simply referred to by their index number. Subsequent research (descr ibed in Chapter 5) attempted to identify the acoustic cues for each of these dimensions and formally evaluated whether these dimensions correlated with descriptive terms such as arousal or confidence as has been reported in previous literature. Further analyses was performed to determin e how well the three dimensional model could separate the four emotion categorie s that were determined in Chap ter 2 to be discriminable from SS. Results showed that the first two dimensions we re able to clearly sepa rate the four emotion categories. In addition, Dimens ion 3 was used to distinguis h between anxious and funny, two emotions within the happy category. This sugges ts that it might be possible to subdivide the happy category into two emotions. The clustering results reported in Chapter 2 support this possibility, since anxious separated from the ha ppy-surprised-funny cluster at a clustering level of 3.67. This clustering level was slightly lowe r than the cutoff of 3.90. However, since these
88 two groups are separated by d values lower than the cutoff, only four emotion categories are assumed to be discriminable in SS at the present time. The ALSCAL algorithm was used to determin e the multidimensional perceptual space to describe differences between emotions for the av erage listener. Two limitations of this study may have affected the results, indivi dual differences in speakers and listeners. The first was addressed here by performing an analysis by speaker. Only two speakers were used for model development, since a large number of emotions were studied. It was necessary to obtain some sense for the differences in emotional expression across thes e two speakers. Correlation analysis, clustering analysis, and MDS were performed on the five mo st easily discriminable emotions expressed by each speaker. Although clustering indicated that the emotions happy, sad, and angry were perceived similarly, there was a discrepancy in the grouping of bored and annoyed. However, the MDS analysis indicated a clear separation of these emotions in a 2D space, suggesting that these differences may not effect the predic tion of four emotion categories from SS More speakers may be necessary to accurately account for speaker differences in other emotion categories, especially those that were formed at lower clustering levels. The second limitation of this experiment was the aggregation of judg ments across listeners before they were submitted to the ALSCAL analys is. This procedure may have hidden individual differences across listeners. Before attempting to develop an acoustic model of emotions in SS using MDS, it is necessary to determine the extent of individual variations in listener perception of emotions. This was addre ssed in the following chapter.
89 Table 3-1. Stimulus coordinates of the 19 emotions in the three-dimensional perceptual space. Dimension 1Dimension 2Dimension 3 angry -0.7282.7140.525 annoyed -0.2921.198-0.764 anxious -2.035-0.384-1.447 bored 1.7761.0780.103 confident -1.0030.6180.037 confused -0.315-0.149-0.350 content -0.191-0.629-0.229 embarrassed 2.006-0.012-0.755 exhausted 1.180-0.0440.883 funny -1.251-0.7391.417 happy -1.704-1.2890.489 interested -0.791-0.676-0.197 jealous 0.1560.7140.355 lonely 2.299-0.4900.298 love 0.909-1.3430.555 respectful -0.1970.352-0.143 sad 1.213-1.037-0.781 surprised -1.731-0.4580.509 suspicious 0.6980.575-0.504
90 Table 3-2. Stimulus coordinates of all listene r judgments of the 19 emotions arranged in ascending order for each dimension (AG = angry; AO = annoyed; AX = anxious; BO = bored; CI = confident; CU = confused; CE = content; EM = embarrassed; EX = exhausted; FU = funny; HA = happy; IN = in terested; JE = jealous ; LN = lonely; LV = love; RE = respectful; SA = sad; SR = surprised; and SS = suspicious). AX*-2.04LV*-1.34AX*-1.45 SR*-1.73HA*-1.29SA-0.78 HA-1.70SA-1.04AO-0.76 FU-1.25FU-0.74EM-0.75 CI-1.00IN-0.68SS-0.50 IN-0.79CE-0.63CU-0.35 AG-0.73LN-0.49CE-0.23 CU-0.32SR-0.46IN-0.20 AO-0.29AX-0.38RE-0.14 RE-0.20CU-0.15CI0.04 CE-0.19EX-0.04BO0.10 JE0.16EM-0.01LN0.30 SS0.70RE0.35JE0.36 LV0.91SS0.58HA0.49 EX1.18CI0.62SR0.51 SA1.21JE0.71AG0.52 BO1.78BO1.08LV0.55 EM*2.01AO1.20EX0.88 LN*2.30AG*2.71FU*1.42 Dimension 1D imension 3 Dimension 2 These emotions were well-separated on each dimension.
91 Table 3-3. The d values between each pair of emotions for the five emotion clusters discriminable above the d cutoff of 2.73 are shown for Speaker 1, Speaker 2, and the averaged data (AG = angry; AO = annoye d; BO = bored; HA = happy; SA = sad). Speaker 1Speaker 2Average AGAO5.0174.9012.993 AGBO6.3985.6804.141 AGHA6.8606.7175.153 AGSA6.8606.8405.575 AOBO5.4255.2153.164 AOHA6.8405.9463.863 AOSA6.8604.8483.444 BOHA6.8406.4654.734 BOSA6.4874.8163.311 HASA6.8604.9443.806 Emotion Pairs
92 1 2 3 4 5 6 0 0.25 0.5 0.75 1 No. of Dimensions Stress R-square Figure 3-1. R-squared and stress measures as a f unction of the number of dimensions included in the MDS solution. -2 -1 0 1 2 -2 0 2 -2 -1 0 1 2 FU EX LV AG SR HA JE LN BO CI RE IN Dimension 3 CE CU SS EM AO SA AXDimension 2 Dimension 1 Figure 3-2. Three-dimensional ALSCAL solution. Th e angry cluster is shown as a green square. The emotions that form the sad cluster are in red triangles, the happy cluster are black stars, and the content/confide nt cluster are blue circle s (AG = angry; AO = annoyed; AX = anxious; BO = bored; CI = confiden t; CU = confused; CE = content; EM = embarrassed; EX = exhausted; FU = f unny; HA = happy; IN = interested; JE = jealous; LN = lonely; LV = love; RE = re spectful; SA = sad; SR = surprised; SS = suspicious).
93 A -2.1 -1.1 -0.1 0.9 1.9 -2.1 -1.1 -0.1 0.9 1.9 AG AO AX BO CI CU CE EM EX FU HA IN JE LN LV RE SA SR SSDimension 1Dimension 2 B -2.1 -1.1 -0.1 0.9 1.9 AG AO AX BO CI CU CE EM EX FU HA IN JE LN LV RE SA SR SSDimension 1Dimension 3 C -2.1 -1.1 -0.1 0.9 1.9 -2.1 -1.1 -0.1 0.9 1.9 AG AO AX BO CI CU CE EM EX FU HA IN JE LN LV RE SA SR SSDimension 2Dimension 3 Figure 3-3. Two-dimensional views of th e emotion categories perceivable in SS using each pair of dimensions: A) Dimension 1 Dimens ion 2, B) Dimension 1 Dimension 3, and C) Dimension 2 Dimension 3 (AG = a ngry; AO = annoyed; AX = anxious; BO = bored; CI = confident; CU = confused; CE = content; EM = embarrassed; EX = exhausted; FU = funny; HA = happy; IN = in terested; JE = jealous ; LN = lonely; LV = love; RE = respectful; SA = sad; SR = surprised; SS = suspicious).
94 0 1 2 3 4 5 6 7 22.214.171.124.45.7126.96.36.199.27.5 d-primesFrequenc y Speaker 1 Speaker 2 Figure 3-4. Histogram of the d values for each speaker for the five emotion clusters discriminable above the d cutoff of 2.73. A y = 0.5451x + 2.1242 R2 = 0.2059 3 4 5 6 7 8 45678 Speaker 1Speaker 2 B y = 0.8719x + 0.4558 R2 = 0.8217 3 4 5 6 7 8 45678 Speaker 1Speaker 2 Figure 3-5. Scatterplot of d values for Speaker 1 and Speaker 2 for the five emotion clusters discriminable above the d cutoff of 2.73. A) Linear regressions using all d values are shown with the sad-annoyed, sad-bored, and sad-happy emotion pairs represented with X markers and B) removed from the analysis (right).
95 A AG AO BO HA SA 3 3.5 4 4.5 5 5.5 6 6.5 7 Emotions Speaker 1Clustering Level B BO SA HA AG AO 3 3.5 4 4.5 5 5.5 6 6.5 7 Emotions Speaker 2Clustering Level Figure 3-6. Hierarchical clustering of each speaker for the five emotion clusters discriminable above the d cutoff of 2.73. Results shown for A) Speaker 1 and B) Speaker 2 (AG = angry; AO = annoyed; AX = anxious; BO = bored; CI = confident; CU = confused; CE = content; EM = embarrassed; EX = exhausted; FU = funny; HA = happy; IN = interested; JE = jealous; LN = lonely; LV = love; RE = respectful; SA = sad; SR = surprised; SS = suspicious).
96 A 0.0 0.3 0.6 0.9 1.2 123 No. of DimensionsR-Square Spk 1 Spk 2 B 0.0 0.1 0.2 0.3 0.4 123 No. of DimensionsStress Spk 1 Spk 2 Figure 3-7. R-square and stress curves as a func tion of the number of dimensions in the MDS solution for Speaker 1 and Speaker 2 for th e five emotion clusters discriminable above the d cutoff of 2.73. Results shown for the A) R-square and B) stress curves
97 A B Figure 3-8. Two-dimensional stimulus spaces fo r emotions greater than the discrimination threshold for each speaker. Results for A) Speaker 1 and B) Speaker 2).
98 CHAPTER 4 INDIVIDUAL AND GENDER DIFFERENC ES IN LISTENER DIMENSIONS Background Listener accuracy in identifying an emotion from a set of 4 to 20 emotion categories when using suprasegmental cues is typically ar ound 60% (Scherer, 1989; B ezooijen, 1984). These measures are well above chance, suggesting that na ive listeners are able to recognize emotions from speech signals when only the suprasegmental information is present. And yet, the recognition rates vary substantially across emotio ns and experiments. For example, the average recognition accuracy of the 14 emotions studied by Banse and Scherer (1996) was 51%, whereas the average identification accuracy of four emotions studied by Yildirim et al. (2004) was 68.3%. It is possible that the inconsistency in findings is partly due to an inherent variability in how individuals perceive emotions. In our previous experiments (Chapter 2), a same-different discrimination task was conducted to obtain reliable measures of similarity among 19 emotions in SS The results from a multidimensional scaling analysis using the ALSCAL algorithm (Chapter 3) showed that three dimensions were necessary to accurately repres ent the perceptual distances between emotions when data from 16 listeners was averaged. However, it is possible that averaging listener data concealed systematic variations between groups or large differences across individuals. In other words, listeners may vary in terms of the percep tual strategies used to discriminate emotions. Both the magnitude of the listener difference and the distribution of listener variability must be considered. It is important that listener variations are small and normally distributed to use judgments from multiple listeners to develop a single model for emotion perception that generalizes to all listeners. Hence, the purpose of the present study was to determine whether listeners use similar perceptual strategies when judging the emotion expressed in speech
99 independently of the semantic informati on. This was achieved through a multidimensional scaling procedure using the INDS CAL algorithm. The following sect ions review the literature on individual and gender differences in emotions and emotions in SS. Individual Differences Research has shown significant differen ces in how individuals experience and communicate emotions (Nowicki & Duke, 1994; Ra faeli, Rogers, & Reve lle, 2007; Bachorowski & Braaten, 1994). In fact, this vari ability has been seen in about every aspect of emotions. For example, Rafaeli and colleagues (2007) performed a series of studies to investigate the differences in how individuals experience mixed emotional states (i.e., affect synchrony). Participants were assessed in their personal ity type and tendencies to focus on the valence dimension (degree of pleasantness) over the arousal dimension (deg ree of activity or excitement) by self-report measures of thei r mood intensities over a five day period. Results showed that individuals differ considerably in their experiences of affect synchrony and dimensional focus. Individuals have also been shown to differ in their experiences and expression of empathy (Hogan, 1969; Mehrabian & Epstein, 1972). Hatf ield, Cacioppo, and Rapson (1993) suggested that an individuals ability to mimic or experience emotional contagion may influence his or her responses to the emotional expressions of others, an ability that is important in the experience of empathy. As a result, a substantial amount of work has been dedicated to the measurement of an individuals experience of empathy (e.g., Hogan, 1969; Mehrabian & Epstein, 1972; Davis, 1980). Many researchers claim that emotions are ex perienced differently by individuals due to differences in personality (Kring, Smith, & Neal e, 1994; Feldman Barrett & Niendenthal, 2004). For example, Larsen and Diener (1987) investigat ed the differences in the intensity of emotions experienced by a number of participants. They found that individual di fferences in affect
100 intensity were related to differences in persona lity. Gross, John, and Ri chards (2000) performed two experiments to determine whether an individu als personality will impact his experiences of emotion and hence, his overall (facial and vo cal) expressions of emotion. They found that emotional responses of highly expressive in dividuals were dependent on the emotions experienced for negative emotions. Much of the research on vocal emotion expression maintains that speakers or encoders differ tremendously in their manner of expression s (Banse & Scherer, 1996). As a result, many researchers stress the importanc e of using a large number of speakers in acoustic experiments (Scherer, 2003). A small number of encoders may affect the accuracy of acoustic measurements within the experiment and the precision of results across expe riments (Sobin & Alpert, 1999). Listener perception of emotions, on the other ha nd, does not seem to be significantly affected by the variability in encoders. Listeners appear to use similar perceptual st rategies in recognizing emotions from SS but this area is relatively unexplored (Greenberg, Shibuya, Tsuzaki, Kato, & Sagisaka, 2007). For example, Greenberg et al. (2 007) performed an experiment to determine the role of fundamental frequency in the perception of the Japanese word n. First an experiment was conducted to determine the emotional words th at could describe the various ways n could be expressed. This resulted in 26 emotional terms that were used in the main experiment. Twelve versions of the word n were expressed by one speaker with th ree levels of mean f0 height and four different f0 patterns (rising, falling, gradual fall, and ri se/fall). Five listeners judged these stimuli using an 8-point rating scale in terms of how well each sample was described by the 26 emotion terms (0= not at all and 7=very much). An individual differences multidimensional scaling (INDSCAL) analysis of these judgmen ts was then performed to determine the dimensionality of these judgments and whether it was appropriate to use the average listener
101 response to represent these judgments. A three-dime nsional solution was selected for analysis. It was reported that the listeners weighted each of three dimensions similarly, thereby allowing calculation of the average score fo r each of the 26 emotion words. Gender Differences In addition to examining individual differe nces, experiments to understand emotional phenomena often explore the possibility of gender differences. A number of studies have been performed to investigate gender differences in emotional expressions through the face, body, and voice. The overwhelming evidence from a number of emotion-related behavior such as facial electromyography or EMGs (Lang, Greenwald, Bradley, & Hamm, 1993; Schwartz, Brown, & Ahem, 1980), facial expressions (Hall, 1978; Kirouac & Dore, 1985; Biele & Grabowska, 2006), smiling tendencies (LaFrance, H echt, & Paluck, 2003; Halberstadt, Hayes, & Pike, 1988; Riggio & Friedman, 1986), skin conductance responses (Notarious & Johnson, 1982), changes in heart rate, and self-reports of expression (Allen & Haccoun, 1976; Balswick & Avertt, 1977; Gross & John, 1995; Kring et al., 1994) s uggest that women are more expressive than men. A number of researchers report emotion-speci fic gender differences. For example, Mandal and Palchoudhury (1985) have shown that women are more adept at recognizing sadness from the face, whereas men are more skilled at iden tifying anger. Biele and Grabowska (2006) found similar results. Males perceived a higher intensity of anger in dynamic facial expressions than women. Wallbott (1988) found no diff erence between males and fema les facial expressions of happiness; however, he did show that women express fear and sadness better than men. He also found that men express anger be tter than women and suggested that this may be due to differences in social display rules of emotion for men and wo men. Davis (1980) found significant sex differences on four measures of empathy. In addition, studies have shown women may have an increased predisposition to synchron ize with the emotional expressions of another
102 person (i.e., emotional contagion; Doherty, Or imoto, Singelis, Hatfield, & Hebb, 1995), which may indicate that women are more perceptive of emotion communication signals. Other studies have reported that men and women seem to expe rience the same emotion, but women are simply more apt to express their emotions through the face and other autonomic system behavior as measured through skin conductance responses (Kring & Gordon, 1998). Schirmer and colleagues (Schirmer, Kotz, & Friederici, 2002; Schirmer & Kotz, 2003) among others have confirmed these results with emotion expression through speech. They found that women integrated emotional prosody and emotionality of words earlier than men. The results of Fujita, Harper, and Wi ens (1980) are illustrative of the reports of gender differences in vocally expressed emotions: female s tend to encode emotions in SS more reliably than males (Mehrabian, 1972; Zuckerman, Lipets, Hall, & Rose nthal, 1975). In addition, their expressions are rated as more effective than males (Bonebright, Thompso n, & Leger, 1996). Zuckerman et al. (1975) examined the sex differences in encoders a nd decoders (i.e., listeners ). Forty participants expressed the sentence, I have to leave now, with an appropri ate prosody that corresponded to nine different emotions. Three practice emotions were expressed first (bewilderment, suffering, and determination) followed by six test emotions (anger, happiness, sadness, fear, surprised, and disgust) that were expressed in random order. Then, 72 listeners judged the expressions using a forced-choice identification task (subjects were divided into two groups and each group judged the expressions from 20 speakers). ANOVAs were pe rformed to determine th e effect of speaker sex and listener sex on identification accuracy. Re sults showed that females were significantly better at encoding and decoding emotions compared to males, as indicated by high identification accuracy of female speakers expressions and hi gh recognition rates overall by female listeners.
103 A number of studies have conf irmed the results of Zuckerman et al. (1975) in reporting that women are better at decoding or percei ving emotions from speech than men (Webb & VanDevere, 1985; Rosenthal, Hall, DiMatteo, Ro gers, & Archer, 1979; Fujita, Harper, & Wiens, 1980). In a review of the literat ure in 1978, Hall examined the dire ction of any gender effect, the effect size, and significance level of 75 studies (55 visual modality only, 13 auditory modality only, and 7 audio-visual). She re ported that evidence supports a female advantage in decoding emotion from the three modalities. A female adva ntage was found for 84% of the 61 studies that demonstrated an effect. The average effect size for the combined audio-visual modalities was 1.02, which was significantly greater than for th e visual-only studies ( 0.32) and auditory-only studies (0.18). Bonebright et al. (1996) found that females performed better than males at identifying fear and sadness, but males were better at perceiving anger from the vocal expression. Both of these findings replicate the findings in the literature on facial expres sions (Wallbott, 1988). Toivanen and colleagues (2005) examined listener percepti on of four emotions in speech including a neutral expression and found that women had m oderately higher discrimination accuracy across the four emotions than men for male speakers (average of 79% as opposed to 74%) and female speakers (81% as opposed 75%). Rymarczyk and Grabowska (2007) conducted a pe rceptual experiment to determine if there were any gender differences in individuals with frontal, te mporo-parietal, or subcortical right hemisphere brain damage and age-matche d individuals. No conclusions can be drawn regarding the differences between healthy men and women because perceptual accuracy was near ceiling. On the other hand, women with fr ontal lobe damage and men with subcortical
104 injuries had difficulty understa nding the speakers prosody. The au thors suggested that this may indicate differential localizati ons of affective prosody based on the sex of the individual. Very little research has examined the per ceptual strategies used by listeners when identifying the emotion from the suprasegmental information in speech. Reardon and Amatea (1973) used the semantic differe ntial technique to obt ain listener ratings of vocally expressed emotions on three dimensions including genera l evaluation, social c ontrol (i.e., social appropriateness of the expression ), and activity. Three scales we re used for each dimension. These included pleasant-unpleasant, beautiful-u gly, and sociable-unsociable for the general evaluation dimension, stable-changeable, sober-dr unk, and reputable-disreputable for the social control dimension, and fast-slow, sharp-blunt and excitable-calm for the activity dimension. Six emotions (love, happiness, indifference, sadness, fear, and anger) were expressed by six male and six female speakers. Then, 39 male and 39 fe male listeners rated the emotions according to the speakers feelings expresse d in his or her tone of voice. Results showed a significant interaction between listener sex and emotion on at least one of the three dimensions for all emotions except fear. On the whole, females ra ted the emotions as more socially controlled than males. In addition, a significant interact ion was found between spea ker sex and emotion for all dimensions. Females expressions were perceive d as less controlled and stable than the male expressions. On the other hand, males were rated more negatively and as significantly less active than women. For example, love and sadness we re perceived by women as highly active, and anger was perceived more negatively by women. As previously mentioned, this may be due to differences in male-female socio-cultural displa y rules. However, it is still unclear whether females are simply better at decoding a speakers emotion from SS or whether this ability is dependent on the emotion expressed.
105 Overview Large inter-personal differences in emotion expr ession are clearly present. Nevertheless, as shown in Chapter 2, listeners are able to understand at least four emotions from a speakers voice despite these differences. Previously, a multidim ensional scaling analysis (MDS) was used to determine the perceptual dimensions used by a group of listeners when discriminating 19 emotions from SS Results showed that listeners used th ree emotion dimensions to discriminate among 19 emotion categories under study. It was assu med that variations in listener judgments were normally distributed, thus allowing an analysis across a ll listeners. However, these variations may be large when listeners are asked to judge multidimensional qualities because listeners may use different criteri a to weight each dimension. Als o, these variations may not be normally distributed. If individual differences in the perception of emotions show a multimodal distribution, then a single model of emotion pe rception that is based on average data from multiple listeners may not generalize to the population. Hence, the purpose of the present experiment was to evaluate the distribution of individual differences in the perceptual strategies used to discriminate emotions in SS This was accomplished using the individual differences mu ltidimensional scaling or INDSCAL algorithm. This algorithm was used to compute the weigh t or the level of importance given to each dimension by each listener (Carroll & Cha ng, 1970; Arabie, Carroll, & DeSarbo, 1987) and develop a common stimulus space for all listeners based on the listener weights. The listener weights were then analyzed using descriptive statistics to uncove r any differences in perceptual strategies used to discriminate emotions in SS Knowledge of whether a ll listeners use similar perceptual strategies for discrimi nating a speakers emotions from SS has implications on the modeling of emotional speech. If the individual variations are normally distributed, data from several listeners may be averaged to fo rm a common model for emotion perception.
106 Alternatively, a bimodal or multimodal distribution of individual variations in perceptual strategies would indicate that it is necessary to generate multiple models of emotion perception for each of the groups. Methods The present study investigated the individual differences in the perception of 19 emotions in SS Perceptual judgments of these 19 emotions in SS were obtained in a previous experiment (Chapter 2). Since the perceptual data collection is an essentia l part of this experiment, a summary of the procedures used to collect this data is presented below. However, the reader is referred back to Chapter 2 for the complete methodological details. Speech Stimuli A data reduction task was performed in Chapte r 2 to select the emotion terms to study in speech. Through a clustering procedure, 19 emotion terms were selected: funny, content, lonely, respectful, embarrassed, confid ent, annoyed, exhausted, confused, bored, suspicious, sad, surprised, interested, angry, anxious, jealous, happy, and love. These 19 emotions were expressed by two theater students (1 actor, 1 actress) using two nonsense sentences. After all recordings were made, each speaker selected hi s or her best production of each emotion. This resulted in 38 individual files (2 speakers X 19 emotions X 1 best sentence/emotion). The speech samples were obtained using a head-mounted microphone (Audiotechnica, ATM21a) and a high fidelity, external sound card (Creative E-MU 0202). To maintain the overall amplitudes since the recording levels were adjusted for each samp le to avoid clipping, the samples were later amplified using Adobe Audition software. Listeners Sixteen participants, eight male and eight fema le (mean age = 23 years), participated in the perceptual experiment. All partic ipants were native speakers of American English. Participants
107 were also given a hearing screening to ensure normal hearing bilatera lly (air-conduction puretone threshold below 20 dB HL at 125 Hz, 500 Hz, 1 kHz, 2 kHz, 4 kHz, and 8 kHz). To ensure that participants had normal em otional health, they were aske d to respond to a 17-question Hamilton Rating Scale for Depression (HAM-D-17) test (scores of 0 to 6 are considered normal ; Hamilton, 1960) given verbally. Participants were also asked whether they are or will be taking medication to treat depression, anxi ety, bipolar disorder, or any othe r mental or emotional health disorders. Discrimination Task Procedures Participants judged the emotional stimuli thr ough a same-different discrimination task. Listeners were presented two sentences in each tr ial, separated by an inter-stimulus interval of 500 ms. Participants were asked to determine wh ether the emotions expressed within the two sentences were the same emotion (a matched trial ) or different (an unmatched trial ) by selecting the appropriate button on the computer scr een using software developed in MATLAB 7.1 (Mathworks, Inc.). All stimulus pairs were pres ented in a random orde r. Listeners heard the samples binaurally at a comfortable loudne ss level using headphones (Sennheiser HD280Pro) that were connected to the E-MU 0202 external sound card of a PC. Signal Detection Analysis Unlike the analysis performed in Chapter 3, each listeners judgments were analyzed separately. Listener performance was then estimated using d the Theory of Signal Detection measure of sensitivity (Green & Swets, 1966). Th is parameter was chosen because it uses both the hit rate (the number of correct discriminations of an unmatched stimulus pair) and the false alarm rate (the number of incorrect discriminatio ns of a matched stimulus pair) to describe listener performance. The d values between emotion pairs for each listener were calculated between each pair of emotions using the differencing model described in Chapter 2 (Macmillan
108 and Creelman, 1991, p. 150, Equations 6.4). A 19 by 19 matrix of d values was computed for each participant. INDSCAL Analysis The perceived distances (i.e., d ) were submitted to a three-way multidimensional scaling procedure using the INDSCAL algorithm (Car roll & Chang, 1970) in SPSS (ver. 11.5; SPSS, Inc., Chicago, IL). A Euclidian metric was used assuming distances were ordinal-level. The goal of the INDSCAL analysis is to determine whether listeners differ in their perceptual strategies for estimating emotions in SS For a specified number of dimensions, INDSCAL attempts to find the optimal stimulus space that is common to all listeners by scaling the stimulus space according to each listeners judgments. This stimulus space is referred to as the group space INDSCAL describes the differential importance gi ven to each dimension by each listener by calculating the listener weights for each dimensi on of the group space. Individual differences are represented through listener wei ghts for each dimension using the following formula (Carroll & Chang, 1970, Eqn. 2, p. 284): (4-1) This formula represents the weighted distance between objects i and j for subject k R indicates the total number of dimensions in the solution, therefore, the quantity xir xjr represents the distance between points xi and xj on the rth dimension. Each listeners weights are represented by a vector in the subject weight space that extend s from the origin to each point (referred to as individual vectors ). Individual differences are reflect ed by differences in the angle and magnitude of individual vectors in the subject weight space. The magnitude of an individual vector is approximately equal to the amount of variance explaine d in the data (Carroll & Chang,
109 1970, p. 289 & 297). Therefore, longer vectors signify a larger amount of variance accounted for by the model. The appropriate number of dimensions for the group space was determined using the customary measures of model fit, R-square (R2) and Kruskals Stress Formula-1 ( stress ). Rsquare is the amount of variance accounted for or the goodness of fit. The stress value describes the amount of error in the MDS solutions repr oduction of the perceptu al distances between emotions (i.e., the d values). A suitable INDSCAL soluti on is typically selected by identifying the number of dimensions that result in an elbow on the stress curve, i. e., the point after which the stress curve reduces its rate of decline. Results Listener Reliability Inter-listener and intra-listener reliability were previously reported in Chapter 2. In short, the average inter-listener reliab ility score measured as the Pe arsons Correlation among listeners was 0.66 (standard deviation or SD: 0.08; ra nge: 0.47 to 0.81). The average intra-listener reliability score was 0.68 (SD: 0.14; range: 0.36 to 0.85). INDSCAL Dimensionality The appropriate number of dimensions was select ed as the smallest dimensional model that minimized stress on the system a nd still accounted for a large amount of variance in the data. The R2 and stress measures were plotted as a functi on of the number of dimensions as shown in Figure 4-1. The elbow of the stress curve appeared at th e fourth dimension (stress = 0.19). A sufficient R2 was selected as the point after which the R2 curve reached a plateau, indicating that only a small amount of variance would be explai ned by adding an additional dimension to the MDS solution. The plateau in the R2 curve also occurred afte r the fourth dimension (R2 of 0.56). Both the increase in R2 and decrease in stress beyond the four th dimension were small (less than
110 0.03). Therefore, the four-dimensional solution wa s an acceptable represen tation of the present data. INDSCAL Group Space The coordinates of the emotions in the fou r-dimensional group space are shown in Table 41. The stimulus configuration of the emotions in the four-dimensional space can be viewed from six two-dimensional plots as s hown in Figure 4-2. The spread of emotions in the group space suggests that some emotions are well-separated from other emotions. Spec ifically, the emotions that were separated by more than 85% of the ra nge for each dimension were assumed to be wellseparated using the properties of that dimension. The two sets of emotions that were easy for listeners to discriminate using Dimension 1 included happy and surprised from lonely and embarrassed. Informal listening suggested that the speaking rate of the emotions at the low end (lonely and embarrassed) was much slower th an the emotions at the high end (happy and surprised). Therefore, Dimension 1 may separate emotions primarily based on the speaking rate. Dimension 2 separated angry from lonely. These em otions differed in the staccato-legato quality of the prosodic contour and possibly steep peaks or rise-falls in the fundamental frequency ( f0 ) contour. Annoyed was clearly discriminated fr om love on Dimension 3, which may roughly corresponded to a gradual changes in f0 over the entire sentence (i.e., gross trend in f0 ). Listeners used the fourth dimension to differentiate betw een the emotions anxious and bored. The high end or anxious was observed to differ from bored, the low end, based on the frequency and duration of pauses (more frequent and shor ter at the high end) and possibly vocal quality (breathier at the low end). INDSCAL Subject Weight Space The individual listener weight s on each of these dimensions show the relative importance given to each dimension by each listener when ma king decisions of emotion similarity. The four-
111 dimensional listener weight space is shown using six graphs in Figure 4-3. The differences in listener weights can be specified by the angles be tween individual vectors and the x-axis (Carroll & Chang, 1970, p. 292). A visual analysis of Fi gures 4-3 demonstrated that the angular separation between any two listeners is small. Listener 2 was the only participant that was observed to be separated from the cluster of listeners on the graphs of Dimensions 1 and 2, Dimensions 1 and 4, and Dimensions 1 and 3. Howe ver, this listener was separated from the group by the magnitude of its indivi dual vector on the first two gr aphs. Recall that short vectors indicate that a smaller amount of variance is ex plained in the data by those dimensions. Since an angular separation of this listene rs vector from the cluster of listeners was apparent on only the graph of Dimensions 1 and 3, this suggests that Listener 2 employed a similar weighting scheme for Dimensions 1, 2, and 4 as other listeners. However, Listener 2 gave more weight to Dimension 3 (weight of 0.47) than to Dimension 1 (weight of 0.37). The distribution of the angles formed by the in dividual vectors and the x-axis are shown in six histograms in Figure 4-4. Each histogram co rresponded to each of the graphs in Figure 4-3. To determine whether the histog rams represented a normal dist ribution of angles, a KolmogorovSmirrnov D test was performed in SPSS. This test was used to determine whether the distributions of the data were significantly different from the standard normal curve. No significant results were found (Dimensions 1 & 2: Z = 0.849, p >> 0.05; Dimensions 1 & 3: Z = 0.679 p >> 0.05; Dimensions 1 & 4: Z = 0.514, p >> 0.05; Dimensions 3 & 2: Z = 0.612, p >> 0.05; Dimensions 4 & 3: Z = 0.419, p >> 0.05; Di mensions 4 & 2: Z = 0.975, p >> 0.05). This suggests that the listener varia tions in dimension weighting were normally distributed. In addition, the average angle was 54.4 degrees (SD: 9.5 degrees), whic h indicated that the angular distance from the individual vectors to the midline (a vector at 45 degree) is moderately small for
112 all six pairs of dimensions. The distance from mid line represents listeners differential weight for the x-axis or y-axis dimension. Angles larger th an 45 degrees indicated th at listeners gave more weight to the y-axis dimension. Although this difference is small, the average listener weight for Dimension 1 (0.56) is larger than the other 3 dimensions (0.39, 0.32, and 0.30, respectively). This is expected because the INDSCAL solution selects the first dimension as the one that accounts for the most variance in the data. Results also show a similar weighting for Dimensions 2, 3, and 4 on average. Individual variations from the group space we re also examined by an analysis of the weirdness indices. The weirdne ss index is a measure of the i ndividual deviations from the average group weight on all dimensions. Extrem e deviations from the group weight, such as weighting a single dimension onl y, results in a weirdness inde x of 1. The distribution of weirdness indices are shown in the histogram in Figure 4-5. Th e distribution of weirdness was skewed to the right as expected (i.e. these tend to be low), indicating th at the individual weights are generally similar to the average group space. Analysis of Gender Differences Visual analysis of the subject weight space in Figure 4-3 demonstrated that the difference in male and female listener weights on each pair of dimensions is small. The female and male listener weights were not clearl y separated from each other on any dimension. To confirm these results, statistical analysis was performed us ing the Mann-Whitney U Test. This nonparametric test was used to compare the means of two independent groups. A Mann-Whitney U Test was performed for each of the four dimensions to de termine whether male and female listeners differ in their weights assigned to each dimension. No significant differences were found (Dimension 1: U = 24, p > 0.05; Dimension 2: U = 18, p > 0.05; Dimension 3: U = 28, p > 0.05; Dimension
113 4: U = 29, p > 0.05), indicating that li stener sex does not affect the st rategies used to discriminate emotions in SS These results suggest that listene rs use the same dimensions or perceptual strategies when discriminating 19 emotions in SS To identify any differences in the ease of discriminability of these emotions by gender, separate hierarchical clustering analyses were performed for males and females. The resulting dendrograms are s hown in Figure 4-6. By examining different clustering levels, the similarities and differences in the hierarchi cal structures can be seen. For example, results suggest that four emotions we re discriminable by both males and females at a clustering level of 4.0. These resu lts support the selection of four emotion categories from the hierarchical clustering results of all listeners (in Chapter 2). However, at clustering levels below 3.0, di fferent numbers of emotion categories are formed. This may be because the emotions clustere d below this level cannot be perceived well in SS Alternatively, it is possible that gender differences exist in the perception of non-basic emotions. An independent samples t-test was pe rformed to determine whether listeners differed in their mean d scores by sex. Results were significant with t(2734) = -3.779 (p < 0.01) with mean d of 3.00 for female listeners (SD 1.12) and 3.17 for male listeners (SD 1.17). This difference in mean d scores suggest that listeners at some leve l differ in their percep tion of emotions in SS, however, these differences may be small for the top four clusters (happy, content-confident, angry, and sad). Discussion Models of emotion perception have assumed th at individuals perceive emotions similarly. At the same time, scientists have suggested that one of the limitations of in generating models of emotion perception is the large individual diffe rences in the percepti on and expression of emotions. In addition, gender differences have b een reported in the st udy of various emotion
114 related phenomena such as facial expressions (B iele & Grabowska, 2006), reliability of encoding emotions in speech (Sobin & Alpert, 1999), and the percent correct accuracy in discriminating emotions (Toivanen, Vyrynen, & Seppnen, 2005). In order to develop a model of emotion perception that can be generalized to all listeners, it was necessary to first determine whether individual differences in th e perception of emotions in SS were normally distributed. A normal distribution of individual variat ions would validate the developm ent of an acoustic model of emotion perception based on averaged listener judgments. The present experiment examined the individua l differences in discriminating 19 emotions in SS This is in contrast to experiments such as Zuckerman et al. (1975) or Bonebright et al. (1996) who sought to determine gender differences in the identification accuracy of a set of emotions. Instead, the purpose of this experiment was to identify differences in the perceptual dimensions used by listeners when differentiating emotions in SS This study was similar to the experiment by Greenberg et al. ( 2007), who performed an INDSCAL analysis of listener ratings of 12 samples of the Japanese word n that varied according to mean f0 height and f0 contour pattern type (rising, falling, gradua l fall, and rise/fall). Greenberg et al. (2007) attempted to identify the perceived differences in the emo tions of these expressions resulting from the acoustic variations and whether all listeners perceived these acoustic changes similarly. The present study also performed an INDSCAL analysis of liste ner judgments of 19 emotions, however, the nature of these dimensions were not predefined as was necessary in the rating scale task in Greenberg et al. (2007). In addition, sentence-length speech samples were used to enhance the ecological validity of the perceptual judgme nts of these expressions. First, each listeners percep tual judgments of these 19 em otions were described using d a parameter that represents the perceptual dist ance between two emotions. An INDSCAL analysis
115 was performed using each listeners set of d values. Results suggested that a four-dimensional solution was needed to represent the group space, the stimulus space that was common to all listeners. Since the group space was formed accord ing to each listeners di mension weights, it was predicted that these dimensions may not ex actly represent the ALSCAL dimensions or the solution that was formed from the averaged judg ments across listeners. The first two dimensions of the INDSCAL group space were similar to th e first two dimensions of the 3D ALSCAL solution determined in Chapter 2. Dimension 1 separated the sad and happy clusters, and Dimension 2 separated angry from the sad a nd happy clusters. The third INDSCAL dimension separated annoyed from love. The fourth dime nsion somewhat resembled the third ALSCAL dimension and separated anxious from bored, susp icious, and funny. The similarities between the INDSCAL and ALSCAL solutions suggest that list eners perceive emotions similarly and that a model based on the average data may apply to a larger population with a reasonable degree of accuracy. To better understand the inter-l istener differences, the listen er weight space was formed. The listener weight space described the amount of importance given to each dimension by each listener. Therefore, points in the listener we ight space represented each listeners dimension weights. All listener weights were grouped together on each of the six 2D graphs (Figure 4-3). In addition, histograms of the angular distance between the listener weights a nd the x-axis (Figure 4-4) suggested that listener di fferences were normally distribut ed. This was confirmed using a Kolmogorov-Smirnov Z test for normality and a Mann-Whitney U test for a difference in means between both sexes. These tests reveal ed no significant differences at the = 0.05 level, suggesting that no gender differences were presen t and furthermore, that the differences in listener weights were normally distributed. Furt hermore, the results of separate clustering
116 analyses of a group of male and a group of female listeners supported a similar model of emotion perception for some emotions at high clustering le vels. Both male and female listeners were able to perceive four emotions (akin to h appy, angry, sad, and confident) in SS at a high clustering level of 4.1. With the exception of c onfident, these emotion categories have been frequently examined in literature on emotions (Greasley et al. 2000; Bu rkhardt et al., 2005). The emotions happy, angry, and sad were identified as basic in the facial expressions literature (Fridlund et al., 1987), according to prototype theory (Shaver et al., 1987), and according to biological behaviors (Izard, 1977) These emotion categories were also those identified in Chapter 2 to serve as the basic emotions in SS providing further support for across-listener recognition of these emotions in SS Together, these resu lts support the use of a single model to describe the perception of four emotion categories based on the average listener. Nevertheless, some literature has demonstrated the presence of gender differences in the processing of happy and angry emotional prosody, su ch as the cortical locations involved in processing congruent and incongruent emotional prosody and word valence (Schirmer, Zysset, Kotz, and von Cramon, 2004) and the time at whic h emotional prosody is used during word processing (Schirmer et al., 2002). In additi on, a number of studies have shown gender differences in the perception and expression of em otions in speech (Hall, 1978; Toivanen et al., 2005). However, few experiments (Rymarczyk a nd Grabowska, 2007; Greenberg et al., 2007) have examined individual or gender differences in the perceptual dimensions used by listeners when distinguishing between emotions in SS. It should be noted that outcomes of the present study did not reject the possibili ty of gender differences in th e perception of the non-basic emotion categories from SS even though both sexes use similar perceptual dimensions. The potential for gender differences in listener perception of specific emotions may be seen from the
117 significant results of an independent sa mples t-test comparison of the mean d scores for males and females. However, the insignificant results of the Mann-Whitney U test, and the finding that the top four emotion clusters were similar for bot h sexes suggests that gender differences in the listener strategies may be small, especially wh en discriminating between the top four emotion categories (happy, content-c onfident, angry, and sad). In summary, the results of an INDSCAL analys is of individual differences as performed in the present study revealed that listeners employed a similar weighting system across dimensions. This suggested that the individual differences in the perceptual dimensions used to discriminate between emotions in SS were small and normally distribut ed. Furthermore, hierarchical clustering analyses of male and female judgments revealed that both sexes perceived the same four unique emotional categories at a high cluste ring level. Therefore, a single acoustic model can be developed to describe emotion perception of these four emotions in SS for most listeners.
118 Table 4-1. Stimulus coordinates of the f our-dimensional INDSCAL group space. The 19 emotions are arranged in ascending order for each dimension (AG = angry; AO = annoyed; AX = anxious; BO = bored; CI = confident; CU = confused; CE = content; EM = embarrassed; EX = exhausted; FU = funny; HA = happy; IN = interested; JE = jealous; LN = lonely; LV = love; RE = resp ectful; SA = sad; SR = surprised; and SS = suspicious). LN*-1.60AG*-2.89AO*-2.19BO*-1.84 EM*-1.59AO-1.20IN-1.43SS-1.43 SA-1.02RE-1.10CU-1.38FU-1.24 EX-0.97CI-0.94JE-0.90JE-0.75 LV-0.74JE-0.78SS-0.76CI-0.75 BO-0.65AX-0.55CE-0.69IN-0.52 AG-0.43SR0.21RE-0.44CU-0.48 SS-0.36BO0.21EM-0.25AG-0.45 JE-0.31EX0.22CI-0.06EX0.05 AO-0.16LV0.26SA0.11AO0.10 RE-0.05CU0.31SR0.39HA0.10 CU0.12SS0.34AX0.48LN0.21 CI0.44CE0.43AG0.56RE0.35 CE0.53HA0.63HA0.84SR0.36 IN0.77FU0.67LN0.87EM0.48 AX1.06SA0.72FU1.03CE0.91 FU1.39IN0.82BO1.11LV0.97 SR*1.72EM1.13EX1.23SA1.61 HA*1.85LN*1.51LV*1.48AX*2.31 Dimension 1Dimension 3 Dimension 2Dimension 4 These emotions were well-separated on each dimension.
119 Table 4-2. Listener weights on each of th e four dimensions of the group space. D1D2D3D4 10.590.350.370.36 20.340.450.470.21 30.520.380.440.26 40.610.420.320.28 50.620.390.300.36 60.460.520.250.28 70.520.480.210.36 80.600.280.270.34 90.540.310.370.25 100.460.410.290.42 110.650.380.270.34 120.590.380.260.38 130.660.390.280.26 140.490.340.320.30 150.590.380.320.17 160.690.370.320.25MalesListener WeightsFemales
120 2 3 4 5 6 0 0.25 0.5 0.75 1 No. of Dimensions q Stress R-square Figure 4-1. R-square and stress values as a f unction of the number of dimensions in the INDSCAL solution Figure 4-2. Group space. Stimulus configuration for the 4D INDSCAL solution is shown. Each symbol represents an emotion: AG = angry; AO = annoyed; AX = anxious; BO = bored; CI = confident; CU = confused; CT = content; EM = embarrassed; EX = exhausted; FU = funny; HA = happy; IN = in terested; JE = jealous ; LN = lonely; LV = love; RE = respectful; SA = sad; SR = surprised; SS = suspicious.
121 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Dimension 1Dimension 2 0 0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Dimension 1Dimension 3 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Dimension 1Dimension 4 0 0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Dimension 3Dimension 4 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Dimension 2Dimension 3 0 0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Dimension 2Dimension 4 Figure 4-3. Subject weight space. Markers repres ent each listeners weights (males are shown with a blue x and females are shown with a pink filled o).
122 Figure 4-4. The spread of the angles formed by th e individual vectors and the x-axis are shown in corresponding histograms.
123 0 1 2 3 4 5 6 7 00.10.20.30.188.8.131.52 Weirdness IndicesFrequency Figure 4-5. Histogram of the dist ribution of weirdness indices.
124 hap sur fun anx ang ann jea res cot int cfi sus cfu bor emb lon sad exh lov 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 "Angry" "Confident" "Sad" "Happy" hap sur fun cot int ann cfi res cfu jea anx ang bor sus emb lon exh lov sad 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 "Happy" "Angry" "Sad" "Confident" Figure 4-6. Hierarchical structure of emotion similarity for males (in blue) and females (in pink).
125 CHAPTER 5 A MODEL OF EMOTION RECOGNITION BASED ON SUPRASEGMENTAL CUES Background Although human-machine interaction systems exist, the quality of the interaction is not equivalent to those between two people. This may be due to the limitations with the speech generated by the machine, which often sounds u nnatural. In addition, machine systems may be able to recognize the literal meaning of words and perform actions accordingly, but they cannot understand the entire human message that is expr essed through suprasegmental cues as well as semantic information. For example, a person ma y speak with urgency in his voice during a 911 call or with hostility in his voice after being charge d a late fee when a bill was paid on time. In both instances, the ability for an automated system to recognize the emotion of the speaker and respond with speech containing the appropriate em otion would significantly improve the quality of these interactions. Further, a better unders tanding of how emotions are communicated in speech would help to improve the rehabilitation of peopl e with emotion communication disorders, since the communication of prosody is often a primary target for treatment (Rosenbek et al., 2004). For instance, this knowledge can be used to provide a theoretical basis for the selection of emotions for use in therapy. In addition, the order in which emotion categories are addressed in therapy can also be determined from an understanding of which emotions can be easily perceived and expressed in speech when only suprasegmental information is present. Despite the efforts to determine the prosodic or acoustic characteristics of emotional speech signals, most researchers report signifi cantly higher accuracy rate s for listener perception of certain emotions in SS over software classifiers based on acoustic measures. For example, the average perceptual and machine cl assification accuracy of four em otions reported by Yildirim et al. (2004) was similar (68% and 67%, respectiv ely). However, listener perception of angry
126 (82%) was significantly better than machine classifi cation of this emotion (54%). Similar results were reported by Toivanen et al. (2006) for tests of five emoti ons (perceptual accuracy of 38% and machine classification of 25%). The disc repancy between human perception and machine recognition is often attributed to individual differences in speaker characteristics such as those resulting from age, gender, phys iological characteri stics (e.g., vocal fold and vocal tract characteristics), and personality type (e.g. softly spoken or boi sterous and loud), situational contexts such as socio-cultural mannerisms, and the intensity of emotional experiences (Larsen & Diener, 1987; Gross et al., 2000). Despite the large differences in the speech acoustic signal arising from such factors, the re sults of Chapter 2 sugge st that listeners are able to discriminate four emotions well in SS. It was also shown in Chapter 4 that the individual differences in perceptual strategies used by listeners when di scriminating these emotions were small (at least for the four basicemotions). Since four emotions were pe rceived well, this suggests that most listeners were able to attend to the relevant acoustic cues to identify a speakers emotions. The discrepancy between the high perceptual accuracy and low machine classification accuracy for certain emotions may be due to 1) limitations with the perc eptual data collection process (such as computing accura cy in percent correct, which ma y result in artificially high perceptual accuracy), 2) an inadequate or incorr ect acoustic feature select ion process, or 3) an inability of the existing algorithms to classify the emotional speech. For example, the inclusion of too few or too many acoustic parameters for classification purposes ma y reduce classification accuracy. Classification of emo tions based on too many features adds complexity to the classification process, which may result in additional variability in the classification results. On the other hand, use of too few parameters may re sult in the omission of important features. The purpose of the present experiments was to devel op a preliminary acoustic model that describes
127 the emotions that can be perceived in SS Specifically, this model was to include only relative acoustic measures or cues that are computed without the use of a speakers baseline neutral emotional expression. It was predicted that a mode l derived from relatively unbiased measures of discrimination accuracy in addition to the use of certain novel acoustic measures would result in an improvement in classification accuracy of th ese emotions. The following sections first review the experiments that have examined various acou stic features of emotions and then provide a summary of the predicted acoustic correlates to each emotion. The subsequent sections review the smaller body of literature that has examin ed the acoustic properties of the emotion dimensions. Acoustic Parameters Previous research to find the acoustic corre lates of emotions expressed in speech has examined four main groups of cues: fundamental frequency ( f0 ), vocal intensity, voice quality, and duration. The following sections describe the parameters investigated within each of these four sets and their predictive re lationship with various emotions. Fundamental frequency The set of f0 -related cues are commonly measured by first generating the f0 contour for the entire utterance under study. From the c ontour, a number of utterance-level or static measures can be computed such as mean f0 maximum or max f0 minimum or min f0 range of f0 and standard deviation or SD of f0 (henceforth referred to as the five global parameters) and the number of peaks in the f0 contour. These measures are averag es that are calculated across the entire utterance. Once the global parameters have been computed for all emotions under investigation, these measurements for a speaker ar e typically compared to the speakers neutral emotion. This comparison esse ntially normalizes the global f0 measures, since the absolute f0 differs depending on the vocal fold characteristic s of each speaker. These normalized changes in
128 f0 can then be compared across speakers and emotions. In one such study, Scherer (1986) predicted increases and decreases in a number of acoustic parameters relative to a neutral emotion. These predictions were of vocal change s that may occur depending on the outcome of a number of stimulus evaluation checks or SECs The SECs included checks for novelty, intrinsic pleasantness, goal/need significance coping potential, and norm/sel f compatibility. For example, enjoyment/happiness can be described as low novelty, high pleasantness, medium relevance, consistent expectation, high conduciveness, very low urgency, high adjustment, and high external and internal norm compatibility. These predictions were later tested in an acoustic analysis of a number of emo tional expressions in an experi ment by Banse and Scherer (1996). However, classification of novel utterances base d on these predictions has not been performed. In recent studies such as the experiment s performed by Forsell, Elenius, and Laukka (2007) and Yildirim et al. (2004), time-varying or dynamic f0 patterns were measured in addition to global measures. These included measures such as the gross f0 trend, the f0 contour shape (patterns in f0 changes over time), and the derivative of the f0 contour. For example, Greenberg et al. (2007) examined the role of mean f0 and f0 contour shape in the emoti onal perception of the Japanese word n. A large number of c onversational samples (n = 6,271) of n was selected from a database of 23,648 samples spoken by one female. The k -means method for vector quantization was employed to categorize the samples based on f0 height (i.e. f0 mean) and f0 shape (rise, gradual fall, fall, or rise & fall). Results showed that f0 height was useful in distinguishing between positive and negative emotions, and f0 contour shape was useful in establishing differences between confident and doubtful expressions and between allowable and unacceptable expressions.
129 In another experiment Auberge, A udibert, and Rilliard (2004) analyzed f0 contour patterns to determine whether unique patterns exist fo r each emotion. One male speaker expressed authentic expressions using interactive software to elicit emotional expressions. After this task, the subject was debriefed and asked to express th e stimuli (color names) once more, resulting in acted expressions. The speaker (an actor) labeled his samples according to the emotions he felt during the recording tasks, whic h resulted in 12 emotional labe ls for the acted expressions (satisfaction, positive surprise, positive concen tration, worried, anxiety, deception, neutral, joy, sadness, hot anger, disgust, and fear) and 10 la bels for the authentic expressions (confidence, positive concentration, joy/surprise, joy, negati ve concentration, deception/surprise, anxiety, anxiety/fear, wearin ess, nothing). Global f0 measures, 11 spectral measures, and f0 contours were calculated on the vowels segments of all stimuli. The contours were smoothed by averaging 32 ms frames, shifted by 10 ms. Results showed that the f0 level (the difference between the attack and f0 mean) of acted contours was higher than the authen tic ones. In addition, the duration of vowels was much higher for acted speech. None of global f0 measures reached significance in the Analysis of Vari ance tests (ANOVAs) even though the mean f0 and variability clearly differed between some emotions such as acted satisfaction and the remaining emotions and also between acted fear and acted po sitive surprise. These results indicate that an f0 contour analysis may be useful in differentiating emotions. Paeschke (2004) investigated the role of global f0 trend in predicting seven emotions: happiness, anger, anxiety, sadness, dis gust, boredom, and neutral. Global f0 trend refers to the gross change in f0 (increasing, decreasing) across the entir e utterance. Stimuli were taken from a previously recorded database of 20 utterances (10 were singl e phrase statements, and 10 were two-phrase statements) spoken with seven emotiona l states (fear, disgust, happiness, boredom,
130 sadness, hot anger, and neutral) by seven Germ an actors (Paeschke, Kienast, & Sendlmeier, 1999). Stimuli that were correctly identified by sixteen of twenty li steners were used in further analysis. Global f0 trends were measured as the gradie nt of the linear regression. A one-way ANOVA showed significant di fferences between the f0 trends for the neutral emotion and the remaining six emotions. In addition, the slope of the f0 trend for boredom wa s significantly less than the f0 trend for neutral (-4 semitones/sec as opposed to -3 semitones/sec, respectively). The f0 trend for sadness was small (0.84 semitones/sec) but displayed the least variability across utterances. The variability in global f0 trend was too high for anxiet y, anger, or happiness to make this cue useful for char acterizing these three emotions. In summary, dynamic f0 measures have been shown to fac ilitate perceptual differentiation between emotions. While some studies have not found global f0 cues to significantly correlate with any of the emotions under study, others claim that these cues can be used to differentiate between some emotions. However, it is not cl ear whether listeners can estimate a speakers neutral emotion for use as a comparison for al l other emotions. In a sense, it is not clear whether this calculation resembles human percep tion. Nevertheless, experiments to identify the acoustics of emotion categories may benefit fr om investigating both setsstatic and dynamic of f0 features. Intensity Intensity, a measure of the amount of energy in the signal, has also been quantified using the five global measures (mean, max, min, range, and SD). For example, Yildirim et al. (2004) computed duration features (includes utteran ce, inter-word silence, vowel, voiced, and unvoiced region durations), the five global measurements on the smoothed (3-point median filter) f0 and formant frequency contours, RMS energy, and th e spectral balance of vowel sounds in an investigation of the acoustic characteristics of three emotions (anger, happiness, sadness) with
131 respect to a neutral emotion. Discriminant analys is was used to determine the effectiveness of each cue. Using the single best acoustic parame ter, RMS energy, the algorithm was able to correctly classify 55.4% of the samples. When the RMS energy, f0, duration, and spectral balance were used, accuracy was similar to human performance at 67%. Closed set identification accuracy of 100 samples by four listeners was 68%. Petrushin (1999) suggested that energy may be a necessary parameter for distinguishing between emotions. A set of 43 global and time-vary ing acoustic features were investigated. The RELIEF-F algorithm (a machine learning algor ithm that makes use of 1 to 12 nearest neighbors or reference samples fo r emotion classification) was used to select the most important features. Fourteen features were chosen including f0 max, f0 standard deviation, f0 range, f0 mean, first formant bandwidth mean (BW1), s econd formant bandwidth mean (BW2), standard deviation of energy, speaking rate, f0 slope, F1 max, energy max, energy range, F2 range, and F1 range. Four classification techniques were comp ared, each of which resulted in approximately 65% recognition across emotions. This suggests th at standard deviation and range of energy may be important for distinguishing between five em otions (normal, happy, angry, sad, and afraid). In contrast to the experiments described above, a few researchers have found intensity measures to contribute little information for emot ion classification. For example, Liscombe et al. (2003) investigated the classifi cation of 11 emotions (confiden t, encouraging, friendly, happy, interested, angry anxious, bored, frustrated, sad, and neutral) spoken by two male and two female actors. The stimuli consisted of 1,760 four-syllable numbers and dates. Twelve features were automatically extracted from the samples, including the global f0 and energy features the ratio of all voiced samples a bove the center of the f0 range to those below the center ( f0 above ) and the ratio of voiced samples to total segments (VTT). Also, six hand-labeled features were computed
132 including the mean length of sylla bles, spectral tilt of vowel w ith nuclear stress (STV nuclear stress), spectral tilt of vowel with highest amp litude (STV highest amplitude), type of nuclear accent, type of intonational contour, and type of phrase accent and boundary tone. Results showed that none of the global energy features were the best predicto rs for any emotion. The f0 features were the best predictors of angry, conf ident, happy, sad, interested, and frustrated, while the spectral tilt was best for anxious, bored, friendly, and equally good as f0 for angry. In summary, the significance of intensity rela ted cues is unclear. A few studies such as Yildirim et al. (2004) have found results that support the need for in tensity measures to distinguish between a small number of emoti ons. Many other experiments find at least one parameter of the set of global intensity cues to be relevant, though it is typically not the most important cue (Petrushin, 1999; Hammerschm idt & Jurgens, 2007). A small number find intensity cues to have little sign ificance in emotion classification. It is possible that intensity changes in speech are correlated with changes in f0 contours, thereby adding little new information for identifying emotions. However, furt her investigation is ne cessary to support this notion. Voice quality Many researchers believe that voice quality is an important factor for distinguishing emotions (Burkhardt and Sendlmeier, 2000; Ro ach, 2000; Murray and Arnott, 1995; Cowie et al., 2001; Gobl, Bennett, & Chasaide, 2002). However, voice quality is no t easy to quantify. As Gobl and Chasaide (2003, p. 191) stated, Much of what we know about the mapping of voice quality to affect has come in the form of received wisdom, based on impressionistic phonetic observations. No standard method of evaluating voice quality exists in th e emotions literature. Some of the previously studied parameters to de scribe voice quality incl ude jitter or fundamental frequency perturbations (Williams & Stevens, 1969; Auberge et al., 2004; Bachorowski &
133 Owren, 1995; Toivanen et al., 2006) and shimmer or amplitude perturbations (Bachorowski & Owren, 1995; Toivanen et al., 2006). For exam ple, Bachorowski & Owren (1995) examined changes in jitter, shimmer, and f0 as participants experienced positive and negative emotions. These emotions were induced by providing fixed levels of positive feedback (the reward case) and negative feedback (punishment). These emotions were induced by success and failure feedback during a difficult lexical decisi on task. Recordings of the phrase Test n test were made prior to beginning the task as well as during the task, where n was essentially a counter of the trial number. Acoustic measurements were made on the middle 60% of the / / vowel in test. Results of a MANOVA for females revealed significant differences in baseline and test conditions for all three acoustic features. Fema les who had high positive scores on the Emotional Intensity Scale (EIS; an assessm ent of individual differences in perceived emotional intensity) had high jitter, shimmer, and f0 in the reward condition. Results for males overall showed significant differences in f0 only. However, males that had hi ghly negative scores on the EIS showed higher f0 jitter, and shimmer measurements in th e punishment case. It should be noted that these recordings were made using cassettes which have been shown to affect the accuracy of perturbation measures (Doherty & Shipp, 1988). Another acoustic measure often used to quantif y changes in voice quality is the spectral slope or some measure of how intensity leve l varies across different frequency bands. The relation between such spectral energy distributi on and emotions has been investigated in a number of studies (Banse & Scherer, 1996; Toivanen et al., 2006; Leinonen, Hilrunen, Linnankoski, & Laakso, 1997). Leinonen et al. (1997) performed an acoustic analysis of Finnish expressions of the word [saara] in 10 emotional contexts (see Table 5-1). Analysis of the spectral energy distribution among [aa] segments was repo rted using self-organ ized spectral feature
134 maps. Visual analysis suggested that the co mmanding and angry expressions have relatively more energy at the main formants (1 kHz) th an the naming models. In contrast, frightened samples had more energy at the f0 and less energy in the region of the main formants than angry samples. Sad was differentiated by emphasis of th e lower end of the spectrum, and the spectrum of admiring was flat and noisy, which may be due to breathy p honation. Toivanen et al. (2006) also examined the perception of emotions from a vowel taken from Finnish speech spoken in five emotional contexts: neutral, sadness, joy, anger, and tenderne ss. The alpha ratio was used as a measure of spectral energy distribution (calcula ted by subtracting the sound pressure level or SPL in the 50 Hz to 1 kHz range from the SPL in the 1 to 5 kHz range). Results of the kNN classification process suggested that duration and alpha ratio were the best features for emotion classification (followed by signa l-to-noise ratio, normalized am plitude quotient or NAQ, and mean jitter). Two other cues for voice quality include the spectral slope and harmonic-to-noise ratio (HNR). Several researchers have measured sp ectral slope, a measure of how quickly energy decreases as a function of frequency (e .g., Laukkanen, Vilkman, Alku, & Oksanen, 1997; Gordon & Ladefoged, 2001; Williams & Stevens, 1972), as well as HNR, a measure of the relative noise level in speech (C ahn, 1990; Klasmeyer and Sendlmeie r, 1995; Hammarberg et al., 1980; Alter et al., 2000). Some studies have found signifi cant correlations between these parameters and the perception of emotions (S chroder et al., 2001; Auberge et al., 2004). However, the strength of the associations is of ten weak if a relation is even present (Pereira, 2000). This inconsistency may be due to a lack of sound theoretical motivation for the use of these measures. The selection of these measur es may have been driven by computational
135 simplicity or ease of accessibility in speech analysis software. In contrast, some researchers have looked at other factors that have been shown to influence the perceived quality of voice such as phonation type (e.g., modal, breathy, rough, hars h, strained, sharp, and falsetto; Gobl & Chasaide, 2003; Banziger & Scherer, 2003). Fo r instance, Burkhardt and Sendlmeier (2000) investigated the role of phonation type and other acoustic features to describe voice quality using synthetically generated speech. The features that were systematically varied included speech rate, mean f0 f0 range, phonation type (modal, falsetto, breathy, creaky, or tense), and vowel precision. A formant-synthesizer was used to ge nerate the acoustic changes in emotionally neutral German sentence. Modifications to phon ation type were performed by adjusting the KLSyn parameters, formant-bandwidths, and sp ectral notches. For example, a voice in the falsetto register was implemented by a pitch-shif t and by introducing an irregularity to the pitch. The creaky voice could be assigne d to all voiced phonemes or only to the first half of vowels after an unvoiced/voiced transiti on. Creaky voice could be generate d as harsh version (with short open-glottis phase) or a breathy ve rsion. The stimuli were divided in to three listening tests and randomly presented once to each listener (Test 1: intonation features and phonation types; Test 2: phonation types and segmental features; Test 3: intonation and segmental features). For each task, thirty German lis teners judged the em otion of these stimuli using a forced choice identification task with response op tions that translated to neutra l, fear, anger, joy, sadness or boredom. The results, in terms of the phonation type only, showed th at fearful and sad utterances were associated with falsetto voice. Sadne ss was associated with a breathy voice. Bored expressions were characterized by breathy or creaky voice. Alt hough joy was not well recognized, the phonation types that provided in creased accuracy were modal or tense phonation.
136 Tense phonation was clearly associated with a ngry expressions. Sche rer (1986) and Laukkanen et al. (1996) have also sugge sted that sadness may be ch aracterized by a breathy voice. In summary, a number of cues to measure vocal quality have been i nvestigated, such as jitter, shimmer, spectral energy distribution, spectral slope, ha rmonic-to-noise ratio, and more recently, estimates of phonation type. However, the results of some studies are confounded by limitations of speech analyses equipment ava ilable at that time (e.g. jitter measures in Bachorowski & Owren, 1995). Results of the experiments examining phonation type are promising. Several studies suggest that anger is associated wi th a tense voice and sadness is characterized by a breathy voice possibly in falsetto. It is noteworthy that many of these experiments are not performed using samples in American English, and many stimuli are not sentence length. Duration A variety of temporal measures have been studied such as spe ech rate (Forsell et al., 2007), mean vowel and fricative durations (Leinonen et al., 1997), mean pause durations (Cahn, 1990), and mean sentence length. For example, Cahn (1990) performed a perceptual test to evaluate listener perception of synthetically generated emo tions that varied according to four sets of parameters: fundamental frequency, duration, voice quality (included intens ity), and articulation. Six emotions were tested (angry, disgusted, glad, sad, scared, and surprised) using five neutral sentences. The average recognition accuracy was 46%, discounting the high recognition accuracy of sad (91%). Since sa d was well-recognized, it was suggested the acoustic parameters manipulated to produce a sad expression may corre spond well with the actual perception of sad voices. These parameters included a low amount of pause discontinuity, hesitation pauses (the number of pauses within a syntac tic or semantic unit), and a low speech rate in ad dition to high pitch discontinuity and high breathiness.
137 In the experiment by Leinonen et al. (1997) an acoustic analysis was performed on expressions of the Finnish word [saara]. This word was expresse d in 10 emotional contexts (see Table 5-1). Results showed that the duration of [ra] varied significantly less across the emotional contexts than the duration of [saa] except for fr ightened, in which case th e duration of [ra] was significantly greater than [aa]. Samples of em otions other than naming (akin to a neutral emotion) and commanding were of longer dur ation. In addition, sadness included a small lengthening of syllable duration. The phonation dura tion was also shorter for naming than sad, content, admiring, scornful, and astonished. On a side note, some of the emotions investigated such as content, admiring, and pleading were not differentiated by any of the parameters investigated. Williams and Stevens (1972) performed an acous tic analysis on the speech of actors that was elicited in a specially designed play for this purpose. In addition, the acoustic cues of an announcer during a real-life traumatic event were computed. Analysis of both the natural and acted speech showed that the mean rate of arti culation (syllables per second) of sorrow was less than half of the speech rate for the remaini ng three emotions, anger, fear, and neutral. The durations of anger expressions we re typically longer due to a sl ower articulation rate, however, this was not always consistent for all speakers. Overall, duration measurements of emotional expressions range from long segments such as the entire utterance, to smaller segments such as vowel, fricative, and pause durations. Results seem to indicate that duration measurements over smaller segments are more informative than mean sentence length, especially since the mean sentence leng th cannot be compared across sentences of different texts. In addition, speech rate in syllables per second may differentiate between certain emotions such as sad from other emotions.
138 Classification of Emotion Categories A number of different classifi cation techniques have been used to evaluate the emotion of an expression based on these acoustic cues. In 2003, Park and Sim used a dynamic recurrent neural network approach to cla ssify four emotions (angry, laugh, surprise, and neutral) using 22 sentences. A simulator was used to recognize each of the four f0 patterns. The raw difference method provided an accuracy of 45.5%, whereas the penalty rule evaluation method gave an overall accuracy of 81.8% when only f0 pattern was used. The signi ficantly high classification accuracy when given the f0 contour pattern suggests that f0 pattern may play an important role in emotion recognition. In 1996, Dallaert et al. proposed an alternate method of measuring f0 features by first using a cubic spline function to smooth the f0 contours of their sentencelength speech samples. They examined two sets of features for predicting four emotions (happy, sad, anger, and fear). Feature Set A was a small set that included the five global measures of f0 for voiced portions of the speech signal, as well as speaking rate and spectr al slope of the entire utterance (based on voiced segments). Measurements for Feature Set B were made on the smoothed f0 contour using cubic splines. These included the first five global m easures previously described, the same five measures on the derivative of the smoothed contour, mean, min, and max of the voiced segments, mean positive and ne gative derivative, and rhythm -related cues (speaking rate, average length between voiced regions, ratio of the number of maxima to the sum total of the maxima and minima, ratio of the number of upslopes to the number of slopes, and slope of maxima). They achieved the best classification accuracy with Feature Set B (79.5% correct as opposed to 67.5% for Feature Set A) using a clas sification process similar to forward feature selection (termed cooperative composition ). This rate was still less than their estimation of human performance (1 male speaker judging 200 samples made by 1 speaker) at 82% correct.
139 The experiments previously described such as Yildirim et al. (2004) and Petrushin (1999) used different classification tech niques. For instance, Yildirim et al. (2004) used discriminant analysis to identify the critical acoustic features necessary to distinguis h between three emotions (anger, happiness, sadness) with respect to a neutral emotion. Th rough this analysis, the authors were able to select a small set of relevant pa rameters from a larger set of approximately 16 parameters. Although use of all avai lable features resulted in th e highest classification accuracy rates (67%), classification using RMS energy alone provided high accuracy (55.4%). Petrushin (1999) compared four differe nt classification algorithms knearest neighbors (kNN), neural networks (NN), ensembles of neural network clas sifiers, and set of expertsto sort five emotions (normal, angry, happy, sad, and afra id). The kNN algorithm reached 55% accuracy using the top eight features of the fourteen selected ( f0 max, f0 SD, f0 range, f0 mean, first formant bandwidth mean (BW1), second form ant bandwidth mean (BW2), SD of energy, speaking rate, f0 slope, F1 max, energy max, energy rang e, F2 range, and F1 range), although anger improved to 65% accuracy with all fourteen features. The neural networks classifier achieved 65% accuracy on average across the five emotions when using the top eight features. The ensembles of neural network classifiers technique provided approximately 66% accuracy for the eight feature set. The fina l classifier referred to as the set of experts was based on the idea of having separate neural networks to r ecognize each emotion. Recognition using the results of all experts provided 60% accuracy. Although th e ensembles of neural network classifier provided the highest accuracy, the single neural network technique provided similar results on average. Despite the significant associations between acoustic factors and emotion categories, algorithms to predict a variety of emotion catego ries based on these relations have not reached
140 the accuracy achieved by human listeners. Ta ble 5-1 shows the perceptual accuracy and computer recognition accuracy in identification of emotions reported by 7 researchers. On average, 6 emotions are tested. However, the ra nge varies between four and 10 emotions. Some emotions such as angry and sad have been inves tigated in the majority of experiments. Various terms that may describe happy are also comm on (e.g. happy and joy). On the other hand, many emotions such as scornful, pleading, frustrate d, and tenderness are rarely explored. The average perceptual recognition accuracy of these emo tions is around 60% (range of 26% to 96%), a figure consistent with a review of the literature in 1989 by Sche rer. Accuracy was not always higher for emotions that were commonly studied. For example, angry was typically the easiest emotion for listeners to recognize (74% accuracy on average), whereas sad was recognized only 56.4% of the time on average. Some studies have reported high accuracy for infrequently studied emotions such as astonished (72%), confident (7 5%), and interested (70%). This variability in listener perception may be dependent on the type of the stimuli used. For example, emotion perception from vowels may be difficult for listen ers due to the short segment size and lack of syntactic structure, as seen in the low recognition rates in Toiv anen et al. (2006). It is also possible that the listener performance varies depending on the language studied. Machine classification accuracy of emotion recognition in SS has not always followed listener perception. For example, the average m achine classification accu racy was similar to listener perception of four emotions in Yildir im et al. (2004). Howeve r, the classification accuracy of angry was less than listener pe rception, while classifi cation accuracy of the remaining three emotions, happy, sad, and neutra l, was greater than listener perception. In contrast, classification accuracy was better than perceptual accuracy of five emotions in Toivanen et al. (2006) and for some of the emoti ons in Liscombe et al. (2003). These differences
141 in computer emotion recognition accuracy may be due to acoustic feature set used or the classification algorithm. Certain emotions may have been difficult to classify since they were not easy to perceive in SS Acoustic Correlates of Emotion Categories The acoustic features that character ize the emotions investigated in SS are shown in Table 5-2. This table also provides information regardin g the initial acoustic f eature set examined, the classification algorithm, and the significant ac oustic parameters to ea ch emotion. Although the emotions and initial acoustic feature sets inve stigated vary significantly across studies, it is possible to form an approximation of the rele vant acoustic cues for each emotion. These are described below for some of the commonly studied emotions such as anger, sadness, happy, fear, and bored. Most of the descriptions are rela tive to the acoustics of a neutral emotion. First, it is noteworthy to mention the emotion of disgust. Disgust is an emotion that has been continuously studied by many researchers even though it has not been well-recognized by listeners. The inconsistency in findings of acousti c correlates to disgust may be due to its low perceptual accuracy. It is possible that listeners may not be able to identify disgust from speech that does not contain semantic information. Since disgust was not identified as a unique emotion category in Chapter 2, the present discussion will not provide an analysis of this emotion. On the other hand, anger is an emotion that has been easily iden tified by listeners and classifiers alike. It has been characterized as having a faster rate of speech compared with a neutral emotion, possibly due to lo w hesitation pauses (the number of pauses within a syntactic or semantic unit) and fluent pauses (the numbe r of pauses between a syntactic or semantic unit; Cahn, 1990). In addition, angry speech is typically expressed with a higher mean f0 and intensity and a larger f0 range. It is possible that these f eatures better describe cold anger (annoyance, irritation) than hot anger (rage). Cold anger may be related to the degree of
142 overshoot of the target formant frequencies (Burkhardt & Sendlmeier, 2000). Hot anger is typically characterized as having many stressed syllables (Burkhardt & Sendlmeier, 2000). The pitch changes may be abrupt on stressed syll ables (Murray & Arnott, 1993). Anger is also characterized by tense phonation (Burkhardt & Sendlmeier, 2000). A commonly reported characteristic of sadne ss is a slow rate of speech. In addition, sadness may be expressed with lower intensity, a breathier quality, a nd slurred articulation (Scherer, 1986; Burkhardt & Sendlmeier, 2000). This emotion has also been described by a relative emphasis of the lower e nd of the spectrum. Typically sa dness is expressed using a lower mean f0 narrower f0 range, and downward f0 inflections. Irregular f0 patterns may be apparent as well. However, sadness may also be expressed using a raised f0 contour with phonation in the falsetto register. This may be a difference in male-female expression of sadness or simply two equally representative methods of expressing sadness. Happiness has been described as having a broader range of f0 a high mean f0 and smooth, upward inflections in the f0 contour (Murray & Arnott, 1993). The speech rate is variable, with some reports of a faster speech rate and others of a slightly slower speech rate. However, there may be fewer inter-word pauses (Yildirim et al., 2004). Ph onation is typically in the modal register. The voice quality of happy e xpressions have been described as breathy by some and tense by others. The mean intensity is typically high. Furthermor e, precise articulation seems to aid in the perception of a happy expr ession. Burkhardt and Sendlmeier (2000) showed that high first and second formant frequencies ma y also cue the perceptio n of happy expressions. Fearful expressions can be characterized by a high mean pitch typically in the falsetto register and f0 variability. Leinonen et al. (1997) reported that frighten ed samples had the highest mean f0 The speech rate is considerably faster than neutral. Cahn (1990) s uggested that scared
143 could be described by a low amount of fluent pauses, a high amount of hesitation pauses, and a high stress frequency (ratio of stressed to stressa ble words). In addition, fearful utterances may have irregular voicing. Fearful expressions may be differentiated from angry by more energy at the f0 compared to the formant frequencies between 1 kHz (Leinonen et al., 1997). Bored is not as commonly studied as angry, happy, sad, and fearful. Acoustic analysis suggests that bored may be characterized by a low mean f0 possibly in the pulse register and a narrow f0 range. The rate of speech in bored expressi ons is typically slow. Vowel articulation may be imprecise (vowel target undershoot). Th ese expressions may also be characterized by increased breathiness. In summary, evidence exists to confirm the notion that emotions can be perceived from suprasegmental variations in speech, which ar e transmitted through acoustic features such as fundamental frequency, intensity, duration of speech segments, and voice quality. Previous experiments have selected a large set of acoustic parameters to be measured from the speech signal and have found significant corr elations with perceptual judgm ents of the emotional speech samples. However, these results are not comple tely convincing. Synthesis of emotional speech based on these parameters has still failed to provide natural sounding emotional speech. This may be due to a limited acoustic feature set or unreliable perceptual data. Acoustic Correlates of Emotion Dimensions Instead of examining the acoustic characte ristics of each emotion category, some researchers attempt to determine the set of acous tic properties or dimensi ons that are common to all emotions. The emotion categories can then be described by their magnitude on each of these dimensions. A discussion of the experiments to find the perceptual dimensions of emotions was provided in Chapter 3. Briefly, the findings showed that the number of dimensions selected to describe a set of emotions was not consiste nt across experiments. Between two and four
144 dimensions are suggested for explaining the differences between the emotions under study, however, this amount depends on the number of emotions under inves tigation. As a result, experiments to study the acoustic characterist ics of these dimensions have shown some inconsistency in findings as well. Neverthe less, some general trends are apparent. The acoustic correlates to the first dimension are similar across experiments. This dimension has typically been described as the level of arousal or emotional energy in the expression (Schlosberg, 1954; Russell & Feldman Barre tt, 1999). It is typi cally referred to as activation or arousal (Davitz, 1969 ). In the majority of experiments, the second dimension separates the negative emotions from positiv e emotions (Block, 1957). This dimension is commonly referred to as valence, evaluation, hedonic tone, or pl easantness (Davitz, 1969). Other dimensions reported include potency or power (Fontaine et al., 2007; Osgood, 1969), unpredictability (Fontaine et al., 2007), dominan ce or confidence (Russell & Mehrabian, 1977), and intensity (Laukka, Juslin, & Bresin, 2005). Howe ver, these dimensions are not consistently part of a model of emotional expression. The acoustic correlates of each of these dimens ions can be determined by first extracting the acoustic cues of the speech samples. These m easures are then correl ated with perceptual judgments of the emotions on each dimension (S chroder et al., 2001; Davitz, 1964; Huttar, 1968; Uldall, 1960). The acoustic cues that were signifi cantly correlated with each dimension can be used in further regression analysis (multiple or stepwise) to determine a set of features that can differentiate between the emotions on a dimens ion (Schroder et al., 2001; Banse & Scherer, 1996; Tato et al., 2002; Juslin & Laukka, 2001). For example, Schroder et al. (2001) investigated the acoustic correlates to the ac tivation, evaluation, and power dime nsions. Perceptual ratings of the spontaneous emotional speech were obtained a nd then correlated with the acoustic variables
145 (a description of the stimuli was provided earlier). Results showed that the highest correlations were found for the activation dimension. The features that were significan tly correlated with at least one dimension were used in a stepwise regression to determine the acoustic patterns of happy, afraid, angry, sad, and neutral on the thre e dimensions. The active end of the activation dimension was found to correspond to higher f0 mean and range, longer phrases, shorter pauses, larger and faster f0 rises and falls, increased intensity, and a flatter spectral slope. The negative end of the evaluation dimension was a ssociated with longer pauses, faster f0 falls, increased intensity, and more prominent intensity maxi ma. The power dimension was characterized by lower f0 mean. In an experiment by Liscombe et al. (2003) 11 emotions were divided on the valence dimension (five positive: confident, encouraging, friendly, happy, interested; five negative: angry anxious, bored, frustrated, and sad; and a neutral emotion). The stimuli consisted of four-syllable numbers and dates. Twelve features were auto matically extracted from the samples. Results showed that the spectral tilt of vowels with the highest amplitude (STV) could be used to distinguish between two sets of emotions accord ing to the valence dimension (friendly, happy, and encouraging from angry a nd frustrated). In addition, f0, RMS amplitude, and speaking rate were useful in separating emotions on the activa tion dimension. These results were also seen in (Cowie and Cornelius, 2003). The RIPPER model was used with a binary classification procedure to sort the speech sa mples into correct or incorrec t detection. Accuracy using the either the automatic features, ha nd-labeled features, or the comb ination was moderate at 75%. The acoustic cues that have been suggested to correspond to each of these dimensions are shown in Table 5-3. These results suggest that emotions on the high end of the activation dimension are associated with a high mean f0, faster speaking rate, increased mean intensity,
146 high f0 variability, and increased high frequency energy. This dime nsion may be used to separate emotions such as happy, angry, and fear from othe rs such as sadness (Tato et al., 2002). The best feature set for distinguishing be tween emotions on the activation dimension seems to mean and variability of f0 and speaking rate. The acoustic cues to the valence dimension have not been consistent in the literature. In fact, a number of studies have failed to determine the acoustic characteristics of the valence dimension (Per eira, 2000; Davitz, 1964), a lthough this may be due to use of a small feature set (Schroder, 2003). Ta to et al. (2002) suggest ed that the lack of correlates to the valence dimension may be due to understudied voice quality features in the dimensional approach. She suggested that the prosodic features ( f0 features, intensity, duration, etc.) may best describe the activ ation dimension, while quality feat ures may best characterize the valence dimension. However, this hypothesis has not been tested fu rther. On the whole, emotions high in valence may be described as ha ving less high frequency energy and more f0 variability than the emotions on the low end. These cues ma y be used to differentiate between emotions such as happy and angry (Tato et al., 2002). Ohal a (1983) suggested that the vocal expression of emotions is typically used to communicate th e presence or absence of power and dominance. However, Schroder (2003) suggested that the ac oustic characteristics of the power dimension may overlap with the activation dimension, sugge sting that this dimension may be redundant. From Table 5-3, the most stable re sults appear for mean and range of intensity. This suggests that highly potent emotions or emotions high in power can be characterized by a high mean intensity and large range of intensity. Conf idence, power, and intensity have been suggested to represent the same dimension. This dimension separates the strong emotions from the weak emotions. Greenberg et al. (2007) suggested that the f0 contour shape may play a role in differentiating emotions based on confidence. Finally, the intens ity dimension was characterized as separating
147 emotions based on mean f0 and high frequency energy, simila r to the power and activation dimensions. In summary, strong evidence suggests that emo tions can be described according to sets of properties that describe one or more dimensions. The acoustic cues to the first arousal dimension are the most robust. Although the second dimension can clearly separate positive and negative emotions perceptually, the acoustic cues to this dimension are inconsistent. This may be due to the gap in the literature in examining vocal qua lity parameters. In the previous section, the sad category was described as differing from the other emotions in term s of breathiness and a falsetto voice. Anger was reportedly characterized by in creased vocal tension. These cues may be relevant to the valence dimensi on. It is interesting that both anger and sad, two negative emotions, were characterized by either breathiness or tension. This may suggest that the definition of the valence dimension as separa ting positive and negative emotions may be too vague. This dimension may separate certain negative and positive emotions rather than all negative and positive emotions. Limitations and Overview The acoustic features investigated in previous research have been s hown to correlate with some of the emotion dimensions; however, these m easures alone have failed to discriminate all emotions on a dimension with a high degree of sensitivity and specificity (Juslin et al., 2005). Evidence exists to suppo rt the idea of emotion-specific acoustic patterns for emotion categories such as angry, sad, and happy, as well as the activa tion dimension. It is inte resting then that the acoustic cues to the valence dimension remain elusive. One explanation for the difficulty in finding the acoustic properties of th e valence dimension is that the definition of this dimension is a reduced account of the emo tion properties corresponding to it. In other words, the term valence is a label that approximately descri bes the emotions separated by this dimension.
148 Perceptual experiments that obtain ratings of e xpressions in terms of a valence or pleasantness dimension may not represent information that can be used acoustically to separate emotions. Instead, measures of perceptual distances between emotions obtained usi ng a discrimination test can be used to obtain the arrangement of emotions on each of the dimensions without introducing any bias due to la beling the emotion properties or dimensions. In addition, the discrimination task enables calcula tion of perceptual distances using d instead of percent correct. This is advantageous since d is a measure that accounts for the false alarm rate whereas percent correct scores do not. As a result, percent correct scores may overestimate listener perception. One of the main limitations of current rese arch is their reliance upon a speakers neutral emotion. To compare global measures of f0 or intensity across speakers, researchers evaluate the acoustic characteristics of each speakers emotions to that speakers neutral emotion (e.g. Lee et al., 2002; Sobin & Alpert, 1999; Banse & Schere r, 1996). Speech produced with a neutral emotion, is assumed to have no emotional content (Ramamohan and Dandapat, 2006; Loveland et al., 1997; Nandur, 2003). This procedure is necessary to aggregat e calculations across speakers due to the large indivi dual differences in physical a nd acoustic characteristics among speakers. For example, it is not possible to say that an absolute mean f0 of 200 Hz would signify the expression of a particular emotion since f0 varies significantly across speakers. Using a neutral emotion as a comparison tool may ease th e computational analyses; however the ability to detect the emotion of an utterance without the use of each speakers neutral emotion has broader applications such as realtime emotion detection from speech. Other limitations that may influence the cla ssification accuracy of emotions include the acoustic feature selection process or the existing algorithms to classify the emotional speech. For
149 example, the inclusion of too few or too many acoustic parameters as a result of the feature selection process may reduce classification accu racy. In addition, many of these experiments have been performed using stimuli of different lengths, degree of semantic information, and languages. Many of the studies have been perf ormed using German speech and a handful of recent experiments have used Finnish speech. Ho wever, no clear evidence suggests that the patterns of emotional expression s in these languages are similar to those of American English speech. Therefore, the present experiments were pe rformed to develop an acoustic model that describes the four emotion categor ies that can be perceived in SS using the emotion dimensions approach. Since perceptual judgments serve as the gold standard for model accuracy, an experiment was performed to obtai n reliable and relatively unbiased measures of the perceptual distances between emotions in a multidimensional space (Chapter 2). These judgments served as the reference for model development. A three di mensional (3D) space was selected to describe the differences between 19 emotions in SS The acoustic correlates to each of these dimensions were hypothesized in Chapter 3. The first of the present two experiments sought to empirically evaluate those predictions. To this end, a numbe r of acoustic features we re extracted from the speech samples. This feature set was novel in th at the acoustic measures were not dependent on any baseline measure from the speaker. The rele vant acoustic parameters that corresponded to each perceptual emotion dimension were selected using a feature selection process consisting of step-wise regressions. This procedure was then used to locate the emotions in a multidimensional acoustic space. Finally, the emotional speech sample s were classified using two algorithms In the second experiment, a novel set of speech samples were collected to evaluate the performance of the acoustic mode l. Although the 3D MDS solution (i.e., the perceptual model)
150 was developed based on 19 emotions, the resu lts of the perceptual data analyses ( d analysis and hierarchical clustering analysis) suggested that many of the 19 emotions may be difficult to discriminate in SS Therefore, the model was developed using the 19 emotions, but evaluated using a reduced set of emotionsonly the 11 emo tion categories formed at a clustering level of 1.95. Furthermore, these emotions were classified into one of the four emotion categories perceivable in SS as determined in Chapter 2. The model predictions were compared to perceptual estimates of these novel samples to ev aluate the performance of the model. It was predicted that a model derived from relatively un biased measures of perception, in addition to the use of certain novel acoustic measures would result in an improvement in classification accuracy of these emotions. A better understanding of how emotions are e xpressed and perceived in SS can directly benefit a number of t echnical and clinical applications. Experiment 1: Development of an Ac oustic Model of Emotion Recognition Previously, it was determined that 19 emo tion categories could be described by their magnitude on three dimensions. In this experime nt it was hypothesized that each dimension of the perceptual MDS model corresponded to a set of acoustic cues. The pu rpose of the present experiment was to determine these cues. Previous research has mainly examined the correlations of global measures to describe emotions, su ch as the mean and standard deviation of f0 and intensity, spectral slope, formant bandwidth, a nd overall duration or speaking rate. Although some of these parameters have been shown to correlate with the percep tion of emotions, these measures alone have failed to discriminate a ll emotions with a high degree of sensitivity and specificity. In addition, measurement of global features such as the mean f0 and mean intensity requires a reference for compar ison. Typically, a speakers neu tral emotional expression has been used for this purpose. However, it is uncl ear whether this process is representative of listener perception of emotions. Therefore, the acoustic feature set computed in the present
151 experiment consisted of measures that did not require a reference emotion for each speaker. A feature selection process was then implemented to identify the set of acoustic cues that corresponded to each of the perceptual dimensi ons. Finally, the speech samples used in model development and additional samples from the same speakers were classified using two simple classifiers to test the pe rformance of the model. Speech Stimuli The speech samples were obtained from one male and one female speaker as described in Chapter 2. Briefly, these speakers expressed two nonsense sentences in 19 emotional contexts, resulting in 76 samples (19 emotions X 2 sp eakers X 2 sentences). However, only a single sentence from each speaker was used in the pe rceptual experiments and analyses in earlier chapters. Therefore, only these 38 sentences (19 emotions X 2 speakers X 1 sentence) were used in model development (the training set). To evaluate the performance of the acousti c model in predicting the emotions of both speakers and sentences, two stimulus sets were used. Each of thes e sets contained expressions in only 11 emotions, since it was determined that al l 19 emotions may not be easily perceived in SS The first set resembled the training set, however, only 11 emotions were included. This set consisted of 22 samples (11 emotions X 2 speaker s X 1 sentence) and will be referred to as trclass This set was mainly used to determin e how closely the acoustic space matched the perceptual space and the performance of the cl assifiers. A second set was formed using both sentences from each speaker to determine whether model accuracy was greater for one speaker or sentence over another. This resulted in the selection of 44 samples (11 emotions X 2 speakers X 2 sentences) for use in the preliminary test of model generalization (referre d to as the first test set or test1 set).
152 Measurement of Acoustic Features The acoustic features examined in this experi ment can be divided into four main groups fundamental frequency cues, intensity cues, duration measurements, and voice quality cues. Some of the acoustic features studied here were unique to the study of emotions in SS Other cues that have been previously investigated in the literature, such as speaking rate and f0, were calculated using novel algorithms. A list of the acoustic cues i nvestigated in the present experiment is shown in Table 5-4. Many of these acoustic parameters were estimated by dividing the speech signal into small time segments or windows This process was used to capture the dynamic changes in the acoustic parameters in the form of contours It is often convenient if not necessary to smooth the contours before extr acting features from them. As a result, a preprocessing step was required prior to compu ting some acoustic features. Most calculations were performed using MATLAB 7.0 (Mathworks, Inc.). However, algorithms to automatically compute some acoustic features ar e not without error. Therefore, although an attempt was made to automate the entire process, some acousti c measures were computed manually. The following sections describe the methods for calculating each of the acoustic cues. Fundamental frequency Williams and Stevens (1972) stated that the f0 contour may provide th e clearest indication of the emotional state of a talker. Therefore, a number of static and dynamic parameters based on the fundamental frequency were calcul ated. To obtain these measurements, the f0 contour was computed using the SWIPE algorithm (Camacho, 2007). SWIPE estimates the f0 by computing a pitch strength measure for each candidate pitc h within a desired rang e and selecting the one with highest strength. Pitch strength is determin ed as the similarity between the input and the spectrum of a signal with maximum pitch strength, where similar ity is defined as the cosine of the angle between the square roots of their magnit udes. It is assumed that a signal with maximum
153 pitch strength is a harmonic signal with a pr ime number of harmonics, whose components have amplitudes that decay according to 1/frequency. Unlike other algorithms that use a fixed window size, SWIPE uses a window size that makes the square root of the spect rum of a harmonic signal resemble a half-wave rectified co sine. Therefore, the strength of the pitch can be approximated by computing the cosine of the angle between the square root of the spectrum and a harmonically decaying cosine. An extra feature of SWIPE is the frequency scale used to compute the spectrum. Unlike FFT based algorithms that use linearly spaced frequency bins, SWIPE uses frequency bins uniformly distributed in the ERB scale. The SWIPE algorithm was selected, since it was shown to perform significantly be tter than other algorithms for normal speech (Camacho, 2007). Once the f0 contours were computed using SWIPE they were smoothed and corrected prior to making any measurements (correcti on and smoothing procedures described in the following preprocessing section). The pitch min imum and maximum were then computed from final pitch contours. To normalize the maxima and mini ma, these measures were computed as the absolute maximum minus the mean (referr ed to as pnorMAX for normalized pitch maximum) and the mean minus the absolute minimum (referred to as pnorMIN for normalized pitch minimum). This is shown in Figure 5-1. A number of dynamic measurements were also made using the contour s. It was predicted that dynamic information may be more informativ e than static information in some occasions. For example, to measure the changes in f0 variability over time, a singl e measure of the standard deviation of f0 may not be appropriate. Samples with the same mean and standard deviation of f0 may have different global maxima and minima or f0 contour shapes. As a re sult, listeners may be attending to these temporal changes in f0 rather than the gross f0 variability. Therefore, the gross
154 trend ( gtrend ) was estimated from the utterance. An algorithm was developed to estimate the gross pitch contour trend across an utteran ce (approximately 4 sec window) using linear regressions. Five points were selected from the f0 contour of each voiced segment (first and last samples, 25%, 50%, and 75% of the segment dura tion). A linear regression was performed using these points from all voiced segments. The slope of this line was obtaine d as a measure of the gross f0 trend. This calculation is illustrated in Figure 5-2. In addition, f0 contour shape may play a role in emotion perception. The contour shape may be quantified by the number of peaks in the f0 contour. For example, emotions at opposite ends of Dimension 1 such as surprised and lonely may differ in terms of the number of increases followed by decreases in the f0 contours (i.e., peaks). In order to determine the number of f0 peaks, the f0 contour was first smoothed considerably. Then, a cutoff frequency was determined. The number of zero-crossings at the cutoff freque ncy was used to identify peaks. Pairs of crossings that were increasing and decreasing were classified as peaks. Th is procedure is shown in Figure 5-3. The number of peaks in the f0 contour within the sentence was then computed. The normalized number of f0 peaks (normnpks) parameter was computed as the number of peaks in the f0 contour divided by the number of syllables within the sentence, since longer sentences may result in more peaks (the me thod of computing the number of syllables is described in the Duration section below). Another method used to assess the f0 contour shape was to measure the steepness of f0 peaks. This was calculated as the mean rising slope and mean falling slope of the peak. The rising slope (mpkrise) was computed as the difference between the maximum peak frequency and the zero crossing frequency, divided by the difference between the zero-crossing time prior to the peak and the peak time at which the peak occurred (i.e. the time period of the peak
155 frequency or the peak time). Similarly, the falling slope (mpkfall) was computed as the difference between the maximum peak frequency and the zero crossing fr equency, divided by the difference between the peak time and the zero-crossing time following the peak. The computation of these two cues are shown in Figure 5-4. These parameters were normalized by the speaking rate since fast speech rates can result in steeper peaks. The formulas for these parameters are as follows: peakrise = [( fpeak max tzero-crossing) / ( tpeak max tzero-crossing)] / speaking rate (5-1) peakfall = [( fpeak max tzero-crossing))/ ( tzero-crossing tpeak max)] / speaking rate (5-2) The peakrise and peakfall were computed for all peaks and aver aged to form the final parameters mpkrise and mpkfall In summary, the novel cues investigated in the present experiment include fundamental frequency as measured using SWIPE, the normnpks and the two measures of steepness of the f0 contour peaks ( mpkrise and mpkfall ). These cues may provide better classification of emotions in SS since they attempt to capture the temporal changes in f0 from an improved estimation of f0. Although some emotions may be described by global measures or gross trends in the f0 contour, others may be dependent on within sentence variations. Intensity Intensity is essentially a measure of the energy in the speech si gnal. The intensity of each speech sample was computed for 20 ms windows w ith a 50% overlap. In each window, the root mean squared (RMS) amplitude was determined a nd then converted to decibels (dB) using the following formula: Intensity (dB) = 20 log10 [mean ( amp2)]1/2 (5-3) The parameter amp refers to the amplitude of each sample within a window. This formula was used to compute the intensity contour of each signal. The global mini mum and maximum were
156 extracted from the smoothed RMS energy contou r (smoothing procedures described in the following Preprocessing section). The intensity minimum and maximum were normalized for each sentence by computing the absolute maximum minus the mean (referred to as iNmax for normalized intensity maximum) and the mean minus the absolute minimum (referred to as iNmin for normalized intensity mini mum). This is shown in Figure 5-5. In addition, the duty cycle and a ttack of the intensity contour were computed as an average across measurements from the three highest peak s. The duty cycle (dutyc yc) was computed by dividing the rise time of the p eak by the total duration of the peak. The attack (attack) was computed as the intensity difference for the rise time of the peak divided by the rise time of the peak. The normalized attack (Nattack) was computed by dividing the attack by the total duration of the peak, since peak s of shorter duration would have faster rise times. Another normalization was performed by dividing the attack by the duty cycle (normattack). This was performed to normalize the attack time as it ma y be affected by the speaking rate and peak duration. These cues were included as measures of the staccato like emphasis of syllables used to discriminate between the em otions at the angry end of D2. These cues have not been frequently examined in the literature, possibl y because they are not easy to compute from commonly used software displays su ch as Praat. The computations of attack and dutycyc are shown in Figure 5-6. Duration Speaking rate (i.e. rate of articulation or tempo) wa s used as a measure of duration. It was calculated as the number of syllables per s econd. Due to limitations in syllable-boundary detection algorithms, an estimation of the syllabl es was made using the intensity contour. This was possible because all English syllables form peaks in the intensity contour. The peaks are areas of higher energy, which t ypically result from vowels. Sin ce all syllables contain vowels,
157 they can be represented by peaks in the intens ity contour. The rate of speech can then be calculated as the number of peaks in the intensity contour. This algorithm is similar to the one proposed by de Jong and Wempe (2009), who attemp ted to count syllables using intensity on the decibel scale and voiced/unvoiced sound detection. However, the al gorithm used in this study computed the intensity contour on the linear scale in order to preserve the large range of values between peaks and valleys. The intensity cont our was first smoothed us ing a 7-point median filter, followed by a 7-point moving average filt er. This successive filtering was observed to smooth the signal significantly, but still preserve the peaks and valleys. Then, a peak-picking algorithm was applied. The peak-picking algor ithm selected peaks based on the number of reversals in the intensity contour, provided that the peaks were greater than a threshold value. Therefore, the speaking rate (srate) was the nu mber of peaks in the intensity contour divided by the total speech sample duration. In addition, the number of peaks in a certain window was calculated across the signal to form a speaking rate contour or an estimate of the change in speaking rate over time. The window size and shift size were selected based on the average number of syllables per second. Evidence suggests that young adults typically express be tween three to five syllables per second (Laver, 1994). The window size, 0.50 seconds, wa s selected to include approximately two syllables. The shift size chosen was one ha lf of the window size or 0.25 seconds. These measurements were used to form a contour of the number of syllables per window. The slope of the best fit linear regression equa tion through these points was used as an estimate of the change in speaking rate over time or the speaking rate trend (srtrend). This calculation is shown in Figure 5-7. The speaking rate trend was computed as a dynamic measure of the speaking rate for a sentence. This measurement resembled the f0 contour and intensity contours.
158 In addition, the vowel-to-consona nt ratio (VCR) was comput ed as the ratio of total vowel duration to the total consonant duration within each sample. The vowel and consonant durations were measured manually by segmenting th e vowels and consonants within each sample using Audition software (Adobe, Inc.). Then, Matlab (v.7.1, Mathworks, Inc.) was used to compute the VCR for each sample. The pause pr oportion (the total pause duration within a sentence relative to the total sentence durati on or PP) was also measured manually using Audition. A pause was defined as non-speech silences longer than 50 ms. Since silences prior to stops were considered speech-related silences, these were not considered pauses unless the silence segment was extremely long (i.e., great er than 100 ms). Audible breaths or sighs occurring in otherwise silent segments were incl uded as silent regions as these were non-speech segments used in prolonging the sentence. A subs et of the hand measurements were obtained a second time by another individual in order to perform a reliability analysis. The method of calculating speaking rate and the parameter srtrend have not been previously examined in the literature on emotions in speech. Voice quality Many experiments suggest that anger can be described by a tense or harsh voice (Scherer, 1986; Burkhardt & Sendlmeier, 2000; Gobl and Chas aide, 2003). Therefore, parameters used to quantify high vocal tension or low vocal tensi on (related to breathine ss) may be useful in describing Dimension 2. One such parameter is the spectral slope. Spectral slope may be useful as an approximation of strain or tension (Schroder, 2003, p. 109) since the spect ral slope of tense voices is shallower than that for re laxed voices. Spectral slope was computed on two vowels common to all sentences. These include /aI/ within a stressed syllable and /i/ within an unstressed syllable. The spectral slope was measur ed using two methods. In the first method, the alpha ratio was computed (aratio and aratio2). This is a measure of the relative amount of
159 low frequency energy to high freque ncy energy within a vowel. To calculate the alpha ratio of a vowel, the long term averaged spectrum (LTAS) of the vowel was first computed. The LTAS was computed by averaging 1024-point Hanning window s of the entire vowel. Then, the total RMS power within the 1 kHz to 5 kHz band was subtracted from the total RMS power in the 50 Hz to 1 kHz band. An alternate method for compu ting alpha ratio was to compute the mean RMS power within the 1 kHz to 5 kHz band and subtract it from the mean RMS power in the 50 Hz to 1 kHz band (maratio and maratio2). The se cond method for measuring spectral slope was by finding the slope of the line that fit the spect ral peaks in the LTAS of the vowels (m_LTAS and m_LTAS2). A peak-picking algorithm was used to determine the peaks in the LTAS. Linear regression was then perfor med using these peak points from 50 Hz to 5 kHz. The slope of the linear regression line was used as the second measure of the spectral slope. This calculation is shown in Figure 5-8. The cepstral peak prom inence (CPP) was computed as a measure of breathiness using the executable developed by Hillenbrand and Houde (1996). CPP determines the periodicity of harmonics in the spectra l domain. Higher values would suggest greater periodicity and less noise, and therefore less breathiness (Heman-Ackah et al., 2003). Preprocessing Before features were extracted from the f0 and intensity contours, a few preprocessing steps were performed. Fundamental frequency extr action algorithms have a certain degree of error resulting from an estimation of these values for unvoiced sounds This can result in discontinuities in the contour (Moore, Cohn, & Katz, 1994; Reed, Buder, & Kent, 1992). As a result, manual correction or smoothing is of ten required to improve the accuracy of measurements from the f0 contour. The intensity contour was smoothed as well to enable easier peak-picking from the contour. A median filter was used for smoothing both the intensity and f0 contours. The output of the filter was computed by selecting a window containing an odd number
160 of samples, sorting the samples, and then com puting the median value of the window (Restrepo & Chacon, 1994). The median value was the output of the filter. The wind ow was then shifted forward by a single sample and the procedure was repeated. Both the f0 contour and the intensity contour were filtered using a five-point median filter with a forward shift of one sample. Before the f0 contour was filtered, a few steps were taken to attempt to remove any discontinuities in the contour. First, any valu e below 50 Hz was forced to zero. Although the male fundamental frequencies can reach 40 Hz, va lues below 50 Hz were frequently in error. Comparisons of segments below 50 Hz were made w ith the waveform to verify that these values were errors in f0 calculation and not in fact, the actual f0 Second, some discon tinuities occurred at the beginning or end of a period of voicing an d were typically preceded or followed by a short section of incorrect values. To remove these errors, two successi ve samples in a window that differed by 50 Hz or more were marked, since this typically indicated a discontinuity. These samples were compared to the mean f0 of the sentence. If the firs t marked sample was greater than or less than the mean by 50 Hz, then all samples of the voiced segment prior to and including this sample was forced to zero. Alte rnately, if the second marked sample was greater than or less than the mean by 50 Hz, then this sample was forced to zero. The first marked sample was then compared with each following sample until the difference no longer exceeded 50 Hz. Feature Selection Once the acoustic cues were computed for each sample, a feature selection process was required to determine the cues that corresponded to the emotions on each dimension. Feature selection is the process of systematically selecting the best acoustic features along a dimension, i.e., the features that explain the most variance in the data. The feature selection approach used in this experiment involved a linear regression an alysis. SPSS was used to compute stepwise linear
161 regressions to select the set of acoustic measures (dependent variables) that best explained the emotion properties for each dimension (independent variable). Stepwise regr essions were used to find the acoustic cues that accounted for a si gnificant amount of the variance among stimuli on each dimension. A mixture of the forward and back ward selection models was used, in which the independent variable that explai ned the most variance in the de pendent variable was selected first, followed by the independent variable that explained the most of the residual variance. At each step, the independent variables that were si gnificant at the 0.05 level were included in the model (entry criteria p 0.28) and predictors that were no longer significan t were removed (removal criteria p 0.29). The optimal feature set in cluded the minimum set of acoustic features that were needed to explain the per ceptual changes relevant for each dimension. The relation between the acoustic features and the di mension models were summarized in regression equations. Since this analysis assumed that only a lin ear relationship exists between the acoustic parameters and the emotion dimensions, scatterplo ts were used to confirm the linearity of the relevant acoustic measures with the emotion dimensions. Parameters that were nonlinearly related to the dimensions were transformed as necessary to ob tain a linear relation. The final regression equations are referred to as the acoustic dimension models and formed the preliminary acoustic model of emotion perception in SS To determine whether an acoustic model based on a single sentence or speaker was better able to represent perception, th e feature selection process was performed multiple times using different perceptual models. For the training set, separate perceptu al MDS models were developed for each speaker (Speaker 1, Speaker 2) in addition to the overall model based on all samples. For the test1 set, separate perceptual MDS models were developed for each speaker
162 (Speaker 1, Speaker 2), each sentence (Sente nce 1, Sentence 2), and each sentence by each speaker (Speaker 1 Sentence 1, Speaker 1 Senten ce 2, Speaker 2 Sentence 1, Speaker 2 Sentence 2), in addition to the overall model ba sed on all samples from both speakers. Model Classification Procedures The acoustic dimension models were then us ed to classify the samples within the trclass and test1 sets. The acoustic location of each samp le was computed based on its acoustic parameters and the dimension models. The speech samples were classified into one of four emotion categories using the k -means algorithm. The emotions that comprised each of the four emotion categories were previously determined in the hierarchical clustering analysis in Chapter 2. These included Clusters or Categories 1 throug h 4 or happy, content-confident, angry, and sad. The labels for these categories were selected as the terms most frequently chosen as the modal emotion term by participants in Chapter 2. Th e label sad was the on ly exception. The term sad was used instead of love, since this te rm is more commonly used in most studies and may be easier to conceptualize than love. The kmeans algorithm classified each test sample as the emotion category closest to that sample. To compute the distance between the test sample and each emotion category, it was necessary to determine the center point of each category. These points acted as the optimal acoustic representation of each emo tion category and were based on the training set samples. Each of the four center points were computed by averaging the acousti c coordinates across all training set samples within each emotion category. For example, the center point for Category 2 (angry) was calculated as an average of the coor dinates of the two angry samples. On the other hand, the coordinates for the center of Category 1 (s ad) were computed as an average of the two samples for bored, embarrassed, lonely, exhauste d, love, and sad. Similarly, the center point for happy or Category 3 was computed using the sa mples from happy, surprised, funny, and anxious,
163 and Category 4 (content/confident) was comput ed using the samples from annoyed, confused, jealous, confident, resp ectful, suspicious, cont ent, and interested. The distances between the test set sample (from either the trclass or test1 set) and each of the four center points were calculated using the Euclidian distance formula as follows. First, the 3D coordinates of the test sample and the center point of an emotion category were subtracted to determine distances on each dimension. Then, these distances were squared and summed together. Finally, the square root of this numbe r was calculated as the emotion distance (ED). This is summarized in Equation 5-4 below. ED = [( Dimension 1)2 + ( Dimension 2)2 + ( Dimension 3)2] 1/2 (5-4) For each sample, the ED between the test point and each of the four center emotion category locations was computed. The test sample was clas sified as the emotion category that was closest to the test sample (the category for which the ED was minimal). The models accuracy in emotion predictions wa s calculated as percent correct scores and d scores. Percent correct scores (i .e., the hit rate) were calculated as the number of times that all emotions within an emotion category were correctly classified as that category. For example, the percent correct for Category 1 (sad) included the bored, embarrassed, exhausted, and sad samples that were correctly classified as Category 1 (sad). However, it was previously suggested that the percent correct score may not be a suitable m easure of accuracy, since this measure does not account for the false alarm rate. In this case, the false alarm rate was the number of times that all emotions not belonging to a particular emotion category were classified as that category. For example, the false alarm ra te for Category 1 (sad) was the number of times that angry, annoyed, anxious, confident, confused, content, and happy were incorrectly classified as Category 1 (sad). Therefore, the parameter d was used in addition to
164 percent correct scores as a measure of model performance, since this measure accounts for the false alarm rate in addition to the hit rate. Two-Dimensional Perceptual Model Preliminary results suggested that the outcomes of the feature selection process might have been biased by noise since many of the 19 emoti ons were not easy for listeners to perceive. Therefore, the entire analysis reported in this chapter was completed using 11 emotionsthe emotions formed at a clustering level of 1.95. To obtain the overall model representing the new training set, a MDS analysis usi ng the ALSCAL procedure was performed on the 11 emotions (the d matrix for these emotions are shown in Table 5-5). Since the new training set was equivalent to the trclass set, these will henceforth be referred to as the training set. Analysis of the R-squared and stress measures as a function of the dimensionality of the stimulus space revealed that a 2D solution was optimal instead of a 3D solution as previously determined (R-squared and stress are shown in Figure 5-9). Therefore, the 2D solution was adopted for model development and testing. The lo cations of the emotions in the 2D stimulus space is shown in Figure 5-10, and the actual MD S coordinates for each emotion are shown in Table 5-6. These dimensions were very simila r to the original MDS dimensions. Since both dimensions of the new perceptual model closely resembled the original dimensions, the original acoustic predictions were still expected to apply. Dimension 1 separated the happy and sad clusters, particularly anxious from embarrassed. As previously predicted in Chapter 3, this dimension may separate emotions according to the gross f0 trend, rise and/or fall time of the f0 contour peaks, and speaking rate. Dimension 2 sepa rated angry from sad potentially due to voice quality (e.g. mean CPP and spectral slope), empha sis (attack time), and the vowel-to-consonant ratio.
165 The classification procedures were modified accordingly to include the reduced training set. The four emotion categories forming the trai ning set now consisted of the same emotions as the test sets. Category 1 (sad) included bore d, embarrassed, exhausted, and sad. Category 2 (angry) was still based on only the emotion a ngry. Category 3 (happy) consisted of happy and anxious, and Category 4 (content/con fident) included annoyed, confused, confident, and content. Perceptual Experiment Perceptual judgments of one sentence expres sed in 19 emotional contexts by two speakers were obtained in Chapter 2 using a discrimina tion task. Although two sentences were expressed by both speakers, only one sentence from each speaker was used for model development in order to obtain the speakers best expression. This permitted an assessment of a large number of emotions at the cost of a limited number of speakers. However, an analysis by sentence was necessary to ensure that both sentences were perceived equally well in SS This required an extra perceptual test in which both sentences expressed by both speake rs were evaluated by listeners. Thus, the test1 set sentences were evaluated along w ith additional speakers in an 11-item identification task described in Experiment 2. Pe rceptual estimates of the speech samples within only the training and test1 sets are summarized here to compare the classification results of the model to listener perception. Perceptual Data Analysis Although an 11-item identificati on task was used, responses for emotions within each of the four emotion categories were aggregated and reported in terms of accuracy per emotion category. This procedure was performed to paralle l the automatic classification procedure. In addition, this method enables future assessment of perception for a larger set of emotion categories (e.g. 6, 11, or 19). Identification accuracy of the emotions was assessed in terms of percent correct and d These computations were equivalent to those made for calculating model
166 performance using the kmeans classifier. Percent correct scor es were calculated as the number of times that an emotion was correctly identified as any emotion within a category. For example, correct judgments for Category 1 (happy) incl uded happy judged as happy and anxious, and anxious judged as anxious and happy. Simila rly, bored samples judged as bored, embarrassed, exhausted, or sad (i.e., the emo tions comprising Category 1) were among the judgments accepted as correct fo r Category 2. In addition, the d scores were computed as a measure of listener performance that normalizes the percent correct scores by the false alarm rates (i.e., the number of times th at any emotion from three emoti on categories were incorrectly identified as the fourth emotion category). Results The validity of the model was tested by compar ing the perceptual and acoustic spaces of the training set samples. Similar acoustic spaces would s uggest that the acoustic cues selected to describe the emotions are representative of liste ner perception. This analysis was completed for each speaker to determine whether a particular sp eaker better described listener perception than an averaged model. An additional test of valid ity was performed by classifying the emotions of the training set samples into four emotion categories Two basic classifica tion algorithms were implemented, since the goal of this experiment wa s to develop an appropriate model of emotion perception instead of the optimal emotion classifi cation algorithm. The classification results were then compared to listener accuracy to estimate mode l performance relative to listener perception. The ability of the model to generalize to novel sentences by the same speakers was analyzed by comparing and the perceptual space of the training set samples with the acoustic space of the test1 set samples. In addition, the test1 set samples were also classified into four emotion categories. To confirm that the classifi cation results were not influenced by the speaker model or the linguistic prosody of the sentence, these samples were classified according to
167 multiple speaker and sentence models. Specifically, five models were developed and tested (two speaker models, two sentence models, and one aver aged model). The results are reported in this section. Perceptual test results Perceptual judgments of the training and test1 sets were obtained from an 11-item identification task. Accuracy for the training set was calculated after including within-category confusions for each speaker and across both spea kers. Since some samples were not perceived above chance level (1/11 or 0.09), two methods were employed for dropping samples from the analysis. In the first procedure, samples identifie d at or below chance level were dropped. For the training set, only the content sample by Speaker 1 was dropped, since listeners correctly judged this sample as content only nine percen t of the time. However, this analysis did not account for within-cluster confus ions. In certain circumstances such as when the sample was confused with other emotions within the same emotion cluste r, the low accuracy could be overlooked. Similarly some sentences may have been recognized with above chance accuracy, but were more frequently categorized as an incorrect emotion category. Therefore, a second analysis was performed based on the emotion cluster containing the highest frequency of judgments. Samples that were not correctly judg ed as the correct emotion cluster after the appropriate confusions were aggr egated, were excluded. The basis fo r this exclusion is that these samples were not valid representations of th e intended emotion. Accordingly, the bored and content samples were dropped from Speaker 1 and the confident and exhausted samples were dropped from Speaker 2. Results are shown in Table 5-7. When all sentences were included in the analysis, accuracy was at d of 2.06 (83%) for Category 1 (happy), 1.26 (63%) for Category 2 (content-confident), 3.20 (92%) for Ca tegory 3 (angry), and 2.17 (68%) for Category 4 (sad). After dropping the sent ence perceived at chance leve l, Category 2 improved to 1.43
168 (70%). After the second excl usion criterion was implemented, Category 2 improved to 1.84 (74%) and Category 4 improved to 2. 17 (77%). It is clear that th e expressions from Categories 1 and 3 were substantially easie r to recognize from the samples from Speaker 1 (2.84 and 3.95, respectively, as opposed to 1.74 a nd 3.11). Speaker 1 samples from Category 4 were also better recognized than Speaker 2. This pattern was appare nt through analyses us ing exclusion criteria as well. On the other hand, Speaker 2 sample s for Category 2 were identified with equal accuracy as the Speaker 1 samples. To perform an analysis by sentence, accuracy for the test1 set was computed for each speaker, each sentence, and across both speakers a nd sentences. Reanalysis using the same two exclusionary criteria were also implemented. Resu lts are shown in Table 5-8. In the analysis of all sentences, differences in the accuracy percei ved for the two sentences were small (difference in d of less than 0.18) for all categories. Th e reanalysis using only the Above Chance Sentences did not change this difference. Howeve r, the reanalysis using the Correct Category Sentences resulted in an increase in these sentence differences, in favor of Sentence 2. However, since a small sample was used and the difference in d scores was small (less than 0.42), it is not clear whether a true sentence effect is present. Acoustic measures The acoustic features were computed for the training and test1 set samples using the procedures described above. Most features were computed au tomatically in Matlab (v.7.0), although a number of features were automati cally computed using hand measured vowels, consonants, and pauses. The raw acoustic measures are shown in Table 5-9. Reliability of hand measurements Acoustic measurements of speech performed ma nually are often subject to high variability. This may result in measurements that are not replicable, thereby inva lidating results based on
169 these measurements. Since the true value of su ch measurements are often subjective, it is necessary for the measurements to be repeatable To confirm that the hand measurements of pause, vowel, and consonant durations made in th e present experiment were reliable, a colleague (Judge 2) was asked to perform these measuremen ts on a subset of the s timuli. Recall that the training set and test1 set stimuli were formed from a subs et of the total stimuli recorded for model development (19 emotions X 2 speakers X 2 sentences or 76 stimuli). Measurements made on 20% of these speech samples or 16 sentences were repeated. The reliability of the total pause duration, vowel duration for the stressed and unstressed vowels used in the computation of measures such as spectral slope, and the vowel-to-consonant ratio were determined. Measurements made by the author and Judge 2 were correlated using Pearsons Correlation Coefficient. The pause duration, both vowel duration measures, and the VCR were highly correlated (0.99, 0.99, 0.93, and 0.95, respectively), sugge sting that the hand measurements were reliable. Results are shown in Table 5-10. Dimension models To develop an acoustic model of emotion per ception in SS, it was necessary to perform a feature selection process to determine the acoustic features that corresponded to each dimension of each perceptual model. Twelve two-dimensi onal perceptual models were developed. These included an overall model and two speaker models using the training set and an overall model, two speaker models, two sentence models, and f our sentence-by-speaker models using the test1 set samples. Stepwise regressions were used to determine the acoustic features that were significantly related to the dimensions for each pe rceptual model. The significant predictors and their coefficients are summarized in regression equations shown in Table 5-11. These equations formed the acoustic model and were used to desc ribe each speech sample in a 2D acoustic space. The acoustic model that described the Overall training set model included the parameters
170 aratio2, srate, and pnorMIN for Dimension 1 (param eter abbreviations are outlined in Table 5-4). These cues were predicted to correspond to Di mension 1 because this dimension separated emotions according to energy or activati on. Dimension 2 was described by normattack (normalized attack time of the intensity cont our) and normpnorMIN (normalized minimum pitch, normalized by speaking rate) since Dimension 2 seem ed to perceptually separate angry from the rest of emotions by a staccato-like prosody. Interes tingly, these cues were not the same as those used to describe the overall model of the te st1 set. Instead of pnorMIN and aratio2 for Dimension 1, iNmax (normalized intensity ma ximum), pnorMAX (normaliz ed pitch maximum), and dutycyc (duty cycle of the intensity contou r) were included in the model. Dimension 2 included srate, mpkrise (mean f0 peak rise time) and srtrend (speaking rate trend). It is possible that cues such as pnorMIN and pnorMAX were hi ghly correlated and thus substituted for one another with little loss of information. To dete rmine whether the inclusio n of different acoustic features across the various models was due to the inclusion of highly correla ted parameters in the initial feature set, a correlation matrix of th e acoustic features was formed by computing the Pearsons correlation among all acoustic parameters using the training set samples. This matrix is shown in Table 5-12. Results showed that norma ttack was highly correlated with attack and Nattack (0.99 and 0.96, respectively), attack wa s highly correlated with Nattack (0.98), and normpnorMIN was highly correlated with pnorMIN (0.91). This suggests that some parameters can be substituted for one another (such as pnorMIN and pnorMAX), while others cannot (such as aratio2 and iNmax). To determine how closely the acoustic sp ace represented the perceptual space, the predicted acoustic values and the perceived MDS values were plotted in the 2D space. However, the MDS coordinates for the perceptual space are somewhat arbitrary. As a result, a
171 normalization procedure was required. The perceive d MDS values and each speakers predicted acoustic values for all 11 emotions of the training set were converted into standard scores (zscores) and then graphed using the Overall m odel (shown in Figure 5-11 ) and the two speaker models (shown in Figure 5-12). From these figures it is clear that the i ndividual speaker models better represented their correspond ing perceptual models than the Overall m odel. Nevertheless, the Speaker 2 acoustic model did not perform as well at representing the Speaker 1 samples for emotions such as happy, anxious, angry, exhaus ted, sad, and confused. The Speaker 1 model was able to separate Category 3 (angry) very well from the remaining emotions based on Dimension 2. Most of the samples for Category 4 (sad) ma tched the perceptual model based on Dimension 1, except the sad sample from Speaker 2. In a ddition, the Speaker 2 samples for happy, anxious, embarrassed, content, confused, and angry were fa r from the perceptual model values. In other words, the individual speaker models resulted in a better acoustic repres entation of the samples from the respective speaker, however, these models were not able to gene ralize as well to the remaining speaker. Therefore, the Overall model may be a more generalizable representation of perception, as this model was able to place most samples from both speakers in the correct ballpark of the perceptual model. The predicted and perceived values were also computed for the test1 set using the Overall perceptual model formed from the test1 set. Since this set contai ned two samples from each speaker, the acoustic predictions fo r each speaker using the Overall test1 set model are shown separately in Figure 5-13. These results were then compared to the predicted values for the test1 set obtained using the Overall training set model (shown in Figure 514). The predicted values obtained using the Overall training set model seemed to better match the perceived values, particularly for the Speaker 2 samples. Specifi cally, Categories 3 and 4 (angry and sad) were
172 closer to the perceptual MD S locations of the Overall training set model; however, the better model was not evident through visu al analysis. In order to id entify the better model, these samples were classified into separate emotion categories. Results are reported in the Model Predictions below. Linearity analysis In order to validate the assump tion of a linear relation betwee n the acoustic cues included in the model and the perceptual model, scatterp lots were formed usi ng the perceived values obtained from the Overall pe rceptual model based on the training set and the corresponding predicted acoustic values. These are shown in Figure 5-15 for Dimension 1 and Figure 5-16 for Dimension 2. Although these graphs depict a high amount of variability (R-squares ranging from 0.347 to 0.722 for Dimension 1 and 0.007 to 0.417 for Dimension 2), these relationships were best represented as a linear one. Therefore, the us e of stepwise regression s as a feature selection procedure using the non-transformed, releva nt acoustic parameters was validated. Model predictions The acoustic model was first evaluated by visu ally comparing how closely the predicted acoustic values matched the perceived MDS values in a 2D space. Another method that was used to assess model accuracy was to classify the samples into the four emotion categories (happy, content-confident, angry, and sad). Classificati on was performed using th e three acoustic models for the training set and the nine acoustic models for the test1 set. The kmeans algorithm was used as an estimate of model pe rformance. Accuracy was calculat ed for each of the four emotion categories in terms of percent correct and d Results for the training set are reported in Table 513. Classification was performed for all sample s, samples by Speaker 1 only, and samples by Speaker 2 only using three acoustic models (the Overall, Speaker 1, and Speaker 2 models). On the whole, the Overall model resulted in the best compromise in classification performance for
173 both speakers. This model performed best at cla ssifying all samples and better than the Speaker 2 model at classifying the sample s from Speaker 2. Performance for Categories 1, 2, and 4 for the samples from Speaker 1 only was not as good as the Speaker 1 model ( d scores of 3.8, 1.74, and 3.25 as opposed to 5.15 for all categories for the Speaker 1 model). However, the Speaker 1 model was not as accurate on the whole as the Overall model. The Speaker 2 model was almost as good as the Overall model for classification of all samples. However, accuracy was lower for Categories 3 and 4 than the Overall model. These results suggest that the Overall model was the best of the three models. This model was e qually good at classifying Category 3 (angry) for samples from both speakers, but better at cl assifying Categories 1, 2, and 4 (happy, contentconfident, and sad) for the Speaker 2 samples. In order to determine how closely these resu lts matched listener performance, the accuracy rates of the Overall model were compared to the accuracy of perceptu al judgments (shown in Table 5-7). The Overall acoustic mode l was better (in percent correct and d scores) at classifying all samples from the training set into four categories than listeners. These results were apparent for all four categories and for each speaker. While the use of exclusion criteria improved the resulting listener accuracy, performance of the acous tic model was still better than listener perception for both the Above Chance Sentences and Correct Category Sentences analyses. The test1 set was also classified into four emotion categories using the kmeans algorithm. Classification was first performed for all sa mples, samples by Speaker 1 only, samples by Speaker 2 only, samples expressed using Senten ce 1 only, and samples expressed using Sentence 2 only according to the Overall test1 set model and the Overall training set model. Results are shown in Table 5-14. The performance of the Overall training set model was better than Overall
174 test1 set model for all emotion categories. While the percent correct rates were comparable for Categories 1 and 4 (happy and sad), a comparison of the d scores revealed higher false alarm rates and thus lower d scores for the Overall test1 set model across all emotion categories. The accuracy of the Overall test1 model was consistently worse than listeners for all samples and for the individual speaker samples. In contrast, the Overall training set model was better than listeners at classifying three of four emotions in terms of d scores (Category 3 had a slightly smaller d of 2.63 compared to listeners at 2.85). Consistent with the classification re sults for all samples, the Overall training model was generally better than the Overall test1 set model at classifying samples from both speakers. However, differences in classification accuracy were apparent by speaker for the Overall training set model. This model was better able to classify the samples from Speaker 2 than Speaker 1 with the only exception of Cate gory 4 (sad). In contrast, the Overall test1 set model was better at classifying Categories 2 and 3 (c ontent-confident and angr y) for the Speaker 1 samples and Categories 1 and 4 (happy and sad) for the Speaker 2 samples. Neither of these patterns were representative of listener percepti on as listeners were better at recognizing the Speaker 1 samples from all emotion categories. Li steners were in fact better than the Overall training set model at identifying Categories 1, 2, a nd 3 from Speaker 1. However, the Overall training set models accuracy for the Speaker 2 sample s was much better than listeners across all emotion categories. No clear difference in performance by sentence was apparent for the Overall training set model. Categories 1 and 3 (happy and angry) we re easier to classify from the Sentence 2 samples, but Category 4 (sad) was the reversed case. On the other hand, the Sentence 2 samples were easier to classify for Categorie s 1, 3, and 4 according to the Overall test1 set model. The
175 Overall training set model matched the pattern of listener perception (shown in Table 5-8 for the test1 set) for the two sentences better than the Overall test1 set model. Category 3 was the only discrepancy in which Sentence 2 wa s better recognized by the Overall training set model, but Sentence 1 was slightly easier for listeners to recognize. In add ition, classification accuracy was generally higher than listener pe rception. Since the differences in classification and perceptual accuracy between the two sentences were generally small and varied by category, it is likely that these are not due to a sentence effect. These differe nces may be random variability or a result of the slightly stronger speaker difference. A final test was performed to evaluate whether any single speaker or sentence model was better than the Overall training set model at classifying th e four emotion categories. Classification was performed using the two training set speaker models and the four test1 set speaker and sentence models for all samples, samples by Speaker 1 only, samples by Speaker only, Sentence 1 samples, and Sentence 2 samples. Results are shown in Table 5-15. In general, the two training set speaker models were better at classification than the test1 set models. These models performed similarly in cla ssifying all samples. The Sentence 2 test1 model was the only model that came close to outperforming any of the training set models. This models classification accuracy was better than all training set models for Categories 1 and 2 (happy and content-confident). However, it was not better than the Overall training set model or listener perception for Categories 3 and 4 (angry and sad) Therefore, the model that performed best overall was the Overall training set model. This model will be used in further testing. Experiment 2: Evaluating the Model The purpose of this second experiment was to test the ability of the acoustic model to generalize to novel samples. This was achieved by testing the models accuracy in classifying expressions from novel speakers. Two nonsense sent ences used in previous experiments and one
176 novel nonsense sentence were expressed in 11 emotional contexts by 10 additional speakers. These samples were described in an acoustic spac e using the models developed in Experiment 1. The novel tokens were classified into four emotion categories (happy, sad, angry, and confident) using two classification algorithms Classification was limited to f our emotion categories since these emotions were well-discriminated in SS These category labels were the terms most frequently chosen as the modal emotion term by pa rticipants in the pile-sort task described in Chapter 2, except sad (the more commonly used term in the literature). These samples were also evaluated in a perceptual identification test which served as the reference for comparison of classification accuracy. In both cas es, accuracy was measured in d scores. A high agreement between classification and liste ner accuracy would confirm the validity of the perceptualacoustic model developed in Experiment 1. Participants A total of 21 individuals were recruited to participate in this st udy. Ten participants (5 male, 5 females) served as the speakers. Thei r speech was used to develop the stimulus set. The remaining 11 participants were nave listener s (1 male, 10 females) who participated in the listening test. The eligibility criteria and screenin g procedures were identical to those described in Chapter 2. Stimuli Ten participants expressed three nonsense sent ences in 11 emotional contexts while being recorded. Two nonsense sentences we re the same as those used in model development. The final sentence was a novel nonsense sentence (The borelips are leeming at the waketowns). Participants were instructed to express the sentences using each of the following emotions: happy, anxious, annoyed, confused, confident, c ontent, angry, bored, exhausted, embarrassed, and sad. The elicitation procedures were the same as those used in Chapter 2. All recordings for
177 each participant were obtained within a sing le session. These sentences were saved as 330 individual files (10 speakers X 11 emotions X 3 sentences) for use in the following perceptual task and model testing. This set will henceforth be referred to as the test2 set. The stimuli evaluated in the perceptual test in cluded the 330 samples (10 speakers X 11 emotions X 3 sentences) from the test2 set and the 44 samples from the training set (2 speakers X 11 emotions X 2 sentences). This re sulted in a total of 374 samples. Identification Task Procedures A perceptual task was performed in order to develop a reference to gauge classification accuracy. Participants were asked to identify the emotion expressed by each speech sample using an 11-item, closed-set, identification task. In each trial, one sample was presented binaurally at a comfortable loudness level us ing a high-fidelity soundcar d and headphones (Sennheiser HD280Pro). The 11 emotions were listed in the previous section. All stimuli were randomly presented 10 times, resulting in 3740 trials ( 374 samples X 10 repetitions). Participants responded by selecting the appropriate button sh own on the computer screen using a computer mouse. Judgments were made using software developed in MATLAB (version 7.1; Mathworks, Inc.). The experiment took between 6.5 and 8 hour s of test time and was completed in 4 sessions. The number of times each sample was correctly a nd incorrectly identified was entered into a similarity matrix to determine the accuracy of classification and the confusions. Identification accuracy of emotion type was calculate d in terms of percent correct and d The data analysis procedures were described in Experiment 1. Classification Procedures To assess how well the acoustic model repres ents listener percep tion, each sample was classified into one of four emotion categor ies. Classification was performed using two algorithms, the kmeans and the k -nearest neighbor ( k NN) algorithms. The ability of the acoustic
178 model to predict the emotions of each samp le was measured using percent correct and dprime scores. These results were compared to listene r accuracy of these samples to evaluate the performance of the acoustic mode l relative to human listeners. The classification procedures for the kmeans algorithm were described previously in Experiment 1. Briefly, this algori thm classified a test sample as the emotion category closest to that sample. The proximity of the test sample to the emotion category was determined by computing a center point of each emotion category. The k NN algorithm classified a test sample as the emotion category be longing to the majority of its k nearest samples. The samples used as a comparison were the samples included in the development of the acoustic model (i.e., the reference samples). It was necessary to ca lculate the distance between the test sample and each reference sample to determine the nearest sa mples. The distances between all samples were computed using Equation 5-4. The k closest samples were analyzed further for k = 1 and 3. For k = 1, the emotion category of the test sample was se lected as the category of the closest reference sample. For k = 3, the category of the test sample was chosen as the emotion category represented by the majority of the three clos est reference samples. Once again, accuracy in emotion category predictions was ca lculated as percent correct and d scores. Results In Experiment 1, acoustic models of emoti on perception were developed. The optimal model was determined to be the Overall training set model. The present experiment investigated the ability of the Overall training set model to acoustically repr esent the emotions from 10 unfamiliar speakers. This was evaluated using two classification algorithms. Samples from 11 emotions were classified into four emotion ca tegories. The results were compared to listener perception and are described below.
179 Perceptual test results All speech samples within the test2 set were evaluated by listeners in an 11-item identification task. Accuracy was calculated by including confusions within the four emotion categories. As described in the previous experime nt, accuracy in terms of percent correct scores and d scores was computed using three procedures. First, the entire test2 set was analyzed. The remaining two procedures involved exclusion cr iteria for removing samples from the analysis. The first of these eliminated samples were thos e perceived at chance level or less based on the percent correct identific ation of 11 emotions. Accordingly, 55 (16.5%) samples were discarded from this analysis. The second exclusion cr iterion involved dropping samples that were misclassified after the within-category conf usions were calculated and summed across all listeners. This resulted in the removal of 88 (26.7 %) samples, which included some but not all of the samples dropped using the first exclusi on rule. Results are shown in Table 5-16. When all sentences were included in the analysis, accuracy was at 46% for Category 1 (happy), 75% for Category 2 (content-confiden t), 40% for Category 3 (angry), and 67% for Category 4 (sad). After dropping the sentence per ceived at chance level, all categories improved to 52%, 76%, 47%, and 73%, respectively. After the second excl usion criterion was implemented, all categories improved to 72%, 79%, 61%, and 79%, respectively. In general, Categories 2 and 4 were easier to recognize. Ho wever, the recognition accuracy of Category 1 was similar to the accuracy of Categories 2 a nd 4 after the second excl usion criteria were implemented. In addition, the mean recognition accuracy of female speakers samples was greater than male speakers samples (shown in Figure 5-11). The most effective speakers in expressing all four emotion categories were fe male Speakers 3 and 4. No single sentence was better recognized on average across all speakers. These results served as a baseline reference for the comparison of model performance.
180 Acoustic measures The necessary acoustic features were computed for the test2 set samples according to each acoustic model. Most features were computed automatically in Matlab (v.7.0), although a number of features were automatically comput ed using hand measured vowels and consonants. Reliability of hand measurements It was necessary to compute reliability on a s ubset of the hand measurements used in computing acoustic parameters of the test se t to confirm that these measurements were replicable. In contrast to the training and test1 sets, pause duration was not measured as part of the test2 set, since it was not determined to be a necessary cue. Hen ce, reliability was calculated on the only hand measurements that were necessa ry for computation of acoustic parameters included in the model. This included vowel dur ation for the stressed vowel (Vowel 1) and unstressed vowel (Vowel 2). The same colleague who performed the reliability measurements for the training and test1 sets (Judge 2) was asked to perf orm these measurements on a subset of the stimuli. Recall that the test2 set included 330 samples (11 emotions X 10 speakers X 3 sentences). Measurements were repeated for 20 % of each speakers samples or 7 sentences per speaker. This resulted in a total of 70 samples, wh ich is slightly more than 20% of the total test set sample size. Measurements made by the author and Judge 2 were corr elated using Pearsons Correlation Coefficient. Both vowel duration measures were highly correlated (0.97 and 0.92, respectively), suggesting that the hand measuremen ts were reliable. Results are shown in Table 5-17. Classification of test2 stimuli To test the generalization capability of the Overall training set acoustic model, the test2 set stimuli were classified into f our emotion categories using the kmeans and k NN algorithms. Classification accuracy was repor ted in percent correct and dprime scores for all samples, each
181 of the 10 speakers, and each of the three sentences. Results of the kmeans classification are shown in Table 5-18, an d the results of the k NN classification for k = 1 and 3 are shown in Table 5-19. The Overall training set acoustic model was equivalent to listener performance for Category 3 (angry) when tested with the k -means algorithm for all samples. For the remaining emotion categories, all three algorithms showed lower accuracy for the acoustic model than listeners. However, the general trend in accuracy was mostly preserved. Category 3 (angry) was most accurately recognized and classified, follo wed by Categories 4, 1, and 2 (sad, happy, and content-confident), respectively. The k -means algorithm resulted in better classification accuracy than the k NN classifiers for Categories 3 and 4 (angry and sad), but the k NN ( k = 1) classifier had better classification accuracy for Categor ies 1 and 2 (happy and content-confident). However, classification accuracy for Categories 1 and 2 was much lower than listener accuracy. In essence, performance of the k NN classifier with k = 1 was similar to the k -means classifier. However, the k -means classifier was more accurate rela tive to listener perception than the k NN classifier. Classification accuracy was reported for the sa mples from each speaker as well. Samples from Speakers 3, 4, and 5 (all female speakers) were the most accurate to classify and for listeners to recognize. In fact, with th e exception of Category 1 (happy), the mean k -means and k NN ( k = 1) d scores for female speakers was much greater than the mean d for male speakers. The male-female difference for Category 1 was tr ivial. Classification accuracy was best for Speaker 4. Performance using the k -means and k NN ( k = 1) classifiers was better than listener performance for two emotion ca tegories, but worse for the ot her two. Still, classification accuracy was better than listener accuracy wh en computed for all samples. Similarly, k -means classification accuracy for Speakers 6 and 7 and k NN ( k = 1) classification accuracy for Speakers
182 1 and 7 were better than liste ner accuracy for Categor ies 1 and 3 (happy and angry), but less for Categories 2 and 4 (content-confident and sad). It can be concluded that the acoustic model worked relatively well in representing the emo tions of the most effective speakers, but was not representative of listener results for the speakers that were not as effective. An analysis by sentence was performe d to determine whether the Overall training set acoustic model was better able to acoustically re present a specific sentence. Accuracy for all classifiers across emotion categor ies was least for Sentence 3, the novel sentence. This trend was representative of listener perc eption. However, the magnitude of the difference was more substantial for the classifiers than for listeners Accuracy for Categories 3 and 4 (angry and sad) was better than the remaining categories for all se ntences and classifiers. This was in agreement with the high accuracy for Categories 3 and 4 seen in the all samples classification results. Since no clear sentence advantage was seen between Sentences 1 and 2 and the low classification accuracy of Sentence 3 was supported by lower percep tual accuracy of this sentence, the results suggest that the acoustic model did not favor one sentence over the others. Discussion A number of researchers have sought to determine the acoustic signature of emotions in speech by using the dimensional approach (Sch roder et al., 2001; Davitz, 1964; Huttar, 1968; Tato et al., 2002). However, the dimensional ap proach has suffered from a number of limitations. First, researchers have not agreed on the number of dimensions that are necessary to describe emotions in SS Techniques to determine the number of dimensions include correlations, regressions, and the semantic differential tasks, but these have resulted in a large range of dimensions. Second, reports of the acoustic cues that correlate to each dimension have been inconsistent. While much of the literature has agreed on the acoustic pr operties of the first dimension which is typically activ ation (speaking rate, high mean f0 high f0 variability, and
183 high mean intensity), the remaining dimensions have much variability. Part of this variability may be a result of differences in the stimulus t ype investigated. Stimuli used in the literature have varied according to the utterance length, the amount of contextual information provided, and the language of the uttera nce. For instance, Juslin and Laukka (2005) investigated the acoustic correlates to four emotion dimensions using short Swedish phrases and found that the high end of the activation dimensi on was described by a high mean f0 and f0 max and a large f0 SD. Positive valence corresponded to low mean f0 and low f0 floor. The potency dimension was described by a large f0 SD and low f0 floor, and the emotion intensity dimension correlated with jitter in addition to the cues that corresponded wi th activation. On the other hand, Schroeder et al. (2001) investigated the acoustic correlates to two dime nsions using spontaneous British English speech from TV and radio programs and found that the activation dimension correlated with a higher f0 mean and range, longer phrases, shorter pa uses, larger and faster F0 rises and falls, increased intensity, and a flatter spectral slope. The valence dimension corresponded with longer pauses, faster f0 falls, increased intensity, and more prominent intensity maxima. Finally, the set of acoustic cues studied in many expe riments may have been limited. For example, Liscombe et al. (2003) used a set of acoustic cu es that did not include speaking rate or any dynamic f0 measures. Lee et al. (2002) used a set of acoustic cues that did not include any duration or voice quality measures. While some of these experiments found significant associations with the acoustic cu es within their feature set and the perceptual dimensions, it is possible that other features bett er describe the dimensions. Hence, two experiments were performed to deve lop and test an acoustic model of emotions in SS While the general objectives of the experiment s reported in this chapter were similar to a handful of studies (e.g., Juslin & Laukka, 2001; Yildirim et al., 2004; Liscombe et al., 2003),
184 these experiments differed from the literature in the methods used to overcome some of the common limitations. The specific aim of the first experiment was to develop an acoustic model of emotions in SS based on discrimination judgments and wit hout the use of a speakers baseline. Since the reference for assessing emotion expressivity in SS is listener judgments, the acoustic model developed in the Experiment 1 was based on the discrimination data obtained in Chapter 2. This model was based on discrimination judgment s, since a same-different discrimination task avoids requiring listeners to as sign labels to emotion samples. While an identification task may be more representative of listener perception, this task assesses how well listeners can associate prosodic patterns (i.e. emotions in SS ) with their corresponding labe ls instead of how different any two prosodic patterns are to listeners. Furthe rmore, judgments in an identification task may be subjectively influenced by each individuals de finition of the emotion terms. A discrimination task may be better for model development, since this task attempts to determine subtle perceptual differences between items. Hence, a multidim ensional perceptual model of emotions in SS was developed based on listener disc rimination judgments of 19 emoti ons (reported in Chapter 3). A variety of acoustic features were measured from the training set samples. These included cues related to fundamental frequency, intens ity, duration, and voice qua lity (summarized in Table 5-4). This feature set was unique because none of the cues required normalization to the speaker characteristics. Most studies require a sp eaker normalization that is typically performed by computing the acoustic cues relative to each speakers neutral emotion. The need for this normalization limits the applications of an acoustic model of emotion perception in SS because of the practicality of obtaining a neutral expr ession. Therefore, the pr esent study sought to develop an acoustic model of emotions that di d not require a speakers baseline measures. The acoustic features were computed relative to other features or other segments within the sentence.
185 Once computed, these acoustic measures were us ed in a feature sel ection process based on stepwise regressions to select the most rele vant acoustic cues to each dimension. However, preliminary results did not result in any acousti c correlates to the second dimension. This was considered as a possible outcome, since even listeners had difficult y discriminating all 19 emotions in SS. To remove the variability contributed to the perceptual m odel by the emotions that were difficult to perceive in SS the perceptual model was rede veloped using a reduced set of emotions. These categories were identified based on the HCS results in Chapter 2. In particular, the 11 clusters formed at a clustering level of 1.95 were selected, instead of the 19 emotions at a clustering level of 0.0. The results of the new feature selection for the training set samples (i.e., the Overall training set model) showed that srate (speaking rate) aratio2 (alpha ratio of the unstressed vowel) and pnorMIN (normalized pitch minimum) corresponded to Dimension 1, and normpnorMIN (normalized pitch minimum by speaking rate) and normattack (normalized attack time) were associated with Dimension 2. The pnorMIN and srate features were among those hypothesized in Chapter 3 to correspond to Di mension 1 because this dimension separated emotions according to articulati on rate and the magnitude of f0 contour changes. Both of these measures have been reported in the literature as correspondin g with Dimension 1 (Scherer & Oshinsky, 1977; Davitz, 1964), considering that pnorMIN was a method of measuring the range of f0 The inclusion of the aratio2 feature is unusual. Computations of voice quality are typically performed on stressed vowels, to obtain a l onger and less variable sample. However, this variability may be important in emotion diffe rentiation. The acoustic features predicted to correspond to Dimension 2 included some measure of the attack time of the intensity contour peaks, as hypothesized in Chapter 3. The feature normattack included a normalized attack time to the duty cycle of the peak, thereby accounti ng for the changes in a ttack time due to the
186 syllable duration. In addition, the normpnorMIN cue was significant, and represents a measure of range of f0 relative to the speaking rate. Since this dimension was not clearly valence or a separation of positive and negative emotions, it was not possible to truly compare results with the literature. Nevertheless, cues such as speaking rate (Scherer & Oshinsky, 1977) and f0 range or variability (Scherer & Oshinsky, 1977; Uldall 1960) have been reported for the valence dimension. To test the acoustic model, th e emotion samples within the training set were acoustically represented in a 2D spac e according to the Overall training set model. But first, it was necessary to convert each speakers samples to z-scor es. This was required because the regression equations were based on the MDS coordinates, wh ich results in arbitrary units. The samples were then classified into four emotion categories. These four categories were the four clusters determined in Chapter 2 to be perceivable in SS Results of the k -means classification revealed near 100% accuracy across the four emotion categories. These resu lts were better than listener judgments of the training set samples obtained using an identification task. Near-perfect performance was expected, since the Overall training set model was developed based on these samples. To test whether the acoustic model gene ralized to novel uttera nces of the same two speakers, this model was used to classify the samples within the test1 set. Results showed that classification accuracy was less for the test1 set samples compared to the training set samples. However, this pattern mimicked listener perf ormance as well. Furthermore, classification accuracy of all samples greater than listener accuracy (Category 3 of the test1 set was the only exception with a 0.22 difference in d scores). The feature selection process was performed multiple times using different perceptual models. The purpose of this procedure was to determine whether an acoustic model based on a
187 single sentence or speaker was better able to represent per ception. For both the training and test1 sets, separate perceptual MDS models were de veloped for each speaker. In addition, perceptual MDS models were developed for each sentence for the test1 set. Results showed that classification accuracy of both the training set and test1 set samples was best for the Overall training set model. Since the training set was used for model devel opment, it was expected that performance would be higher fo r this model than for the test1 set models. In addition, the Overall training set model provided approxima tely equal results in classifying the emotions for both sentences. However, accuracy for the individual speaker samples varied. The samples from Speaker 2 were easier to classify for the test1 and training set samples. This contradicted listener performance, as listeners found the samples from Speaker 1 much easier to identify. In terms of the differe nt speaker and sentence models, the Speaker 2 training set model was better than the Speaker 1 training set model at classifying the training set samples for three of the four emotion categories. This model was equivalent to the Speaker 2 test1 set model but worse than the Sentence 2 test1 set model at classifying the test1 set samples. While the Sentence 2 test1 set model performed similarly to the Overall training set model, the latter was better at classifying Categories 3 and 4 (angry and sad) while th e former was better at classifying Categories 1 and 2 (h appy and content-confident). The pattern exhibited by the Overall training set model was consistent with listener judgments and was therefore used in further model testing performed in Experiment 2. While the objective of the first experiment wa s to develop an acoustic model of emotions in SS the aim of the second experiment was to te st the validity of the model by evaluating how well it was able to classify the emotions of novel speakers. Ten novel speakers expressed one novel and two previously used nonsense sentences in 11 emotions (i.e., the test2 set). These
188 samples were then acoustically represented using the Overall training set model. The k NN classification algorithm (for k = 1 and 3) was used in addition to the k -means algorithm to evaluate model performance. Results showed that classification accuracy of all samples of the test2 set was not as good as accuracy for the training and test1 sets. These results occurred regardless of the classifi cation algorithm, although the k -means algorithm performed better than both k NN methods. Listener identification accu racy was also much worse than the training and test1 sets. This suggests that the low classification accuracy for the test2 set may in part be due to reduced effectiveness of the speakers. The acousti c model was almost equal to listener accuracy for Category 3 (angry) using the kmeans classifier (difference of 0.04). In fact, Category 3 (angry) was the easiest emotion to classify a nd recognize for all three sample sets. The next highest in classification and r ecognition accuracy for all sets was Category 4 (sad). The only exception was classification accuracy for the training set samples. Accuracy of Category 4 was less than Category 1; however, th is discrepancy may have been due to the small sample size (one Category 4 sample was misclassifi ed out of four samples). The high perceptual accuracy for angry samples has been reported in the literature. For instance, Yildirim et al. (2004) found that angr y was recognized with 82 % accuracy out of four emotions (plus an other category). Petr ushin (1999) found that angry was recognized with 72% accuracy out of five emotions. On the other hand classification accuracy of angry has typically been equal to or less than perceptual accurac y. Yildirim et al. (2004) found that angry was classified with 54% accuracy out of four emoti ons using discriminant analysis. Toivanen et al. (2006) found that angry was classified with 25% accuracy compared to 38% recognition out of five emotions using k NN classification. Similarly, recognition accuracy of sad has typically been high. For example, Dallaert et al. (1996) found that sad was recognized with 80% accuracy out
189 of four emotions. Petrushin ( 1999) found that sad was recognized with 68% accuracy out of five emotions. Classification accuracy of sad has also been high. Petrushin (1999) found that sad was classified with between 73-81% accuracy out of five emotions using multiple classification algorithms ( k NN, neural networks, ensembles of neural network classifiers, and set of experts). Yildirim et al. (2004) found that sad was percei ved with 61% accuracy bu t classified with 73% accuracy. While Categories 1 and 2 (happy and conten t-confident) had lower recognition accuracy than Categories 3 and 4 (angry and sad) for the sa mples from all sets, classification accuracy for these categories for the test2 set samples was much lower than listener accuracy. Reports of recognition accuracy of happy have been mixed, but classification accuracy has generally been high. For instance, Liscombe et al. (2003) found that happy samples were ranked highly as happy with 57% accuracy out of 10 emotions and clas sified with 80% accuracy out of 10 emotions using the RIPPER model was used with a binary classifi cation procedure. Yildirim et al. (2004) found that happy was recognized with 56% accuracy out of four emotions (plus an other category) and classified with 61% accuracy out of four emotions using discriminant analysis. Based on the literature, classi fication accuracy of Category 1 (happy) was expected to be higher than reported. It was possible that samp les from this category were confused with Category 2 (content-confident), since these categori es were clustered together at a lower level than Category 1 (happy) with Categories 3 and 4 (angry and sad). Therefore, an analysis was performed to determine whether this low accuracy was due to an inability of the acoustic model to represent this category or whether these samples were confused with Category 2 (contentconfident). When the samples classified as Cate gory 2 were included as co rrect classification of Category 1 (happy) samples, accuracy increased to 75 % correct or a d of 1.6127. This accuracy
190 was higher than listener accuracy. This suggested that the low classification accuracy of happy may be due to inadequate representations of these speakers improved Accuracy of the final category of content-conf ident has been mixed. Liscombe et al. (2003) found 75% perceptual and classification accuracy of confident (algorithm: RIPPER model with binary classification procedur e) out of 10 emotions. Toiv anen et al. (2006) found 50% recognition accuracy and 72% k NN classification accuracy of a neutral emotion out of five emotions. Petrushin (1999) found 66% recognit ion and 55-65% recognition of a normal emotion. Classification results of the test2 set were also reported by sentence and speaker. Both classification and recognition results showed si milar performance for Sentences 1 and 2. This matched the sentence analysis of the training and test1 sets. However, classification accuracy of Sentence 3 was much less than Sentences 1 and 2 for all emotion categories. While listener accuracy of Sentence 3 was also less than Sent ences 1 and 2 for all emotion categories, the reduction in performance was greater for the classifiers. In other words, the Overall training set acoustic model was better able to represent the se ntences used in model development. However, it was not clear whether the model is dependent on the sentence text, or the novel sentence was simply harder to express emotionally. The analysis by speaker revealed clear diffe rences in the classi fication of different speakers. This raises concern for the whether the level of training of the speakers used in model development and testing affected model performa nce. The speakers used in model development were acting students at the Univer sity of Florida who had complete d at least two full years within their program. Individuals with ac ting training were used in mode l development, since previous research has suggested that actors may have th e ability to produce fu ll blown emotions or
191 emotions with greater magnitudes than persons without any formal training (Banse and Scherer, 1996). Based on the suggestions by Banse and Scherer (1996), it was assumed that the expressions from actors would be more natural than the expre ssions from individuals without acting training. Although several t echniques have been proposed to elicit expressions with greater naturalness from non-tr ained individuals (these emo tion induction techniques were previously described in Chapter 2, such as induction through music, reading of emotional passages, and games), it is not clear whether these expressions are as natural or high in magnitude as those expressed by actors. Since the model would be applied to indivi duals without acting tr aining, it was assumed that the use of non-actors would pr ovide a test of the ecological va lidity of the model. Therefore, the speakers used in model testing had no formal acting training in college (female Speakers 1 and 2 and male Speaker 9 had participated in theater during their high school years; female Speaker 5 was a first-year theater major with one semester completed in the program). Results showed that the performance of the model in representing th e emotions of novel speakers without acting training was not as good as performance in represen ting the two speakers used in model development with acting training. This sugg ests one of two conclusions. Either a model based on acted speech simply does not apply well to the speech of individuals without acting training, or the model could gene ralize to only the highly effectiv e speakers regardless of acting training. To understand which of these two scenar ios was more representative of the results, listener performance was compared across stimulus sets. For the test2 set, listener identification performance was best for Speakers 3 and 4. Listen er accuracy for these speakers was comparable to accuracy for the training set speakers. Listeners perceived th ree of four categories of the Speaker 4 test2 set samples and all four cat egories of the Speaker 3 test2 set samples better than
192 the Speaker 2 training set samples. Also, listeners percei ved two of four categories of the Speaker 4 test2 set samples and one of four categories of the Speaker 3 test2 set samples better than the Speaker 1 training set samples. This indicates that it was possible for listeners to perceive the speech samples from some individua ls without acting training (e.g. Speakers 3 and 4) and with acting training equa lly well. However, the speech samples from the actors were perceived more accurately than most of the sp eakers (8 of 10 speakers) without acting training. These results were consistent with the literature that suggests that actors may express emotions in a manner that facilitates recogniti on. It is conceivable that the samples from individuals without acting training were low in magnitude due to th e unnatural recording set ting and therefore more difficult for listeners to recognize. However, the differences in the perceived magnitudes of the samples were not explicitly te sted in this experiment. An analysis of the model and listener accu racy by speaker showed that the model performed well at predicting the expressions of Speakers 3 and 4, two female speakers without acting training. Classification a ccuracy for Speaker 4 was bette r than listener accuracy for Categories 3 and 4 (angry and sad), and model accu racy for Speaker 3 was better than listeners for Category 3. In fact, classifica tion accuracy was greater than th e listener accuracy computed over all samples of the test2 set for all categories for Speaker 4 a nd for three of four categories for Speaker 3. Since the accuracy of model was high for the speakers that were perceived best, this suggests that the model was in fact able to generalize to individuals without acting training. Specifically, the model was able to represent th e expressions of effec tive speakers, but it was poor at representing the emotional samples of speak ers who were moderately effective. In other words, the sensitivity of the model was not as god as listeners. These resu lts also suggest that
193 most speakers without acting traini ng were not as effective in expressing the 11 emotions in SS especially in a laboratory setup. In summary, an acoustic model was devel oped based on discrimination judgments of emotional samples by two speakers. While 19 emoti ons were obtained and used in the perceptual test, only 11 emotions were used in model development. In clusion of the remaining eight emotions seemed to add variability into the mode l, due to their low disc rimination accuracy in SS. Due to the potential for large speaker differen ces in expression (as confirmed by the results of this study), acted speech was used. The Overall training set acoustic model was developed based on a single sentence by two actors and outp erformed other speaker and sentence models that included additional sentences by the same speakers. It is possible that these additional models were not able to accu rately represent the samples because they were based on identification judgments instead of discrimination, but this was not tested in the present study. While the performance of the Overall training set acoustic model was better than listeners for the training and test1 sets, there were a couple of limitations of this model. First, certain features used in the model were computed on vowels that were segmented by hand offline. To truly automate this model, it is necessary to develop an algorithm to automatically isolate stressed and unstressed vowels from a speech sample. Second, it was necessary to normalize the samples from each speaker by converting them to z-scores. This normalization did not negate the purpose of this studyto develop an acoustic model based on the acoustic features that were not dependent on a speakers baseline. However, it did hinder the overall goal which was to develop a speaker independent method of predicting emotions in SS It is also important to keep in mind that two basic classification al gorithms were used. The use of more complex algorithm may potentially improve upon the classification accuracy.
194 The results of the test of model generalization showed that the model was able to classify angry with high accuracy relative to listeners. Th is suggested that the acoustic cues used to differentiate angry from the re maining emotions, i.e. the acoustic cues to Dimension 2, were more robust than those previously used to describe this dimension in the literature. This is an important finding, since the ability to differentiate angry from othe r emotions is necessary in a number of applications. Therefore, the results presented here suggest that an acoustic model based on perceptual judgments of nonsensica l speech from two actors could sufficiently differentiate between anger a nd the remaining emotions in SS when expressed by non-trained individuals.
195 Table 5-1. Perceptual and algorithm classificatio n accuracy of emotions perceived from the suprasegmental information in speech. Language/ Stimuli Emotions studied Perceptual accuracy Classification Accuracy Average accuracy Dallaert, Polzin, & Waibel (1996) American English sentences Angry Fear Happy Sad 96% 64% 88% 80% n/a 79.5% algorithm 82% perception Yildirim et al. (2004) American English neutral sentences Angry Happy Sad Neutral 82% 56% 61% 74% 54% 61% 73% 80% 67% algorithm 68.3% perception Toivanen et al. (2006) Finnish vowels Angry Joy Sad Tenderness Neutral 38% 28% 43% 29% 50% 25% 57% 50% 63% 72% 53.4% algorithm 37.5% perception Hammerschmidt & Jurgens (2007) German speakers expressing 1 word Anna Rage/hot anger Despair/ lamentation Contempt/ disgust Joyful surprise Voluptuous enjoyment/ sensual satisfaction Affection/ tenderness Range of 43% (contempt/ disgust) to 87% (rage/ hot anger) n/a 77.4% algorithm n/a perception Leinonen et al. (1997) Finnish word [saara] Naming Commanding Angry Scornful Content Admiring Pleading Sad Frightened Astonished 48% 75% 70% 52% 26% 34% 37% 36% 64% 72% n/a n/a algorithm 51.4% perception Petrushin (1999) American English, four neutral sentences Normal Happiness Anger Sadness Fear 66% 61% 72% 68% 50% 55-65% 61-69% 73-81% 73-83% 35-53% 65% algorithm 63.4% perception
196 Table 5-1. Continued Liscombe et al. (2003) American English foursyllable dates and numbers Angry Confident Happy Interested Encouraging Sad Anxious Bored Friendly Frustrated 69% 75% 57% 70% 52% 62% 56% 66% 59% 59% 72% 75% 80% 74% 73% 80% 72% 79% 74% 74% 75.3% algorithm 62.5% perception
197 Table 5-2. Initial feature sets used in experiments to cl assify emotions. The best features that provided the most accurate classi fication accuracy are also shown. Emotions studied Classification results (best features) Initial feature set Method of classification Dallaert et al. (1996) Angry Happy Sad Fear Cues from Feature Set B: Max and median f0 and mean positive derivative of regions where f0 is increasing Feature Set A: 5 global f0 measures, spectral slope and speaking rate Feature Set B: speaking rate, mean length between voiced regions, number of maxima / number of (minima + maxima), number of upslopes /number of slopes, slope of maxima, 5 global measures of smoothed f0 contour and its derivative (median instead of mean), mean min and mean max voiced segment, mean positive and negative derivative across voiced segments kNN with cooperative composition; speaker dependent Yildirim et al. (2004) Angry Happy Sad Neutral Energy was single most important cue but all cues combined achieved highest accuracy 5 global f0 parameters plus median f0 for the whole utterance, RMS energy, Mean utterance, vowel, inter-word silence, voiced region, and unvoiced region durations, speaking rate (phonemes/sec), spectral balance, acoustic likelihood HMMs, Discriminant analysis
198 Table 5-2. Continued Toivanen et al. (2006) Angry Joy Sad Tenderness Neutral duration, alpha ratio, signal-to-noise ratio, NAQ, and average jitter Alpha ratio, duration, std. deviation of jitter, mean jitter, shimmer, signal-to-noise ratio (SNR), normalized amplitude quotient (NAQ), mean jitter std. deviation jitter, SNR shimmer k-NearestNeighbor classifier (kNN); speaker independent Hammerschmi dt & Jurgens (2007) Rage/ hot anger Despair/ lamentation Contempt/ disgust Joyful surprise Voluptuous enjoyment/ sensual satisfaction Affection/ tenderness amplitude, DFA2 mean, DFB1 local mod, duration, mean range, and HNR mean amplitude; duration; noise; f0 mean; f0 max; peak frequency (PF) mean; PF max; PF/ f0 coeff mean; HNR mean; range mean; DFA2 mean (distribution of frequency amplitudes across the spectrum, irrespective of their tonal/noisy character); DFB1 local mod (small, local modulations of first dominant frequency band); PF amp max; PF max loc. stepwise discriminant function analysis
199 Table 5-2 Continued Williams & Stevens (1972) Anger Fear Sorrow Neutral High mean f0 (half octave above neutral), High f0 range Vowels in syllables with high intensity had highest f0, longer duration, slower syllabic rate, High 1st formants Mean f0 less than anger; close to neutral, sporadic points of high f0 (peaks in the contour) Voicing irregularities sometimes, longer duration than anger and neutral More precise articulation Lower mean f0 than neutral Narrow f0 range, longer duration from longer vowels, consonants, and pauses Slower syllabic rate, voicing irregularity (voiced sounds can be whispered) Little noise or irregularity between formants or in highfrequency regions, imprecise consonants, especially when in unstressed syllables Relatively shorter duration n/a Visual analysis All acoustic comparisons with neutral
200 Table 5-2 Continued Petrushin (1999) Normal Happiness Anger Sadness Fear f0 max, std. deviation, range, and mean; BW1 and BW2 mean, energy standard deviation, and speaking rate f0 energy, speaking rate, first three formants (F1, F2, and F3) and their bandwidths (BW1, BW2, and BW3), Global measures for all features except speaking rate, f0 slope, relative voiced energy Neural networks Liscombe et al. (2003) Angry Confident Happy Interested Encouragin g Sad Anxious Bored Friendly Frustrated f0 *, RMS *, TILT *, VTT f0 range, f0 mean f0 MIN f0 STDV VTT f0 MAX STV highest amplitude STV highest amplitude STV nuclear stress f0 MAX Global f0 and energy cues, f0 above, VTT, syllable length, STV highest amplitude, STV with nuclear stress, type of nuclear accent, intonational contour type, phase accent and boundary case type RIPPER model (machine learning program) using binary classification
201 Table 5-3. Feature sets corresponding to the emotion dimensions. Dimension Experiment Acoustic Cues Activation Tischer (1993); Breitens tein et al. (2001); Pereira (2000); Davitz (1964); Levin & Lord (1975); Scherer & Oshinsky (1977); Schroder et al. (2001) Tischer (1993); Apple et al. (1979); Kehrein (2002); Breitenstein et al. (2001); Davitz (1964); Scherer & Oshinsky (1977) Tischer (1993); Pere ira (2000); Davitz (1964); Huttar (1968); Schroder et al. (2001) Tischer (1993) Schroder et al. (2001) Tischer (1993); Trouva in & Barry (2000); Schroder et al. (2001) Huttar (1968) Breitenstein et al. (2001); Pereira (2000); Scherer & Oshinsky (1977); Schroder et al. (2001); Uldall, 1960 Davitz (1964) Scherer & Oshinsky (1977); Pittam et al. (1990); Schroder et al. (2001) High mean f0 Fast speech rate Increased mean intensity late intensity peaks, intensity increase during a sense unit, slope of f0 rises between syllable maxima longer phrases, larger and faster f0 rises and falls, and a flatter spectral slope Shorter pauses f0 range f0 variability blazing timbre Increased high-frequency energy Confidence Greenberg et al. (2007) f0 shape Intensity Juslin & Laukka (2001) Mean f0 high frequency energy
202 Table 5-3 Continued Valence Scherer & Oshinsky (1977); Schroder et al. (2001) Scherer & Oshinsky (1977); Pittam et al. (1990) Green & Cliff (1975) Tischer (1993) Greenberg et al. (2007) Scherer & Oshinsky (1977); Scherer & Oshinsky (1977); Scherer (1974); Scherer & Oshinsky (1977); Uldall (1960) Faster speaking rate, Shorter pauses, increased intensity less high frequency energy warm voice quality Longer vowel duration, High mean f0 ( f0 height) Low mean f0 large f0 range f0 variability Power Schroder (2003); Harrigan et al. (2001) Apple et al. (1979); Tusing & Dillard (2000) Schroder (2003) Schroder (2003); Tusing & Dillard (2000) Harrigan et al. (1989); Kehrein (2002); Pereira (2000); Scherer (1974); Tusing & Dillard (2000) Schroder et al. (2001); Scherer & Oshinsky (1977) Tischer (1993); Apple et al. (1979); Schroder et al. (2001) Tischer (1993) Kehrein (2002); Pereira (2000); Uldall (1968) high tempo, slow speech rate more high-frequency energy, short or few pauses, steep f0 slope) large intensity range, high intensity high mean f0 low mean f0 vowel duration f0 variability
203 Table 5-4. List of acousti c features analyzed. Feature Set Acoustic Cues Abbreviation Fundamental Global normalized f0 max pnorMAX frequency ( f0 ) Global normalized f0 min pnorMIN Gross f0 trend gtrend Normalized number of f0 contour peaks normnpks Steepness of f0 contour peaks: Peak rise time mpkrise Steepness of f0 contour peaks: Peak fall time mpkfall Intensity Normalized Minimum iNmin Normalized Maximum iNmax Attack of syllables in contour attack Normalized attack (attack / peak duration) Nattack Normalized attack (attack / dutycyc) normattack Duty cycle of syllables in contour dutycyc Duration Speaking rate srate Vowel to consonant ratio VCR Pause Proportion PP Speaking rate trend strend Voice quality Breathiness Ce pstral peak prominence MeanCPP Spectral tilt alpha ratio of stressed vowel (summed) aratio Spectral tilt mean alpha ratio of stressed vowel maratio Spectral tilt regression thr ough the long-term averaged spectrum of stressed vowel mLTAS Spectral tilt regression thr ough the long-term averaged spectrum of unstressed vowel mLTAS2 Spectral tilt mean alpha ratio of unstressed vowel maratio2 Spectral tilt alpha ratio of unstressed vowel (summed) aratio2
204 Table 5-5. Matrix of d values for 11 emotions (AG = angry; AO = annoyed; AX= anxious; BO = bored; CI = confident; CU = confused; CE = content; EM = embarrassed; EX = exhausted; HA = happy; SA = sad) submitte d for multidimensional scaling analysis. AGA0AXBOCICUCEEMEXHASA AG0.002.994.494.142.414.014.384.673.865.155.58 A02.990.003.453.161.752.202.493.263.083.863.44 AX4.493.450.005.343.023.312.114.964.632.693.53 BO184.108.40.2060.003.623.312.902.702.684.733.31 CI2.411.753.023.620.001.832.093.593.482.303.41 CU4.012.203.313.311.830.001.973.052.852.712.83 CE4.382.492.112.902.091.970.002.932.472.323.09 EM4.673.264.962.703.593.052.930.002.015.371.60 EX3.863.084.632.683.482.852.472.010.003.632.22 HA5.153.862.694.732.302.712.325.373.630.003.81 SA5.583.443.533.313.412.833.091.602.223.810.00 Table 5-6. Stimulus coordinates of all listene r judgments of the 19 emotions arranged in ascending order for each dimension (AG = angry; AO = annoyed; AX = anxious; BO = bored; CI = confident; CU = confused; CE = content; EM = embarrassed; EX = exhausted; HA = ha ppy; SA = sad). AX-1.75AG-2.16 HA-1.65AO-0.90 CI-0.91CI-0.57 AG-0.36BO-0.29 CE-0.20EX0.18 CU-0.16CE0.37 AO0.22AX0.38 SA0.77CU0.39 EX1.06EM0.52 BO1.49HA0.79 EM1.50SA1.30 Dimension 1Dimension 2
205 Table 5-7. Perceptual accuracy for the training set based on all sentences and two exclusionary criteria. HCASHCAS Spk197%56%99%69%2.841.283.952.50 Spk270%71%85%66%1.741.283.111.95 ALL83%63%92%68%2.061.263.202.17 Spk197%68%99%69%3.231.613.902.51 Spk270%71%85%66%1.741.283.111.95 ALL83%70%92%68%2.261.433.182.16 Spk197%68%99%80%220.127.116.112.86 Spk270%80%85%74%2.041.743.042.15 ALL83%74%92%77%2.371.843.132.44 d -primeAll Sentences Above Chance Sentences Correct Category SentencesPercent Correct H = Category 1 or Happy, C = Category 2 or Content-Confident, A = Category 3 or angry, and S = Category 4 or Sad; Spk = Speak er Number ; Sent = Sentence Number
206 Table 5-8. Perceptual accuracy for the test1 set based on all sentences and two exclusionary criteria. HCASHCAS Spk190%59%94%66%2.341.283.062.23 Spk263%73%81%51%1.521.082.881.56 Sent177%59%88%70%1.771.162.951.98 Sent277%73%87%47%1.921.172.771.89 Spk1, Sent183%56%99%69%1.951.283.942.03 Spk1, Sent297%63%89%63%2.991.282.652.81 Spk2, Sent170%62%77%71%1.641.072.781.94 Spk2, Sent257%83%85%32%1.411.203.011.23 ALL77%66%88%59%1.841.142.851.87 Spk190%69%94%66%2.521.533.222.19 Spk263%73%81%64%1.651.182.881.89 Sent177%65%88%70%1.951.312.931.96 Sent277%77%87%59%2.061.423.012.14 Spk1, Sent183%68%99%69%2.331.613.901.99 Spk1, Sent297%69%89%63%2.971.463.052.75 Spk2, Sent170%62%77%71%1.641.072.781.94 Spk2, Sent257%83%85%50%1.851.323.001.72 ALL77%71%88%65%2.001.362.962.02 Spk190%75%94%84%2.692.383.222.75 Spk263%77%81%68%1.831.382.841.99 Sent177%68%88%75%2.061.552.902.09 Sent277%84%87%79%2.371.973.052.71 Spk1, Sent183%68%99%80%2.262.133.882.33 Spk1, Sent297%85%89%90%3.472.833.153.75 Spk2, Sent170%68%77%71%1.901.242.741.91 Spk2, Sent257%83%85%56%0.831.422.961.86 ALL77%76%88%76%2.181.742.962.33Correct Category Sentencesd -prime Percent CorrectAll Sentences Above Chance Sentences H = Category 1 or Happy, C = Category 2 or Content-Confident, A = Category 3 or angry, and S = Category 4 or Sad; Spk = Speak er Number ; Sent = Sentence Number
207 Table 5-9. Raw acoustic measurements for the test1 set. mean cpp ppvcrsratesrtrendgtrend normn pks mpkrise Spk1ang r s113.700.1660.6782.303-0.064-0.0030.400214.915 Spk1ang r s213.520.2810.9802.2490.018-0.0450.100771.503 Spk1annos112.910.0000.6903.5380.095-0.0450.222404.744 Spk1annos213.780.0760.9892.721-0.009-0.0580.111165.405 Spk1anxis112.500.0720.7104.030-0.036-0.0440.333183.278 Spk1anxis213.030.0641.1183.5870.179-0.0380.333185.626 Spk1 b ores116.160.0611.0532.4810.028-0.0280.11158.801 Spk1 b ores215.020.1501.1441.9020.042-0.0270.111131.659 Spk1cofis113.440.0320.7783.151-0.030-0.0800.333428.533 Spk1cofis213.910.0001.1703.2550.150-0.0570.100135.325 Spk1cofus112.950.2180.7562.4230.028-0.0220.10046.949 Spk1cofus213.310.1211.2332.867-0.0270.0450.111148.614 Spk1cotes113.730.0000.7274.2090.000-0.0200.111245.031 Spk1cotes213.330.0001.2163.8960.286-0.0640.111366.683 Spk1emb a s113.620.1990.6752.212-0.097-0.0490.222143.666 Spk1emb a s215.110.0941.0463.0150.103-0.0430.11184.515 Spk1exh a s114.360.0270.5562.466-0.060-0.0270.222366.554 Spk1exh a s215.190.0461.2082.5730.039-0.0290.111103.206 Spk1happs113.040.0000.7703.624-0.083-0.0460.222159.580 Spk1happs212.970.0001.3983.5700.274-0.0360.222315.480 Spk1sadds114.130.0760.8972.523-0.014-0.0080.333132.500 Spk1sadds213.780.1171.4142.3440.082-0.0430.500199.908 Spk2ang r s113.040.0000.6104.1770.179-0.0460.22292.303 Spk2ang r s214.200.0001.4393.481-0.060-0.0360.11177.147 Spk2annos113.850.0000.7773.780-0.036-0.0730.222248.902 Spk2annos214.570.0001.2833.414-0.100-0.0320.11182.188 Spk2anxis113.430.0000.8744.3070.000-0.0590.333250.884 Spk2anxis214.690.0001.0833.7030.000-0.0130.22247.667 Spk2 b ores116.160.0000.9553.211-0.117-0.0270.222195.008 Spk2 b ores216.510.0001.4663.044-0.109-0.0170.11140.981 Spk2cofis113.640.0000.8833.408-0.050-0.0170.333207.956 Spk2cofis215.420.0001.3373.466-0.133-0.0480.11199.627 Spk2cofus112.770.0000.6084.075-0.107-0.0360.22296.972 Spk2cofus213.440.0001.2493.774-0.048-0.0320.111161.657 Spk2cotes114.330.0000.8503.7360.000-0.0280.111169.782 Spk2cotes215.370.0001.0603.4060.033-0.0470.111126.633 Spk2emb a s115.280.0000.7923.6160.000-0.0300.22257.012 Spk2emb a s214.410.0001.0433.333-0.100-0.0110.22277.453 Spk2exh a s113.410.0000.6823.896-0.036-0.0410.22240.080 Spk2exh a s214.070.0181.1143.155-0.127-0.0350.22266.827 Spk2happs113.490.0000.8023.8620.1790.0230.222104.097 Spk2happs213.940.0001.3903.9040.0360.0250.111302.083 Spk2sadds113.920.0000.6293.747-0.179-0.0200.333139.151 Spk2sadds214.870.0001.3083.5680.060-0.0010.22284.732
208 Table 5-9 Continued mpkfalliNminiNmax pnor MAX pnor MIN normpn ormin aratioaratio2 Spk1angrs1207.12928.13624.09071.88490.19339.1676731.06312.2 Spk1angrs2865.58832.94728.109179.68763.91628.4166664.75783.4 Spk1annos1176.75423.63015.059174.80688.93325.1386364.45545.3 Spk1annos2132.32427.89221.197165.08093.88934.5085744.35196.5 Spk1anxis1125.72921.53217.13395.43898.77524.5116290.55281.6 Spk1anxis2186.51230.75519.555122.143121.31333.8215838.05873.8 Spk1 b ores1246.79924.75815.41677.91660.10324.2245551.35017.0 Spk1 b ores2180.00325.91919.60582.30855.05828.9415849.24724.4 Spk1cofis1117.83124.18917.472103.910140.69344.6496756.96015.2 Spk1cofis2235.03928.29222.839159.589109.15033.5296433.65972.4 Spk1cofus1128.67531.91123.789119.357121.81850.2856292.95624.2 Spk1cofus2212.53329.38720.247136.253129.12045.0345958.05558.3 Spk1cotes1168.10217.19612.430111.12296.22022.8606222.85565.2 Spk1cotes2462.86221.52013.381217.786114.23729.3255586.64696.1 Spk1emb a s186.90825.55822.17588.61684.25738.0956344.65304.4 Spk1emb a s258.45330.16216.36882.13969.33322.9995906.25102.7 Spk1exhas1192.75723.54317.07369.52457.35623.2606241.16022.4 Spk1exhas2203.74342.67521.390121.46667.15126.0955790.15352.3 Spk1happs196.88823.53617.72390.856129.98535.8726607.35974.6 Spk1happs2342.24826.02216.219216.943165.45046.3446463.85818.5 Spk1sadds1262.73028.10217.52685.08390.50835.8756245.25413.4 Spk1sadds2307.59330.01620.70289.01890.76038.7185157.95275.9 Spk2angrs168.81526.77519.48958.52059.73414.3026551.85999.6 Spk2angrs241.67317.36516.80166.91151.98514.9325994.56011.5 Spk2annos1183.67121.89118.724104.27760.46515.9986260.65347.1 Spk2annos230.56417.59912.40166.88954.53715.9765657.25716.4 Spk2anxis1109.48424.01214.71786.14987.30920.2735847.95454.6 Spk2anxis295.38425.19013.37645.49364.64517.4575867.94988.5 Spk2 b ores140.69925.22714.54260.32947.33014.7425974.15614.9 Spk2 b ores299.10025.23112.45164.31847.09715.4745949.75188.2 Spk2cofis1176.27621.95517.579108.22760.12217.6405995.35400.3 Spk2cofis2192.68522.66111.849121.33165.27018.8346169.65496.4 Spk2cofus174.03820.89516.920119.40659.69514.6505733.65228.7 Spk2cofus2195.89026.85211.788124.48553.54314.1875601.05459.7 Spk2cotes1101.86923.24014.47078.68754.62914.6245693.75422.0 Spk2cotes281.21123.43610.752104.75469.14320.2995600.85462.5 Spk2emb a s133.30720.44713.88538.86545.81812.6705632.94903.6 Spk2emb a s242.16717.88611.43837.55153.23915.9745828.65505.4 Spk2exhas161.69324.08713.80139.35748.36812.4155713.94971.0 Spk2exhas295.45525.01112.65657.07240.65212.8855615.45120.3 Spk2happs1143.55624.27616.31287.30182.08821.2535968.15930.5 Spk2happs2258.14418.00712.740104.99948.08012.3175560.75417.3 Spk2sadds1102.52319.46714.12641.60866.36717.7114973.05098.1 Spk2sadds265.15427.82513.98735.11453.91615.1095364.95159.1
209 Table 5-9 Continued maratiomaratio2m_LTASm_LTAS2attacknattack duty cyc norm attack Spk1angrs1-6.851-10.949-0.00176-0.008082.19613.6310.4974.416 Spk1angrs2-7.908-13.747-0.00405-0.005421.7388.3610.3934.424 Spk1annos1-5.440-15.254-0.00456-0.006390.8345.1570.4451.873 Spk1annos2-8.806-11.325-0.00562-0.003410.7704.1770.4111.874 Spk1anxis1-4.036-14.879-0.00266-0.005200.5003.1660.5180.965 Spk1anxis2-11.919-10.436-0.00590-0.003500.9174.6330.4392.090 Spk1 b ores1-10.049-18.532-0.00413-0.009300.3521.3120.3151.115 Spk1 b ores2-12.296-19.295-0.00350-0.007490.2850.9820.3710.769 Spk1cofis1-5.385-8.340-0.00412-0.006151.6448.8410.3854.271 Spk1cofis2-6.110-12.368-0.00352-0.006521.94812.3650.2976.551 Spk1cofus1-8.804-12.183-0.00638-0.008731.3729.3350.4263.221 Spk1cofus2-10.361-13.466-0.00544-0.007781.0525.1990.4242.485 Spk1cotes1-6.237-13.222-0.00280-0.006810.5413.7320.4791.131 Spk1cotes2-14.323-19.727-0.00829-0.005600.4231.8970.3371.252 Spk1emb a s1-4.781-15.465-0.00435-0.008530.8734.4770.3952.209 Spk1emb a s2-10.106-12.822-0.00820-0.006500.6162.6360.3591.715 Spk1exhas1-4.103-10.541-0.00136-0.005350.5702.8500.4571.248 Spk1exhas2-9.383-14.134-0.00582-0.009240.8132.4650.3142.587 Spk1happs1-5.663-9.295-0.00403-0.005561.4747.6690.5112.884 Spk1happs2-6.722-11.579-0.00296-0.005551.2519.4580.5712.193 Spk1sadds1-3.385-10.358-0.00255-0.005120.5431.9120.3891.395 Spk1sadds2-12.578-10.739-0.00662-0.005170.6802.7100.3921.733 Spk2angrs1-9.905-16.042-0.00590-0.005121.96513.8540.4544.333 Spk2angrs2-16.010-10.931-0.00568-0.007811.4986.8360.2745.457 Spk2annos1-7.975-20.075-0.00459-0.008510.9366.6420.4102.285 Spk2annos2-14.461-15.179-0.00776-0.006370.7503.4770.3791.979 Spk2anxis1-11.411-17.115-0.00774-0.006671.0916.6700.3892.805 Spk2anxis2-13.166-16.948-0.00808-0.005890.8944.8920.3862.317 Spk2 b ores1-14.381-19.130-0.00820-0.007070.5533.3750.5131.077 Spk2 b ores2-15.393-21.515-0.00670-0.007010.5411.7980.3521.539 Spk2cofis1-11.963-19.133-0.00679-0.005961.0576.4370.4492.353 Spk2cofis2-11.755-15.347-0.00546-0.004730.7843.9810.3742.099 Spk2cofus1-14.212-17.855-0.00587-0.004990.6734.1210.3731.802 Spk2cofus2-19.791-17.775-0.01070-0.006230.3771.6790.3571.055 Spk2cotes1-16.088-19.046-0.00882-0.007760.6253.9280.5221.196 Spk2cotes2-17.512-17.155-0.00663-0.007670.5772.6030.3111.853 Spk2emb a s1-16.699-22.315-0.00648-0.007390.5673.5400.4021.410 Spk2emb a s2-16.768-19.330-0.00535-0.007040.4161.6680.4031.034 Spk2exhas1-14.785-21.255-0.00784-0.007050.5023.0510.4741.061 Spk2exhas2-18.859-19.389-0.00853-0.008300.3841.6870.4210.914 Spk2happs1-11.462-13.383-0.00781-0.006241.1306.6230.3243.490 Spk2happs2-17.944-17.064-0.00794-0.006120.6743.3370.3471.942 Spk2sadds1-12.349-18.556-0.00819-0.007190.4653.1930.4760.977 Spk2sadds2-21.077-18.903-0.00725-0.006560.3931.6320.4480.876
210 Table 5-10. Reliability analysis of manual acousti c measurements (pause duration, duration of a stressed vowel or Vowel 1, and duration of an unstressed vowel or Vowel 2 in seconds), and vowel-to-c onsonant ratio (VCR) for test1 set by Author and Judge 2. Pause Duration(s) AuthorJudge 2AuthorJudge 2AuthorJudge 2AuthorJudge 2 Talker1_angr_s1 0.170.150.130.120.250.26 0.680.71 Talker1_anxi_s2 0.060.050.170.170.060.05 1.121.29 Talker1_cofi_s2 0.000.000.170.170.070.08 1.171.12 Talker1_exha_s2 0.050.040.200.190.110.12 1.211.16 Talker1_happ_s2 0.000.000.170.160.060.06 1.401.46 Talker1_lone_s1 0.190.180.140.150.080.10 0.750.97 Talker1_resp_s2 0.000.000.150.150.110.08 1.111.00 Talker1_susp_s1 0.190.190.120.120.080.10 0.680.63 Talker2_anno_s1 0.000.000.140.130.100.11 0.780.81 Talker2_anxi_s2 0.000.000.200.190.100.07 1.081.14 Talker2_bore_s2 0.000.000.310.310.120.11 1.471.54 Talker2_exha_s1 0.000.000.160.170.070.08 0.680.72 Talker2_happ_s1 0.000.000.140.130.070.07 0.800.83 Talker2_jeal_s2 0.000.000.200.210.060.06 1.141.17 Talker2_sadd_s2 0.000.000.190.190.110.06 1.311.21 Talker2_susp_s2 0.000.000.280.260.080.08 1.241.31 Pearson's Correlation VCR Vowel 2 (s) Vowel 1 (s) 0.9990.948 0.9880.927 Talker1_angr_s1 is the first angry sentence by Talker 1).
211 Table 5-11. Regression equations for mu ltiple perceptual models using the training and test1 sets. Regression Equation D1-0.002*aratio2 -0.768*sra te -0.026*pnorMIN +13.87 D2-0.887*normattack +0.132*normpnorMIN -1.421 D1-0.001*aratio +0.983*srate + 0.256*Nattack +4.828*normnpks +2.298 D2-2.066*attack +0.031*pnorMIN +0.097*iNmax -2.832 D1-2.025*VCR -0.006*mpkfa ll -0.071*pnorMIN +6.943 D2-0.662*normattack +0.049*pnorMIN -0.008*mpkrise -0.369 D1-0.238*iNmax -1.523*srate-0.02*pnorMAX+14.961*dutycyc+4.83 D2-1.584*srate +0.013*mpkrise12.185*srtrend-12.185 D1 0.265*iNmax -7.097*dutycyc +0.028*pnorMAX +0.807*MeanCPP 16.651 D2 0.036*normpnorMIN +7.477*PP-524.541*m _LTAS +0.159*maratio2 2.061 D1 0.249*iNmax +14.257*dutycyc -0.011*pnorMAX -0.071*pnorMIN 6.687 D2 -0.464*iNmax +0.014*MeanCPP +7.06 *normnpks +7.594*srtrend 2.614*srate -14.805 D10.178*iNmin -1.677*srate +0.025*pnorMAX -0.028*pnorMIN +1.446 D2-0.003*aratio -3.289*VCR 0.007*mpkfall +0.008*pnorMAX +22.475 D14.802*srtrend -0.044*pnorMIN -0.013*pnorMAX +4.721 D2-7.038*srtrend +0.017*pnorMAX -1.47 *srate +0.201*normattack +2.542 D1-0.336*maratio +0.008*mpkrise + 0.206*iNmin -0.122*maratio2 -10.306 D2 -0.006*mpkrise -15.768*dutycyc -0.879*MeanCPP -0.013*pnorMIN +21.423 D1 -6.68*normnpks +0.221*iNmax 0.002*aratio +270.486*m_LTAS +10.171 D2 -28.454*gtrend +0.504*maratio2 -0.038*pnorMIN -0.193*iNmin 736.463*mLTAS2 -0.992*MeanCPP +24.581 D1-0.034*pnorMAX -8.336*srtrend + 0.002*aratio -2.086*VCR -5.438 D2-0.334*maratio -0.184*iNmin +0.925*srate +0.008*pnorMAX -4.197 D1-0.304*maratio2 -591.928*m_LTA S2 +0.139*normpnorMIN -11.395 D2 298.412*m_LTAS +7.784*VCR 0.007*mpkfall +156.11*PP +0.091*pnorMIN -0.002*aratio -1.884TRAINING TEST1Spk2 Spk1 Overall Overall Spk1 Spk2 Sent1 Sent2 Spk1, Sent1 Spk1, Sent2 Spk2, Sent1 Spk2, Sent2
212 Table 5-12. Correlation matrix of all acoustic va riables submitted to the stepwise regression analysis using the training set samples. Each column number corresponds to the acoustic cues listed by row. 123456789101112 1 normattack 1.00.52-.45.26.44-.01.75.04.17.04.28.17 2 normpnormin .521.00-.64.04.42.34.18.104.22.168.28.12 3 MeanCPP -.45-.641.00-.01-.05-.32-.40-.50-.44.13-.31-.42 4 pp .26.04-.011.00.08.40.62-.57-.39.16-.16-.71 5 vcr .44.42-.05.081.00.45.22.214.171.124-.22-.25 6 iNmin -.01.34-.32.40.451.00.31-.03.23.12-.04-.27 7 iNmax .75.35-.126.96.36.1991.00-.31-.07-.03.39-.06 8 srate .04.38-.50-.57.12-.03-.311.00.75-.28.09.50 9 srtrend .17.58-.44-.39.51.23-.07.751.00.11-.07.26 10 gtrend .04.28.13.16.51.12-.03-.28.111.00-.15-.59 11 normnpks .28.28-.31-.16-.22-.04.39.09-.07-.151.00.37 12 mpkrise .17.12-.42-.71-.25-.27-.06.50.26-.59.371.00 13 mpkfall .01.49-.42-.52.35.30-.188.8.131.52.11.35 14 pnorMAX .19.55-.61-.39.27.20-.06.74.81-.15-.20.51 15 pnorMIN .38.91-.68-.184.108.40.206.220.127.116.11.31 16 aratio .76.34-.44-.09.11-.18.104.22.168-.24.10.57 17 aratio2 .82.45-.43-.09.68.04.38.22.214.171.124.26 18 maratio .03.12-.26-.54-.52-.46.05.27.12-.18.57.62 19 maratio2 .68.60-.55-.05.54.08.30.126.96.36.199.13 20 m_LTAS .25-.30.16-.38-.25-.87-.04-.04-.14.02.23.32 21 m_LTAS2 .03.56-.58-.44.09.26-.09.60.46-.02.63.43 22 attack .99.57-.49.18.44-.01.74.09.24.04.33.24 23 nattack .96.64-.57.12.46.04.70.19.39.05.25.29 24 dutycyc -.36.09-.15-.65-.15.04-.188.8.131.52.37.45
213 Table 5-12 Continued 1314151617181920212223 24 1normattack.01.19.38.76.82.03.68.25.03.99.96-.36 2normpnormin.184.108.40.206.45.12.60-.220.127.116.11.09 3MeanCPP-.42-.61-.68-.44-.43-.26-.55.16-.58-.49-.57-.15 4pp-.52-.39-.26-.09-.09-.54-.05-.38-.44.18.12-.65 5vcr.18.104.22.168.68-.52.54-.25.09.44.46-.15 6iNmin.30.20.26-.26.04-.46.08-.87.26-.01.04.04 7iNmax-.11-.06.11.50.38.05.30-.04-.09.74.70-.27 8srate.22.214.171.124.18.27.21-.04.60.09.19.18 9srtrend.126.96.36.199.34.12.25-.188.8.131.52.32 10gtrend.37-.15.11-.24.24-.18.42.02-.02.04.05.15 11normnpks.11-.184.108.40.206.220.127.116.11.33.25.37 12mpkrise.18.104.22.168.22.214.171.124.126.96.36.199 13mpkfall1.00.64.57.09.33.38.39-.11.50.09.22.68 14pnorMAX.641.00.74.45.30.23.18-.188.8.131.52.22 15pnorMIN.57.741.00.29.42.21.54-.184.108.40.206.20 16aratio.09.45.291.00.63.32.35.41-.08.79.83-.07 17aratio220.127.116.11.631.00-.08.86.26.22.83.82-.08 18maratio.18.104.22.168-.081.00-.01.53.30.10.15.60 19maratio22.214.171.124.35.86-.011.00.17.49.69.66.02 20m_LTAS-.11-.21-.126.96.36.199.171.00-.188.8.131.52 21m_LTAS184.108.40.206-.08.22.30.49-.171.00.10.12.53 22attack.09.25.45.79.220.127.116.11.101.00.98-.27 23nattack.18.104.22.168.22.214.171.124.12.981.00-.19 24dutycyc.68.22.20-.07-.08.60.02.13.53-.27-.191.00
214 Table 5-13. Classification accuracy for the full training set (All Sentences) and a reduced set based on an exclusion criterion (Correct Ca tegory Sentences). Classification is reported for all samples, samples by Speaker 1 only, and samples by Speaker 2 only based on three acoustic models. HCASHCAS Spk1 samples100%75%100%75%3.801.745.153.25 Spk2 samples100%100%100%100%126.96.36.199.15 All samples100%88%100%88%4.172.625.153.73 Spk1 samples100%100%100%100%188.8.131.52.15 Spk2 samples50%50%100%50%1.220.575.150.57 All samples75%75%100%75%2.271.745.151.74 Spk1 samples100%100%100%50%5.153.643.862.58 Spk2 samples100%75%100%100%3.803.255.155.15 All samples100%88%100%75%4.172.624.223.25 Spk1 samples100%33%100%100%3.642.153.735.15 Spk2 samples100%100%100%100%184.108.40.206.15 All samples100%67%100%100%4.043.014.115.15 Spk1 samples100%100%100%100%220.127.116.11.15 Spk2 samples50%33%100%33%1.070.005.150.00 All samples75%67%100%67%2.141.405.151.40 Spk1 samples50%67%100%33%2.580.863.252.15 Spk2 samples100%100%100%100%18.104.22.168.15 All samples75%83%100%67%3.251.933.733.01 d -prime Percent CorrectOverall Model Speaker 1 Model Speaker 2 Model All Sentences Correct Category Sentences Overall Model Speaker 1 Model Speaker 2 Model H = Category 1 or Happy, C = Category 2 or Content-Confident, A = Category 3 or angry, and S = Category 4 or Sad; Spk = Speak er Number ; Sent = Sentence Number
215 Table 5-14. Classification accuracy for the test1 set using the Overall training acoustic model and the Overall test1 acoustic model. HCASHCAS Spk1 samples75%63%50%88%2.271.111.642.62 Spk2 samples50%100%100%75%2.583.375.152.14 Sent1 samples75%88%50%75%2.271.722.583.25 Sent2 samples50%75%100%88%2.581.744.222.22 All samples63%81%75%81%2.231.682.632.35 Spk1 samples50%38%50%75%0.761.151.641.24 Spk2 samples50%25%0%88%1.220.39-1.902.22 Sent1 samples25%25%0%88%0.290.79-1.541.52 Sent2 samples75%38%50%75%1.640.751.042.14 All samples50%31%25%81%0.970.750.361.68Overall Training Modeld -prime Percent CorrectOverall Test1 Model H = Category 1 or Happy, C = Category 2 or Content-Confident, A = Category 3 or angry, and S = Category 4 or Sad; Spk = Speak er Number ; Sent = Sentence Number
216 Table 5-15. Classification accuracy of the test1 set by two training and four test1 models. HCASHCAS Spk1 samples75%63%50%88%1.901.781.642.22 Spk2 samples25%38%100%50%0.55-0.145.150.57 Sent1 samples50%88%100%63%1.221.725.152.89 Sent2 samples50%13%50%75%1.22-0.361.640.85 All samples50%50%75%69%1.220.672.631.28 Spk1 samples50%50%50%75%1.220.572.581.47 Spk2 samples50%50%100%75%1.220.794.221.74 Sent1 samples50%63%50%88%2.581.112.581.72 Sent2 samples50%38%100%63%0.722.214.171.124 All samples50%50%75%75%1.220.672.631.60 Spk1 samples50%13%50%25%0.76-0.580.670.12 Spk2 samples0%25%50%0%-2.29-0.110.39-1.11 Sent1 samples50%13%0%13%0.14-0.36-1.73-0.36 Sent2 samples0%25%100%13%-1.61-0.312.830.31 All samples25%19%50%13%-0.17-0.320.52-0.08 Spk1 samples25%75%50%88%0.551.471.642.62 Spk2 samples75%13%100%88%1.64-0.083.612.62 Sent1 samples50%38%50%88%1.590.750.842.22 Sent2 samples50%50%100%88%0.760.795.153.73 All samples50%44%75%88%1.090.761.962.62 Spk1 samples50%38%100%38%1.220.053.610.75 Spk2 samples25%38%100%38%0.9126.96.36.199 Sent1 samples75%13%100%63%1.90-0.363.611.11 Sent2 samples0%63%100%13%-0.980.503.100.31 All samples38%38%100%38%1.060.153.330.75 Spk1 samples100%63%100%88%3.542.894.223.73 Spk2 samples75%88%50%50%3.252.621.040.79 Sent1 samples100%63%50%63%3.801.781.281.39 Sent2 samples75%88%100%75%2.273.733.862.14 All samples88%75%75%69%2.532.481.961.73Training Set Models Test1 Set Models Speaker 1 Model Speaker 2 Modeld -prime Percent CorrectSpeaker 1 Model Sentence 2 Model Speaker 2 Model Sentence 1 Model H = Category 1 or Happy, C = Category 2 or Content-Confident, A = Category 3 or angry, and S = Category 4 or Sad; Spk = Speak er Number ; Sent = Sentence Number.
217 Table 5-16. Perceptual accuracy for the test2 set based on all sentence s and two exclusionary criteria. HCASHCAS TF1 49%61%26%79%1.340.751.801.79 TF2 52%86%49%55%1.631.292.411.80 TF3 82%78%74%78%2.311.983.161.98 TF4 86%73%70%92%2.361.912.633.19 TF5 42%84%20%80%1.491.471.652.06 TM6 12%83%35%48%0.250.911.641.31 TM7 22%64%44%71%0.920.571.711.60 TM8 42%72%48%38%1.160.711.520.96 TM9 29%83%30%56%1.230.951.771.48 TM10 39%62%4%70%1.050.630.951.34 Sent1 48%72%41%73%1.481.091.921.69 Sent2 42%79%42%65%1.311.121.891.76 Sent3 47%73%37%62%1.280.921.801.55 TF1, Sent1 48%51%49%85%1.480.782.241.70 TF1, Sent2 31%75%22%73%1.300.861.581.78 TF1, Sent3 68%56%8%78%1.470.701.522.05 TF2, Sent1 63%82%55%66%2.051.342.441.95 TF2, Sent2 54%90%55%53%1.681.532.581.83 TF2, Sent3 39%86%37%45%1.201.032.221.66 TF3, Sent1 62%80%76%85%1.852.063.162.12 TF3, Sent2 95%84%61%81%3.282.122.822.31 TF3, Sent3 90%70%84%67%2.361.813.591.61 TF4, Sent1 83%75%67%95%2.212.082.853.33 TF4, Sent2 89%75%79%88%2.481.933.043.24 TF4, Sent3 86%69%65%92%2.411.732.203.16 TF5, Sent1 36%79%9%77%1.521.111.571.80 TF5, Sent2 33%83%39%83%1.111.571.882.27 TF5, Sent3 56%89%12%79%1.861.781.722.14 TM6, Sent1 20%85%13%56%0.571.051.041.51 TM6, Sent2 11%80%59%50%0.250.912.301.19 TM6, Sent3 5%84%33%40%-0.230.791.431.26 TM7, Sent1 27%69%47%72%1.080.771.731.71 TM7, Sent2 19%74%27%71%1.020.771.331.73 TM7, Sent3 20%50%58%71%0.700.212.061.39 TM8, Sent1 53%65%67%42%1.420.652.040.93 TM8, Sent2 34%73%45%36%0.940.661.290.99 TM8, Sent3 40%79%33%37%1.110.861.270.98 d -prime Percent CorrectAll Sentences
218 Table 5-16 Continued HCASHCAS TM9, Sent1 47%81%29%70%1.861.271.781.62 TM9, Sent2 20%84%28%50%1.050.771.661.45 TM9, Sent3 20%84%34%48%0.740.831.871.46 TM10, Sent1 38%58%0%78%1.140.64-inf1.46 TM10, Sent2 36%69%5%69%0.750.79Inf1.68 TM10, Sent3 44%59%5%61%1.310.471.310.98 ALL 46%75%40%67%1.361.041.871.65 TF1 59%63%35%79%1.700.992.021.77 TF2 52%86%49%62%1.591.382.431.98 TF3 82%78%74%78%2.341.973.151.99 TF4 86%77%70%94%2.522.082.603.30 TF5 50%81%25%83%1.671.541.782.10 TM6 23%83%35%57%0.911.101.501.51 TM7 25%64%44%75%0.990.631.681.74 TM8 43%72%48%45%1.200.811.441.18 TM9 34%83%30%69%1.431.051.881.83 TM10 39%66% N/A 75%1.290.85 N/A 1.51 Sent1 57%73%50%77%1.721.292.121.86 Sent2 46%80%46%72%1.601.251.941.94 Sent3 54%75%44%70%1.521.141.941.80 TF1, Sent1 55%49%49%85%1.670.832.181.73 TF1, Sent2 45%75%22%73%1.640.991.581.75 TF1, Sent3 76%62% N/A 78%1.901.18 N/A 1.94 TF2, Sent1 63%82%55%82%1.991.572.482.46 TF2, Sent2 54%90%55%54%1.621.572.541.85 TF2, Sent3 39%86%37%48%1.191.072.321.76 TF3, Sent1 62%80%76%85%1.852.063.162.12 TF3, Sent2 95%84%61%84%3.542.072.792.43 TF3, Sent3 90%70%84%67%2.361.813.591.61 TF4, Sent1 83%69%67%95%2.191.932.903.28 TF4, Sent2 89%96%79%89%3.593.033.023.28 TF4, Sent3 86%69%65%96%2.361.782.163.47 TF5, Sent1 71%79% N/A 77%2.391.66 N/A 1.73 TF5, Sent2 33%78%39%83%1.061.381.842.19 TF5, Sent3 56%85%12%89%1.891.681.652.47 TM6, Sent1 25%85%13%56%0.771.191.051.45 TM6, Sent2 21%80%59%64%1.250.942.081.58 d -prime Percent CorrectAll Sentences Above Chance Sentences
219 Table 5-16 Continued HCASHCAS TM6, Sent3 N/A 84%33%56% N/A 188.8.131.52 TM7, Sent1 47%69%47%92%1.521.101.582.66 TM7, Sent2 19%74%27%71%1.020.771.331.73 TM7, Sent3 20%50%58%71%0.700.212.061.39 TM8, Sent1 66%65%67%52%1.830.831.881.30 TM8, Sent2 34%73%45%44%0.930.801.241.21 TM8, Sent3 40%79%33%41%1.150.921.221.09 TM9, Sent1 47%81%29%68%1.811.211.811.56 TM9, Sent2 20%84%28%71%1.130.821.782.00 TM9, Sent3 36%84%34%70%1.341.092.062.07 TM10, Sent1 38%58% N/A 74%1.350.71 N/A 1.19 TM10, Sent2 36%68% N/A 85%1.040.84 N/A 2.04 TM10, Sent3 44%76% N/A 68%1.521.08 N/A 1.47 ALL 52%76%47%73%2.271.682.332.12 TF1 58%73%49%79%2.061.362.391.73 TF2 60%86%55%66%1.831.532.592.06 TF3 92%78%74%83%2.922.033.132.24 TF4 86%89%70%92%3.202.562.593.14 TF5 71%84%N/A82%2.351.99N/A2.16 TM6 N/A83%59%64% N/A 1.472.171.58 TM7 N/A68%53%91% N/A 1.601.772.43 TM8 55%72%50%62%1.601.011.371.74 TM9 75%83% N/A 71%2.581.56 N/A 1.81 TM10 65%76% N/A 83%1.831.71 N/A 2.13 Sent1 66%77%60%85%2.111.702.342.22 Sent2 72%81%64%75%2.321.652.362.06 Sent3 77%79%60%79%2.401.682.312.15 TF1, Sent1 48%64%49%85%1.9184.108.40.206 TF1, Sent2 N/A 75% N/A 73% N/A 1.37 N/A 1.52 TF1, Sent3 68%77% N/A 78%2.311.56 N/A 1.87 TF2, Sent1 63%82%55%82%1.991.572.482.46TF2, Sent2 54%90%55%53%1.681.532.581.83 TF2, Sent3 67%86% N/A 68%1.901.63 N/A 2.18 TF3, Sent1 91%80%76%85%2.8220.127.116.11 TF3, Sent2 95%84%61%81%3.282.122.822.31 TF3, Sent3 90%70%84%81%2.651.893.552.06 TF4, Sent1 83%92%67%95%3.012.812.813.27Correct Category Sentencesd -prime Percent CorrectAbove Chance Sentences
220 Table 5-16 Continued HCASHCAS TF4, Sent2 89%96%79%88%3.592.993.033.20 TF4, Sent3 86%81%65%92%18.104.22.168.11 TF5, Sent1 71%79% N/A 85%2.361.93 N/A 2.04 TF5, Sent2 50%83% N/A 83%1.711.88 N/A 2.31 TF5, Sent3 91%89% N/A 79%3.262.22 N/A 2.24 TM6, Sent1 N/A 85% N/A 66% N/A 1.93 N/A 1.56 TM6, Sent2 N/A 80%59%62% N/A 22.214.171.124 TM6, Sent3 N/A 84% N/A 63% N/A 1.39 N/A 1.70 TM7, Sent1 N/A 69%47%93% N/A 1.531.582.60 TM7, Sent2 N/A 74% N/A 90% N/A 2.08 N/A 2.28 TM7, Sent3 N/A 58%58%91% N/A 1.251.842.43 TM8, Sent1 53%65%67%65%1.600.891.881.51 TM8, Sent2 51%73% N/A 66%1.581.05 N/A 2.09 TM8, Sent3 64%79%33%55%1.691.091.121.86 TM9, Sent1 75%81% N/A 80%2.701.70 N/A 2.03 TM9, Sent2 N/A 84% N/A 64% N/A 1.39 N/A 1.53 TM9, Sent3 N/A 84% N/A 70% N/A 1.52 N/A 1.92 TM10, Sent1 N/A73% N/A 92% N/A 2.11 N/A 2.50 TM10, Sent2 N/A 80% N/A 80% N/A 1.86 N/A 2.73 TM10, Sent3 65%74% N/A 75%2.051.41 N/A 1.66 ALL72%79%61%79%1.361.041.871.65 d -prime Percent CorrectCorrect Category Sentences H = Category 1 or Happy, C = Category 2 or Content-Confident, A = Category 3 or angry, and S = Category 4 or Sad; TF = Female Talker Number; TM = Male Talker Number Sent = Sentence Number; N/A = Scores not available; these samples were dropped.
221 Table 5-17. Reliability analysis of manual acousti c measurements (stresse d and unstressed vowel durations in seconds) for the test set (e .g. Talker1_angr_s1: first angry sentence by Talker 1). AJAJAJAJ Talker1_angr_s10.080.090.030.04Talker6_anno_s10.050.080.050.07 Talker1_anxi_s10.050.060.030.06Talker6_anxi_s30.120.120.030.03 Talker1_cofi_s10.070.080.050.06Talker6_cofi_s126.96.36.199.08 Talker1_cofu_s188.8.131.52.16Talker6_cofu_s30.160.160.080.10 Talker1_cote_s30.150.150.060.06Talker6_cote_s20.080.090.050.03 Talker1_emba_s184.108.40.206.13Talker6_emba_s10.060.070.040.04 Talker1_exha_s220.127.116.11.10Talker6_sadd_s20.090.090.050.05 Talker2_anno_s10.070.080.050.05Talker7_angr_s20.090.110.040.04 Talker2_bore_s20.070.080.040.06Talker7_anno_s20.090.090.070.06 Talker2_cofi_s30.120.120.060.06Talker7_bore_s20.070.080.040.03 Talker2_cofu_s20.090.110.080.07Talker7_cofu_s10.060.070.040.07 Talker2_cofu_s18.104.22.1680.24Talker7_emba_s20.080.100.040.06 Talker2_emba_s20.110.120.090.07Talker7_happ_s30.090.100.050.06 Talker2_exha_s20.100.120.060.06Talker7_sadd_s20.110.110.060.05 Talker3_anno_s20.120.140.070.08Talker8_angr_s20.090.120.030.06 Talker3_anxi_s30.120.140.060.06Talker8_anno_s20.090.090.040.05 Talker3_cofi_s30.120.130.060.05Talker8_anxi_s20.080.110.040.06 Talker3_exha_s20.140.170.080.07Talker8_cofi_s20.100.120.060.06 Talker3_happ_s20.120.120.090.08Talker8_cofu_s20.130.110.070.07 Talker3_sadd_s10.110.090.070.09Talker8_emba_s10.090.090.040.05 Talker3_sadd_s30.190.190.070.07Talker8_happ_s10.060.090.050.06 Talker4_angr_s10.080.090.050.05Talker9_bore_s10.060.070.050.09 Talker4_angr_s30.160.160.090.09Talker9_cofu_s10.040.060.040.07 Talker4_anxi_s30.100.110.060.07Talker9_cote_s20.080.100.070.07 Talker4_bore_s10.090.090.060.07Talker9_emba_s30.090.110.040.06 Talker4_cofi_s10.070.070.040.05Talker9_happ_s10.060.070.060.05 Talker4_cofu_s20.120.130.070.07Talker9_sadd_s10.060.070.060.07 Talker4_exha_s22.214.171.124.10Talker9_sadd_s30.140.150.030.04 Talker5_angr_s20.150.180.050.08Talker10_angr_s10.060.090.040.06 Talker5_bore_s10.080.100.130.13Talker10_angr_s30.110.120.030.06 Talker5_cofi_s126.96.36.199.12Talker10_anno_s20.100.130.050.08 Talker5_cofu_s10.100.110.060.07Talker10_cofi_s10.060.070.040.06 Talker5_cofu_s188.8.131.52.12Talker10_cote_s20.120.130.080.08 Talker5_cote_s20.140.150.060.06Talker10_exha_s20.110.120.040.06 Talker5_emba_s30.170.180.070.08Talker10_happ_s30.100.100.060.06 0.9710.919 Pearson's Correlation Coefficient Vowel 1 (s)Vowel 2 (s)Vowel 1 (s)Vowel 2 (s)
222 Table 5-18. Classification accuracy of the Overall training model for the test2 set samples using the k -means algorithm. HCASHCAS TF1 samples17%42%67%50%-0.320.092.261.07 TF2 samples0%33%33%83%-1.930.141.401.84 TF3 samples50%58%100%58%1.451.095.150.64 TF4 samples67%75%67%92%1.481.743.013.96 TF5 samples17%75%33%67%0.081.111.072.10 TM6 samples33%17%100%50%0.79-0.544.080.30 TM7 samples50%42%67%58%1.220.361.930.92 TM8 samples0%50%0%50%-1.930.18-0.740.88 TM9 samples33%50%33%33%0.470.300.850.45 TM10 samples33%33%33%75%0.4184.108.40.206 Sent1 samples25%48%80%75%0.550.592.721.37 Sent2 samples35%50%50%68%0.540.701.751.30 Sent3 samples30%45%30%43%0.200.091.120.82 All samples30%48%53%62%0.410.451.831.14 d -prime Percent Correctk -means algorithm H = Category 1 or Happy, C = Category 2 or Content-Confident, A = Category 3 or angry, and S = Category 4 or Sad; TF = Female Ta lker number; TM = Male Talker number ; Sent = Sentence number.
223 Table 5-19. Classification accuracy of the Overall training model for the test2 set samples using the k NN algorithm for two values of k HCASHCAS TF1 samples33%58%67%58%0.330.783.011.28 TF2 samples17%58%1%58%-0.200.270.001.52 TF3 samples67%58%67%67%1.881.093.011.00 TF4 samples83%75%67%92%2.411.743.013.05 TF5 samples17%67%33%50%-0.200.861.071.31 TM6 samples17%33%67%42%-0.070.001.930.22 TM7 samples67%42%67%50%1.480.361.930.88 TM8 samples17%50%1%42%-0.200.30-0.740.36 TM9 samples17%50%1%42%-0.070.06-1.070.67 TM10 samples33%33%33%67%0.330.142.151.00 Sent1 samples40%53%60%75%0.910.812.581.37 Sent2 samples35%60%40%63%0.500.861.631.32 Sent3 samples35%45%20%33%0.38-0.020.800.44 All samples37%53%40%57%0.580.531.631.03 TF1 samples17%55%33%55%0.03-0.012.151.40 TF2 samples1%45%1%82%-1.870.140.001.94 TF3 samples67%73%100%58%2.181.285.151.02 TF4 samples50%80%1%92%1.411.410.003.00 TF5 samples1%91%33%42%-1.3220.127.116.11 TM6 samples1%25%50%42%-1.15-1.061.830.17 TM7 samples50%58%33%40%1.410.142.150.62 TM8 samples17%40%1%42%-0.26-0.190.000.42 TM9 samples17%58%1%33%0.080.03-0.740.45 TM10 samples17%55%33%42%0.23-0.072.150.63 Sent1 samples37%56%44%62%0.970.352.441.18 Sent2 samples15%61%11%62%0.260.321.361.15 Sent3 samples20%56%20%35%0.030.051.210.67 All samples24%58%25%53%0.410.241.780.99k NN algorithm with k = 3d -prime Percent Correctk NN algorithm with k = 1 H = Category 1 or Happy, C = Category 2 or Content-Confident, A = Category 3 or angry, and S = Category 4 or Sad; TF = Female Ta lker number; TM = Male Talker number ; Sent = Sentence number.
224 0 500 1000 1500 2000 2500 3000 3500 4000 100 200 300 400 Smoothed Sample NumberFrequency (Hz) 0 500 1000 1500 2000 2500 3000 3500 4000 100 200 300 400 Smoothed Sample NumberFrequency (Hz) 0 500 1000 1500 2000 2500 3000 3500 4000 100 200 300 400 Smoothed Sample NumberFrequency (Hz) pnorMAXMean MAX MINpnorMIN Figure 5-1. Acoustic measurements of pnorMIN and pnorMAX from the f0 contour. 0 500 1000 1500 2000 2500 100 200 300 400 Smoothed Sample NumberFrequency (Hz)Gross f0 trend "gtrend" 5 points from each voiced segment of f0 contour Figure 5-2. Acoustic measurement of gtrend from the f0 contour.
225 0 500 1000 1500 2000 2500 3000 3500 4000 100 200 300 400 Smoothed Sample NumberFrequency (Hz) Zero-Crossings to indicate a peak Zero-Crossing Line: 50% Max Mean f0 contour Figure 5-3. Acoustic measurement of normnpks from the f0 contour. 800 900 1000 1100 1200 1300 270 290 310 330 Smoothed Sample NumberFrequency (Hz) Zero-Crossings to indicate a peak f0 contour Peak rise time Peak fall time Change in frequency from peak to zero-crossing Figure 5-4. Acoustic measurements of mpkrise and mpkfall from the f0 contour.
226 0 50 100 150 200 250 300 350 400 -80 -60 -40 -20 0 Smoothed Sample NumberIntensity (dB)Mean Max Min iNmin iNmax Intensity contour Figure 5-5. Acoustic measurements of iNmin and iNmax from the f0 contour. 0 50 100 150 200 250 300 350 400 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Smoothed Sample NumberIntensity (dB) Peak 80% down from peak Intensity contour Valley Rise time Peak duration Intensity rise attack = intensity rise/ rise time dutycyc = rise time/ peak duration Nattack = intensity rise/ peak duration Figure 5-6. Acoustic measurements of attack and dutycyc from the f0 contour.
227 0 2 4 6 8 10 12 14 0 0.5 1 1.5 2 2.5 3 Window NumberNo. of Syllables srtrend = slope of regression line Window size = 500 ms Shift size = 250 ms Figure 5-7. Acoustic measurements of srtrend from the f0 contour. 0 2000 4000 6000 8000 10000 60 70 80 90 100 110 120 130 140 Frequency (Hz)Intensity (dB) m LTAS = slope of linear regression through peaks from 50 Hz 5 kHz Figure 5-8. Acoustic measurements of m_LTAS from the f0 contour.
228 1 2 3 4 5 0 0.25 0.5 0.75 1 No. of Dimensions R-square Stress Figure 5-9. R-squared and stress measures as a f unction of the number of dimensions included in the MDS solution for 11 emotions. Figure 5-10. Eleven emotions in a 2D stimulus space according to the perceptual MDS model.
229 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -3 -2 -1 0 1 2 AX HA AO CI CU CE AG BO EM EX SAAG1 AO1 AX1 BO1 CI1 CU1 CE1 EM1 EX1 HA1 SA1 AG2 AO2 AX2 BO2 CI2 CU2 CE2 EM2 EX2 HA2 SA2Dimension 1Dimension 2 Figure 5-11. Standardized predic ted acoustic values for Speaker 1 (open circles and numbered ) and Speaker 2 (open squares and nu mbered ) and perceived MDS values (stars) for the training set according to the Overall training model. Colors represent the four emotion categories (Category 1 or happy in blue, Category 2 or contentconfident in green, Category 3 or angry in red, and Category 4 or sad in black).
230 A -1.5 -1 -0.5 0 0.5 1 1.5 2 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 AX HA AO CI CU CE AG BO EM EX SAAG1 AO1 AX1 BO1 CI1 CU1 CE1 EM1 EX1 HA1 SA1 AG2 AO2 AX2 BO2 CI2 CU2 CE2 EM2 EX2 HA2 SA2Dimension 1Dimension 2 B -3 -2 -1 0 1 2 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 AX HA AO CI CU CE AG BO EM EX SAAG1 AO1 AX1 BO1 CI1 CU1 CE1 EM1 EX1 HA1 SA1 AG2 AO2 AX2 BO2 CI2 CU2 CE2 EM2 EX2 HA2 SA2Dimension 1Dimension 2 Figure 5-12. Standardized pred icted and perceived values ac cording to individual speaker models. Predicted acoustic va lues for Speaker 1 (open circles and numbered ) and Speaker 2 (open squares and numbered ) and perceived MDS values (stars) for the training set according to the A) Speaker 1 and B) Speaker 2 training models (Category 1 or happy in blue, Category 2 or content-confident in green, Category 3 or angry in red, and Category 4 or sad in black).
231 A -3 -2 -1 0 1 2 -2 -1 0 1 2 3 4 AX HA AO CI CU CE AG BO EM EX SADimension 1Dimension 2 AG1 AG2 AO1 AO2 AX1 AX2 BO1 BO2 CI1 CI2 CU1 CU2 CE1 CE2 EM1 EM2 EX1 EX2 HA1 HA2 SA1 SA2 B -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 AX HA AO CI CU CE AG BO EM EX SADimension 1Dimension 2 AG1 AG2 AO1 AO2 AX1 AX2 BO1 BO2 CI1 CI2 CU1 CU2 CE1 CE2 EM1 EM2 EX1 EX2 HA1 HA2 SA1 SA2 Figure 5-13. Standardized predicted and pe rceived values according to the Overall test1 model. Predicted acoustic values for A) Speaker 1 (open circles) and B) Speaker 2 (open squares) and perceived MDS values (stars) for the test1 set. Numbers represent the sentence number and colors represent th e four emotion categor ies (Category 1 or happy in blue, Category 2 or content-confiden t in green, Category 3 or angry in red, and Category 4 or sad in black).
232 A -2 -1 0 1 2 3 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 AX HA AO CI CU CE AG BO EM EX SADimension 1Dimension 2 AG1 AG2 AO1 AO2 AX1 AX2 BO1 BO2 CI1 CI2 CU1 CU2 CE1 CE2 EM1 EM2 EX1 EX2 HA1 HA2 SA1 SA2 B -3 -2 -1 0 1 2 -4 -3 -2 -1 0 1 2 AX HA AO CI CU CE AG BO EM EX SADimension 1Dimension 2 AG1 AG2 AO1 AO2 AX1 AX2 BO1 BO2 CI1 CI2 CU1 CU2 CE1 CE2 EM1 EM2 EX1 EX2 HA1 HA2 SA1 SA2 Figure 5-14. Standardized predic ted values according to the test1 set model and perceived values according to the Overall training set model. Predicted acoustic values for A) Speaker 1 (open circles) and B) Speaker 2 (open squares) for the Overall test1 set and perceived MDS values (stars) for the Overall training set model. Numbers represent the sentence number (Category 1 or happy in blue, Category 2 or content-confident in green, Category 3 or angry in red, and Category 4 or sad in black).
233 A -2 -1.5 -1 -0.5 0 0.5 1 1.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 AG1 AO1 AX1 BO1 CI1 CU1 CE1 EM1 EX1 HA1 SA1 AG2 AO2 AX2 BO2 CI2 CU2 CE2 EM2 EX2 HA2 SA2Dimension 1Alpha ratio (unstressed vowel) Speaker 1 Speaker 2 B -2 -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 AG1 AO1 AX1 BO1 CI1 CU1 CE1 EM1 EX1 HA1 SA1 AG2 AO2 AX2 BO2 CI2 CU2 CE2 EM2 EX2 HA2 SA2Dimension 1Speaking rate Speaker 1 Speaker 2 C -2 -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 0.5 1 1.5 2 AG1 AO1 AX1 BO1 CI1 CU1 CE1 EM1 EX1 HA1 SA1 AG2 AO2 AX2 BO2 CI2 CU2 CE2 EM2 EX2 HA2 SA2Dimension 1Normalized pitch minimum Speaker 1 Speaker 2 Figure 5-15. Standardized acoustic values as a function of the perceived D1 values based on the Overall training set model. Scatterplots for A) al pha ratio, B) speaking rate, and C) normalized pitch minimum. Markers represent speakers and lines represent linear regressions (Speaker 1: red ci rcles, solid line, Speaker 2: black squares, dashed line).
234 A B Figure 5-16. Standardized acousti c values as a function of the perceived Dimension 2 values based on the Overall training set model. Scatterplots for A) normalized attack time of intensity contour and B) normalized p itch minimum by speaking rate. Markers represent speakers and lines represent linear re gressions (Speaker 1: red circles, solid line, Speaker 2: black squares, dashed line). -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 AG1 AO1 AX1 BO1 CI1 CU1 CE1 EM1 EX1 HA1 SA1 AG2 AO2 AX2 BO2 CI2 CU2 CE2 EM2 EX2 HA2 SA2Dimension 1Normalized attack time of intensity contour Speaker 1 Speaker 2 Dimension 2 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 0.5 1 1.5 2 AG1 AO1 AX1 BO1 CI1 CU1 CE1 EM1 EX1 HA1 SA1 AG2 AO2 AX2 BO2 CI2 CU2 CE2 EM2 EX2 HA2 SA2Dimension 1Normalized pitch minimum by speaking rate Speaker 1 Speaker 2 Dimension 2
235 CHAPTER 6 GENERAL CONCLUSIONS Summary of Findings Research within the past few decades has seen an increase in the number of experiments to identify the acoustic cues that correspond to the di mensions of emotionality. However, reports of the cues to describe these dimensionspart icularly the valence dimensionhave been inconsistent. An understanding of the acoustic ba sis to these dimensions could potentially be used to develop a model of emotion magnitude in SS. This information is important for a number of clinical applications, includ ing the rehabilitation of patients who have trouble expressing their emotions in SS A better understanding of how acoustic changes in speech convey emotions can help clinicians with better assessment tools, provide a theoretical basi s for the selection of appropriate therapy goals, as we ll as develop appropriate outcom e measures. Hence, the purpose of this research was to understand how emo tions are perceived from the suprasegmental information in speech and to develop a preliminary acoustic model based on the dimensions approach. The experiments presented here differed from previous research in that 1) the model was developed based on the emoti ons that can be perceived in SS and 2) the model consisted of acoustic measurements that did not require no rmalization to each speak ers neutral emotion. Furthermore, these experiments used nonsense speech in order to isolate the role of suprasegmental and paralinguistic information fr om segmental and semantic factors. While both these types of information play a role in cui ng emotions, the initial focus was to understand how suprasegmental factors influence the perception of emotions. The decision was made to study suprasegmental information first because this is often the primary target of treatment in conditions such as Aprosodia. Nevertheless, fu ture experiments will al so evaluate how these effects may be affected by the semantic content of speech.
236 The objective of the firs t four of six experime nts was to identify what emotions listeners can perceive in SS and how listeners perceive them. Since little empirical evidence exists to demonstrate which emotions can be directly expressed and perceived through speech, the first two experiments were performed to identify the emotions that can be perceived in SS This group of emotions was assumed to be a subset of a ll emotion categories. Typi cally, researchers have assumed that the complete set of emotion categories are repres ented by the set of basic emotions However, the number of emotions considered basic and the degree of similarity among each of the basic emotions is not clear. Hence, the first experime nt provided an altern ative to using a set of basic emotions for study in SS This data reduction experiment was necessary to determine a small set of terms that represent relatively dist inct emotions, i.e., to minimize the likelihood of selecting multiple terms that convey the same (or similar) emotions. In this experiment, listeners were given 70 emotions terms to categorize based on similarity. A hierarch ical clustering scheme was used to determine the similarity between em otion terms based on the number of times they were grouped together. This resulte d in the selection of a unique set of 19 emotion categories for further investigation in speech: funny, content, love, respectful, happy, exhausted, confident, confused, bored, suspicious, embarrassed, sad, su rprised, interested, annoyed, angry, anxious, jealous, and lonely. Once an initial set of emotion terms was chos en, the next experiment was performed to empirically determine whether these emotion cat egories could be correctly discriminated in speech when only suprasegmental cues were available to listeners. The stimuli consisted of two nonsense sentences expressed by two actors in e ach of the 19 emotion categories. Nonsense sentences were used to preserve the suprasegmental features without including the segmental information. These speech samples served as stim uli within a listening task to examine which
237 emotion pairs are discriminable in SS Sixteen listeners heard the emotional sentences in pairs and were asked to judge whether the emotions e xpressed in both utterances were the same or different. The number of times each pair of emoti ons was judged as same and different across all listeners was represented by the theory of signa l detection measure of perceptual distances, d Once again, HCS was used to analyze the percei ved similarity among the 19 emotion categories, based on the d scores. Four main clusters emerged fr om this process: Category 1 or happy, Category 2 or content-confident, Category 3 or angry, and Category 4 or sad. These unique emotion categories formed the minimal set of basic emotions perceivable in SS To develop an acoustic model of emotions in SS based on the dimensional approach, it was necessary to understand the percep tual strategies or dimensions that may underlie listeners ability to classify nonsense speech into specific em otional categories. In other words, the goal of this experiment was to figure out the number of perceptual dime nsions that can be used to describe each emotion category and the per ceptual properties of these dimensions. A multidimensional scaling procedure using the ALSCAL algorithm was used to determine the number of perceptual dimensions necessary to describe the perceived distances between the 19 emotions. The ALSCAL algorithm was used to dete rmine the coordinates of each emotion in the multidimensional space. A three-dimensional solution was found to account for 90% of the variance in perceptual data and was therefore chosen as the perceptual model. These three dimensions were roughly observed to represent activation, valence, and confidence, but not entirely. Since it was not necessary to confine the differences in emotions to these dimension labels, they were referred to as Di mension 1, Dimension 2, and Dimension 3. An acoustic model could now be developed based on the perceptual model. However, this model would assume that all listeners perceive emotions similarly. To va lidate this assumption, it
238 was necessary to first confirm that the importance attributed to each dimension does not differ for individual listeners or for each sex. Hence, the purpose of the fourth experiment was to examine whether any individual or gender differences exist. Indivi dual similarity matrices of perceptual distances were calculated for each li stener and then submitted to an MDS analysis using the individual difference s caling or INDSCAL algorithm. It was observed from the listener weight space that males and females did not diffe r in the importance given to each dimension. A further analysis of the listener weight space usin g histograms revealed unimodal distributions of weights for each pair of dimensions, thereby va lidating the use of a si ngle model of emotion perception as determined in Experiment 3. The final two of six experiments were aimed at developing and tes ting an acoustic model of emotions in SS based on the perceptual model. Th e acoustic model was developed in Experiment 5 using relatively sm all, but unique set of acoustic f eatures. These features could be measured from each speech sample directly, without the use of a speakers neutral emotion or no emotion sample. Stepwise regressions were us ed to determine the acoustic features that best described each emotion dimension. This model was based on a 2D perceptual model formed from 11 emotions, since the remaining emotions were not easily perceived in SS The performance of the model was evaluated by classi fying the samples as one of the four emotion categories that were easily perceived in SS The k -means classification algorithm was used. The samples classified include the set used in model development (the training set) and the set of samples that included novel sentences from th e speakers used in model development (the test1 set). Results showed that classification accuracy of all samples of both stimulus sets was better than listener identification accuracy. In a ddition, accuracy varied by speaker, but not by sentence. Therefore, it was concluded th at the acoustic model based on all of the training set
239 samples (the Overall training set model) was sufficiently able to acoustically represent novel samples by the speakers used in model development. To further test the ability of the acoustic model to gene ralize to novel sentences and speakers, an additional experiment was perfor med. In the sixth and final experiment, 10 novel speakers expressed three nonsense sentences (two were previously used) in 11 emotional contexts. These samples were acousti cally represented using the Overall training set model and classified using the k -means and knearest neighbor algorithms. Classification accuracy was slightly higher for the k -means algorithm over the k NN ( k = 1) and much better than the k NN ( k = 3) algorithm. However, classification accura cy equaled listener accuracy for only Category 3 (angry). While classification of Category 4 (sad) was high, it was not equivalent to listeners. Classification of the remaining two cat egories was poor. As seen with the training and test1 set results, speaker differences were present. The m odel performed best at classifying samples from Speaker 4, the most effective speaker according to listeners. In addition, the model was worse at representing samples from the novel sentence. However, speakers were less effective at expressing this sentence emotionally than the re maining two sentences. Taken together, these results suggest that the acousti c model could represent the emo tions of effective speakers, particularly for Categories 3 and 4. However, the model was not as accura te in representing the emotional samples of speakers who were moderately effective. In summary, a number of conclusions may be drawn from the experiments conducted within this dissertation. First, a minimum of four emotion categor ies can be perceived in speech when listeners were given only the supraseg mental cues. These four categories roughly corresponded to happy (Category 1), content-c onfident (Category 2), angry (Category 3), and sad (Category 4). The result s of this study also identified the hierarchical structure of
240 emotions according to their percei ved similarity (shown in Figure 23). It is possible to refer to this dendrogram or the table of the numb er of clusters formed at a series of d values (Table 2-4) to investigate whether more than four emotions are perceivable from SS. These four emotion categories were perceivable in speech fro m the speech samples of both individuals with and without act ing training. While this study di d not determine the ecological validity of using the speech of actors or even the speech of non-actors obtained in a laboratory setting, it was possible to compare the accuracy of listener performance for these two sample sets. In other words, this study did not investig ate whether the speech of actors was a realistic representation of everyday speech, nor did it examine the extent to which the samples from nonactors corresponded to everyday speech. However, a comparison of the per ceptual identification results for these two sample sets showed that list eners were able to identify the speech of actors with better accuracy than most non-actors. Therefore, it is likely that the expressions of effective speakers (actors and non-actors) are expressed similarly. Next, a multidimensional scaling analysis of listener discrimination judgments of the emotions showed that three or fewer dimens ions are needed to represent the emotions perceivable in SS In the present study three dimensi ons were suggested to represent the differences in 19 emotions even though a 2D solu tion may have been sufficient. Thereafter, the optimal MDS solution was computed for a reduc ed set of 11 emotions, since many of the 19 emotions were not easily perceived in SS. A 2D model was clearly optim al. In this model, the first dimension separated the happy and sad clusters particularly anxious from embarrassed, and the second dimension separate d angry from the remaining emotions, particularly sad. The amount of benefit to the model by adding a third dimension is not clear. Additional multidimensional scaling analysis using the INDSCAL model suggested that individual
241 differences in the listener strategies for disc riminating 19 emotions in speech were small and normally distributed. Accordingly, no gender differ ences in the listener strategies were seen. However, these results did not reject the possibi lity of gender differences in the perception of certain emotion categories from SS, as suggested by the signif icant differences in mean d scores among male and female listeners Finally, stepwise regressions were used to develop an acoustic model based on the speech of actors using acoustic measures that were not based on individual speakers neutral expressions. In this model, D1 corresponded to a combination of the parameters speaking rate ( srate ), normalized pitch minimum ( pnorMIN ), and alpha ratio of the unstressed vowel ( aratio2 ), whereas D2 included the normalized atta ck time of the intensity contour ( normattack ) and normalized minimum pitch, normalized by speaking rate ( normpnorMIN ). This acoustic model based on the acted speech samples (the Overall training model) was successfully able to represent the samples used in model devel opment, and a novel sample from each of the training set speakers. Furthermore, the model was able to generalize to the sp eech of non-actors who were effective speakers according to listeners, particularly for Cate gories 3 and 4. The high accuracy for Category 3 (anger) suggests that the two cuesnormalized attack time and normalized pitch minimum (relative to speaki ng rate)were a very good representation of Dimension 2 (similar to valence). This fi nding is important for dimensional models considering that the acoustic cues to the valence dimension have been elusive. Limitations and Future Directions A number of limitations may have affected the accuracy of results. First, a small number of speakers were used in model development. Sp eakers were shown to be variable in their effectiveness, and the effectiven ess of the model is only as g ood as the effectiveness of the speakers. The high perceptual discrimination d scores of the two speakers used in model
242 development suggested that they were both very effective. Nevertheless, the formation of model based on a small number of speakers may not be re presentative of speakers on average. The use of a larger number of speakers was not possible in the present study because it was necessary to perform a discrimination task of 19 items. For each additional speaker, a minimum of 1,444 stimuli (19 19 stimuli 2 for forwards and back wards presentation 2 re petitions number of speakers) would need to be tested. For exampl e, to test 10 speakers, this would result in approximately 40 hours of test time. Still, furt her model development may incorporate additional speakers with fewer than 19 emotions. Similarly, results showed that listener accur acy of the samples from 8 of 10 speakers without acting training were not as good as the accuracy of acted speech samples. In order to truly test model performa nce, a large number of effective speakers are needed. Future work should use alternative methods for obtaining expressions from indi viduals without acting training. Since the long-term goal of this work is to extend the model to pr edict the emotions of different populations (e.g., patient s with aprosodia and individuals with autism), the development of techniques for obtaining realisti c expressions from these populati ons is of critical importance. Next, hand measurement and normalization by sp eaker resulted in model that was not completely automatic, nor speaker-independent. Algorithms to quantify voice quality are typically performed on vowels. While it is possi ble to isolate voiced se gments from unvoiced segments of speech (e.g. based on the amount of lo w-frequency energy in the signal as used in the present study), voiced sounds include both vo wels and consonants. Presently, automatic segmentation of vowels from running speech still resu lts in some degree of error. To avoid these errors in tests of model performance, the vow el-to-consonant ratio a nd the two measures of spectral slope were computed on one stressed and unstressed vowel that were segmented
243 manually offline. In addition, the pause proportions were also computed by hand, since it was difficult to distinguish between silences prior to word-initial stop consonants. To truly automate this model, it is necessary to develop an algorithm to automatically isolate stressed and unstressed vowels from a speech sample, and dis tinguish between true silences and the silence portions of consonants. Finally, the model was still speaker-dependent because it was necessary to normalize the samples from each speaker by converting them to z-scores. This normalization was required because the regression equations were formed from the MDS coordinates of each emotion (and the acoustic parameters), and these values were ar bitrary. As a result, the acoustic representations of the samples in the multidimensional space were not the absolute locations. Future work will investigate methods to avoid the limitation placed by the need for such normalization in order form a speaker-independent model of emotions in speech.
244 LIST OF REFERENCES Abelson, R.P., Sermat V. ( 1962 ). Multidimensional scaling of facial expressions, Journal of Experimental Psychology 63 546-554. Allen, J. G., Haccoun, D. M. ( 1976 ). Sex differences in emotionality: A multidimensional approach, Human Relations 29 711-722. Alter, K., Rank, E., Kotz, S.A., Pfeifer, E., Besson, M., Friederici, A.D., Matiasek, J. ( 1999 ). On the relations of semantic and acoustic properties of emotions, in Proc. of the XIVth International Congress of Phonetic Sciences, San Francisco, pp. 2121-2124. Alvarado, N. ( 1998 ). A reconsideration of the structure of the emotion lexicon, Motivation and Emotion 22 (4), 329-344. Arabie, P., Carroll, J. D., DeSarbo, W. S. ( 1987 ). Three-way scaling an d clustering, in Sage University Paper Series on Quantitative Applications in the Social Sciences 65 (Sage, Newbury Park, CA). Arai, T. ( 2006 ). Cue parsing between nasality and br eathiness in speech perception, Acoust. Sci. & Tech. 27 (5), 298-301. Arnold, M.B. ( 1960 ). Emotion and Personality: Vol 1 (Columbia University Press, New York). Auberg, V., Audibert, N., Rilliard, A. ( 2004 ). "Acoustic morphology of expressive speech: what about contours?" in Proc. of the Speech Prosody, Nara, Japan, March 23-26, pp. 201-204. Averill, J. R. ( 1975 ). A semantic atlas of emotional c oncepts, JSAS Catalogue of Selected Documents in Psychology 5 330, (Ms. No. 421). Bachorowski J-A. ( 1999 ). Vocal expression and perception of emotion, Curr. Dir, Psychol. Sci. 8 53. Bachorowski, J.-A., Braaten, E. B. ( 1994 ). Emotional intensity: Me asurement and theoretical implications, Personality and Individual Differences 17 (2), 191. Bachorowski, J.A., Owren, M. ( 1995 ). Vocal expression of emotion: acoustic properties of speech are associated with emotional inte nsity and context, Psychological Science 6 (4), 219224. Baird, J.C., Noma, E. ( 1978 ). Fundamentals of Scaling and Psychophysics (John Wiley & Sons, Inc., New York). Balswick, J., Avertt, C. ( 1977 ). Differences in expressiveness: Gender, interpersonal orientation, and perceived pare ntal expressiveness as cont ributing factors, Journal of Marriage and Family 39 121-127.
245 Banse, R., Scherer, K.R. ( 1996 ). Acoustic profiles in vocal emotion expression, Journal of Personality and Social Psychology 70 (3), 614-636. Banziger, T., Scherer, K.R. ( 2005 ). The role of intonation in emotional expressions, Speech Communication 46 252-267. Banziger, T., Scherer, K.R. ( 2003 ). A study of perceived vocal f eatures in emotional speech, in Proc. of the VOQUAL'03, 169-172. Batliner, A., Fischer, K., Huber, R., Spilker, J., Nth, E. ( 2000 ). Desperately seeking emotions: actors, wizards, and human beings, in Proc of the ISCA-Workshop on Speech and Emotion, Newcastle, Northern Irela nd, UK, September 5-7, 195-200. Batliner, A., Fischer, K., Huber, R., Spilker, J., Noth, E. ( 2003 ). How to find trouble in communication, Speech Commun. 40 117-143. Baumgartner, T., Esslen, M., Jncke, L. ( 2006 ). From emotion perception to emotion experience: Emotions evoked by pictures a nd classical music, Int. J. Psychophysiol. 60 (1), 34-43. Biele C., Grabowska, A. ( 2006 ). Sex differences in perception of emotion intensity in dynamic and static facial expres sions, Exp Brain Res. 171 1. Bigand, E., Vieillard, S., Madurell, F ., Marozeau, J., and Dacquet, A. ( 2005 ). Multidimensional scaling of emotional responses to music: The effect of musical expertise and of the duration of the excerpts, Cognition & Emotion 19 1113-1139. Block, J. ( 1957 ). Studies in the phenomenology of emo tions, Journal of Abnormal and Social Psychology 54 358-363. Bonebright, T. L., Thompson, J. L., Leger, D.W. ( 1996 ). Gender stereotype s in the expression and perception of vocal affect, Se x Roles: A Journal of Research 34 (5/6), 429-445. Bradley, M. ( 1994 ). Emotional memory: A dimensional anal ysis, in an Goozen, S. H. M., van de Poll, N. E., and Sergeant, J. A., (Eds.), Emotions: Essays on Emotion Theory (Lawrence Erlbaum, Hillsdale, NJ), pp. 97. Bradley, M. M., Lang, P. J. ( 1994 ). Measuring emotion: The se lf-assessment manikin and the semantic differential, Journal of Behavi or Therapy and Experimental Psychiatry 25 (1), 4959. Buck, R. ( 1999 ). The biological affects: A typology, Psychol. Rev. 106 (2), 301-336. Burkhardt, F., Paeschke, A., Rolfes M., Sendlmeier, W., Weiss, B. ( 2005 ). A database of German emotional speech, in Proc. of the of INTERSPEECH, Lisbon, Portugal, pp. 1517-1520.
246 Bush, Lynn, E. ( 1973 ). Individual differences MDS of adj ectives denoting feelings, Journal of Personality and Social Psychology 25 50-57. Cacioppo, J. T., Klein, D. J., Berntson, G. C., Hatficld, E. ( 1993 ). The psychophysiology of emotion, in M. Lewis & J. M. Haviland (Eds.), Handbook of Emotions (Guilford Press, New York), pp. 119-142. Cahn, J. ( 1990 ). The generation of affect in synthesized speech, Journal of the American Voice I/O Society 8 1-19. Camacho, A., ( 2007 ). SWIPE: A sawtooth waveform insp ired pitch estimator for speech and music, Doctoral dissertation, Univ ersity of Florida, Florida. Carroll, J.D., Chang, J.J. ( 1970 ). Analysis of individual differences in multidimensional scaling via an n-way generalization of Eckar t-Young decomposition, Psychometrika 35 (3), 283319. Chu, W.C. ( 2004 ). Speech coding algorithms: foundati on and evolution of standardized coders, (John Wiley & S ons, New Jersey), pp. 34. Church, A. T., Katigbak, M. S., Reyes, J. A. S., Jensen, S. M. ( 1998 ). Language and organisation of Filipino emotion concepts: Comparing emotion concepts and dimensions across cultures, Cognition & Emotion 12(1) 63. Cowie, R., Cornelius, R. ( 2003 ). Describing the emotional states that are expressed in speech, Speech Communication 40 5-32. Cowie, R., Douglas-Cowie, E., Appolloni, B., Taylor, J., Romano, A., Fellenz, W. ( 1999 ). What a neural net needs to know about emotion wo rds, in Proc. of the Circuits, Systems, Communications and Computers, At hens, Greece, July 4-8, pp. 5311-5316. Cowie, R.; Douglas-Cowie, E.; Savvidou, S.; McMahon, E.; Sawey, M.; Schrder, M. ( 2000 ). FEELTRACE: An instrument for recording pe rceived emotion in real time, in ISCA Workshop on Speech and Emotion, Belfast. Cowie, R.; Douglas-Cowie, E.; Tsapatsoulis, N.; Votsis, G.; Kollias, S.; Fellenz, W.; G. Taylor, J. ( 2001 ). Emotion recognition in human-computer interaction, IEEE Signal Processing Magazine 18 32-80. Davis, MH. ( 1980 ). A multidimensional approach to indivi dual differences in empathy, JSAS Catalog of Selected Documents in Psychology 10 (4), 85. Dallaert, F., Polzin, T., Waibel, A. ( 1996 ). Recognizing emotion in speech, in Proc. of the International Conference on Spoken Language Processing, Philadelphia, PA, pp. 1970-1973. Davitz, Joel R. ( 1969 ). The Language of Emotion, (Academic Press, New York).
247 Davitz, Joel R. ( 1964 ). The Communication of Emotional Meaning (McGraw-Hill, New York), pp. 101. De Jong, N.H., Wempe, T. ( 2009 ). Praat script to detect syllabl e nuclei and measure speech rate automatically, Behavior Research Methods 41 (2), 385-390. Doherty, E. T., Shipp, T. ( 1988 ). Tape recorder effects on jitter and shimmer extraction, Journal of Speech and Hearing Research 31 485-490. Doherty, R.W., Orimoto, L., Singelis, T.M., Hatfield, E., Hebb, J. ( 1995 ). Emotional contagion: gender and occupational differe nces, Psychol. Women Q. 19 355 371. Douglas-Cowie, E., Campbell N., Cowie R., Roach P. ( 2003 ). Emotional speech: towards a new generation of databases, Speech Commun. 40 33-60. Douglas-Cowie, E., Cowie, R., Schrder, M. ( 2000 ). A new emotion database: Considerations, sources and scope, ISCA Workshop on Speech & Emotion, Northern Ireland, pp. 39-44. Duffy, J. R. ( 1995 ). Motor Speech Disorders: Substrates, Differential Diagnosis, and Management (St. Louis, Mosby). Ekman, P. ( 1992 ). Are there basic emoti ons, Psychological Review 99 (3), 550-553. Ekman, P. ( 1999 ). Basic emotions, in Handbook of Cognition Emotion in T. Dalgleish and M. Power (Eds.), (John Wiley & Sons Ltd., New York). Ekman, P. ( 1972 ). Universals and cultural di fferences in facial expressi ons of emotions, in J. K. Cole (Ed.), Nebraska Symposium on Motiv ation (Vol. 19), Lincoln, (University of Nebraska Press). Ekman, P., Friesen, W. V., Ellsworth, P. ( 1972 ). Emotion in the Human Face: Guidelines for Research and an Integration of Findings (Pergamon, New York). Ekman, P., Heider, K. G. ( 1988 ). The universality of a contem pt expression: A replication, Motiv. Emotion 12 303-308. Ekman, P., Levenson, R. W., & Friesen. W. V. ( 1983 ). Autonomic nervous system activity distinguishes between emotions, Science 221 1208-1210. Ekman, P, Oster, H. ( 1979 ). Facial expressions of emo tion, Annual Reviews in Psychology 30 (1), 527-34. Epstein, S. ( 1984 ). Controversial issues in emotion theory, in P. Shaver. (Ed.) Review of Personality and Social Psychology: Vol. 5 (Sage, Beverly Hills, CA), pp. 64-88. Fakotakis, N. ( 2004 ). Corpus design, recording and phone tic analysis of Greek emotional database, LREC, Lisbon, Portugal.
248 Fehr, B., Russell, J. A. ( 1984 ). Concept of emotion viewed from a prototype perspective, J. Exp. Psychol. Gen. 113 464-486. Feldman Barrett., Niendenthal, P.M. ( 2004 ). Valence focus and the perc eption of facial affect, Emotion 4 266-274. Fillenbaum, S., & Rapaport, A. ( 1971 ). Structures in the Subjective Lexicon (Academic Press, New York). Fonagy, I. ( 1978 ). A new method of investigating th e perception of prosodic features, Language and Speech 21 34-49. Fontaine, J., Scherer, K., Roesch, E., Ellsworth, P. ( 2007 ). The world of emotions is not twodimensional, Psychological Science 18 (12), 1050-1057. Forsell, M., Elenius, K., and Laukka, P. ( 2007 ). Acoustic correlates of frustration in spontaneous speech, Speech, Music and Hearin g. Quarterly Progress and Status Report 50 37-40. Fridlund, A., Ekman, P., Oster, H. ( 1987 ). Facial expressions of em otion, in A. Siegman & S. Feldstein (Eds.), Nonverbal Behavior and Communication (Erlbaum, Hillsdale, NJ), pp. 143224. Frijda, N. H. ( 1986 ). The Emotions (Cambridge University Press, Cambridge, England). Fujita, B. N., Harper, R. G., & Wiens, A. N. ( 1980 ). Encoding-decoding of nonverbal emotional messages: Sex differences in spontaneous and enacted expressions, Journal of Nonverbal Behavior 4 (3), 131-145. Gobl, C.; N Chasaide, A. ( 2003 ). The role of voice quality in communicating emotion, mood and attitude, Speech Communication 40 189-212. Gobl, C.; Bennett, E.; N Chasaide, A. ( 2002 ). Expressive synthesi s: how crucial is voice quality, in Proc. of the IEEE Workshop on Sp eech Synthesis, Santa Monica, California, paper 52 1-4. Gordon, M., and Ladefoged, P. ( 2001 ). Phonation types: a crosslinguistic overview, Journal of Phonetics 29 383-406. Grandjean, D., Sander, D., Pourtois, G., Schwartz S., Seghier, M.L., Scherer, K.R., Vuilleumier, P. ( 2005 ). The voices of wrath: brain responses to angry prosody in meaningless speech, Nat, Neurosci. 8 145-146. Greasley, P., Sherrard, C., Waterman, M. ( 2000 ). Emotion in language and speech: Methodological issues in natura listic approaches, Lang. Speech 43(4), 355. Green, DM, Swets, J.A. ( 1966 ). Signal Detection Theory and Psychophysics (Wiley & Sons, Inc., New York).
249 Green, R. S. and Cliff, N. ( 1975 ). Multidimensional comparisons of structures of vocally and facially expressed emotion, Perception & Psychophysics 17 (5), 429. Greenberg, Y, Shibuya, N, Tsuzaki, M, Kato, H, Sagisaka, Y. ( 2007 ). Analysis on paralinguistic prosody control in perceptual impression space using multiple dimensional scaling, Speech Communication, doi:10.1016/j.specom. 2007.10.006. Grimshaw, G. M. ( 1998 ). Integration and inte rference in the cerebral hemispheres: Relations with hemispheric specialization, Brain Cognition 36 108. Gross, J. J., John, O. P. ( 1995 ). Facets of emotional expressiv ity: Three self-repo rt factors and their correlates, Personalit y and Individual Differences 19 555-568. Gross, J. J., O. P. John, Richards. ( 2000 ). The dissociation of emotion expression from emotion experience: A personality perspe ctive, Pers Soc Psychol Bull. 26 (6), 712-726. Hacker, C., Batliner, A., Noth, E. ( 2006 ). Are you looking at me, are you talking with me multimodal classification of the focus of attentio n, in Proc. of the Te xt, Speech and Dialogue (TSD 2006), LNAI 4188, Springer, Berlin, Heidelberg, pp. 581-588. Halberstadt, A. G., Hayes, C. W., Pike, K. M. ( 1988 ). Gender, and gender differences in smiling and communication consistency, Sex Roles 19 589-603. Hall, J. A. ( 1978 ). Gender effects in decoding nonverb al cues, Psychological Bulletin 85 845 857. Hamilton, M. ( 1960 ). A rating scale for depression, J Neurol Neurosurg Psychiatry 23 (1), 5662. Hammarberg, B, Fritzell, B, Gauf fin, J, Sundberg, J, Wedin, L. ( 1980 ). Perceptual and acoustic correlates of abnormal voice qualit ies, Acta Oto-Laryngologica 90 441. Hammerschmidt, K., Jrgens, U. ( 2007 ). Acoustical correlates of affective prosody, J. Voice 21 (5), 531-540. Hanson HM. ( 1997 ). Glottal characteristics of female speakers: acoustic correlates, J Acoust SocAm. 101 (1), 466. Hatfield, C., Rapson. ( 1993 ). Emotional contagion, Current Directions in Psychological Science 2 96-99. Havlena, W. J., Holbrook, M. B. ( 1986 ). The varieties of consum ption experience: Comparing two typologies of emotion in consumer beha vior, The Journal of Consumer Research 13 (3), 394-404. Heman-Ackah, Y. D., Heuer, R. J., Michael, D. D., Ostrowski, R., Horman, M., Baroody, M. M., Hillenbrand, J., and Sataloff, R. T. ( 2003 ). Cepstral peak prominence: a more reliable measure of dysphonia, The Annals of Otology, Rhinology, and Laryngology 112 324.
250 Hillenbrand, J., Houde, R. A. ( 1996 ). Acoustic correlates of breathy vocal quality: Dysphonic voices and continuous speech, Journal of Speech and Hearing Research 39 311-321. Hogan, R. ( 1969 ). Development of an empathy scale, Journal of Consulting and Clinical Psychology 33 307-316. Huttar, G. L. ( 1968 ). Relations between prosodic variab les and emotions in normal American English utterances, Journal of Speech and Hearing Research 11 481. Izard, CE. ( 1992 ). Basic emotions, relations among emo tions, and emotion-cognition relations. Psychological Review 99 (3), 561-565. Izard, C. E. ( 1977 ). Human Emotions (Plenum Press, New York). James, W. ( 1884 ). What is an emotion? Mind 9 (34), 188-205. James, W. ( 1890/2007 ). The Principles of Psychology Vol. 2 (Cosimo, Inc., New York, NY), pp. 449. Jamieson, D.G., Morosan, D.E. ( 1986 ). Training non-native spe ech contrasts in adults: Acquisition of the English / / / / contrast by francophones, Percept. Psychophys. 40 (4), 205-215. Johnson, S. ( 1967 ). Hierarchical clustering schemes, Psychometrika 32 (3), 241-254. Juslin, P. N., & Laukka, P. ( 2001 ). Impact of intended emotion intensity on cue utilization and decoding accuracy in vocal ex pression of emotion, Emotion 1 381-412. Juslin, P.N., Laukka, P. ( 2003 ). Communication of emotions in vocal expression and music performance: Different channels, same code? Psychol. Bull. 129 (5), 770. Laukka, P., Juslin, P. N., & Bresin, R. ( 2005 ). A dimensional approach to vocal expression of emotion, Cognition and Emotion 19 633-653. Kappas A, Hess U, Scherer KR. ( 1991 ). Voice and emotion, in R.S. Feldman, & B. Rim (Eds.), Fundamentals of Nonverbal Behavior (Cambridge University Press, Cambridge and New York), pp. 200-238. Karlsson, I., Banziger, T., Dankovicova, J., John stone, T., Lindberg, J., Melin, H., Nolan, F., Scherer, K.R. ( 1998 ). Speaker verification with elic ited speaking styles in the VeriVox project, in Proc. of the Workshop on Speaker Recognition and its Commercial and Forensic Applications (RLA2C), Avignon, France, pp. 207-210. Keltner, D. and A. M. Kring ( 1998 ). Emotion, social function, and psychopathology, Review of General Psychology 2 320-342. Kemper, T. ( 1987 ). How many emotions are there? Wedding the social and the autonomic components, American Journal of Sociology 93 263-289.
251 Kempster, G., Kistler, D., & Hillenbrand, J. ( 1991 ). Multidimensional scaling analysis of dysphonia in two speaker groups, Journa l of Speech and Hearing Research 34 534. Kienast, M., Sendlmeier, W.F. ( 2000 ). Acoustical analysis of spectral and temporal changes in emotional speech, in Proc. of the ISCA-Workshop on Speech and Emotion, Newcastle, Northern Ireland, UK, September 5-7, pp. 92-97. Kirouac, G.,&Dore, F.Y. ( 1985 ). Accuracy of the judgment of f acial expression of emotions as a function of sex and level of educa tion, Journal of Nonverbal Behaviour 9 (1), 3. Klasmeyer, G., Sendlmeier, W.F. ( 1995) Objective voice parameters to characterize the emotional content in speech, in Proc. of the XIIIth Internationa l Congress of Phonetic Sciences, Stockholm Vol. 1 pp. 182. Konijn, E. A. ( 2000 ). Acting Emotions (Amsterdam University Press, Amsterdam). Kring, A. M., Gordon, A. H. ( 1998 ). Sex differences in emotion: expression, experience, and physiology, J Pers Soc Psychol 74 (3), 686-703. Kring, Ann M., Smith, David A., Neale, John M. ( 1994 ). Individual differences in dispositional expressiveness: Development and validation of th e emotional expressivity scale, Journal of Personality and Social Psychology 66 (5), 934-949. LaFrance, M., Hecht, M.A., Paluck, E.L. ( 2003 ). The contingent smile: a meta-analysis of sex differences in smiling, Psychol. Bull. 129 305 334. Lane, RD, Reiman, EM, Bradley, MM, Lang, PJ Ahern, GL, Davidson, RJ, Schwartz, GE. ( 1997 ). Neuroanatomical correlates of pleas ant and unpleasant emotion, Neuropsychologia 35 (l l), 1437-1444. Lang, P. J., Greenwald, M. K., Bradley, M. M., & Hamm, A. O. ( 1993 ). Looking at pictures: Affective, facial, visceral, and be havioral reactions, Psychophysiology 30 261-273. Lange, C.G., James, W. ( 1922 ). The Emotions (Williams & Wilkins Co., New York, NY), pp. 449. Larsen, R. J., Diener, E. ( 1987 ). Affect intensity as an individual difference characteristic: A review, Journal of Re search in Personality 21 (1), 1. Larsen, R. J., Diener, E. ( 1992 ). Promises and problems with the circumplex model of emotion, in M. S. Clark (Ed.), Review of Personality and Social Psychology Vol. 13 (Sage, Newbury Park, CA), pp. 25. Laukka, P. ( 2004 ). Vocal expression of emotion: Disc rete-emotions and dimensional accounts, Doctoral dissertation, Uppsala Un iversity, (Acta Universitatis Up saliensis, Uppsala, Sweden), 141 pp. 1-80.
252 Laukka, P., Juslin, P., Bresin. ( 2005 ). A dimensional approach to vocal expression of emotion, Cognition & Emotion 19 (5), 633-653. Laukkanen, A.-M., Vilkman, E., Alku, P., Oksanen, H. ( 1996 ). Physical variation related to stress and emotionally state: a pr eliminary study, Journal of Phonetics 24 313. Laukkanen, A.-M., Vilkman, E., Alku, P., Oksanen, H. ( 1997 ). On the perception of emotions in speech: the role of voice quality, Scandina vian Journal of Logopedics, Phoniatrics and Vocology 22 157. LeDoux, J. ( 1998 ). The Emotional Brain: The Mysterious Underpinnings of Emotional Life (Simon & Schuster, New York). Lee, C. M., Narayanan, S. S., and Pieraccini, R. ( 2002 ). Classifying emotions in humanmachine spoken dialogs, in IEEE Intl Conference on Multimedia and Expo pp. 737740. Leinonen L, Hiltunen T, Linnankoski I, Laakso M-L. ( 1997 ). Expression of emotionalmotivational connotations with a one-wor d utterance, J. Acoust. Soc. Am. 102 1853. Levitt, H. ( 1971 ). Transformed up-down methods in ps ychoacoustics, J. Acoust. Soc. Am. 49 (2), 467-477. Lewis, M., Haviland, J. (Eds.). ( 1993 ). Handbook of Emotions (Guilford Press, New York). Liscombe, J., Venditti, J., Hirschberg, J. ( 2003 ). Classifying subject ratin gs of emotional speech using acoustic features, in EUROSPEECH-2003, 725-728. Loveland, K. A., TunaliKotoski, B., Chen, Y. R., Ortegon, J., Pearson, D. A., Brelsford, K. A., and Gibbs, M. C. ( 1997 ). Emotion recognition in autism: Ve rbal and nonverbal information, Development and psychopathology 9 579-593. Lutz, C. ( 1982 ). The domain of emotion words on Ifaluk, Am. Ethnol. 9 (1), 113-128. Macmillan, N.A., Creelman, C.D. ( 1991 ). Detection Theory: A User's Guide (Cambridge University Press, Cambridge). Mandal, M. K., Palchoudhury, S. ( 1985 ). Perceptual skill in decodi ng facial affect, Perceptual and Motor Skills 60 (1), 96. Markel, N.N., Bein, M.F., Phillis, J.A., ( 1973 ). The relation between words and tone of voice, Lang. Speech 16 15-21. Mayo, C, Clark, RAJ, King, S. ( 2005 ). Multidimensional scaling of listener responses to synthetic speech, in INTERSPEECH-2005, 1725-1728. Mehrabian, A. ( 1972 ). Nonverbal Communication (Aldine-Atherton, Chicago). Mehrabian, A., Epstein, N. ( 1972 ). A measure of emotional em pathy, Journal of Personality 40 525-543.
253 Mehrabian, A., Russell, J. A. ( 1974 ). An Approach to Environmental Psychology (Cambridge, Mass. MIT Press). Milenkovic, P. ( 2001 ). TF32, (University of Wisconsin, Madison, WI). Mervis, C. B., & Crisati, M. A. ( 1982 ). Order of acquisition of subordinate-, basic-, and superordinate-level categories, Child Development 53 258-266. Millot J-L, Brand G. ( 2001 ). Effects of pleasant and unpleasant ambient odors on human voice pitch, Neurosci. Lett. 297 61. Montero L.M., Gutierrez-Arriola J ., Palazuelos S., Enrquez E., Aguilera S., Pardo J. M. ( 1998 ). Emotional speech synthesis: From speech database to TTS, in Proc. ICSLP. Moore, C.A., Cohn, J.F., Katz, G.S. ( 1994 ). Quantitative descrip tion and differentiation of fundamental frequency contours, Computer Speech and Language 8 1-20. Mowrer, O.H. ( 1960 ). Learning Theory and Behavior (Wiley, New York). Murray, I.; Arnott, J.L. ( 1993 ). Towards the Simulation of emotion in Synthetic Speech: A review of the Lite rature on Human Vocal Emotion, J. Acoust. Soc. Am. 1097-1108. Murray, I., Arnott, J.L. ( 1995 ). Implementation and testing of a system for producing emotionby-rule in synthetic speech, Speech Commun. 16 (4), 369-390. Myers, P.S. ( 1999 ). Right hemisphere damage: Disorders of Communication and Cognition (Singular Publishing, San Diego, CA). Nandur, V. ( 2003 ). Performance of gale using semantically neutral sentences, Masters Thesis, University of Florida. Notarious, C. I., & Johnson, J. S. ( 1982 ). Emotional expression in husbands and wives, Journal of Marriage and th e Family 483-489. Nowicki, S., Duke, M. ( 1994 ). Individual differences in the nonverbal communicat ion of affect: The diagnostic analysis of nonverbal accuracy scale, Journal of Nonverbal Behavior 18 (1), 9. Oatley, K., Johnson-Laird, P. N. ( 1987 ). Towards a cognitive theory of emotion, Cognition and Emotion 1 29-50. Ohala, J.J. ( 1983 ). Cross-language use of pitch: An ethological view, Phonetica 40 1-18. Ortony, A., Turner, J.T. ( 1990 ). What's basic about basic emotions? Psychol. Rev. 97 (3), 315331. Osgood, C. E. ( 1969 ). On the whys and wherefores of E, P, and A, Journal of Personality and Social Psychology 12 194-199.
254 Osgood, C. E., Suci, G. J., Tannenbaum, P. H. ( 1957 ). The Measurement of Meaning (University of Illinois Press, Urbana, USA). Paeschke, A. ( 2004 ). Global trend of fundamental freque ncy in emotional speech, in Proc. of the Speech Prosody, 671-674. Paeschke A, Kienast, M., Sendlmeier, W. F. ( 1999) F0-contours in emotional speech, in Proc. of the ICPhS 14 :2, 929-932. Paeschke A, Sendlmeier W. F. ( 2000 ). Prosodic characteris tics of emotional speech: measurements of fundamental freque ncy movements, in Proc. of the SpeechEmotion-2000 75. Pakosz, M. ( 1983 ). Attitudinal judgments in intonation: some evidence for a theory, J. Psycholinguist Res. 12 311. Panksepp, J. A. ( 1982 ). Toward a general psychobiological theory of emotions, Behav. Brain Sci. 5 407-467. Panksepp, J. A. ( 1992 ). Critical role for affective neuros cience in resolving what is basic about basic emotions, Psychol. Rev. 99 (3), 554-560. Park, C. H., & Sim, K. B. ( 2003 ). Emotion recognition and acous tic analysis from speech signal, in Proc. of the In ternational Joint C onference on Neural Networks, pp. 25942598. Paulmann, S., Kotz, S.A. ( 2008 ). An ERP investigation on the temporal dynamics of emotional prosody and emotional semantics in pseudoand lexical-sentence c ontext, Brain Lang. 105 (1), 59-69. Pereira, C. ( 2000 ). Dimensions of emotional meaning in speech, Tutorial and Research Workshop on Speech and Emotion, in Proc. of the Speech-Emotion-2000, 25-28. Pereira, C. ( 1998 ). Some acoustic characteristics of emoti on, in Proc. of the Fifth International Conference on Spoken Language Pr ocessing (ICSLP-1998), paper 68. Petrushin, V. A. ( 1999 ). Emotion in speech: Recognition an d application to call centers, in Proc. of the Conference on Artificial Neural Networks in Engineering. St. Louis, MO, November, pp. 7-10. Pittam J, Gallois C, Callan V. ( 1990 ). The long-term spectrum and perceived emotion, Speech Communication 9 (3), 177. Plutchik, R. ( 1980 ). A general psychoevolutionary theory of emotion, in R. Plutchik & H. Kellerman Emotion: Theory, Research, and Experience Theories of Emotion, Vol. 1 (Academic Press, New York), pp. 3-31.
255 Plutchik, R. ( 1993 ). Emotions and their vicissitudes: Emotions and psychopathology, in M. Lewis & J. M. Haviland (Eds.), Handbook of Emotions (Guilford Press, New York), pp. 5366. Rafaeli, E., Rogers, G. M, Revelle, W. ( 2007 ). Affective synchrony: In dividual differences in mixed emotions, Pers. Soc. Psychol. Bull. 33 (7), 915-932. Ramamohan, S., Dandapat, S. ( 2006) Sinusoidal model-based anal ysis and classification of stressed speech, IEEE Trans. on Audio, Speech, and Language Processing 14 (3), 737-746. Reardon, R., Amatea, E. ( 1973 ). The meaning of vocal emotional expressions: Sex differences for listeners and speakers, Internati onal Journal of Social Psychiatry 19 (3-4), 214-219. Reed, C., Buder, E.H., Kent, R.D. ( 1992 ). Speech analysis systems: An evaluation, Journal of Speech and Hearing Research 35 314-332. Restrepo, A., Chacon, L. ( 1994 ). A smoothing property of the me dian filter, IEEE transactions on signal processing 42 1553-1555. Riggio, R. E., Friedman, H. S. ( 1986 ). Impression formation: The ro le of expressi ve behavior, Journal of Personality and Social Psychology 50 421-427. Rilliard, A., Auberg, V., Audibert, N. ( 2004 ). Evaluating an authenti c audio-visual expressive speech corpus, in Proc. of the LREC, Lisbon, Portugal, pp. 175-178. Roach, P. ( 2000) Techniques for the phonetic description of emotional speech, in Proc. of the ISCA Workshop on Speech and Emotion (pp. 53), Northern Ireland. Rosch, E. ( 1973 ). On the internal struct ure of perceptual and sema ntic categories, in T.E. Moore (Eds.) Cognitive Development and th e Acquisition of Language (Academic Press, New York), pp. 111-144. Rosch, E., Mervis, C. B., Gray, W. D., Johnson, D. M., & Boyes-Braem, P. ( 1976 ). Basic objects in natural categorie s, Cognitive Psychology 8 382--439. Roseman, I. J., Spindel, M. S., and Jose, P. E. ( 1990 ). Appraisals of emo tion-eliciting events: Testing a theory of discrete emotions, Journal of Personality and Social Psychology 59 899915. Rosenbek, J.C., Crucian, G.P., Leon, S.A., Hieber B., Rodriguez, A.D., Holiway, B., Ketterson, T.U., Ciampitti, M.Z., Heilman, K.M., Gonzalez-Rothi, L.J. ( 2004 ). Novel treatments for expressive aprosodia: A phase I investiga tion of cognitive lin guistic and imitative interventions, J. Int. Neuropsych. Soc. 10(5), 786. Rosenthal, R., Hall, J. A., DiMatteo, M. R., Rogers, P. L., & Archer, D. ( 1979 ). Sensitivity to Nonverbal Communication: The PONS Test (The Johns Hopkins University Press, Baltimore).
256 Ross, F. D. ( 1981 ). The aprosodias: functionalanatom ical organization of the affective components of language in the right hemisphere, Archives of Neurology 38 561-569. Russell JA, Feldman Barrett L. ( 1999) Core affect, prototypical emotional episodes, and other things called emotion: dissecting the elephant, J. Personal. Soc. Psychol. 76 805. Russell, J. A. ( 1997 ). Reading emotion from and into faces: Resurrecting a dimensional contextual perspective, in J.A. Ru ssell & J. M. Fernandez-Dols (Eds.), The Psychology of Facial Expression (Cambridge University Press, New York), pp. 295 320. Russell, J. A., & Mehrabian, A. ( 1977 ). Evidence for a three-factor theory of emotions, Journal of Research in Personality 11 273-294. Russell, J. A., Weiss, A., Mendelsohn, G.A. ( 1989 ). Affect grid: A single-item scale of pleasure and arousal, Journal of Pers onality and Social Psychology 57 (3), 493-502. Rymarczyk, K., Grabowska, A. ( 2007 ). Sex differences in brain control of prosody, Neuropsychologia, 45 (5), 921-930. Scherer, K. R. ( 1999 ). Appraisal theory, in T. Dalgleish & MJ. Power (Eds.), Handbook of Cognition and Emotion (John Wiley & Sons, England; New York), pp. 637-663. Scherer, K.R. ( 1984 ). Emotion as a multicomponent proces s: A model and some cross-cultural data, in P. Shaver (Eds.), Review of Personality and Social Psychology Vol. 5 (Sage, Beverly Hills, CA), pp. 64-88. Scherer, K.R. ( 1986 ). Vocal affect expres sion: A review and a mode l for future research, Psychol. Bull. 99 (2), 143-165. Scherer, K.R. ( 1989 ). Vocal correlates of emotion. in Wagner, H., Manstead, A. (Eds.), Handbook of Psychophysiology: Em otion and Social Behavior (Wiley, London), pp. 165. Scherer, K.R., ( 2003) Vocal communication of emotion: A review of research paradigms, Speech Communication 40 227-256. Scherer, K.R. ( 2004 ). Which emotions can be induced by music? What are the underlying mechanisms? And how can we measure them? Journal of New Music Research 33 (3), 239 251. Schirmer, A., Kotz, S. ( 2003 ). ERP evidence for a sex-specific stroop effect in emotional speech, J. Cognitive Neurosci. 15 (8), 1135-1148. Schirmer, A., Kotz, S. A., Friederici, A. D. ( 2002 ). Sex differentiates the role of emotional prosody during word processing, Cognitive Brain Research 14 (2), 228-233. Schirmer A., Kotz, S.A., & Friederici, A. D. ( 2005 ). On the role of attention for the processing of emotions in speech: Sex differences revisited, Cognitive Brain Research 24 442-452.
257 Schirmer A., Zysset, S., Kotz, S. A., von Cramon, D. Y. ( 2004 ). Gender differences in the activation of inferior frontal cortex duri ng emotional speech perception, NeuroImage 21 1114-1123. Schlosberg, H. ( 1941 ). A scale for the judgement of facial expressions, Journal of Experimental Psychology 29 497. Schlosberg, H. ( 1954 ). Three dimensions of emo tion, Psychological Review 61 (2), 81-88. Schroder, M. ( 2003 ). Experimental study of affe ct bursts, Speech Communication 40 (1-2), 99116. Schrder, M., Cowie, R., Douglas-Cowie E., Westerdijk, M., Gielen, S., ( 2001) Acoustic correlates of emotion dimensions in view of speech synthesis, in Proc. of the Eurospeech 2001, Aalborg, 1 pp. 87. Schwartz, G. E., Brown, S. L., & Ahern, G. L. ( 1980 ). Facial muscle patterning and subjective experience during affective imagery, Psychophysiology 17 75-82. Shaver, P., Schwartz, J., Kirson, D., O'Connor, C. ( 1987 ). Emotion knowledge: Further exploration of a prototype appr oach, J. Pers. Soc. Psychol. 52 (6), 1061-1086. Shriberg, L.D., Kent, R.D., ( 1995) Clinical Phonetics 2nd Ed (Allyn & Bacon, Massachusetts), pp. 98. Shrivastav, R., Sapienza, C., Nandur V. ( 2005 ). Application of psycho metric theory to the measurement of voice quality using rati ng scales, J Speech Lang Hear Res. 48 323. Skinner, E. R., ( 1935 ). A calibrated recording and analysis of the pitch, force and quality of vocal tones expressing happiness and sadness, Speech Monographs 2 (1), 81-137. Smith, C. A., Ellsworth. P. C. ( 1985 ). Patterns of cognitive appraisa l in emotion, J. Pers. Soc. Psychol. 48 813-838. Sobin, C., Alpert, M. ( 1999 ). Emotion in speech: the acoustic attributes of fear, anger, sadness, and joy, J Psycholinguist Res. 28 (4), 347-65. Stein, N. L., Oatley, K. ( 1992 ). Basic emotions: theory and measurement, Cognition and emotion 6 (3-4), 161-168. Stringer A. Y. ( 1996) Treatment of motor aprosodia w ith pitch biofeedback and expression modeling, Brain Inj. 10 (8), 583-590. Strongman, K. T. ( 1996 ). The Psychology of Emotion: Theories of Emotion in Perspective (Fourth Ed.) (John Wiley and Sons, Chichester). Tato, R, Santos, R., Kompe, R., Pardo, J. M. ( 2002 ). Emotional space improves emotion recognition, in Proc. of the ICSLP-2002, 2029-2032.
258 Toivanen, J., Vyrynen, E., Seppnen, T. ( 2005 ). Gender differences in the ability to discriminate emotional content from speech, in Proc. of the FONETIK, Dept. of Linguistics, Goteborg, University, 119-122. Toivanen, J., Waaramaa, T., Alku, P., Laukkanen, AM., Seppnen, T., Vyrynen, E., Airas, M. ( 2006 ). Emotions in [a]: a perceptual a nd acoustic study, Logoped. Phoniatr. Vocol. 31 43 48. Tolkmitt, F. J., Scherer, K. R., ( 1986 ). Effects of experimenta lly induced stress on vocal parameters, J. Exp. Psychol.: Hum. Percept. Perform. 12 302. Tomkins, S. S. ( 1984 ). Affect theory, in Approaches to Emotion edited by K.S. Scherer & P. Ekman. (Erlbaum, Hillsdale, NJ), pp. 163-195. Trouvain, J., and Barry, W. J. ( 2000 ). The prosody of excitement in horse race commentaries, in SpeechEmotion-2000 (Northern Ireland), pp. 86-91. Turner, T. J., Ortony, A. ( 1992 ). Basic emotions: Can c onflicting criteria converge, Psychological Review 99 (3), 566-571. Tusing, K. J, Dillard, J. P. ( 2000) The sounds of dominance. Vo cal precursors of perceived dominance during interpersonal influe nce, Human Communication Research. 26 148-171. Uldall, E. ( 1960 ). Attitudinal meanings conveyed by intonation contours, Language and Speech 3 223. Van Bezooijen, R. ( 1984 ). Characteristics and Recognizability of Vocal Expressions of Emotion (Foris Pubns, Dordrecht, The Netherlands). Vogt, T., Andre, E. ( 2005 ). Comparing feature sets for acted and spontaneous speech in view of automatic emotion recognition, in Proc. of th e IEEE International Conference on Multimedia and Expo (ICME), The Netherlands, pp. 474-477. Wallbott, H. G. ( 1988 ). Big girls don't frown, big boys don't cry: Gender differences of professional actors in communica ting emotion via facial expression, Journal of Nonverbal Behavior 12 98-106. Watson, D., Tellegen, A. ( 1985 ). Toward a consensual struct ure of mood, Psychological Bulletin 98 219. Webb, T. E., VanDevere, C. A. ( 1985 ). Sex differences in the expression of depression: A developmental interactio n effect, Sex Roles 12 91-95. Weller, S., & Romney, A. K. ( 1988 ). Systematic Data Collection (Sage, Beverly Hills, CA). Wierzbicka, A. ( 1992 ). Defining emotion concepts, Cognitive Science 16 (4), 539-581.
259 Williams, C. E., Stevens, K. N. ( 1969). On determining the emotional state of pilots during flight: An exploratory study, AerospaceM ed. 40 1369-1372. Williams, C. E., Stevens, K. N. ( 1972 ). Emotions and speech: some acoustical correlates, J. Acoust. Soc. Am. 52 1238. Yik, M. S. M., Russell, J. A., Feldman-Barrett, L. ( 1999) Interpretation of faces: a crosscultural study of a prediction from Fridlunds theory, Cogn. Emot. 13 93. Yildirim, S., Murtaza, B., Lee, C. M., Kazemzad eh, A., Busso, C., Deng, Z., Lee, S., Narayanan, S., ( 2004 ). An acoustic study of emotions expressed in speech, in Proc. of the ICSLP, Jeju Island, Korea, October 4-8, pp. 2193-2196. Young, F. W., Takane, Y., Lewyckyj, R. ( 1980 ). ALSCAL: A multidimensional scaling package with several individual differences options, The American Statistician 34 (2), 117118. Zuckerman, M., Lipets, M. S., Hall, J. A. & Rosenthal, R. ( 1975 ). Encoding and decoding nonverbal cues of emotion, Jour nal of Personality and Social Psychology 32 1068-1076.
260 BIOGRAPHICAL SKETCH Sona Patel was born in 1982, in Champaign, Illi nois. The elder of two children, she grew up in Homer, IL, graduating from Heritage High School as Vale dictorian in 2000. She earned her B.S. in Electrical Engineering from Boston Un iversity in 2004. Upon graduation, Sona began pursuing her graduate studies at the Univers ity of Florida in August 2004. She completed her Masters in Communication Sciences and Disord ers in the fall of 2008. Upon completion of her doctoral studies in the Communication Sciences and Disorders Department in 2009, she will begin a post-doctoral research posi tion with Dr. Klaus Scherer at the Center for Affective Studies in Geneva, Switzerland.