The Impact of Signal Temporal Structure on Auditory Sensitivity and Its Application to Audio Dynamic Range Control

MISSING IMAGE

Material Information

Title:
The Impact of Signal Temporal Structure on Auditory Sensitivity and Its Application to Audio Dynamic Range Control
Physical Description:
1 online resource (100 p.)
Language:
english
Creator:
Yang,Qing
Publisher:
University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Electrical and Computer Engineering
Committee Chair:
Harris, John G
Committee Members:
Wu, Dapeng
Holmes, Alice E
Shrivastav, Rahul

Subjects

Subjects / Keywords:
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre:
Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract:
The purpose of this dissertation is to investigate the impact of signal temporal structure on human auditory sensitivity and its application to audio dynamic range control. This dissertation is organized as follows. Firstly, the relation between the signal temporal structure and human auditory sensitivity is systematically studied with subjective listening tests. The subjective results show that the human auditory system is more sensitive to transient signals than steady signals given the same energy. Inspired by the impact of signal temporal structure on auditory sensitivity, a high-order spectro-temporal integration model is developed to better predict the audibility thresholds of non-stationary sounds. This higher-order integration model is shown to outperform the existing energy-based and loudness-based audibility prediction models on our experimental data. This model can be extended to provide improved standards for determination of hearing impairment. We propose the use of dynamic range control (DRC) algorithm for hearing protection. To further improve the conventional DRC algorithm, the level estimation in the traditional DRC framework is extended from second order to higher order. The objective evaluation results show that higher-order DRC algorithms perform best for moderate-size analysis windows.
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility:
by Qing Yang.
Thesis:
Thesis (Ph.D.)--University of Florida, 2011.
Local:
Adviser: Harris, John G.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2013-08-31

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Classification:
lcc - LD1780 2011
System ID:
UFE0042877:00001


This item is only available as the following downloads:


Full Text

PAGE 1

1 THE IMPA CT OF SIGNAL TEMPORAL STRUCTURE ON AUDITORY SENSITIVITY AND ITS APPLICATION TO AUDIO DYNAMIC RANGE CONTROL By QING YANG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFI LLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2011

PAGE 2

2 2011 Qing Yang

PAGE 3

3 To my dear parents, Mr. Shuntian Yang and Ms. Suizhen Fu

PAGE 4

4 ACKNOWLEDGMENTS I would like express my greatest gratitude to my advisor, Dr. John G. Harris, for his his inspiration, guidance and support throughout my entire Ph.D. studies. I would like to extend my thanks to my supervisory committee members Dr. Alice Holmes Dr. Dapeng Wu, and Dr. Rahul Shrivastav for their insightful com ments and suggestions. I greatly appreciate the help of all the volunteers for their participation in the subjective listening tests. I am also very thankful to all the students and staff in Computational NeuroEngineering Laboratory (CNEL) who ever helped me, encouraged me, inspired me and supported me in my research and life in the past five years Last but not least, I would like to give my most sincere thanks to my parents Mr. Shuntian Yang and Ms. Suizhen Fu, and my husband ChunMing Tang for their endless love and support s.

PAGE 5

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS .................................................................................................. 4 LIST OF TABLES ............................................................................................................ 7 LIST OF FIGURES .......................................................................................................... 8 LIST OF ABBREVIATIONS ........................................................................................... 10 ABSTRACT ................................................................................................................... 11 CHAPTER 1 INTRODUCTION .................................................................................................... 13 2 TEMPORAL AND SPECTRAL INTEGRATION FOR AUDITION ............................ 17 2.1 Temporal Integration ...................................................................................... 17 2.1.1 Masking Experiments .......................................................................... 17 2.1.2 Classical Theory Energy Integrator .................................................. 19 2.1.3 Detectability Index ( d) ......................................................................... 20 2.1.4 Recent Model Multiple Looks ............................................................ 22 2.1.5 Simulation Results and Discussions .................................................... 25 2.2 Spectral Integration ........................................................................................ 26 2.2.1 Critical Band Theory ............................................................................ 26 2.2.2 Experiments on Complex Signals ........................................................ 28 2.2.3 Multi Band Energy Detector ................................................................. 29 2.2.4 Independent Threshold Model ............................................................. 31 2.3 Non Stationary Sounds .................................................................................. 32 2.3.1 Loudness ............................................................................................. 33 2.3.2 Audibility Prediction Based on Time Varying Partial Loudness ........... 34 2.4 Summary ........................................................................................................ 38 3 SUBJECTIVE EXPERIMENTS ON NON STATIONARY SOUNDS ....................... 40 3. 1 Signal Synthesis ............................................................................................. 40 3.1.1 PDF (Probability Density Function) Based Phase Manipulation .......... 41 3.1.2 Key Fowle Haggarty Phase ................................................................. 42 3.2 Subjective Listening Experiments .................................................................. 45 3.2.1 Test Signal Analysis ............................................................................ 45 3.2.2 Experimental Paradigm ....................................................................... 46 3.2.3 Discussion ........................................................................................... 46 3.3 Summary ........................................................................................................ 50

PAGE 6

6 4 A HIGHERORDER SPECTRO TEMPORAL INTEGRATION MODEL FOR AUDIBILITY PREDICTION ..................................................................................... 51 4.1 Basic Model .................................................................................................... 51 4.1.1 Model Description ................................................................................ 51 4.1.1.1 Front end auditory filter bank .................................................. 52 4.1.1.2 Non linear processing ............................................................. 53 4.1.1.3 Low pass filtering .................................................................... 53 4.1.1.4 Decision making ..................................................................... 54 4.1.2 Model Evaluation and Discussion ........................................................ 55 4.2 Model Extension I ........................................................................................... 57 4.2.1 Extended Model with Spectral Integration Based on Probability Summation .......................................................................................... 58 4.2.2 Model Evaluation and Discussion ........................................................ 59 4.3 Model Extension II .......................................................................................... 61 4.3.1 Analogy between Audibility P erdition and Hearing Loss Prediction ..... 61 4.3.1.1 Experiments on stationary sounds .......................................... 62 4.3.1.2 Inadequacy of existing safety standards for non stationary sounds .................................................................................... 64 4.3.1.3 Hearing risk associated with signals of different time structures ................................................................................ 66 4.3.2 Extended Model Suggested for Hearing Loss Prediction .................... 68 4.4 Summary ........................................................................................................ 70 5 HIGHERORDER LEVEL ESTIMATION FOR AUDIO DYNAMIC RANGE CONTROL .............................................................................................................. 71 5.1 Dynamic Range Control for Hearing Protection ............................................. 71 5.2 High Order Dynamic Range Control ............................................................... 73 5.2.1 Model Description ................................................................................ 73 5.2.2 Evaluation and Results ........................................................................ 77 5.3 Summary ........................................................................................................ 87 6 CONCLUSION ........................................................................................................ 89 APPENDIX: DERIVATION FOR KEY FOWLE HAGGARTY PHASE SOLUTION FOR WAVEFORM MANIPULATION ...................................................................... 91 LIST OF REFERENCES ............................................................................................... 96 BIOGRAPHICAL SKETCH .......................................................................................... 100

PAGE 7

7 LIST OF TABLES Table page 3 1 Average Subjective thresholds for single/ threeband signals normalized by the threshold of 1 KHz tone ................................................................................ 49 4 1 Prediction errors for threeband audio signals in decibels .................................. 56 4 2 Average prediction error in decibels (dB) for different beta parameters ............. 61 5 1 Psychoacoustic measures used in PEAQ basic version and the co rresponding index used in Table 52 ............................................................... 83 5 2 Normalized psychoacoustic measures for a given dynamic range change of 1 dB with different window sizes ......................................................................... 84

PAGE 8

8 LIST O F FIGURES Figure page 2 1 Just audible levels of 1 kHz tone bursts as a function of tone duration. ............. 18 2 2 Block diagram of energy integrator ..................................................................... 19 2 3 Illustration of detectability index d ...................................................................... 21 2 4 Weighting function for multiplelook model with a ti me constant of 325ms. ........ 24 2 5 Predicted signal thresholds from three temporal integration models .................. 25 2 6 Block diagram of Moore an d Glasbergs audibility prediction model ................... 35 2 7 Multi resolution spectrum in Moore and Glasbergs loudness model .................. 36 2 8 Illustra tion of different segmentation schemes .................................................... 37 2 9 Illustration of averaging loudness across windows ............................................. 37 3 1 Block diagram of pdf base d phase algorithm ..................................................... 41 3 2 Block diagram of the KFH phase manipulation system ...................................... 43 3 3 Synthetic signals from KFH phase algorithm ...................................................... 44 3 4 Power spectral densities (PSD) of single band and threeband signals ............. 47 3 5 Time domain plots of single/threeband signals. ................................................. 48 3 6 Bar plot of subjective thresholds. ........................................................................ 49 4 1 Block diagram of higher order spectrotemporal integration model .................... 52 4 2 Average prediction error for threeband signals in decibels. ............................... 56 4 3 Block diagram of extended higher order integration model for audibility prediction ............................................................................................................ 60 4 4 Block diagram of the higher order integration model for predicting hearing loss ..................................................................................................................... 69 4 5 Block diagram of higher order integration model for predicting hearing risk ....... 69 5 1 Block diagram of the dynamic range control system .......................................... 73 5 2 Illus tration of the input output mapping function for the DRC algorithm .............. 75

PAGE 9

9 5 3 Block diagram of ITU R BS.1387 1perceptual evaluation of audio quality ......... 78 5 4 Operational curves of average dynamic range reduction and objective audio quality ................................................................................................................. 80 5 5 Operational curves with the best compromise of dynamic range reduction and audi o quality. ............................................................................................... 87

PAGE 10

10 LIST OF ABBREVIATION S 2AFC Two alternative forced choice experiment ATS Asymptotic threshold shift DRC Dynamic range control ERB Equivalent rectangular bandwidth KFH phase Key Fowle Haggarty phase ODG Obj ective difference grade output from PEAQ algorithm PDF Probability density function PEAQ Perceptual evaluation of audio quality PMP Personal media player PSD Power spectral density PTS Permanent threshold shift RMS Root meansquared value SNR Signal to noi se ratio SPL Sound pressure level TK point Threshold kneepoint TTS Temporary threshold shift TWA Time weighted average

PAGE 11

11 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Re quirements for the Degree of Doctor of Philosophy THE IMPACT OF SIGNAL TEMPORAL STRUCTURE ON AUDITORY SENSITIVITY AND ITS APPLICATION TO AUDIO DYNAMIC RANGE CONTROL By Qing Yang August 2011 Chair: John G. Harris Major: Electrical and Computer Engineering The purpose of this dissertation is to investigate the impact of signal temporal structure on human auditory sensitivity and its application to audio dynamic range control. T his dissertation is organized as follows. Firstly, the relation between the s ignal temporal structure and human auditory sensitivity is systematically studied with subjective listening test s. The subjective results show that the human auditory system is more sensitive to transient signals than steady signals given the same energy Inspired by the impact of s ignal temporal stru cture on auditory sensitivity a highorder spectrotemporal integration model is developed to better predict the audibility thresholds of nonstationary sounds This higher order integration model is shown to outperform the existing energy based and loudness based audibility prediction models on our experimental data. This model can be extended to provide improved standards for determination of hearing impairment. We propose the use of dynamic range control (D RC) algorithm for hearing protection. To further improve the conv entional DRC algorithm, the level estimation in the traditional DRC framework is extended from second order to higher order. The

PAGE 12

12 objective evaluation results show that higher order DRC algorithm s perform best for moderatesize analysis window s.

PAGE 13

13 CHAPTER 1 INTRODUCTION The purpose of this study is to investigate how the temporal structure of sound impact s auditory sensitivity and its application to audio dynamic range control. Auditory sen sitivity has been studied intensively in the area of psychoacoustics, but m ost of these studies have focused on human auditory sensitivity to stationary sounds, such as pur e tones and octaveband noise. The energy detector is the dominant approach to model auditory sensitivity to these stationary sounds. However, the sounds we deal with in our daily life are most ly nonstationary sounds, e.g., speech and audio signals. One of the most important differences between stationary and no n stationary sounds is that the temporal statistics, including signal mean and variance, vary over time for nonstationary sounds. Given the temporal differences between stationary and nonstationary sounds, the impact of the temporal structure on human auditory sensitivity to nonstationary sounds is a particular research interest in this study. To find out how the signal temporal structure affect s human auditory sensitivity to nonstationary sounds independent of spectral differences a family of signals with identical power spec tra but ve ry different temporal structure is synthesized using phase manipulation. The auditory sensitivit y to these synthetic signals is measured based on the audibility thresholds the lowest signal to noise ratios at which the human subjects can just detect the sounds in a given noise environment. The subjective responses in this experiment show that the audibility thresholds are lower for transient signals than the steady signals given the same power spectra, which indicates the human auditory system i s more sensitive to the transient signals than the steady signals given the same energy (power spectra). The results from this subjective experiment cannot be

PAGE 14

14 explained with the conventional energy based model for predicting the auditory sensitivity to sta tionary sounds. The new finding in our subjective auditory sensitivity experiment leads to an improved model in predicting the audibility thresholds of nonstationary sounds. Predicting the audibility of audio signals is a crucial task in many acoustical applications Engineers working on noise monitoring/controlling would be interested in predicting if the aircraft noise in an airport or the rock music from an outdoor stadium would bother the residents in the neighborhood. Designers of cell phone ringto nes or audio alarms would like to know which audio signals are more audible and can therefore better alert people. Audibility is also a big concern in the design of hearing aids The hearing aid fitting is based on an audiogram tested with stationary pure tones typically with a range of frequencies from 0.5 8 kHz. Of course what the hearing impaired people are really interested in is to be able to hear non stationary speech and music signals using their hearing aids The mismatch between the audiogram test signals and the real world signals causes significant difficulty in the hearing aid fitting process. In another related area, audibility prediction has further implication on hearing loss prediction. In order to better predict the audibility thresholds o f nonstationary sounds, there are two challenges that need to be addressed: auditory temporal integration and spectral integration. First of all, sinc e the signal temporal structure plays an important role in determining the human auditory sensitivity to nonstationary sounds, the existing energy based temporal integration models developed from stationary sounds are not adeq uate for real world sounds. A computational model that takes the temporal structure into consideration is requir ed for the audibility prediction of nonstationary

PAGE 15

15 sounds. In addition, conventional models are designed for narrowband signals such as puretones and octave band noise, and therefore only concerned about the spectral integration within a si ngle auditory frequency channel. A reasonable spectral integration scheme for wideband signals is also required to handle the spectral information from multiple frequency bands in determining the audibility of real world sounds. Given these two considerations, a highorder spectro temporal integration model is proposed to predict the audibility of nonstationary sounds and it is shown that this higher order integration model outperforms all the conventional models based on our experimental data. With the enhanced understanding of the auditory sensitivity to nonstationary sounds, a n application of the auditory sensitivity model is exemplified in an audio dynamic range controller using higher order level estimation. Dynamic range control (DRC) has been widely used in the design of hearing aids, radio and TV broadcasting, teleconferencing and other acoustical applications. As modern personal media players (PMP) with mass storage capacities, long battery life, and high output levels, become more and more popular, music induced hearing loss is becoming more of a social and clinical problem. Listeners often set volume levels based on the intelligibility or detectability of the softest sounds in the audio signals. For audio signals with wide dynamic range, at a given volume level, when the softest s ounds are adequately audible, the loudest sounds might be overwhelmingly intense. As shown in previous studies, the loudest transient signals likely cause the most damage to the auditory system ( Hamernik and Qiu 2001, Strasser et al. 1999) Therefore, a de licate DRC

PAGE 16

16 algorithm that balances dynamic range reduction and perceptual concerns would be beneficial to protect the hearing of music listeners As the audio signals are mostly used for entertainment purposes, a primary concern for reducing the dynamic ra nge of audio signals is the consequent quality degradation. A desirable dynamic range controller should provide the optimal balance of both dynamic range reduction and audio quality. In other words, for a given dynamic range reduction, the designed DRC sho uld offer the best audio quality for fidelity requirements ; for a given audio quality requirement, the designed DRC should offer the most reduction on signal dynamic range for hearing protection purposes The best compromise between the dynamic range reduc tion and the audio quality is realized in our proposed DRC using higher order level estimation. The remainder of this dissertation is organized into five chapters. Chapter 2 reviews the existing studies on auditory temporal and spectral integration. Chap ter 3 explores how the temporal structure impact s the auditory sensitivity in human subjective experiments. Chapter 4 proposes a higher order spectotemporal integration model for audibility prediction and compares it with the existing models using our experimental data. Possible extensions of the proposed higher order integration model are also summarized. Chapter 5 discusses a practical application of the audi tory sensitivity model to audio dynamic range control. Chapter 6 concludes the dissertation.

PAGE 17

17 CHA PTER 2 TEMPORAL AND SPECTRA L INTEGRATION FOR AUDITION In this chapter, existing studies of auditory temporal integration and spectral integration will be reviewed, including the most important psychoacoustic experiments, classical theories and computational models for both stationary and nonstationary sounds. Relevant model simulations and their implications will be discussed as well. 2.1 Temporal I ntegration Auditory temporal integration often refers to the ability of the auditory system to integrate information across the time to improve the detectability of a signal In this section, conventional experiments and models that explain th e auditory temporal integration process are introduced in detail s. 2.1.1 Masking E xperiments In a typical auditory masking exper iment, the subjects are asked if they hear the target signal in the stimulus or not in a simple detection task, or in which stimulus they hear the target signal given multiple choices. The signal levels are then adjusted according to the subjective responses. The lowest signal levels at which the subjects can just detect the target signals in quiet or in noise is defined to be the absolute audibility thresholds or the masked audibility thresholds. As Figure 2 1 illustrates, both absolute and masked thresholds depend on the duration of the test sinusoidal tone. The absolute threshold, also known as the threshold in quiet, is shown by dotted lines and thresholds for tone bursts masked by uniform masking noise of 60 and 40 dB sound pressure level (SPL) are indicated by solid and dashed lines, respectively. In spite of the different masking noise levels, both absolute and masked thresholds as a function of tone durations are parallel curves in Figure 21. The dependence on

PAGE 18

18 duration shows a constant test t one threshold for durations longer than 200 ms corresponding to that of long lasting sounds. For durations shorter than 200 ms, both the absolute and the masked thresholds increase with decreasing duration at a rate of 10 dB per decade. This behavior c an be described by assuming that the human auditory system integrates the sound intensity over a period of 200 ms (Fastl and Zwicker 2006). 101 102 103 0 10 20 30 40 50 60 Tone duration (ms)Tone level at threshold (dB)-10 dB/decade -10 dB/decade -10 dB/decade Noise level = 60 dB Noise Level = 40 dB In Quiet Figure 21 Just audible levels of 1 kHz tone bursts as a function of tone duration in quiet condition ( dotted line ) and masked by uniform masking noise of given levels ( solid for noise of 60 dB and dashed line for noise of 40 dB ). This figure is adapted from Fastl and Zwicker s book Psychoacoustics published in 2006.

PAGE 19

19 2.1.2 Classical T heory E nergy I ntegrator The simplest explanation of the in tensity duration tradeoff is an integration process. An energy integrator is typically composed of three parts: a critical band filter, a rectifier and an integrator, as shown in Figure 22. The critical band filter ensures only the signal energy from the same critical band is integrated. The rectifier is usually a squarelaw device that provides a quantity whose average is a value monotonically related to the signal i ntensity. The integrator provides the basis for the final decision. Magnitude CB f ) t ( s ) t ( x2 ) t ( s2 d ) ( x ) t ( hT 0 Signal Critical band filter Rectifier Integrator s(t) x(t) y(t) Figure 22 Block diagram of energy integrator Suppose ) t ( x is the input to the integration process. If we weight the input by a function ) t ( h and inte grate the weighted values, we obtain the output ) t ( y as the following convolution: d ) ( x ) t ( h ) t ( yt (2 1) The simplest form for ) t ( h is the rectangular window, otherwise 0 t 0 1 ) t ( h (2 2)

PAGE 20

20 This rectangular weighting function simply integrates all of the ) t ( x occurring within time To better model the intensity duration trade off Muson (1947), Plomp and Bouman (1959), and Zwislocki (1960) introduced an exponential function for ) t ( h in their seminal papers. otherwise 0 0 t e ) t ( h/ t (2 3) This integrator weights past time by an amount of / te so that recent events carry more weights, and events more than 3 in the past are greatly attenuated and have little impact on t he present output. T his choice of ) t ( h results in a simple first order lowpass filter (Eddins and Green 1995). The energy integrator predict s the signal audi bility threshold solely dependent on the signal energy. It works well for narrowband stationary signals such as pure tones. However, for nonsta tionary signals, such as speech and audio, there are strong suggestions that the auditory sensitivity of impulsive signal s is in part related to the peak levels (Price and Wansack 1985, Erdreich 1985). As a result, such energy b ased audibility prediction u nderestimates the audibility of many transient or impulsive sounds. 2.1.3 Detectability I ndex (d) The field of signal detection theory can be used to quantify the detectability of a signal. The detectability of a signal depends both on the separation and the spread of the noisealone and signal plus noise curves as shown in Figure 23 Detection is made easier either by increasing the separation (stronger signal) or by decreasing the spread

PAGE 21

21 (less noise). In either case, there is less overlap between the probabilities of occurrence curves. ProbabilitySignal energy 2 2 0 1 2d 0 1 ) ( N ~ ) n | x ( f0 ) ( N ~ ) n s | x ( f1 Figure 23 Illustration of detectability index d The most widely used measure for signal detectability is the detectability index d d' = separation / spread = 0 1 (2 4) Assuming we have two hypotheses, noise signal ) i ( n ) i ( s ) i ( x : H noise ) i ( n ) i ( x : H1 0 (2 5) and the conditional density functions for these two hypotheses are Gaussian distributions ) ( N0 and ) ( N1 respectively. Let N 1 i 2) i ( x ) x ( T we have

PAGE 22

22 1 2 N 0 2 N 2H under ) ( H under ~ ) x ( T ) x ( T (2 6) where 2 s 2 N 1 i 2E ) i ( s Using the properties of chi squared distribution, we have N0 (2 7) N1 (2 8) N 220 (2 9) N 2 42 1 (2 10) Assuming the signal power is very weak and there are enough samples, we have 2 0 21N 2 N 2 4 (2 11) N 2 / E N 2 d2s 2 2 0 2 0 1 2 (2 12) From equa tion 212, it can be seen that, d is actually an estimate of the signal to noise ratio. Since d is monotonically related to the expected percentage correct, it is frequently employed in signal detection (Green and Swets 1974, Kay 1998). 2.1.4 Recent Model M ultiple L ooks To account for temporal integration, both the rectangular integrator and the exponential integrator use a long window or time constant, typically hundreds of milliseconds. On the other hand, models proposed to explain temporal resolution, such as modulation detection, gap detection and certain temporal aspects of nonsimultaneous masking, usually assume a lot shorter integration windows

PAGE 23

23 To solve this resolutionintegration discrepancy, Viemeist er and Wakefield propose a multiplelook model for temporal integration, which assumes that long term integration does not occur, instead, the listeners can take multiple looks during a long duration signal and combine the information from all looks optimally to detect a signal (Viemeister and Wakefield 1991). Assume that the temporal window for each look is a 3ms rectangular window and that successive windows are contiguous and nonoverlapping. Also assume that the window samples are mutually independent and optimally combined. According to Greens multiple observation theory (Green and Swet 1974), the overall detectability for n looks is the square root of the sum of the squares of the individual looks id: n 1 i 2 i n) d ( d (2 13) Furthermore assume that I k di i (2 14) where I is the signal intensity and ik is the weight for the i th look. In ord er to fit the integration data for pure tones in quiet shown in Figure 21 I ) e 1 ( ITc / t (2 15) where t is the signal duration, I is the threshold for very long duration signals, and Tc is the time constant used to fit the data, Viemeister and Wakefield derived a formula for the weights ) e 1 ( e k) Tc / ( i ) Tc / (i i (2 16) where is the duration of the temporal window and at threshol d is a constant given by

PAGE 24

24 2 c) I / d )( T / ( 2 (2 17) 0 50 100 150 200 250 300 350 400 450 500 0 0.2 0.4 0.6 0.8 1 Time (ms) Relative weight exponential function rectangular function Figure 24. Weighting function for multiplelook model with a time constant of 325ms. The weighting functions for the multiplelook mod el are plotted in Figure 24. The solid line represents the weighting function derived from the exponential integrator in equation 215 and the dasheddotted line represents the weighting function derived from the rectangular window integrator t / C I ) T t (c and C I ) T t (c From Figure 24, we can see that the weighting functions derived from different integration functions are significantly different, even though their integration functions look similar. Since this weighting function is derived from the pure tone temporal integration data, it might be applicable only for signals with flat envelopes The weighting func tion for nonstationary signals especially transients, could be consider ately different. Moreover, the temporal integration rule for nonstationary signals is not yet clear.

PAGE 25

25 Without this prior knowledge, a general weighting function that works for an arbitrary nonstationary signal is hard to derive. 2.1.5 Simulation R esults and D iscussions To compare the audibility prediction models discussed in the previous sections, Figure 25 illustrates the intensity d uration tradeoff s for a 1 kHz pure tone for three integration models : rectangular window energy integrator, exponential w indow energy integrator and the multiplelook model using an exponential window ( plotted by solid, dashed and dotted lines, respectively ) The window size/time constants for both energy integrators are 200ms. In a ll three cases, they model the intensity d uration tradeoff 10 dB/decade very well. The thresholds of exponential window energy integrator and the multiplelook model are around 3 dB different from that of the rectangular window energy integrator, but they converge at around 1 second. 10-3 10-2 10-1 100 101 -5 0 5 10 15 20 Signal duration (sec) Signal level at threshold (dB) Rectangular window integrator Exponential window integrator Multiple looks Figure 2 5 P redicted signal thresholds from three temporal integration models

PAGE 26

26 Since all three models are developed from na rrowband stationary signals the goal for all these models is to predict the 10dB/decade intensity duration tradeoff for narrowband si gnals with flat envelopes. S ignals such as speech and audio are both nonstationary and wideband. First of all, the 10dB/decade rule is not necessarily fit for nonstationary signals. Solely integrating the signal powers is not sufficient to predict the audibility threshold for nonstationary sounds, especiall y for music signals which have a wide dynamic range. What is the appropriate quantity to integrate in the time domain is the key topic that will be explored in the later parts of this dissertation. On the other hand, real world signals are rarely confined within a single critical band. How to generalize a narrowband temporal integration model to a twodimensional spectro temporal surface for wideband signals is another key problem to discuss in this disser tation 2.2 Spectral I ntegration Auditory spectral integration often refers to the ability of the auditory system to sum information across a range of frequencies to improve the detectability of a signal in a masking experiment. There are two levels of spectral integration discussed in this section, within critical band and across critical bands. 2.2.1 Critical B and T heory Fletcher first measured the threshold for detecting a sinusoidal signal as a function of the bandwidth of a band pass noise masker. Th e threshold of the signal increased at first as the noise bandwidth increases. As soon as the noise bandwidth reaches the critical bandwidth, the detection threshold starts to flatten off (Fletcher 1940). T hese observations can be explained with the signa l processing in the peripheral auditory system. The basilar membrane in the cochlea can be represented as a series

PAGE 27

27 of band pass filters, called auditory filters. The detection of the signal is thought to be governed by the auditory filter that is centered around the signal. When the power ratio between the tone and the masker at the output of this filter exceeds a certain criterion value, the tone is assumed to be detectable. As long as the noise bandwidth is less than the auditory filter bandwidth, an incr ease in the noise bandwidth results in more noise passing through the auditory filter, which will lead to an increase in the detection threshold. Once the noise bandwidth exceeds the auditory filter bandwidth, the added noise power will be rejected by the bandpass auditory filter and thus have no impact on the detection threshold. Fletchers classical critical band theory suggests that, signals in each critical band are processed independently. Signal powers are integrated only when they fall in the same c ritical band. Later on, Moore and Glasberg proposed the equivalent rectangular bandwidth (ERB) based on their not ch noise masking experiments. ERB is said to provide a more accurate approximation of the auditory frequency response than critical bandwidth. The values of ERB are close to critical band scale but tend to be smaller at low frequencies than the values of critical bandwidth (Moore 2003). The empirically fit equation describing the value of the ERB as a function of center frequency f (in kHz) is ( Glasberg and Moore 1990) ) 1 f 37 4 ( 7 24 ) f ( ERB (2 18) A formula relating the number of ERBs to frequency f (in kHz) is given as ( Glasberg and Moore 1990) ) 1 f 37 4 ( log 4 21 ) f ( ERBs of Number10 (2 19)

PAGE 28

28 Th is ERB scale is used to design the front end auditory filter bank in our proposed highorder audibility prediction model that will be explained in Chapter 4. 2.2.2 Experiments on Complex S ignals Gassler measured the detection threshold for complex signals consisting of evenly spaced sinusoids. As the number of frequency components in a sound was increased, the threshold, specified in terms of total energy, remained constant until the overall spacing of the tones reached the critical bandwidth. Thereafter the threshold increased by about 3 dB per doubling of bandwidth. These results indicated the energies of the individual components in a complex sound will sum in the detection of that sound, provided that all compone nts lie within the same critical band. When the components are distributed over more than one critical band, the detection is based on the single critical band that gives the highest detectability (Malmierca and Irvine 2005). However, other results indicate that simultaneous presence of signal energy in different critical bands aids auditory detection. Green showed that psychometric function s for complex sounds composed of 12 or 16 equally detectable components were largely parallel to those for a single t one; the thresholds for the complex tones, in terms of the level per component, were about 6 dB below the thresholds for the single components in isolation (Green 1958). Buus et al. compared the masked threshold of a 450ms 18 tone complex relative to a single pure tone. Their results showed that the thresholds for an 18tone complex, in terms of level per tone, is consistently lower than for pure tones and the level decrease followed the ) n ( 10 log 10 dB rule, where n is the number of frequency com ponents (Buus 1986). Van den Brink and Houtgast further pointed out the masked threshold of a broadband brief signal (typically 10 ms or less)

PAGE 29

29 decreases even faster as the signal bandwidth increases. Instead of ) n ( 10 log 10 dB, the detection threshold for brief compound signals in terms of level per component is lowered by ) n ( 10 log 16 dB, with n being the number of 1/3 octave bands that equally excited by the input signal ( Van den Brink and Houtgast 1990, Malmierca and Irvine 2005). Ho w the auditory system improves the signal detection given the wideband information is the key to generalize to more sophisticated real world signals such as speech and audio. Clearly, it is not because of the integration of the signal powers across frequencies. Classical critical band theory has clarified that only the signal powers that fall in to the same critical band are integrated and signals in different critical bands are processed independently. To explain the improved detectability of wideband sign als, there are two models researchers generally support, multi band energy detector and independent threshold model, which will be introduced in the next two sections. 2.2.3 Multi B and Energy D etector Green proposed a multi band energy detector, to explai n the improved masked threshold of a multi component signal with respect to the threshold of each of the singlecomponent stimuli (Green 1958). This model postulates that the energy within each auditory channel is transformed into a Gaussian distributed decision variable ) ( N ~ xi i i The detectablity in each channel id is proportional to its signal intensity. It also assumes that the detection for wideband signal depends on a weighted sum of decision variable in each channel n 1 i i ix w y (2 20 )

PAGE 30

30 It further assumes the human auditory system can optimally combine the information across channels and make a final decision. Green proved that, to make the detection optimal the weights must be inversely proportional to the variance and proportional to the detectability in each channel, i.e. i i id w (2 21) The resulting detectability of a wideband sign al n 1 i 2 i n) d ( d (2 22 ) With Greens multi band energy detection model, it is easy to explain the improved detectability of wideband signals. Suppose a wideband signal is equally detectable in all the critical bands, the overall detectability is i nd n d Since' i dis proportional to the signal intensity, the threshold corresponding to some fixed level of performance (e.g., 16 1 d' to track a percentage correct of 79%), for the n component signal will be n 10 log 10 L L1 n dB (2 2 3 ) where 1L is the threshold for a single component and nL is the level per component in a complex signal at threshold. Despite the convenience of explaining the detectability improvement of wideband signals using this multi band energy detector, it does not seem physiologically plausible to assume that there is interaction among different cri tical bands below threshold, especially among those distant critical bands.

PAGE 31

31 2.2.4 Independent Threshold M odel The independent threshold model was proposed by Schafer and Gales (Schafer and Gales 1949) and used by Scholl (1961) to explain the different ps ychometric functions for detection of narrow band and wideband signals (Buus et al. 1986). According to this model, an increase in detectability results from multiple channels is not because their information is integrated to form a single basis for decis ion, but rather because each additional channel presents another independent detection opportunity. In other words, it is a result of probability summation that causes the detectability improvement of wideband signals. This model also assumes statistically independent observation in each auditory channel. The rule used to combine decisions is to make a positive response when any one of the several observations is positive. Under this rule, the probability of detection based on n observations is n 1 i i n) p 1 ( 1 p (2 2 4 ) If the individual probabilities are all equal, this formula becomes n i n) p 1 ( 1 p (2 2 5 ) To treat the forced choice task with this m odel, it may be noted that, the probability of correct response on a single trial is ) p 1 )( m 1 ( p ) C ( P (2 2 6 ) w here m is the number of intervals in a trial. Hence, the expression for ) C ( Pn can be written by substituting np in equation 222 for p in equation 223, with the result that

PAGE 32

32 n 1 i i n 1 i i n) p 1 ( ) m 1 ( ) p 1 ( 1 ) C ( P (2 2 7 ) Scholl showed that the summation of probabilities could give rise to steeper psychometric functions for wideband than narrowband signals. However, this finding is critically dependent on the assumed form of the psychometric function. Buus et al. derived a possible psychometric function for this model whic h could predict the ) n ( 10 log 10 dB improvement in detectability for n component signal. ) 5 0 2 ( 5 0 ) C ( PSL 2 010 (2 2 8 ) where P(C) is the probability of correct response in a two alternative forced choice (2AFC) task and SL is the sensation level in dB (Buus 1986). In terms of physiology, the place coding theory says, signal information is encoded in the response of the auditory nerve fibers, which discharge at the rate in response to the movement of the basilar membrane at that location. Auditory nerve fibers at different locations fire independently and the ones that discharge at the greatest rate determine the signal audibility threshold. Also in terms of psychoacoustics, the critical band theor y specifies only the signal powers in the same critical band are integrated together and different critical bands function independently. Hence, the independent threshold model is consistent with our existing knowledge both in physiology and psychoacoustic s, which we prefer to embrace. 2.3 Non S tationary S ounds The existing temporal integration models, including the energy integrator and the multiple look model, have focused on stationary signals. Some studies suggest that nonstationary sounds can be characterized by their crest factors and kurtosis (Erdreich

PAGE 33

33 1986, Hamernik and Qiu 2001). Crest factor, which is the ratio of the peak level and the root mean square (rms) value, is sensitive only to signals with single largest peak and thus is not a good predi ctor (Erdreich 1986). Kurtosis is defined as the fourth cumulant divided by the square of the second cumulant and is often used as a measure of peakedness of the signal probability distribution. Kurtosis is correlated with auditory sensitivity of impulsive noise ( Hamernik and Qiu 2001), however, it is not clear how to quantify its influence explicitly in a prediction model. More recently, Moore and Glasberg proposed an audibility prediction model for timevarying sounds based on loudness measurement s. Psychoacoustic basics related to loudness and loudness based audibility prediction model will be introduced in the next subsections. 2.3.1 Loudness Loudness is a psychological description of magnitude of auditory sensation (Flecter and Munson 1933), which varies directly with the sound intensity, frequency and bandwidth. Frequency dependence: Equal loudness contours illustrate the frequency dependence of the loudness of pure tones. Based on equal loudness contours, for pure tones of different frequencies, the same intensity does not necessarily produce the same loudness. Also the contours are flatter at higher loudness level s than at low er loudness, which is the result of steeper function relating loudness to intensity at low frequencies than at medium frequencies. Level dependence : Loudness as a function of signal intensity follows Stevens power law, 3 0I k N (2 2 9 )

PAGE 34

34 where N is loudness, I is intensity and k is a constant depending on the listener and unit employed (Stevens 1955). Bandwidth dependence : Zwicker et al. found out in their experiment, when total energy was held constant, as the bandwidth of a signal increased, its loudness stayed roughly the same until the bandwidth was greater than the critical bandwidth. Zwicker et al. explained, if loudness in each critical band followed Stevens law, then distributing the intensity of a sound across n critical bands, would result in an increase in total loudness, because 3 0 3 0) n / I ( k n I k when n>1. This is Zwickers theory of critical bandwidth for loudness summation (Zwicker et al. 1957 ). Loudne ss scale: Stevens was the first to introduce the sone as the unit of loudness, where 1 sone was defined as the loudness of a 1 kHz pure tone with a sound pressure level of 40 dB. Loudness of other sounds would be determined by c omparing with this reference, 1 k Hz 40 dB tone. For example, a sound judged by listeners to be twice as loud as the reference would have a loudness of 2 sones. ( Moore 1995) 2.3.2 Audibility P redi c tion Based on Time Varying Pa rtial L oudness Glasberg and Moore proposed an audibility prediction model for time varying sounds based on the average short term partial loudness (STPL). Partial loudness is defined as the loudness of a target sound presented in the context of other sounds, which is different from loudness in absolute quiet. The stages of this model are summarized as follows and illustrated in Figure 26 ( Moore and Glasberg 2005) 1. A given signal and noise go throug h a finite impulse response filter representing the signal transmission of outer and middle ear. 2. Calculation of multi resolution spectrum using the Fast Fourier Transform (FFT) for both signal and noise. To give adequate spectral resolution at low frequencies combined with adequate temporal resolution at high frequencies, six FFTs are

PAGE 35

35 calculated in parallel with longer signal segments for low frequencies and shorter signal segments for high frequencies as illustrated in Figure 27 The windowed segments ar e zero padded and use a 2048point FFT. All FFTs are updated at 1 ms intervals. 3. Calculation of excitation patterns from physical spectrum of signal and noise. 4. Transformation of excitation patterns to specific partial loudness. 5. Calculation of instantaneous loudness by integrating specific loudness across frequencies. 6. Calculation of short term partial loudness by low pass filtering instantaneous loudness along the time axis. 7. Average the short term partial loudness over the entire duration of the given signal to determine the signal audibility. Yes/No Outer mid ear transmission Multi resolution spectrum Excitation patterns Specific partial loudness Instantaneous partial loudness Shortterm partial loudness Average Reference Comp Outer mid ear transmission Multiresolution Spectrum Excitation patterns Multi resolution spectrum Excitation patterns Signal Noise Figure 26 Block diagram of Moore and Glasbergs audibility prediction model There are two drawbacks involved in the loudness based audibility prediction model the spectrum calculation and the loudness average process which are shaded in Figure 26 Even though this model calculates loudness based on the multi resolution spectrum, in each auditory frequency band, signals are still pr ocessed by a window of fixed size. This fixedsize window is not necessarily the ideal segmentation for some transient signals. As an example shown in Figure 28 if the transient signal is

PAGE 36

36 segmented into two fixedsize windows as illustrated by dotted lin es, neither of the windows is audible and they will be both given a loudness value close to zero sones. After lowpass filtering and averaging, the average short term loudness will remain close to zero and the signal will be judged as not audible. However, if a different segmentation is used as illustrated in solid line, the audibility decision might turn around because the signal in one of the window is assessed as audible and the overall signal has better chance to exceed the audibility threshold. Frequency (Hz) 16000 4050 2540 1250 500 80 20 Time (ms) 64 32 4816 0 8 24 40 56 Frequency (Hz) 16000 4050 2540 1250 500 80 20 Time (ms) 64 32 4816 0 8 24 40 56 Figure 27 Multi resolution spectrum in Moore and Glasbergs loudness model F or some impulsive signals, such as percussion music, the signal may last a long time but its energy is highly concentrated for only a very brief time interval. The audibility of the signal is actually caused by only a few frames that have high enough loudness. Even the signal is actually audible to the subjects when counting its STPL over time, most of the frames may have loudness close to zero and the resulting average short term loudness may become s maller than the predeter mined reference

PAGE 37

37 ( Figure 29 ). Hence, using average STPL over the duration of the signal tends to underscore the impact of transients on the signal audibility. 0 0.5 1 1.5 2 2.5 x 104 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Not audible Not audible Audible Figur e 28 Illustration of different segmentation schemes 0 1 2 3 4 5 6 7 8 9 10 x 104 -1.5 -1 -0.5 0 0.5 1 Short -term Loudness A>threshold, audible Avg. Short -term Loudness A / N > threshold, not audible? 0 1 2 3 4 5 6 7 8 9 10 x 104 -1.5 -1 -0.5 0 0.5 1 0 1 2 3 4 5 6 7 8 9 10 x 104 -1.5 -1 -0.5 0 0.5 1 Short -term Loudness A>threshold, audible Avg. Short -term Loudness A / N > threshold, not audible? Figure 29 Illustration of averaging loudness across windows

PAGE 38

38 Moreov er, studies by Willera et al. suggest that the most efficient spectrotemporal integration for the auditory system is confined to a narrow time window and critical bandwidth ( Willera et al. 1990 ). S umming up the specific loudness over all ERB bands and averaging the low pass filtered instantaneous loudness over the duration of the signal, are actually giving equal weights to the loudness of all ERB bands and all time frames. This scheme is not consis tent with the spectrotemporal integration efficiency of the auditory system. 2.4 Summary In this chapter, we have reviewed temporal integration and spectral integration for audition. For temporal integration, both the energy detector and the multipleloo k model are derived from the narrowband stationary pure tone psychoacoustic data and aim to model the 10 dB/decade intensity duration trade off However, th e integration rule for wideband nonstationary signals such as speech and audio are not necessarily the same, so the energy based metric may not be sufficient to predict the audibility thresholds of real world wideband nonstationary signals. Future chapters in this dissertation will explore the most appropriate quantity to integrate in the time domain. As for auditory spectral integration, it is commonly agreed that signal powers in the same critical band are integrated. B oth the multi band energy detector and the independent threshold model can explain the improved detectability of a wideband signal s. H owever, the multiband energy detector is built on the assumption that auditory system can combine information optimally to make a final audibility decision for a wideband signal. It is still under debate whether across frequency spectral integration is ph ysiological plausible. On the other hand, the independent threshold model postulates our auditory system process es signals in different critical band independently

PAGE 39

39 and the improved dete ctability of wideband signals result s from the probability summation of multiple bands, which is consistent with our existing knowledge on critical bands and place coding. As a result, we use the independent threshold model as the basis for our more sophisticated spectrotemporal audibility prediction model for wideband nonstationary signals.

PAGE 40

40 CHAPTER 3 SUBJECTIVE EXPERIMEN TS ON NONSTATIONARY SOUNDS This chapter aims to explore how the tempor al structure of sounds impact s human auditory sensitivity The design of the subjective experiment, including the test signal synthesi s and the experiment paradigm, will be discussed in detail s. Observations from the subjective listening test show that, in spite of identical power spectra, the more impulsive the signal is, the lower the subjectiv e audibility threshold will be, and the more detectable to the human auditory system. 3.1 Signal S ynthesis As mentioned in C hapter 2, the signal energy is not sufficient to characterize nonstationary signals. T here is strong evidence that shows the human auditory system is more sensitive to trans ient signals than stationary signals given the same signal energy (Hamernik et al. 2001, Price and Wansack 1985, Erdreich 1986) In order to verify this concept we generated a family of signals that have the same windowed Fourier spectra but different temporal envelopes. We then used these generated signals in subjective listening test s. By factoring out the spectral differences, we can clearly see how the temporal differences solely affect human audibility thresholds. The key step of the experiment is t o construct temporal waveform s from the Fourier magnitude and phase spectra. The solution is derived from the Four ier transform equation. In the F ourier transform pair in time and frequency domain, the temporal envelope is controlled by the spectral m agnit ude and phase. Since we do not want to change the magnitude spectrum, it leaves only the phase to manipulate the temporal envelope. By maintaining the original signal magnitude spectrum and replacing the phase spectrum with some designed phases, and applyi ng the inverse

PAGE 41

41 discrete Fourier transform (IDFT), a given signal temporal envelope can be changed over each selected time window without introducing any difference in the signal magnitude spectrum. The phase solution for our signal synthesis will be discu ssed in the next two sections. 3.1.1 PDF (Probability Density Function) Based P has e M anipulation Hsueh and Hamernik proposed a pdf based phase algorithm for random noise synthesis as shown in Figure 31. By keeping the overall magnitude spectrum and chang ing the phase probability distribution (e.g. standard deviation) within certain selected frequency bands (onband), peaks in the random waveform can be constructed from the phase manipulated frequency bands. In this way, an entire famil y of signals can be produced having the same magnitude spectra but temporal st atistical characteristics (e.g. skewness and kurtosis) that vary along the continuum from Gaussian through nonGaussian to purely impulsive ( Hsueh and Hamernik 1990). ) f ( f On band Off band f= fs /2 n=N/2 ) ( Unif ~ ) f ( ), ( Unif ~ ) f (B B B B B ) f ( X f fs /2 IFFT Output Figure 31 Block diagram of pdf based phase algorithm

PAGE 42

42 T his p d f based phase algorithm is able to synthesize random noise wi th different temporal structure and a given power spectrum for auditory sensitivity experiments (Hamernik and Qi u 2001, Yang and Harris 2009), however when it is applied on audio synthesis frame by frame, it creates significant artifacts in the processed audio signals 3.1.2 Key Fowle Haggarty P hase Quatieri and McAulay proposed a peak to rms (root mean square) reduction technique for speech signals based on a sinusoidal model ( Quatieri and McAulay 1991). We modified Quatieri and McAulays method to manipulate the audio signal temporal waveforms without introducing any spectral differences. By maintaining the original signal magnitude spectrum and replacing the phase spectrum with the Key Fowle Haggarty (KFH) phase, and applying the inverse discrete Fourier transform (IDFT), a given signal envelope can be optimally flattened over each selected time window. This technique does not introduce any difference into the signal magnitude spectrum and assumes that the durationbandwidth product is large enough ( Fowle 1964). The diagram for our signal s ynthesis is shown in Figure 32. The filtering procedure before phase manipulation is to select single/multiple critical band(s) to be included in the synthesized signals. The KFH phase can be computed as d d M L0 0 2 KFH (3 1) w here 2M is the signal power normalized by its energy and L is the frame size in samples. The KFH phase solution was originally designed to reduce the dynamic range of radar signals without changing the signal spe ctrum. Since the goal is to minimize the

PAGE 43

43 dynamic range of the temporal envelope, the ideal case is to make the temporal e nvelope become a constant level, in other words, the dynamic range is zero. Given the magnitude spectrum, this constant temporal envelop e can be derived using Parsevals theorem. The phase problem can be simplified as t o find a spectral phase as a function of the signal magnitude spectrum through the Fourier transform equation under the constraint of a given constant temporal envelope. Wi th a series of mathematical manipulation, including stationary phase theory and Taylor series expansion, an approximate solution was found by Key et al. at MIT Lincoln labs ( Fowle 1964) The detailed mathematical deriv ation of the KFH phase for manipulati on of signal temporal structure can be found in the Appendix. Filtering Windowing DFT ) ( IDFT ) (KFH ) ( X Processed FrameAudio Flattened audio Figure 32 Block diagram of the KFH phase manipulation system It can be seen from equation 31, the KFH phase is a function of the signal power spectrum and the selected window size. Since the magnitude spectrum is determined by the original signal, the peak to rms reduction for the entire signal depends on the window length used in the phase manipulation. Figure 33 shows a set of signals synthesized usi ng the KFH phase algorithm. The original signal shown at the upper left

PAGE 44

44 panel is an exponentially decaying 1 kHz sinusoid repeated at a period of 100 ms and the rest of the signals are phase manipulated signals with different window size. As the window siz e increases from 25 ms to 100 ms, the signal temporal envelope becomes flatter. When the window size exceeds 100ms, the period of the original signal, the signal envelopes are not further flattened. As a result, for periodic or quasi periodic signals (e.g. voiced speech), choosing the pitch period as the window size L in equation 31 will optimally flatten the overall signal. For nonperiodic signals, we can treat the overall signal as one period, the longer the window size used for KFH phase manipulation, the flatter the signal envelope will be. 0 0.5 1 -5 0 5 X: 0.5003 Y: 5.897 Original time (second)Amplitude 0 0.5 1 -5 0 5 X: 0.5182 Y: 4.447 win = 25ms time (second)Amplitude 0 0.5 1 -5 0 5 X: 0.524 Y: 3.361 win = 75ms time (second)Amplitude 0 0.5 1 -5 0 5 X: 0.5337 Y: 2.361 win = 100ms time (second)Amplitude 0 0.5 1 -5 0 5 X: 0.6269 Y: 2.72 win = 200ms time (second)Amplitude 0 0.5 1 -5 0 5 X: 0.3914 Y: 3.299 win = 500ms time (second)Amplitude Figure 33 Synthetic signals from KFH phase algorithm

PAGE 45

45 3.2 Subjective Listening E xperiments 3.2.1 Test Signal A nalysis We study two special c ases single criticalband (1 kHz, 2 kHz and 4 k Hz) and three critic al band (1kHz + 2kHz + 4k Hz) audio signals. Two wideband music clips are selected of typical Cuban percussion instruments: the conga and the bongo drum. Each clip is about 2seconds long and sampled at 32 kHz. For the single band case, these two signals are first ERB (Equivalent Rectangular Bandwidth) filtered with center frequencies of 1kHz, 2 kHz and 4kHz and fed into the phase manipulation system. The singleband signals are segmented into 50% overlapped frames with duration of 64 ms and 512 ms. For the three band case, three ERB filtered single band phase manipulated signals are summed into one threeband signal. Figure 34 illustrates the power spectral densities (PSD) of single/threeband signals estimated using the Welch method (8 ms rectangular win dow with 50% overlap). The blue line shows the ERB filtered audio without phase manipulation, and the red line shows the phasemanipulated signal with a window size of 64 ms, and green line shows the phasemanipulated signal with a window size of 512 ms. T he root mean squared values ( RMS ) of all three signals are normalized to unity. Theoretically, the processed audio and the input signal should have identical spectra window by window, but overall, there are some slight differences after connecting all the frames into one signal due to different window size used in the power spectral density estimation. However, the spectral differences are too small to cause a difference in audibility thresholds. On the other hand, Figure 35 shows the large temporal differ ence among the processed signals. Even though these signals have unity root meansquared power but they have

PAGE 46

46 ve ry different temporal structure. The peak to rms ratio differences are as large as 8.59 dB for narrowband signals and 10.8 dB for three band signals 3.2.2 Experiment al P aradigm The test ed audio stimuli are all played binaurally with added white noise through headphones (SENNHEISER HD 251) to subject s with normal hearing. The number of subjects is extended from one (Yang and Harris 2010) to four. Each signal is tested five times and the results are averaged. Levitts adaptive two alternative forced choice (2AFC) paradigm was used to determine the masked signal to noise ratios ( SNR) at threshold for all stimuli (Levitt 1971). There are two intervals in each trial: white noise alone and signal plus the same noise. Three correct responses count as a successful trial and one incorrect response counts as a failure trial, Hence, thresholds are determined at the 79% correct point. There are twelve reversal s used for each test. The step size is initially set to 5 dB and reduced to 2 dB after the first four reversals and 1 dB after the first six reversals. The first 4 reversals are discarded and the following 8 reversals are used to calculate the masked SNR at the audibility threshold. 3.2.3 Discussion The normalized masked SNRs in de cibels for the singleand threeband signals are shown in Table 31 and Figure 36 From this table and figure, we can see that eight groups of signals have almost the same spe ctra within groups, but have 4 ( Conga 1 kHz ) ~ 11 dB (conga 4k Hz) difference in audibility thresholds. Since the spectra are nearly identical, the factors that cause these differences in audibility thresholds must be temporal differences. In each of these eight groups, the original signal is the most impulsive signal (and have the lowest thresholds) while the phase manipulated signal with window size of 512 ms is the flattest signals (and the highest thresholds), indicating

PAGE 47

47 that the auditory system is more sensitive to impulsive signals than to flat signals of the same power spectra. 0 2 4 6 8 10 12 14 16 -80 -70 -60 -50 -40 -30 -20 Frequency (kHz)Power/frequency (dB/Hz)Power Spectral Density original win = 64ms win = 512 ms A 0 2 4 6 8 10 12 14 16 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 Frequency (kHz)Power/frequency (dB/Hz) Power Spectral Density original win = 64ms win = 512 ms B Figure 34 Power spectral densities (PSD) of A) single band signals B) threeband signals

PAGE 48

48 0 1 2 -30 -20 -10 0 10 20 30 original peak/rms = 28.44 dB Time (sec)Amplitude 0 1 2 -30 -20 -10 0 10 20 30 Time (sec)Amplitudewin = 64ms peak/rms = 23.33 dB 0 1 2 -30 -20 -10 0 10 20 30 Time (sec)Amplitudewin = 512ms peak/rms = 19.85 dB A 0 1 2 -30 -20 -10 0 10 20 30 original peak/rms = 27.66 dB Time (sec)Amplitude 0 1 2 -30 -20 -10 0 10 20 30 Time (sec)Amplitudewin = 64ms peak/rms = 22.37 dB 0 1 2 -30 -20 -10 0 10 20 30 Time (sec)Amplitudewin = 512ms peak/rms = 16.86 dB B Figure 35 Time d omain plots of single/threeband signals with corresponding PSDs shown in Figure 34. A) single band signal s. B) three band signals.

PAGE 49

49 Table 31 Average Subjective thresholds for single/ threeband signals normalized by the threshold of 1 KHz tone Instrument Window (ms) 1 KHz (dB) 2 KHz (dB) 4KHz (dB) 3 band (dB) Conga Original 10 .00 10.25 9.25 8 .00 Conga 64 7 .00 6 .00 6.5 4 .00 Conga 512 6 .00 2.5 0 1.75 0 .00 Bongo Original 4 .00 3 .00 2.25 1.75 Bongo 64 2 .00 1.25 0.5 0.5 0 Bongo 512 1 .00 2 .00 2.5 0 4.75 Conga 1kHz Conga 2kHz Conga 4kHz Conga 3-band -15 -10 -5 0 5 Original 64ms 512ms A Bongo 1kHz Bongo 2kHz Bongo 4kHz Bongo 3-band -15 -10 -5 0 5 10 Original 64ms 512ms B Figure 36 Bar plot of subjec tive thresholds for single/ threeband signals normalized by the threshold of 1 kHz tone that are listed in Table 31 A) Subjective thresholds for c onga signals. B) Subjective thresholds for bongo signals.

PAGE 50

50 3.3 Summary The goal of this chapter is to find out how signal temporal structure impact s human auditory sensitivity To exclude the spectral differences and obs erve how the temporal structure alone affect s the signal audibility, two phase manipulation approaches are considered for test signal synthesis. By changing the standard deviation (STD) of the phase distribution, Huesh and Hamerniks pdf based approach is effective in generat ing a family of noise signals with different impulsiveness given a presele cted p ower spectrum. However, when applie d to audio signal synthesis, this approach causes a lot of artifacts and severely degrades the perceptual quality of the audio. On the other hand, by changing the manipulation window size, the KFH phase approach can modify the signal temporal structure without modifying the signal power spectra, while retaining the perceptual quality of synthetic audio at an acceptable level. Signal s synthesized with the KFH phase algorithm were tested in a standard two alternati ve forced choice subjective (2AFC) experiment. The subjective responses show that, given the nearly identical power spectra, the more impulsive the signal is, the lower the audibility threshold is. In other words, the human auditory system is more sensitiv e to impulsive signals than the steady signals of the same power spectra, which cannot be explained by the conventional secondorder energy detector. To better predict the audibility of nonstationary signals, a higher order integration model will be proposed in the next chapter.

PAGE 51

51 CHAPTER 4 A HIGHERORDER SPECTRO TEMPORAL INTEGRATION MODEL FOR AUDIBILITY PREDICTION In this chapter, a higher order model to determine the audibility of audio signals is presented. Previous models have been energy based (seco ndorder) and adequate only for stationary, narrow band signals. Music, speech and other audio signals are nonstationary and wideband so traditional energy based models poorly predict the audibility of these sounds. The predictions from the higher order model are compared to actual subjective listening tests to show that the higher order, wideband technique outperforms previous models. P ossible extensions of our higher order integration model are proposed as well. 4. 1 Basic Model 4.1.1 Model Description The goal of our model is to predict the lowest signal to noise ratio (SNR) at which human subjects can detect a stimulus in the presence of a white noise masker. The signal can be wideband or narrowband, stationary or transient. The block diagram of the mo del is shown in Figure 41 The model consists of a front end auditory filter bank, followed by a nonlinear operator which is then low pass filtered and thresholded to determine the signal audibility. The most important part in this model is the nonlinear block shaded in blue. In conventional energy detectors, this nonlinear block is typically a secondorder operator, i.e. 2 n while in our proposed model, it will be replaced by a higher order operator, 2 n

PAGE 52

52 Stimulus nx LPF ) f ( ERB1 LPF ) f ( ERB2 LPF ) f ( ERBn n / 1) x ( OR n / 1) x ( Comp Comp n / 1) x ( Comp 1TH 2TH nTH Yes/No nx nx Fig ure 41 Block diagram of higher order spectro temporal integration model 4 .1.1.1 Front end auditory filter bank In this model, an unknown stimulus first goes through an ERB (Equivalent Rectangular Bandwidth) filter bank, which aims to mimic human auditory frequency selectivity. Based on the critical band theory, critical bands are independent auditory processing channels, i.e. only spectral components that fall into the same critical band are processed together. The ERB scale recommended by Moore and Glas berg is close to the critical band scale and better fits the data in notchnoise masking experiments. The empirically fit equation for ERB is ) 1 1000 / f 37 4 ( 7 24 ) f ( ERB (4 1) where f is the cent er frequency i n hert z. The ERB scale is defined as the number of ERBs for each frequency ) 1 1000 / f 37 4 ( log 4 21 s ERB of number10 (4 2)

PAGE 53

53 The ERB filterbank in the front end of the model provides a relatively accurate approximation to the auditory frequency response. 4.1. 1. 2 Non linear p rocessing The non linear block nx in Figure 41 is typically a rectifier or square operator, i.e., 2 n However, by analyzing the subjective thresholds of audio signals in section 3. 2 we have shown that these energy based audibility models do n o t accurately predict the audibility of transient stimuli. We replace the secondorder processing with higher order processing: n 2) n ( x ) n ( x Intuitively, this gives more emphasis to the transient (larger) parts of the signal. In section 4. 1. 2, by considering orders from secondorder to tenthorder, we find the optimal values of n to predict audibility. 4.1.1.3 Low pass f iltering As is well known, the human auditory system cannot keep t rack of the very fine structure of the stimuli. Auditory temporal integration is traditionally modeled as a combination of a square law rectifier and a low pass filter controlled by a time constant, which is also known as leaky integration. The l ow pass filter can be formulated as ) n ( S ) 1 ( ) 1 n ( S ) 1 n ( S' (4 3) where ) n ( S' is the output of the low pass filter at time n, and ) n ( S is the n th output from the nonlinear blo ck. is a constant related to the low pass filter time constant cT which can be calculated as ) T / T exp( 1c i (4 4) where iT is the time interval between successive values.

PAGE 54

54 The time constant cT is chosen to be the temporal integration limit of the auditory system 200ms for an attack ( )) 1 n ( S ) n ( S' ) and 500ms for a release ( )) 1 n ( S ) n ( S' ). 4.1. 1. 4 Decision m aking Finally, we need to decide whether the observation is signal plus noise or noise alone. ) k ( n ) k ( x : H0 ) k ( n ) k ( s ) k ( x : H1 w here ) k ( s denotes the signal samples, ) k ( n denotes the noise masker samples and ) k ( x denotes the observation samples. The output of each band is compared with a predetermined frequency dependent threshold iTH as shown in Figure 41, which is found based on the subjective t hresholds of the narrowband audio signals. To determine the threshold iTH in F igure 41, we first calibrate the narrowband audio signals at their j ust audible levels as shown in T able 31, we then feed each calibrated narrowband audio to our model and obtain the individual n th root output. The n th root outputs of the narrowband audio signals with the same center frequency are averaged to set the predetermined threshold iTH for the corresponding ERB band. Finally the model will output a binary decision for each band. If a stimulus in any ERB band is audible, then the overall stimulus is audible. The lowest SNR for a stimulus to be audible in the white noise masker is the masked audibility threshold for this stimulus.

PAGE 55

55 4 1. 2 Model Evaluation and Discussion Six threeband audio signals were fed into the higher order integration model shown in Figure 41 and compared with the frequency dependent thresholds iTH The prediction errors of both the loudness b ased model s and the higher order model s are listed in T able 41. The prediction error is calculated from the predicted thresholds subtracted by the actual subjective thresholds listed in the Table 31. In Tabl e 41 the audio signals C1C3 and B1 B3 repre sent two groups of threeband signals, conga and bongo, in three cases, the signal without phase manipulation, and the phase manipulated signal with window size of 64 ms and 512 ms, respectively. The final column of the table shows the average absolute prediction error s of all six signals for each model. The average prediction errors are also plotted in Figure 42. M odels that are compared in T able 4 1 are as follows. Loud1 denotes the original loudness based model proposed by Glasberg and Moore (Glasberg and Moore 2005) Loud2 denotes a modified loudness based model, in which instead of averaging short term loudness over the duration of the signal, instantaneous loudness is low pass filtered with the same time constant as our higher order model (200ms for an attack and 500ms for a release) and the output of the low pass filter is compared with the preselected threshold set by a 1 kHz tone at threshold. HO 2 HO 10 denotes our higher order spectrotemporal integration model shown in Fig ure 4 1 using order s from two to ten

PAGE 56

56 HO Inf represents the higher order integration model using infinit e order Let x(i) be the signal samples N be the total number of samples, and M be the maximum absolute sample value, we have M M ) i ( x lim M M )i ( x M lim ) i ( x limn / 1 n N 1 i n n / 1 N 1 i n n n n n / 1 N 1 i n n (4 5) From the simple derivation in equation 45, we can see that the integration model wi th infinite order is essentially a signal peak detector. Table 41 Prediction errors for threeband audio signals in decibels Model C1 C2 C3 B1 B2 B3 Average Loud 1 3 00 2 .00 1 .00 1.75 1.5 0 0.75 1.67 Loud 2 2 .00 1 .00 2 .00 0.25 0.5 0 1.75 1.25 HO 2 3.5 0 0.5 2.5 0 1.75 0 .00 3.75 2 .00 HO 3 1 .00 0.5 1.5 0 0.75 0.5 0 2.75 1.17 HO 4 0 .00 0 .00 1 .00 0.25 1 .00 2.25 0.75 HO 5 0.5 0 0 .00 1 .00 0.75 1.5 0 1.75 0.92 HO 6 0.5 0 0 .00 1 .00 0.75 1.5 0 1.75 0.92 HO 7 1 .00 0 .00 0.5 0 1.25 1.5 0 1.75 1 .00 HO 8 1 .00 0 .00 0.5 0 1.25 2 .00 1.75 1.08 HO 9 1 .00 0 .00 0.5 0 1.25 2 .00 1.75 1.08 HO 10 1 .00 0 .00 0.5 0 1.75 2 .00 1.75 1.17 HO Inf 1 .00 0 .00 0.5 0 2.25 2.5 0 1.75 1.33 Figure 42 Average prediction error for threeband signals in decibels. 1 2 3 4 5 6 7 8 9 10 11 12 0 0.5 1 1.5 2 2.5 Average Prediction Error (dB) Loud-1 HO-3 HO-10 HO-9 HO-8 HO-7 HO-6 HO-5 HO-4 HO-Inf HO-2 Loud-2

PAGE 57

57 The observations from Table 41 and Figure 4 2 are as follows. Loudness based model versus higher order models : f rom Figure 42 we can see that, in terms of average prediction error the higher order models HO 3 HO 10 outperform the two loudness based models Loud1 and Loud2. Overall, the fourth order model is the best audibility predictor and the secondorder model is the worst one on our experimental data. Loudness based models: f or two loudness based models Loud1 and Loud2, simply by replacing the average of short term loudness with the maximum of the low passed instantaneous loudness as an alternative predictor, the average predi ct ion error is decreased by 0.42 dB and the individual prediction error for the most impulsive signal C1 is decreased by 5 dB which indicates that the average of loudness over the whole duration of a signal seriously underestimates the audibility of an impu lsive signal that has energy highly concentrated in a brief time interval. Models of all orders: Among all the higher or der models HO 2 HO Inf the fourth order model outperform s the others. When the order equals two, the predictor acts as a single band energy detector with limited memory. As the order increases, the predictor emphasizes more of the signal peaks. It turns out that some moderate order between t he second order and the infinit e order (peak detection) best predicts all the tested audio signals which indicates both the signal energy and signal peak contribute to the signal audibility. 4. 2 Model Extension I As mentioned in C hapter 2, the complex tone experiments ( Green 1958, Buus et al. 1986, Van den Brink and Houtgast 1990) show that equally detectable signal components in isolated critical bands improve the detectability of the overall signal.

PAGE 58

58 However, our basic model in the previous section predi cts the audibility based on the most detectable critical band and hence cannot explain the decreas ed audibility thresholds of complex tones. To overcome this problem, an extended model that incorporates auditory spectral integration is proposed in this section. 4.2.1 Extended Model with Spectral In tegration Based on Probability S ummation As stated in C hapter 2, t he independent threshold model assumes statistically independent observations in each auditory channel and makes decisions based on probability summation across auditory channels which is shown to be a better balance of explaining the detec tability improvement of wideband signals and being consistent with the critical band theory than the multi band integration model As a result, the independent threshold spectral integration model is incorporated in our extended audibility prediction model The rule used to combine decisions in the independent threshold model is to make a positive response when any one of the several observations is positive. Under this rule, the probability of detection based on observations in M critical bands is M 1 i i) p 1 ( 1 P (4 6) where ip denotes the detection probability of the i th critical band The psychometric function pro posed by Green and Luce (Green and Luce 1975) is most commonly used to calculate the auditory detection probability for each critical band ) I exp( 1 p i (4 7) I represents th e signal intensity, and and are both free parameters.

PAGE 59

59 Let TH Iand THp be the signal intensity and detection probability at threshold for a given critical band, and iI and ip be a random intensity and the corresponding detection probability for the same signal within the same critical band, we have ) I exp( 1 pTH TH (4 8) ) I exp( 1 pi i (4 9) From equation ( 4 8 ), we have TH THI p 1 ln (4 10) Substitut ing equation ( 4 10) into equation ( 4 9 ), we have the individual probability ip as a function of the signal intensity THI the detection probability at threshold THp for the same critical band and the preselected parameter as follows. TH i TH iI I p 1 ln exp 1 p (4 11) w here the frequency dependent THI and THp can be determined by the singleband signals at thresholds The free parameter can be set bas ed on the value that best fits the experimental data. The value of controls the amount of emphasis that put s on the most detectable auditory frequency channels. The overall extended model is shown in Figure 4 3. 4.2.2 Model Evaluation and Discussion 2, 4, and 6 are used to test the extended model shown in Figure 43. The prediction errors on our experimental data ar e listed in Table 42. By putting more emphasis on the most detectable auditory frequency channels and attenuating the

PAGE 60

60 contribution of the least detectable auditory frequency channels, 6 performs the best in predicting the audibility thresholds of all the tested audio signals. Audio LPF LPF LPF 1p 2p Mp ) f ( ERB1 ) f ( ERB2 ) f ( ERBM P Yes/No Temporal integration S pectral integration nx nx nx n / 1x n / 1x n / 1x Figure 43 Block d iagram of extended higher order integration model for audibility prediction By comparing Table 41 and Table 42, the extended model with 6 performs very close to the basic model. In this evaluation, since the tested audio signals have power distributed very differ ently across auditory frequency channels, the extended model with 6 is essentially reduced to the basic model, making audibility decision based the most detectable auditory channel solely for the tested audio signals.

PAGE 61

61 The advantage o f this extended model is in predicting the decreased audibility thresholds of signals that have equivalent or comparable detectabilit y in multiple auditory channels The value of can be optimized with more subjective data. Table 42 Average prediction error in decibels (dB) for different beta parameters Beta 2 4 6 HO 2 2.75 2.08 2 .00 HO 3 2.58 1.08 1.17 HO 4 2.58 0.92 0.75 HO 5 2.67 1.08 1.08 HO 6 2.67 1.17 0.92 HO 7 2.67 1.17 1.08 HO 8 2.58 1.17 1.08 HO 9 2.58 1.25 1.08 HO 10 2.58 1.25 1.25 4.3 Model Extension II 4.3.1 Analogy between A udibility P erdition and H earing L oss P rediction The human auditory system has evolved so that it can detect sounds with displacements in the subangstrom range while, at its upper limits, it can faithfully encode sounds over a dynamic range of 120 dB SPL ( Henderson and Hamernik 1995). Incoming sound travels down the external ear (pinna and meatus) and causes the tympanic membrane to vibrate. These vibrations are transmitted through the middle ear by three ossicles to the oval window, which is the opening of the spiral shaped cochlear. When the oval window is set in motion by the incoming sound, a pressure difference is applied across the basilar membrane (BM) and causes the BM to move. When the BM moves up and down, a shearing motion is created between the BM and the tectorial membrane. As a result, the hairs at the top of the outer hair cells are displaced, which leads to the excit ation of the inner hair cells and leads in turn to the generation of action potentials in the neurons of the auditory nerve ( Moore 2003).

PAGE 62

62 However, the process cannot continue indefinitely. Over stimulation might cause cellular fatigue in our auditory system due to a change of the metabolic proc ess in the inner ear and result in a temporary threshold shift (TTS). With greater exposure to sounds, the auditory periphery, or cochlea, progressively deteriorates which results in mechanic al damage to the auditory system The damage caused by this kind of intense noise is pervasive and affects virtually all of the cellular subsystems of the inner ear (sensory cells, nerve endings and vascular supply). S ounds such as gunfire and certain industrial impacts peak levels greater than 125 dB are especially hazardous to the cochlea and very likely to cause permanent threshold shift (PTS). ( Henderson and Hamernik 1995). Although the mechanisms for audibility of weak sounds and temporary/permanent threshold shift s caused by moderate to intense sounds are very dif ferent in terms of biology, they do share some common characteristics The characteristics are level dependence, frequency dependence, and leaky integration and temporal effects discussed in the following sections. 4.3.1.1 Experiments on stationary sounds Frequency place theory and half octave shift : The frequency place theory proposed by Von Bek esey states that the traveling wave forms a displacement envelope on the basilar membrane that peaks at different places along the cochlear according to the excitation frequency (Bekesey 1960). As a result of the place theory, if the stimulus, pure tone or narrowband noise, is low in level, audibility is usually restricted to a narrow range near the stimulus frequency. At high exposure level s, the signal energy at the center frequency and temporal threshold shift (TTS) are also correlated, except that the T TS spreads predominantly toward the high frequencies and

PAGE 63

63 the maximum hearing loss typically shifts to a point onehalf octave above the center frequency of the stimulus (Quaranta et al. 1998). Level dependent asymptotic threshold shift (ATS) : For long ex posures, e.g., 1624 hours, Temporary threshold shifts (TTS) increase for about 8 h and then reach a plateau called asymptotic threshold shift (ATS), which sometimes is interpreted as the upper value of PTS that can occur from the same exposure that lasts for years(Quaranta et al. 1998). Same as audibility thresholds, there is a clear relation between the sound level and the ATS. Systematic experiments on TTS by Mills et al. (1979) showed, in the frequency region of greatest loss, ATS increased about 1.7 dB for every 1 dB increase in the level of the noise above a critical level, which can be described as ) C OBL ( 7 1 ATS (5 1) where OBL is the octave band noise level in decibels and C is the frequency dependent critical level ( Mills et al. 1979, Melnick 1991). Exponentially growth and recovery of TTS : One of the most intensive studies on TTS was conducted by Mills and his colleagues. In their study, 60 human subjects were divided into 8 groups and exposed to a diffuse sound field for 1624 h to an octaveband noise centered at 4, 2, 1, or 0.5 kHz. Soundpressure levels were varied from 75 dB t o 88 dB on different exposure occasions Temporary threshold shifts (TTS) increased for about 8 h and then reached a plateau or asymptote. After termination of the exposure, recovery to within 5 dB of preexposure thresholds was achieved within 24 h or les s. Recovery can be described by a simple exponential function with a time constant of 7.1 h. Mills et al. also summarized the relation between TTS and exposure

PAGE 64

64 duration as simple exponential functions using their own data as well as the previously reported TTS data ( Mills et al. 1979) S imilar to the case of soft sounds the growth and recovery of TTS caused by moderate to high level octave band noises is also a temporal integration process TTS can also be characterized by leaky integrators with different time constants for growth and recovery as in the case for audibility. Given the analogy between audibility and TTS in level dependence, frequency dependence and temporal integration for stationary sounds we believe that our higher order audibility predic tion model could be extended to TTS prediction for nonstationary sounds using a leaky integrator with different time constant for growth and recovery. 4.3. 1. 2 Inadequacy of e xisting s afety s tandards for n onstationary s ounds The data sets used to determi ne the degree of hearing loss caused by noise were collected in the late 1960s and early 1970s in predominantly white, adult male populations that were exposed to industrial noise. These data were instrumental in developing standards ( ISO 1999 ANSI S3.44 1996) to describe the relationship between noise exposure and noiseinduced permanent threshold shift (NIPTS) and regulations by the Occupational Safety and Health Administration ( OSHA 1983 ) and safety recommendations by t he National Institute for Occupational Safety and Health ( NIOSH 1998 ). Current understanding from these data is that a maximum exposure of 85 dB, A weighted (dBA), for an 8hour daily exposure over a working lifetime of 40 years results in roughly 8% of exposed persons having a hearing handicap, owing to the wi de variability seen in susceptibility to noiseinduced hearing loss across individuals ( Fligor 2006). Exchange rate of exposure level and duration : As mentioned in C hapter 2, for stationary signals less than 200ms, there is a clear durationintensity trade off for

PAGE 65

65 audibility threshold and sound duration. Risk assessment for hearing loss suggested by current standards is also equalized using a similar exchange rate that is an increment, or decrement, of decibels that requires halving, or doubling, the allow able exposure time, respectively. Use of the exchange rate allows for determining the TimeWeighted Average (TWA) exposure level to compare risk for noiseinduced permanent threshold shift (NIPTS) despite different durations and levels of exposure. Current ly, the OSHA limits the maximum permissible exposure limit to 90 dBA for an 8hour TWA, using a 5dB exchange rate. While not exceeding this permissible exposure limit, workers exposed to 85 dBA, 8hour TWA, must be enrolled in a Hearing Conservation Progr am, including annual audiometry, use of hearing protection devices, and education on risks and prevention of NIHL. This 85 dBA, 8hour TWA (with 5dB exchange rate) is considered the Action Level by OSHA ( OSHA 1983 ). In contrast, NIOSH recommends an exposure limit of 85 dBA for an 8hour TWA, using a 3dB exchange rate ( NIOSH 1998). Most developed western countries limit worker exposure to 85 dBA, 8 hour TWA using the 3dB exchange rate. And the 3 dB exchange rate is exactly based on equivalent energy calculation ( Fligor 2006) Energy based PTS prediction in current standard ( ISO 1990: 1999 ) : Similar to energy detector s for audibility threshold prediction, the measure of exposure to noise for a population at risk is the averaged A weighted sound exposure (timeintegrated squared sound pressure) T AE, and the related the equivalent continuous A weighted sound pressure level T AeqL over an average working day (assumed to be of 8 h duration), for a given number of years of exposure, which are both secondorder energy based measures restricted in each octav e band.

PAGE 66

66 The prediction method presented in the ISO standard is based primarily on data collected with essentially broadband steady nontonal noise. For tonal noise and/or impulsive/impact noise, the standard only suggests they are about as harmful as a steady nontonal noise that is approximately 5 dB higher in level (ISO 1990:1999) This guidance doesnt take the specific signal time structure into consideration and it offers very limited help for practical use. 4.3.1.3 Hearing r isk a ssociated with s ign als of d ifferent t ime s tructures As verified in Chapter 3, for nonstationary sounds of equivalent energy, the more impulsive the signal is, the lower its audibility threshold is, and thus the easier for human auditory system to detect it. There is growing evidence that shows hearing loss induced by intense sounds is also affected b y the signal temporal structure (Strasser et al. 1999, Hamernik et al. 2001 and 2003) for signals of same energy above some critical level, the more impulsive the signal is, the more damaging to human auditory systems. TTS Experiments on human subjects : In order to disclose the actual physiological responses to exposures which varied with respect to the t emporal structure and the semantic quality of sounds, Strasser et al. conducted a series of tests where physiological costs associated with varying exposures were measured audiometrically. In this study, 10 subjects with normal hearing participated in a test series with four exposures (white noise, industrial noise, heavy metal music, and classical music) which were characterized by the same level 94 dB(A), for an hour. The physiological responses to the four exposures were assessed by Integrated Restitution TTS (IRTTS), which is computed as the integral of the regression func tion TTS (t) from 2 min af ter the exposure to the point t The IRTTS is a numeric value for

PAGE 67

67 the total threshold shift (in dB x min) which has to be "paid" by the hearing in physiological costs for the exposure. The results showed that the industrial noise had an IRTTS value of 631 dBmin in relation to 424 dBmin qualified as responses to white noise brought about an increase of approximately 50% in total physiological cost. Heavy metal music was also associated with tremendous physiological cost (637 dBmin) Classical music was accompanied with the low est physiological cost (160 dBmin). The results indicate time structure does affect the noise induced hearing risk and energy equivalent approach could lead to dangerously wrong assessment of hearing risk (Stra sser et al. 1999). PTS Experiments on animal subjects : The most systematic studies on hearing loss induced by noise signals of equivalent energy were conducted by Hamernik and his colleagues. In their studies, groups of chinchillas were exposed to either a Gaussian noise or one of the non Gaussian noises at 100 dB(A) SPL. All exposures had the same total energy and approximately the same flat spectrum but their statistical properties were varied to yield a series of exposure conditions that varied across a continuum from Gaussian through various nonG aussion conditions to pure impact noise exposures. Trauma, as measured by asymptotic and permanent threshold shifts (ATS, PTS) and by sensory cell loss, was greater for all of the nonGaussian exposure condit ions. In t heir chinchilla model, PTS and outer hair cell loss were monotonically rel ated to the signal kurtosis ) t ( Besides, the frequency specific OHC loss produced by nonG noise exposures is well correlated the frequency specific kur tosis ) f ( In conclusion, in spite of the same signal energy and spectra, the more transient the signal was, the more serious trauma developed on the animal subjects (Hamernik et al. 2001 and 2003).

PAGE 68

68 To sum up, for stationary sounds, audibility and hearing loss are both a temporal integration process in individual octave bands with exponential growth and recovery. Existing prediction models for audibility and hearing loss are both based on equivalent energy concept same energy causes sa me detectability/damage. For nonstationary sounds, both audibility and hearing loss are affected not only by the signal energy but also the signal temporal structure. 4.3.2 Extended Model Suggested for Hearing L oss P rediction Given the similarity between audibility and hearing loss prediction for stationary sounds, extending our higher order integration model to hearing loss prediction can potentially lead to a more accurate prediction for hearing loss induced by nonstationary signals. A possib le framew ork to predict the lowest signal level that causes hearing loss for a given audio signal is illustrated in Figure 4 4. The main differences between this hearing loss prediction model in Figure 44 and the audibility prediction model in Figure 41 are in the design of the low pass filter and the predetermined thresholds for individual auditory channels, which are shaded in Figure 44. As mentioned in the previous section, compared with audibility prediction, temporary/permanent hearing loss takes a lot long er time to build up, hence, a temporal integration processing with a longer time constant needs to be embedded in the design of the low pass filter. In addition, the predetermined thresholds for each auditory channel need to be modified based on the exper imental data for hearing loss. Given more subjective data, the higher order for auditory temporal integration can also be optimized for hearing loss prediction.

PAGE 69

69 Stimulus nx LPF ) f ( ERB1 LPF ) f ( ERB2 LPF ) f ( ERBn n / 1) x ( OR n / 1) x ( Comp Comp n / 1) x ( Comp Cause hearing loss or not? nx nx TH1 TH2 THn Figure 44 Block diagram of the higher order integration model for predicting the threshold of a given audio signal at which possible hearing loss will be induced Stimulus nx LPF ) f ( ERB1 LPF ) f ( ERB2 LPF ) f ( ERBn n / 1) x ( n / 1) x ( n / 1) x ( nx nx ) ( F1 ) ( F2 ) ( Fn 1HL 2HL nHL Figure 45 Block diagram of higher order integration model for predicting the possible hearing risk that is induced for a given audio signal at a fixed signal level

PAGE 70

70 Alternatively, a similar framework can also predict the potential hearing risk induced by a given audio signal at a preselected level. A possible modified model is illustrated in Figure 45. In this model, the output of the higher order integrator is fed into a predet ermined function ) ( Fi and the potential hearing loss for each critical band induced by the given audio signal is calculated accordingly. With sufficient subjective data, the relation between the output of the higher order integration and the consequent hearing loss can be formulated for each critical band. 4. 4 Summary We propose a higher order spectrotemporal integration model for predicting the audibility thresholds of audio signals. By using higher order temporal integration in indiv idual ERB bands and making decision on the most audible ERB band, our basic prediction model outperforms existing loudness based and energy based models. With more data, we will determine the precise order of the model. For better generalization of our basic model a spectral integration process based on probability summation across all auditory channels is incorporated into the basic model. The prediction performance of this extended model is shown to be very close to our basic model with our experimental data. The analogy between audibility and hearing loss prediction has been well summarized. Given the similarity shared by both audibility and hearing loss prediction, two hearing loss prediction models extended form our basic higher order integration mod el have been proposed. The refinement and validation of these two models require more subjective data.

PAGE 71

71 CHAPTER 5 HIGHERORDER LEVEL ESTIMATI ON FOR AUDIO DYNAMIC RA NGE CONTROL In C hapter 2, t he human auditory system has been shown to be more sensiti ve to transient signals than stationary signals given the same energ y The conventional dynamic range control (DRC) algorithm is based on secondorder level estimates (i.e. energy or root mean squared value) Since the secondorder estimate cannot adequat ely characterize the auditory perception of nonstationary audio signals, the conventional secondorder DRC algorithm is extended using a higher order level estimate in this study. The perceptual quality and the dynamic range reduction effectiveness are ev aluated for both secondorder and higher order DRC algorithms. Evaluation results show that higher order DRC algorithm s with moderate size analysis windows offers the best balance of perceptual qual ity and dynamic range reduction 5.1 Dynamic Range Control for Hearing Protection Dynamic range control (DRC) has been widely used in the design of hearing aids, radio and TV broadcasting, teleconferencing and other acoustical applications. As modern personal media players (PMP) with mass storage capacities, long battery life, and high output levels, become more and more popular, music induced hearing loss is becoming more of a social and clinical problem. Listeners often set volume levels based on the intelligibility or detectability of the softest sounds in the audio signals. For audio signals with wide dynamic range at a given volume level, when the softest sounds are adequately audible, the loudest sounds might be overwhelmingly intense. Studies have shown that the loudest and most transient sounds cause the most he aring risk. In order to d etermine the actual physiological responses to exposures which varied with respect to the time structures, Strasser and his colleagues tested ten normal hearing human

PAGE 72

72 subjects with a series of four exposures (white noise, ind ustrial noise, heavy metal music, and classical music), which were characterized by the same level 94 dB(A). The experimental data showed that the heavy metal music induced the most integrated temporary threshold shifts on human subjects, followed by the i ndustrial noise, white noise and classical music. These results indicate that acoustical signals that are rich in intense transients induced the most hearing risk at a given sound level (Stresser et al 1999) Hamernik and his colleagues conducted the most systematic studies on hearing loss induced by noises of equivalent energy. In their studies, groups of chinchillas were exposed to a family of noise exposures that had the same sound level 100 dB(A) and approximately the same flat spectra but very differen t time structures. Their results showed that the more transient the signal was, the more serious was the trauma developed in the animal subjects (Hamernik and Qiu 2001) Since the intense transient signals are more damaging to the auditory system at a given sound pressure level, a delicate DRC algorithm that balances the dynamic range reduction and perceptual concerns would be benefici al to protect the hearing of music listeners As recorded audio signals are mostly used for entertainment purposes, a prim ary concern for reducing the dynamic range of audio signals is the consequent quality degradation. Conventional DRC algorithms based on secondorder measur ements (root mean squared value or energy) normally provide a tradeoff of perceptual quality and dynamic range reduction. This study will extend the conventional secondorder DRC using a higher order level estimate and further investigate the impact of temporal structure on the optimal order selection for a given audio signal.

PAGE 73

73 In the rest of this chapter our higher order DRC system will be first introduced briefly. The objective audio quality and the dynamic range reduction for both our proposed approach and the existing approach will then be evaluated. The relation between the optimal DRC order and the temporal structure for a given audio signal will be investigated as well. Finally, section 5 3 summarizes the results and offers suggestions for future work. 5 2 High Order Dynamic Range Control 5 2 .1 Model D escription There are generally two types of topologies for DRC systems, feedback and feedforward. Feedback systems use the output signal to control the gain. Since the signal levels must reach the system output before the compensation gains can be generated, the feedback systems allow overshoots to oc cur at the output and thus do not handle transients well. The feedforward systems overcome this limitation by using the input signal levels to control the gains and better suppress the transients. Feedforward designs also have an advantage in system stability (Schneider and Brennan 1997). Hence, the feedforward topology is chosen for our DRC system. The block diagram of the complete feed forward DRC system used in this study is shown in Figure 51. Integrator LPF MAX Gain Audio Frame Output Frame Figure 51 Block diagram of the dynamic range control system Pitch and timbre are the most important aspects in music perception. Variations in pitch are central to our experience of melody, harmony, and key. Timbre allows us to

PAGE 74

74 distinguish one instrument from another. The physical variables that contribute to our experience of timbre include the spectral energy distribution (i.e. bandwidth and concentration), temporal envelope, and transient components of a tone. A multiple frequency channel DRC is widely used for speech signals. However, music requires that the balance between the lower frequency fundamental energy and the higher frequency harmonic energy remains intact to achieve optimal sound quality. An imbalance in the amplification of low and highfrequency channels will always affect timbre, may als o lead to problems for musical pitch perception and ultimately distort the intent of the musician (Chasin 2003) As a result, a single frequency channel processing is more common for music to preserve timbres because the short term spectral distribution is unaltered. Some studies attempted to design multi channel DRC system s for music signals to achieve more flexibility in frequency equalization, but they still need to ensure that the adjacent channel controllers were set similarly (Schmidt and Rutledge 1996). For simplicity, we chose to use a single frequency band for our DRC system. The input/output mapping function is usually determined by three aspects: the compression ratio, the threshold kneepoint (TK) where the DRC algorithm starts, and the smoothness of the transition between different linear segments. To simplify the system, we use only the TK point to control the output dynamic range, in other words, the audio frames with levels higher than the TK point will be reduced to a fixed level otherwise they will be linearly amplified. The simplified input output mapping function is shown in Figure 52. To calculate the TK point, we first construct a histogram of the audio levels across all the frames, obtain the cumulative density function (CDF), and pic k a level at a certain cumulative probability as the TK point. The higher the selected

PAGE 75

75 cumulative probability, the less the signal will be compressed, which normally leads to better perceptual quality and less dynamic range reduction. Input Output Threshold kneepoint (TK ) Compressed region Linear region Figure 52 Illustr ation of the input output mapping function for the DRC algorithm in this study The key part in our DRC design is the level estimation in the integrator block for each frame as shaded in Figure 51 Conventional dynamic range controllers estimate the compensation gains based on the second order measurement root mean squared value (RMS) or energy. Our study in Chapter 3 has shown that the human auditory system is more sensitive to impulsive signals than stationary signals given the same energy. There is also growing evidence that shows hearing loss induced by intense sounds is affected by the signal temporal structure (Hamernik and Qiu 2001) These results cannot be adequately explained by energy based (secondorder) models. On the other hand, our higher order spectro temporal integration model outperforms the existing secondorder models in predicting the audibility of nonstationary audio signals ( C hapter 4) and higher order statistics such as kurtosis have been shown to be correlated to the hearing loss induced by non stationary noise (Hamernik and Qiu

PAGE 76

76 2001). Inspired by the applicati on of higher order processing in auditory modeling of nonstationary signals, we replaced the secondorder level estimate in the conventional DRC with a higher order level estimate, N / 1 M 1 i N 2 / 1 M 1 i 2) i ( x ) k ( S ) i ( x ) k ( S (5 1) where ) k ( s is the instantaneous level for the k th frame, ) i ( x is the i th sample in a given frame, M is the number of samples in each frame, and N (N>2) is the higher order for the nonlinear operation. The instantaneous levels across frames are then smoothed by a first order low pass filter as follows. ) n ( S ) 1 ( ) 1 n (S ) n ( S ( 5 2) where ) n ( S is the smoothed output of the low pass filter for the nth frame, and is a constant related to the low pass filter time constant 5 TCms ) T / T exp( 1 C i ( 5 3) iT is the time interval between the start points of successive frames. The final estimated level for the nth frame is determined by the maximum of t he smoothed level ) n ( S and the instantaneous level ) n ( S for each value of n )) n ( S ), n ( S max( ) n ( L ( 5 4) which ensures that the estimated levels ac ross frames are smoothed for relatively stationary frames and the impulsive frames can be quickly suppressed as well. The hope is that the intense transient parts of the audio signals can be further reduced by the higher order nonlinear processing and the degradation in perceptual quality can also be minimized. The objective quality evaluation will be addressed in section 5. 2 .2

PAGE 77

77 5 2 .2 Evaluation and R esults Forty five audio recordings are selected from the European Broadcasting Union (EBU) sound quality as sessment materials (SQAM) (European Broadcasting Union 2008) which covers a variety of audio contents such as single instruments, vocals, solo instruments, vocal and orchestra, orchestra and pop music. All the audio recordings are originally sampled at 44100 Hz and converted to a 48000 Hz sample rate. The audio signals are first segmented into 50% overlapped frames with a predetermined frame size ranging from 128 samples/frame ( 2.7 ms/frame) to 4096 samples/frame( 85.3 ms/frame). The signal levels for eac h frame are calculated using order N =2, 4, 6, and infinity as shown in equation 51 which are then low pass filtered To calculate the threshold knee point ( TK ) point, the corresponding cumulative probability functions of the signal levels across all frames are obtained through their histograms. The TK point is v aried from the signal level at 1 0% cumulative probability to the signal level at 95% cumulative probability with a spacing of 5 %. The processed audio signals are normalized to the same overall root mean squared value as the corresponding unprocessed audio signal. The peak to RMS ratio reduction relative to the unprocessed signal are calculated every 2 seconds and averaged out. The perceptual quality of the processed audio signals is evaluated by t he basic version of ITU standard objective measure of perceived audio quality a.k.a., PEAQ ( International Telecommunication Union 2001, Kabal 2002). The basic version of this standard used in this study ( Figure 53 ) calculates a number of psychoacoustical measures, including the signal bandwidth, noise loudness, noiseto mask ratio, modulation difference, detection probability and error harmony structure based on the excitation patterns from auditory preprocessing. A ll of these psychoacoustical

PAGE 78

78 measures are input to a well trained neural network with a single hidden layer with three nodes to give a measure of the quality difference between the unprocessed signal and the DRC processed signal. The objective difference grades (ODG) that output from the neural network range from 0 to 4, where 0 corresponds to an imperceptible impairment and 4 to an impairment judged as very annoying. Output: objective difference grade (ODG) Signal Bandwidth Noise loudness Noise to mask ratio Modulation difference Detection probability Error harmony structure Cognitive model Psycho acoustical Input Auditory Pre processing Excitation p atterns Compressed signal Original signal Figure 53 Block diagram of ITU R BS.1387 1 perceptual evaluation of audio quality (PEAQ) The dynamic range reduction of each processed audio signal is interpolated for objective difference grades from 0.1 to 1.5 with a spacing of 0.05. The interpolated dynamic range reduction values of all the audio signals for a given analysis window size are averaged and an operational curves of average dynamic range reduction and objective difference grades is generated for each analysis window size The reason why objective quality scores of 0.1 to 1.5 is of interest in this study is that the DRC s using long windows tend to operate in the region of high quality and lower dynamic range reduction. For DRCs using really long windows, the low quality region of the operational curve s are heavily dependent on extrapolation and thus not very reliable. To compare

PAGE 79

79 the operational curves of dynamic range reduction and audio quality across windows, objective difference grade of 0.1 to 1.5 is a shared operational region of DRCs using different window sizes. In addition, processed audio signals with objective difference grades within the range of 0.1 to 1.5 are generally considered as imperceptible to slightly perceptible degradation relative to the unprocessed signal s, which is necessary for audio entertainment. The operational curves of average dynamic range reduction and objective audio qua lity are shown in Figure 5 4 In each subplot, for all f our types of DRC algorithm s from the second order to the infinite order, the objective perceptual quality generally decreases as the signal dynamic range decreases. When the dynamic range is reduced severely, the audio quality is seriously degraded; when the dynamic range is changed very little, the audio quality is almost the same as the original. Between these two extremes, the leftmost DRC offers the best balance of dynamic range reduction and objec tive perceptual quality compared to the other DRC algorithms The interpretation for Figure 5 4 is that, given the same objective perceptual quality, the signals processed by the leftmost DRC have larger dynamic range reduction than the others, which can b etter protect the hearing of audio listeners. On the other hand, given the same dynamic range reduction between the input and output, the signals processed by the leftmost DRC offer better perceptual quality than the other signals, which can better satisfy the fidelity requirement.

PAGE 80

80 -1.5 -1.4 -1.3 -1.2 -1.1 -1 -0.9 -0.8 -0.7 -0.6 -1.5 -1.4 -1.3 -1.2 -1.1 -1 -0.9 -0.8 -0.7 -0.6 -0.5 2.7 ms/frame Dynamic range change (dB) Objective Differece Grade (ODG) order = 2 order = 4 order = 6 order = inf A -1.4 -1.3 -1.2 -1.1 -1 -0.9 -0.8 -0.7 -1.5 -1.4 -1.3 -1.2 -1.1 -1 -0.9 -0.8 -0.7 -0.6 -0.5 5.3 ms/frame Dynamic range change (dB) Objective Differece Grade (ODG) order = 2 order = 4 order = 6 order = inf B Figure 54 Operational curves of average dynamic range reduction and objective audio quality for given analysis window sizes A) 2.7 ms/frame B) 5.3 ms/frame C) 10.7 ms/frame D) 21.3 ms/frame E) 42.7 ms/frame F) 85.3 ms/fra me

PAGE 81

81 -1.5 -1.4 -1.3 -1.2 -1.1 -1 -0.9 -0.8 -0.7 -1.5 -1.4 -1.3 -1.2 -1.1 -1 -0.9 -0.8 -0.7 -0.6 -0.5 10.7 ms/frame Dynamic range change (dB) Objective Differece Grade (ODG) order = 2 order = 4 order = 6 order = inf C -1.5 -1.4 -1.3 -1.2 -1.1 -1 -0.9 -0.8 -0.7 -0.6 -1.5 -1.4 -1.3 -1.2 -1.1 -1 -0.9 -0.8 -0.7 -0.6 -0.5 21.3 ms/frame Dynamic range change (dB)Objective Differece Grade (ODG) order = 2 order = 4 order = 6 order = inf D Figure 5 4 Continued.

PAGE 82

82 -1.5 -1.4 -1.3 -1.2 -1.1 -1 -0.9 -0.8 -0.7 -0.6 -0.5 -1.5 -1.4 -1.3 -1.2 -1.1 -1 -0.9 -0.8 -0.7 -0.6 -0.5 42.7 ms/frame Dynamic range change (dB) Objective Differece Grade (ODG) order = 2 order = 4 order = 6 order = inf E -1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 -1.5 -1.4 -1.3 -1.2 -1.1 -1 -0.9 -0.8 -0.7 -0.6 -0.5 85.3 ms/frame Dynamic range change (dB) Objective Differece Grade (ODG) order = 2 order = 4 order = 6 order = inf F Figure 54 Continued. Depending on the window size us ed for DRC segmentation, the optimal order to achieve the best balance of dynamic range reduction and objective audio quality could be different. As shown in the first subplot of Figure 54 (A), g iven a window size of 2.7 ms/frame, the leftmost signal that achieves the best balance between dynamic range

PAGE 83

83 reduction and audio quality is the one processed by the secondorder DRC algorithm (represented in blue line) As the window size increases, the optimal order for DRC level estimation also shift s to higher order. With window size s of 5 .3 ms/frame and 10.7 ms/frame audio signal s processed by the fourth order (represented in green line) and sixth order (represented in re d line ) outperform the signals processed by second order DRC in compromising dynamic range reduction and audio quality ( Figure 54 (B) and (C)) When the window size is raised t o 21 .3 ms/frame and above, the signal processed by infinite order DRC (signa l peak detector) sur passes the signal processed by the fourth and sixth orders and becomes the best in the joint evaluation of dynamic range reduction and audio quality which are illustrated in the in Figure 54 (D)(E). Table 52 list the impact of wind ow size and DRC order on individual psychoacoustic measures used in PEAQ at a given dynamic range change of 1 dB These psychoacoustic measures include n oise to mask ratio, w indowed modulation difference, a verage block distortion h armonic structure of t he error a verage modulation difference, distortion loudness m aximum filtered probability of detection, and relatively disturbed frames All these measure values are normalized to [0,1] for comparison convenience. The corresponding index es are explained i n Table 51. Table 51 Psychoacoustic measures used in PEAQ basic version and the corresponding index used in Table 52 Index Psychoacoustic measures M 1 Noise to mask ratio M 2 Windowed modulation difference M 3 Average block distortion M 4 Harmo nic structure of the error M 5 Average modulation difference 1 M 6 Average modulation difference 2 M 7 Distortion loudness M 8 Maximum filtered probability of detection M 9 Relatively disturbed frames

PAGE 84

84 Table 52 Normalized psychoacoustic measures for a given dynamic range change of 1 dB with different window sizes Window size = 2.7 ms/frame O rder M1 M2 M3 M4 M5 M6 M7 M8 M9 2 0.368 7 0.1045 0.6394 0.019 4 0.1753 0.028 8 0.040 4 0.995 9 0.318 7 4 0.369 9 0.1015 0.6327 0.02 30 0.17 30 0.028 7 0.044 8 0.997 1 0.30 73 6 0.378 1 0.10 40 0.637 2 0.024 7 0.1784 0.029 7 0.0482 0.99 60 0.3124 Inf 0.4001 0.1140 0.655 9 0.0295 6 0.199 1 0.0335 0.056 9 0.99 60 0.3327 Best Order 2 4 4 2 4 4 2 2 4 Window size = 5.3 ms/frame O rder M1 M2 M3 M4 M5 M6 M7 M8 M9 2 0.3478 0.1355 0.6990 0.015 4 0.22 31 0.037 3 0.0304 0.991 3 0.3187 4 0.3267 0.13 20 0.6872 0.0175 0.2142 0.037 5 0.031 4 0.9960 0.2812 6 0.3300 0.1373 0.6854 0.019 1 0.2226 0.0398 0.0335 0.9956 0.285 3 Inf 0.355 6 0.15 50 0.7008 0.021 3 0.256 8 0.048 6 0.0402 0.9963 0.3181 Best Order 4 4 6 2 4 2 2 2 4 Window size = 10.7 ms/frame O rder M1 M2 M3 M4 M5 M6 M7 M8 M9 2 0.373 1 0.1475 0.7329 0.011 7 0.245 2 0.040 2 0.0238 0.989 3 0.3330 4 0.3372 0.1383 0.715 9 0.0131 0.2306 0.039 6 0.0223 0.968 2 0.2923 6 0.3315 0.1404 0.7098 0.013 3 0.2350 0.041 3 0.022 8 0.980 1 0.287 7 Inf 0.328 1 0.145 3 0.706 9 0.014 2 0.251 4 0.0471 0.024 3 0.9731 0.2858 Best Order Inf 4 Inf 2 4 4 4 4 Inf Window size = 21.3 ms/frame O rder M1 M2 M3 M4 M5 M6 M7 M8 M9 2 0.4368 0.1814 0.7786 0.0086 0.2614 0.0453 0.023 2 0.9960 0.361 5 4 0.354 7 0.1362 0.7228 0.012 9 0.2167 0.03 60 0.019 6 0.9849 0.2835 6 0.338 8 0 .1317 0.7075 0.0137 0.213 2 0.0358 0.019 4 0.979 7 0.2600 Inf 0.3279 0.137 4 0.6993 0.01 60 0.227 6 0.040 7 0.0209 0.9671 0.2416 Best Order Inf 6 Inf 2 6 6 6 Inf Inf Window size = 42.7 ms/frame O rder M1 M2 M3 M4 M5 M6 M7 M8 M9 2 0.4713 0.1794 0.7975 0.0073 0.255 8 0.0440 0.0217 0.9995 0.3783 4 0.3702 0.127 8 0.7399 0.011 8 0.2124 0.032 9 0.018 4 0.997 6 0.2885 6 0.3423 0.1229 0.7182 0.0133 0.2046 0.0322 0.01 80 0.9917 0.254 7 Inf 0.31 5 6 0.1308 0.691 2 0.0153 0.22 10 0.0377 0.0201 0.9916 0.2365 Best Order Inf 6 Inf 2 6 6 6 Inf Inf

PAGE 85

85 Table 52 Continued. Window size = 85.3 ms/frame O rder M1 M2 M3 M4 M5 M6 M7 M8 M9 2 0.5125 0.1649 0.8301 0.0046 0.238 5 0.0579 0.040 3 1.0000 0.484 8 4 0. 4314 0.125 5 0.7988 0.007 1 0.192 7 0.0302 0.015 9 1.0000 0.3766 6 0.3854 0.108 9 0.7741 0.0085 0.178 9 0.0271 0.0147 1.0000 0.318 9 Inf 0.3499 0.1070 0.7385 0.0111 0.1750 0.027 5 0.014 8 0.991 8 0.265 7 Best Order Inf Inf Inf 2 Inf 6 6 Inf Inf Since the dynami c range change is fixed at 1 dB, the optimal DRC order for an individual psychoacoustic measure with a given window size is determined by the lowes t measure value, which are listed in the last row of Table 52. As it can be seen in Table 52 as the window size increases, the optimal DRC orders for individual psychoacoustic measures generally increases except harmonic structure of the error (M4), which is consistent with the relation between the window size and optimal DRC order for audio quality and dynamic range reduction shown in Figure 54. In general, the lower the measure value is, the less impact on the audio quality degradation. The consistency between the individual psychoacoustic measures and the overall audio quality scores is not surprising. T his relation between the window size and the optimal order could be explained by the signal stationarit y within the window duration. For short windows of a few milliseconds, the signal stays relatively stationary and signal dynamic range is limited within the each window, hence, the secondorder measure is sufficient to characterize the signal in each window. When the window size increases, more transient components are likely to be included within each window and the signal b ecomes more and more dynamic, as a result, a higher order is needed to measure the nonstationary signal within each window.

PAGE 86

86 To compare the operational curves of dynamic range reduction and objective audio quality across different window sizes the best operational curves for each giv en window size in the Figure 54 are selected and reproduced in Figure 55. It can be seen from Figure 5 5, neither short est signal segmentation (e.g. 2.7 ms/frame and 5.3 ms/frame ) n or the long est signal segmentation (e.g., 85.3 ms/frame) gives the best j oint performance of dynamic range reduction and audio quality. The DRC algorithms with really short window sizes are e ffective in reducing the dynamic range, but tend to degrade the audio quality severely; on the other hand, the DRC algorithms using extrem ely long window sizes are good at pres erving the audio quality, but are li mited in compressing the dynamic range. The most desirable window sizes for DRC signal segmentation are from 10.7 ms/frame to 42.7 ms/frame. The optimal DRC orders that correspond t o these window sizes are from 6th order and infinite order. In general, for window sizes from 10 ms/frame to 50ms/frame, audio signals could be very nonstationary. For this window size, higher order level estimation is necessary to optimize the tradeoff b etween dynamic r ange reduction and audio fidelity In practice, given the specifications in affordable delay time, buffer sizes, computation capacity, there could be occasions that window sizes are fixed for DRC processing and a suboptimal window size mus t be used. In this case, the DRC level estimation order should be chosen based on the average signal temporal structure within the given analysis window. The ITU objective measure of perceived audio quality, PEAQ, is specifically designed for reliably ev aluating the perceptual quality of audio codecs and may not be optimal for this DRC application. The psycho acoustical aspects considered in this

PAGE 87

87 algorithm are targeted at the artifacts introduced by bit rate reduction in audio encoding and decoding system When the PEAQ is applied to audio dynamic range controllers, it gives a relevant difference grades for processed signals and original signals, which is valuable for preliminary study but could be biased relative to the subjective evaluation. More subject ive experiments are required to show that these objective results are consistent with subjective responses. -1.5 -1.4 -1.3 -1.2 -1.1 -1 -0.9 -0.8 -0.7 -0.6 -1.5 -1.4 -1.3 -1.2 -1.1 -1 -0.9 -0.8 -0.7 -0.6 -0.5 Dynamic range reduction (dB)Objective Differece Grade (ODG)Best operational curves across windows 2.7 ms/frame order = 2 5.3 ms/frame order = 4 10.7 ms/frame order = 6 21.3 ms/frame order = inf 42.7 ms/frame order = inf 85.3 ms/frame order = inf Figure 55 Operational curves with the best compromise of dynamic range reduction and audio quality for window sizes from 2.7 ms/frame to 85.3 ms /frame. 5. 3 Summary T he conventional secondorder DRC is extended to a higher order DRC algorithm. By examining a variety of audio signals using level estimates with different orders, we find that the optimal level estimate for the DRC algorithm that achieves the best balance of perceptual quality and dynamic range reduction relies on the signal temporal structure within the pre determined analysis window s, the longer the analysis window is, the more transient the signal is in each window the higher the or der of DRC that is

PAGE 88

88 preferred. Higher order DRC with moderate window size gives the best performance in general. Further experiments are required to show the objective results presented in this chapter are consistent with the human subjective tests.

PAGE 89

89 CHAPT ER 6 CONCLUSION This aim of this dissertation i s to study the impact of te mporal structure on human auditory sensitivity of non stationary sounds and its application to audio dynamic r ange control. There are three major contributions given as follows. Firs t, a systematic subjective listening test has been designed and conducted to investigate how temporal structure solely affect s human auditory detection of nonstationary sounds independent of spectral differences. The s ubjective experimental data confirm t hat the human auditory system is more sensitive to transient signals than steady signals given the same energy which give s a strong counter example against the equivalent energy theory for auditory temporal integration. Since these temporal effects impac t the auditory sensitivity to nonstationary sounds, a high er order spectrotemporal integration model that emphasizes the transient parts relative to the steady parts of an audio signal, is proposed to predict the audibility thresholds of nonstationary s ounds. By doing higher order temporal integration in each critical band and making the decision based on the most audible band, t h e proposed higher order integration model outperforms the existing energy based and loudness based audibility prediction model s on our human subjective data. We expect that this higher order methodology can be used to develop improved standards for determining hearing risks of transient sounds. We propose to use audio dynamic range reduction (DRC) to reduce the risk of music indu ced hearing loss while listening to music. Inspired by the success of the higher order model in predicting audibility of nonstationary sounds, an audio dynamic range control (DRC) algorithm using higher order l evel estimation is proposed for both

PAGE 90

90 hearing protection and audio fidelity Based on the objective evaluation, the optimal order for the DRC algorithm to achieve the best balance of dynamic range reduction and audio quality is dependent on the signal temporal structure within a given analysis window. In general, for moderate sized temporal windows, the higher order DRC algorithms provide the optimal tradeoff between dynamic range reduction and signal fidelity. Further subjective experiment is needed to confirm the objective results are consistent with the subjective evaluations.

PAGE 91

91 APPENDIX DERIVATION FOR KEY FOWLE HAGGARTY PHASE SOLUT ION FOR WAVEFORM MANIPULATION (Key et al 1961, Fowle 1964, Quatieri and McAulay1991) Let x(t) be the given signal and ) t ( x be its Hilbert transform Then the analytic signal representation r(t) is given by ) t ( x j ) t ( x) t ( r (A 1) Let ) t ( r be the envelope and ) t ( is the phase of r(t), the analytic signal be written as )) t ( j exp( ) t ( r ) t ( r (A 2) which has a Fourier transform )) ( j exp( ) ( M ) ( M (A 3) The signal design problem can be stated as follows: given a timedomain envelope ) t ( r and a frequency domain spectral magnitude ) ( M find the phase ) t ( in time and ) ( in frequency such that the following Fourier transform relation is satisfied: )) ( j exp( ) ( M F)) t ( j exp( ) t ( r1 (A 4) where 1 F denotes inverse Fourier transform. d ))) ( t ( j exp( ) ( M 2 1 d ) t j exp( )) ( j exp( ) ( M 2 1 ) t ( r (A 5) According to the principle of stationary phase, the integral of a rapidly oscillating function has little value except in regions where the phase is stationary or where the derivative of the phase is zero. In (A 5) we have a stationary point where

PAGE 92

92 0 )) ( t ( d d (A 6) Let us represent the value of which satisfies (A 6) by Then we have t ) ( (A 7) Under the assumption that for each value of time t, there is only one stationary point, (A 5) t hen becomes d ))) ( t ( j exp( ) ( M 2 1 ) t ( r (A 8) Following the method of stationary phase, the phase function in (A 5) is expanded in a 3term Taylors series about the stationary point and 2) ( 2 ) ( ) )]( ( t [ )] ( t [ ) ( t (A 9) The second term is zero based on (A 7). Then r(t) is given by d ] ) ( 2 ) ( j exp[ )]} ( t [ j exp{ ) ( M 2 1 ) t ( r2 (A 10) Setting x and ) ( j 12 (A 10) becomes dx ] 2 x exp[ )]} ( t [ j exp{ ) ( M 2 1 ) t ( r2 2 (A 11) Th e statement that the phase is dispersive at infers that ) ( is large and 2 is therefore small. We assume 2 is small enough to cause the entire area under the Gaussian funct ion in (A 11) to be obtained within the limits even though is small. Hence, we have d ] ) (2 ) ( j exp[ )]} ( t [ j exp{ ) ( M 2 1 ) t ( r2 (A 12)

PAGE 93

93 The integral may be evaluated as follows, 2 dx ) x exp(2 2 (A 13) where ) ( ) 4 j exp( (A 14) We take + associated with 0 ) ( with 0 ) ( Thus we obtain ) 4 ) ( t ( j exp ) ( ) ( M ) t ( r (A 15) The modulus |r(t)| is ) ( ) ( M ) t ( r (A 16) To make |r(t)| constant, inspection of (A 16 ) shows that we must set 2) ( M c ) ( (A 17) We differentiate (A 7) with respect to to obtain d dt ) ( (A 18) Then we square both sides of (A 16) and use (A 18) to get d ) ( M dt ) t ( r2 2 (A 19) We integrate (A 19) and have

PAGE 94

94 d ) ( M dt ) t ( r2 t 2 (A 20) Let A ) t ( r and A is constant, we have d )( M t A dt A2 2 t 2 (A 21) It follows 2 2A / )d ) ( M ( t (A 22) By substituting (A 22) in to (A 7), we have 2 2A / ) d ) ( M ( ) ( (A 23) Integrate both side of (A 17), we have d ) ( M c ) (2 (A 24) By comparing (A 23) and (A 24), we have 2 A 1 c (A 25) Parsevals theory requires that d ) ( M dt ) t ( r2 T 0 2 (A 26) Because for analytic signal r(t), we have ) 0 ( 0 ) ( M (A 26) can be rewritten as d ) ( M dt ) t ( r0 2 T 0 2 (A 27)

PAGE 95

95 By substituting A ) t ( r in to (A 27) we have d ) ( M dt A0 2 T 0 2 (A 28) d ) ( M T A0 2 2 (A 29) T / ) d ) ( M ( A0 2 2 (A 30) Combining (A 30) and (A 25), we have 0 2d) ( M T c (A 31) By substituting (A 31) into (A 17), we have 0 2 2d ) ( M )( M T ) ( (A 32) => d d ) ( M T d ) ( M d d ) ( M T ) (0 0 2 0 2 0 0 2 (A 33) where 0 2 2 2d ) ( M ) ( M ) ( M

PAGE 96

96 LIST OF REFERENCES American National Standards Institute (1996). ANSI S3.441996 American National Standard Determination of Occupational Noise Exposure and Estimation of Noiseinduced Hearing Impairment. Buus, S., E. Schorer, et al. (1986). "Decision rules in detection of simple and complex tones." The Journal of the Acoustical Society of America 80(6): 1646 1657. Chasin, M. (2003). "Music and hearing aids." Hearing Journal 56(7): pp 36,38,4041. Chasin, M. (2006). "Hearing Aids for Musicians." Hearing Review. Eddins, D. A. and D. M. Green (1995). Temporal integration and temporal resolution. Hearing., San Diego, CA, US: Academic Press: 207242. Erdreich, J. (1985). "Distribution based definition of impulse noi se." The Journal of the Acoustical Society of America 77(S1): S19S19. European Broadcasting Union (2001). Recommendation ITU R BS.13871: Method for objective measurements of perceived audio quality. European Broadcasting Union (2008). Sound Quality Asses sment Material recordings for subjective tests. Fastl, H. and E. Zwicker (2006). Psychoacoustics: Facts and Models, Springer Verlag New York, Inc. Fletcher, H. (1940). "Auditory Patterns." Reviews of Modern Physics 12(Copyright (C) 2010 The American Physic al Society): 47. Fletcher, H. and W. A. Munson (1933). "Loudness of a Complex Tone, Its Definition, Measurement and Calculation." The Journal of the Acoustical Society of America 5(1): 6565. Fligor, B. J. (2009). "Risk for NoiseInduced Hearing Loss From Use of Portable Media Players: A Summary of Evidence Through 2008." Perspectives on Audiology 5(1): 1020. Fowle, E. (1964). "The design of FM pulse compression signals." Information Theory, IEEE Transactions on 10(1): 6167. Glasberg, B. R. and B. C. J. M oore (1990), Derivation of auditory filter shapes from notchednoise data. Hearing Research, 47: 103 138. Glasberg, B. R. and B. C. J. Moore (2005). "Development and evaluation of a model for predicting the audibility of timevarying sounds in the presence of background sounds." Journal of Audio Engineering Society 53: 906918

PAGE 97

97 Green, D. M. (1958). "Detection of Multiple Component Signals in Noise." The Journal of the Acoustical Society of America 30(10): 904911. Green, D. M. and R. D. Luce (1975). "Parallel psychometric functions from a set of independent detectors." Psychological Review 82(6): 483486. Green, D. M. and J. A. Swets (1974). Signal detection theory and psychophysics, Oxford, England: Robert E. Krieger. Hamernik, R. P. and W. Qiu (2001). "Energy independent factors influencing noiseinduced hearing loss in the chinchilla model." The Journal of the Acoustical Society of America 110(6): 31633168. Hamernik, R. P., W. Qiu, et al. (2003). "The effects of the amplitude distribution of equal energy exposures on noiseinduced hearing loss: the kurtosis metric." J Acoust Soc Am 114(1): 38695. Henderson, D. and R. P. Hamernik (1995). "Biologic bases of noise induced hearing loss." Occup Med 10(3): 51334. Hsueh, K. D. and R. P. Hamernik (1990). "A generalized approach to random noise synthesis: Theory and computer simulation." The Journal of the Acoustical Society of America 87(3): 1207 1217. International Organization for Standardization (1999). ISO 1999:1990 Acoustics Determination of occupational noise exposure and estimation of noiseinduced hearing impairment. Kabal, P. (2002). An Examination and Interpretation of ITU R BS.1387: Perceptual Evaluation of Audio Quality. Kay, S. M. (1998). Fundamentals of Statistical Signal Processing, Volume 2: Detection Theory Prentice Hall PTR. Levitt, H. (1971). "Transformed UpDown Methods in Psychoacoustics." The Journal of the Acoustical Society of America 49(2B): 467477. Malmierca, M. and D. Irvine, Eds. (2005). Auditory spectral processing, Academic Pres s. Melnick, W. (1991). "Human temporary threshold shift (TTS) and damage risk." The Journal of the Acoustical Society of America 90(1): 147154. Mills, J. H., R. M. Gilbert, et al. (1979). "Temporary threshold shifts in humans exposed to octave bands of noise for 16 to 24 hours." The Journal of the Acoustical Society of America 65(5): 1238 1248.

PAGE 98

98 Moore, B. C. J., Ed. (1995). Hearing (Handbook of Perception and Cognition), Academic Press. Moore, B. C. J. (2003). An introduction to the psychology of hearing (5th ed.), San Diego, CA, US: Academic Press. National Institute for Occupational Safety and Health (1998). NIOSH 98 126 Criteria for a Recommended Standard: Occupational Noise Exposure. Occupational Safety & Health Administration (1986). OSHA 1910.95 Occupa tional noise exposure. Price, G. R. and S. Wansack (1985). "A test of predicted maximum susceptibility to impulse noise." The Journal of the Acoustical Society of America 77(S1): S82S82. Quaranta, A., P. Portalatini, et al. (1998). "Temporary and permanent threshold shift: an overview." Scand Audiol Suppl 48: 7586. Quatieri, T. F. and R. J. McAulay (1991). "Peak to RMS reduction of speech based on a sinusoidal model." Signal Processing, IEEE Transactions on 39(2): 273288. Schafer, T. H. and R. S. Gales ( 1949). "Auditory Masking of Multiple Tones by Random Noise." The Journal of the Acoustical Society of America 21(4): 392397. Schmidt, J. C. and J. C. Rutledge (1996). Multichannel dynamic range compression for music signals. Acoustics, Speech, and Signal Processing, 1996. ICASSP 96. Conference Proceedings., 1996 IEEE International Conference on. Schneider, T. and R. Brennan (1997). A multichannel compression strategy for a digital hearing aid. Acoustics, Speech, and Signal Processing, 1997. ICASSP 97., 1997 IEEE International Conference on. Stevens, S. S. (1955). "The Measurement of Loudness." The Journal of the Acoustical Society of America 27(5): 815829. Strasser, H., H. Irle, et al. (1999). "Physiological cost of energy equivalent exposures to white noi se, industrial noise, heavy metal music, and classical music." Noise Control Engineering Journal 47(5): 187197. van den Brink, W. A. and T. Houtgast (1990). "Efficient across frequency integration in short signal detection." Journal of the Acoustical Soci ety of America 87(1): 284291. van den Brink, W. A. C. and T. Houtgast (1990). "Spectrotemporal integration in signal detection." The Journal of the Acoustical Society of America 88(4): 17031711. Viemeister, N. F. and G. H. Wakefield (1991). "Temporal integration and multiple looks." The Journal of the Acoustical Society of America 90(2): 858865.

PAGE 99

99 Von Bksy, G. (1960). Experiments in hearing, Oxford, England: Mcgraw Hill. Ward, W. D., E. M. Cushing, et al. (1976). "Effective quiet and moderate TTS: Implications for noise exposure standards." The Journal of the Acoustical Society of America 59(1): 160165. Yang, Q. and J. G. Harris (2009 ). An Audibility Model for nonstationary sounds. The American SpeechLanguageHearing Association (ASHA) Annual Convent ion 2009. Yang, Q. and J. G. Harris (2010). A higher order spectrotemporal integration model for predicting signal audibility. Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. Zwicker, E., G. Flottorp, et al. (1957). "Critical band width in loudness summation." Journal of the Acoustical Society of America 29: 548557.

PAGE 100

100 BIOGRAPHICAL SKETCH Qing Yang was born in Guangzhou, the largest city in South China, and grew up in Shenzhen, an immediate neighbor to Hong kong. Qing received her b achelor s and m aster s degree s in e lectrical e ngineering from South China University of Technology (SCUT) in 2002 and 2005, respectively. She started her doctoral program at University of Florida in 2005 and joined the Computational NeuroEn gineering Laboratory (CNEL) in 2006. Since 2006, she has been working as a research assistant at CNEL under the guidance of Dr. John G. Harris. Her research interests include auditory perception, speech/audio signal processing loudness measurement and con trol dynamic range control, perceptual evaluation of sound quality S he worked as a software engineering intern at the iDen advanced technology group, Motorola Mobile Devices Plantation, Florida in the fall semester of 2006 to optimize and implement a sp eech enhancement algorithm for Motorola mobile devices