Determining Which Acoustic Features Contribute Most to Speech

Determining Which Acoustic Features Contribute Most to Speech Intelligibility John-Paul Hosom Alexander Kain Akiko Kusumoto hosom@cslu.ogi.edu Center for Spoken Language Understanding (CSLU) OGI School of Science & Engineering Oregon Health & Science University (OHSU) 0 / 50 image from http://www.ph.tn.tudelft.nl/~vanwijk/Athens2004/niceones/images.html 1 / 50 Outline 1. Introduction 2. Background: Speaking Styles 3. Background: Acoustic Features 4. Background: Prior Work on Clear Speech 5. Objectives of Current Study 6. Methods 7. Results 8. Conclusion 2 / 50 1. Introduction Motivation #1: • Difficult to understand speech in noise, especially with a hearing impairment. • When people speak clearly, speech becomes more intelligible. • Automatic enhancement of speech could be used in next-generation hearing aids. • Attempts to modify speech by computer to improve intelligibility not yet very successful. • Need to understand which parts of signal should be modified and how to modify them. 3 / 50 1. Introduction Motivation #2: • Even best current model for computer speech recognition does not provide sufficiently accurate results. • Current research applies new mathematical techniques to this model, but techniques are generally not motivated by studies of human speech perception. • A better understanding of how acoustic features contribute to speech intelligibility could guide research on improving computer speech recognition. 4 / 50 1. Introduction Research Objective: To identify the relative contribution of acoustic features to intelligibility by examining conversational and clear speech. Long-Term Goals: • Accurately predict speech intelligibility from acoustic features, • Integrate most effective features into computer speech-recognition models, • Develop novel signal-processing algorithms for hearing aids. 5 / 50 2. Speaking Styles: Production “Conversational speech” and “clear speech” easily produced with simple instructions to speakers. • Conversational (CNV) speech: “read text conversationally as in daily communication.” • Clear (CLR) speech: “read text clearly as if talking to a hearingimpaired listener.” 6 / 50 2. Speaking Styles: Perception • To compare CNV and CLR speech intelligibility, same sentences read in both styles, then listened to by group of subjects. • Intelligibility measured as the percentage of sentences that are correctly recognized by listener. • CLR speech increases intelligibility for a variety of:  Listeners, (young listeners, elderly listeners)  Speech materials, (meaningful sentences, nonsense syllables)  Noise conditions. (white noise, multi-talker babble noise) 7 / 50 Outline 1. Introduction 2. Background: Speaking Styles 3. Background: Acoustic Features 4. Background: Prior Work on Clear Speech 5. Objectives of Current Study 6. Methods 7. Results 8. Conclusion 8 / 50 3. Acoustic Features: Representations • Acoustic Features  Duration (length of each distinct sound)  Energy  Pitch  Spectrum (spectrogram)  Formants  Residual (power spectrum without formants) 9 / 50 3. Acoustic Features: Waveform • Time-Domain Waveform “Sound originates from the motion or vibration of an object. This motion is impressed upon the [air] as a pattern of changes in pressure.” amplitude [Moore, p. 2] time (msec) waveform for word “two” 10 / 50 3. Acoustic Features: Energy • Energy Energy is proportional to the square of the pressure variation. Log scale is used to reflect human perception.  t N 1 2  10 log10   xn N   n t  xn = waveform sample x at time point n N = number of time samples waveform energy 11 / 50 3. Acoustic Features: Pitch • Pitch Pitch (F0) is rate of vibration of vocal folds. vocal tract tongue vocal folds (larynx) amplitude nasal tract 1 fundamenta l F0   frequency periodicity time Airflow through vocal folds Speech Production Apparatus (from Olive, p. 23) 12 / 50 3. Acoustic Features: Pitch • Pitch Pitch (F0) is rate of vibration of vocal folds. amplitude time (msec) Pitch=117 Hz Pitch=83 Hz 13 / 50 3. Acoustic Features: Spectrum • Phoneme: Abstract representation of basic unit of speech (“cat”: /k q t/). • Spectrum: What makes one phoneme, /e/, sound different from another phoneme, /i/? Different shapes of the vocal tract: /E/ is produced with the tongue low and in the back of the mouth; /i/ with tongue high and toward the front. 14 / 50 3. Acoustic Features: Spectrum • Source of speech is pulses of air from vocal folds. • This source is filtered by vocal tract “tube”. • Speech waveform is result of filtered source signal. • Different shapes of tube create different filters, different resonant frequencies, different phonemes. /e/ (from Ladefoged, p. 58-59) /i/ 15 / 50 3. Acoustic Features: Spectrum Resonant frequencies identified by frequency analysis of speech signal. Fourier Transform expresses a signal in terms of signal strength at different frequencies: X( f )      t   t   e x(t )e  j 2ft dt  x(t )cos(2ft )  j sin(2ft )dt j  cos( )  j sin( )  S ( f )  10 log10 ( X ( f ) ) 2 16 / 50 3. Acoustic Features: Spectrum The time-domain waveform and power spectrum can be plotted like this (/e/): timedomain amplitude 90 spectral power (dB) 10 0 Hz frequency (Hz) 4000 Hz 17 / 50 3. Acoustic Features: Spectrum The time-domain waveform and power spectrum can be plotted like this (/e/): timedomain amplitude 90 spectral power (dB) 10 F0=95 Hz 0 Hz frequency (Hz) 4000 Hz 18 / 50 3. Acoustic Features: Spectrum The resonant frequencies, or formants, are clearly different for vowels /e/ and /i/. Spectral envelope is important for phoneme identity (envelope = general spectral shape, no harmonics). envelope 0 1K 2K /e / 3K 4K 0 1K 2K /i / 3K 4K 19 / 50 3. Acoustic Features: Formants Formants (dependent on vocal-tract shape) are independent of pitch (rate of vocal-fold vibration). /e / F0=80 Hz /e / F0=160 Hz 0 1K 2K 3K 4kHz 20 / 50 3. Acoustic Features: Formants • Formants specified by frequency, and numbered in order of increasing frequency. For /e/, F1=710, F2=1100. • F1, F2, and sometimes F3 often sufficient for identifying vowels. • For vowels, sound source is air pushed through vibrating vocal folds. Source waveform is filtered by vocal-tract shape. Formants correspond to these filters. • Digital model of a formant can be implemented using an infinite-impulse response (IIR) filter. 21 / 50 3. Acoustic Features: Formants Formant frequencies (averages for English): 3500 3000 2500 2000 2890 2560 2490 2490 2540 2400 2380 2250 2250 1920 1770 1660 1500 1000 500 280 550 400 1200 1100 1030 870 690 600 710 450 310 0 iy ih eh (from Ladefoged, p. 193) ae ah aa uh uw 22 / 50 3. Acoustic Features: Coarticulation frequency time u e r frequency j “you are”: /j u e r/ 23 / 50 3. Acoustic Features: Coarticulation frequency time u e r frequency j “you are”: /j u e r/ 24 / 50 3. Acoustic Features: Coarticulation frequency time 25 / 50 3. Acoustic Features: Vowel Neutralization When speech is uttered quickly, or is not clearly enunciated, formants shift toward neutral vowel: (from van Bergem 1993 p. 8) 26 / 50 Outline 1. Introduction 2. Background: Speaking Styles 3. Background: Acoustic Features 4. Background: Prior Work on Clear Speech 5. Objectives of Current Study 6. Methods 7. Results 8. Conclusion 27 / 50 4. Prior Work: Acoustics of Clear Speech 1. Pitch (F0): more variation, higher average. 2. Energy: Consonant-vowel (CV) energy ratio increases for stops (/p/, /t/, /k/, /b/, /d/, /g/). 3. Pauses: Longer in duration and more frequent. 4. Phoneme and sentence duration: longer. • However, correlation between a characteristic of an acoustic feature and intelligibility does not mean the characteristic causes increased intelligibility. • For example, fast speech can be just as intelligible as slow speech; longer sentence duration not a cause of increased intelligibility. 28 / 50 4. Prior Work: Speech Modification • Lengthen phoneme durations [e.g. Uchanski 1996] • Insert pauses at phrase boundaries or word boundaries [e.g. Gordon-Salant 1997; Liu 2006]. • Amplify consonant energy in consonant-vowel (CV) contexts [Gordon-Salant, 1986; Hazan, 1998]. Positive results at sentence level reported in only one case, using extreme modification. (Hazan 1998, 4.2% improvement) 29 / 50 5. Objectives: Background Summary of Current State: • CLR speech intelligibility higher than CNV speech. • Speech has acoustic features that interact in complex ways. • Correlation between acoustic features and intelligibility has been shown, but causation not demonstrated. • Signal modification of CNV speech shows little or no intelligibility improvement. • Reason for inability to dramatically improve CNV speech intelligibility not known. 30 / 50 5. Objectives of Current Study Objectives of Current Study: 1. To validate that CLR speech is more intelligible than CNV speech for our speech material, 2. To process CNV speech so that intelligibility is significantly closer to CLR speech, We propose a hybridization algorithm that creates “hybrid” (HYB) speech using features from both CNV and CLR speech 3. To determine acoustic features of CLR speech that cause increased intelligibility. 31 / 50 Outline 1. Introduction 2. Background: Speaking Styles 3. Background: Acoustic Features 4. Background: Prior Work on Clear Speech 5. Objectives of Current Study 6. Methods 7. Results 8. Conclusion 32 / 50 6. Methods: Hybridization Algorithm Hybridization: • Input: parallel recordings of a sentence spoken in both CNV and CLR styles. • Signal processing replaces certain acoustic features from CNV speech with those of CLR speech. • Output: synthetic speech signal. • Uses Pitch-Synchronous Overlap Add (PSOLA) for pitch and/or duration modification. [Moulines and Charpentier, 1990]. 33 / 50 6. Methods: Hybridization with PSOLA Original CNV speech Pitch Modification alter distance between glottal pulses scale 2.0  lengthen duration scale 2.0  raise pitch Modified Signal Original Signal Duration Modification duplicate or eliminate glottal pulses a a b c 33ms a b a d b 66ms c c d d a b a b c b 25ms c c 34 / 50 6. Methods: Hybridization Algorithm CNV Speech CLR Speech Phoneme Labelling Pitch marking Voicing Phoneme Labelling Voicing Pitch marking Placement of Auxiliary Marks Placement of Auxiliary Marks Phoneme Alignment between CLR and CNV Speech Stage 1: Database Preparation Parallelization between CLR and CNV Speech (features P, N) F0 (F) F0 (F) Long-term Energy (E) Long-term Energy (E) Phoneme Duration (D) Hybrid Configuration Phoneme Duration (D) Spectrum (S) Spectrum (S) Pitch Synchronous Overlap Add (PSOLA) Output: HYB Speech Stimuli: CLR-D Stages 2 and 3: Feature Analysis and Selection Stage 4: Waveform Synthesis 35 / 50 6. Methods: Hybridization Algorithm CNV Speech CLR Speech Phoneme Labelling Pitch marking Voicing Phoneme Labelling Voicing Placement of Auxiliary Marks Pitch marking Placement of Auxiliary Marks Phoneme Alignment between CLR and CNV Speech Stage 1: Database Preparation Parallelization between CLR and CNV Speech (features P, N) For each sentence (CLR and CNV recordings): • Manually label phoneme identity and locations. • Match phonemes in CLR and CNV recordings. • Identify location of each glottal pulse. 36 / 50 6. Methods: Hybridization Algorithm • Extract acoustic features: Spectrum (S), F0 (F), Energy (E), Duration (D) • For each feature, select from CNV or CLR for generating speech waveform. CLR CNV F0 (F) F0 (F) Long-term Energy (E) Long-term Energy (E) Phoneme Duration (D) Hybrid Configuration Phoneme Duration (D) Spectrum (S) Spectrum (S) Pitch Synchronous Overlap Add (PSOLA) Output: HYB Speech Stimuli: CLR-D Stages 2 and 3: Feature Analysis and Selection Stage 4: Waveform Synthesis 37 / 50 6. Methods: Hybridization Algorithm • Use PSOLA to generate waveform using selected features with spectrum at each glottal pulse. • Output is HYB speech, named according to features taken from CLR speech, e.g. CLR-D. CLR CNV F0 (F) F0 (F) Long-term Energy (E) Long-term Energy (E) Phoneme Duration (D) Hybrid Configuration Phoneme Duration (D) Spectrum (S) Spectrum (S) Pitch Synchronous Overlap Add (PSOLA) Stimulus: CLR-D Stages 2 and 3: Feature Analysis and Selection Stage 4: Waveform Synthesis 38 / 50 6. Methods: Speech Corpus • Public database of sentences, syntactically and semantically valid.  Ex: His shirt was clean but one button was gone.  5 keywords (underlined) for measuring intelligibility.  Long enough to test effects of prosodic features (combination of duration, energy, pitch).  Short enough to minimize memory effects. • One male speaker read text material with both CNV and CLR speaking styles. 39 / 50 6. Methods: Perceptual Test For each listener: 1. Audiometric test (to ensure normal hearing), 2. Find optimal noise level for this listener, 3. Measure intelligibility of CLR, CNV, and HYB speech. For finding optimal noise levels and measuring intelligibility, the listener’s task is to repeat the sentence aloud. 40 / 50 6. Methods: Finding Optimal Noise Level • Total energy of each sentence normalized (65 dBA). • To avoid “ceiling effect,” sentences played with background noise (12-speaker babble noise). • To normalize performance differences between listeners, noise set to a specific level for each listener. • Noise level set so that each listener correctly identifies CNV sentences 50% of the time. decreasing noise level 41 / 50 6. Methods: Measuring Intelligibility • 48 sentences per subject • Correct response for sentence when at least 4 of 5 keywords correctly repeated by listener. Intelligibility (%) = # of sentences correctly identified x100 # of sentences presented 42 / 50 6. Methods: Listeners • Subjects:  12 listeners with normal hearing  age 19 – 40 (mean 29.17)  Average noise level -0.24 dB SNR • Significance Testing  Paired t-test with p < 0.05 43 / 50 6. Methods: Features • Energy and pitch always taken from CNV speech. • Test importance of other two acoustic features:  spectrum (for phoneme identity)  duration (for syntactic parsing) • Test co-dependence of spectrum and duration. Speech Waveform Spectrum Residual Formants Prosody Duration Energy Pitch 44 / 50 6. Methods: Stimuli • Conditions: 1. CNV Original 2. HYB Speech, CLR-Dur 3. HYB Speech, CLR-Spec 4. HYB Speech, CLR-DurSpec 5. CLR Original Speech Waveform Spectrum Residual Formants Prosody Duration Energy Pitch 45 / 50 Outline 1. Introduction 2. Background: Speaking Styles 3. Background: Acoustic Features 4. Background: Prior Work on Clear Speech 5. Objectives of Current Study 6. Methods 7. Results 8. Conclusion 46 / 50 7. Results Mean Intelligibility (%) • 10% difference between CNV and CLR-Dur • 11% difference between CNV and CLR-Spec • 18% difference between CNV and CLR-DurSpec • 25% difference between CNV and CLR * * 100 * 80 74 60 40 64 75 82 89 * = significant difference, compared with CNV 20 47 / 50 8. Conclusion Results of Objectives: 1. To validate that CLR speech is more intelligible than CNV speech, Confirmed: 25% absolute difference (significant). 2. To process CNV speech so that intelligibility is significantly closer to CLR speech, Confirmed: 18% absolute improvement (significant). 3. To determine acoustic features of CLR speech that cause increased intelligibility. Spectrum and combination of Spectrum and Duration are effective. Duration alone almost significant. 48 / 50 8. Conclusion Conclusions: 1. The single acoustic feature that yields greatest intelligibility improvement is the spectrum, but it contributes less than half of possible improvement. 2. Duration alone yields improvements almost as good as spectrum alone. (Prior work indicates, however, that total sentence duration and pause patterns are not important for intelligibility.) 3. The combination of duration and spectrum does not quite yield the intelligibility of CLR speech; further work to determine if difference due to (a) pitch, (b) energy, (c) signal-processing artifacts.49 / 50 8. Conclusion Long-Term Goals: • Identify more specific features that contribute to speech intelligibility and their degree of contribution, Speech Waveform Spectrum Residual Formants Prosody Duration Energy Pitch • Evaluate different speakers and listener groups, • Accurately predict speech intelligibility from acoustics, • Integrate most effective features into signalprocessing and speech-recognition algorithms. 50 / 50 Thank you! CSLU will have job opening(s) in Summer/Fall 2007 for phonetic transcription, syntactic labeling. If interested, please e-mail to hosom@cslu.ogi.edu or roark@cslu.ogi.edu (also, special thanks to my dog, Nayan…) 51 / 50

Determining Which Acoustic Features Contribute Most to Speech

Related documents

Products

Support

Determining Which Acoustic Features Contribute Most to Speech

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib