National Taipei University of Technology
Professor: Yuan-Fu Liao
Overview
Introduction
Microphone Array and ASR Integration
Noise - Phase Error Filtering
Maximum Likelihood-based Integration
Maximum Classification Error-like Integration
Reverberation - Subband Filtering-and-Sum
Maximum Likelihood-based Integration
Maximum Classification Error-based Integration
Summary
Traditional Beamforming+ASR
Pipeline : first enhance speech with beamformer, then
feed into recogniser
Bridge the Gap between Array and
Speech Recognizer
Take the advantage of available a priori knowledge, i.e., the
underline recognition model
Directly feed the output of recognizer back to microphone
array
References
Noise - dual-microphone phase error filtering
Shi, G., Aarabi, P. and Jiang, H., “Phase-Based Dual-Microphone Speech Enhancement Using
A Prior Speech Model”, IEEE Trans. Audio Speech Lang. Process., 15:109-118, 2007.
C. Kim, K. Kumar, B. Raj, and R. M. Stern, “Signal separation for robust speech recognition
based on phase difference information obtained in the frequency domain,” In INTERSPEECH2009, pp. 2495-2498, 2009.
Hsien-Cheng Liao, Yuan-Fu Liao and Chin-Hui Lee, Maximum Confidence Measure Based
Interaural Phase Difference Estimation for Noise Masking in Dual-Microphone Robust Speech
Recognition, InterSpeech 2011
Reverberation - subband filtering-and-sum
M.L. Seltzer, B. Raj, R.M. Stern, “Likelihood-maximizing beamforming for robust hands-free
speech recognition,” IEEE Trans. Speech, and Audio Processing, vol. 12, no. 5, pp. 489–498,
Sep. 2004.
M.L. Seltzer, R.M. Stern, “Subband likelihood-maximizing beamforming for speech
Recognition in Reverberant Environments,” IEEE Trans. Speech, and Audio Processing, vol. 14,
no. 6, pp. 2109–2121, Nov. 2006.
Yuan-Fu Liao, I-Yun Xu: Subband minimum classification error beamforming for speech
recognition in reverberant environments, ICASSP‘2010
Signal Modeling(ITD)
sampling rate: 8000Hz
interaural
time delay
sound
source
△t
0.05 x cos (Φ)
sound source
Φ
mic1
0.05 m
mic2
mic1
mic2
0.05 m
Binary Masking
保留
去除
speaker
interference
FFT
masking
micL
micR
ITD >
ITD <
Optimal τ estimation
語音辨識
至少一個
一階段
左麥克風
訊號
短時距
傅立葉轉換
雙耳時間差
計算模組
特徵向量
計算模組
右麥克風
訊號
語音命令模型
模型N+1
門檻值
輸入
自動
門檻值
調整模組
no
X-score
計算模組
X-score
輸出
yes
最大
X-score
輸出
辨識結果/
門檻值
Testing Database
轉錄雙麥克風音檔錄音環境設定
無響室:5X4X3 m3
麥克風位置:無響室正中央
雙麥克風距離:5cm
麥克風高度:1 m
目標音源與雙麥克風中心距離:30cm
o
o
Babble雜訊音源角度:30 & 60
測試語料
50 commands (e.g. 向前、後退 …)
11 speakers (6 males & 5 females)
547 utterances in total
Noise added artificially
SNR : 0,6,12,18 dB
Recognition Model
Training Data
MAT2000 DB4
Feature
25 ms/frame without overlap
13 Dims(8 ceps, 4 delta ceps, dC0)
Recognition Model
100 2-state RCD Initials + 38 2-state CI Finals
2 mixture Gaussians/state
Performance of online τ estimation
30o
db
60o
db
Reverberation - Subband
Filtering-and-Sum
• Introduction
• Maximum Likelihood-based Integration
• Maximum Classification Error-based
Integration
Introduction
Reverberant Model
Noise Free Model in Time Domain
Speech Reverberation -Time Domain
Speech Reverberation -Frequency
Domain
Clean Speech
Noisy Speech
Basic idea of LiMaBeam
Iterative procedure, utterance-based:
Do beamforming
Decode the utterance
Given most likely HMM state sequence, optimise the beamformer
parameters for this sequence
Stop when likelihood has converged
Subband Likelihood-Maximizing Beamforming
Formulation
Subband Minimum Classification Error
Beamforming
MCE CRITERION
TCC300 Reverberation Experiment
Experimental Setting
Microphone array with 7 microphones, 5.66 cm between two microphones
Speaker 2m away from the array
Room reverberation time T60=0.3~1.3 sec.
TCC300 database, 29 speakers, each with 5 calibration and 10 test
utterances
– Evaluation with free-syllable decoding/syllable error rate (no language model)
Experimental Results
–
–
–
–
Typical Spectrum Examples
Clean Speech
Noisy Speech
Delay-and-Sum
MCE beamformer
Summary
Take the advantage of available a priori knowledge, i.e., the
underline recognition model
Directly feed the output of recognizer back to microphone
array
Error rate criterion is better than likelihood