Mandarin Chinese
Speech Recognition
Mandarin Chinese
Tonal language (inflection matters!)
Monosyllabic language
1st tone – High, constant pitch (Like saying “aaah”)
2nd tone – Rising pitch (“Huh?”)
3rd tone – Low pitch (“ugh”)
4th tone – High pitch with a rapid descent (“No!”)
“5th tone” – Neutral used for de-emphasized syllables
Each character represents a single base syllable and tone
Most words consist of 1, 2, or 4 characters
Heavily contextual language
Mandarin Chinese and Speech
Processing
Accoustic representations of Chinese
syllables
Structural Form
(consonant) + vowel + (consonant)
Mandarin Chinese and Speech
Processing
Phone Sets
Initial/final phones [1]
e.g. Shi, ge, zi = (shi + ib), (ge + e), (z + if)
Initial phones: unvoiced
1 phone
Final phones: voiced (tone 1-5)
Can consist of multiple phones
Mandarin Chinese and Speech
Processing
Strong tonal recognition is crucial to
distinguish between homonyms [3]
(especially w/o context)
Creating tone models is difficult
Discontinuities exist in the F0 contour
between voiced and unvoiced regions
Prosody
Prosody: “the rhythmic and intonational
aspect of language” [2]
Embedded Tone Modeling[4]
Explicit Tone Modeling[4]
Tone Modeling
Embedded Tone Modeling
Tonal acoustic units are joined with spectral
features at each frame [4]
Explicit Tone Modeling
Tone recognition is completed independently
and combined after post-processing [4]
Tone Modeling
Pitch, energy, and duration (Prosody) combined
with lexical and syntactic features improves
tonal labeling
Coarticulation
Variations in syllables can cause variations in tone:
Bu4 + Dui4 = Bu2 Dui4 (wrong)
Ni3 + Hao3 = Ni2 Hao3 (hello)
Emebedded Tone Modeling:
Two Stream Modeling
Ni, Liu, Xu
Spectral Stream –MFCC’s (Mel frequency cepstral
coefficients)
Describe vocal tract information
Distinctive for phones (short time duration)
Pitch/Tone Stream – requires smoothing
Describe vibrations of the vocal chords
Independent of Spectral features
d/dt(pitch) aka tone and d2/dt2(pitch) are added
Embedded in an entire syllable
Affected by coarticulation (requires a longer time
window) – i.e. Sandhi Tone – context dependency
Embedded Tone Modeling:
Two Stream Modeling [4]
Tonal Identification Features
F0
Energy
Duration
Coarticulation (cont. speech)
Initially use 2 stream embedded model followed
by explicit modeling during lattice rescoring
(alignment?)
Explicit tone modeling uses max. entropy framework
[4] (discriminative model)
Explicit Tone Modeling [4]
No.
1
Feature Description
Duration of current, previous, and following
syllables
# of Features
3
2
3
Previous syllable is or is not sp
4
Statistical Parameters of pitch and log-energy of
current syllable (i.e. max, min, mean, etc.)
10
5
Normalized max and mean of pitch and energy in
each syllable in the context window
12
6
7
Location of current syllable within word
Slope and intercept of F0 contour of current
syllable, its delta, and delta-delta
Tones of preceding and proceding syllables
1
6
1
2
Other Work
Chang, Zhou, Di, Huang, & Lee [1]
3 Methods
Powerful Language Model (no tone modeling)
Embedded 2 Stream
CER = 7.32%
Tone Stream + Feature Stream
CER = 6.43%
Embedded 1 Stream
Developed Pitch extractor
pitch track added to feature vector
CER = 6.03%
Other Work
Qian, Soong [3]
F0 contour smoothing
Multi-Space Distribution (MSD)
Models 2 prob. Spaces
Unvoiced: Discrete
Voiced (F0 Contour): Continuous
Other Work
Lamel, Gauvain, Le, Oparin, Meng [6]
Multi-Layer Perceptron Features
Compare Language Models
Combined with MFCC’s and Pitch features
N-Gram: Back-off Language Model
Neural Network Language Model
Language Model Adaptation
Other Work
O. Kalinli [7]
Replace prosodic features with biologically inspired
auditory attention cues
Cochlear filtering, inner hair cell, etc.
Other features are extracted from the auditory
spectrum
Intensity
Frequency contrast
Temporal contrast
Orientation (phase)
Other Work
Qian, Xu, Soong [8]
Cross-Lingual Voice Transformation
Phonetic mapping between languages
Difficult for Mandarin and English
Very different prosodic features
References
[1] Eric Chang, Jianlai Zhou, Shuo Di, Chao Huang, & Kai-fu Li, “Large Vocabulary
Mandarin Speech Recognition with different Approached in Modeling Tones”
[2] Meriam-Webster Dictionary, http://www.merriam-webster.com/
[3] Yao Qian & Frank Soong, “A Multispace Distribution (MSD) and Two Stream Tone
Modeling Approach to Mandarin Speech Recognition”, Science Direct, 2009
[4]Chongjia Ni, Wenju Liu, & Bo Xu, “Improved Large vocabulary Mandarin Speech
Recognition using Prosodic and Lexical Information in Maximum Entropy Framework”
[5] Yi Liu & Pascale Fung, “Pronunciation Modeling for Spontaneous Mandarin Speech
Recognition”, International Journal of Speech Technology, 2004
[6] Lori Lamel, J.L. Gauvain, V.B. Le, I. Oparin, S. Meng, “Improved Models For Mandarin
Speech to Text Transcription, ICASSP, 2011
[7] O. Kalinli, “Tone and Pitch Accent Classification Using Auditory Attention Cues”,
ICASSP, 2011