1. Field of the Invention
The present invention relates to speech recognition and more specifically to vocal tract length normalization in real-time speech recognition.
2. Introduction
One of the fundamental difficulties with speech recognition is that different speakers sound different, even when saying lexically identical utterances. Even casual observers find the differences between speakers drastic and much more pronounced than, for example, differences between separate utterances by the same speaker of the same string of words. Some of the inter-speaker difference can be attributed to simple acoustical properties of the human speech apparatus. Different people have different physical properties and thus their speech production organs also differ. If the speech generation process can be separated into a source and a channel, where the channel is a vocal tract, then any accounting for changes in the length of the vocal tract would greatly reduce the acoustic differences between different speakers. This would be true even if the differences in the shape of the vocal tracts are ignored.
Vocal Tract Length Normalization (VTLN) is a well established and successful technique for speaker normalization. VTLN attempts to normalize speech representation by removing differences caused by variations in the length of speakers' vocal tracts. A most popular way of achieving such normalization is by warping a frequency axis of a short term magnitude spectrum. This method can be applied during a recognition stage, but the improvements are roughly doubled if the same algorithm is also applied to training data before building an acoustic model. The most common implementation uses at least a few minutes of speech per speaker and the final result, even if the recognition was faster than real time, has significant latency.