Automatic speech recognition (ASR) is a technique for recognizing the contents a speaker speaks, and has been developed since years. During recognizing the contents of a speech, the performance of automatic speech recognition (ASR) is still difficult because of the widely varying acoustic environment such as different speakers, speaking styles, background noise and different audio transmission channels. To resolve the problems, there has been much interest in the use of normalization and adaptation techniques to take into account this highly non-homogenous data.
The normalization of speech data includes normalization of transmission channel, microphone normalization, speech speed normalization, speaker normalization and cepstrum normalization. Among these normalizations, the speaker normalization is one of the normalization methods that are often used for removing the heterogeneity in speech data. Speaker normalization mainly comprises normalization of speech spectrum, and the normalization factor is usually referred to as warping factor, which reflects the characteristics of speakers of certain type and is used for normalizing the corresponding speech spectrum. After speaker normalization, the differences among speeches of different speakers due to the acoustic characteristics and speaking styles of speaker may be eliminated, so that the normalized speech is easier to be recognized in terms of the contents thereof.
Therefore, in speech recognition, the stage of speaker normalization may be regarded as a pre-recognition or a pre-processing stage before the speech content recognition stage. That is, before recognition of speech content, the class of the speech data (corresponding to the class of the speaker) is recognized firstly and the speech spectrum is normalized depending on the class of the speaker, then it comes to recognition of contents.
Further, said “pre-recognition” or “pre-processing” stage comprises two sub-stages: recognition of the class of the speaker; and normalization of the speech spectrum in accordance with the recognized class of the speaker. Then the influences of the differences in vocal tract among different speakers on the speech content recognition may be removed. Depending on the application, the number of the classes of speaker may be more or less. As an example having small number of classes, the speakers may be classified into “male” and “female”, or “adult”, “children” and “aged”. More classes are possible, and even each human being may be regarded as one class. That is, as a class, the speaker is first recognized. But the computation load thus occurred will be very heavy.
In usual normalization, each speaker class has corresponding normalization factor(s), that is, warping factor(s). Physically, the warping factor is a factor for compressing or extending the spectrum of a speaker. In linear normalization, each speaker class corresponds to one normalization factor, that is, the speaker spectrum is linearly normalized; while in non-linear normalization, each speaker class may correspond to multiple normalization factors, that is, the speaker spectrum is non-linearly normalized.
Like ordinary recognition, the recognition of speaker class also comprises a training stage and a recognition stage. Different speaker classes and a corresponding classifier are obtained after completion of the training stage. In the recognition stage, the classifier obtained in the training stage classifies the speech samples into respective speaker classes.
Conventionally, for obtaining the warping factor, there are basically two methods: a parametric approach, disclosed, for example, in U.S. Pat. No. 6,236,963(Speaker Normalization Processor Apparatus for Generating Frequency Warping Function, and Speech Recognition Apparatus with Said Speaker Normalization Processor Apparatus); or linear search.
Among the widely used normalization techniques, vocal tract length normalization (VTLN) is one of the most popular methods to reduce inter-speaker variability. VTLN actually is normalization of speaker spectrum. Vocal tract is the channel for producing voices in human body, including lips, oral cavity and the other vocal organs. The positions and shapes of various vocal organs determine the voice to be produced. In other words, the shape of the vocal tract determines the voice to be produced. In broad sense, vocal tract length is one of the shape elements of the vocal tract; but in narrow sense in the field of speech normalization, the vocal tract length is distinguished from vocal tract shape, that is, the vocal tract shape refer to the shape elements other than the vocal tract length.
In conventional parametric approach, unsupervised GMM (Gaussian Mixed Model) classification is adopted. The classification comprises a training stage and a recognition stage. In the training stage, the training samples are classified without supervision, and then the classifier is described with a GMM model. Then, the same warping factor is applied to all the speakers belong to the same class.
Since VTL (vocal tract length), which reflects the different in vocal tract between different speakers, has relationship with formant positions, and hence the formant frequency reflecting the VTL could be calculated based on the linear predictive model. Its disadvantages are that formant frequency and its relationship with VTL are highly dependent on the context (Li Lee, Richard C. Rose, “Speaker Normalization Using Efficient Frequency Warping Procedures,” in Proc. ICASSP Vol. 1, pp. 353-356, Atlanta, Ga., May 1996), and could vary largely with different context even for the same speaker. While in the current parametric approach, the pitch fails to be taken into account when selecting features and only formant frequencies are adopted. Moreover, the current parametric approach does not consider the VTL's high dependency on the context, and thus the classification does not consider the context. Anyway, the parametric approach is still adopted widely due to small computation load and stable computation results.
In the line search method, a speaker is classified by maximizing the probability of recognizing an utterance given a particular acoustic model in the content recognition stage. Strictly speaking, the normalization based on the line search factor doesn't not exactly mean it is doing vocal tract length normalization because the classification is conducted in the way to increase the matching score of the acoustic model in the content recognition stage. And thus, what is reflected is not only the difference in the vocal tract length, but a mixed result of various factors. For example, the variation in vocal tract shape could also affect on the line search warping factor.
The major disadvantage of the line search is that it is very expansive in computation, since one needs to carry out the speech decoding process for every possible class and select the one with which the matching score is the best. Otherwise, classification using ML (Maximum Likelihood) to search the best matching score with the acoustic models in the content recognition stage will make the classification very dependent on the acoustic model in the content recognition stage, and the result is very unstable. See Zhan Puming, Waibel Alex “Vocal tract length normalization for large vocabulary continuous speech recognition”, CMU-CS-97-148, May 1997.