1. Field of the Invention
The present invention relates to an apparatus and a computer program product for creating a standard speaker model representing a distribution of features extracted from a speech of a standard speaker.
2. Description of the Related Art
In speech recognition apparatuses for recognizing various speakers' speeches, a recognition rate of a specific speaker may be remarkably lower than recognition rates of other speakers. Speaker normalization has been widely known as a technology for overcoming the above problem. In the speaker normalization, speaker characteristics of a feature vector are normalized by transforming the feature vector extracted from a speech in a predetermined manner.
In “Adaptive training using simple target models”, Stemmer et al., in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2005 has disclosed a technology for normalizing a feature vector by using an approach called constrained maximum likelihood linear regression (CMLLR). The CMLLR needs a standard speaker model representing a distribution of feature vectors of a standard speaker. The feature vectors are transformed so that a series of the feature vectors conforms to the standard speaker model as much as possible.
The standard speaker model is, for example, a Gaussian mixture model (GMM). “Speaker adaptive training: a maximum likelihood approach to speaker normalization”, Anastasakos et al., in Proceedings of ICASSP, 1997 and “Maximum likelihood linear transformations for HMM-based speech recognition”, Gales, Computer Speech and Language, vol. 12, 1998 has disclosed a technology for calculating GMM parameters including a mixing coefficient, a mean vector, and a covariance matrix of a mixture by using an approach called speaker adaptive training method, and creating the standard speaker model using the GMM parameters. In the speaker adaptive training method, a transformation parameter for speaker normalization and the GMM parameters for obtaining a maximum likelihood for training data are calculated by using an expectation-maximization (EM) algorithm.
A typical speech recognition system is assumed to work in a noisy environment. Therefore, it is recommended to create the standard speaker model from training data recorded in the noisy environment like the actual working environment. If training data recorded in a noiseless environment is used, the accuracy of speech recognition decreases due to mismatch between the training environment and the test environment.
However, if feature vectors extracted from a speech that is recorded in the noisy environment are used as the training data for the speaker normalization disclosed in Stemmer et al., an incorrect transformation parameter for speaker normalization is calculated in the speaker adaptive training method. This is because a fluctuation in the feature vectors due to noises makes a fluctuation in the feature vectors due to the speaker characteristics hidden. As a result, incorrect GMM parameters are calculated as well as the transformation parameter for speaker normalization. In other words, if the training data recorded in the noisy environment is used to prevent occurrence of mismatch between the training environment and the test environment, the standard speaker model can not be created stably.