In general, a speech recognition apparatus recognizes an input speech by the following processing (speech recognition processing).
That is, the speech recognition apparatus acoustically analyzes the input speech, thereby extracting a predetermined-dimensional characteristic vector indicating the amount of characteristics of the input speech. Fourier transform and the like are used as a method for analyzing the speech.
Then, the characteristic vector is subjected to matching processing with an acoustic model. A word string (words) corresponding to a series of acoustic models matching to the series of characteristic vector is obtained as the result of recognizing speech.
In the matching processing using, for example, a continuous HMM (Hidden Markov Model) method, the acoustic model is HMM using a probability (density) function such as at least one Gaussian probability distribution defined by characteristic vector space. In the matching processing, a likelihood (score) for observing the series of characteristic vector is calculated from the series of acoustic models as a plurality of candidates of the speech recognizing results (hereinafter, appropriately referred to as a theory) by using the Gaussian probability distribution forming the acoustic model, and the final speech recognizing result is determined from a plurality of theories based on the score. In other words, the theory in which the score for the series of characteristic vector is estimated as the highest one is selected as the acoustic model which best matches the input speech from the plurality of theories, and the word string corresponding to the series of the acoustic model forming the theory is outputted as the result of recognizing the speech.
In recent years, various speech recognition apparatuses are proposed, and they are classified into three types of a speech recognition apparatus for specific speakers, a speech recognition apparatus for unspecific speakers, and a model-adaptive speech recognition apparatus.
The speech recognition apparatus for specific speakers uses the acoustic model which is learnt by using the speeches of the specific speaker and, therefore, the result of recognizing the speech for specific speaker is obtained with high accuracy (low error-ratio). However, in the speech recognition apparatus for specific speakers, the accuracy for recognizing the speech of a speaker other than the specific speakers generally deteriorates.
The speech recognition apparatus for unspecific speakers uses the acoustic models which are learnt by the speeches of a large number of arbitrary speakers. Therefore, the results of recognizing the speeches of the arbitrary speakers are obtained with relatively high accuracy. However, in the speech recognition apparatus for unspecific speakers, a specific speaker is picked up and then the accuracy for recognizing the speech of the picked-up speaker is not so higher than the accuracy for recognizing the speech of the speech recognition apparatus for specific speaker.
The model-adaptive speech recognition apparatus first has the same performance as that of the speech recognition apparatus for unspecific speakers. However, during using the apparatus by the specific user (speaker), the adaptation to the acoustic model is performed based on the specific user's speech and the accuracy for recognizing speech of the specific user is improved.
That is, the model-adaptive speech recognition apparatus first recognizes the speech by using the acoustic model similar to that used by the speech recognition apparatus for unspecific speakers. In this case, the mismatching is analyzed between the input speech of the user and the acoustic model, and a transformation matrix for transforming the acoustic model into a model matching (applied to) the input speech is obtained. Thereafter, the speech is recognized by using the acoustic model obtained by transforming the acoustic model by using the transformation matrix, namely, the acoustic model after the adaptation to the acoustic model. The model-adaptive speech recognition apparatus performs the above-mentioned adaptation to the acoustic model as training operation before regularly using the apparatus by the user. Thus, the acoustic model is transformed into that matching the user's speech and the accuracy for recognizing the speech of the specific user is improved.
As mentioned above, in the model-adaptive speech recognition apparatus, the acoustic model is transformed into the acoustic model suitable to recognize the input speech. Consequently, attention is paid to the user (speaker) and then the speech recognition apparatus matches the user. Further, attention is paid to the environment of the speech recognition apparatus and then the speech recognition apparatus becomes adaptive to the environment.
In other words, the environment of the speech recognition apparatus includes noise at the place and the distortion of channels until the user's speech is inputted to the speech recognition apparatus. When the model-adaptive speech recognition apparatus is used under a predetermined environment, the acoustic model is transformed adaptively to the sound under the predetermined environment. In this view, the model-adaptive speech recognition apparatus is adaptive to the environment thereof. The distortions of the channels are caused depending on characteristics of a microphone for transforming the speech into an electrical signal, characteristics of a transfer line in which a band of a telephone line or the like is limited upon transferring the input speech of the speech recognition apparatus.
Upon using the HMM as the acoustic model, the adaptation to the acoustic model is performed by linearly converting an average vector for defining the Gaussian probability distribution forming the HMM by using the above-mentioned transformation matrix. An advantage equivalent to the adaptation to the model for transforming the acoustic model is obtained by linearly transforming the characteristic vector using the transformation matrix and by calculating the score using the characteristic vector after transform and the acoustic model. Therefore, the adaptation to the model means both the transformation of the acoustic model using the transformation matrix and the transformation of the characteristic vector. That is, the characteristic vector obtained by the user's speech may be made adaptive to the acoustic model, or the characteristic vector obtained by the user's speech may be made adaptive to the acoustic model.
The adaptation to the model is performed so as to improve (increasing) the likelihood of the characteristic vector of any target speech which is observed from the acoustic model, that is, the score of the characteristic vector which is calculated by the Gaussian probability distribution forming the HMM as the acoustic model corresponding to the target speech (the acoustic model indicating the phoneme of the target speech, etc.). Therefore, consider the adaptation to the model for transforming the characteristic vector, ideally, the characteristic vector is transformed by the transformation matrix, thus to map the characteristic vector to the average vector for defining the Gaussian probability distribution forming the acoustic model.
Then, in the adaptation to the acoustic model, in order to make the score of the characteristic vector of the target speech, which is calculated from the acoustic model corresponding to the target speech, larger than the score calculated from another acoustic model, the transformation matrix is obtained for executing the linear transformation in which the characteristic vector of the target speech matches the average vector for defining the Gaussian probability distribution forming the acoustic model corresponding to the target speech. The transformation matrix can be calculated periodically or aperiodically. Upon recognizing the speech, the matching processing is performed by using the characteristic vector (or the acoustic model) obtained by the transformation matrix.
The transformation matrix for adaptation to the acoustic model of one specific speaker is obtained by using a plurality of series of characteristic vector obtained from a plurality of speeches of the specific speaker. Therefore, a matrix for matching each of the plurality of characteristic vectors with the corresponding average vector must be obtained as the transformation matrix. As methods for obtaining the transformation matrix for mapping the plurality of characteristics vector to the corresponding average vectors, one method using the linear regression (least-squares method) is used. The transformation matrix obtained by the linear regression minimizes the characteristic vector obtained from the speech of the specific speaker to be mapped to a vector by which a statistic error from the corresponding average vector (here, the total of the squared errors). Therefore, generally, the transformation matrix does not enable any desired characteristic vector obtained from the speech of the specific speaker to completely match the corresponding average vector.
The methods for the adaptation to the model include not only the above-mentioned method but also other methods varied in details. According to any of the methods, similarly to the above-mentioned method, the characteristic vector of the target speech or the acoustic model corresponding to the target speech is basically transformed based on the acoustic model so that the likelihood for observing the characteristic vector becomes maximum.
In the model-adaptive speech recognition apparatus, the adaptation to the model of the speech of one specific user or the adaptation to the model under the specific environment are performed many times. Then, the accuracy for recognizing the speech of the specific user or the speech under the specific environment is improved and, on the other hand, the accuracy for recognizing the speech of other users or the speech under other environments deteriorates. As a result, the model-adaptive speech recognition apparatus has the same performance as that of the speech recognition apparatus for specific speaker.
After the model-adaptive speech recognition apparatus is adaptive to the specific user or the specific environment as mentioned above, the speech recognition apparatus is used by other users or is used under other environments, thereby enabling the speech recognition apparatus to adapt to other users or other environments.
However, just after the use of the apparatus starts by other users or under other environments, the acoustic model of the speech recognition apparatus is still adaptive to the first user or the first environment. Thus, the accuracy for recognizing the speech extremely deteriorates until the acoustic model becomes adaptive to other users or other environments.
Further, in some cases, the acoustic model adaptive to the first user or the user environment cannot completely adapt to other users or other environments. The acoustic model adaptive to the first user or the first environment must return (reset) to the initial acoustic model and then must adapt to other users or other environments.
In the above-mentioned case, the following speech recognition apparatus exists. That is, a plurality of sets of the acoustic models are prepared and the sets of the acoustic models varied depending on the users are made adaptive to the users. The speech recognition apparatus recognizes the speech of the plurality of users by using the acoustic model adaptive to the users and therefore the accuracy for recognizing the speech is obtained for all the users, similarly to the speech recognition apparatus for specific speakers.
However, the above-mentioned speech recognition apparatus recognizes the speech by using the acoustic model adaptive to the user of the speech and therefore it must be informed which user speaks. Thus, it is troublesome that the user must input information for specifying him by operating a button or the like before starting to use the apparatus.