1. Field of the Invention
The present invention relates to a method and an apparatus for producing an acoustic model for speech recognition, which is used for obtaining a high recognition rate in a noisy environment.
2. Description of the Prior Art
In a conventional speech recognition in a noisy environment, noise data are superimposed on speech samples and, by using the noise superimposed speech samples, untrained acoustic models are trained to produce acoustic models for speech recognition, corresponding to the noisy environment, as shown in “Evaluation of the Phoneme Recognition System for Noise mixed Data”, Proceedings of the Conference of the Acoustical Society of Japan, 3-P-8, March 1988.
A configuration of a conventional acoustic model producing apparatus which performs the conventional speech recognition is shown in FIG. 10.
In the acoustic model producing apparatus shown in FIG. 8, reference numeral 201 represents a memory, reference numeral 202 represents a CPU (central processing unit) and reference numeral 203 represents a keyboard/display. Moreover, reference numeral 204 represents a CPU bus through which the memory 201, the CPU 202 and the keyboard/display 203 are electrically connected to each other.
Furthermore, reference numeral 205a is a storage unit on which speech samples 205 for training are stored, reference numeral 206a is a storage unit on which a kind of noise sample for training is stored and reference numeral 207 a is a storage unit for storing thereon untrained acoustic models 207, these storage units 205a-207a are electrically connected to the CPU bus 204 respectively.
The acoustic model producing processing by the CPU 202 is explained hereinafter according to a flowchart shown in FIG. 9.
In FIG. 9, reference characters S represent processing steps performed by the CPU 202.
At first, the CPU 202 reads the speech samples 205 from the storage unit 205a and the noise sample 206 from the storage unit 206a, and the CPU 202 superimposes the noise sample 206 on the speech samples 205 (Step S81), and performs a speech analysis of each of the noise superimposed speech samples by predetermined time length (Step S82).
Next, the CPU 202 reads the untrained acoustic models 207 from the storage unit 207 to train the untrained acoustic models 207 on the basis of the analyzed result of the speech analysis processing, thereby producing the acoustic models 210 corresponding to the noisy environment (Step S83). Hereinafter, the predetermined time length is referred to frame, and then, the frame corresponds to ten millisecond.
Then, the one kind of noise sample 206 is a kind of data that is obtained based on noises in a hall, in-car noises or the like, which are collected for tens of seconds.
According to this producing processing, when performing the training operation of the untrained acoustic models on the basis of the speech samples on which the noise sample is superimposed, it is possible to obtain a comparatively high recognition rate.
However, the noise environment at the time of speech recognition is usually unknown so that, in the described conventional producing processing, in cases where the noise environment at the time of speech recognition is different from the noise environment at the time of training operation of the untrained acoustic models, a problem in that the recognition rate is deteriorated arises.
In order to solve the problem, it is attempted to collect all noise samples which can exist at the time of speech recognition, but it is impossible to collect these all noise samples.
Then, actually, by supposing a large number of noise samples which can exist at the time of speech recognition, it is attempted to collect the supposed noise samples so as to perform the training operation.
However, it is inefficient to train the untrained acoustic models on the basis of all of the collected noise samples because of taking an immense amount of time. In addition, in cases where the large number of collected noise samples have characteristics which are offset, even in the case of training the untrained acoustic models by using the noise samples having the offset characteristics, it is hard to widely recognize unknown noises which are not associated with the offsetted characteristics.