1. Field of the Invention
The present invention relates to a voice recognition apparatus which employs a neural network and is capable of recognizing any word voiced by unspecified persons. More particularly, the invention relates to the voice recognition apparatus which provides more highly efficient nonlinear matching along time axis.
2. Description of the Related Arts
Today, a neural network, which is a modelling system of neurons in a human brain, has been applied to a voice recognition field. The inventors of the present application know that various approaches using the neural network have been attempted such as a multilayer perceptron type neural network using an error back propagation method (BP method, for short), which is described in Nakagawa: "Voice Information Processing" Bit magazine, September 1989 issue, pp. 183-195, Vol. 21, No. 11, and Shikano: "Application of Neural Network to Voice Information Processing" in the Proceedings of the Communications Society, pp. 27-40, September, 1988.
Voice recognition apparatuses are generally categorized into two types of systems. For one system, an input voice is analyzed at each frame so that feature or characteristic vectors (feature or characteristic parameters) are extracted for each frame. The extracted feature vectors are applied as two-dimensional patterns arranged in time series into an input layer of a neural network. Meanwhile, a teacher signal for identifying the input voice is applied to an output layer of the neural network, thereby allowing a weight coefficient of each link to be obtained using the BP method. Then, by making use of the fact that an actually-voiced word has a slightly different length each time, even if the same word is voiced by the same person while since the number of units included in the input layer of the neural network is constant, the input voice data series are normalized to a predetermined length and the feature vectors of an unknown input voice are applied to the neural network which has learned the weight coefficient of each link based on the feature vectors. Then, the input voice is allowed to be recognized, depending on an output value of each unit included in the output of the neural network.
For the other system, referred to as a multi-template system, each word voice data given by many, unspecified speakers are broken into segments. Then, the voice data, based on the center of each segment or an average value of the voice data belonging to each cluster, are stored as a reference pattern. For segmenting the word voice data, several algorithms are used in combination. Then, for an unknown input voice, the distances between the feature pattern of the input voice and the reference patterns of the stored words are all calculated with a DP (Dynamic Programming) matching method so that the voiced word is recognized as a word matched to a reference pattern with the minimum distance.
The foregoing systems both require detection of a head and a tail of the input voiced word. The detection of a voice interval defined by the word head and tail depends on whether or not a short-period power larger than a predetermined threshold value continues for a constant time or more. Two threshold values are prepared for the short-period power. The voice interval can be detected by using these two threshold values in combination. Or, it may be detected by using a zero crossing or a difference between a noise interval and the voice interval itself.
The voiced word is, however, a time-series pattern, so that an actually-voiced word has each duration even if the same word is voiced and provides nonlinear fluctuation of phonemes with respect to the time. Further, it is desirable to prevent false recognition due to a cough or paper noise. For distinguishing unnecessary sound from the voice, however, a word-spotting method is required for automatically extracting only predetermined words out of the voice reading a manuscript.
One of the foregoing methods, that is, the multi-template system using the DP matching method, requires detection of a voice interval before recognition-processing the voice. However, it is not easy to properly detect the voice interval and quite difficult to detect a head of a word voice, a tail consonant and a low vowel. Further, it is necessary to properly remove noises such as a respiratory sound added to the voice tail. The aforementioned methods dependent on the short-period power, zero crossings or difference between the voice interval and the noise interval do not meet those requirements. It results in erroneously detecting the voice interval and lowering a recognition rate.
If the word-spotting method is used, it may bring about another shortcoming that the continuous DP matching requires a lot of calculations and causes an extra word to be added and an actual word phoneme to be deleted.
The foregoing method using the neural network requires the input voice interval to be normalized, because the input layer included in the neural network includes a predetermined number of units. If the input voice interval is linearly normalized, however, it results in very often transforming or shifting the time of occurrence of the dynamic feature characteristic of the necessary phoneme for identifying the voiced word, thereby disallowing the longer or shorter nonlinear word voice pattern to be corrected.
Further, the normal voice recognition apparatus has to remove a voiceless interval and a noise interval before and after the speech, from the signal input by the microphone for extracting a voice interval, that is, detecting the voice interval.
The detection of the voice interval is not so difficult if the signal has a high S/N ratio. In this state, the voice interval may be defined as a interval where the power series extracted from a voice signal are larger than a threshold value.
In actual environments, however, there exist various noises so that the S/N ratio may be degraded. Hence, it is difficult to detect a weak frictional sound and a voiced sound with a small amplitude often provided on a head and a tail of voice. Moreover, an unsteady noise may be erroneously detected as a voice interval.
For distinguishing a voice interval from the background noises, there has been proposed a method for selecting a proper voice interval from a plurality of interval candidates.
This method mainly takes the two steps of voice-recognizing each interval candidate and selecting as a proper voice interval the interval at which the highest checking value can be obtained.
As an improvement of the above method, a method has been proposed for setting all the times on the data as front endpoint and tail candidates, voice-recognizing all the intervals, and finding the interval at which the highest checking value can be obtained. One example of this method is the word spotting using a continuous DP method as mentioned above.
The voice recognition apparatus employing the word spotting method with the continuous DP method has an disadvantage that it can offer a low "reject" capability and noise-resistance. Moreover, it may add an unnecessary word or drop a word or phoneme and requires a great amount of computations and storage, because the DP matching has to be always done.
And, the foregoing voice recognition apparatus has to detect the front endpoint in advance and may erroneously recognize or reject the subject word if the detection error is large.