1. Field of the Invention
The present invention relates to a voice recognition apparatus which can extract feature variables from an input voice independently of speakers and languages, and which can absorb fluctuations dependent on speakers and effectively reduce an amount of calculations in matching for voice recognition.
2. Description of the Related Art
Voice recognition apparatus are generally divided into two systems. One system is a word voice recognition system in which word voices are recognized through matching by the use of reference patterns composed of words as units. The other system is a phoneme recognition system in which word phonemes are recognized through matching by the use of standard patterns composed of phonemes or syllables, smaller than words, as units.
The word voice recognition system has no problem of false recognition due to articulatory coupling and can provide a high rate of recognition. However, the word voice recognition system has a problem such that the number of reference patterns is increased with increasing the number of vocabularies, which requires a large memory capacity and a great deal of calculations in matching. Particularly, in the case of recognizing many and unspecified speakers, a plurality of reference patterns (multi-templates) are needed for each word, because voices are largely fluctuated dependent on individual speakers. Such voice fluctuations are attributable to various factors. Thus, since speakers have their own physiological factors such as sex, age, and length of a vocal tract, voices are fluctuated as speakers change. In the case of a single speaker, voice fluctuations are also caused if the speaker makes voices in a different manner (loudness, voice production speed, etc.) dependent on circumstances, or if the surrounding noise is varied.
Therefore, the problem arisen by increasing the number of vocabularies has been dealt with as follows. In order to reduce the number of reference patterns for use in matching, preliminary selection of the reference patterns is performed before executing principal matching, based on the intermediate result of DP matching among the reference patterns, durations, global features and local features of the input voices.
However, there has not yet been found an approach of completely eliminating voice fluctuations due to change of speakers.
Applicants know that, to some extent, sound source characteristics among fluctuations depending on speakers can be compensated by passing voices through primary to tertiary adaptive inverted filters of the critical damping type. It has also been attempted to normalize a difference between the individual speakers by making a voice signal subjected to simple conversion using first formant through third formant.
In the case of recognizing an input voice signal by a voice recognition apparatus of the phoneme recognition system, the input voice signal is frequency-analyzed by a feature extracting device to extract several feature variables of phonemes relating to the recognized object in advance. These plural feature variables of phoneme are stored in a storage section as reference patterns for the respective phonemes. Then, each of words is expressed by a series of such phoneme reference patterns, and the resulting series of phoneme reference patterns are stored in a storage device in association with phoneme series of words using word-by-word for being stocked as a word dictionary. On the other hand, when an unknown voice is input, the aforesaid feature extracting device extracts feature variables from the input voice for each frame in a like manner as mentioned above. A check is then made to similarity between the extracted feature variables of the unknown voice for each frame and the phoneme reference patterns stored in the storage section. As a result, the phoneme corresponding to the phoneme reference pattern with the maximum similarity is determined as a phoneme of that frame. Likewise, phonemes of subsequent frames are determined successively to express the unknown voice as a series of phonemes. Afterward, a check is made to similarity between the phoneme series obtained from the unknown voice and the series of phoneme reference patterns for respective words in the word dictionary which are stored in the storage section. As a result, the word corresponding to the series of phoneme reference patterns with the maximum similarity is determined as a word of the input voice.
In acoustic analysis and feature extraction, a voice can be expressed with a less number of parameters through the linear prediction analysis (LPC) by supposing the voice to be an all-polar model. There has been proposed an attempt to utilize such a model approach to directly express the structure of articulatory organs and motional characteristics thereof, thereby effectively describing vocal tract functions cross-section area with the aid of a model. This is called an articulartory model using an articulatory parameter x (Shirai and Honda: "Estimation of Articulatory Parameters from Speech Waves", Trans. IECE Japan, 61-A, 5, pp. 409-416, 1978). The articulatory parameter x composed of an opening/closing angle of the lower jaw: X(J), an antero-posterior (longitudinal) deformation of the tongue surface: X(T1), a vertical deformation of the tongue: X(T2), an opening area/extension of the lip: X(L), a shape of glottis: X(G), and an opening of the velum (degree of nasalization): X(N). Thus, the articulatory parameter can be expressed by: x=[X(T1), X(T2), X(J), X(L), X(G), X(N)]. Assuming that a non-linear articulatory model for converting the articulatory parameter x to an acoustic parameter is given, the articulatory parameter x can be derived by solving the non-linear optimization problem from the acoustic parameter in a reversed manner. While the number of parameter dimensions is normally 12-20 in the aforementioned LPC, the number of dimensions for the articulatory parameter x is 6. This means that in the case of using the articulatory parameter x, information is compressed down to a half or less level compared with the LPC parameter.
Meanwhile, a narrow degree C at a point of articulation in the vocal tract has difficulties to express with high accuracy using the articulatory parameter x, but it is deeply related to the types of articulation such as vowel, fricative and closure. For the reason, the narrow degree is extracted separately from the articulatory parameter x and the coordinates (x, y) of a narrowed position so that it is utilized for voice recognition and the like. Further, both of the narrow degree C and the vector (x, y) of the narrowed position can be calculated simply from the acoustic parameter by using a neural network, while avoiding the non-linear optimization problem in the tone parameter x.
However, the above-mentioned conventional voice recognition apparatus has problems as follows. In the method based on the phoneme reference patterns, the feature variable of the phoneme extracted by the feature extracting device may be different depending on not only a physiological difference (e.g., a length difference in the vocal tract) between individual speakers but also an influence of articulatory coupling in the successive phonemic environment in the case of vowel(s) in a word, even if the voice is produced to express a phoneme symbol of the same representation. Stated otherwise, if voice recognition is made using the feature variable of phoneme, even the voice produced to express the same phoneme symbol may be determined as a different phoneme, whereby it is rejected or incorrectly recognized. Accordingly, high recognition ability cannot be obtained. This problem is attributable to the fact that voice recognition is performed using the feature variables of phonemes which may be fluctuated dependent on speakers and phonemic environment.
Speaker independent word recognition has a problem, as mentioned above, that an amount of calculations necessary for matching between the feature patterns of an input voice and the reference patterns is increased.
Further, the method of predicting an articulatory parameter from an acoustic parameter using an articulatory model is also problematic that the non-linear optimization problem must be solved, which is disadvantageous in amount of calculations and stability of convergence. To avoid this problem, there have been attempted several methods such as taking into account a fluctuation range and continuity of the parameter, utilizing a table lookup, etc. However, an amount of calculations remains essentially large. Another problem is in that the articulatory parameter x is directed to specified speakers and prediction can be well succeeded only in a vowel steady portion.
There have also been proposed various methods using formant frequencies to be adapted for many and unspecified speakers. But, no decisive method has been found.