1. Field of the Invention
This invention relates to phoneme information extracting apparatus for extracting phoneme information used for recognizing a continuous voice tract in terms of phonemes.
2. Description of the Prior Art
Recently, it has been proposed to use as the input device of a data processor a voice recognizing apparatus which recognizes input voice and generates voice data corresponding to the recognized input voice. In such a voice recognizing apparatus, the voice is recognized by comparing a voice pattern that is obtained, for instance, by compressing an input voice signal and a preliminarity registered voice pattern or by comparing information of phoneme strings obtained through conversion of the input voice signal and information of phoneme strings preliminarily registered for each word or phrase. In the case of the former voice recognition apparatus, the accuracy of the voice recognition is high, so that it is advantageous for processing a few words. However, if a large number of words are to be recognized, it is difficult to make real-time recognition of the voice. In addition, a memory having a large capacity is required for storing a great quantity of word information. In the latter voice recognition apparatus, the precision of the voice recognition greatly depends upon the recognition score of individual phonemes obtained from the input voice. Presently, there are two different phoneme recognition apparatus; namely one in which the input voice signal is divided into successive phoneme-unit sections and phonemes in the individual sections are recognized, and the other in which the input voice signal is divided into frames each covering a constant time period and the phonemes in the individual frames are recognized. The former phoneme recognition apparatus has a merit that the individual phonemes can be recognized with high precision. However, difficulties are encountered in dividing the input voice signal into phoneme-units, so that the utility of this apparatus is limited. In the latter phoneme recognition apparatus, the input voice signal is divided into frames covering a fixed time period, for instance 10 to 20 msec., and the phoneme data for each frame is recognized through comparison with preliminarily registered phoneme. In this case, however, the precision of recognition is inferior because the phoneme data are recognized in a sort of forecasting way from the partial data obtained in a divided or limited period of time. Accordingly, in practice a plurality of phonemes obtained as a result of phoneme recognition in individual frames are given respective priorities or probabilities, and voice is recognized for each unit word through comparison of these phoneme data and reference phoneme data that are registered as unit words (for instance [tokjo] for the word "TOKYO"). At this time, warping of the voice in the time axis direction due to variation in length of the voice producing time can be absorbed by using a dynamic programming method when comparing the input phoneme strings and the registered unit word phoneme data. However, in this phoneme recognition method it is necessary to execute the phoneme recognition in such a short period of time as one frame. Particularly, when a voice series is continuously produced, the individual uttered phonemes are influenced by the immediately preceding and succeeding phonemes in dependence upon the restrictions imposed upon the operation of the voice producing organ (which is referred to as coarticulation), and the input voice signal sometimes contains phoneme data which represent different phonemic features from those of the reference phoneme data that are preliminarily registered for the phoneme recognition. For example, when uttering the words "the eye" (.vertline..differential.i ai.vertline.) comparatively slowly, the phonemic features, for instance power spectra or formant that appear on the power spectra, have substantially the same values as the phonemic features of separately pronounced phonemes [.differential.i] and [ai]. However, when the voice for the words "the eye" is uttered quickly, the phoneme [a] is strongly influenced by the adjacent phonemes [.differential.i] and [i], so that the phonemic feature of this frame is no longer of the phoneme [a] but is altered to be close to the phonemic features of the preceding and succeeding phonemes [ i], and in this case, the proper recognition of the phoneme can no longer be obtained. In order to solve this problem, it has been known to preliminarily register as reference phoneme data, for instance, an imaginary phoneme ["ia"] that appears at the time of transition from the phoneme [i] to the phoneme [a]. More particularly, when the words "the eye" are pronounced, either a phoneme string [.differential.].fwdarw.[i].fwdarw.[a].fwdarw.[i] or a phoneme string [.differential.].fwdarw.[i].fwdarw.["ia"].fwdarw.[i] is obtained, and accordingly phoneme string data ([.differential.].fwdarw.[i].fwdarw.["ia"].fwdarw.[i]) are preliminarily stored in a dictionary memory so that the utterance of the words "the eye" can be recognized as such through the comparison with these data. In this phoneme recognition method, however, for a large number of phoneme combinations an enormous amount of reference phoneme data (inclusive of the imaginary phoneme data) have to be registered, and also it is difficult to obtain the phoneme extraction when the voice is uttered quickly.