The present invention relates to a continuous speech recognition apparatus for recognizing a continuous speech.
It is very important to effectively recognize continuous natural speech with high reliability in a wordprocessor or speech input typewriter which deals with speech input data. Conventionally, a continuous speech recognition apparatus is known wherein a speech segment is used as the minimum unit of input speech to be recognized and a time-sequence of input speech feature parameters is converted to a series of phonemic symbols or segment lattice. However, coarticulation often occurs between two adjacent speech segments (phonemes) in a continuous speech, so that a given speech segment may have different feature parameters from those of an original speech segment. For this reason, it is very difficult to convert a continuous speech pattern to phonemic symbols with high precision.
Another continuous speech recognition apparatus is also known wherein a word unit is used as the minumum unit of input speech to be recognized each word unit is identified based on a sequence of input speech feature parameters, and a series of identified words is recognized as a sentence. According to this speech recognition apparatus, reference speech patterns indicating respective words are used. A feature parameter pattern indicating the input speech is compared with the corresponding reference speech pattern to calculate a similarity therebetween so as to recognize the input speech pattern in each word unit. Therefore, an influence due to the coarticulation described above can thus be substantially reduced. This recognition apparatus employs two word identification methods: one identification method wherein each word interval of an input speech is first detected to identify a word in the word interval; and the other identification method wherein a word is identified without detecting a word interval under the assumption that several words are present during the input speech interval. The word interval is determined by sequentially extracting feature parameters such as acoustic power or power spectrum of the input speech, and detecting a maximal or minimal point of change in the feature parameter. However, when words "I (ai)" and "eat (i:t)" are continuously pronounced to produce a speech input "I eat (ai:t)", the word interval of this speech cannot be correctly detected.
In the latter word identification method described above, reference speech patterns each having feature parameters of a plurality of frames are used to identify a corresponding one of words in the input speech pattern. For each frame, a distance between the feature parameters of the plurality of frames of the input speech and the reference speech pattern is calculated to detect a word giving a shortest distance in each frame. In this case, the distance between the feature parameter pattern of the input speech and the reference speech pattern can be calculated by a dynamic programming method, for example. All possible combinations of a series of words in the speech interval are made, and the input speech is then recognized by detecting one of the series of words giving a minimum total distance.
This word identification method is effective when a speaker is specified and word identification can be performed by using a small number of reference speech patterns. However, when a speaker is not specified, the input speech patterns of a word vary greatly from speaker to speaker. In order to process the speech data from nonspecified speakers, a great number of reference word patterns are required. In practice, it is impossible to prepare reference speech patterns for an indefinite number of nonspecified speakers. Therefore, it is impossible to accurately recognize the input speech patterns of an indefinite number of nonspecified speakers.
Speech data processing is recently proposed wherein a small number of reference patterns are used for the individual words, and speech data of a nonspecified speaker are processed utilizing a clustering technique. However, in this case, the recognition rate of a series of words is greatly decreased. Furthermore, the distance between the reference speech pattern and the feature parameter pattern of the input speech must be calculated in each frame, thus greatly increasing a total number of calculations. Therefore, it is very difficult to effectively recognize the input speech with high reliability.