The present invention relates to a continuous speech recognition apparatus to recognize an utterance or speech generated continuously.
In word processors and audio typewriters which handle the input information by speeches, it is important to efficiently and highly reliably recognize the speech which is continuously and naturally generated. Conventionally, there has been known a continuous speech recognition apparatus which recognizes the input speech by first converting feature parameter sequence of the input speech into a series of phonemic symbols, i.e. into the so-called segment lattice using a minimum unit of the speech to be recognized as a speech segment unit. However, in the speech to be continuously generated, there may be a case where a certain speech segment is coarticulated with the speech segments to be generated before and after that segment, so that even in case of the same speech segment, it would have various different kinds of feature parameters. Due to this, it is difficult to convert the input speech into phonemic symbols with high degree of accuracy.
In addition, conventionally, there has been also known a continuous speech recognition apparatus which identifies words from the feature parameter sequence of the input speech using the minimum unit of the input speech to be recognized as a word unit, thereby recognizing a series of these identified words as a sentence. In this recognition apparatus, the reference patterns representative of respective words are used and by calculating the similarity between the feature patterns indicative of the input speech and the reference patterns, the input speech is recognized on a word unit basis; therefore, this type of apparatus is hardly affected by coarticulation between successive speech segments. Furthermore, the word identification method in this recognition apparaus is mainly divided into two methods: a first method being attained by detecting the word interval of the input speech and identifying the words in this word interval; and a second method being attained by detecting words which would possibly exist in the generation interval of the input speech without detecting the word interval. In the former word identification method, the word interval is determined by sequentially extracting the feature parameters of, e.g. the acoustic powers or acoustic spectra of the input speech and by detecting the maximal or minimal point of changes in these feature parameters. However, in the case where, for example, "I (ai)" and "eat (i:t)" are continuously generated so that when "I eat (ai:t)" is formed, there is a problem such that it is impossible to correctly detect the word intervals in this speech.
In addition, in the latter word identification method, reference patterns each having feature parameters of a plurality of frames are used for each word in the input speech, and the distances between the feature parameters of a plurality of frames of the input speech and the reference patterns are obtained for every frame, thereby determining the word which gives the minimum distance. In this case, the distances between the feature patterns of the input speech and the reference patterns are obtained by, e.g. a dynamic programming method. In this way all of the word series that can exist in the whole speech interval are detected and the word series in which the sum of the distances relating to the words is minimum among these series of words is detected, thereby recognizing the input speech.
This word identification method is effective in the case where speakers are limited and where the word identification can be executed using a small number of reference patterns. However, for unspecified speakers, the speech pattern of each word largely differs in dependence upon the speakers; therefore, it is necessary to prepare a great amount of reference patterns for each word in order to process the speech information from unspecified speakers. It is of course actually impossible to prepare the reference patterns of the number corresponding to the number of unspecified speakers and, accordingly, it is impossible to suitably recognize the speech from all of the unspecified speakers.
To cope with such a drawback, recently, an idea has been considered whereby a small limited number of reference patterns are used for each word and the speech information from unspecified speakers is processed by applying the clustering technique. In this case, however, the rate of correct recognition for a series of words will be remarkably reduced and furthermore it is necessary to calculate the distances between the reference patterns and the feature patterns of the input speech for every frame, causing the total amount of calculations of be extremely increased. Thus, it is very difficult to efficiently and highly reliably recognize the speech which are continously generated.