1. Field of the Invention
The present invention relates to a method and apparatus for extracting an isolated speech word to be used in an isolated speech word recognition apparatus and the like.
2. Description of the Related Art
In, for example, an isolated speech word recognition apparatus, each speech word is first extracted from an input speech signal and then the extracted speech word is compared with isolated speech words stored in a dictionary to conduct an isolated speech word recognition. In this speech word recognition, each isolated speech word must be extracted without error, i.e., no consonant and/or vowel of each speech word can be overlooked.
Prior related art to the present invention is disclosed in the publication, OKI electric company research and development report, No. 128, Vol. 52, No. 4 ('985) "A speech recognition LSI for an independent talker with 10 words". According to this publication, the front and end of each isolated speech word are detected, to extract each speech interval, by the process described below. When the front of the speech word is to be detected, the existence of the front is determined, if four or more frames (each frame appears for 10 ms and shows whether the level detected is higher than a predetermined level) are produced successively at the point at which such frames have occurred first. In this case, each frame exhibits a speech power which is obtained by averaging the speech powers detected at respective predetermined frequencies within a speech frequency band. If the existence of the front is determined merely by observing a time when the average speech power becomes higher than the predetermined level, to differentiate same from the atmospheric noise, an incorrect determination may be made due to an instantaneous rise of power in, for example, the atmospheric noise. Therefore, four or more successive frames are used to determine the existence of the actual front, as mentioned above, to avoid such an incorrect determination.
When the end of the speech word is to be detected, the existence of the end is determined, if four or more frames (each frame appears for 10 ms and shows that the level detected is lower than the predetermined level), are produced successively at the point at which these frames occurred first. In this case, if the existence of the end is determined merely by observing a time when the average speech power becomes lower than the predetermined level, an incorrect determination may be made due to a interpretation of a pseudo non-speech interval as the end of the speech word. Such a pseudo non-speech interval usually occurs between two adjacent speech segments, thus eight or more successive frames are used to determine the existence of the actual end, as mentioned above, to avoid such an incorrect determination.
This prior art has a problem in that a consonant may be overlooked when detecting the same from the extracted speech word, and thus a correct speech recognition cannot be made, as explained later in detail.
Other prior art described in Japanese Unexamined Patent Publication No. Sho 60 (1985)-260096, refers to a measure for overcoming a problem analogous to the above mentioned problem of the previously illustrated art. Nevertheless, this latter prior art is effective only in avoiding overlooking a word including the vowel "i", for example, "ichi" and "ni", which are Japanese words corresponding to "1" and "2" in English, respectively. This avoids the overlooking by selectively lowering a threshold level for extracting the speech interval whenever it is determined that an input speech signal includes the vowel "i".