1. Field of the Invention
The present invention relates to a speech recognition system and more specifically to a word or speech boundary detection system for searching sound power of a speech uttered by non-specific speakers to determine a plurality of start and end point candidates of the speech.
2. Description of the Prior Art
An ordinary speech recognition system is configured as shown in FIG. 1. A speech signal transduced through a microphone, for instance is analyzed by a sound analyzer 1 in accordance with band-pass filter (BPF) analysis or a linear prediction coding (LPC) analysis; the analyzed speech signal is pre-processed by a preprocessor to detect speech boundaries (start and end of a word) and to normalize the analyzed signal with respect to its amplitude; similarities (distance) between the preprocessed patterns and reference patterns previously registered in a reference pattern memory 4 are calculated by a pattern match detector 3; and then the magnitudes of the detected similarities are compared with a predetermined level by a discriminator 5 to finally recognize the inputted speech.
In the speech recognition processings as described above, the word boundary detection technique involves serious problems, because a speech signal varies with the time elapsed and includes a considerable level of background noise. To improve the accuracy of the word or speech boundary detection, a great effort has been made; however, since miscellaneous noises (such as a tongue-clicking, breathing, rumble, etc.) produced by a speaker itself are superimposed upon a speech in addition to surrounding noise or circuit noise (in the case of a telephone speech), it has been impossible to eliminate detection errors as long as a word boundary or a start and an end of a speech are determined by simply detecting the duration of a word or a speech exceeding a threshold level.
FIG. 2 is an example of sonagraph obtained when a telephone speech of "8 (/hat.intg.i/)" (8 is pronounced as denoted above in phonetic symbol in the Japanese language) is uttered by a women under background noise in an office room and analyzed in accordance with LPC method. In this sonagraph, the sound power is plotted in logarithmic scale (log G) on the upper side and depicted in the manner of character shading (MZ7l+-) printed by a computer with respect to frequency components on the lower side. Further, a frame represents an aggregation of a series of bits within a periodically repeated digit time slot (window). The time intervals (frame rate) between two frames shown in FIG. 2 is about 8 to 10 ms.
In this example shown in FIG. 2, since speech of other office workers is superimposed upon the speech of "8" as background noise, when the conventional speech recognition system is adopted, a position B2 (Frame No. 110) is erroneously detected as an end of a word of "8", in spite of the fact that the correct end of "8" is B1 (Frame No. 82).
Further, the conventional method of simply determining start and end of a word involves other problems as follows: For instance, in the case of a telephone speech recognition system for recognizing various speeches of a plurality of non-specific speakers, it may be possible to previously determine a reference speech dictionary on the basis of many person's speeches. However, in determining a boundary, since a consonant near a start of a sound "g" of "5 /go/", for instance is not necessarily detected accurately as a start thereof at the head of "5" (before "g"), there exists a problem such that it is impossible to improve the recognition performance, if various modifications such as starts of consonant sounds, intermediate portions of consonant sounds, transits to vowel sound, etc. are not included in the reference speech dictionary. Further, as a result of the fact that various modifications should be included in the dictionary, when the number of words to be recognized increases (due to an increase in the number of words of similar syllable series), a difference between /go/ and /o/ is not discriminated and therefore the similarity rate increases, thus resulting in another problem in that the rejection faculty is lowered and therefore recognition performance is deteriorated.