1. Field of the Invention
The present invention relates to a speech section detection apparatus, and more particularly to a speech section detection apparatus capable of reliably detecting a speech section even for a word containing a glottal stop sound or for a word containing a succession of “s” column sounds (sounds belonging to the third column in the Japanese Goju-on Zu syllabary table) or “h” column sounds (sounds belonging to the sixth column in the same table).
2. Description of the Related Art
In speech recognition, speech sections, based on which speech is recognized, must be extracted from a time-series signal captured through a microphone. There is proposed a method that takes a period during which the short-duration power of speech is greater than a predetermined threshold as a speech section but, with this method, it has been difficult to achieve sufficient accuracy for speaker-independent systems intended to recognize a large variety of words spoken by unspecified speakers.
The applicant has previously proposed a pitch period extraction apparatus and method that can detect with high accuracy a pitch, the highness or lowness of tone, in a time domain, from a speech signal (Japanese Unexamined Patent Publication No. 9-50297), but it is also possible to determine a speech section based on the pitch period.
However, in the case of a word A which contains a glottal stop sound in the word (for example, Japanese word “chisso”), a word B which contains a succession of “s” column sounds (sounds in the third column in the Japanese Goju-on Zu syllabary table) (for example, Japanese word “sushiya”), or a word C which contains a succession of “h” column sounds (sounds in the sixth column in the Japanese Goju-on Zu syllabary table) (for example, Japanese word “hihuka”), it has not been possible to avoid the possibility of erroneous detection resulting from a failure to detect all the constituent sounds of the word as one continuous speech section.
FIGS. 1A, 1B, and 1C show the speech section detection results obtained according to the prior art pitch period detection method. FIG. 1A shows the speech section detection result for the “word A”, FIG. 1B for the “word B”, and FIG. 1C for the “word C”. In each figure, the upper part shows the speech signal, and the lower part the detected speech section.
As can be seen from the figures, in the case of the “word A”, the sound in the first half of the word (“chi” in the Japanese word “chisso”) is detected in the speech section, but the sound in the last half (“sso” in the Japanese word “chisso”) is not detected.
In the case of the Japanese word “sushiya”, there is a break in the speech section between “sushi” and “ya”, while in the case of the Japanese word “hihuka”, there is a break between “hifu” and “ka”; in either case, the word is not detected as one continuous speech section.
Possible causes for such erroneous detection include the following.
A: In the word A, the fricative “ss” that follows the glottal stop, and in the word B, the fricative “sh” that follows the “s” column sound “su”, are not only low in level but also difficult to differentiate from noise, and as a result, it is difficult to detect the pitch period itself.
B: When there is no aspirated sound part or noise part preceding the word, and when the tone is low, the pitch period cannot be detected.
C: In the case of the word C, there is a relatively long pause between the series of “h” sounds (“hihu” in the Japanese word “hihuka”) and the succeeding sound (“ka” in the Japanese word “hihuka”).
D: Noise during a pause.