1. Field of the Invention
The present invention relates to voice recognition devices capable of recognizing human voice no matter who is the speaker, e.g., a low-pitched man, a high-pitched woman or a child, and more specifically, to a device for normalizing voice pitch on the basis of a previously-provided sample voice pitch.
2. Description of the Background Art
Recently, with the progression of digital signal processing technology and LSI of higher performance capabilities and lower price, voice recognition technology became popular with consumer electronic products. The voice recognition technology also improves such products in operability. Such voice recognition device principally works to recognize human voice by converting an incoming command voice into a digital voice signal, and then referring to a voice dictionary for sample voice data previously prepared for comparison. Therefore, for easy comparison, the voice recognition device often requests a user to produce a sound for commanding in a specific manner, or to register the user voice in advance, for example.
The issue herein is specifying a user in the voice recognition device equipped in the consumer electric product badly impairs its usability and thus product value. To get around such problem, the voice recognition device is expected to recognize human voices varied in pitch and speed, no matter who is the speaker. However, as already described, the conventional voice recognition device refers to the voice dictionary for comparison with an incoming command voice. Therefore, if the incoming command voice is differed in pitch or speed to a large extent from the sample in the voice dictionary, the voice recognition device fails to correctly perform voice recognition.
FIG. 7 shows a voice recognition device disclosed in Japanese Patent Laid-Open Publication No.9-325798 (97-325798) for the betterment. A voice recognition device VRAc includes a voice input part 111, voice speed calculation part 112, voice speed change rate determination part 113, voice speed change part 114, and voice recognition part 115.
A sound, or voice produced by a user is taken into the voice input part 111, and is captured as a command voice thereby. The captured command voice is A/D converted into a digital voice signal. The voice speed calculation part 112 receives thus produced digital voice signal, and based thereon, calculates the user's voice speed. The voice speed change rate determination part 113 compares thus calculated voice speed with a reference voice speed, and then determines a speed change rate to compensate for the speed gap therebetween. By referring thereto, the voice speed change part 114 changes the voice speed. Then, the voice recognition part 115 performs voice recognition with respect to the voice-speed-changed voice signal.
Described next is the operation of the voice recognition device VRAc. The user sound is captured as command voice together with background noise by the voice input part 111 via a microphone and an amplifier equipped therein, and then an analog signal including the command voice and the background noise is subjected to A/D conversion by an equipped A/D converter. From the voice included in thus obtained digital voice signal, the voice speed calculation part 112 extracts a sound unit which corresponds to the command voice, and calculates the voice speed for the sound unit based on the time taken for the user to produce or utter the sound.
Here, assuming that the time taken to utter the sound unit (hereinafter, “one-sound unit utterance time” is Ts, and a reference time for utterance of the sound unit (hereinafter, “one-sound unit reference time”) is Th. Based thereon, the voice speed change rate determination part 113 determines a speed change rate α by comparing 1/Ts and 1/Th with each other, which denote a one-sound unit utterance speed and a one-sound unit reference speed, respectively. The speed change rate a is calculated by the following equation (1).α=Ts/Th  (1)
The equation (1) tells, when the one-sound unit utterance time Ts is shorter than the one-sound unit reference time Th, i.e., when an incoming sound voice speed is faster than that workable by the voice recognition device VRAc, the speed change rate α is smaller than 1. If this is the case, the incoming command voice should be decreased in speed. Conversely, when the one-sound unit utterance time Ts is longer than the one-sound unit reference time Th, i.e., the incoming command voice speed is slower, the speed change rate a becomes greater than 1. In such case, the incoming command voice should be increased in speed.
In the voice recognition device VRAc, the voice speed change part 114 refers to the speed change rate a to keep the command voice signal constant in speed, and produces a speed-changed command voice signal. The voice recognition part 115 performs voice recognition with respect to the speed-changed command voice signal, and outputs a result obtained thereby.
Such speed change can be easily done under the recent digital technology. For example, in order to decrease the speed of voice, the voice signal is added with several vowel waveforms having correlation with the sound unit included in the command voice. To increase the speed of voice, on the other hand, such vowel waveform is decimated from the command voice signal for several times.
This is a technique for changing the voice speed without affecting the pitch of the command voice. That is, this technique is effective for voice recognition in the case that the user speaks faster or slower than the dictionary voice.
The above-described conventional voice recognition device VRAc works well for voice recognition when the user voice speed is differed to a large extent from the one-sound unit reference speed 1/Th. However, this is not applicable if the user's voice is differently pitched compared with a reference pitch.
In detail, although the voice recognition device VRAc can manage with various types of speakers varied in frequency range, i.e., a low-pitched man, a high-pitched woman or a child, voice recognition to be achieved thereby is not satisfactory.
For the fast speaker speaking at a high speed, it is possible to ask him/her to speak moderately, but it is impossible to speak in a different voice pitch. Note that the speaker's voice pitch is essentially determined by his/her throat especially in shape and size. Since the speaker cannot change his/her throat in shape or size by his/her intention, the voice pitch cannot be changed by his/her intention, as well.
For realizing a voice recognition of various voices with different pitches, the voice recognition device VRAc shall store a great number of sample voice data groups each correspond to different speakers such as a man, a woman, or a child speaking in different pitch. Further, the voice recognition device VRAc shall select one group among those great number of sample voice data groups, according to the incoming command voice.
In order to avoid such nuisance, it seems effective to process the incoming command voice to a pitch optimal for voice recognition. However, since incoming command voices vary greatly in pitch according to the speaker, it is substantially impossible to process the incoming command voice to a desired pitch at one dash. Even in the desired pitch, the correct voice recognition cannot be secured because the content of incoming command voice or a speaking manner may spoil the voice recognition result. As known from this, the pitch considered optimal for voice recognition in terms of voice recognition device or sample voice data is not necessarily optimal.
Therefore, an object of the present invention is to provide a device for normalizing voice pitch to a level considered optimal for voice recognition.