1. Field of the Invention
The present invention relates to voice recognition devices capable of recognizing human voice no matter who is the speaker, e.g., a low-pitched man, a high-pitched woman or a child, and more specifically, to a device for normalizing voice pitch on the basis of a previously-provided sample voice pitch.
2. Description of the Background Art
Recently, with the progression of digital signal processing technology and LSI of higher performance capabilities and lower price, voice recognition technology became popular with consumer electronic products. The voice recognition technology also improves such products in operability. Such voice recognition device principally works to recognize human voice by converting an incoming command voice into a digital voice signal, and then referring to a voice dictionary for sample voice data previously prepared for comparison. Therefore, for easy comparison, the voice recognition device often requests a user to produce a sound for commanding in a specific manner, or to register the user voice in advance, for example.
The issue herein is specifying a user in the voice recognition device equipped in the consumer electric product badly impairs its usability and thus product value. To get around such problem, the voice recognition device is expected to recognize human voices varied in pitch and speed, no matter who is the speaker. However, as already described, the conventional voice recognition device refers to the voice dictionary for comparison with an incoming command voice. Therefore, if the incoming command voice is differed in pitch or speed to a large extent from the sample in the voice dictionary, the voice recognition device fails to correctly perform voice recognition.
FIG. 7 shows a voice recognition device disclosed in Japanese Patent Laid-Open Publication No. 9-325798 (97-325798) for the betterment. A voice recognition device VRAc includes a voice input part 111, voice speed calculation part 112, voice speed change rate determination part 113, voice speed change part 114, and voice recognition part 115.
A sound, or voice produced by a user is taken into the voice input part 111, and is captured as a command voice thereby. The captured command voice is A/D converted into a digital voice signal. The voice speed calculation part 112 receives thus produced digital voice signal, and based thereon, calculates the user""s voice speed. The voice speed change rate determination part 113 compares thus calculated voice speed with a reference voice speed, and then determines a speed change rate to compensate for the speed gap therebetween. By referring thereto, the voice speed change part 114 changes the voice speed. Then, the voice recognition part 115 performs voice recognition with respect to the voice-speed-changed voice signal.
Described next is the operation of the voice recognition device VRAc. The user sound is captured as command voice together with background noise by the voice input part 111 via a microphone and an amplifier equipped therein, and then an analog signal including the command voice and the background noise is subjected to A/D conversion by an equipped A/D converter. From the voice included in thus obtained digital voice signal, the voice speed calculation part 112 extracts a sound unit which corresponds to the command voice, and calculates the voice speed for the sound unit based on the time taken for the user to produce or utter the sound.
Here, assuming that the time taken to utter the sound unit (hereinafter, xe2x80x9cone-sound unit utterance timexe2x80x9d is Ts, and a reference time for utterance of the sound unit (hereinafter, xe2x80x9cone-sound unit reference timexe2x80x9d) is Th. Based thereon, the voice speed change rate determination part 113 determines a speed change rate xcex1 by comparing 1/Ts and 1/Th with each other, which denote a one-sound unit utterance speed and a one-sound unit reference speed, respectively. The speed change rate xcex1 is calculated by the following equation (1).
xcex1=Ts/Thxe2x80x83xe2x80x83(1)
The equation (1) tells, when the one-sound unit utterance time Ts is shorter than the one-sound unit reference time Th, i.e., when an incoming sound voice speed is faster than that workable by the voice recognition device VRAc, the speed change rate xcex1 is smaller than 1. If this is the case, the incoming command voice should be decreased in speed. Conversely, when the one-sound unit utterance time Ts is longer than the one-sound unit reference time Th, i.e., the incoming command voice speed is slower, the speed change rate xcex1 becomes greater than 1. In such case, the incoming command voice should be increased in speed.
In the voice recognition device VRAc, the voice speed change part 114 refers to the speed change rate xcex1 to keep the command voice signal constant in speed, and produces a speed-changed command voice signal. The voice recognition part 115 performs voice recognition with respect to the speed-changed command voice signal, and outputs a result obtained thereby.
Such speed change can be easily done under the recent digital technology. For example, in order to decrease the speed of voice, the voice signal is added with several vowel waveforms having correlation with the sound unit included in the command voice. To increase the speed of voice, on the other hand, such vowel waveform is decimated from the command voice signal for several times.
This is a technique for changing the voice speed without affecting the pitch of the command voice. That is, this technique is effective for voice recognition in the case that the user speaks faster or slower than the dictionary voice.
The above-described conventional voice recognition device VRAc works well for voice recognition when the user voice speed is differed to a large extent from the one-sound unit reference speed 1/Th. However, this is not applicable if the user""s voice is differently pitched compared with a reference pitch.
In detail, although the voice recognition device VRAc can manage with various types of speakers varied in frequency range, i.e., a low-pitched man, a high-pitched woman or a child, voice recognition to be achieved thereby is not satisfactory.
For the fast speaker speaking at a high speed, it is possible to ask him/her to speak moderately, but it is impossible to speak in a different voice pitch. Note that the speaker""s voice pitch is essentially determined by his/her throat especially in shape and size. Since the speaker cannot change his/her throat in shape or size by his/her intention, the voice pitch cannot be changed by his/her intention, as well.
For realizing a voice recognition of various voices with different pitches, the voice recognition device VRAc shall store a great number of sample voice data groups each correspond to different speakers such as a man, a woman, or a child speaking in different pitch. Further, the voice recognition device VRAc shall select one group among those great number of sample voice data groups, according to the incoming command voice.
In order to avoid such nuisance, it seems effective to process the incoming command voice to a pitch optimal for voice recognition. However, since incoming command voices vary greatly in pitches according to the speaker, it is substantially impossible to process the incoming command voice to a desired pitch at one dash. Even in the desired pitch, the correct voice recognition cannot be secured because the content of incoming command voice or a speaking manner may spoil the voice recognition result. As known from this, the pitch considered optimal for voice recognition in terms of voice recognition device or sample voice data is not necessarily optimal.
Therefore, an object of the present invention is to provide a device for normalizing voice pitch to a level considered optimal for voice recognition.
A first aspect of the present invention is directed to a voice pitch normalization device equipped in a voice recognition device for recognizing an incoming command voice uttered by any speaker based on sample data for a plurality of words, and used to normalize the incoming command voice to be in an optimal pitch for voice recognition, the device comprising:
a target voice generator for generating a target voice signal by changing the incoming command voice on a predetermined degree basis;
a probability calculator for calculating a probability indicating a degree of coincidence among the target voice signal and the words in the sample data; and
a voice pitch changer for repeatedly changing the target voice signal in voice pitch until a maximum of the probabilities reaches a predetermined probability or higher.
As described above, in the first aspect, an incoming command voice is so adjusted in voice pitch that a probability indicating a degree of coincidence among the incoming command voice and sample voice data for a plurality of words becomes a predetermined value or greater. Therefore, the incoming command voice can be normalized in a fast and correct manner.
According to a second aspect, in the first aspect, when the maximum of the probabilities is smaller than the predetermined probability, the voice pitch changer includes a voice pitch adjustment for increasing or decreasing the target voice signal on the predetermined degree basis.
As described above, in the second aspect, the incoming command voice can be normalized even if being lower or higher in voice pitch compared with the sample voice data.
According to a third aspect, in the second aspect, the voice pitch normalization device further comprises:
a memory for temporarily storing the incoming command voice;
a read-out controller for reading out a string of the incoming command voice from the memory, and generating the target voice signal; and
a read-out clock controller for generating a read-out clock signal with a timing clock determined by frequency, and outputting the timing clock to the memory to change, with the timing specified thereby, the target voice signal in frequency on the predetermined degree basis.
According to a fourth aspect, in the second aspect, the target voice signal is increased in voice pitch on the predetermined degree basis started from a pitch level of the incoming command voice.
According to a fifth aspect, in the fourth aspect, the target voice signal is limited in voice pitch up to a first predetermined pitch, and when the maximum of the probabilities fails to reach the predetermined probability or higher before the target voice signal reaching the first predetermined pitch, the target voice signal is decreased in voice pitch on the predetermined degree basis started from the pitch level of the incoming command voice.
As described above, in the fifth aspect, the capability of the voice recognition device appropriately determines a range for normalizing the incoming command voice.
According to a sixth aspect, in the fifth aspect, the target voice signal is limited in voice pitch down to a second predetermined pitch, and when the maximum of the probabilities fails to reach the predetermined probability or higher before the target voice signal reaches the second predetermined pitch, the incoming command voice is stopped being normalized.
As described above, in the sixth aspect, the capability of the voice recognition device appropriately determines a range for normalizing the incoming command voice.
According to a seventh aspect, in the second aspect, the target voice signal is decreased in voice pitch on the predetermined degree basis started from a pitch level of the incoming command voice.
According to an eighth aspect, in the seventh aspect, the target voice signal is limited in voice pitch down to a third predetermined pitch, and when the maximum of the probabilities fails to reach the predetermined probability or higher before the target voice signal reaches the third predetermined pitch, the target voice signal is increased in voice pitch on the predetermined degree basis started from the pitch level of the incoming command voice.
As described above, in the eighth aspect, the capability of the voice recognition device appropriately determines a range for normalizing the incoming command voice.
According to a ninth aspect, in the eighth aspect, the target voice signal is limited in voice pitch up to a fourth predetermined pitch, and when the maximum of the probabilities fails to reach the predetermined probability or higher before the target voice signal reaches the fourth predetermined pitch, the incoming command voice is stopped being normalized.
A tenth aspect of the present invention is directed to a voice recognition device for recognizing an incoming command voice optimally normalized for voice recognition based on sample data for a plurality of words, the device comprising:
a target voice generator for generating a target voice signal by changing the incoming command voice on a predetermined degree basis;
a probability calculator for calculating a probability indicating a degree of coincidence among the target voice signal and the words in the sample data; and
a voice pitch changer for repeatedly changing the target voice signal in voice pitch until a maximum of the probabilities reaches a predetermined probability or higher.
As described above, in the tenth aspect, an incoming command voice is so adjusted in voice pitch that a probability indicating a degree of coincidence among the incoming command voice and sample voice data for a plurality of words becomes a predetermined value or greater. Therefore, the incoming command voice can be normalized in a fast and correct manner.
According to an eleventh aspect, in the tenth aspect, when the maximum of the probabilities is smaller than the predetermined probability, the target voice generator includes a voice pitch adjustment for increasing or decreasing the target voice signal on the predetermined degree basis.
As described above, in the eleventh aspect, the incoming command voice can be normalized even if being lower or higher in voice pitch compared with the sample voice data.
According to a twelfth aspect, in the eleventh aspect, the voice recognition device further comprises:
a memory for temporarily storing the incoming command voice;
a read-out controller for reading out a string of the incoming command voice from the memory, and generating the target voice signal; and
a read-out clock controller for generating a read-out clock signal with a timing clock determined by frequency, and outputting the timing clock to the memory to change, with the timing specified thereby, the target voice signal in frequency on the predetermined degree basis.
According to a thirteenth aspect, in the eleventh aspect, the target voice signal is increased in voice pitch on the predetermined degree basis started from a pitch level of the incoming command voice.
As described above, in the thirteenth aspect, the capability of the voice recognition device appropriately determines a range for normalizing the incoming command voice.
According to a fourteenth aspect, in the thirteenth aspect, the target voice signal is limited in voice pitch up to a first predetermined pitch, and when the maximum of the probabilities fails to reach the predetermined probability or higher before the target voice signal reaches the first predetermined pitch, the target voice signal is decreased in voice pitch on the predetermined degree basis started from the pitch level of the incoming command voice.
As described above, in the fourteenth aspect, the capability of the voice recognition device appropriately determines a range for normalizing the incoming command voice.
According to a fifteenth aspect, in the fourteenth aspect, the target voice signal is limited in voice pitch down to a second predetermined pitch, and when the maximum of the probabilities fails to reach the predetermined probability or higher before the target voice signal reaches the second predetermined pitch, the incoming command voice is stopped being normalized.
According to a sixteenth aspect, in the eleventh aspect, the target voice signal is decreased in voice pitch on the predetermined degree basis started from a pitch level of the incoming command voice.
According to a seventeenth aspect, in the sixteenth aspect, the target voice signal is limited in voice pitch down to a third predetermined pitch, and when the maximum of the probabilities fails to reach the predetermined probability or higher before the target voice signal reaches the third predetermined pitch, the target voice signal is increased in voice pitch on the predetermined degree basis started from the pitch level of the incoming command voice.
As described above, in the seventeenth aspect, the capability of the voice recognition device appropriately determines a range for normalizing the incoming command voice.
According to an eighteenth aspect, in the seventeenth aspect, the target voice signal is limited in voice pitch up to a fourth predetermined pitch, and when the maximum of the probabilities fails to reach the predetermined probability or higher before the target voice signal reaches the fourth predetermined pitch, the incoming command voice is stopped being normalized.
A nineteenth aspect of the present invention is directed to a voice pitch normalization method utilized for a voice recognition device for recognizing an incoming command voice uttered by any speaker based on sample data for a plurality of words, and applied to normalize the incoming command voice to be in an optimal pitch for voice recognition, the method comprising:
a step of generating a target voice signal by changing the incoming command voice on a predetermined degree basis;
a step of calculating a probability indicating a degree of coincidence among the target voice signal and the words in the sample data; and
a step of repeatedly changing the target voice signal in voice pitch until a maximum of the probabilities reaches a predetermined probability or higher.
As described above, in the nineteenth aspect, an incoming command voice is so adjusted in voice pitch that a probability indicating a degree of coincidence among the incoming command voice and sample voice data for a plurality of words becomes a predetermined value or greater. Therefore, the incoming command voice can be normalized in a fast and correct manner.
According to a twentieth aspect, in the nineteenth aspect, the voice pitch normalization method further comprises a step of, when the maximum of the probabilities is smaller than the predetermined probability, increasing or decreasing the target voice signal on the predetermined degree basis.
As described above, in the twentieth aspect, the incoming command voice can be normalized even if being lower or higher in voice pitch compared with the sample voice data.
According to a twenty-first aspect, in the twentieth aspect, the voice pitch normalization method further comprises:
a step of temporarily storing the incoming command voice;
a step of generating the target voice signal from a string of the temporarily stored incoming command voice; and
a step of determining a timing clock by frequency, in such manner as to change, with the timing specified thereby, the target voice signal in frequency on the predetermined degree basis.
According to a twenty-second aspect, in the twentieth aspect, the voice pitch normalization method further comprises a step of increasing the target voice signal in voice pitch on the predetermined degree basis started from a pitch level of the incoming command voice.
According to a twenty-third aspect, in the twenty-second aspect, the target voice signal is limited in voice pitch up to a first predetermined pitch, and
the method further comprises a step of, when the maximum of the probabilities fails to reach the predetermined probability or higher before the target voice signal reaches the first predetermined pitch, decreasing the target voice signal in voice pitch on the predetermined degree basis started from the pitch level of the incoming command voice.
As described above, in the twenty-third aspect, the capability of the voice recognition device appropriately determines a range for normalizing the incoming command voice.
According to a twenty-fourth aspect, in the twenty-third aspect, the target voice signal is limited in voice pitch down to a second predetermined pitch, and
the method further comprises a step of, when the maximum of the probabilities fails to reach the predetermined probability or higher before the target voice signal reaches the second predetermined pitch, stopping normalizing the incoming command voice.
As described above, in the twenty-fourth aspect, the capability of the voice recognition device appropriately determines a range for normalizing the incoming command voice.
According to a twenty-fifth aspect, in the twentieth aspect, the voice pitch normalization method further comprises a step of decreasing the target voice signal in voice pitch on the predetermined degree basis started from a pitch level of the incoming command voice.
According to a twenty-sixth aspect, in the twenty-fifth aspect, the target voice signal is limited in voice pitch down to a third predetermined pitch, and
the method further comprises a step of, when the maximum of the probabilities fails to reach the predetermined probability or higher before the target voice signal the third predetermined pitch, increasing the target voice signal in voice pitch on the reaches predetermined degree basis started from the pitch level of the incoming command voice.
As described above, in the twenty-sixth aspect, the capability of the voice recognition device appropriately determines a range for normalizing the incoming command voice.
According to a twenty-seventh aspect, in the twenty-sixth aspect, the target voice signal is limited in voice pitch down to a fourth predetermined pitch, and
the method further comprises a step of, when the maximum of the probabilities fails to reach the predetermined probability or higher before the target voice signal reaches the fourth predetermined pitch, stopping normalizing the incoming command voice.