1. Field of the Invention
The present invention relates to speech recognition apparatuses, and more specifically, to a speech recognition apparatus used for AV equipment such as a TV, radio, and audio system that reproduces multichannel audio including two-channel stereo, capable of controlling the AV equipment through voice, inputting information to the AV equipment through voice, and carrying out other operations even if audio is reinforced by loudspeakers.
2. Description of the Background Art
A conventional speech recognition technique with audio reinforced by a loudspeaker is exemplarily disclosed in Japanese Patent Laid-Open Publication No. 5-22779 (1993-22779) (Title: SPEECH RECOGNITION REMOTE CONTROLLER).
FIG. 23 is a block diagram showing the configuration of a conventional speech recognition apparatus for AV equipment using the technique disclosed in the above publication. The speech recognition apparatus of FIG. 23 is used for AV equipment with a single loudspeaker 201. In FIG. 23, the conventional speech recognition apparatus includes a microphone 202, a speech recognition unit 203, and an echo canceller 204.
With reference to FIG. 24, the operation of the above-configured conventional speech recognition apparatus for AV equipment is now described.
FIG. 24 is a diagram showing time waveforms of signals inputted to or outputted from the components of the speech recognition apparatus of FIG. 23. In FIG. 24, consider the case where a user speaks to control speech while audio is reinforced by the loudspeaker 201.
When the user speaks without the audio being reinforced by the loudspeaker 21, a speech signal outputted from the microphone 202 is extremely good in S/N ratio, as indicated by a reference numeral 211 in FIG. 24. When an audio signal 212 for a TV program is inputted to the loudspeaker 201, an echo signal 213 that is similar to the loudspeaker input 212 is mixed into an output from the microphone 202.
Therefore, the microphone 202 outputs a signal with the user""s speech 211 and the echo signal 213 mixed therein, as indicated by a reference numeral 214 of FIG. 24. This signal is too low in S/N ratio for recognition of the user""s speech. Naturally, with such microphone output 214, sufficient speech recognition results by the speech recognition unit 203 cannot be expected.
Thus, in the speech recognition apparatus of FIG. 23, the echo signal 213 echoed to the microphone 202 from the loudspeaker 201 is estimated by an adaptive digital filter provided in the echo canceller 204. A subtraction circuit in the echo canceller 204 subtracts the estimated echo signal from the microphone output 214 to totally cancel out the echo signal 213, thereby extracting only the user""s speech 211.
The echo canceller 204 is provided with the loudspeaker input 212, which is an input signal to the loudspeaker 201. The adaptive digital filter in the echo canceller 204 estimates an echo signal 215 from the waveform of the loudspeaker input 212 and an impulse response from the loudspeaker 201 through the microphone 202 that is stored therein. Then, the subtraction circuit provided in the echo canceller 204 subtracts the estimated echo signal 215 from the microphone output 214 to obtain an echo canceller output 216.
As known from the comparison between the echo canceller output 216 and the waveform of the user""s speech 211, the speech recognition unit 203 can be expected to carry out correct speech recognition under the action of echo cancellation by the echo canceller 204 even when audio is reinforced by the loudspeaker 201.
However, the audio recognition apparatus of FIG. 23 supports only monaural AV equipment, and cannot be used for multichannel AV equipment using a plurality of loudspeakers.
FIG. 25 is a block diagram showing the configuration of another conventional speech recognition apparatus for AV equipment. The speech recognition apparatus of FIG. 25 is used for 2-channel AV equipment with two loudspeakers 221 and 222.
In this speech recognition apparatus, sound echoed from the loudspeaker 221 to the microphone 223 and sound echoed from the loudspeaker 222 to the microphone 223 are estimated by adaptive digital filters in the echo cancellers 225 and 226. By subtracting the estimated values from the output signal from the microphone, only user""s speech can be extracted. Unlike the speech recognition apparatus of FIG. 23, the speech recognition apparatus of FIG. 25 is adaptable to stereo AV equipment.
The speech recognition apparatus of FIG. 25, however, requires as many echo cancellers as audio channels. Therefore, it becomes too costly for use in multichannel AV equipment. Moreover, in such system using a plurality of echo cancellers, mutual interference among the echo cancellers occurs, resulting in major drawbacks such as instability in adaptive operation of each echo canceller, an increase in echo and oscillation due to failure in adaptation.
It is strongly desired that speech recognition apparatuses for AV equipment should carry out speech recognition while reproducing audio through a loudspeaker, support multichannel audio, ensure high reliability, and have a low price.
However, as described above, the conventional speech recognition apparatuses require as many echo cancellers as audio channels. Therefore, they become too costly for use in multichannel AV equipment.
Furthermore, mutual interference among the echo cancellers makes adaptive operation of each echo canceller extremely unstable, thereby causing an increase in echo and oscillation due to failure in adaptation, and as a result, decreasing speech recognition performance.
Therefore, an object of the present invention is to achieve a low-cost speech recognition apparatus for multichannel AV equipment capable of speech recognition with high accuracy while multichannel sound is being produced from loudspeakers.
The present invention has the following features to solve the problems above.
A first aspect of the present invention is directed to a speech recognition apparatus used for AV equipment outputting multichannel sound through a plurality of loudspeakers, capable of recognizing user""s speech inputted through a microphone and causing the AV equipment to perform a predetermined process, the apparatus comprising:
a monaural conversion part for converting multichannel signals to the plurality of loudspeakers into a monaural signal;
a single echo canceller, provided with an output from the microphone (microphone output) and an output from the monaural conversion part (monaural output), for estimating echo sound of the multichannel sound based on the monaural signal and eliminating the echo sound from the microphone output; and
a speech recognition part for recognizing the user""s speech based on an output from the single echo canceller (echo canceller output).
In the first aspect, the multichannel signals are converted into a monaural signal, which is provided to the single echo canceller. The single echo canceller eliminates echo sound of multichannel sound from the microphone output. Therefore, with only a single echo canceller, speech recognition can be carried out while multichannel sound is produced from the loudspeakers irrespectively of the number of channels. Furthermore, unlike the case where a plurality of echo cancellers are provided, the present invention can prevent mutual interference among the echo cancellers that leads to deterioration in speech recognition performance.
According to a second aspect, in the first aspect, the multichannel signals are provided to the plurality of loudspeakers.
In the second aspect, multichannel sound is produced from the plurality of loudspeakers. Therefore, echo sound cannot be completely cancelled out with the monaural signal. However, if a monaural level of the multichannel signals is closer to 1, echo sound can be cancelled out for the most part. At least part of echo sound can be cancelled out unless the monaural level of the multichannel signals is 0.
Here, the monaural level of the multichannel signals is a ratio of signal components (monaural components) commonly included in all channels to one of the signals. If the signals of all channels have no correlation to each other, the monaural level is xe2x80x9c0xe2x80x9d. If these signals are equal, the monaural level is xe2x80x9c1xe2x80x9d.
According to a third aspect, in the first aspect, the speech recognition apparatus further comprises a switching part for switching between the multichannel signals and the monaural signal to the plurality of loudspeakers.
In the third aspect, multichannel or monaural sound can be selectively produced from the plurality of loudspeakers.
According to a fourth aspect, in the third a speech recognition apparatus further comprises a speech detection part for detecting the user""s speech based on the monaural signal and the echo canceller output, wherein the switching part:
inputs the multichannel signals to the plurality of loudspeakers when the speech detection part does not detect the user""s speech; and
inputs the monaural signal to the plurality of loudspeakers when the speech detection part detects the user""s speech.
In the fourth aspect, multichannel sound is produced when speech recognition is not required (user""s speech is not detected), while monaural sound is produced when required (detected). Therefore, speech recognition can be carried out with sufficiently high accuracy.
According to a fifth aspect, in the third aspect, the speech recognition apparatus further comprises:
a start instruction part for providing an instruction to start speech recognition operation;
an end instruction part for providing an instruction to end the speech recognition operation; and
a state setting part for setting, responsive to the instructions from the start instruction part and the end instruction part, the speech recognition part to an active state or wait state, wherein the switching part:
inputs the multichannel signals to the plurality of loudspeakers when the state setting part sets the speech recognition part to the wait state, and state; and
inputs the monaural signal to the plurality of loudspeakers when the state setting part sets the speech recognition part to the active state.
In the fifth aspect, multichannel sound is produced when the speech recognition part is in a wait state (OFF state), while monaural sound is produced when in an active state (ON state). Therefore, speech recognition can be carried out with sufficiently high accuracy.
According to a sixth aspect, in the fifth aspect, the speech recognition apparatus further comprises:
a monaural level determination part for determining a monaural level of the multichannel signals; and
an arbitrary level monaural conversion part for converting the multichannel signals at an arbitrary monaural level, wherein:
the monaural conversion part completely converts the multichannel signals; and
when the monaural level determined by the monaural level determination part is lower than a predetermined monaural level, the arbitrary level monaural conversion part converts the multichannel signals at the predetermined monaural level.
In the sixth aspect, the monaural level of the multichannel signals is always higher than the predetermined monaural level. Therefore, even if the speech recognition part is in an active state (ON state), speech recognition performance can be achieved with high accuracy and little loss of a sense of stereo. That is, a sense of stereo and high speech recognition performance can be balanced.
According to a seventh aspect, in the fifth aspect, the multichannel signals are signals of three or more channels, the apparatus further comprises a 2-channel conversion part for converting the multichannel signals into 2-channel signals, the monaural conversion part converts the 2-channel signals into a monaural signal, and the switching part switches among the multichannel signals, the 2-channel signals, and the monaural signal for output to the plurality of loudspeakers.
According to an eighth aspect, in the seventh aspect, the speech recognition apparatus further comprises:
a speech detection part for detecting the user""s speech based on the monaural signal and the echo canceller output, wherein:
the switching part:
inputs the multichannel signals to the plurality of loudspeakers when the state setting part sets the speech recognition part to the wait, state;
inputs the 2-channel signals to the plurality of loudspeakers when the state setting part sets the speech recognition part to the active state; and
inputs the monaural signal to the plurality of loudspeakers when the speech detection part detects the user""s speech.
In the eighth aspect, multichannel sound is produced when the speech recognition part in a wait state (OFF state); multichannel sound is produced when in an active state (ON state) but not required to perform sound recognition (user""s speech is not detected); and monaural sound is produced when required to perform sound recognition (user""s speech is detected). Therefore, speech recognition performance can be achieved with sufficiently high accuracy and little loss of a sense of stereo.
According to a ninth aspect, in the fifth aspect, the speech recognition apparatus further comprises:
a cancellation monitoring part for monitoring, based on the monaural signal and the echo canceller output, whether the echo canceller sufficiently cancels out the echo sound;
a speech detection part for detecting the user""s speech based on the monaural signal and the echo canceller output; and
an attenuation part for attenuating the multichannel signals, wherein the attenuation part attenuates the multichannel signals when the speech detection part detects the user""s speech while the cancellation monitoring part indicates that the echo sound is not sufficiently cancelled out.
In the ninth aspect, when user""s speech is detected while echo sound is not sufficiently cancelled out, the level of sound produced from the plurality of loudspeakers is reduced, thereby preventing mixing of echo sound. Consequently, speech recognition performance with echo sound not sufficiently cancelled out can be improved.
According to a tenth aspect, in the fifth aspect, the echo canceller comprises:
an adaptive digital filter for estimating an impulse response on an echo path between the plurality of loudspeakers and the microphone and calculating the echo sound based on the estimated impulse response and the monaural signal; and
a subtraction part for subtracting an output from the adaptive digital filter from the microphone output.
In the tenth aspect, echo sound of multichannel sound is eliminated from the microphone output, only the user""s speech can be provided to the speech recognition part.
According to an eleventh aspect, in the tenth aspect, the speech recognition apparatus further comprises an adaptation sound generation part for generating monaural adaptation sound for accelerating adaptation of the adaptive digital filter when the switching part switches inputs to the plurality of loudspeakers from the multichannel signals to the monaural signal.
In the eleventh aspect, when inputs to the loudspeakers are switched from the multichannel signals to the monaural signal, monaural adaptation sound is produced from the plurality of loudspeakers. Therefore, even if no sound is produced immediately after switching, the impulse response held by the digital filter can be forcefully adapted to that on an echo path.
According to a twelfth aspect, in the tenth aspect, the speech recognition apparatus further comprises an adaptation control part for controlling an adaptation speed of the adaptive digital filter, wherein the adaptation control part includes a high adaptation speed for monaural and a low adaptation speed for multichannel, selecting the high adaptation speed when the state setting part sets the speech recognition part to the active state and selecting the low adaptation speed when the state setting part sets the speech recognition part to the wait state.
In the twelfth aspect, the adaptation speed of the adaptive digital filter in the echo canceller is controlled to be high when speech recognition part is set in an active state, while low in a wait state. Therefore, appropriate echo cancellation can be achieved for monaural and multichannel sound.
That is, when multichannel sound is produced from the loudspeakers, many stereo components, which are noise for the adaptive digital filter, are present therein. Therefore, with a low adaptation speed, noise-resistance is increased. On the other hand, when monaural sound is produced, no stereo components are present therein. Therefore, with a high adaptation speed, fluctuations in impulse response on the echo path can be followed more.
As a result, an excellent echo canceling effect can be achieved in a wait state, and speech recognition performance immediately after a transition to an active state can be increased.
According to a thirteenth aspect, in the twelfth aspect, the adaptation control part is provided with an identification signal indicating whether the plurality of loudspeakers are provided with the multichannel signals or the monaural signal, and when the identification signal indicates monaural, the adaptation control part selects the high adaptation speed irrespectively of whether the state setting part sets the speech recognition part to the active or wait state.
In the thirteenth aspect, it is determined whether the plurality of loudspeakers are provided with the multichannel signals or the monaural signal. For the monaural signal, the high adaptation speed is selected irrespectively of whether the state setting part sets the speech recognition part to an active or wait state. Therefore, fluctuations in impulse response on the echo path can be followed without degradation. As a result, an excellent echo canceling effect can be achieved in a wait state, and speech recognition performance immediately after a transition to an active state can be increased.
According to a fourteenth aspect, in the tenth aspect, the speech recognition apparatus further comprises:
a monaural level determination part for determining a monaural level of the multichannel signals; and
an adaptation control part for controlling the adaptation speed of the adaptive digital filter based on the determined monaural level.
In the fourteenth aspect, the adaptation speed of the adaptive digital filter is controlled based on the monaural level of the multichannel signals. Therefore, appropriate echo cancellation can be made for multichannel signals varying in monaural level.
That is, if the monaural level is low, the adaptation speed is made low, thereby increasing noise resistance. On the other hand, if the monaural level is high, a small number of stereo components, which are noise for the adaptive digital filter, are present in the multichannel signals. Therefore, noise-resistance is not much required. Therefore, as in the following fifteenth aspect, with a high adaptation speed, fluctuations in impulse response on the echo path can be followed more. As a result, an excellent echo canceling effect can be achieved especially when the monaural level is high, and speech recognition performance immediately after a transition to an active state can be increased.
According to a fifteenth aspect, in the fourteenth aspect, the adaptation control part increases the adaptation speed of the adaptive digital filter as the monaural level of the multichannel signals is higher.
According to a sixteenth aspect, in the tenth aspect, the speech recognition apparatus further comprises a non-volatile memory, wherein:
the non-volatile memory receives and stores the impulse response estimated by the adaptive digital filter at power OFF, and provides, at power ON, the estimated impulse response stored at power OFF to the adaptive digital filter; and
the adaptive digital filter starts estimating the impulse response by taking the estimated impulse response provided at power OFF by the non-volatile memory as an initial value.
In the sixteenth aspect, the estimated impulse response at power OFF is stored, and estimation of the impulse response is stared a power ON by taking the stored estimated impulse as the initial value. Therefore, compared with the case where 0 is taken as the initial value, estimation error immediately after power ON can be reduced. As a result, speech recognition performance is increased.
According to a seventeenth aspect, in the fifth aspect, the speech recognition apparatus further comprises a speech detection part for detecting the user""s speech based on the monaural signal and the echo canceller output, wherein:
the start instruction part is implemented by a button switch that provides a start instruction to the state setting part when being pressed; and
the end instruction part is implemented by a time switch that provides an end instruction to the state setting part after a predetermined period during which the speech detection part are not detecting the user""s speech.
In the seventeenth aspect, speech recognition operation can be automatically ended.
According to an eighteenth aspect, in the fifth aspect, the speech recognition apparatus further comprises a speech detection part for detecting the user""s speech based on the monaural signal and the echo canceller output, wherein:
the start instruction part is implemented by a voice switch that provides a start instruction to the state setting part when the speech detection part detects the user""s speech; and
the end instruction part is implemented by a time switch that provides an end instruction to the state setting part after a predetermined period during which the speech detection part are not detecting the user""s speech.
In the eighteenth aspect, speech recognition operation can be automatically started and ended.