A voice input/output device that includes a voice input device such as a microphone and a voice output device such as a headphone, for example, a headset microphone, is known. A voice-based data input device that: recognizes a voice input from a voice input device to convert the voice into text; converts the text of the recognition result into a voice; and outputs the voice from a voice output device is also known. By checking the voice (hereafter referred to as “synthetic voice”) obtained by converting the text of the recognition result, the user can determine whether or not the voice produced by the user is appropriately recognized.
In other words, in the case of checking (hereafter also referred to as “monitoring”) the input voice using the above-mentioned data input device, the data input device outputs not only the synthetic voice but also the input voice to the voice output device.
FIG. 10 is an explanatory diagram depicting an example of the data input device. In the example depicted in FIG. 10, when a voice produced by the user is input to a microphone 71, the voice is output from a speaker 72. The voice produced by the user is simultaneously input to a voice recognition/synthesis device 73, and a synthetic voice generated by a voice recognition and voice synthesis process is output from the speaker 72, too.
On reason for monitoring the input voice from the voice input device by the voice output device is to ensure that the voice can be input from the voice input device. Another reason is to prevent a decrease in voice recognition rate due to the Lombard effect when speaking in a loud environment. In the case where a headphone is used as the voice output device, the user's ears are covered and so the user might not be able to hear an ambient sound. Even in such a case, outputting the input voice from the voice input device to the voice output device (headphone) enables the user to hear the ambient sound.
Typically, the timing at which the voice input to the voice input device is output and the timing at which the synthetic voice is output are different. This is because a predetermined processing time is taken for voice recognition when generating the synthetic voice. Accordingly, the user hears the synthetic voice a predetermined time after he or she produces the voice.
In the voice input/output device that combines the voice input device and the voice output device, the balance between the voice input level and output level needs to be adjusted in order to prevent howling. Various methods for adjusting these levels are known.
Patent Literature (PTL) 1 describes a karaoke machine having a function of adjusting a microphone used to input a singing voice. In the karaoke machine described in PTL 1, when adjusting the microphone volume or effect, a singer's voice is converted by PCM (Pulse Code Modulation), and the converted data is recorded as a voice. The singer adjusts the microphone volume while repeatedly playing the recorded voice, and records the voice again. This saves the need for the user to repeatedly producing the voice.
PTL 2 describes a karaoke machine that prevents howling by automatically adjusting voices output from a plurality of speakers. The karaoke machine described in PTL 2 prevents howling by, in accordance with the relation between a predetermined speaker position and a designated microphone position, lowering the microphone input voice signal level or lowering the mixing level upon output from each speaker.