In recent years, we increasingly encounter a situation requiring speech processing for extracting a target speech in a noisy environment with the aim of improving usability. A typical example of the situation discussed above is an operation of an in-car navigation system through speech recognition. If a driver can specify an operation in a hands-free manner through a command or the like issued by driver's natural voice while sitting on a driver's seat without being aware of the existence of a microphone so much, the driver can concentrate on driving without taking his/her eyes from the scene in front of the car and therefore it can contribute to driver's safety.
From a viewpoint of voice activity detection (VAD), the usage of speech recognition can be divided into three types of systems: (1) Push-to-Talk system; (2) Push-to-Activate system; and (3) Always-Listening system. Among them, the Push-to-Activate system in (2) is widely used in consideration of a balance between performance and usability in the field of car navigation systems. In the Push-to-Activate system in (2), first, a talk switch is pressed to notify the system of the start of speech before the issue of a speech command. The end of the speech is automatically detected on the system side.
In a general car navigation system equipped with the current speech recognition, the talk switch is pressed to stop an audio stream being played so as to create a quiet indoor environment for speech recognition in order to maintain a recognition performance and a speech segment (the end of speech) detection performance. On this occasion, other seat members sitting on a passenger seat and the like have to stop their conversation temporarily and to keep quiet so as to make no noise. It is unpleasant for the driver and other seat members to stop the music or to keep waiting in silence and patience because of the speech recognition and it is therefore unfavorable from a viewpoint of usability. Accordingly, it is desired to provide a speech recognition technology equivalent to the Always-Listening system in (3) that does not require muting and is practicable without any change in the car interior acoustic environment.
In the case of attempting to use speech recognition without muting in a car, a first conceivable measure is to remove audio sound being played by means of an echo canceller. This measure, however, requires signal processing with a heavy load and further requires dedicated hardware or wires for reference input, which leads to a heavy burden on an in-car equipment manufacturer or automaker.
On the other hand, there have been suggested methods such as an independent component analysis (ICA) and an adaptive beamformer capable of preventing a malfunction caused by voice of other seat members sitting on the passenger seat and backseats in the case of attempting to cause a car navigation system to recognize a command spoken by a driver as a target speech with the driver set as a target speaker. In order to achieve practically sufficient performance, however, it is necessary to prepare a large number of calculation sources or to specify the number of noise sources in advance (the number of microphones needs to be greater than the number of noise sources).
In contrast to them, a method of applying gain control to a speech spectrum using a cross-correlation between two-channel signals such as a cross-power spectrum phase (CSP) coefficient requires only a low calculation amount and is capable of efficiently removing voice from unexpected directions and it is therefore expected as a promising method. The conventional method using the CSP technique, however, has not been successful yet in getting practically sufficient recognition performance from the recognition system in an actual environment in the car partly because a combined application with other noise removal techniques has not been sufficiently studied.
For example, the following Nonpatent Document, “A Study on Speech Detection and Speech Enhancement Based on the Cross Correlation Coefficient,” does not refer to a relationship between a noise processing technique and a flooring process and Nonpatent Document 2, A Study of Hands-free Voice Activity Detection Based on Weighted CSP Analysis and Zero Crossing Detector,” does not suggest gain control. Nonpatent Document 3, “Hands-Free Speech Recognition in Real Environments Using Microphone Array and Kalman Filter as a Front-End System of Conversational TV,” and Nonpatent Document 4, “A Study of Talker Localization Based on Subband CSP Analysis,” are cited as background art of handsfree speech recognition or the CSP method.
“A Study on Speech Detection and Speech Enhancement Based on the Cross Correlation Coefficient,” Yoshifumi Nagata, Toyota Fujioka, and Masato Abe (Department of Computer and Information Science, Iwate University), IEICE technical report, Speech, SP2002-165, pp. 25-31, January 2003 discloses a new method to measure the target speech activity and its application both to target speech detection and to speech enhancement. This measure is the modified cross correlation coefficient calculated from the weighted cross spectrum assuming that the two directional microphones receive the identical target signal. The weighting function which have been delivered to obtain Maximum Likelihood estimator of the generalized cross correlation function is combined with the inter-channel power ratio to attenuate coherent signal arriving from an uninteded direction. We apply this measure not only to target speech detection but to speech enhancement by utilizing it to control the total output gain. The proposed activity measure is evaluated in the speech enhancement experiments and in the word endpoint detection of an HMM-based speech recognition system across several signal-to-noise ratios (SNR) and three noise conditions. We show that the proposed speech enhancer outperforms both the adaptive beamformer and the coherence based filtering in all the noise conditions. Moreover, we show that being compared with the conventional recognition system which employs adaptive beamforming and spectral subtraction, the system with the proposed detector improves the recognition rate by 64.6% and 60.7% in the presence of impulsive noise and speech noise respectively at SNR=0 dB.
“A Study of Hands-free Voice Activity Detection Based on Weighted CSP Analysis and Zero Crossing Detector,” Takamasa Tanaka, Yuki Denda, Masato Nakayama, and Takanobu Nishiura (Ritsumeikan University), the collected papers presented at the lecture meeting of the Acoustical Society of Japan, 1-2-13, pp. 25-26, September 2006. Hands-free voice activity detection is indispensable in hands-free sound recognition in a noisy environment, but the detection function declines markedly in an extremely noisy environment since detection is done based on the time information for the power in the conventional method. In this article the authors examine hands-free voice activity detection that is robust towards noise by actively employing spatial information as well as time information.
“Hands-Free Speech Recognition in Real Environments Using Microphone Array and Kalman Filter as a Front-End System of Conversational TV,” Masakiyo Fujimoto and Yasuo Ariki (Graduate School of Science and Technology, Ryukoku University), The 4th DSPS Educational Conference, pp. 55-58, August 2002. An interactive type television is desirable so that viewers can make detailed inquiries about televised content of interest to them. Such an interactive television enables users to face the television using a microphone and say “Please tell me more about XXX” while watching a news show and so on. However, since natural speech is inhibited by the speaker's being overly conscious when speaking into the mike, hands-free speech recognition is needed, but there are the problems of noise and echoes in the latter case. Research has been done about raising the quality of sound reception by forming directionality with a microphone array, and there are two main kinds, the delayed sum array system (for forming directionality for the speaker) and the adaptive array (for forming a dead zone for noise with directionality). In this study, we have assumed a situation where the noise does not impart directionality and spreads inside the laboratory, and reception of speech that is emitted with a delayed sum array is carried out. Although the emitted speech can stressed and noise can be inhibited by a delayed sum array, the noise is still superimposed in the received speech signals and this affects speech recognition precision. There is also the problem that such arrays cannot cope adequately with echoes. To solve these problems, the speech signals after beam forming with a speech recognition method that is robust against noise (see note 3 for the prior invention) are recognized. We then evaluated this method in an environment where news audio was present in a background that assumes an interactive type television.
“A Study of Talker Localization Based on Subband CSP Analysis,” Yuki Denda, Takanobu Nishiura, Hideki Kawahara, and Toshio Irino (Graduate School of Systems Engineering, Wakayama University; and College of Information Science and Engineering, Ritsumeikan University), IEICE technical report. Speech, NLC 2004-69, pp. 79-84, SP2004-109, December 2004 discloses It is very important to capture distant-talking speech with high quality for hands-free speech acquisition systems. Microphone array steering is an ideal candidate for capturing distant-talking speech with high quality. However, it requires localizing a target talker before capturing distant-talking speech. Conventional talker localization methods cannot localize a target talker accurately in higher noisy environments. To deal with this problem, in this paper, we propose a new talker localization method based on subband CSP analysis with weighting of an average speech spectrum. It consists of subband analysis with equal bandwidth on mel-frequency and analysis weight coefficients based on an average speech spectrum, which are trained with a speech database, in advance. As a result of evaluation experiments in a real room, we confirmed that the proposed method could provide better talker localization performance than the conventional methods.