1. Field of the Invention
The present invention relates to automatic speech recognition and more particularly to a technique for accurately detecting voiced segment of a target speaker.
2. Description of the Related Art
In recent years, there is an increasing demand for automatic speech recognition technology, particularly in automobiles. More specifically, there has been a need for manual operations also with respect to operations not directly related to driving, such as button operations of a navigation system or of an air conditioner in automobiles. As a result, there is an increased risk of accidents due to careless steering operations by drivers while performing the above manual operations. Consequently, more vehicles are now equipped with systems that enable a driver to perform various operations with voice instructions while concentrating on driving. While the driver is driving, a microphone by a map light unit picks up a driver's voice when the driver issues a voice instruction. The system then recognizes and converts the voice to a command so as to control the car navigation system, which thereby activates the car navigation system. In the same manner, it is possible for the driver to perform the operations of an air conditioner and an audio system with voice. As described above, it is possible to provide a technique for performing a handsfree operation not directly related to driving in a car.
There is a known technique of detecting and using voiced segment as preprocessing of automatic speech recognition in the technical field of the automatic speech recognition. The speech signal segment that is determined by a voice activity detection (VAD) unit is important to the performance of the automatic speech recognition in general automatic speech recognition and the VAD performance has a decisive influence on the performance of the automatic speech recognition. In many cases, the VAD unit includes a feature extractor and a subsequent discrimination unit and currently being studied is the technique for extracting features from a speech signal with the aim of accurately detecting voiced segment.
Shikano, et al., “IT Text Automatic speech recognition System,” May 2001, discloses an approach for speech feature extraction which is typically used in the automatic speech recognition and voice activity detection. However, the discrimination unit has traditionally been studied. Sohn, et al., “A statistical model based voice activity detection,” January 1999, discloses a technique of using a statistical model based on a Gaussian distribution for VAD in order to improve the accuracy in the VAD by reducing the influence of background noise as a typical discrimination unit. Binder, et al., “Speech non-speech separation with GMM,” October 2001, discloses that a mel frequency cepstrum coefficient (MFCC) or the like is used for a feature vector for VAD using the statistical model. In addition, the inventors in this invention applied a speech processing method and system capable of stable automatic speech recognition under noisy environments by extracting a harmonic structure of a human speech from an observed speech and directly designing a filter having weights in the harmonic structure from the observed speech to emphasize the harmonic structure in the speech spectrum (Refer to Japanese Patent Application No. 2007-225195).
Because automatic speech recognition in cars is adversely affected by various background noises such as a driving noise, air-conditioner noise, and a window open condition. It has been difficult to achieve a high performance not only in the automatic speech recognition itself, but also in voice activity detection. In the related art and the combination of the related art, a difference in the feature vector between speech and non-speech is ambiguous when background noise in cars increases, making it difficult to detect voiced segment accurately in the situation of a low signal-to-noise (S/N) ratio.