In recent years, a speech recognition technique has been increasingly demanded particularly in the automotive industry. Specifically, in a vehicle, it has heretofore been required to manually perform operations not directly related to driving, such as operating buttons in a car navigation system and turning on and off an air conditioner. Thus, during such operations, steering is paid little attention to, which could involve a risk of accident.
Meanwhile, there have been appearing vehicles equipped with systems which enable drivers to perform various operations by giving instructions with speech while concentrating on driving. When the driver gives an instruction with speech even while driving, a microphone provided in a map light unit receives the speech. At the same time, the system recognizes the speech and converts the speech into a command to operate the car navigation system. Thus, the car navigation system is operated. Similarly, the air conditioner and an audio system can be operated with speech.
However, since speech recognition in the vehicle is exposed to a lot of noise, it is difficult to achieve a high recognition rate by suppressing the noise. Typical kinds of noise during the running of the vehicle are as follows:                1. Music Reproduction        2. Interfering speech of fellow passengers        3. Noise generated when an air volume of a fan is large and noise generated when windows are open.        
As to the music reproduction, sounds can be canceled by using an echo canceller technique. As to the interfering speech of fellow passengers, for example, a microphone for speech recognition can be set not to receive speech of the fellow passengers by using a microphone array technique.
“Perceptual Harmonic Cepstral Coefficients as the Front-end for Speech Recognition,” discloses that perceptual harmonic cepstral coefficients (PHCC) are proposed as features to extract for speech recognition. The publication further states, pitch estimation and classification into voiced, unvoiced, and transitional speech are performed by a spectro-temporal auto-correlation technique. A peak picking algorithm is then employed to precisely locate pitch harmonics. A weighting function, which depends on the classification and the pitch harmonics, is applied to the power spectrum and ensures accurate representation of the voiced speech spectral envelope. The harmonics weighted power spectrum undergoes mel-scaled band-pass filtering, and the logenergy of the filters' output is discrete cosine transformed to produce cepstral coefficients. For perceptual considerations, within-filter cubic-root amplitude compression is applied to reduce amplitude variation without compromise of the gain invariance properties. Experiments show substantial recognition gains of PHCC over MFCC, with 48% and 15% error rate reduction for the Mandarin digit database and E-set, respectively.
Japanese Patent Application Publication No. 2001-024976 discloses an image processor, electronic camera, control method for these and memory medium. This publication discusses providing an electronic camera capable of easily utilizing an external device. The disclosure states when an external device such as image pickup device, recording device, display device or communication device is mounted on an electronic camera, a system control circuit judges whether that external device 210 has a function similar to a built-in device or not. When the built-in and external devices have the mutually similar functions, the operation of that external device is validated in place of that internal device.
Japanese Patent Application Publication No. 2003-337594 discloses a voice recognition device, method and program. This publication discloses a method in which background noise other than the sound source located along an objective direction is efficiently eliminated to realize highly precise voice recognition and to provide a system using the method. This publication further discloses an angle distinctive power distribution, that is observed by orienting the directivity of a microphone array toward various sound source directions being considered and is approximated by the sum of coefficient multiples of a reference angle distinctive power distribution that is beforehand measured using reference sound along the objective sound source directions and a reference angle distinctive power distribution of non-directive background sound. The publication further discloses that only the components along the objective sound source direction are extracted. Moreover, when the objective sound source direction is unknown, the objective sound source direction is estimated by selecting the one which minimizes an approximation residue in a sound source location searching section among the reference angle distinctive power distributions along various sound source directions. Furthermore, a maximum likelihood operation is conducted using the voice data of the components along the sound source direction being processed and the voice model which is obtained by making a prescribed model for the voice data and voice recognition is conducted based on the obtained estimated value.
Japanese Patent Application Publication No. 2003-76393 discloses a method for estimating voice in noisy environment and voice recognition method. The disclosure discusses providing a voice estimating method and a voice recognition method which are robustly operated even of a voice signal inputted in noises or a voice signal in which noise is mixed on a communication line. The voice estimating method includes a step for segmenting an input acoustic signal by short-time segments, an acoustic analyzing step for performing a short time frequency analysis, an element estimating step for estimating elements required for voice estimation, and a voice estimating step for estimating a voice by using elements obtained in the element estimating step. Concretely, the input acoustic signal is segmented by short-time segments, and short-time frequency analysis is performed, and spectrum envelopes of voice held in a code book for voice recognition are utilized as knowledge to generate sound models, and spectrum information obtained by the short-time frequency analysis is regarded as a probability density function, and maximum posteriori probability estimation is used to estimate mixed weight values, and it is judged that the existence supposition of elements generating the sound model having the maximum weight value at each time has the maximum likelihood, and these elements are outputted.
Against this background, an object of the present invention is to provide a system capable of performing speech recognition stably under noisy environment conditions. This object can be achieved by using a combination of features described in the independent claims in the scope of claims. In addition, the dependent claims define more advantageous specific examples of the present invention.