The present invention relates to a speech processing apparatus and a speech processing method for distinguishing between noise components and speech components.
A signal generated by capturing voices carries speech segments that involve the voices and non-speech segments that are pauses or breath with no voices. A speech (or voice) recognition system determines speech and non-speech segments for higher speech recognition rate and speech-recognition process efficiency. Mobile communication using mobile phones, transceivers, etc. switches the encoding process for input signals between speech and non-speech segments for higher coded rate and transfer efficiency. The mobile communication requires a real-time performance, hence demanding less delay in a speech-segment determination process.
A known speech-segment determination process with less delay detects speech segments, with cepstrum analysis to: derive harmonic data on a fundamental wave that involves the maximum number of harmonic overtone components, from a frame of an input signal; and analyze the harmonic data and power data on energy in the frame (the power data indicating an energy level with respect to a threshold level) whether the harmonic and power data exhibit the feature of voices. Another known speech-segment determination process with less delay derives autocorrelation of spectra spread in the frequency domain and detects speech segments based on the level of autocorrelation.
The known speech-segment determination processes are effective in an environment where noises are relatively small. However, the known processes tend to erroneously detect speech segments when noises become larger due to the fact the feature of voices is embedded in the noises. The feature of voices is, for example, the flatness of a frequency distribution (indicating how often peaks appear) of a frame of an input signal and the pitch (high tones).
Moreover, the cepstrum analysis requires to perform Fourier transform two times with a heavy processing load in the frequency domain, thus consuming much power. Thus, if the cepstrum analysis is employed in a battery-powered system such as mobile communication equipment, a higher-capacity battery is required for much power consumption, resulting in a higher cost, a bulkier system, etc.
Furthermore, for an input signal that carries periodic noises like voices having periodicity, a known technique for detecting the feature of voices based on the periodicity of voices may erroneously determine noises as voices.