1. Field of the Invention
The present invention relates to a speech recognition technology, and more particularly, to an apparatus for normalizing a channel variation for robust speech recognition, wherein the channel variation is generally caused by various factors including different microphone types and variations in communications systems, and a method therefore.
2. Description of the Related Art
Generally, as illustrated in FIG. 4, a speech recognition apparatus includes a characteristic extraction unit 10 and a speech recognition unit 20. The characteristic extraction unit 10 extracts characteristics of an inputted speech signal. The speech recognition unit 20 recognizes a speech based on data related to the characteristics extracted from the characteristic extraction unit 10. Although the characteristic extraction unit 10 can be implemented based on various methods, a mel-frequency cepstrum coefficient (MFCC) method or a perceptual linear prediction cepstrum coefficient (PLPCC) method is mainly employed. A hidden markov model (HMM), a dynamic time warping (DTW) method and a neural network method are frequently employed for the speech recognition unit 20.
FIG. 5 illustrates an exemplary conventional characteristic extraction unit for extracting a speech characteristic based on the MFCC method. As illustrated, the characteristic extraction unit includes: a spectrum analysis unit 11; a filter bank unit 12; a logarithmic compression unit 13; a discrete cosine transformation unit 14. The spectrum analysis unit 11 extracts information on a frequency spectrum of a speech signal. The filter bank unit 12 estimates an envelope curve of a simplified spectrum from the spectrum estimated by the spectrum analysis unit 11. The logarithmic compression unit 13 implies the size of the simplified spectrum based on a logarithmic function. The discrete cosine transformation unit 14 performs a discrete cosine transformation (DCT) operation to an output of the logarithmic compression unit 13 and calculating a cepstrum coefficient.
A channel variation typically occurs in a speech signal due to differences in microphone types, telephone networks and communication systems and personal variations. A cepstral mean subtraction (CMS) method, a signal bias removal (SBR) method and an affine transform of cepstrum (ATC) method are commonly known methods for compensating the channel variation. Due to a limitation in an amount of computation, most of the introduced channel variation compensation methods, which can improve a speech recognition function by compensating the channel variation, are not applied to a speech recognition signal but to a specific characteristic parameter for each time period after extracting characteristics.
On the basis of the fact that a channel variation is expressed as one constant in a mel-frequency cepstral coefficient (MFCC) region, which is a representative speech characteristic, the most widely employed method is the cepstral mean subtraction (CMS) method that calculates an average value of MFCC parameters for each time period and subtract the calculated average value from each MFCC parameter. Although this CMS method is simple and effective, there is a limitation in that the individual MFCC parameter can be viewed as a channel variation in respect of each time period. Also, the CMS method may result in a removal of a speech component for recognition.
In more detail of the CMS method, since the overall average value of the MFCC parameters is subtracted from the MFCC parameters, the individual MFCC parameter can be often viewed as a channel variation. For instance, in the case that a signal expressed as ‘sin(t)+a’ is changed into a signal expressed as ‘sin(t)+a+b’ due to a channel variation, the application of the CMS method causes a removal of an average value of ‘a+b’, thereby outputting a value of ‘sin(t)’.