The present invention relates to speech recognition. In particular, the present invention relates to training acoustic models for speech recognition.
In speech recognition, speech signals are compared to acoustic models to identify a sequence of phonemes that is represented by the speech signal. In most such systems, the comparison between the speech signal and the models is performed in what is known as the cepstral domain. To place a speech signal in the cepstral domain, the speech signal is sampled by an analog-to-digital converter to form frames of digital values. A Discrete Fourier Transform is applied to the frames of digital values to place them in the frequency domain. The power spectrum is computed from the frequency domain values by taking the magnitude squared of the spectrum. Mel weighting is applied to the power spectrum and the logarithm of each of the weighted frequency components is determined. A truncated discrete cosine transform is then applied to form a cepstral vector for each frame. The truncated discrete cosine transform typically converts a forty dimension vector that is present after the log function into a thirteen dimension cepstral vector.
In order for speech decoding to be performed in the cepstral domain, the models must be trained on cepstral vectors. One way to obtain such training data is to convert speech signals into cepstral vectors using a high sampling rate such as sixteen kilohertz. When speech is sampled at this high sampling rate, it is considered wideband data. This wideband data is desirable because it includes information for a large number of frequency components thereby providing more information for forming models that can discriminate between different phonetic sounds.
Although such wideband speech data is desirable, it is expensive to obtain. In particular, it requires that a speaker be in the same room as the microphone used to collect the speech data. In other words, the speech cannot pass through a narrowband filter before reaching the microphone. This requirement forces either the speaker or the designer of the speech recognition system to travel in order to collect training speech.
A second technique for collecting training speech is to collect the speech through a telephone network. In such systems, people are invited to call into a telephone number and provide examples of speech.
In order to limit the amount of data passed through the telephone network, it is common for telephone network providers to sample the speech signal at a low sampling rate. As a result, the speech received for training is narrowband speech that is missing some of the frequency components that are present in wideband training speech. Because such speech includes less information than wideband speech, the models trained from such narrowband telephone speech do not perform as well as models trained from wideband speech.
Although systems have been developed that attempt to decode speech from less than perfect data, such systems have operated in the spectral domain and have not provided a way to train models from less than perfect data. Because the Discrete Cosine Transform that places vectors in the cepstral domain mixes frequency components, and often involves a truncation of features, such systems cannot be applied directly to training cepstral domain acoustic models.
Thus a system is needed that can construct better wideband acoustic models in the cepstral domain using narrowband telephone data.