This invention relates to low bit rate speech coding and to speech recognition for the purpose of speech to text conversion.
In the following description reference is made to the following publications:
[1] S. Davis and P. Mermelstein, xe2x80x9cComparison of parametric representations for monosyllabic word recognition in continuously spoken sentencesxe2x80x9d, IEEE Trans ASSP, Vol. 28, No. 4, pp. 357-366, 1980.
[2] S. Young, xe2x80x9cA review of large-vocabulary continuous-speech recognitionxe2x80x9d, IEEE signal processing magazine, pp 45-47, September 1996.
[3] McAulay, R. J. Quatieri, T. F. xe2x80x9cSpeech analysis-synthesis based on a sinusoidal representationxe2x80x9d, IEEE Trans ASSP, Vol. 34, No. 4, pp. 744-754, 1986.
[4] McAulay, R. J. Quatieri, T. F. xe2x80x9cSinusoidal codingxe2x80x9d in W. Kleijn and K. Paliwal Editors xe2x80x9cSpeech Coding and Synthesis xe2x80x9d, ch. 4, pp. 121-170, Elsevier 1995.
[5] Y. Medan, E. Yair and D. Chazan, xe2x80x9cSuper resolution pitch determination of speech signalsxe2x80x9d, IEEE Trans ASSP, Vol. 39, No. 1, pp. 40-48, 1991.
[6] W. Hess, xe2x80x9cPitch Determination of Speech Signalsxe2x80x9d, Springer-Verlag, 1983.
[7] G. Ramaswamy and P. Gopalakrishnan, xe2x80x9cCompression of acoustic features for speech recognition in network environmentxe2x80x9d, Proceedings of ICASSP 1998.
In digital transmission of speech, usually a speech coding scheme is utilized. At the receiver the speech is decoded so that a human listener can listen to it. The decoded speech may also serve as an input to a speech recognition system. Low bit rate coding used to transmit speech through a limited bandwidth channel may impair the recognition accuracy compared to the usage of non-compressed speech. Moreover, the necessity to decode the speech introduces a computational overhead to the recognition process.
A similar problem occurs when the coded speech is stored for later playback and deferred recognition, e.g., in a hand-held device, where the storage is limited.
It is therefore desirable to encode speech at a low bit-rate so that:
1. Speech may be decoded from the encoded bit-stream (for a human listener); and
2. A recognition system may use the decoded bit-stream, with no impairment of the recognition accuracy or computational overhead.
It is therefore an object of the invention to provide a method for encoding speech at a low bit-rate to produce a bit stream which may be decoded as audible speech.
This object is realized in accordance with a first aspect of the invention by a method for encoding a digitized speech signal so as to generate data capable of being decoded as speech, said method comprising the steps of
(a) converting the digitized speech signal to a series of feature vectors by:
i) deriving at successive instances of time an estimate of the spectral envelope of the digitized speech signal,
ii) multiplying each estimate of the spectral envelope by a predetermined set of frequency domain window functions, wherein each window occupies a narrow range of frequencies, and computing the integrals thereof, and
iii) assigning said integrals or a set of predetermined functions thereof to respective components of a corresponding feature vector in said series of feature vectors;
(b) computing for each instance of time a respective pitch value of the digitized speech signal, and
(c) compressing successive acoustic vectors each containing the respective pitch value and feature vector so as to derive therefrom a bit stream.
According to a second, complementary aspect of the invention there is provided a method for decoding a bit-stream representing a compressed series of acoustic vectors each containing a respective feature vector and a respective pitch value derived at a respective instance of time, each of the feature vectors having multiple components obtained by:
i) deriving at successive instances of time an estimate of the spectral envelope of a digitized speech signal,
ii) multiplying each estimate of the spectral envelope by a predetermined set of frequency domain window functions, wherein each window occupies a narrow range of frequencies, and computing the integral thereof, and
iii) assigning said integrals or a set of predetermined functions thereof to a respective one of said remaining components of the feature vector;
said method comprising the steps of:
(a) separating the received bit-stream into compressed feature vectors data and compressed pitch values data,
(b) decompressing the compressed feature vectors data and outputting quantized feature vectors,
(c) decompressing the compressed pitch values data and outputting quantized pitch values, and
(d) generating a continuous speech signal, using the quantized feature vectors and pitch values.
The invention will best be appreciated with regard to speech recognition schemes as currently implemented. All speech recognition schemes start by converting the digitized speech to a set of features that are then used in all subsequent stages of the recognition process. A commonly used set of feature vectors are the Mel-frequency Cepstral coefficients (MFCC) [1, 2], which can be regarded as a specific case of the above-described feature vectors. Transmitting a compressed version of the set of feature vectors removes the overhead required for decoding the speech. The feature extraction stage of the recognition process is replaced by feature decompression, which requires fewer computations by an order of magnitude. Furthermore, low bit rate transmission of the Mel-Cepstral features (4-4.5 Kbps) is possible without impairing the recognition accuracy [7].
The invention is based on the finding that if compressed pitch information is transmitted together with the speech recognition features, it is possible to obtain a good quality reproduction of the original speech.
The encoder consists of a feature extraction module, a pitch detection module and a features and pitch compression module. The decoder consists of a decompression module for the features and pitch and a speech reconstruction module.
It should be noted that in some recognition systems, especially for tonal languages, the pitch information is used for recognition and pitch detection is applied as a part of the recognition process. In that case, the encoder only compresses the information obtained anyway during the recognition process.
It is possible to encode additional components that are not used for speech recognition, but may be used by the decoder to enhance the reconstructed speech quality.